自动驾驶汽车广泛使用屋顶旋转的LIDAR传感器,推动了3D点序列实时处理的需求。但是,大多数激光雷达语义细分数据集和算法将这些收购分为$ 360^\ circ $框架,从而导致收购潜伏期与现实的实时应用程序和评估不符。我们通过两个关键贡献来解决这个问题。首先,我们介绍Helixnet,这是一个10亿美元的点数据集,具有细粒度的标签,时间戳和传感器旋转信息,可以准确评估分割算法的实时准备就绪。其次,我们提出了helix4d,这是一种专门设计用于旋转激光雷达点序列的紧凑而有效的时空变压器结构。 Helix4D在采集切片上运行,对应于传感器的全部旋转的一部分,从而大大降低了总延迟。我们介绍了Helixnet和Semantickitti上几种最先进模型的性能和实时准备的广泛基准。 Helix4D与最佳的分割算法达到准确性,而在延迟和型号$ 50 \ times $中,降低了$ 5 \ times $。代码和数据可在以下网址获得:https://romainloiseau.fr/helixnet
translated by 谷歌翻译
变压器在自然语言处理中的成功最近引起了计算机视觉领域的关注。由于能够学习长期依赖性,变压器已被用作广泛使用的卷积运算符的替代品。事实证明,这种替代者在许多任务中都取得了成功,其中几种最先进的方法依靠变压器来更好地学习。在计算机视觉中,3D字段还见证了使用变压器来增加3D卷积神经网络和多层感知器网络的增加。尽管许多调查都集中在视力中的变压器上,但由于与2D视觉相比,由于数据表示和处理的差异,3D视觉需要特别注意。在这项工作中,我们介绍了针对不同3D视觉任务的100多种变压器方法的系统和彻底审查,包括分类,细分,检测,完成,姿势估计等。我们在3D Vision中讨论了变形金刚的设计,该设计使其可以使用各种3D表示形式处理数据。对于每个应用程序,我们强调了基于变压器的方法的关键属性和贡献。为了评估这些方法的竞争力,我们将它们的性能与12个3D基准测试的常见非转化方法进行了比较。我们通过讨论3D视觉中变压器的不同开放方向和挑战来结束调查。除了提出的论文外,我们的目标是频繁更新最新的相关论文及其相应的实现:https://github.com/lahoud/3d-vision-transformers。
translated by 谷歌翻译
准确的移动对象细分是自动驾驶的重要任务。它可以为许多下游任务提供有效的信息,例如避免碰撞,路径计划和静态地图构建。如何有效利用时空信息是3D激光雷达移动对象分割(LIDAR-MOS)的关键问题。在这项工作中,我们提出了一个新型的深神经网络,利用了时空信息和不同的LiDAR扫描表示方式,以提高LIDAR-MOS性能。具体而言,我们首先使用基于图像图像的双分支结构来分别处理可以从顺序的LiDAR扫描获得的空间和时间信息,然后使用运动引导的注意模块组合它们。我们还通过3D稀疏卷积使用点完善模块来融合LIDAR范围图像和点云表示的信息,并减少对象边界上的伪像。我们验证了我们提出的方法对Semantickitti的LiDAR-MOS基准的有效性。我们的方法在LiDar-Mos IOU方面大大优于最先进的方法。从设计的粗到精细体系结构中受益,我们的方法以传感器框架速率在线运行。我们方法的实现可作为开源可用:https://github.com/haomo-ai/motionseg3d。
translated by 谷歌翻译
Scene understanding is crucial for autonomous robots in dynamic environments for making future state predictions, avoiding collisions, and path planning. Camera and LiDAR perception made tremendous progress in recent years, but face limitations under adverse weather conditions. To leverage the full potential of multi-modal sensor suites, radar sensors are essential for safety critical tasks and are already installed in most new vehicles today. In this paper, we address the problem of semantic segmentation of moving objects in radar point clouds to enhance the perception of the environment with another sensor modality. Instead of aggregating multiple scans to densify the point clouds, we propose a novel approach based on the self-attention mechanism to accurately perform sparse, single-scan segmentation. Our approach, called Gaussian Radar Transformer, includes the newly introduced Gaussian transformer layer, which replaces the softmax normalization by a Gaussian function to decouple the contribution of individual points. To tackle the challenge of the transformer to capture long-range dependencies, we propose our attentive up- and downsampling modules to enlarge the receptive field and capture strong spatial relations. We compare our approach to other state-of-the-art methods on the RadarScenes data set and show superior segmentation quality in diverse environments, even without exploiting temporal information.
translated by 谷歌翻译
In this work, we address the problem of unsupervised moving object segmentation (MOS) in 4D LiDAR data recorded from a stationary sensor, where no ground truth annotations are involved. Deep learning-based state-of-the-art methods for LiDAR MOS strongly depend on annotated ground truth data, which is expensive to obtain and scarce in existence. To close this gap in the stationary setting, we propose a novel 4D LiDAR representation based on multivariate time series that relaxes the problem of unsupervised MOS to a time series clustering problem. More specifically, we propose modeling the change in occupancy of a voxel by a multivariate occupancy time series (MOTS), which captures spatio-temporal occupancy changes on the voxel level and its surrounding neighborhood. To perform unsupervised MOS, we train a neural network in a self-supervised manner to encode MOTS into voxel-level feature representations, which can be partitioned by a clustering algorithm into moving or stationary. Experiments on stationary scenes from the Raw KITTI dataset show that our fully unsupervised approach achieves performance that is comparable to that of supervised state-of-the-art approaches.
translated by 谷歌翻译
Our dataset provides dense annotations for each scan of all sequences from the KITTI Odometry Benchmark [19]. Here, we show multiple scans aggregated using pose information estimated by a SLAM approach.
translated by 谷歌翻译
在基于LIDAR的自主驱动的基于LIDAR的3D对象检测中,与2D检测情况相比,对象尺寸与输入场景尺寸的比率明显较小。俯瞰此差异,许多3D探测器直接遵循2D探测器的常见做法,即使在量化点云之后,也可以将特征映射下来。在本文中,我们首先重新思考这种多级刻板印象如何影响基于激光雷达的3D对象探测器。我们的实验指出,下采样操作带来了一些优势,并导致不可避免的信息损失。要解决此问题,我们提出了单程稀疏变压器(SST),以将原始分辨率从网络的开头维护。我们的方法武装变压器,我们的方法解决了单步体系结构中的接收领域不足的问题。它还与点云的稀疏合作,自然避免昂贵的计算。最终,我们的SST在大型Waymo Open DataSet上实现了最先进的结果。值得一提的是,由于单程的特征,我们的方法可以在小物体(行人)检测上实现令人兴奋的性能(83.8级)对小物体(行人)检测。代码将在https://github.com/tusimple/sst释放
translated by 谷歌翻译
3D autonomous driving semantic segmentation using deep learning has become, a well-studied subject, providing methods that can reach very high performance. Nonetheless, because of the limited size of the training datasets, these models cannot see every type of object and scenes found in real-world applications. The ability to be reliable in these various unknown environments is called domain generalization. Despite its importance, domain generalization is relatively unexplored in the case of 3D autonomous driving semantic segmentation. To fill this gap, this paper presents the first benchmark for this application by testing state-of-the-art methods and discussing the difficulty of tackling LiDAR domain shifts. We also propose the first method designed to address this domain generalization, which we call 3DLabelProp. This method relies on leveraging the geometry and sequentiality of the LiDAR data to enhance its generalization performances by working on partially accumulated point clouds. It reaches a mIoU of 52.6% on SemanticPOSS while being trained only on SemanticKITTI, making it state-of-the-art method for generalization (+7.4% better than the second best method). The code for this method will be available on Github.
translated by 谷歌翻译
Point cloud learning has lately attracted increasing attention due to its wide applications in many areas, such as computer vision, autonomous driving, and robotics. As a dominating technique in AI, deep learning has been successfully used to solve various 2D vision problems. However, deep learning on point clouds is still in its infancy due to the unique challenges faced by the processing of point clouds with deep neural networks. Recently, deep learning on point clouds has become even thriving, with numerous methods being proposed to address different problems in this area. To stimulate future research, this paper presents a comprehensive review of recent progress in deep learning methods for point clouds. It covers three major tasks, including 3D shape classification, 3D object detection and tracking, and 3D point cloud segmentation. It also presents comparative results on several publicly available datasets, together with insightful observations and inspiring future research directions.
translated by 谷歌翻译
点云是用于在自动车辆中的感知的关键模态,提供对周围环境的坚固几何理解的手段。然而,尽管传感器从自主车辆自然是季度自然的,但仍然有限地探讨了3D Sem-TIC分割的利用点云序列。在本文中,我们提出了一种新颖的稀疏时间本地注意力(StELA)模块,其聚合在先前点云帧中的本地邻域中中间特征,以向解码器提供丰富的时间上下文。使用稀疏的本地邻居使我们的方法能够更灵活地收集比直接匹配点特征的方法,比在整个点云框架上执行昂贵的全球关注的那些。我们在Semantickitti DataSet上实现了64.3%的竞争Miou,并在我们的消融研究中表现出对单一帧基线的显着改进。
translated by 谷歌翻译
这项工作通过创建具有准确而完整的动态场景的新颖户外数据集来解决语义场景完成(SSC)数据中的差距。我们的数据集是由每个时间步骤的随机采样视图形成的,该步骤可监督无需遮挡或痕迹的场景的普遍性。我们通过利用最新的3D深度学习体系结构来使用时间信息来创建最新的开源网络中的SSC基准,并构建基准实时密集的局部语义映射算法MotionsC。我们的网络表明,提出的数据集可以在存在动态对象的情况下量化和监督准确的场景完成,这可以导致改进的动态映射算法的开发。所有软件均可在https://github.com/umich-curly/3dmapping上找到。
translated by 谷歌翻译
我们介绍了PointConvormer,这是一个基于点云的深神经网络体系结构的新颖构建块。受到概括理论的启发,PointConvormer结合了点卷积的思想,其中滤波器权重仅基于相对位置,而变形金刚则利用了基于功能的注意力。在PointConvormer中,附近点之间的特征差异是重量重量卷积权重的指标。因此,我们从点卷积操作中保留了不变,而注意力被用来选择附近的相关点进行卷积。为了验证PointConvormer的有效性,我们在点云上进行了语义分割和场景流估计任务,其中包括扫描仪,Semantickitti,FlyingThings3D和Kitti。我们的结果表明,PointConvormer具有经典的卷积,常规变压器和Voxelized稀疏卷积方法的表现,具有较小,更高效的网络。可视化表明,PointConvormer的性能类似于在平面表面上的卷积,而邻域选择效果在物体边界上更强,表明它具有两全其美。
translated by 谷歌翻译
LIDAR传感器对于自动驾驶汽车和智能机器人的感知系统至关重要。为了满足现实世界应用程序中的实时要求,有必要有效地分割激光扫描。以前的大多数方法将3D点云直接投影到2D球形范围图像上,以便它们可以利用有效的2D卷积操作进行图像分割。尽管取得了令人鼓舞的结果,但在球形投影中,邻里信息尚未保存得很好。此外,在单个扫描分割任务中未考虑时间信息。为了解决这些问题,我们提出了一种新型的语义分割方法,用于元素rangeseg的激光雷达序列,其中引入了新的范围残差图像表示以捕获空间时间信息。具体而言,使用元内核来提取元特征,从而减少了2D范围图像坐标输入和3D笛卡尔坐标输出之间的不一致。有效的U-NET主链用于获得多尺度功能。此外,特征聚合模块(FAM)增强了范围通道的作用,并在不同级别上汇总特征。我们已经进行了广泛的实验,以评估semantickitti和semanticposs。有希望的结果表明,我们提出的元rangeseg方法比现有方法更有效。我们的完整实施可在https://github.com/songw-zju/meta-rangeseg上公开获得。
translated by 谷歌翻译
We propose a new self-supervised method for pre-training the backbone of deep perception models operating on point clouds. The core idea is to train the model on a pretext task which is the reconstruction of the surface on which the 3D points are sampled, and to use the underlying latent vectors as input to the perception head. The intuition is that if the network is able to reconstruct the scene surface, given only sparse input points, then it probably also captures some fragments of semantic information, that can be used to boost an actual perception task. This principle has a very simple formulation, which makes it both easy to implement and widely applicable to a large range of 3D sensors and deep networks performing semantic segmentation or object detection. In fact, it supports a single-stream pipeline, as opposed to most contrastive learning approaches, allowing training on limited resources. We conducted extensive experiments on various autonomous driving datasets, involving very different kinds of lidars, for both semantic segmentation and object detection. The results show the effectiveness of our method to learn useful representations without any annotation, compared to existing approaches. Code is available at \href{https://github.com/valeoai/ALSO}{github.com/valeoai/ALSO}
translated by 谷歌翻译
Automotive radar sensors provide valuable information for advanced driving assistance systems (ADAS). Radars can reliably estimate the distance to an object and the relative velocity, regardless of weather and light conditions. However, radar sensors suffer from low resolution and huge intra-class variations in the shape of objects. Exploiting the time information (e.g., multiple frames) has been shown to help to capture better the dynamics of objects and, therefore, the variation in the shape of objects. Most temporal radar object detectors use 3D convolutions to learn spatial and temporal information. However, these methods are often non-causal and unsuitable for real-time applications. This work presents RECORD, a new recurrent CNN architecture for online radar object detection. We propose an end-to-end trainable architecture mixing convolutions and ConvLSTMs to learn spatio-temporal dependencies between successive frames. Our model is causal and requires only the past information encoded in the memory of the ConvLSTMs to detect objects. Our experiments show such a method's relevance for detecting objects in different radar representations (range-Doppler, range-angle) and outperform state-of-the-art models on the ROD2021 and CARRADA datasets while being less computationally expensive. The code will be available soon.
translated by 谷歌翻译
前所未有的访问多时间卫星图像,为各种地球观察任务开辟了新的视角。其中,农业包裹的像素精确的Panoptic分割具有重大的经济和环境影响。虽然研究人员对单张图像进行了探索了这个问题,但我们争辩说,随着图像的时间序列更好地寻址作物候选的复杂时间模式。在本文中,我们介绍了卫星图像时间序列(坐着)的Panoptic分割的第一端到端,单级方法(坐姿)。该模块可以与我们的新型图像序列编码网络相结合,依赖于时间自我关注,以提取丰富和自适应的多尺度时空特征。我们还介绍了Pastis,第一个开放式访问坐在Panoptic注释的数据集。我们展示了对多个竞争架构的语义细分的编码器的优越性,并建立了坐在的第一封Panoptic细分状态。我们的实施和痛苦是公开的。
translated by 谷歌翻译
基于激光雷达的3D对象检测,语义分割和全景分段通常在具有独特架构的专业网络中实现,这些网络很难相互适应。本文介绍了Lidarmultinet,这是一个基于激光雷达的多任务网络,该网络统一了这三个主要的激光感知任务。在其许多好处中,多任务网络可以通过在多个任务中分享权重和计算来降低总成本。但是,与独立组合的单任务模型相比,它通常表现不佳。拟议的Lidarmultinet旨在弥合多任务网络和多个单任务网络之间的性能差距。 Lidarmultinet的核心是一个强大的基于3D Voxel的编码器架构,具有全局上下文池(GCP)模块,从激光雷达框架中提取全局上下文特征。特定于任务的头部添加在网络之上,以执行三个激光雷达感知任务。只需添加新的任务特定的头部,可以在引入几乎没有额外成本的同时,就可以实现更多任务。还提出了第二阶段来完善第一阶段的分割并生成准确的全景分割结果。 Lidarmultinet在Waymo Open数据集和Nuscenes数据集上进行了广泛的测试,这首先证明了主要的激光雷达感知任务可以统一在单个强大的网络中,该网络是经过训练的端到端,并实现了最先进的性能。值得注意的是,Lidarmultinet在Waymo Open数据集3D语义分割挑战2022中达到了最高的MIOU和最佳准确性,对于测试集中的22个类中的大多数,仅使用LIDAR点作为输入。它还为Waymo 3D对象检测基准和三个Nuscenes基准测试的单个模型设置了新的最新模型。
translated by 谷歌翻译
在自动驾驶汽车和移动机器人上使用的多光束liDAR传感器可获得3D范围扫描的序列(“帧”)。由于有限的角度扫描分辨率和阻塞,每个框架都稀疏地覆盖了场景。稀疏性限制了语义分割或表面重建等下游过程的性能。幸运的是,当传感器移动时,帧将从一系列不同的观点捕获。这提供了互补的信息,当积累在公共场景坐标框架中时,会产生更密集的采样和对基础3D场景的更完整覆盖。但是,扫描场景通常包含移动对象。这些对象上的点不能仅通过撤消扫描仪的自我运动来正确对齐。在本文中,我们将多帧点云积累作为3D扫描序列的中级表示,并开发了一种利用室外街道场景的感应偏见的方法,包括其几何布局和对象级刚性。与最新的场景流估计器相比,我们提出的方法旨在使所有3D点在共同的参考框架中对齐,以正确地积累各个对象上的点。我们的方法大大减少了几个基准数据集上的对齐错误。此外,累积的点云使诸如表面重建之类的高级任务受益。
translated by 谷歌翻译
用于LIDAR点云的快速准确的Panoptic分割系统对于自主驾驶车辆来了解周围物体和场景至关重要。现有方法通常依赖于提案或聚类到分段前景实例。结果,他们努力实现实时性能。在本文中,我们提出了一种用于LIDAR点云的新型实时端到端Panoptic分段网络,称为CPSEG。特别地,CPSEG包括共享编码器,双解码器,任务感知注意模块(TAM)和无簇实例分段头。 TAM旨在强制执行这两个解码器以学习用于语义和实例嵌入的丰富的任务感知功能。此外,CPSEG包含一个新的无簇实例分割头,以根据学习嵌入的嵌入动态占据前景点。然后,它通过找到具有成对嵌入比较的连接的柱子来获取实例标签。因此,将传统的基于提议的或基于聚类的实例分段转换为对成对嵌入比较矩阵的二进制分段问题。为了帮助网络回归实例嵌入,提出了一种快速和确定的深度完成算法,以实时计算每个点云的表面法线。该方法在两个大型自主驾驶数据集中基准测试,即Semantickitti和Nuscenes。值得注意的是,广泛的实验结果表明,CPSEG在两个数据集的实时方法中实现了最先进的结果。
translated by 谷歌翻译
点云的Panoptic分割是一种重要的任务,使自动车辆能够使用高精度可靠的激光雷达传感器来理解其附近。现有的自上而下方法通过将独立的任务特定网络或转换方法从图像域转换为忽略激光雷达数据的复杂性,因此通常会导致次优性性能来解决这个问题。在本文中,我们提出了新的自上而下的高效激光乐光线分割(有效的LID)架构,该架构解决了分段激光雷达云中的多种挑战,包括距离依赖性稀疏性,严重的闭塞,大规模变化和重新投影误差。高效地板包括一种新型共享骨干,可以通过加强的几何变换建模容量进行编码,并聚合语义丰富的范围感知多尺度特征。它结合了新的不变语义和实例分段头以及由我们提出的Panoptic外围损耗功能监督的Panoptic Fusion模块。此外,我们制定了正则化的伪标签框架,通过对未标记数据的培训进行进一步提高高效性的性能。我们在两个大型LIDAR数据集中建议模型基准:NUSCENES,我们还提供了地面真相注释和Semantickitti。值得注意的是,高效地将在两个数据集上设置新的最先进状态。
translated by 谷歌翻译