In this paper, we propose a graph neural network to detect objects from a LiDAR point cloud. Towards this end, we encode the point cloud efficiently in a fixed radius near-neighbors graph. We design a graph neural network, named Point-GNN, to predict the category and shape of the object that each vertex in the graph belongs to. In Point-GNN, we propose an auto-registration mechanism to reduce translation variance, and also design a box merging and scoring operation to combine detections from multiple vertices accurately. Our experiments on the KITTI benchmark show the proposed approach achieves leading accuracy using the point cloud alone and can even surpass fusion-based algorithms. Our results demonstrate the potential of using the graph neural network as a new approach for 3D object detection. The code is available at https://github.com/WeijingShi/Point-GNN.
translated by 谷歌翻译
从点云的准确3D对象检测已成为自动驾驶中的重要组成部分。但是,前面的作品中的体积表示和投影方法无法在本地点集之间建立关系。在本文中,我们提出了稀疏的Voxel-Graph注意网络(SVGA-Net),一种新型端到端培训网络,主要包含Voxel-Traph模块和稀疏 - 致密的回归模块,以实现RAW的可比3D检测任务LIDAR数据。具体地,SVGA-NET通过所有体素构建每个分割的3D球形体素和全局KNN图中的本地完整图。本地和全局图作为增强提取特征的注意机制。此外,新颖的稀疏 - 密集的回归模块通过不同级别的特征映射聚合来增强3D盒估计精度。 KITTI检测基准测试的实验证明将图形表示扩展到3D对象检测的效率,并且所提出的SVGA-NET可以实现体面的检测精度。
translated by 谷歌翻译
两阶段探测器在3D对象检测中已广受欢迎。大多数两阶段的3D检测器都使用网格点,体素电网或第二阶段的ROI特征提取的采样关键点。但是,这种方法在处理不均匀分布和稀疏的室外点方面效率低下。本文在三个方面解决了这个问题。 1)动态点聚集。我们建议补丁搜索以快速在本地区域中为每个3D提案搜索点。然后,将最远的体素采样采样用于均匀采样点。特别是,体素尺寸沿距离变化,以适应点的不均匀分布。 2)Ro-Graph Poling。我们在采样点上构建本地图,以通过迭代消息传递更好地模型上下文信息和地雷关系。 3)视觉功能增强。我们引入了一种简单而有效的融合策略,以补偿具有有限语义提示的稀疏激光雷达点。基于这些模块,我们将图形R-CNN构建为第二阶段,可以将其应用于现有的一阶段检测器,以始终如一地提高检测性能。广泛的实验表明,图R-CNN的表现优于最新的3D检测模型,而Kitti和Waymo Open DataSet的差距很大。我们在Kitti Bev汽车检测排行榜上排名第一。代码将在\ url {https://github.com/nightmare-n/graphrcnn}上找到。
translated by 谷歌翻译
We present a new two-stage 3D object detection framework, named sparse-to-dense 3D Object Detector (STD). The first stage is a bottom-up proposal generation network that uses raw point cloud as input to generate accurate proposals by seeding each point with a new spherical anchor. It achieves a high recall with less computation compared with prior works. Then, PointsPool is applied for generating proposal features by transforming their interior point features from sparse expression to compact representation, which saves even more computation time. In box prediction, which is the second stage, we implement a parallel intersection-over-union (IoU) branch to increase awareness of localization accuracy, resulting in further improved performance. We conduct experiments on KITTI dataset, and evaluate our method in terms of 3D object and Bird's Eye View (BEV) detection. Our method outperforms other stateof-the-arts by a large margin, especially on the hard set, with inference speed more than 10 FPS.
translated by 谷歌翻译
它得到了很好的认识到,从深度感知的LIDAR点云和语义富有的立体图像中融合互补信息将有利于3D对象检测。然而,探索稀疏3D点和密集2D像素之间固有的不自然相互作用并不重要。为了简化这种困难,最近的建议通常将3D点投影到2D图像平面上以对图像数据进行采样,然后聚合点处的数据。然而,这种方法往往遭受点云和RGB图像的分辨率之间的不匹配,导致次优性能。具体地,作为多模态数据聚合位置的稀疏点导致高分辨率图像的严重信息丢失,这反过来破坏了多传感器融合的有效性。在本文中,我们呈现VPFNET - 一种新的架构,可以在“虚拟”点处巧妙地对齐和聚合点云和图像数据。特别地,它们的密度位于3D点和2D像素的密度之间,虚拟点可以很好地桥接两个传感器之间的分辨率间隙,从而保持更多信息以进行处理。此外,我们还研究了可以应用于点云和RGB图像的数据增强技术,因为数据增强对迄今为止对3D对象探测器的贡献不可忽略。我们对Kitti DataSet进行了广泛的实验,与最先进的方法相比,观察到了良好的性能。值得注意的是,我们的VPFNET在KITTI测试集上实现了83.21 \%中等3D AP和91.86 \%适度的BEV AP,自2021年5月21日起排名第一。网络设计也考虑了计算效率 - 我们可以实现FPS 15对单个NVIDIA RTX 2080TI GPU。该代码将用于复制和进一步调查。
translated by 谷歌翻译
3D object detection from LiDAR point cloud is a challenging problem in 3D scene understanding and has many practical applications. In this paper, we extend our preliminary work PointRCNN to a novel and strong point-cloud-based 3D object detection framework, the part-aware and aggregation neural network (Part-A 2 net). The whole framework consists of the part-aware stage and the part-aggregation stage. Firstly, the part-aware stage for the first time fully utilizes free-of-charge part supervisions derived from 3D ground-truth boxes to simultaneously predict high quality 3D proposals and accurate intra-object part locations. The predicted intra-object part locations within the same proposal are grouped by our new-designed RoI-aware point cloud pooling module, which results in an effective representation to encode the geometry-specific features of each 3D proposal. Then the part-aggregation stage learns to re-score the box and refine the box location by exploring the spatial relationship of the pooled intra-object part locations. Extensive experiments are conducted to demonstrate the performance improvements from each component of our proposed framework. Our Part-A 2 net outperforms all existing 3D detection methods and achieves new state-of-the-art on KITTI 3D object detection dataset by utilizing only the LiDAR point cloud data. Code is available at https://github.com/sshaoshuai/PointCloudDet3D.
translated by 谷歌翻译
目前,现有的最先进的3D对象检测器位于两阶段范例中。这些方法通常包括两个步骤:1)利用区域提案网络以自下而上的方式提出少数高质量的提案。 2)调整拟议区域的语义特征的大小和汇集,以总结Roi-Wise表示进一步改进。注意,步骤2中的这些ROI-WISE表示在馈送到遵循检测标题之后,在步骤2中的循环表示作为不相关的条目。然而,我们观察由步骤1所产生的这些提案,以某种方式从地面真理偏移,在局部邻居中兴起潜在的概率。在该提案在很大程度上用于由于坐标偏移而导致其边界信息的情况下出现挑战,而现有网络缺乏相应的信息补偿机制。在本文中,我们向点云进行了3D对象检测的$ BADET $。具体地,而不是以先前的工作独立地将每个提议进行独立地改进每个提议,我们将每个提议代表作为在给定的截止阈值内的图形构造的节点,局部邻域图形式的提案,具有明确利用的对象的边界相关性。此外,我们设计了轻量级区域特征聚合模块,以充分利用Voxel-Wise,Pixel-Wise和Point-Wise特征,具有扩展的接收领域,以实现更多信息ROI-WISE表示。我们在广泛使用的基提数据集中验证了坏人,并且具有高度挑战的Nuscenes数据集。截至4月17日,2021年,我们的坏账在基蒂3D检测排行榜上实现了Par表演,并在Kitti Bev检测排行榜上排名在$ 1 ^ {st} $ in $ superge $难度。源代码可在https://github.com/rui-qian/badet中获得。
translated by 谷歌翻译
In this work, we study 3D object detection from RGB-D data in both indoor and outdoor scenes. While previous methods focus on images or 3D voxels, often obscuring natural 3D patterns and invariances of 3D data, we directly operate on raw point clouds by popping up RGB-D scans. However, a key challenge of this approach is how to efficiently localize objects in point clouds of large-scale scenes (region proposal). Instead of solely relying on 3D proposals, our method leverages both mature 2D object detectors and advanced 3D deep learning for object localization, achieving efficiency as well as high recall for even small objects. Benefited from learning directly in raw point clouds, our method is also able to precisely estimate 3D bounding boxes even under strong occlusion or with very sparse points. Evaluated on KITTI and SUN RGB-D 3D detection benchmarks, our method outperforms the state of the art by remarkable margins while having real-time capability. * Majority of the work done as an intern at Nuro, Inc. depth to point cloud 2D region (from CNN) to 3D frustum 3D box (from PointNet)
translated by 谷歌翻译
In this paper, we propose a novel 3D object detector that can exploit both LIDAR as well as cameras to perform very accurate localization. Towards this goal, we design an end-to-end learnable architecture that exploits continuous convolutions to fuse image and LIDAR feature maps at different levels of resolution. Our proposed continuous fusion layer encode both discrete-state image features as well as continuous geometric information. This enables us to design a novel, reliable and efficient end-to-end learnable 3D object detector based on multiple sensors. Our experimental evaluation on both KITTI as well as a large scale 3D object detection benchmark shows significant improvements over the state of the art.
translated by 谷歌翻译
Object detection in point clouds is an important aspect of many robotics applications such as autonomous driving. In this paper we consider the problem of encoding a point cloud into a format appropriate for a downstream detection pipeline. Recent literature suggests two types of encoders; fixed encoders tend to be fast but sacrifice accuracy, while encoders that are learned from data are more accurate, but slower. In this work we propose PointPillars, a novel encoder which utilizes PointNets to learn a representation of point clouds organized in vertical columns (pillars). While the encoded features can be used with any standard 2D convolutional detection architecture, we further propose a lean downstream network. Extensive experimentation shows that PointPillars outperforms previous encoders with respect to both speed and accuracy by a large margin. Despite only using lidar, our full detection pipeline significantly outperforms the state of the art, even among fusion methods, with respect to both the 3D and bird's eye view KITTI benchmarks. This detection performance is achieved while running at 62 Hz: a 2 -4 fold runtime improvement. A faster version of our method matches the state of the art at 105 Hz. These benchmarks suggest that PointPillars is an appropriate encoding for object detection in point clouds.
translated by 谷歌翻译
Accurate detection of objects in 3D point clouds is a central problem in many applications, such as autonomous navigation, housekeeping robots, and augmented/virtual reality. To interface a highly sparse LiDAR point cloud with a region proposal network (RPN), most existing efforts have focused on hand-crafted feature representations, for example, a bird's eye view projection. In this work, we remove the need of manual feature engineering for 3D point clouds and propose VoxelNet, a generic 3D detection network that unifies feature extraction and bounding box prediction into a single stage, end-to-end trainable deep network. Specifically, VoxelNet divides a point cloud into equally spaced 3D voxels and transforms a group of points within each voxel into a unified feature representation through the newly introduced voxel feature encoding (VFE) layer. In this way, the point cloud is encoded as a descriptive volumetric representation, which is then connected to a RPN to generate detections. Experiments on the KITTI car detection benchmark show that VoxelNet outperforms the state-of-the-art LiDAR based 3D detection methods by a large margin. Furthermore, our network learns an effective discriminative representation of objects with various geometries, leading to encouraging results in 3D detection of pedestrians and cyclists, based on only LiDAR.
translated by 谷歌翻译
近年来,由于深度学习技术的发展,LiDar Point Clouds的3D对象检测取得了长足的进步。尽管基于体素或基于点的方法在3D对象检测中很受欢迎,但它们通常涉及耗时的操作,例如有关体素的3D卷积或点之间的球查询,从而使所得网络不适合时间关键应用程序。另一方面,基于2D视图的方法具有较高的计算效率,而通常比基于体素或基于点的方法获得的性能低。在这项工作中,我们提出了一个基于实时视图的单阶段3D对象检测器,即CVFNET完成此任务。为了在苛刻的效率条件下加强跨视图的学习,我们的框架提取了不同视图的特征,并以有效的渐进式方式融合了它们。我们首先提出了一个新颖的点范围特征融合模块,该模块在多个阶段深入整合点和范围视图特征。然后,当将所获得的深点视图转换为鸟类视图时,特殊的切片柱旨在很好地维护3D几何形状。为了更好地平衡样品比率,提出了一个稀疏的柱子检测头,将检测集中在非空网上。我们对流行的Kitti和Nuscenes基准进行了实验,并以准确性和速度来实现最先进的性能。
translated by 谷歌翻译
Compared to typical multi-sensor systems, monocular 3D object detection has attracted much attention due to its simple configuration. However, there is still a significant gap between LiDAR-based and monocular-based methods. In this paper, we find that the ill-posed nature of monocular imagery can lead to depth ambiguity. Specifically, objects with different depths can appear with the same bounding boxes and similar visual features in the 2D image. Unfortunately, the network cannot accurately distinguish different depths from such non-discriminative visual features, resulting in unstable depth training. To facilitate depth learning, we propose a simple yet effective plug-and-play module, One Bounding Box Multiple Objects (OBMO). Concretely, we add a set of suitable pseudo labels by shifting the 3D bounding box along the viewing frustum. To constrain the pseudo-3D labels to be reasonable, we carefully design two label scoring strategies to represent their quality. In contrast to the original hard depth labels, such soft pseudo labels with quality scores allow the network to learn a reasonable depth range, boosting training stability and thus improving final performance. Extensive experiments on KITTI and Waymo benchmarks show that our method significantly improves state-of-the-art monocular 3D detectors by a significant margin (The improvements under the moderate setting on KITTI validation set are $\mathbf{1.82\sim 10.91\%}$ mAP in BEV and $\mathbf{1.18\sim 9.36\%}$ mAP in 3D}. Codes have been released at https://github.com/mrsempress/OBMO.
translated by 谷歌翻译
This paper aims at high-accuracy 3D object detection in autonomous driving scenario. We propose Multi-View 3D networks (MV3D), a sensory-fusion framework that takes both LIDAR point cloud and RGB images as input and predicts oriented 3D bounding boxes. We encode the sparse 3D point cloud with a compact multi-view representation. The network is composed of two subnetworks: one for 3D object proposal generation and another for multi-view feature fusion. The proposal network generates 3D candidate boxes efficiently from the bird's eye view representation of 3D point cloud. We design a deep fusion scheme to combine region-wise features from multiple views and enable interactions between intermediate layers of different paths. Experiments on the challenging KITTI benchmarkshow that our approach outperforms the state-of-the-art by around 25% and 30% AP on the tasks of 3D localization and 3D detection. In addition, for 2D detection, our approach obtains 14.9% higher AP than the state-of-the-art on the hard data among the LIDAR-based methods.
translated by 谷歌翻译
We address the problem of real-time 3D object detection from point clouds in the context of autonomous driving. Computation speed is critical as detection is a necessary component for safety. Existing approaches are, however, expensive in computation due to high dimensionality of point clouds. We utilize the 3D data more efficiently by representing the scene from the Bird's Eye View (BEV), and propose PIXOR, a proposal-free, single-stage detector that outputs oriented 3D object estimates decoded from pixelwise neural network predictions. The input representation, network architecture, and model optimization are especially designed to balance high accuracy and real-time efficiency. We validate PIXOR on two datasets: the KITTI BEV object detection benchmark, and a large-scale 3D vehicle detection benchmark. In both datasets we show that the proposed detector surpasses other state-of-the-art methods notably in terms of Average Precision (AP), while still runs at > 28 FPS.
translated by 谷歌翻译
由于缺乏深度信息,单眼3D对象检测在自主驾驶中非常具有挑战性。本文提出了一种基于多尺度深度分层的单眼单目眼3D对象检测算法,它使用锚定方法检测每像素预测中的3D对象。在所提出的MDS-Net中,开发了一种新的基于深度的分层结构,以通过在对象的深度和图像尺寸之间建立数学模型来改善网络的深度预测能力。然后开发出新的角度损耗功能,以进一步提高角度预测的精度并提高训练的收敛速度。最终在后处理阶段最终应用优化的软,以调整候选盒的置信度。基蒂基准测试的实验表明,MDS-Net在3D检测中优于现有的单目3D检测方法,并在满足实时要求时进行3D检测和BEV检测任务。
translated by 谷歌翻译
In this paper, we propose PointRCNN for 3D object detection from raw point cloud. The whole framework is composed of two stages: stage-1 for the bottom-up 3D proposal generation and stage-2 for refining proposals in the canonical coordinates to obtain the final detection results. Instead of generating proposals from RGB image or projecting point cloud to bird's view or voxels as previous methods do, our stage-1 sub-network directly generates a small number of high-quality 3D proposals from point cloud in a bottom-up manner via segmenting the point cloud of the whole scene into foreground points and background. The stage-2 sub-network transforms the pooled points of each proposal to canonical coordinates to learn better local spatial features, which is combined with global semantic features of each point learned in stage-1 for accurate box refinement and confidence prediction. Extensive experiments on the 3D detection benchmark of KITTI dataset show that our proposed architecture outperforms state-of-the-art methods with remarkable margins by using only point cloud as input. The code is available at https://github.com/sshaoshuai/PointRCNN.
translated by 谷歌翻译
由于经过验证的2D检测技术的适用性,大多数当前点云检测器都广泛采用了鸟类视图(BEV)。但是,现有方法通过简单地沿高度尺寸折叠的体素或点特征来获得BEV特征,从而导致3D空间信息的重丢失。为了减轻信息丢失,我们提出了一个基于多级特征降低降低策略的新颖点云检测网络,称为MDRNET。在MDRNET中,空间感知的维度降低(SDR)旨在在体素至BEV特征转换过程中动态关注对象的宝贵部分。此外,提出了多级空间残差(MSR),以融合BEV特征图中的多级空间信息。关于Nuscenes的广泛实验表明,该提出的方法的表现优于最新方法。该代码将在出版时提供。
translated by 谷歌翻译
近年来,自主驾驶LIDAR数据的3D对象检测一直在迈出卓越的进展。在最先进的方法中,已经证明了将点云进行编码为鸟瞰图(BEV)是有效且有效的。与透视图不同,BEV在物体之间保留丰富的空间和距离信息;虽然在BEV中相同类型的更远物体不会较小,但它们包含稀疏点云特征。这一事实使用共享卷积神经网络削弱了BEV特征提取。为了解决这一挑战,我们提出了范围感知注意网络(RAANET),提取更强大的BEV功能并产生卓越的3D对象检测。范围感知的注意力(RAA)卷曲显着改善了近距离的特征提取。此外,我们提出了一种新的辅助损耗,用于密度估计,以进一步增强覆盖物体的Raanet的检测精度。值得注意的是,我们提出的RAA卷积轻量级,并兼容,以集成到用于BEV检测的任何CNN架构中。 Nuscenes DataSet上的广泛实验表明,我们的提出方法优于基于LIDAR的3D对象检测的最先进的方法,具有16 Hz的实时推断速度,为LITE版本为22 Hz。该代码在匿名GitHub存储库HTTPS://github.com/Anonymous0522 / ange上公开提供。
translated by 谷歌翻译
We present a novel and high-performance 3D object detection framework, named PointVoxel-RCNN (PV-RCNN), for accurate 3D object detection from point clouds. Our proposed method deeply integrates both 3D voxel Convolutional Neural Network (CNN) and PointNet-based set abstraction to learn more discriminative point cloud features. It takes advantages of efficient learning and high-quality proposals of the 3D voxel CNN and the flexible receptive fields of the PointNet-based networks. Specifically, the proposed framework summarizes the 3D scene with a 3D voxel CNN into a small set of keypoints via a novel voxel set abstraction module to save follow-up computations and also to encode representative scene features. Given the highquality 3D proposals generated by the voxel CNN, the RoIgrid pooling is proposed to abstract proposal-specific features from the keypoints to the RoI-grid points via keypoint set abstraction with multiple receptive fields. Compared with conventional pooling operations, the RoI-grid feature points encode much richer context information for accurately estimating object confidences and locations. Extensive experiments on both the KITTI dataset and the Waymo Open dataset show that our proposed PV-RCNN surpasses state-of-the-art 3D detection methods with remarkable margins by using only point clouds. Code is available at https://github.com/open-mmlab/OpenPCDet.
translated by 谷歌翻译