最近,机器人和增强现实中的有希望的应用引起了从点云中的3D对象检测引起了相当大的关注。在本文中,我们展示了FCAF3D - 一流的全卷积锚无室内3D对象检测方法。它是一种简单而有效的方法,使用点云的体素表示,并处理具有稀疏卷曲的体素。 FCAF3D可以通过单个完全卷积前馈通量来处理具有最小运行时的大规模场景。现有的3D对象检测方法在对象的几何形状上进行现有假设,我们认为它限制了它们的泛化能力。为了摆脱任何先前的假设,我们提出了一种以纯粹的数据驱动方式获得更好的结果的导向边界框的新颖参数化。该方法在Scannet V2(+4.5),Sun RGB-D(+3.5)和S3DIS(+20.5)数据集上实现了最先进的3D对象检测结果。代码和模型可在https://github.com/samsunglabs/fcaf3d中获得。
translated by 谷歌翻译
3D object detection from LiDAR point cloud is a challenging problem in 3D scene understanding and has many practical applications. In this paper, we extend our preliminary work PointRCNN to a novel and strong point-cloud-based 3D object detection framework, the part-aware and aggregation neural network (Part-A 2 net). The whole framework consists of the part-aware stage and the part-aggregation stage. Firstly, the part-aware stage for the first time fully utilizes free-of-charge part supervisions derived from 3D ground-truth boxes to simultaneously predict high quality 3D proposals and accurate intra-object part locations. The predicted intra-object part locations within the same proposal are grouped by our new-designed RoI-aware point cloud pooling module, which results in an effective representation to encode the geometry-specific features of each 3D proposal. Then the part-aggregation stage learns to re-score the box and refine the box location by exploring the spatial relationship of the pooled intra-object part locations. Extensive experiments are conducted to demonstrate the performance improvements from each component of our proposed framework. Our Part-A 2 net outperforms all existing 3D detection methods and achieves new state-of-the-art on KITTI 3D object detection dataset by utilizing only the LiDAR point cloud data. Code is available at https://github.com/sshaoshuai/PointCloudDet3D.
translated by 谷歌翻译
Current 3D object detection methods are heavily influenced by 2D detectors. In order to leverage architectures in 2D detectors, they often convert 3D point clouds to regular grids (i.e., to voxel grids or to bird's eye view images), or rely on detection in 2D images to propose 3D boxes. Few works have attempted to directly detect objects in point clouds. In this work, we return to first principles to construct a 3D detection pipeline for point cloud data and as generic as possible. However, due to the sparse nature of the data -samples from 2D manifolds in 3D space -we face a major challenge when directly predicting bounding box parameters from scene points: a 3D object centroid can be far from any surface point thus hard to regress accurately in one step. To address the challenge, we propose VoteNet, an end-to-end 3D object detection network based on a synergy of deep point set networks and Hough voting. Our model achieves state-of-the-art 3D detection on two large datasets of real 3D scans, ScanNet and SUN RGB-D with a simple design, compact model size and high efficiency. Remarkably, VoteNet outperforms previous methods by using purely geometric information without relying on color images.
translated by 谷歌翻译
In this work, we study 3D object detection from RGB-D data in both indoor and outdoor scenes. While previous methods focus on images or 3D voxels, often obscuring natural 3D patterns and invariances of 3D data, we directly operate on raw point clouds by popping up RGB-D scans. However, a key challenge of this approach is how to efficiently localize objects in point clouds of large-scale scenes (region proposal). Instead of solely relying on 3D proposals, our method leverages both mature 2D object detectors and advanced 3D deep learning for object localization, achieving efficiency as well as high recall for even small objects. Benefited from learning directly in raw point clouds, our method is also able to precisely estimate 3D bounding boxes even under strong occlusion or with very sparse points. Evaluated on KITTI and SUN RGB-D 3D detection benchmarks, our method outperforms the state of the art by remarkable margins while having real-time capability. * Majority of the work done as an intern at Nuro, Inc. depth to point cloud 2D region (from CNN) to 3D frustum 3D box (from PointNet)
translated by 谷歌翻译
In this paper, we propose PointRCNN for 3D object detection from raw point cloud. The whole framework is composed of two stages: stage-1 for the bottom-up 3D proposal generation and stage-2 for refining proposals in the canonical coordinates to obtain the final detection results. Instead of generating proposals from RGB image or projecting point cloud to bird's view or voxels as previous methods do, our stage-1 sub-network directly generates a small number of high-quality 3D proposals from point cloud in a bottom-up manner via segmenting the point cloud of the whole scene into foreground points and background. The stage-2 sub-network transforms the pooled points of each proposal to canonical coordinates to learn better local spatial features, which is combined with global semantic features of each point learned in stage-1 for accurate box refinement and confidence prediction. Extensive experiments on the 3D detection benchmark of KITTI dataset show that our proposed architecture outperforms state-of-the-art methods with remarkable margins by using only point cloud as input. The code is available at https://github.com/sshaoshuai/PointRCNN.
translated by 谷歌翻译
Three-dimensional objects are commonly represented as 3D boxes in a point-cloud. This representation mimics the well-studied image-based 2D bounding-box detection but comes with additional challenges. Objects in a 3D world do not follow any particular orientation, and box-based detectors have difficulties enumerating all orientations or fitting an axis-aligned bounding box to rotated objects. In this paper, we instead propose to represent, detect, and track 3D objects as points. Our framework, CenterPoint, first detects centers of objects using a keypoint detector and regresses to other attributes, including 3D size, 3D orientation, and velocity. In a second stage, it refines these estimates using additional point features on the object. In CenterPoint, 3D object tracking simplifies to greedy closest-point matching. The resulting detection and tracking algorithm is simple, efficient, and effective. CenterPoint achieved state-of-theart performance on the nuScenes benchmark for both 3D detection and tracking, with 65.5 NDS and 63.8 AMOTA for a single model. On the Waymo Open Dataset, Center-Point outperforms all previous single model methods by a large margin and ranks first among all Lidar-only submissions. The code and pretrained models are available at https://github.com/tianweiy/CenterPoint.
translated by 谷歌翻译
Accurate detection of objects in 3D point clouds is a central problem in many applications, such as autonomous navigation, housekeeping robots, and augmented/virtual reality. To interface a highly sparse LiDAR point cloud with a region proposal network (RPN), most existing efforts have focused on hand-crafted feature representations, for example, a bird's eye view projection. In this work, we remove the need of manual feature engineering for 3D point clouds and propose VoxelNet, a generic 3D detection network that unifies feature extraction and bounding box prediction into a single stage, end-to-end trainable deep network. Specifically, VoxelNet divides a point cloud into equally spaced 3D voxels and transforms a group of points within each voxel into a unified feature representation through the newly introduced voxel feature encoding (VFE) layer. In this way, the point cloud is encoded as a descriptive volumetric representation, which is then connected to a RPN to generate detections. Experiments on the KITTI car detection benchmark show that VoxelNet outperforms the state-of-the-art LiDAR based 3D detection methods by a large margin. Furthermore, our network learns an effective discriminative representation of objects with various geometries, leading to encouraging results in 3D detection of pedestrians and cyclists, based on only LiDAR.
translated by 谷歌翻译
Object detection in point clouds is an important aspect of many robotics applications such as autonomous driving. In this paper we consider the problem of encoding a point cloud into a format appropriate for a downstream detection pipeline. Recent literature suggests two types of encoders; fixed encoders tend to be fast but sacrifice accuracy, while encoders that are learned from data are more accurate, but slower. In this work we propose PointPillars, a novel encoder which utilizes PointNets to learn a representation of point clouds organized in vertical columns (pillars). While the encoded features can be used with any standard 2D convolutional detection architecture, we further propose a lean downstream network. Extensive experimentation shows that PointPillars outperforms previous encoders with respect to both speed and accuracy by a large margin. Despite only using lidar, our full detection pipeline significantly outperforms the state of the art, even among fusion methods, with respect to both the 3D and bird's eye view KITTI benchmarks. This detection performance is achieved while running at 62 Hz: a 2 -4 fold runtime improvement. A faster version of our method matches the state of the art at 105 Hz. These benchmarks suggest that PointPillars is an appropriate encoding for object detection in point clouds.
translated by 谷歌翻译
We present a new two-stage 3D object detection framework, named sparse-to-dense 3D Object Detector (STD). The first stage is a bottom-up proposal generation network that uses raw point cloud as input to generate accurate proposals by seeding each point with a new spherical anchor. It achieves a high recall with less computation compared with prior works. Then, PointsPool is applied for generating proposal features by transforming their interior point features from sparse expression to compact representation, which saves even more computation time. In box prediction, which is the second stage, we implement a parallel intersection-over-union (IoU) branch to increase awareness of localization accuracy, resulting in further improved performance. We conduct experiments on KITTI dataset, and evaluate our method in terms of 3D object and Bird's Eye View (BEV) detection. Our method outperforms other stateof-the-arts by a large margin, especially on the hard set, with inference speed more than 10 FPS.
translated by 谷歌翻译
当前的3D分割方法很大程度上依赖于大规模的点状数据集,众所周知,这些数据集众所周知。很少有尝试规避需要每点注释的需求。在这项工作中,我们研究了弱监督的3D语义实例分割。关键的想法是利用3D边界框标签,更容易,更快地注释。确实,我们表明只有仅使用边界框标签训练密集的分割模型。在我们方法的核心上,\ name {}是一个深层模型,灵感来自经典的霍夫投票,直接投票赞成边界框参数,并且是专门针对边界盒票的专门定制的群集方法。这超出了常用的中心票,这不会完全利用边界框注释。在扫描仪测试中,我们弱监督的模型在其他弱监督的方法中获得了领先的性能(+18 MAP@50)。值得注意的是,它还达到了当前完全监督模型的50分数的地图的97%。为了进一步说明我们的工作的实用性,我们在最近发布的Arkitscenes数据集中训练Box2mask,该数据集仅使用3D边界框注释,并首次显示引人注目的3D实例细分掩码。
translated by 谷歌翻译
3D对象检测通过将点云作为唯一的输入来取得了显着的进展。但是,点云通常遭受不完整的几何结构和缺乏语义信息,这使得检测器难以准确地对检测到的对象进行分类。在这项工作中,我们专注于如何有效利用来自图像的对象级信息来提高基于点的3D检测器的性能。我们提出DEMF,这是一种简单而有效的方法,将图像信息融合到点特征中。给定一组点特征和图像特征图,DEMF通过将3D点的投影2D位置作为参考来自适应地汇总图像特征。我们在挑战性的Sun RGB-D数据集上评估了我们的方法,从而提高了最新的结果(+2.1 map@0.25和+2.3map@0.5)。代码可从https://github.com/haoy945/demf获得。
translated by 谷歌翻译
We focus on the task of amodal 3D object detection in RGB-D images, which aims to produce a 3D bounding box of an object in metric form at its full extent. We introduce Deep Sliding Shapes, a 3D ConvNet formulation that takes a 3D volumetric scene from a RGB-D image as input and outputs 3D object bounding boxes. In our approach, we propose the first 3D Region Proposal Network (RPN) to learn objectness from geometric shapes and the first joint Object Recognition Network (ORN) to extract geometric features in 3D and color features in 2D. In particular, we handle objects of various sizes by training an amodal RPN at two different scales and an ORN to regress 3D bounding boxes. Experiments show that our algorithm outperforms the state-of-the-art by 13.8 in mAP and is 200× faster than the original Sliding Shapes. Source code and pre-trained models are available.
translated by 谷歌翻译
Figure 1: Results obtained from our single image, monocular 3D object detection network MonoDIS on a KITTI3D test image with corresponding birds-eye view, showing its ability to estimate size and orientation of objects at different scales.
translated by 谷歌翻译
We address the problem of real-time 3D object detection from point clouds in the context of autonomous driving. Computation speed is critical as detection is a necessary component for safety. Existing approaches are, however, expensive in computation due to high dimensionality of point clouds. We utilize the 3D data more efficiently by representing the scene from the Bird's Eye View (BEV), and propose PIXOR, a proposal-free, single-stage detector that outputs oriented 3D object estimates decoded from pixelwise neural network predictions. The input representation, network architecture, and model optimization are especially designed to balance high accuracy and real-time efficiency. We validate PIXOR on two datasets: the KITTI BEV object detection benchmark, and a large-scale 3D vehicle detection benchmark. In both datasets we show that the proposed detector surpasses other state-of-the-art methods notably in terms of Average Precision (AP), while still runs at > 28 FPS.
translated by 谷歌翻译
在鸟眼中学习强大的表现(BEV),以进行感知任务,这是趋势和吸引行业和学术界的广泛关注。大多数自动驾驶算法的常规方法在正面或透视视图中执行检测,细分,跟踪等。随着传感器配置变得越来越复杂,从不同的传感器中集成了多源信息,并在统一视图中代表功能至关重要。 BEV感知继承了几个优势,因为代表BEV中的周围场景是直观和融合友好的。对于BEV中的代表对象,对于随后的模块,如计划和/或控制是最可取的。 BEV感知的核心问题在于(a)如何通过从透视视图到BEV来通过视图转换来重建丢失的3D信息; (b)如何在BEV网格中获取地面真理注释; (c)如何制定管道以合并来自不同来源和视图的特征; (d)如何适应和概括算法作为传感器配置在不同情况下各不相同。在这项调查中,我们回顾了有关BEV感知的最新工作,并对不同解决方案进行了深入的分析。此外,还描述了该行业的BEV方法的几种系统设计。此外,我们推出了一套完整的实用指南,以提高BEV感知任务的性能,包括相机,激光雷达和融合输入。最后,我们指出了该领域的未来研究指示。我们希望该报告能阐明社区,并鼓励对BEV感知的更多研究。我们保留一个活跃的存储库来收集最新的工作,并在https://github.com/openperceptionx/bevperception-survey-recipe上提供一包技巧的工具箱。
translated by 谷歌翻译
3D场景从点云层的理解对各种机器人应用起着重要作用。遗憾的是,目前的最先进的方法使用单独的神经网络进行对象检测或房间布局估计等不同任务。这种方案具有两个限制:1)存储和运行多个网络以用于不同任务的网络对于典型的机器人平台昂贵。 2)忽略单独输出的内在结构,潜在地侵犯。为此,我们使用点云输入提出了第一变压器架构,其同时预测3D对象和布局。与估计布局关键点或边缘的现有方法不同,我们将单独参数化为一组四边形。因此,所提出的架构被称为p(oint)q(UAD)-Transformer。除了新颖的四边形表示之外,我们提出了一种量身定制的物理约束损失功能,阻碍对象布局干扰。公共基准SCANNet上的定量和定性评估表明,所提出的PQ变换器成功地共同解析了3D对象和布局,以准实时(8.91 FPS)速率运行而无需效率为导向的优化。此外,新的物理限制损失可以改善强力基线,房间布局的F1分数明显促进了37.9%至57.9%。
translated by 谷歌翻译
从点云中检测3D对象是一项实用但充满挑战的任务,最近引起了越来越多的关注。在本文中,我们提出了针对3D对象检测的标签引导辅助训练方法(LG3D),该方法是增强现有3D对象检测器的功能学习的辅助网络。具体而言,我们提出了两个新型模块:一个标签 - 通道诱导器,该模块诱导器将框架中的注释和点云映射到特定于任务的表示形式和一个标签 - 知识式插曲器,该标签知识映射器有助于获得原始特征以获得检测临界表示。提出的辅助网络被推理丢弃,因此在测试时间没有额外的计算成本。我们对室内和室外数据集进行了广泛的实验,以验证我们的方法的有效性。例如,我们拟议的LG3D分别在SUN RGB-D和SCANNETV2数据集上将投票人员分别提高了2.5%和3.1%的地图。
translated by 谷歌翻译
我们提出了一种基于动态卷积的3D点云的实例分割方法。这使其能够在推断时适应变化的功能和对象尺度。这样做避免了一些自下而上的方法的陷阱,包括对超参数调整和启发式后处理管道的依赖,以弥补物体大小的不可避免的可变性,即使在单个场景中也是如此。通过收集具有相同语义类别并为几何质心进行仔细投票的均匀点,网络的表示能力大大提高了。然后通过几个简单的卷积层解码实例,其中参数是在输入上生成的。所提出的方法是无建议的,而是利用适应每个实例的空间和语义特征的卷积过程。建立在瓶颈层上的轻重量变压器使模型可以捕获远程依赖性,并具有有限的计算开销。结果是一种简单,高效且健壮的方法,可以在各种数据集上产生强大的性能:ScannETV2,S3DIS和Partnet。基于体素和点的体系结构的一致改进意味着提出的方法的有效性。代码可在以下网址找到:https://git.io/dyco3d
translated by 谷歌翻译
近年来,自主驾驶LIDAR数据的3D对象检测一直在迈出卓越的进展。在最先进的方法中,已经证明了将点云进行编码为鸟瞰图(BEV)是有效且有效的。与透视图不同,BEV在物体之间保留丰富的空间和距离信息;虽然在BEV中相同类型的更远物体不会较小,但它们包含稀疏点云特征。这一事实使用共享卷积神经网络削弱了BEV特征提取。为了解决这一挑战,我们提出了范围感知注意网络(RAANET),提取更强大的BEV功能并产生卓越的3D对象检测。范围感知的注意力(RAA)卷曲显着改善了近距离的特征提取。此外,我们提出了一种新的辅助损耗,用于密度估计,以进一步增强覆盖物体的Raanet的检测精度。值得注意的是,我们提出的RAA卷积轻量级,并兼容,以集成到用于BEV检测的任何CNN架构中。 Nuscenes DataSet上的广泛实验表明,我们的提出方法优于基于LIDAR的3D对象检测的最先进的方法,具有16 Hz的实时推断速度,为LITE版本为22 Hz。该代码在匿名GitHub存储库HTTPS://github.com/Anonymous0522 / ange上公开提供。
translated by 谷歌翻译
从单个图像中识别3D中的场景和对象是计算机视觉的长期目标,该目标具有机器人技术和AR/VR的应用。对于2D识别,大型数据集和可扩展解决方案已导致前所未有的进步。在3D中,现有的基准尺寸很小,并且方法专门研究几个对象类别和特定域,例如城市驾驶场景。在2D识别的成功中,我们通过引入一个称为Omni3d的大型基准来重新审视3D对象检测的任务。 OMNI3D重新排列并结合了现有的数据集,导致234K图像与超过300万个实例和97个类别相结合。由于相机内在的差异以及场景和对象类型的丰富多样性,因此3d检测到了这种规模的检测具有挑战性。我们提出了一个称为Cube R-CNN的模型,旨在以统一的方法跨相机和场景类型概括。我们表明,Cube R-CNN在较大的Omni3D和现有基准测试方面都优于先前的作品。最后,我们证明OMNI3D是一个用于3D对象识别的功能强大的数据集,表明它可以改善单数据库性能,并可以通过预训练在新的较小数据集上加速学习。
translated by 谷歌翻译