自主驾驶应用中的对象检测意味着语义对象的检测和跟踪通常是城市驾驶环境的原产,作为行人和车辆。最先进的基于深度学习的物体检测中的主要挑战之一是假阳性,其出现过于自信得分。由于安全问题,这在自动驾驶和其他关键机器人感知域中是非常不可取的。本文提出了一种通过将新的概率层引入测试中的深度对象检测网络来缓解过度自信预测问题的方法。建议的方法避免了传统的乙状结肠或Softmax预测层,其通常产生过度自信预测。证明所提出的技术在不降低真实阳性上的性能的情况下降低了误报的过度频率。通过yolov4和第二(基于LiDar的探测器)对2D-Kitti异点检测验证了该方法。该方法使得能够实现可解释的概率预测,而无需重新培训网络,因此非常实用。
translated by 谷歌翻译
近年来,3D对象检测已疯狂地研究,特别是对于机器人感知系统。但是,现有的3D对象检测位于闭合状态条件下,这意味着网络只能输出培训的类盒。不幸的是,这种封闭式条件不够强大,因为实际使用,因为它将识别错误的未知对象。因此,在本文中,我们提出了一个开放式3D对象检测器,其旨在(1)识别已知对象,如闭合设置检测,并且(2)识别未知对象并提供其准确的边界框。具体而言,我们将开放式3D对象检测问题分为两个步骤:(1)找出包含具有高概率的未知对象的区域和(2)用适当的边界框封闭这些区域的点。第一步是通过该发现,未知对象通常被归类为具有低置信度的已知对象,并且我们表明基于度量学习的欧几里德距离总和是比天真的Softmax概率与来自已知对象区分区别的更好的置信度。 。在此基础上,使用无人监督的群集来改进未知对象的边界框。所提出的方法组合度量学习和无监督群集称为MLUC网络。我们的实验表明,我们的MLUC网络实现了最先进的性能,可以根据预期识别已知和未知的物体。
translated by 谷歌翻译
近年来,自主驾驶LIDAR数据的3D对象检测一直在迈出卓越的进展。在最先进的方法中,已经证明了将点云进行编码为鸟瞰图(BEV)是有效且有效的。与透视图不同,BEV在物体之间保留丰富的空间和距离信息;虽然在BEV中相同类型的更远物体不会较小,但它们包含稀疏点云特征。这一事实使用共享卷积神经网络削弱了BEV特征提取。为了解决这一挑战,我们提出了范围感知注意网络(RAANET),提取更强大的BEV功能并产生卓越的3D对象检测。范围感知的注意力(RAA)卷曲显着改善了近距离的特征提取。此外,我们提出了一种新的辅助损耗,用于密度估计,以进一步增强覆盖物体的Raanet的检测精度。值得注意的是,我们提出的RAA卷积轻量级,并兼容,以集成到用于BEV检测的任何CNN架构中。 Nuscenes DataSet上的广泛实验表明,我们的提出方法优于基于LIDAR的3D对象检测的最先进的方法,具有16 Hz的实时推断速度,为LITE版本为22 Hz。该代码在匿名GitHub存储库HTTPS://github.com/Anonymous0522 / ange上公开提供。
translated by 谷歌翻译
In this paper, we propose a novel 3D object detector that can exploit both LIDAR as well as cameras to perform very accurate localization. Towards this goal, we design an end-to-end learnable architecture that exploits continuous convolutions to fuse image and LIDAR feature maps at different levels of resolution. Our proposed continuous fusion layer encode both discrete-state image features as well as continuous geometric information. This enables us to design a novel, reliable and efficient end-to-end learnable 3D object detector based on multiple sensors. Our experimental evaluation on both KITTI as well as a large scale 3D object detection benchmark shows significant improvements over the state of the art.
translated by 谷歌翻译
We address the problem of real-time 3D object detection from point clouds in the context of autonomous driving. Computation speed is critical as detection is a necessary component for safety. Existing approaches are, however, expensive in computation due to high dimensionality of point clouds. We utilize the 3D data more efficiently by representing the scene from the Bird's Eye View (BEV), and propose PIXOR, a proposal-free, single-stage detector that outputs oriented 3D object estimates decoded from pixelwise neural network predictions. The input representation, network architecture, and model optimization are especially designed to balance high accuracy and real-time efficiency. We validate PIXOR on two datasets: the KITTI BEV object detection benchmark, and a large-scale 3D vehicle detection benchmark. In both datasets we show that the proposed detector surpasses other state-of-the-art methods notably in terms of Average Precision (AP), while still runs at > 28 FPS.
translated by 谷歌翻译
In this work, we study 3D object detection from RGB-D data in both indoor and outdoor scenes. While previous methods focus on images or 3D voxels, often obscuring natural 3D patterns and invariances of 3D data, we directly operate on raw point clouds by popping up RGB-D scans. However, a key challenge of this approach is how to efficiently localize objects in point clouds of large-scale scenes (region proposal). Instead of solely relying on 3D proposals, our method leverages both mature 2D object detectors and advanced 3D deep learning for object localization, achieving efficiency as well as high recall for even small objects. Benefited from learning directly in raw point clouds, our method is also able to precisely estimate 3D bounding boxes even under strong occlusion or with very sparse points. Evaluated on KITTI and SUN RGB-D 3D detection benchmarks, our method outperforms the state of the art by remarkable margins while having real-time capability. * Majority of the work done as an intern at Nuro, Inc. depth to point cloud 2D region (from CNN) to 3D frustum 3D box (from PointNet)
translated by 谷歌翻译
由遮挡,信号丢失或手动注释错误引起的3D边界框的地面真相注释的固有歧义可能会使训练过程中的深3D对象检测器混淆,从而使检测准确性恶化。但是,现有方法在某种程度上忽略了此类问题,并将标签视为确定性。在本文中,我们提出了GLENET,这是一个从条件变异自动编码器改编的生成标签不确定性估计框架,以建模典型的3D对象与其潜在的潜在基边界框之间具有潜在变量的一对一关系。 Glenet产生的标签不确定性是一个插件模块,可以方便地集成到现有的深3D检测器中,以构建概率检测器并监督本地化不确定性的学习。此外,我们提出了概率探测器中的不确定性质量估计量架构,以指导对IOU分支的培训,并预测了本地化不确定性。我们将提出的方法纳入各种流行的3D检测器中,并观察到它们的性能显着提高到Waymo Open DataSet和Kitti数据集中的当前最新技术。
translated by 谷歌翻译
Figure 1: Results obtained from our single image, monocular 3D object detection network MonoDIS on a KITTI3D test image with corresponding birds-eye view, showing its ability to estimate size and orientation of objects at different scales.
translated by 谷歌翻译
3D object detection from LiDAR point cloud is a challenging problem in 3D scene understanding and has many practical applications. In this paper, we extend our preliminary work PointRCNN to a novel and strong point-cloud-based 3D object detection framework, the part-aware and aggregation neural network (Part-A 2 net). The whole framework consists of the part-aware stage and the part-aggregation stage. Firstly, the part-aware stage for the first time fully utilizes free-of-charge part supervisions derived from 3D ground-truth boxes to simultaneously predict high quality 3D proposals and accurate intra-object part locations. The predicted intra-object part locations within the same proposal are grouped by our new-designed RoI-aware point cloud pooling module, which results in an effective representation to encode the geometry-specific features of each 3D proposal. Then the part-aggregation stage learns to re-score the box and refine the box location by exploring the spatial relationship of the pooled intra-object part locations. Extensive experiments are conducted to demonstrate the performance improvements from each component of our proposed framework. Our Part-A 2 net outperforms all existing 3D detection methods and achieves new state-of-the-art on KITTI 3D object detection dataset by utilizing only the LiDAR point cloud data. Code is available at https://github.com/sshaoshuai/PointCloudDet3D.
translated by 谷歌翻译
我们提出了DeepFusion,这是一种模块化的多模式结构,可在不同组合中以3D对象检测为融合激光雷达,相机和雷达。专门的功能提取器可以利用每种模式,并且可以轻松交换,从而使该方法变得简单而灵活。提取的特征被转化为鸟眼视图,作为融合的共同表示。在特征空间中融合方式之前,先进行空间和语义对齐。最后,检测头利用丰富的多模式特征,以改善3D检测性能。 LIDAR相机,激光摄像头雷达和摄像头融合的实验结果显示了我们融合方法的灵活性和有效性。在此过程中,我们研究了高达225米的遥远汽车检测的很大程度上未开发的任务,显示了激光摄像机融合的好处。此外,我们研究了3D对象检测的LIDAR点所需的密度,并在对不利天气条件的鲁棒性示例中说明了含义。此外,对我们的摄像头融合的消融研究突出了准确深度估计的重要性。
translated by 谷歌翻译
对于现代自治系统来说,可靠的场景理解是必不可少的。当前基于学习的方法通常试图根据仅考虑分割质量的细分指标来最大化其性能。但是,对于系统在现实世界中的安全操作,考虑预测的不确定性也至关重要。在这项工作中,我们介绍了不确定性感知的全景分段的新任务,该任务旨在预测每个像素语义和实例分割,以及每个像素不确定性估计。我们定义了两个新颖的指标,以促进其定量分析,不确定性感知的综合质量(UPQ)和全景预期校准误差(PECE)。我们进一步提出了新型的自上而下的证据分割网络(EVPSNET),以解决此任务。我们的架构采用了一个简单而有效的概率融合模块,该模块利用了预测的不确定性。此外,我们提出了一种新的LOV \'ASZ证据损失函数,以优化使用深度证据学习概率的分割的IOU。此外,我们提供了几个强大的基线,将最新的泛型分割网络与无抽样的不确定性估计技术相结合。广泛的评估表明,我们的EVPSNET可以实现标准综合质量(PQ)的新最新技术,以及我们的不确定性倾斜度指标。
translated by 谷歌翻译
估计神经网络的不确定性在安全关键环境中起着基本作用。在对自主驾驶的感知中,测量不确定性意味着向下游任务提供额外的校准信息,例如路径规划,可以将其用于安全导航。在这项工作中,我们提出了一种用于对象检测的新型采样的不确定性估计方法。我们称之为特定网络,它是第一个为每个输出信号提供单独的不确定性:Objectness,类,位置和大小。为实现这一点,我们提出了一种不确定性感知的热图,并利用检测器提供的相邻边界框在推理时间。我们分别评估了不同不确定性估计的检测性能和质量,也具有具有挑战性的域名样本:BDD100K和肾上腺素训练在基蒂培训。此外,我们提出了一种新的指标来评估位置和大小的不确定性。当转移到看不见的数据集时,某些基本上概括了比以前的方法和集合更好,同时是实时和提供高质量和全面的不确定性估计。
translated by 谷歌翻译
We present AVOD, an Aggregate View Object Detection network for autonomous driving scenarios. The proposed neural network architecture uses LIDAR point clouds and RGB images to generate features that are shared by two subnetworks: a region proposal network (RPN) and a second stage detector network. The proposed RPN uses a novel architecture capable of performing multimodal feature fusion on high resolution feature maps to generate reliable 3D object proposals for multiple object classes in road scenes. Using these proposals, the second stage detection network performs accurate oriented 3D bounding box regression and category classification to predict the extents, orientation, and classification of objects in 3D space. Our proposed architecture is shown to produce state of the art results on the KITTI 3D object detection benchmark [1] while running in real time with a low memory footprint, making it a suitable candidate for deployment on autonomous vehicles. Code is at: https://github.com/kujason/avod
translated by 谷歌翻译
探讨了将数据驱动对象检测器的不确定性结合到对象跟踪算法中的不确定性的方法。对象跟踪方法依赖于测量误差模型,通常以测量噪声,假阳性率和错过检测速率的形式。通常,这些数量通常可以取决于物体或测量位置。然而,对于从神经网络处理的摄像机输入产生的检测,这些测量误差统计不足以表示主要错误源,即运行时传感器输入与检测器训练的训练数据之间的不相似性。为此,我们调查将数据不确定性纳入物体跟踪方法,例如提高跟踪物体的能力,特别是那些超出的能力。培训数据。所提出的方法在对象跟踪基准上验证以及具有真正自治飞机的实验。
translated by 谷歌翻译
Compared to typical multi-sensor systems, monocular 3D object detection has attracted much attention due to its simple configuration. However, there is still a significant gap between LiDAR-based and monocular-based methods. In this paper, we find that the ill-posed nature of monocular imagery can lead to depth ambiguity. Specifically, objects with different depths can appear with the same bounding boxes and similar visual features in the 2D image. Unfortunately, the network cannot accurately distinguish different depths from such non-discriminative visual features, resulting in unstable depth training. To facilitate depth learning, we propose a simple yet effective plug-and-play module, One Bounding Box Multiple Objects (OBMO). Concretely, we add a set of suitable pseudo labels by shifting the 3D bounding box along the viewing frustum. To constrain the pseudo-3D labels to be reasonable, we carefully design two label scoring strategies to represent their quality. In contrast to the original hard depth labels, such soft pseudo labels with quality scores allow the network to learn a reasonable depth range, boosting training stability and thus improving final performance. Extensive experiments on KITTI and Waymo benchmarks show that our method significantly improves state-of-the-art monocular 3D detectors by a significant margin (The improvements under the moderate setting on KITTI validation set are $\mathbf{1.82\sim 10.91\%}$ mAP in BEV and $\mathbf{1.18\sim 9.36\%}$ mAP in 3D}. Codes have been released at https://github.com/mrsempress/OBMO.
translated by 谷歌翻译
We present a new two-stage 3D object detection framework, named sparse-to-dense 3D Object Detector (STD). The first stage is a bottom-up proposal generation network that uses raw point cloud as input to generate accurate proposals by seeding each point with a new spherical anchor. It achieves a high recall with less computation compared with prior works. Then, PointsPool is applied for generating proposal features by transforming their interior point features from sparse expression to compact representation, which saves even more computation time. In box prediction, which is the second stage, we implement a parallel intersection-over-union (IoU) branch to increase awareness of localization accuracy, resulting in further improved performance. We conduct experiments on KITTI dataset, and evaluate our method in terms of 3D object and Bird's Eye View (BEV) detection. Our method outperforms other stateof-the-arts by a large margin, especially on the hard set, with inference speed more than 10 FPS.
translated by 谷歌翻译
LiDAR-based 3D Object detectors have achieved impressive performances in many benchmarks, however, multisensors fusion-based techniques are promising to further improve the results. PointPainting, as a recently proposed framework, can add the semantic information from the 2D image into the 3D LiDAR point by the painting operation to boost the detection performance. However, due to the limited resolution of 2D feature maps, severe boundary-blurring effect happens during re-projection of 2D semantic segmentation into the 3D point clouds. To well handle this limitation, a general multimodal fusion framework MSF has been proposed to fuse the semantic information from both the 2D image and 3D points scene parsing results. Specifically, MSF includes three main modules. First, SOTA off-the-shelf 2D/3D semantic segmentation approaches are employed to generate the parsing results for 2D images and 3D point clouds. The 2D semantic information is further re-projected into the 3D point clouds with calibrated parameters. To handle the misalignment between the 2D and 3D parsing results, an AAF module is proposed to fuse them by learning an adaptive fusion score. Then the point cloud with the fused semantic label is sent to the following 3D object detectors. Furthermore, we propose a DFF module to aggregate deep features in different levels to boost the final detection performance. The effectiveness of the framework has been verified on two public large-scale 3D object detection benchmarks by comparing with different baselines. The experimental results show that the proposed fusion strategies can significantly improve the detection performance compared to the methods using only point clouds and the methods using only 2D semantic information. Most importantly, the proposed approach significantly outperforms other approaches and sets new SOTA results on the nuScenes testing benchmark.
translated by 谷歌翻译
自动驾驶数据集通常是倾斜的,特别是,缺乏距自工载体远距离的物体的训练数据。随着检测到的对象的距离增加,数据的不平衡导致性能下降。在本文中,我们提出了模式识的地面真相抽样,一种数据增强技术,该技术基于LIDAR的特征缩小对象的点云。具体地,我们模拟了用于深度的物体的自然发散点模式变化,以模拟更远的距离。因此,网络具有更多样化的训练示例,并且可以更有效地概括地检测更远的物体。我们评估了使用点删除或扰动方法的现有数据增强技术,并发现我们的方法优于所有这些。此外,我们建议使用相等的元素AP箱,以评估跨距离的3D对象探测器的性能。我们在距离大于25米的距离上的Kitti验证分裂上提高了PV-RCNN对车载PV-RCNN的性能。
translated by 谷歌翻译
3D object detection is an essential task in autonomous driving. Recent techniques excel with highly accurate detection rates, provided the 3D input data is obtained from precise but expensive LiDAR technology. Approaches based on cheaper monocular or stereo imagery data have, until now, resulted in drastically lower accuracies -a gap that is commonly attributed to poor image-based depth estimation. However, in this paper we argue that it is not the quality of the data but its representation that accounts for the majority of the difference. Taking the inner workings of convolutional neural networks into consideration, we propose to convert image-based depth maps to pseudo-LiDAR representations -essentially mimicking the LiDAR signal. With this representation we can apply different existing LiDAR-based detection algorithms. On the popular KITTI benchmark, our approach achieves impressive improvements over the existing state-of-the-art in image-based performance -raising the detection accuracy of objects within the 30m range from the previous state-of-the-art of 22% to an unprecedented 74%. At the time of submission our algorithm holds the highest entry on the KITTI 3D object detection leaderboard for stereo-image-based approaches. Our code is publicly available at https: //github.com/mileyan/pseudo_lidar.
translated by 谷歌翻译
Monocular 3D object detection is a key problem for autonomous vehicles, as it provides a solution with simple configuration compared to typical multi-sensor systems. The main challenge in monocular 3D detection lies in accurately predicting object depth, which must be inferred from object and scene cues due to the lack of direct range measurement. Many methods attempt to directly estimate depth to assist in 3D detection, but show limited performance as a result of depth inaccuracy. Our proposed solution, Categorical Depth Distribution Network (CaDDN), uses a predicted categorical depth distribution for each pixel to project rich contextual feature information to the appropriate depth interval in 3D space. We then use the computationally efficient bird's-eye-view projection and single-stage detector to produce the final output detections. We design CaDDN as a fully differentiable end-to-end approach for joint depth estimation and object detection. We validate our approach on the KITTI 3D object detection benchmark, where we rank 1 st among published monocular methods. We also provide the first monocular 3D detection results on the newly released Waymo Open Dataset. We provide a code release for CaDDN which is made available here.
translated by 谷歌翻译