3D object detection is an essential task in autonomous driving. Recent techniques excel with highly accurate detection rates, provided the 3D input data is obtained from precise but expensive LiDAR technology. Approaches based on cheaper monocular or stereo imagery data have, until now, resulted in drastically lower accuracies -a gap that is commonly attributed to poor image-based depth estimation. However, in this paper we argue that it is not the quality of the data but its representation that accounts for the majority of the difference. Taking the inner workings of convolutional neural networks into consideration, we propose to convert image-based depth maps to pseudo-LiDAR representations -essentially mimicking the LiDAR signal. With this representation we can apply different existing LiDAR-based detection algorithms. On the popular KITTI benchmark, our approach achieves impressive improvements over the existing state-of-the-art in image-based performance -raising the detection accuracy of objects within the 30m range from the previous state-of-the-art of 22% to an unprecedented 74%. At the time of submission our algorithm holds the highest entry on the KITTI 3D object detection leaderboard for stereo-image-based approaches. Our code is publicly available at https: //github.com/mileyan/pseudo_lidar.
translated by 谷歌翻译
伪LIDAR表示的建议显着缩小了基于视觉的基于视觉激光痛的3D对象检测之间的差距。但是,当前的研究仅专注于通过利用复杂且耗时的神经网络来推动伪LIDAR的准确性提高。很少探索伪LIDAR代表的深刻特征来获得促进机会。在本文中,我们深入研究伪激光雷达表示,并认为3D对象检测的性能并不完全取决于高精度立体声深度估计。我们证明,即使对于不可靠的深度估计,通过适当的数据处理和精炼,它也可以达到可比的3D对象检测准确性。有了这一发现,我们进一步表明了使用伪大部分系统中快速但不准确的立体声匹配算法来实现低潜伏期响应的可能性。在实验中,我们开发了一个具有功能较低的立体声匹配预测指标的系统,并采用了提出的改进方案来提高准确性。对KITTI基准测试的评估表明,所提出的系统仅使用23毫秒的计算来实现最先进方法的竞争精度,这表明它是部署到真实CAR-HOLD应用程序的合适候选者。
translated by 谷歌翻译
大多数自治车辆都配备了LIDAR传感器和立体声相机。前者非常准确,但产生稀疏数据,而后者是密集的,具有丰富的纹理和颜色信息,但难以提取来自的强大的3D表示。在本文中,我们提出了一种新的数据融合算法,将准确的点云与致密的,但不太精确的点云组合在立体对。我们开发一个框架,将该算法集成到各种3D对象检测方法中。我们的框架从两个RGB图像中的2D检测开始,计算截肢和它们的交叉点,从立体声图像创建伪激光雷达数据,并填补了LIDAR数据缺少密集伪激光器的交叉区域的部分要点。我们训练多个3D对象检测方法,并表明我们的融合策略一致地提高了探测器的性能。
translated by 谷歌翻译
Compared to typical multi-sensor systems, monocular 3D object detection has attracted much attention due to its simple configuration. However, there is still a significant gap between LiDAR-based and monocular-based methods. In this paper, we find that the ill-posed nature of monocular imagery can lead to depth ambiguity. Specifically, objects with different depths can appear with the same bounding boxes and similar visual features in the 2D image. Unfortunately, the network cannot accurately distinguish different depths from such non-discriminative visual features, resulting in unstable depth training. To facilitate depth learning, we propose a simple yet effective plug-and-play module, One Bounding Box Multiple Objects (OBMO). Concretely, we add a set of suitable pseudo labels by shifting the 3D bounding box along the viewing frustum. To constrain the pseudo-3D labels to be reasonable, we carefully design two label scoring strategies to represent their quality. In contrast to the original hard depth labels, such soft pseudo labels with quality scores allow the network to learn a reasonable depth range, boosting training stability and thus improving final performance. Extensive experiments on KITTI and Waymo benchmarks show that our method significantly improves state-of-the-art monocular 3D detectors by a significant margin (The improvements under the moderate setting on KITTI validation set are $\mathbf{1.82\sim 10.91\%}$ mAP in BEV and $\mathbf{1.18\sim 9.36\%}$ mAP in 3D}. Codes have been released at https://github.com/mrsempress/OBMO.
translated by 谷歌翻译
基于摄像头的3D对象探测器由于其更广泛的部署而欢迎其比LIDAR传感器较低。我们首先重新访问先前的立体声检测器DSGN,以表示代表3D几何和语义的立体音量构建方式。我们抛光立体声建模,并提出高级版本DSGN ++,旨在在三个主要方面增强整个2d到3D管道的有效信息流。首先,为了有效地将2D信息提高到立体声音量,我们提出了深度扫地(DPS),以允许较密集的连接并提取深度引导的特征。其次,为了掌握不同间距的功能,我们提出了一个新颖的立体声音量 - 双视立体声卷(DSV),该卷(DSV)集成了前视图和顶部视图功能,并重建了相机frustum中的子素深度。第三,随着前景区域在3D空间中的占主导地位,我们提出了一种多模式数据编辑策略-Stereo-lidar拷贝性 - 可确保跨模式对齐并提高数据效率。没有铃铛和哨子,在流行的Kitti基准测试中的各种模式设置中进行了广泛的实验表明,我们的方法始终优于所有类别的基于相机的3D检测器。代码可从https://github.com/chenyilun95/dsgn2获得。
translated by 谷歌翻译
鉴于其经济性与多传感器设置相比,从单眼输入中感知的3D对象对于机器人系统至关重要。它非常困难,因为单个图像无法提供预测绝对深度值的任何线索。通过双眼方法进行3D对象检测,我们利用了相机自我运动提供的强几何结构来进行准确的对象深度估计和检测。我们首先对此一般的两视案例进行了理论分析,并注意两个挑战:1)来自多个估计的累积错误,这些估计使直接预测棘手; 2)由静态摄像机和歧义匹配引起的固有难题。因此,我们建立了具有几何感知成本量的立体声对应关系,作为深度估计的替代方案,并以单眼理解进一步补偿了它,以解决第二个问题。我们的框架(DFM)命名为深度(DFM),然后使用已建立的几何形状将2D图像特征提升到3D空间并检测到其3D对象。我们还提出了一个无姿势的DFM,以使其在摄像头不可用时可用。我们的框架在Kitti基准测试上的优于最先进的方法。详细的定量和定性分析也验证了我们的理论结论。该代码将在https://github.com/tai-wang/depth-from-motion上发布。
translated by 谷歌翻译
In this work, we study 3D object detection from RGB-D data in both indoor and outdoor scenes. While previous methods focus on images or 3D voxels, often obscuring natural 3D patterns and invariances of 3D data, we directly operate on raw point clouds by popping up RGB-D scans. However, a key challenge of this approach is how to efficiently localize objects in point clouds of large-scale scenes (region proposal). Instead of solely relying on 3D proposals, our method leverages both mature 2D object detectors and advanced 3D deep learning for object localization, achieving efficiency as well as high recall for even small objects. Benefited from learning directly in raw point clouds, our method is also able to precisely estimate 3D bounding boxes even under strong occlusion or with very sparse points. Evaluated on KITTI and SUN RGB-D 3D detection benchmarks, our method outperforms the state of the art by remarkable margins while having real-time capability. * Majority of the work done as an intern at Nuro, Inc. depth to point cloud 2D region (from CNN) to 3D frustum 3D box (from PointNet)
translated by 谷歌翻译
已经尝试通过融合立体声摄像机图像和激光镜传感器数据或使用LIDAR进行预训练,而仅用于测试的单眼图像来检测3D对象,但是由于精确度较低而仅尝试使用单眼图像序列的尝试较少。另外,当仅使用单眼图像的深度预测时,只能预测尺度不一致的深度,这就是研究人员不愿单独使用单眼图像的原因。因此,我们提出了一种通过仅使用单眼图像序列来预测绝对深度和检测3D对象的方法,通过启用检测网络和深度预测网络的端到端学习。结果,所提出的方法超过了Kitti 3D数据集中性能的其他现有方法。即使在训练期间一起使用单眼图像和3D激光雷达以提高性能,与使用相同输入的其他方法相比,我们的展览也是最佳性能。此外,端到端学习不仅可以改善深度预测性能,而且还可以实现绝对深度预测,因为我们的网络利用了这样一个事实,即3D对象(例如汽车)的大小由大约大小确定。
translated by 谷歌翻译
对于许多应用程序,包括自动驾驶,机器人抓握和增强现实,单眼3D对象检测是一项基本但非常重要的任务。现有的领先方法倾向于首先估算输入图像的深度,并基于点云检测3D对象。该例程遭受了深度估计和对象检测之间固有的差距。此外,预测误差积累也会影响性能。在本文中,提出了一种名为MonopCN的新方法。引入单频道的洞察力是,我们建议在训练期间模拟基于点云的探测器的特征学习行为。因此,在推理期间,学习的特征和预测将与基于点云的检测器相似。为了实现这一目标,我们建议一个场景级仿真模块,一个ROI级别的仿真模块和一个响应级仿真模块,这些模块逐渐用于检测器的完整特征学习和预测管道。我们将我们的方法应用于著名的M3D-RPN检测器和CADDN检测器,并在Kitti和Waymo Open数据集上进行了广泛的实验。结果表明,我们的方法始终提高不同边缘的不同单眼探测器的性能,而无需更改网络体系结构。我们的方法最终达到了最先进的性能。
translated by 谷歌翻译
由于缺乏深度信息,单眼3D对象检测在自主驾驶中非常具有挑战性。本文提出了一种基于多尺度深度分层的单眼单目眼3D对象检测算法,它使用锚定方法检测每像素预测中的3D对象。在所提出的MDS-Net中,开发了一种新的基于深度的分层结构,以通过在对象的深度和图像尺寸之间建立数学模型来改善网络的深度预测能力。然后开发出新的角度损耗功能,以进一步提高角度预测的精度并提高训练的收敛速度。最终在后处理阶段最终应用优化的软,以调整候选盒的置信度。基蒂基准测试的实验表明,MDS-Net在3D检测中优于现有的单目3D检测方法,并在满足实时要求时进行3D检测和BEV检测任务。
translated by 谷歌翻译
In this paper, we propose a novel 3D object detector that can exploit both LIDAR as well as cameras to perform very accurate localization. Towards this goal, we design an end-to-end learnable architecture that exploits continuous convolutions to fuse image and LIDAR feature maps at different levels of resolution. Our proposed continuous fusion layer encode both discrete-state image features as well as continuous geometric information. This enables us to design a novel, reliable and efficient end-to-end learnable 3D object detector based on multiple sensors. Our experimental evaluation on both KITTI as well as a large scale 3D object detection benchmark shows significant improvements over the state of the art.
translated by 谷歌翻译
Camera and lidar are important sensor modalities for robotics in general and self-driving cars in particular. The sensors provide complementary information offering an opportunity for tight sensor-fusion. Surprisingly, lidar-only methods outperform fusion methods on the main benchmark datasets, suggesting a gap in the literature. In this work, we propose PointPainting: a sequential fusion method to fill this gap. PointPainting works by projecting lidar points into the output of an image-only semantic segmentation network and appending the class scores to each point. The appended (painted) point cloud can then be fed to any lidaronly method. Experiments show large improvements on three different state-of-the art methods, Point-RCNN, Vox-elNet and PointPillars on the KITTI and nuScenes datasets. The painted version of PointRCNN represents a new state of the art on the KITTI leaderboard for the bird's-eye view detection task. In ablation, we study how the effects of Painting depends on the quality and format of the semantic segmentation output, and demonstrate how latency can be minimized through pipelining.
translated by 谷歌翻译
单眼3D对象检测是自动驾驶和计算机视觉社区中的一项挑战。作为一种常见的做法,大多数以前的作品都使用手动注释的3D盒标签,其中注释过程很昂贵。在本文中,我们发现在单眼3D检测中,精确和仔细注释的标签可能是不必要的,这是一个有趣且违反直觉的发现。与使用地面真相标签相比,使用随机干扰的粗糙标签,检测器可以达到非常接近的精度。我们深入研究了这种潜在的机制,然后从经验上发现:关于标签精度,与标签的其他部分相比,标签中的3D位置部分是优选的。由上面的结论和考虑到精确的LIDAR 3D测量的动机,我们提出了一个简单有效的框架,称为LiDAR Point Cloud引导的单眼3D对象检测(LPCG)。该框架能够降低注释成本或大大提高检测准确性,而无需引入额外的注释成本。具体而言,它从未标记的LIDAR点云生成伪标签。得益于3D空间中精确的LIDAR 3D测量值,由于其3D位置信息是精确的,因此,此类伪标签可以替换单眼3D检测器训练中手动注释的标签。可以将LPCG应用于任何单眼3D检测器中,以完全使用自动驾驶系统中的大量未标记数据。结果,在KITTI基准测试中,我们在单眼3D和BEV(Bird's-eye-tive)检测中都获得了明显差的检测。在Waymo基准测试中,我们使用10%标记数据的方法使用100%标记的数据获得了与基线探测器的可比精度。这些代码在https://github.com/spengliang/lpcg上发布。
translated by 谷歌翻译
基于LIDAR的传感驱动器电流自主车辆。尽管进展迅速,但目前的激光雷达传感器在分辨率和成本方面仍然落后于传统彩色相机背后的二十年。对于自主驾驶,这意味着靠近传感器的大物体很容易可见,但远方或小物体仅包括一个测量或两个。这是一个问题,尤其是当这些对象结果驾驶危险时。另一方面,在车载RGB传感器中清晰可见这些相同的对象。在这项工作中,我们提出了一种将RGB传感器无缝熔化成基于LIDAR的3D识别方法。我们的方法采用一组2D检测来生成密集的3D虚拟点,以增加否则稀疏的3D点云。这些虚拟点自然地集成到任何基于标准的LIDAR的3D探测器以及常规激光雷达测量。由此产生的多模态检测器简单且有效。大规模NUSCENES数据集的实验结果表明,我们的框架通过显着的6.6地图改善了强大的中心点基线,并且优于竞争融合方法。代码和更多可视化可在https://tianweiy.github.io/mvp/上获得
translated by 谷歌翻译
由于LIDAR传感器捕获的精确深度信息缺乏准确的深度信息,单眼3D对象检测是一个关键而挑战的自主驾驶任务。在本文中,我们提出了一种立体引导的单目3D对象检测网络,称为SGM3D,其利用立体图像提取的鲁棒3D特征来增强从单眼图像中学到的特征。我们创新地研究了多粒度域适配模块(MG-DA)以利用网络的能力,以便仅基于单手套提示产生立体模拟功能。利用粗均衡特征级以及精细锚级域适配,以引导单眼分支。我们介绍了一个基于IOO匹配的对齐模块(iou-ma),用于立体声和单眼域之间的对象级域适应,以减轻先前阶段中的不匹配。我们对最具挑战性的基蒂和Lyft数据集进行了广泛的实验,并实现了新的最先进的性能。此外,我们的方法可以集成到许多其他单眼的方法中以提高性能而不引入任何额外的计算成本。
translated by 谷歌翻译
In this paper we propose to exploit multiple related tasks for accurate multi-sensor 3D object detection. Towards this goal we present an end-to-end learnable architecture that reasons about 2D and 3D object detection as well as ground estimation and depth completion. Our experiments show that all these tasks are complementary and help the network learn better representations by fusing information at various levels. Importantly, our approach leads the KITTI benchmark on 2D, 3D and bird's eye view object detection, while being real-time. * Equal contribution.† Work done as part of Uber AI Residency program.
translated by 谷歌翻译
它得到了很好的认识到,从深度感知的LIDAR点云和语义富有的立体图像中融合互补信息将有利于3D对象检测。然而,探索稀疏3D点和密集2D像素之间固有的不自然相互作用并不重要。为了简化这种困难,最近的建议通常将3D点投影到2D图像平面上以对图像数据进行采样,然后聚合点处的数据。然而,这种方法往往遭受点云和RGB图像的分辨率之间的不匹配,导致次优性能。具体地,作为多模态数据聚合位置的稀疏点导致高分辨率图像的严重信息丢失,这反过来破坏了多传感器融合的有效性。在本文中,我们呈现VPFNET - 一种新的架构,可以在“虚拟”点处巧妙地对齐和聚合点云和图像数据。特别地,它们的密度位于3D点和2D像素的密度之间,虚拟点可以很好地桥接两个传感器之间的分辨率间隙,从而保持更多信息以进行处理。此外,我们还研究了可以应用于点云和RGB图像的数据增强技术,因为数据增强对迄今为止对3D对象探测器的贡献不可忽略。我们对Kitti DataSet进行了广泛的实验,与最先进的方法相比,观察到了良好的性能。值得注意的是,我们的VPFNET在KITTI测试集上实现了83.21 \%中等3D AP和91.86 \%适度的BEV AP,自2021年5月21日起排名第一。网络设计也考虑了计算效率 - 我们可以实现FPS 15对单个NVIDIA RTX 2080TI GPU。该代码将用于复制和进一步调查。
translated by 谷歌翻译
3D object detection is vital as it would enable us to capture objects' sizes, orientation, and position in the world. As a result, we would be able to use this 3D detection in real-world applications such as Augmented Reality (AR), self-driving cars, and robotics which perceive the world the same way we do as humans. Monocular 3D Object Detection is the task to draw 3D bounding box around objects in a single 2D RGB image. It is localization task but without any extra information like depth or other sensors or multiple images. Monocular 3D object detection is an important yet challenging task. Beyond the significant progress in image-based 2D object detection, 3D understanding of real-world objects is an open challenge that has not been explored extensively thus far. In addition to the most closely related studies.
translated by 谷歌翻译
Monocular 3D object detection is a key problem for autonomous vehicles, as it provides a solution with simple configuration compared to typical multi-sensor systems. The main challenge in monocular 3D detection lies in accurately predicting object depth, which must be inferred from object and scene cues due to the lack of direct range measurement. Many methods attempt to directly estimate depth to assist in 3D detection, but show limited performance as a result of depth inaccuracy. Our proposed solution, Categorical Depth Distribution Network (CaDDN), uses a predicted categorical depth distribution for each pixel to project rich contextual feature information to the appropriate depth interval in 3D space. We then use the computationally efficient bird's-eye-view projection and single-stage detector to produce the final output detections. We design CaDDN as a fully differentiable end-to-end approach for joint depth estimation and object detection. We validate our approach on the KITTI 3D object detection benchmark, where we rank 1 st among published monocular methods. We also provide the first monocular 3D detection results on the newly released Waymo Open Dataset. We provide a code release for CaDDN which is made available here.
translated by 谷歌翻译
This paper aims at high-accuracy 3D object detection in autonomous driving scenario. We propose Multi-View 3D networks (MV3D), a sensory-fusion framework that takes both LIDAR point cloud and RGB images as input and predicts oriented 3D bounding boxes. We encode the sparse 3D point cloud with a compact multi-view representation. The network is composed of two subnetworks: one for 3D object proposal generation and another for multi-view feature fusion. The proposal network generates 3D candidate boxes efficiently from the bird's eye view representation of 3D point cloud. We design a deep fusion scheme to combine region-wise features from multiple views and enable interactions between intermediate layers of different paths. Experiments on the challenging KITTI benchmarkshow that our approach outperforms the state-of-the-art by around 25% and 30% AP on the tasks of 3D localization and 3D detection. In addition, for 2D detection, our approach obtains 14.9% higher AP than the state-of-the-art on the hard data among the LIDAR-based methods.
translated by 谷歌翻译