3D object detection is vital as it would enable us to capture objects' sizes, orientation, and position in the world. As a result, we would be able to use this 3D detection in real-world applications such as Augmented Reality (AR), self-driving cars, and robotics which perceive the world the same way we do as humans. Monocular 3D Object Detection is the task to draw 3D bounding box around objects in a single 2D RGB image. It is localization task but without any extra information like depth or other sensors or multiple images. Monocular 3D object detection is an important yet challenging task. Beyond the significant progress in image-based 2D object detection, 3D understanding of real-world objects is an open challenge that has not been explored extensively thus far. In addition to the most closely related studies.
translated by 谷歌翻译
基于LIDAR的传感驱动器电流自主车辆。尽管进展迅速,但目前的激光雷达传感器在分辨率和成本方面仍然落后于传统彩色相机背后的二十年。对于自主驾驶,这意味着靠近传感器的大物体很容易可见,但远方或小物体仅包括一个测量或两个。这是一个问题,尤其是当这些对象结果驾驶危险时。另一方面,在车载RGB传感器中清晰可见这些相同的对象。在这项工作中,我们提出了一种将RGB传感器无缝熔化成基于LIDAR的3D识别方法。我们的方法采用一组2D检测来生成密集的3D虚拟点,以增加否则稀疏的3D点云。这些虚拟点自然地集成到任何基于标准的LIDAR的3D探测器以及常规激光雷达测量。由此产生的多模态检测器简单且有效。大规模NUSCENES数据集的实验结果表明,我们的框架通过显着的6.6地图改善了强大的中心点基线,并且优于竞争融合方法。代码和更多可视化可在https://tianweiy.github.io/mvp/上获得
translated by 谷歌翻译
3D object detection is an essential task in autonomous driving. Recent techniques excel with highly accurate detection rates, provided the 3D input data is obtained from precise but expensive LiDAR technology. Approaches based on cheaper monocular or stereo imagery data have, until now, resulted in drastically lower accuracies -a gap that is commonly attributed to poor image-based depth estimation. However, in this paper we argue that it is not the quality of the data but its representation that accounts for the majority of the difference. Taking the inner workings of convolutional neural networks into consideration, we propose to convert image-based depth maps to pseudo-LiDAR representations -essentially mimicking the LiDAR signal. With this representation we can apply different existing LiDAR-based detection algorithms. On the popular KITTI benchmark, our approach achieves impressive improvements over the existing state-of-the-art in image-based performance -raising the detection accuracy of objects within the 30m range from the previous state-of-the-art of 22% to an unprecedented 74%. At the time of submission our algorithm holds the highest entry on the KITTI 3D object detection leaderboard for stereo-image-based approaches. Our code is publicly available at https: //github.com/mileyan/pseudo_lidar.
translated by 谷歌翻译
In this paper we propose to exploit multiple related tasks for accurate multi-sensor 3D object detection. Towards this goal we present an end-to-end learnable architecture that reasons about 2D and 3D object detection as well as ground estimation and depth completion. Our experiments show that all these tasks are complementary and help the network learn better representations by fusing information at various levels. Importantly, our approach leads the KITTI benchmark on 2D, 3D and bird's eye view object detection, while being real-time. * Equal contribution.† Work done as part of Uber AI Residency program.
translated by 谷歌翻译
在鸟眼中学习强大的表现(BEV),以进行感知任务,这是趋势和吸引行业和学术界的广泛关注。大多数自动驾驶算法的常规方法在正面或透视视图中执行检测,细分,跟踪等。随着传感器配置变得越来越复杂,从不同的传感器中集成了多源信息,并在统一视图中代表功能至关重要。 BEV感知继承了几个优势,因为代表BEV中的周围场景是直观和融合友好的。对于BEV中的代表对象,对于随后的模块,如计划和/或控制是最可取的。 BEV感知的核心问题在于(a)如何通过从透视视图到BEV来通过视图转换来重建丢失的3D信息; (b)如何在BEV网格中获取地面真理注释; (c)如何制定管道以合并来自不同来源和视图的特征; (d)如何适应和概括算法作为传感器配置在不同情况下各不相同。在这项调查中,我们回顾了有关BEV感知的最新工作,并对不同解决方案进行了深入的分析。此外,还描述了该行业的BEV方法的几种系统设计。此外,我们推出了一套完整的实用指南,以提高BEV感知任务的性能,包括相机,激光雷达和融合输入。最后,我们指出了该领域的未来研究指示。我们希望该报告能阐明社区,并鼓励对BEV感知的更多研究。我们保留一个活跃的存储库来收集最新的工作,并在https://github.com/openperceptionx/bevperception-survey-recipe上提供一包技巧的工具箱。
translated by 谷歌翻译
In this work, we study 3D object detection from RGB-D data in both indoor and outdoor scenes. While previous methods focus on images or 3D voxels, often obscuring natural 3D patterns and invariances of 3D data, we directly operate on raw point clouds by popping up RGB-D scans. However, a key challenge of this approach is how to efficiently localize objects in point clouds of large-scale scenes (region proposal). Instead of solely relying on 3D proposals, our method leverages both mature 2D object detectors and advanced 3D deep learning for object localization, achieving efficiency as well as high recall for even small objects. Benefited from learning directly in raw point clouds, our method is also able to precisely estimate 3D bounding boxes even under strong occlusion or with very sparse points. Evaluated on KITTI and SUN RGB-D 3D detection benchmarks, our method outperforms the state of the art by remarkable margins while having real-time capability. * Majority of the work done as an intern at Nuro, Inc. depth to point cloud 2D region (from CNN) to 3D frustum 3D box (from PointNet)
translated by 谷歌翻译
We address the problem of real-time 3D object detection from point clouds in the context of autonomous driving. Computation speed is critical as detection is a necessary component for safety. Existing approaches are, however, expensive in computation due to high dimensionality of point clouds. We utilize the 3D data more efficiently by representing the scene from the Bird's Eye View (BEV), and propose PIXOR, a proposal-free, single-stage detector that outputs oriented 3D object estimates decoded from pixelwise neural network predictions. The input representation, network architecture, and model optimization are especially designed to balance high accuracy and real-time efficiency. We validate PIXOR on two datasets: the KITTI BEV object detection benchmark, and a large-scale 3D vehicle detection benchmark. In both datasets we show that the proposed detector surpasses other state-of-the-art methods notably in terms of Average Precision (AP), while still runs at > 28 FPS.
translated by 谷歌翻译
单眼3D对象检测是自动驾驶和计算机视觉社区中的一项挑战。作为一种常见的做法,大多数以前的作品都使用手动注释的3D盒标签,其中注释过程很昂贵。在本文中,我们发现在单眼3D检测中,精确和仔细注释的标签可能是不必要的,这是一个有趣且违反直觉的发现。与使用地面真相标签相比,使用随机干扰的粗糙标签,检测器可以达到非常接近的精度。我们深入研究了这种潜在的机制,然后从经验上发现:关于标签精度,与标签的其他部分相比,标签中的3D位置部分是优选的。由上面的结论和考虑到精确的LIDAR 3D测量的动机,我们提出了一个简单有效的框架,称为LiDAR Point Cloud引导的单眼3D对象检测(LPCG)。该框架能够降低注释成本或大大提高检测准确性,而无需引入额外的注释成本。具体而言,它从未标记的LIDAR点云生成伪标签。得益于3D空间中精确的LIDAR 3D测量值,由于其3D位置信息是精确的,因此,此类伪标签可以替换单眼3D检测器训练中手动注释的标签。可以将LPCG应用于任何单眼3D检测器中,以完全使用自动驾驶系统中的大量未标记数据。结果,在KITTI基准测试中,我们在单眼3D和BEV(Bird's-eye-tive)检测中都获得了明显差的检测。在Waymo基准测试中,我们使用10%标记数据的方法使用100%标记的数据获得了与基线探测器的可比精度。这些代码在https://github.com/spengliang/lpcg上发布。
translated by 谷歌翻译
Recently, over-height vehicle strike frequently occurs, causing great economic cost and serious safety problems. Hence, an alert system which can accurately discover any possible height limiting devices in advance is necessary to be employed in modern large or medium sized cars, such as touring cars. Detecting and estimating the height limiting devices act as the key point of a successful height limit alert system. Though there are some works research height limit estimation, existing methods are either too computational expensive or not accurate enough. In this paper, we propose a novel stereo-based pipeline named SHLE for height limit estimation. Our SHLE pipeline consists of two stages. In stage 1, a novel devices detection and tracking scheme is introduced, which accurately locate the height limit devices in the left or right image. Then, in stage 2, the depth is temporally measured, extracted and filtered to calculate the height limit device. To benchmark the height limit estimation task, we build a large-scale dataset named "Disparity Height", where stereo images, pre-computed disparities and ground-truth height limit annotations are provided. We conducted extensive experiments on "Disparity Height" and the results show that SHLE achieves an average error below than 10cm though the car is 70m away from the devices. Our method also outperforms all compared baselines and achieves state-of-the-art performance. Code is available at https://github.com/Yang-Kaixing/SHLE.
translated by 谷歌翻译
Visual perception plays an important role in autonomous driving. One of the primary tasks is object detection and identification. Since the vision sensor is rich in color and texture information, it can quickly and accurately identify various road information. The commonly used technique is based on extracting and calculating various features of the image. The recent development of deep learning-based method has better reliability and processing speed and has a greater advantage in recognizing complex elements. For depth estimation, vision sensor is also used for ranging due to their small size and low cost. Monocular camera uses image data from a single viewpoint as input to estimate object depth. In contrast, stereo vision is based on parallax and matching feature points of different views, and the application of deep learning also further improves the accuracy. In addition, Simultaneous Location and Mapping (SLAM) can establish a model of the road environment, thus helping the vehicle perceive the surrounding environment and complete the tasks. In this paper, we introduce and compare various methods of object detection and identification, then explain the development of depth estimation and compare various methods based on monocular, stereo, and RDBG sensors, next review and compare various methods of SLAM, and finally summarize the current problems and present the future development trends of vision technologies.
translated by 谷歌翻译
3D object detection from LiDAR point cloud is a challenging problem in 3D scene understanding and has many practical applications. In this paper, we extend our preliminary work PointRCNN to a novel and strong point-cloud-based 3D object detection framework, the part-aware and aggregation neural network (Part-A 2 net). The whole framework consists of the part-aware stage and the part-aggregation stage. Firstly, the part-aware stage for the first time fully utilizes free-of-charge part supervisions derived from 3D ground-truth boxes to simultaneously predict high quality 3D proposals and accurate intra-object part locations. The predicted intra-object part locations within the same proposal are grouped by our new-designed RoI-aware point cloud pooling module, which results in an effective representation to encode the geometry-specific features of each 3D proposal. Then the part-aggregation stage learns to re-score the box and refine the box location by exploring the spatial relationship of the pooled intra-object part locations. Extensive experiments are conducted to demonstrate the performance improvements from each component of our proposed framework. Our Part-A 2 net outperforms all existing 3D detection methods and achieves new state-of-the-art on KITTI 3D object detection dataset by utilizing only the LiDAR point cloud data. Code is available at https://github.com/sshaoshuai/PointCloudDet3D.
translated by 谷歌翻译
This paper aims at high-accuracy 3D object detection in autonomous driving scenario. We propose Multi-View 3D networks (MV3D), a sensory-fusion framework that takes both LIDAR point cloud and RGB images as input and predicts oriented 3D bounding boxes. We encode the sparse 3D point cloud with a compact multi-view representation. The network is composed of two subnetworks: one for 3D object proposal generation and another for multi-view feature fusion. The proposal network generates 3D candidate boxes efficiently from the bird's eye view representation of 3D point cloud. We design a deep fusion scheme to combine region-wise features from multiple views and enable interactions between intermediate layers of different paths. Experiments on the challenging KITTI benchmarkshow that our approach outperforms the state-of-the-art by around 25% and 30% AP on the tasks of 3D localization and 3D detection. In addition, for 2D detection, our approach obtains 14.9% higher AP than the state-of-the-art on the hard data among the LIDAR-based methods.
translated by 谷歌翻译
We present AVOD, an Aggregate View Object Detection network for autonomous driving scenarios. The proposed neural network architecture uses LIDAR point clouds and RGB images to generate features that are shared by two subnetworks: a region proposal network (RPN) and a second stage detector network. The proposed RPN uses a novel architecture capable of performing multimodal feature fusion on high resolution feature maps to generate reliable 3D object proposals for multiple object classes in road scenes. Using these proposals, the second stage detection network performs accurate oriented 3D bounding box regression and category classification to predict the extents, orientation, and classification of objects in 3D space. Our proposed architecture is shown to produce state of the art results on the KITTI 3D object detection benchmark [1] while running in real time with a low memory footprint, making it a suitable candidate for deployment on autonomous vehicles. Code is at: https://github.com/kujason/avod
translated by 谷歌翻译
以视觉为中心的BEV感知由于其固有的优点,最近受到行业和学术界的关注,包括展示世界自然代表和融合友好。随着深度学习的快速发展,已经提出了许多方法来解决以视觉为中心的BEV感知。但是,最近没有针对这个小说和不断发展的研究领域的调查。为了刺激其未来的研究,本文对以视觉为中心的BEV感知及其扩展进行了全面调查。它收集并组织了最近的知识,并对常用算法进行了系统的综述和摘要。它还为几项BEV感知任务提供了深入的分析和比较结果,从而促进了未来作品的比较并激发了未来的研究方向。此外,还讨论了经验实现细节并证明有利于相关算法的开发。
translated by 谷歌翻译
Compared to typical multi-sensor systems, monocular 3D object detection has attracted much attention due to its simple configuration. However, there is still a significant gap between LiDAR-based and monocular-based methods. In this paper, we find that the ill-posed nature of monocular imagery can lead to depth ambiguity. Specifically, objects with different depths can appear with the same bounding boxes and similar visual features in the 2D image. Unfortunately, the network cannot accurately distinguish different depths from such non-discriminative visual features, resulting in unstable depth training. To facilitate depth learning, we propose a simple yet effective plug-and-play module, One Bounding Box Multiple Objects (OBMO). Concretely, we add a set of suitable pseudo labels by shifting the 3D bounding box along the viewing frustum. To constrain the pseudo-3D labels to be reasonable, we carefully design two label scoring strategies to represent their quality. In contrast to the original hard depth labels, such soft pseudo labels with quality scores allow the network to learn a reasonable depth range, boosting training stability and thus improving final performance. Extensive experiments on KITTI and Waymo benchmarks show that our method significantly improves state-of-the-art monocular 3D detectors by a significant margin (The improvements under the moderate setting on KITTI validation set are $\mathbf{1.82\sim 10.91\%}$ mAP in BEV and $\mathbf{1.18\sim 9.36\%}$ mAP in 3D}. Codes have been released at https://github.com/mrsempress/OBMO.
translated by 谷歌翻译
利用多模式融合,尤其是在摄像头和激光雷达之间,对于为自动驾驶汽车构建准确且健壮的3D对象检测系统已经至关重要。直到最近,点装饰方法(在该点云中都用相机功能增强,一直是该领域的主要方法。但是,这些方法无法利用来自相机的较高分辨率图像。还提出了最近将摄像头功能投射到鸟类视图(BEV)融合空间的作品,但是它们需要预计数百万像素,其中大多数仅包含背景信息。在这项工作中,我们提出了一种新颖的方法中心功能融合(CFF),其中我们利用相机和激光雷达中心的基于中心的检测网络来识别相关对象位置。然后,我们使用基于中心的检测来识别与对象位置相关的像素功能的位置,这是图像中总数的一小部分。然后将它们投射并融合在BEV框架中。在Nuscenes数据集上,我们的表现优于仅限激光雷达基线的4.9%地图,同时比其他融合方法融合了100倍。
translated by 谷歌翻译
它得到了很好的认识到,从深度感知的LIDAR点云和语义富有的立体图像中融合互补信息将有利于3D对象检测。然而,探索稀疏3D点和密集2D像素之间固有的不自然相互作用并不重要。为了简化这种困难,最近的建议通常将3D点投影到2D图像平面上以对图像数据进行采样,然后聚合点处的数据。然而,这种方法往往遭受点云和RGB图像的分辨率之间的不匹配,导致次优性能。具体地,作为多模态数据聚合位置的稀疏点导致高分辨率图像的严重信息丢失,这反过来破坏了多传感器融合的有效性。在本文中,我们呈现VPFNET - 一种新的架构,可以在“虚拟”点处巧妙地对齐和聚合点云和图像数据。特别地,它们的密度位于3D点和2D像素的密度之间,虚拟点可以很好地桥接两个传感器之间的分辨率间隙,从而保持更多信息以进行处理。此外,我们还研究了可以应用于点云和RGB图像的数据增强技术,因为数据增强对迄今为止对3D对象探测器的贡献不可忽略。我们对Kitti DataSet进行了广泛的实验,与最先进的方法相比,观察到了良好的性能。值得注意的是,我们的VPFNET在KITTI测试集上实现了83.21 \%中等3D AP和91.86 \%适度的BEV AP,自2021年5月21日起排名第一。网络设计也考虑了计算效率 - 我们可以实现FPS 15对单个NVIDIA RTX 2080TI GPU。该代码将用于复制和进一步调查。
translated by 谷歌翻译
对于许多应用程序,包括自动驾驶,机器人抓握和增强现实,单眼3D对象检测是一项基本但非常重要的任务。现有的领先方法倾向于首先估算输入图像的深度,并基于点云检测3D对象。该例程遭受了深度估计和对象检测之间固有的差距。此外,预测误差积累也会影响性能。在本文中,提出了一种名为MonopCN的新方法。引入单频道的洞察力是,我们建议在训练期间模拟基于点云的探测器的特征学习行为。因此,在推理期间,学习的特征和预测将与基于点云的检测器相似。为了实现这一目标,我们建议一个场景级仿真模块,一个ROI级别的仿真模块和一个响应级仿真模块,这些模块逐渐用于检测器的完整特征学习和预测管道。我们将我们的方法应用于著名的M3D-RPN检测器和CADDN检测器,并在Kitti和Waymo Open数据集上进行了广泛的实验。结果表明,我们的方法始终提高不同边缘的不同单眼探测器的性能,而无需更改网络体系结构。我们的方法最终达到了最先进的性能。
translated by 谷歌翻译
由于缺乏深度信息,单眼3D对象检测在自主驾驶中非常具有挑战性。本文提出了一种基于多尺度深度分层的单眼单目眼3D对象检测算法,它使用锚定方法检测每像素预测中的3D对象。在所提出的MDS-Net中,开发了一种新的基于深度的分层结构,以通过在对象的深度和图像尺寸之间建立数学模型来改善网络的深度预测能力。然后开发出新的角度损耗功能,以进一步提高角度预测的精度并提高训练的收敛速度。最终在后处理阶段最终应用优化的软,以调整候选盒的置信度。基蒂基准测试的实验表明,MDS-Net在3D检测中优于现有的单目3D检测方法,并在满足实时要求时进行3D检测和BEV检测任务。
translated by 谷歌翻译
In this paper, we propose a novel 3D object detector that can exploit both LIDAR as well as cameras to perform very accurate localization. Towards this goal, we design an end-to-end learnable architecture that exploits continuous convolutions to fuse image and LIDAR feature maps at different levels of resolution. Our proposed continuous fusion layer encode both discrete-state image features as well as continuous geometric information. This enables us to design a novel, reliable and efficient end-to-end learnable 3D object detector based on multiple sensors. Our experimental evaluation on both KITTI as well as a large scale 3D object detection benchmark shows significant improvements over the state of the art.
translated by 谷歌翻译