单眼3D对象检测旨在将3D边界框本地化在输入单个2D图像中。这是一个非常具有挑战性的问题并且仍然是开放的,特别是当没有额外的信息时(例如,深度,激光雷达和/或多帧)可以利用训练和/或推理。本文提出了一种对单眼3D对象检测的简单而有效的配方,而无需利用任何额外信息。它介绍了从训练中学习单眼背景的单片方法,以帮助单目3D对象检测。关键的想法是,通过图像中的对象的注释3D边界框,在训练中有一个丰富的良好的投影2D监控信号,例如投影的角键点及其相关联的偏移向量相对于中心在2D边界框中,应该被开发为培训中的辅助任务。拟议的单一的单一的机动在衡量标准理论中的克拉默 - Wold定理在高水平下。在实施中,它利用非常简单的端到端设计来证明学习辅助单眼环境的有效性,它由三个组成组成:基于深度神经网络(DNN)的特征骨干,一些回归头部分支用于学习用于3D边界框预测的基本参数,以及用于学习辅助上下文的许多回归头分支。在训练之后,丢弃辅助上下文回归分支以获得更好的推理效率。在实验中,拟议的单一组在基蒂基准(汽车,Pedestrain和骑自行车的人)中测试。它超越了汽车类别上排行榜中的所有现有技术,并在准确性方面获得了行人和骑自行车者的可比性。由于简单的设计,所提出的单控制方法在比较中获得了38.7 FP的最快推断速度
translated by 谷歌翻译
Due to the lack of depth information of images and poor detection accuracy in monocular 3D object detection, we proposed the instance depth for multi-scale monocular 3D object detection method. Firstly, to enhance the model's processing ability for different scale targets, a multi-scale perception module based on dilated convolution is designed, and the depth features containing multi-scale information are re-refined from both spatial and channel directions considering the inconsistency between feature maps of different scales. Firstly, we designed a multi-scale perception module based on dilated convolution to enhance the model's processing ability for different scale targets. The depth features containing multi-scale information are re-refined from spatial and channel directions considering the inconsistency between feature maps of different scales. Secondly, so as to make the model obtain better 3D perception, this paper proposed to use the instance depth information as an auxiliary learning task to enhance the spatial depth feature of the 3D target and use the sparse instance depth to supervise the auxiliary task. Finally, by verifying the proposed algorithm on the KITTI test set and evaluation set, the experimental results show that compared with the baseline method, the proposed method improves by 5.27\% in AP40 in the car category, effectively improving the detection performance of the monocular 3D object detection algorithm.
translated by 谷歌翻译
Compared to typical multi-sensor systems, monocular 3D object detection has attracted much attention due to its simple configuration. However, there is still a significant gap between LiDAR-based and monocular-based methods. In this paper, we find that the ill-posed nature of monocular imagery can lead to depth ambiguity. Specifically, objects with different depths can appear with the same bounding boxes and similar visual features in the 2D image. Unfortunately, the network cannot accurately distinguish different depths from such non-discriminative visual features, resulting in unstable depth training. To facilitate depth learning, we propose a simple yet effective plug-and-play module, One Bounding Box Multiple Objects (OBMO). Concretely, we add a set of suitable pseudo labels by shifting the 3D bounding box along the viewing frustum. To constrain the pseudo-3D labels to be reasonable, we carefully design two label scoring strategies to represent their quality. In contrast to the original hard depth labels, such soft pseudo labels with quality scores allow the network to learn a reasonable depth range, boosting training stability and thus improving final performance. Extensive experiments on KITTI and Waymo benchmarks show that our method significantly improves state-of-the-art monocular 3D detectors by a significant margin (The improvements under the moderate setting on KITTI validation set are $\mathbf{1.82\sim 10.91\%}$ mAP in BEV and $\mathbf{1.18\sim 9.36\%}$ mAP in 3D}. Codes have been released at https://github.com/mrsempress/OBMO.
translated by 谷歌翻译
近年来,自主驾驶LIDAR数据的3D对象检测一直在迈出卓越的进展。在最先进的方法中,已经证明了将点云进行编码为鸟瞰图(BEV)是有效且有效的。与透视图不同,BEV在物体之间保留丰富的空间和距离信息;虽然在BEV中相同类型的更远物体不会较小,但它们包含稀疏点云特征。这一事实使用共享卷积神经网络削弱了BEV特征提取。为了解决这一挑战,我们提出了范围感知注意网络(RAANET),提取更强大的BEV功能并产生卓越的3D对象检测。范围感知的注意力(RAA)卷曲显着改善了近距离的特征提取。此外,我们提出了一种新的辅助损耗,用于密度估计,以进一步增强覆盖物体的Raanet的检测精度。值得注意的是,我们提出的RAA卷积轻量级,并兼容,以集成到用于BEV检测的任何CNN架构中。 Nuscenes DataSet上的广泛实验表明,我们的提出方法优于基于LIDAR的3D对象检测的最先进的方法,具有16 Hz的实时推断速度,为LITE版本为22 Hz。该代码在匿名GitHub存储库HTTPS://github.com/Anonymous0522 / ange上公开提供。
translated by 谷歌翻译
由于缺乏深度信息,单眼3D对象检测在自主驾驶中非常具有挑战性。本文提出了一种基于多尺度深度分层的单眼单目眼3D对象检测算法,它使用锚定方法检测每像素预测中的3D对象。在所提出的MDS-Net中,开发了一种新的基于深度的分层结构,以通过在对象的深度和图像尺寸之间建立数学模型来改善网络的深度预测能力。然后开发出新的角度损耗功能,以进一步提高角度预测的精度并提高训练的收敛速度。最终在后处理阶段最终应用优化的软,以调整候选盒的置信度。基蒂基准测试的实验表明,MDS-Net在3D检测中优于现有的单目3D检测方法,并在满足实时要求时进行3D检测和BEV检测任务。
translated by 谷歌翻译
单眼3D对象检测在简单性和成本方面的优势引起了极大的关注。由于单眼成像过程的2D至3D映射本质不足,因此单眼3D对象检测的深度估计不准确,因此3D检测结果较差。为了减轻这个问题,我们建议将地面作为单眼3D对象检测中的先验引入。地面先验是对不足的映射的额外几何条件,并且是深入估算的额外源。这样,我们可以从地面获得更准确的深度估计。同时,为了获得地面平面的充分优势,我们提出了一种深度对准训练策略和精确的两阶段深度推理方法,该方法是为地面平面量身定制的。值得注意的是,引入的地面之前不需要额外的数据源,例如LIDAR,立体声图像和深度信息。 Kitti基准测试的广泛实验表明,与其他方法相比,我们的方法可以在保持非常快速的速度的同时获得最新的结果。我们的代码和型号可在https://github.com/cfzd/monoground上找到。
translated by 谷歌翻译
由于LIDAR传感器捕获的精确深度信息缺乏准确的深度信息,单眼3D对象检测是一个关键而挑战的自主驾驶任务。在本文中,我们提出了一种立体引导的单目3D对象检测网络,称为SGM3D,其利用立体图像提取的鲁棒3D特征来增强从单眼图像中学到的特征。我们创新地研究了多粒度域适配模块(MG-DA)以利用网络的能力,以便仅基于单手套提示产生立体模拟功能。利用粗均衡特征级以及精细锚级域适配,以引导单眼分支。我们介绍了一个基于IOO匹配的对齐模块(iou-ma),用于立体声和单眼域之间的对象级域适应,以减轻先前阶段中的不匹配。我们对最具挑战性的基蒂和Lyft数据集进行了广泛的实验,并实现了新的最先进的性能。此外,我们的方法可以集成到许多其他单眼的方法中以提高性能而不引入任何额外的计算成本。
translated by 谷歌翻译
3D对象检测是各种实际应用所需的重要功能,例如驾驶员辅助系统。单眼3D检测作为基于图像的方法的代表性的常规设置,提供比依赖Lidars的传统设置更经济的解决方案,但仍然产生不令人满意的结果。本文首先提出了对这个问题的系统研究。我们观察到,目前的单目3D检测可以简化为实例深度估计问题:不准确的实例深度阻止所有其他3D属性预测改善整体检测性能。此外,最近的方法直接估计基于孤立的实例或像素的深度,同时忽略不同对象的几何关系。为此,我们在跨预测对象构建几何关系图,并使用该图来促进深度估计。随着每个实例的初步深度估计通常在这种不均匀的环境中通常不准确,我们纳入了概率表示以捕获不确定性。它提供了一个重要的指标,以确定自信的预测并进一步引导深度传播。尽管基本思想的简单性,但我们的方法,PGD对基蒂和NUSCENES基准的显着改进,尽管在所有单眼视觉的方法中实现了第1个,同时仍保持实时效率。代码和模型将在https://github.com/open-mmlab/mmdetection3d发布。
translated by 谷歌翻译
3D object detection from LiDAR point cloud is a challenging problem in 3D scene understanding and has many practical applications. In this paper, we extend our preliminary work PointRCNN to a novel and strong point-cloud-based 3D object detection framework, the part-aware and aggregation neural network (Part-A 2 net). The whole framework consists of the part-aware stage and the part-aggregation stage. Firstly, the part-aware stage for the first time fully utilizes free-of-charge part supervisions derived from 3D ground-truth boxes to simultaneously predict high quality 3D proposals and accurate intra-object part locations. The predicted intra-object part locations within the same proposal are grouped by our new-designed RoI-aware point cloud pooling module, which results in an effective representation to encode the geometry-specific features of each 3D proposal. Then the part-aggregation stage learns to re-score the box and refine the box location by exploring the spatial relationship of the pooled intra-object part locations. Extensive experiments are conducted to demonstrate the performance improvements from each component of our proposed framework. Our Part-A 2 net outperforms all existing 3D detection methods and achieves new state-of-the-art on KITTI 3D object detection dataset by utilizing only the LiDAR point cloud data. Code is available at https://github.com/sshaoshuai/PointCloudDet3D.
translated by 谷歌翻译
低成本单眼的3D对象检测在自主驾驶中起着基本作用,而其精度仍然远非令人满意。在本文中,我们挖掘了3D对象检测任务,并将其重构为对象本地化和外观感知的子任务,这有​​利于整个任务的互惠信息的深度挖掘。我们介绍了一个名为DFR-Net的动态特征反射网络,其中包含两种新的独立模块:(i)首先将任务特征分开的外观定位特征反射模块(ALFR),然后自相互反映互核特征; (ii)通过自学习方式自适应地重建各个子任务的培训过程的动态内部交易模块(DIT)。关于挑战基蒂数据集的广泛实验证明了DFR网的有效性和泛化。我们在基蒂测试集中的所有单眼3D对象探测器中排名第一(直到2021年3月16日)。所提出的方法在许多尖端的3D检测框架中也容易在较忽略的成本下以忽略的成本来播放。该代码将公开可用。
translated by 谷歌翻译
Surround-view fisheye perception under valet parking scenes is fundamental and crucial in autonomous driving. Environmental conditions in parking lots perform differently from the common public datasets, such as imperfect light and opacity, which substantially impacts on perception performance. Most existing networks based on public datasets may generalize suboptimal results on these valet parking scenes, also affected by the fisheye distortion. In this article, we introduce a new large-scale fisheye dataset called Fisheye Parking Dataset(FPD) to promote the research in dealing with diverse real-world surround-view parking cases. Notably, our compiled FPD exhibits excellent characteristics for different surround-view perception tasks. In addition, we also propose our real-time distortion-insensitive multi-task framework Fisheye Perception Network (FPNet), which improves the surround-view fisheye BEV perception by enhancing the fisheye distortion operation and multi-task lightweight designs. Extensive experiments validate the effectiveness of our approach and the dataset's exceptional generalizability.
translated by 谷歌翻译
单眼3D对象检测对于自动驾驶具有重要意义,但仍然具有挑战性。核心挑战是在没有明确深度信息的情况下预测对象的距离。与在大多数现有方法中将距离作为单个变量回归不同,我们提出了一种基于几何几何距离的分解,以通过其因子恢复距离。分解因素因物体到最具代表性和稳定的变量的距离,即图像平面中的物理高度和投影视觉高度。此外,该分解保持了两个高度之间的自我矛盾,当两个预测高度不准确时,导致距离的距离预测可靠。分解还使我们能够追踪不同场景的距离不确定性的原因。这种分解使距离预测可解释,准确和稳健。我们的方法直接通过紧凑的体系结构从RGB图像中预测3D边界框,从而使训练和推理简单有效。实验结果表明,我们的方法在单眼3D对象检测上实现了最先进的性能,而鸟类视图Kitti数据集的眼睛视图任务,并且可以推广到具有不同摄像机内在的图像。
translated by 谷歌翻译
Figure 1: Results obtained from our single image, monocular 3D object detection network MonoDIS on a KITTI3D test image with corresponding birds-eye view, showing its ability to estimate size and orientation of objects at different scales.
translated by 谷歌翻译
与LIDAR相比,相机和雷达传感器在成本,可靠性和维护方面具有显着优势。现有的融合方法通常融合了结果级别的单个模式的输出,称为后期融合策略。这可以通过使用现成的单传感器检测算法受益,但是晚融合无法完全利用传感器的互补特性,因此尽管相机雷达融合的潜力很大,但性能有限。在这里,我们提出了一种新颖的提案级早期融合方法,该方法有效利用了相机和雷达的空间和上下文特性,用于3D对象检测。我们的融合框架首先将图像建议与极坐标系中的雷达点相关联,以有效处理坐标系和空间性质之间的差异。将其作为第一阶段,遵循连续的基于交叉注意的特征融合层在相机和雷达之间自适应地交换时尚信息,从而导致强大而专心的融合。我们的摄像机雷达融合方法可在Nuscenes测试集上获得最新的41.1%地图,而NDS则达到52.3%,比仅摄像机的基线高8.7和10.8点,并在竞争性能上提高竞争性能LIDAR方法。
translated by 谷歌翻译
与周围摄像机的3D对象检测是自动驾驶的有希望的方向。在本文中,我们提出了Simmod,这是用于解决问题的多相对象检测的简单基线。为了合并多视图信息,并基于以前对单眼3D对象检测的努力,该框架建立在样本的对象建议基础上,并旨在以两阶段的方式工作。首先,我们提取多尺度特征,并在每个单眼图像上生成透视对象建议。其次,多视图提案进行了汇总,然后在DETR3D式中使用多视图和多尺度视觉特征进行迭代完善。精制的提案被端到端解码为检测结果。为了进一步提高性能,我们将辅助分支与提案生成并列以增强特征学习。此外,我们设计了目标过滤和教师强迫的方法,以促进两阶段训练的一致性。我们对Nuscenes的3D对象检测基准进行了广泛的实验,以证明Simmod的有效性并实现新的最新性能。代码将在https://github.com/zhangyp15/simmod上找到。
translated by 谷歌翻译
3D object detection is an essential task in autonomous driving. Recent techniques excel with highly accurate detection rates, provided the 3D input data is obtained from precise but expensive LiDAR technology. Approaches based on cheaper monocular or stereo imagery data have, until now, resulted in drastically lower accuracies -a gap that is commonly attributed to poor image-based depth estimation. However, in this paper we argue that it is not the quality of the data but its representation that accounts for the majority of the difference. Taking the inner workings of convolutional neural networks into consideration, we propose to convert image-based depth maps to pseudo-LiDAR representations -essentially mimicking the LiDAR signal. With this representation we can apply different existing LiDAR-based detection algorithms. On the popular KITTI benchmark, our approach achieves impressive improvements over the existing state-of-the-art in image-based performance -raising the detection accuracy of objects within the 30m range from the previous state-of-the-art of 22% to an unprecedented 74%. At the time of submission our algorithm holds the highest entry on the KITTI 3D object detection leaderboard for stereo-image-based approaches. Our code is publicly available at https: //github.com/mileyan/pseudo_lidar.
translated by 谷歌翻译
单眼3D对象检测是自主驾驶中的重要任务。在存在自我汽车姿势改变W.R.T的情况下,它可以很容易难以解决。地平面。由于道路平滑度和斜坡的轻微波动,这很常见。由于工业应用缺乏洞察力,开放数据集的现有方法忽略了相机姿势信息,这不可避免地导致探测器易受相机外在参数的影响。物体的扰动在工业产品最自主驾驶案件中非常受欢迎。为此,我们提出了一种捕获摄像机姿势的新方法,以配制无自脉扰动的检测器。具体地,所提出的框架通过检测消失点和地平线改变来预测相机外在参数。转换器旨在纠正潜在空间中的扰动特征。通过这样做,我们的3D探测器独立于外在参数变化,并在现实情况下产生准确的结果,例如坑道和不均匀的道路,几乎所有现有的单眼检测器都无法处理。实验证明我们的方法与基蒂3D和NUSCENES数据集的大型裕度相比,我们的方法与其他最先进的最先进。
translated by 谷歌翻译
In this work, we study 3D object detection from RGB-D data in both indoor and outdoor scenes. While previous methods focus on images or 3D voxels, often obscuring natural 3D patterns and invariances of 3D data, we directly operate on raw point clouds by popping up RGB-D scans. However, a key challenge of this approach is how to efficiently localize objects in point clouds of large-scale scenes (region proposal). Instead of solely relying on 3D proposals, our method leverages both mature 2D object detectors and advanced 3D deep learning for object localization, achieving efficiency as well as high recall for even small objects. Benefited from learning directly in raw point clouds, our method is also able to precisely estimate 3D bounding boxes even under strong occlusion or with very sparse points. Evaluated on KITTI and SUN RGB-D 3D detection benchmarks, our method outperforms the state of the art by remarkable margins while having real-time capability. * Majority of the work done as an intern at Nuro, Inc. depth to point cloud 2D region (from CNN) to 3D frustum 3D box (from PointNet)
translated by 谷歌翻译
鉴于其经济性与多传感器设置相比,从单眼输入中感知的3D对象对于机器人系统至关重要。它非常困难,因为单个图像无法提供预测绝对深度值的任何线索。通过双眼方法进行3D对象检测,我们利用了相机自我运动提供的强几何结构来进行准确的对象深度估计和检测。我们首先对此一般的两视案例进行了理论分析,并注意两个挑战:1)来自多个估计的累积错误,这些估计使直接预测棘手; 2)由静态摄像机和歧义匹配引起的固有难题。因此,我们建立了具有几何感知成本量的立体声对应关系,作为深度估计的替代方案,并以单眼理解进一步补偿了它,以解决第二个问题。我们的框架(DFM)命名为深度(DFM),然后使用已建立的几何形状将2D图像特征提升到3D空间并检测到其3D对象。我们还提出了一个无姿势的DFM,以使其在摄像头不可用时可用。我们的框架在Kitti基准测试上的优于最先进的方法。详细的定量和定性分析也验证了我们的理论结论。该代码将在https://github.com/tai-wang/depth-from-motion上发布。
translated by 谷歌翻译
由于基于相交的联盟(IOU)优化维持最终IOU预测度量和损失的一致性,因此它已被广泛用于单级2D对象检测器的回归和分类分支。最近,几种3D对象检测方法采用了基于IOU的优化,并用3D iou直接替换了2D iou。但是,由于复杂的实施和效率低下的向后操作,3D中的这种直接计算非常昂贵。此外,基于3D IOU的优化是优化的,因为它对旋转很敏感,因此可能导致训练不稳定性和检测性能恶化。在本文中,我们提出了一种新型的旋转旋转iou(RDIOU)方法,该方法可以减轻旋转敏感性问题,并在训练阶段与3D IOU相比产生更有效的优化目标。具体而言,我们的RDIOU通过将旋转变量解耦为独立术语,但保留3D iou的几何形状来简化回归参数的复杂相互作用。通过将RDIOU纳入回归和分类分支,鼓励网络学习更精确的边界框,并同时克服分类和回归之间的错位问题。基准Kitti和Waymo开放数据集的广泛实验验证我们的RDIOU方法可以为单阶段3D对象检测带来实质性改进。
translated by 谷歌翻译