Due to the lack of depth information of images and poor detection accuracy in monocular 3D object detection, we proposed the instance depth for multi-scale monocular 3D object detection method. Firstly, to enhance the model's processing ability for different scale targets, a multi-scale perception module based on dilated convolution is designed, and the depth features containing multi-scale information are re-refined from both spatial and channel directions considering the inconsistency between feature maps of different scales. Firstly, we designed a multi-scale perception module based on dilated convolution to enhance the model's processing ability for different scale targets. The depth features containing multi-scale information are re-refined from spatial and channel directions considering the inconsistency between feature maps of different scales. Secondly, so as to make the model obtain better 3D perception, this paper proposed to use the instance depth information as an auxiliary learning task to enhance the spatial depth feature of the 3D target and use the sparse instance depth to supervise the auxiliary task. Finally, by verifying the proposed algorithm on the KITTI test set and evaluation set, the experimental results show that compared with the baseline method, the proposed method improves by 5.27\% in AP40 in the car category, effectively improving the detection performance of the monocular 3D object detection algorithm.
translated by 谷歌翻译
单眼3D对象检测旨在将3D边界框本地化在输入单个2D图像中。这是一个非常具有挑战性的问题并且仍然是开放的,特别是当没有额外的信息时(例如,深度,激光雷达和/或多帧)可以利用训练和/或推理。本文提出了一种对单眼3D对象检测的简单而有效的配方,而无需利用任何额外信息。它介绍了从训练中学习单眼背景的单片方法,以帮助单目3D对象检测。关键的想法是,通过图像中的对象的注释3D边界框,在训练中有一个丰富的良好的投影2D监控信号,例如投影的角键点及其相关联的偏移向量相对于中心在2D边界框中,应该被开发为培训中的辅助任务。拟议的单一的单一的机动在衡量标准理论中的克拉默 - Wold定理在高水平下。在实施中,它利用非常简单的端到端设计来证明学习辅助单眼环境的有效性,它由三个组成组成:基于深度神经网络(DNN)的特征骨干,一些回归头部分支用于学习用于3D边界框预测的基本参数,以及用于学习辅助上下文的许多回归头分支。在训练之后,丢弃辅助上下文回归分支以获得更好的推理效率。在实验中,拟议的单一组在基蒂基准(汽车,Pedestrain和骑自行车的人)中测试。它超越了汽车类别上排行榜中的所有现有技术,并在准确性方面获得了行人和骑自行车者的可比性。由于简单的设计,所提出的单控制方法在比较中获得了38.7 FP的最快推断速度
translated by 谷歌翻译
由于缺乏深度信息,单眼3D对象检测在自主驾驶中非常具有挑战性。本文提出了一种基于多尺度深度分层的单眼单目眼3D对象检测算法,它使用锚定方法检测每像素预测中的3D对象。在所提出的MDS-Net中,开发了一种新的基于深度的分层结构,以通过在对象的深度和图像尺寸之间建立数学模型来改善网络的深度预测能力。然后开发出新的角度损耗功能,以进一步提高角度预测的精度并提高训练的收敛速度。最终在后处理阶段最终应用优化的软,以调整候选盒的置信度。基蒂基准测试的实验表明,MDS-Net在3D检测中优于现有的单目3D检测方法,并在满足实时要求时进行3D检测和BEV检测任务。
translated by 谷歌翻译
Compared to typical multi-sensor systems, monocular 3D object detection has attracted much attention due to its simple configuration. However, there is still a significant gap between LiDAR-based and monocular-based methods. In this paper, we find that the ill-posed nature of monocular imagery can lead to depth ambiguity. Specifically, objects with different depths can appear with the same bounding boxes and similar visual features in the 2D image. Unfortunately, the network cannot accurately distinguish different depths from such non-discriminative visual features, resulting in unstable depth training. To facilitate depth learning, we propose a simple yet effective plug-and-play module, One Bounding Box Multiple Objects (OBMO). Concretely, we add a set of suitable pseudo labels by shifting the 3D bounding box along the viewing frustum. To constrain the pseudo-3D labels to be reasonable, we carefully design two label scoring strategies to represent their quality. In contrast to the original hard depth labels, such soft pseudo labels with quality scores allow the network to learn a reasonable depth range, boosting training stability and thus improving final performance. Extensive experiments on KITTI and Waymo benchmarks show that our method significantly improves state-of-the-art monocular 3D detectors by a significant margin (The improvements under the moderate setting on KITTI validation set are $\mathbf{1.82\sim 10.91\%}$ mAP in BEV and $\mathbf{1.18\sim 9.36\%}$ mAP in 3D}. Codes have been released at https://github.com/mrsempress/OBMO.
translated by 谷歌翻译
低成本单眼的3D对象检测在自主驾驶中起着基本作用,而其精度仍然远非令人满意。在本文中,我们挖掘了3D对象检测任务,并将其重构为对象本地化和外观感知的子任务,这有​​利于整个任务的互惠信息的深度挖掘。我们介绍了一个名为DFR-Net的动态特征反射网络,其中包含两种新的独立模块:(i)首先将任务特征分开的外观定位特征反射模块(ALFR),然后自相互反映互核特征; (ii)通过自学习方式自适应地重建各个子任务的培训过程的动态内部交易模块(DIT)。关于挑战基蒂数据集的广泛实验证明了DFR网的有效性和泛化。我们在基蒂测试集中的所有单眼3D对象探测器中排名第一(直到2021年3月16日)。所提出的方法在许多尖端的3D检测框架中也容易在较忽略的成本下以忽略的成本来播放。该代码将公开可用。
translated by 谷歌翻译
由于LIDAR传感器捕获的精确深度信息缺乏准确的深度信息,单眼3D对象检测是一个关键而挑战的自主驾驶任务。在本文中,我们提出了一种立体引导的单目3D对象检测网络,称为SGM3D,其利用立体图像提取的鲁棒3D特征来增强从单眼图像中学到的特征。我们创新地研究了多粒度域适配模块(MG-DA)以利用网络的能力,以便仅基于单手套提示产生立体模拟功能。利用粗均衡特征级以及精细锚级域适配,以引导单眼分支。我们介绍了一个基于IOO匹配的对齐模块(iou-ma),用于立体声和单眼域之间的对象级域适应,以减轻先前阶段中的不匹配。我们对最具挑战性的基蒂和Lyft数据集进行了广泛的实验,并实现了新的最先进的性能。此外,我们的方法可以集成到许多其他单眼的方法中以提高性能而不引入任何额外的计算成本。
translated by 谷歌翻译
单眼3D对象检测在简单性和成本方面的优势引起了极大的关注。由于单眼成像过程的2D至3D映射本质不足,因此单眼3D对象检测的深度估计不准确,因此3D检测结果较差。为了减轻这个问题,我们建议将地面作为单眼3D对象检测中的先验引入。地面先验是对不足的映射的额外几何条件,并且是深入估算的额外源。这样,我们可以从地面获得更准确的深度估计。同时,为了获得地面平面的充分优势,我们提出了一种深度对准训练策略和精确的两阶段深度推理方法,该方法是为地面平面量身定制的。值得注意的是,引入的地面之前不需要额外的数据源,例如LIDAR,立体声图像和深度信息。 Kitti基准测试的广泛实验表明,与其他方法相比,我们的方法可以在保持非常快速的速度的同时获得最新的结果。我们的代码和型号可在https://github.com/cfzd/monoground上找到。
translated by 谷歌翻译
近年来,自主驾驶LIDAR数据的3D对象检测一直在迈出卓越的进展。在最先进的方法中,已经证明了将点云进行编码为鸟瞰图(BEV)是有效且有效的。与透视图不同,BEV在物体之间保留丰富的空间和距离信息;虽然在BEV中相同类型的更远物体不会较小,但它们包含稀疏点云特征。这一事实使用共享卷积神经网络削弱了BEV特征提取。为了解决这一挑战,我们提出了范围感知注意网络(RAANET),提取更强大的BEV功能并产生卓越的3D对象检测。范围感知的注意力(RAA)卷曲显着改善了近距离的特征提取。此外,我们提出了一种新的辅助损耗,用于密度估计,以进一步增强覆盖物体的Raanet的检测精度。值得注意的是,我们提出的RAA卷积轻量级,并兼容,以集成到用于BEV检测的任何CNN架构中。 Nuscenes DataSet上的广泛实验表明,我们的提出方法优于基于LIDAR的3D对象检测的最先进的方法,具有16 Hz的实时推断速度,为LITE版本为22 Hz。该代码在匿名GitHub存储库HTTPS://github.com/Anonymous0522 / ange上公开提供。
translated by 谷歌翻译
3D对象检测是各种实际应用所需的重要功能,例如驾驶员辅助系统。单眼3D检测作为基于图像的方法的代表性的常规设置,提供比依赖Lidars的传统设置更经济的解决方案,但仍然产生不令人满意的结果。本文首先提出了对这个问题的系统研究。我们观察到,目前的单目3D检测可以简化为实例深度估计问题:不准确的实例深度阻止所有其他3D属性预测改善整体检测性能。此外,最近的方法直接估计基于孤立的实例或像素的深度,同时忽略不同对象的几何关系。为此,我们在跨预测对象构建几何关系图,并使用该图来促进深度估计。随着每个实例的初步深度估计通常在这种不均匀的环境中通常不准确,我们纳入了概率表示以捕获不确定性。它提供了一个重要的指标,以确定自信的预测并进一步引导深度传播。尽管基本思想的简单性,但我们的方法,PGD对基蒂和NUSCENES基准的显着改进,尽管在所有单眼视觉的方法中实现了第1个,同时仍保持实时效率。代码和模型将在https://github.com/open-mmlab/mmdetection3d发布。
translated by 谷歌翻译
已经尝试通过融合立体声摄像机图像和激光镜传感器数据或使用LIDAR进行预训练,而仅用于测试的单眼图像来检测3D对象,但是由于精确度较低而仅尝试使用单眼图像序列的尝试较少。另外,当仅使用单眼图像的深度预测时,只能预测尺度不一致的深度,这就是研究人员不愿单独使用单眼图像的原因。因此,我们提出了一种通过仅使用单眼图像序列来预测绝对深度和检测3D对象的方法,通过启用检测网络和深度预测网络的端到端学习。结果,所提出的方法超过了Kitti 3D数据集中性能的其他现有方法。即使在训练期间一起使用单眼图像和3D激光雷达以提高性能,与使用相同输入的其他方法相比,我们的展览也是最佳性能。此外,端到端学习不仅可以改善深度预测性能,而且还可以实现绝对深度预测,因为我们的网络利用了这样一个事实,即3D对象(例如汽车)的大小由大约大小确定。
translated by 谷歌翻译
In recent years, object detection has achieved a very large performance improvement, but the detection result of small objects is still not very satisfactory. This work proposes a strategy based on feature fusion and dilated convolution that employs dilated convolution to broaden the receptive field of feature maps at various scales in order to address this issue. On the one hand, it can improve the detection accuracy of larger objects. On the other hand, it provides more contextual information for small objects, which is beneficial to improving the detection accuracy of small objects. The shallow semantic information of small objects is obtained by filtering out the noise in the feature map, and the feature information of more small objects is preserved by using multi-scale fusion feature module and attention mechanism. The fusion of these shallow feature information and deep semantic information can generate richer feature maps for small object detection. Experiments show that this method can have higher accuracy than the traditional YOLOv3 network in the detection of small objects and occluded objects. In addition, we achieve 32.8\% Mean Average Precision on the detection of small objects on MS COCO2017 test set. For 640*640 input, this method has 88.76\% mAP on the PASCAL VOC2012 dataset.
translated by 谷歌翻译
Monocular 3D object detection is a key problem for autonomous vehicles, as it provides a solution with simple configuration compared to typical multi-sensor systems. The main challenge in monocular 3D detection lies in accurately predicting object depth, which must be inferred from object and scene cues due to the lack of direct range measurement. Many methods attempt to directly estimate depth to assist in 3D detection, but show limited performance as a result of depth inaccuracy. Our proposed solution, Categorical Depth Distribution Network (CaDDN), uses a predicted categorical depth distribution for each pixel to project rich contextual feature information to the appropriate depth interval in 3D space. We then use the computationally efficient bird's-eye-view projection and single-stage detector to produce the final output detections. We design CaDDN as a fully differentiable end-to-end approach for joint depth estimation and object detection. We validate our approach on the KITTI 3D object detection benchmark, where we rank 1 st among published monocular methods. We also provide the first monocular 3D detection results on the newly released Waymo Open Dataset. We provide a code release for CaDDN which is made available here.
translated by 谷歌翻译
对于许多应用程序,包括自动驾驶,机器人抓握和增强现实,单眼3D对象检测是一项基本但非常重要的任务。现有的领先方法倾向于首先估算输入图像的深度,并基于点云检测3D对象。该例程遭受了深度估计和对象检测之间固有的差距。此外,预测误差积累也会影响性能。在本文中,提出了一种名为MonopCN的新方法。引入单频道的洞察力是,我们建议在训练期间模拟基于点云的探测器的特征学习行为。因此,在推理期间,学习的特征和预测将与基于点云的检测器相似。为了实现这一目标,我们建议一个场景级仿真模块,一个ROI级别的仿真模块和一个响应级仿真模块,这些模块逐渐用于检测器的完整特征学习和预测管道。我们将我们的方法应用于著名的M3D-RPN检测器和CADDN检测器,并在Kitti和Waymo Open数据集上进行了广泛的实验。结果表明,我们的方法始终提高不同边缘的不同单眼探测器的性能,而无需更改网络体系结构。我们的方法最终达到了最先进的性能。
translated by 谷歌翻译
In this paper we propose to exploit multiple related tasks for accurate multi-sensor 3D object detection. Towards this goal we present an end-to-end learnable architecture that reasons about 2D and 3D object detection as well as ground estimation and depth completion. Our experiments show that all these tasks are complementary and help the network learn better representations by fusing information at various levels. Importantly, our approach leads the KITTI benchmark on 2D, 3D and bird's eye view object detection, while being real-time. * Equal contribution.† Work done as part of Uber AI Residency program.
translated by 谷歌翻译
在鸟眼中学习强大的表现(BEV),以进行感知任务,这是趋势和吸引行业和学术界的广泛关注。大多数自动驾驶算法的常规方法在正面或透视视图中执行检测,细分,跟踪等。随着传感器配置变得越来越复杂,从不同的传感器中集成了多源信息,并在统一视图中代表功能至关重要。 BEV感知继承了几个优势,因为代表BEV中的周围场景是直观和融合友好的。对于BEV中的代表对象,对于随后的模块,如计划和/或控制是最可取的。 BEV感知的核心问题在于(a)如何通过从透视视图到BEV来通过视图转换来重建丢失的3D信息; (b)如何在BEV网格中获取地面真理注释; (c)如何制定管道以合并来自不同来源和视图的特征; (d)如何适应和概括算法作为传感器配置在不同情况下各不相同。在这项调查中,我们回顾了有关BEV感知的最新工作,并对不同解决方案进行了深入的分析。此外,还描述了该行业的BEV方法的几种系统设计。此外,我们推出了一套完整的实用指南,以提高BEV感知任务的性能,包括相机,激光雷达和融合输入。最后,我们指出了该领域的未来研究指示。我们希望该报告能阐明社区,并鼓励对BEV感知的更多研究。我们保留一个活跃的存储库来收集最新的工作,并在https://github.com/openperceptionx/bevperception-survey-recipe上提供一包技巧的工具箱。
translated by 谷歌翻译
3D object detection from LiDAR point cloud is a challenging problem in 3D scene understanding and has many practical applications. In this paper, we extend our preliminary work PointRCNN to a novel and strong point-cloud-based 3D object detection framework, the part-aware and aggregation neural network (Part-A 2 net). The whole framework consists of the part-aware stage and the part-aggregation stage. Firstly, the part-aware stage for the first time fully utilizes free-of-charge part supervisions derived from 3D ground-truth boxes to simultaneously predict high quality 3D proposals and accurate intra-object part locations. The predicted intra-object part locations within the same proposal are grouped by our new-designed RoI-aware point cloud pooling module, which results in an effective representation to encode the geometry-specific features of each 3D proposal. Then the part-aggregation stage learns to re-score the box and refine the box location by exploring the spatial relationship of the pooled intra-object part locations. Extensive experiments are conducted to demonstrate the performance improvements from each component of our proposed framework. Our Part-A 2 net outperforms all existing 3D detection methods and achieves new state-of-the-art on KITTI 3D object detection dataset by utilizing only the LiDAR point cloud data. Code is available at https://github.com/sshaoshuai/PointCloudDet3D.
translated by 谷歌翻译
Surround-view fisheye perception under valet parking scenes is fundamental and crucial in autonomous driving. Environmental conditions in parking lots perform differently from the common public datasets, such as imperfect light and opacity, which substantially impacts on perception performance. Most existing networks based on public datasets may generalize suboptimal results on these valet parking scenes, also affected by the fisheye distortion. In this article, we introduce a new large-scale fisheye dataset called Fisheye Parking Dataset(FPD) to promote the research in dealing with diverse real-world surround-view parking cases. Notably, our compiled FPD exhibits excellent characteristics for different surround-view perception tasks. In addition, we also propose our real-time distortion-insensitive multi-task framework Fisheye Perception Network (FPNet), which improves the surround-view fisheye BEV perception by enhancing the fisheye distortion operation and multi-task lightweight designs. Extensive experiments validate the effectiveness of our approach and the dataset's exceptional generalizability.
translated by 谷歌翻译
自我监督的学习已经为单眼深度估计显示出非常有希望的结果。场景结构和本地细节都是高质量深度估计的重要线索。最近的作品遭受了场景结构的明确建模,并正确处理细节信息,这导致了预测结果中的性能瓶颈和模糊人工制品。在本文中,我们提出了具有两个有效贡献的通道 - 明智的深度估计网络(Cadepth-Net):1)结构感知模块采用自我关注机制来捕获远程依赖性并聚合在信道中的识别特征尺寸,明确增强了场景结构的感知,获得了更好的场景理解和丰富的特征表示。 2)细节强调模块重新校准通道 - 方向特征映射,并选择性地强调信息性功能,旨在更有效地突出至关重要的本地细节信息和熔断器不同的级别功能,从而更精确,更锐化深度预测。此外,广泛的实验验证了我们方法的有效性,并表明我们的模型在基蒂基准和Make3D数据集中实现了最先进的结果。
translated by 谷歌翻译
鉴于其经济性与多传感器设置相比,从单眼输入中感知的3D对象对于机器人系统至关重要。它非常困难,因为单个图像无法提供预测绝对深度值的任何线索。通过双眼方法进行3D对象检测,我们利用了相机自我运动提供的强几何结构来进行准确的对象深度估计和检测。我们首先对此一般的两视案例进行了理论分析,并注意两个挑战:1)来自多个估计的累积错误,这些估计使直接预测棘手; 2)由静态摄像机和歧义匹配引起的固有难题。因此,我们建立了具有几何感知成本量的立体声对应关系,作为深度估计的替代方案,并以单眼理解进一步补偿了它,以解决第二个问题。我们的框架(DFM)命名为深度(DFM),然后使用已建立的几何形状将2D图像特征提升到3D空间并检测到其3D对象。我们还提出了一个无姿势的DFM,以使其在摄像头不可用时可用。我们的框架在Kitti基准测试上的优于最先进的方法。详细的定量和定性分析也验证了我们的理论结论。该代码将在https://github.com/tai-wang/depth-from-motion上发布。
translated by 谷歌翻译
它得到了很好的认识到,从深度感知的LIDAR点云和语义富有的立体图像中融合互补信息将有利于3D对象检测。然而,探索稀疏3D点和密集2D像素之间固有的不自然相互作用并不重要。为了简化这种困难,最近的建议通常将3D点投影到2D图像平面上以对图像数据进行采样,然后聚合点处的数据。然而,这种方法往往遭受点云和RGB图像的分辨率之间的不匹配,导致次优性能。具体地,作为多模态数据聚合位置的稀疏点导致高分辨率图像的严重信息丢失,这反过来破坏了多传感器融合的有效性。在本文中,我们呈现VPFNET - 一种新的架构,可以在“虚拟”点处巧妙地对齐和聚合点云和图像数据。特别地,它们的密度位于3D点和2D像素的密度之间,虚拟点可以很好地桥接两个传感器之间的分辨率间隙,从而保持更多信息以进行处理。此外,我们还研究了可以应用于点云和RGB图像的数据增强技术,因为数据增强对迄今为止对3D对象探测器的贡献不可忽略。我们对Kitti DataSet进行了广泛的实验,与最先进的方法相比,观察到了良好的性能。值得注意的是,我们的VPFNET在KITTI测试集上实现了83.21 \%中等3D AP和91.86 \%适度的BEV AP,自2021年5月21日起排名第一。网络设计也考虑了计算效率 - 我们可以实现FPS 15对单个NVIDIA RTX 2080TI GPU。该代码将用于复制和进一步调查。
translated by 谷歌翻译