Monocular 3D object detection is a low-cost but challenging task, as it requires generating accurate 3D localization solely from a single image input. Recent developed depth-assisted methods show promising results by using explicit depth maps as intermediate features, which are either precomputed by monocular depth estimation networks or jointly evaluated with 3D object detection. However, inevitable errors from estimated depth priors may lead to misaligned semantic information and 3D localization, hence resulting in feature smearing and suboptimal predictions. To mitigate this issue, we propose ADD, an Attention-based Depth knowledge Distillation framework with 3D-aware positional encoding. Unlike previous knowledge distillation frameworks that adopt stereo- or LiDAR-based teachers, we build up our teacher with identical architecture as the student but with extra ground-truth depth as input. Credit to our teacher design, our framework is seamless, domain-gap free, easily implementable, and is compatible with object-wise ground-truth depth. Specifically, we leverage intermediate features and responses for knowledge distillation. Considering long-range 3D dependencies, we propose \emph{3D-aware self-attention} and \emph{target-aware cross-attention} modules for student adaptation. Extensive experiments are performed to verify the effectiveness of our framework on the challenging KITTI 3D object detection benchmark. We implement our framework on three representative monocular detectors, and we achieve state-of-the-art performance with no additional inference computational cost relative to baseline models. Our code is available at https://github.com/rockywind/ADD.
translated by 谷歌翻译
来自LIDAR或相机传感器的3D对象检测任务对于自动驾驶至关重要。先锋尝试多模式融合的尝试补充了稀疏的激光雷达点云,其中包括图像的丰富语义纹理信息,以额外的网络设计和开销为代价。在这项工作中,我们提出了一个名为SPNET的新型语义传递框架,以通过丰富的上下文绘画的指导来提高现有基于激光雷达的3D检测模型的性能,在推理过程中没有额外的计算成本。我们的关键设计是首先通过训练语义绘制的教师模型来利用地面真实标签中潜在的指导性语义知识,然后引导纯LIDAR网络通过不同的粒度传播模块来学习语义绘制的表示:类别:类别:类别:类别:类别:类别:类别:类别:类别:类别:类别:类别:类别:类别:类别:类别:类别:类别:类别:类别:类别:类别:类别:类别:类别:类别:类别:类别:类别:类:类别:类别:类别:类别:类别:类别:类别: - 通过,像素的传递和实例传递。实验结果表明,所提出的SPNET可以与大多数现有的3D检测框架无缝合作,其中AP增益为1〜5%,甚至在KITTI测试基准上实现了新的最新3D检测性能。代码可在以下网址获得:https://github.com/jb892/sp​​net。
translated by 谷歌翻译
由于LIDAR传感器捕获的精确深度信息缺乏准确的深度信息,单眼3D对象检测是一个关键而挑战的自主驾驶任务。在本文中,我们提出了一种立体引导的单目3D对象检测网络,称为SGM3D,其利用立体图像提取的鲁棒3D特征来增强从单眼图像中学到的特征。我们创新地研究了多粒度域适配模块(MG-DA)以利用网络的能力,以便仅基于单手套提示产生立体模拟功能。利用粗均衡特征级以及精细锚级域适配,以引导单眼分支。我们介绍了一个基于IOO匹配的对齐模块(iou-ma),用于立体声和单眼域之间的对象级域适应,以减轻先前阶段中的不匹配。我们对最具挑战性的基蒂和Lyft数据集进行了广泛的实验,并实现了新的最先进的性能。此外,我们的方法可以集成到许多其他单眼的方法中以提高性能而不引入任何额外的计算成本。
translated by 谷歌翻译
Recently, Bird's-Eye-View (BEV) representation has gained increasing attention in multi-view 3D object detection, which has demonstrated promising applications in autonomous driving. Although multi-view camera systems can be deployed at low cost, the lack of depth information makes current approaches adopt large models for good performance. Therefore, it is essential to improve the efficiency of BEV 3D object detection. Knowledge Distillation (KD) is one of the most practical techniques to train efficient yet accurate models. However, BEV KD is still under-explored to the best of our knowledge. Different from image classification tasks, BEV 3D object detection approaches are more complicated and consist of several components. In this paper, we propose a unified framework named BEV-LGKD to transfer the knowledge in the teacher-student manner. However, directly applying the teacher-student paradigm to BEV features fails to achieve satisfying results due to heavy background information in RGB cameras. To solve this problem, we propose to leverage the localization advantage of LiDAR points. Specifically, we transform the LiDAR points to BEV space and generate the foreground mask and view-dependent mask for the teacher-student paradigm. It is to be noted that our method only uses LiDAR points to guide the KD between RGB models. As the quality of depth estimation is crucial for BEV perception, we further introduce depth distillation to our framework. Our unified framework is simple yet effective and achieves a significant performance boost. Code will be released.
translated by 谷歌翻译
To achieve accurate and low-cost 3D object detection, existing methods propose to benefit camera-based multi-view detectors with spatial cues provided by the LiDAR modality, e.g., dense depth supervision and bird-eye-view (BEV) feature distillation. However, they directly conduct point-to-point mimicking from LiDAR to camera, which neglects the inner-geometry of foreground targets and suffers from the modal gap between 2D-3D features. In this paper, we propose the learning scheme of Target Inner-Geometry from the LiDAR modality into camera-based BEV detectors for both dense depth and BEV features, termed as TiG-BEV. First, we introduce an inner-depth supervision module to learn the low-level relative depth relations between different foreground pixels. This enables the camera-based detector to better understand the object-wise spatial structures. Second, we design an inner-feature BEV distillation module to imitate the high-level semantics of different keypoints within foreground targets. To further alleviate the BEV feature gap between two modalities, we adopt both inter-channel and inter-keypoint distillation for feature-similarity modeling. With our target inner-geometry distillation, TiG-BEV can effectively boost BEVDepth by +2.3% NDS and +2.4% mAP, along with BEVDet by +9.1% NDS and +10.3% mAP on nuScenes val set. Code will be available at https://github.com/ADLab3Ds/TiG-BEV.
translated by 谷歌翻译
Compared to typical multi-sensor systems, monocular 3D object detection has attracted much attention due to its simple configuration. However, there is still a significant gap between LiDAR-based and monocular-based methods. In this paper, we find that the ill-posed nature of monocular imagery can lead to depth ambiguity. Specifically, objects with different depths can appear with the same bounding boxes and similar visual features in the 2D image. Unfortunately, the network cannot accurately distinguish different depths from such non-discriminative visual features, resulting in unstable depth training. To facilitate depth learning, we propose a simple yet effective plug-and-play module, One Bounding Box Multiple Objects (OBMO). Concretely, we add a set of suitable pseudo labels by shifting the 3D bounding box along the viewing frustum. To constrain the pseudo-3D labels to be reasonable, we carefully design two label scoring strategies to represent their quality. In contrast to the original hard depth labels, such soft pseudo labels with quality scores allow the network to learn a reasonable depth range, boosting training stability and thus improving final performance. Extensive experiments on KITTI and Waymo benchmarks show that our method significantly improves state-of-the-art monocular 3D detectors by a significant margin (The improvements under the moderate setting on KITTI validation set are $\mathbf{1.82\sim 10.91\%}$ mAP in BEV and $\mathbf{1.18\sim 9.36\%}$ mAP in 3D}. Codes have been released at https://github.com/mrsempress/OBMO.
translated by 谷歌翻译
知识蒸馏在分类中取得了巨大的成功,但是,仍然有挑战性。在用于检测的典型图像中,来自不同位置的表示可能对检测目标具有不同的贡献,使蒸馏难以平衡。在本文中,我们提出了一种有条件的蒸馏框架来蒸馏出所需的知识,即关于每个例子的分类和本地化有益的知识。该框架引入了一种可学习的条件解码模块,其将每个目标实例检索为查询的信息。具体而言,我们将条件信息编码为查询并使用教师的表示作为键。查询和键之间的注意用于测量不同特征的贡献,由本地化识别敏感辅助任务指导。广泛的实验表明了我们的方法的功效:我们在各种环境下观察到令人印象深刻的改进。值得注意的是,在1倍计划下,我们将通过37.4至40.7地图(+3.3)与Reset-50骨架的Restinetet提升。代码已在https://github.com/megvii-research/icd上发布。
translated by 谷歌翻译
Monocular 3D object detection is a key problem for autonomous vehicles, as it provides a solution with simple configuration compared to typical multi-sensor systems. The main challenge in monocular 3D detection lies in accurately predicting object depth, which must be inferred from object and scene cues due to the lack of direct range measurement. Many methods attempt to directly estimate depth to assist in 3D detection, but show limited performance as a result of depth inaccuracy. Our proposed solution, Categorical Depth Distribution Network (CaDDN), uses a predicted categorical depth distribution for each pixel to project rich contextual feature information to the appropriate depth interval in 3D space. We then use the computationally efficient bird's-eye-view projection and single-stage detector to produce the final output detections. We design CaDDN as a fully differentiable end-to-end approach for joint depth estimation and object detection. We validate our approach on the KITTI 3D object detection benchmark, where we rank 1 st among published monocular methods. We also provide the first monocular 3D detection results on the newly released Waymo Open Dataset. We provide a code release for CaDDN which is made available here.
translated by 谷歌翻译
低成本单眼的3D对象检测在自主驾驶中起着基本作用,而其精度仍然远非令人满意。在本文中,我们挖掘了3D对象检测任务,并将其重构为对象本地化和外观感知的子任务,这有​​利于整个任务的互惠信息的深度挖掘。我们介绍了一个名为DFR-Net的动态特征反射网络,其中包含两种新的独立模块:(i)首先将任务特征分开的外观定位特征反射模块(ALFR),然后自相互反映互核特征; (ii)通过自学习方式自适应地重建各个子任务的培训过程的动态内部交易模块(DIT)。关于挑战基蒂数据集的广泛实验证明了DFR网的有效性和泛化。我们在基蒂测试集中的所有单眼3D对象探测器中排名第一(直到2021年3月16日)。所提出的方法在许多尖端的3D检测框架中也容易在较忽略的成本下以忽略的成本来播放。该代码将公开可用。
translated by 谷歌翻译
以视觉为中心的BEV感知由于其固有的优点,最近受到行业和学术界的关注,包括展示世界自然代表和融合友好。随着深度学习的快速发展,已经提出了许多方法来解决以视觉为中心的BEV感知。但是,最近没有针对这个小说和不断发展的研究领域的调查。为了刺激其未来的研究,本文对以视觉为中心的BEV感知及其扩展进行了全面调查。它收集并组织了最近的知识,并对常用算法进行了系统的综述和摘要。它还为几项BEV感知任务提供了深入的分析和比较结果,从而促进了未来作品的比较并激发了未来的研究方向。此外,还讨论了经验实现细节并证明有利于相关算法的开发。
translated by 谷歌翻译
在本文中,我们提出了激光雷达蒸馏,以弥合由不同的激光束引起的3D对象检测的域间隙。在许多现实世界中,大规模生产的机器人和车辆使用的激光点通常比大型公共数据集的光束少。此外,随着LIDARS升级到具有不同光束量的其他产品模型,使用先前版本的高分辨率传感器捕获的标记数据变得具有挑战性。尽管域自适应3D检测最近取得了进展,但大多数方法都难以消除梁诱导的域间隙。我们发现,在训练过程中,必须将源域的点云密度与目标域的点云密度保持一致。受到这一发现的启发,我们提出了一个渐进式框架,以减轻光束诱导的域移位。在每次迭代中,我们首先通过下采样高光束点云来产生低光束伪激光雷达。然后,使用教师学生的框架来将丰富的信息从数据中提取更多的信息。 Waymo,Nuscenes和Kitti数据集的大量实验具有三个不同的基于激光雷达的探测器,这证明了我们激光蒸馏的有效性。值得注意的是,我们的方法不会增加推理的任何额外计算成本。
translated by 谷歌翻译
单眼3D对象检测旨在将3D边界框本地化在输入单个2D图像中。这是一个非常具有挑战性的问题并且仍然是开放的,特别是当没有额外的信息时(例如,深度,激光雷达和/或多帧)可以利用训练和/或推理。本文提出了一种对单眼3D对象检测的简单而有效的配方,而无需利用任何额外信息。它介绍了从训练中学习单眼背景的单片方法,以帮助单目3D对象检测。关键的想法是,通过图像中的对象的注释3D边界框,在训练中有一个丰富的良好的投影2D监控信号,例如投影的角键点及其相关联的偏移向量相对于中心在2D边界框中,应该被开发为培训中的辅助任务。拟议的单一的单一的机动在衡量标准理论中的克拉默 - Wold定理在高水平下。在实施中,它利用非常简单的端到端设计来证明学习辅助单眼环境的有效性,它由三个组成组成:基于深度神经网络(DNN)的特征骨干,一些回归头部分支用于学习用于3D边界框预测的基本参数,以及用于学习辅助上下文的许多回归头分支。在训练之后,丢弃辅助上下文回归分支以获得更好的推理效率。在实验中,拟议的单一组在基蒂基准(汽车,Pedestrain和骑自行车的人)中测试。它超越了汽车类别上排行榜中的所有现有技术,并在准确性方面获得了行人和骑自行车者的可比性。由于简单的设计,所提出的单控制方法在比较中获得了38.7 FP的最快推断速度
translated by 谷歌翻译
对于许多应用程序,包括自动驾驶,机器人抓握和增强现实,单眼3D对象检测是一项基本但非常重要的任务。现有的领先方法倾向于首先估算输入图像的深度,并基于点云检测3D对象。该例程遭受了深度估计和对象检测之间固有的差距。此外,预测误差积累也会影响性能。在本文中,提出了一种名为MonopCN的新方法。引入单频道的洞察力是,我们建议在训练期间模拟基于点云的探测器的特征学习行为。因此,在推理期间,学习的特征和预测将与基于点云的检测器相似。为了实现这一目标,我们建议一个场景级仿真模块,一个ROI级别的仿真模块和一个响应级仿真模块,这些模块逐渐用于检测器的完整特征学习和预测管道。我们将我们的方法应用于著名的M3D-RPN检测器和CADDN检测器,并在Kitti和Waymo Open数据集上进行了广泛的实验。结果表明,我们的方法始终提高不同边缘的不同单眼探测器的性能,而无需更改网络体系结构。我们的方法最终达到了最先进的性能。
translated by 谷歌翻译
知识蒸馏(KD)是一种广泛使用的技术,将繁琐的教师模型继承到紧凑的学生模型,从而实现模型压缩和加速度。与图像分类相比,对象检测是一个更复杂的任务,设计特定的KD方法用于对象检测是非微小的。在这项工作中,我们精心研究教师和学生检测模型之间的行为差​​异,并获得了两个有趣的观察:首先,教师和学生对其检测到的候选盒子相得益彰,这导致了它们的精确差异。其次,教师和学生之间的特征响应差异和预测差异之间存在相当大的差距,表明同样模仿老师的所有特征映射是提高学生准确性的次优选。基于这两个观察,我们提出了用于分别蒸馏单级探测器的测量模拟(RM)和预测引导的特征模仿(PFI)。 RM从教师那里夺取候选人盒的等级作为一种新的知识形式,蒸馏,这始终如一地优于传统的软标签蒸馏。 PFI试图将特征差异与预测差异相关,使特征模仿直接有助于提高学生的准确性。在MS Coco和Pascal VOC基准测试中,广泛的实验在不同骨干的各种探测器上进行,以验证我们方法的有效性。具体而言,具有Reset50的RetinAnet在MS Coco中实现了40.4%的图,比其基线高3.5%,并且还优于先前的KD方法。
translated by 谷歌翻译
与周围摄像机的3D对象检测是自动驾驶的有希望的方向。在本文中,我们提出了Simmod,这是用于解决问题的多相对象检测的简单基线。为了合并多视图信息,并基于以前对单眼3D对象检测的努力,该框架建立在样本的对象建议基础上,并旨在以两阶段的方式工作。首先,我们提取多尺度特征,并在每个单眼图像上生成透视对象建议。其次,多视图提案进行了汇总,然后在DETR3D式中使用多视图和多尺度视觉特征进行迭代完善。精制的提案被端到端解码为检测结果。为了进一步提高性能,我们将辅助分支与提案生成并列以增强特征学习。此外,我们设计了目标过滤和教师强迫的方法,以促进两阶段训练的一致性。我们对Nuscenes的3D对象检测基准进行了广泛的实验,以证明Simmod的有效性并实现新的最新性能。代码将在https://github.com/zhangyp15/simmod上找到。
translated by 谷歌翻译
用于对象检测的常规知识蒸馏(KD)方法主要集中于同质的教师学生探测器。但是,用于部署的轻质检测器的设计通常与高容量探测器显着不同。因此,我们研究了异构教师对之间的KD,以进行广泛的应用。我们观察到,异质KD(异核KD)的核心难度是由于不同优化的方式而导致异质探测器的主链特征之间的显着语义差距。常规的同质KD(HOMO-KD)方法遭受了这种差距的影响,并且很难直接获得异性KD的令人满意的性能。在本文中,我们提出了异助剂蒸馏(Head)框架,利用异质检测头作为助手来指导学生探测器的优化以减少此间隙。在头上,助手是一个额外的探测头,其建筑与学生骨干的老师负责人同质。因此,将异源KD转变为同性恋,从而可以从老师到学生的有效知识转移。此外,当训练有素的教师探测器不可用时,我们将头部扩展到一个无教师的头(TF-Head)框架。与当前检测KD方法相比,我们的方法已取得了显着改善。例如,在MS-COCO数据集上,TF-Head帮助R18视网膜实现33.9 MAP(+2.2),而Head将极限进一步推到36.2 MAP(+4.5)。
translated by 谷歌翻译
Due to the lack of depth information of images and poor detection accuracy in monocular 3D object detection, we proposed the instance depth for multi-scale monocular 3D object detection method. Firstly, to enhance the model's processing ability for different scale targets, a multi-scale perception module based on dilated convolution is designed, and the depth features containing multi-scale information are re-refined from both spatial and channel directions considering the inconsistency between feature maps of different scales. Firstly, we designed a multi-scale perception module based on dilated convolution to enhance the model's processing ability for different scale targets. The depth features containing multi-scale information are re-refined from spatial and channel directions considering the inconsistency between feature maps of different scales. Secondly, so as to make the model obtain better 3D perception, this paper proposed to use the instance depth information as an auxiliary learning task to enhance the spatial depth feature of the 3D target and use the sparse instance depth to supervise the auxiliary task. Finally, by verifying the proposed algorithm on the KITTI test set and evaluation set, the experimental results show that compared with the baseline method, the proposed method improves by 5.27\% in AP40 in the car category, effectively improving the detection performance of the monocular 3D object detection algorithm.
translated by 谷歌翻译
在本文中,我们提出了一种用于一般物体检测的第一自蒸馏框架,称为LGD(标签引导自蒸馏)。以前的研究依赖于强大的预酝酿教师,以提供在现实世界方案中可能无法使用的指导知识。相反,我们通过对象之间的关系间和帧间关系建模来生成一个有效的知识,只需要学生表示和常规标签。具体而言,我们的框架涉及稀疏的标签外观编码,对象间关系适应和对象内的知识映射,以获得指导知识。他们在培训阶段共同形成隐式教师,动态依赖标签和不断发展的学生表示。 LGD中的模块与学生检测器的端到端训练,并在推理中丢弃。实验上,LGD在各种探测器,数据集和广泛的任务上获得了体面的结果,如实例分段。例如,在MS-Coco DataSet中,LGD将Reset-50下的REDINENT改善2倍单尺度培训,从36.2%到39.0%地图(+ 2.8%)。它在2倍多尺度培训下使用Resnext-101 DCN V2等FCO的探测器增加了更强大的探测器,从46.1%到47.9%(+ 1.8%)。与古典教师的方法FGFI相比,LGD不仅在不需要佩金的教师而且还可以降低固有的学生学习超出51%的培训成本。
translated by 谷歌翻译
从点云中检测3D对象是一项实用但充满挑战的任务,最近引起了越来越多的关注。在本文中,我们提出了针对3D对象检测的标签引导辅助训练方法(LG3D),该方法是增强现有3D对象检测器的功能学习的辅助网络。具体而言,我们提出了两个新型模块:一个标签 - 通道诱导器,该模块诱导器将框架中的注释和点云映射到特定于任务的表示形式和一个标签 - 知识式插曲器,该标签知识映射器有助于获得原始特征以获得检测临界表示。提出的辅助网络被推理丢弃,因此在测试时间没有额外的计算成本。我们对室内和室外数据集进行了广泛的实验,以验证我们的方法的有效性。例如,我们拟议的LG3D分别在SUN RGB-D和SCANNETV2数据集上将投票人员分别提高了2.5%和3.1%的地图。
translated by 谷歌翻译
自动驾驶中的3D对象检测旨在推理3D世界中感兴趣的对象的“什么”和“在哪里”。遵循先前2D对象检测的传统智慧,现有方法通常采用垂直轴的规范笛卡尔坐标系。但是,我们共轭这并不符合自我汽车的视角的本质,因为每个板载摄像头都以激进(非垂体)轴的成像几何形状感知到了楔形的楔形世界。因此,在本文中,我们主张对极性坐标系的开发,并提出一个新的极性变压器(极性形式),以在Bird's-eye-View(BEV)中更准确的3D对象检测(BEV),仅作为输入仅作为输入的多相机2D图像。具体而言,我们设计了一个基于交叉注意的极性检测头,而无需限制输入结构的形状以处理不规则的极性网格。为了解决沿极性距离维度的不受约束的物体量表变化,我们进一步引入了多个层状表示策略。结果,我们的模型可以通过参与序列到序列时尚的相应图像观察来充分利用极性表示,但要受几何约束。对Nuscenes数据集进行的彻底实验表明,我们的极性形式的表现明显优于最先进的3D对象检测替代方案,并且在BEV语义分割任务上产生了竞争性能。
translated by 谷歌翻译