大多数自治车辆都配备了LIDAR传感器和立体声相机。前者非常准确,但产生稀疏数据,而后者是密集的,具有丰富的纹理和颜色信息,但难以提取来自的强大的3D表示。在本文中,我们提出了一种新的数据融合算法,将准确的点云与致密的,但不太精确的点云组合在立体对。我们开发一个框架,将该算法集成到各种3D对象检测方法中。我们的框架从两个RGB图像中的2D检测开始,计算截肢和它们的交叉点,从立体声图像创建伪激光雷达数据,并填补了LIDAR数据缺少密集伪激光器的交叉区域的部分要点。我们训练多个3D对象检测方法,并表明我们的融合策略一致地提高了探测器的性能。
translated by 谷歌翻译
3D object detection is an essential task in autonomous driving. Recent techniques excel with highly accurate detection rates, provided the 3D input data is obtained from precise but expensive LiDAR technology. Approaches based on cheaper monocular or stereo imagery data have, until now, resulted in drastically lower accuracies -a gap that is commonly attributed to poor image-based depth estimation. However, in this paper we argue that it is not the quality of the data but its representation that accounts for the majority of the difference. Taking the inner workings of convolutional neural networks into consideration, we propose to convert image-based depth maps to pseudo-LiDAR representations -essentially mimicking the LiDAR signal. With this representation we can apply different existing LiDAR-based detection algorithms. On the popular KITTI benchmark, our approach achieves impressive improvements over the existing state-of-the-art in image-based performance -raising the detection accuracy of objects within the 30m range from the previous state-of-the-art of 22% to an unprecedented 74%. At the time of submission our algorithm holds the highest entry on the KITTI 3D object detection leaderboard for stereo-image-based approaches. Our code is publicly available at https: //github.com/mileyan/pseudo_lidar.
translated by 谷歌翻译
伪LIDAR表示的建议显着缩小了基于视觉的基于视觉激光痛的3D对象检测之间的差距。但是,当前的研究仅专注于通过利用复杂且耗时的神经网络来推动伪LIDAR的准确性提高。很少探索伪LIDAR代表的深刻特征来获得促进机会。在本文中,我们深入研究伪激光雷达表示,并认为3D对象检测的性能并不完全取决于高精度立体声深度估计。我们证明,即使对于不可靠的深度估计,通过适当的数据处理和精炼,它也可以达到可比的3D对象检测准确性。有了这一发现,我们进一步表明了使用伪大部分系统中快速但不准确的立体声匹配算法来实现低潜伏期响应的可能性。在实验中,我们开发了一个具有功能较低的立体声匹配预测指标的系统,并采用了提出的改进方案来提高准确性。对KITTI基准测试的评估表明,所提出的系统仅使用23毫秒的计算来实现最先进方法的竞争精度,这表明它是部署到真实CAR-HOLD应用程序的合适候选者。
translated by 谷歌翻译
它得到了很好的认识到,从深度感知的LIDAR点云和语义富有的立体图像中融合互补信息将有利于3D对象检测。然而,探索稀疏3D点和密集2D像素之间固有的不自然相互作用并不重要。为了简化这种困难,最近的建议通常将3D点投影到2D图像平面上以对图像数据进行采样,然后聚合点处的数据。然而,这种方法往往遭受点云和RGB图像的分辨率之间的不匹配,导致次优性能。具体地,作为多模态数据聚合位置的稀疏点导致高分辨率图像的严重信息丢失,这反过来破坏了多传感器融合的有效性。在本文中,我们呈现VPFNET - 一种新的架构,可以在“虚拟”点处巧妙地对齐和聚合点云和图像数据。特别地,它们的密度位于3D点和2D像素的密度之间,虚拟点可以很好地桥接两个传感器之间的分辨率间隙,从而保持更多信息以进行处理。此外,我们还研究了可以应用于点云和RGB图像的数据增强技术,因为数据增强对迄今为止对3D对象探测器的贡献不可忽略。我们对Kitti DataSet进行了广泛的实验,与最先进的方法相比,观察到了良好的性能。值得注意的是,我们的VPFNET在KITTI测试集上实现了83.21 \%中等3D AP和91.86 \%适度的BEV AP,自2021年5月21日起排名第一。网络设计也考虑了计算效率 - 我们可以实现FPS 15对单个NVIDIA RTX 2080TI GPU。该代码将用于复制和进一步调查。
translated by 谷歌翻译
已经尝试通过融合立体声摄像机图像和激光镜传感器数据或使用LIDAR进行预训练,而仅用于测试的单眼图像来检测3D对象,但是由于精确度较低而仅尝试使用单眼图像序列的尝试较少。另外,当仅使用单眼图像的深度预测时,只能预测尺度不一致的深度,这就是研究人员不愿单独使用单眼图像的原因。因此,我们提出了一种通过仅使用单眼图像序列来预测绝对深度和检测3D对象的方法,通过启用检测网络和深度预测网络的端到端学习。结果,所提出的方法超过了Kitti 3D数据集中性能的其他现有方法。即使在训练期间一起使用单眼图像和3D激光雷达以提高性能,与使用相同输入的其他方法相比,我们的展览也是最佳性能。此外,端到端学习不仅可以改善深度预测性能,而且还可以实现绝对深度预测,因为我们的网络利用了这样一个事实,即3D对象(例如汽车)的大小由大约大小确定。
translated by 谷歌翻译
In this paper we propose to exploit multiple related tasks for accurate multi-sensor 3D object detection. Towards this goal we present an end-to-end learnable architecture that reasons about 2D and 3D object detection as well as ground estimation and depth completion. Our experiments show that all these tasks are complementary and help the network learn better representations by fusing information at various levels. Importantly, our approach leads the KITTI benchmark on 2D, 3D and bird's eye view object detection, while being real-time. * Equal contribution.† Work done as part of Uber AI Residency program.
translated by 谷歌翻译
基于LIDAR的传感驱动器电流自主车辆。尽管进展迅速,但目前的激光雷达传感器在分辨率和成本方面仍然落后于传统彩色相机背后的二十年。对于自主驾驶,这意味着靠近传感器的大物体很容易可见,但远方或小物体仅包括一个测量或两个。这是一个问题,尤其是当这些对象结果驾驶危险时。另一方面,在车载RGB传感器中清晰可见这些相同的对象。在这项工作中,我们提出了一种将RGB传感器无缝熔化成基于LIDAR的3D识别方法。我们的方法采用一组2D检测来生成密集的3D虚拟点,以增加否则稀疏的3D点云。这些虚拟点自然地集成到任何基于标准的LIDAR的3D探测器以及常规激光雷达测量。由此产生的多模态检测器简单且有效。大规模NUSCENES数据集的实验结果表明,我们的框架通过显着的6.6地图改善了强大的中心点基线,并且优于竞争融合方法。代码和更多可视化可在https://tianweiy.github.io/mvp/上获得
translated by 谷歌翻译
近年来,由于深度学习技术的发展,LiDar Point Clouds的3D对象检测取得了长足的进步。尽管基于体素或基于点的方法在3D对象检测中很受欢迎,但它们通常涉及耗时的操作,例如有关体素的3D卷积或点之间的球查询,从而使所得网络不适合时间关键应用程序。另一方面,基于2D视图的方法具有较高的计算效率,而通常比基于体素或基于点的方法获得的性能低。在这项工作中,我们提出了一个基于实时视图的单阶段3D对象检测器,即CVFNET完成此任务。为了在苛刻的效率条件下加强跨视图的学习,我们的框架提取了不同视图的特征,并以有效的渐进式方式融合了它们。我们首先提出了一个新颖的点范围特征融合模块,该模块在多个阶段深入整合点和范围视图特征。然后,当将所获得的深点视图转换为鸟类视图时,特殊的切片柱旨在很好地维护3D几何形状。为了更好地平衡样品比率,提出了一个稀疏的柱子检测头,将检测集中在非空网上。我们对流行的Kitti和Nuscenes基准进行了实验,并以准确性和速度来实现最先进的性能。
translated by 谷歌翻译
我们提出了DeepFusion,这是一种模块化的多模式结构,可在不同组合中以3D对象检测为融合激光雷达,相机和雷达。专门的功能提取器可以利用每种模式,并且可以轻松交换,从而使该方法变得简单而灵活。提取的特征被转化为鸟眼视图,作为融合的共同表示。在特征空间中融合方式之前,先进行空间和语义对齐。最后,检测头利用丰富的多模式特征,以改善3D检测性能。 LIDAR相机,激光摄像头雷达和摄像头融合的实验结果显示了我们融合方法的灵活性和有效性。在此过程中,我们研究了高达225米的遥远汽车检测的很大程度上未开发的任务,显示了激光摄像机融合的好处。此外,我们研究了3D对象检测的LIDAR点所需的密度,并在对不利天气条件的鲁棒性示例中说明了含义。此外,对我们的摄像头融合的消融研究突出了准确深度估计的重要性。
translated by 谷歌翻译
LiDAR-based 3D Object detectors have achieved impressive performances in many benchmarks, however, multisensors fusion-based techniques are promising to further improve the results. PointPainting, as a recently proposed framework, can add the semantic information from the 2D image into the 3D LiDAR point by the painting operation to boost the detection performance. However, due to the limited resolution of 2D feature maps, severe boundary-blurring effect happens during re-projection of 2D semantic segmentation into the 3D point clouds. To well handle this limitation, a general multimodal fusion framework MSF has been proposed to fuse the semantic information from both the 2D image and 3D points scene parsing results. Specifically, MSF includes three main modules. First, SOTA off-the-shelf 2D/3D semantic segmentation approaches are employed to generate the parsing results for 2D images and 3D point clouds. The 2D semantic information is further re-projected into the 3D point clouds with calibrated parameters. To handle the misalignment between the 2D and 3D parsing results, an AAF module is proposed to fuse them by learning an adaptive fusion score. Then the point cloud with the fused semantic label is sent to the following 3D object detectors. Furthermore, we propose a DFF module to aggregate deep features in different levels to boost the final detection performance. The effectiveness of the framework has been verified on two public large-scale 3D object detection benchmarks by comparing with different baselines. The experimental results show that the proposed fusion strategies can significantly improve the detection performance compared to the methods using only point clouds and the methods using only 2D semantic information. Most importantly, the proposed approach significantly outperforms other approaches and sets new SOTA results on the nuScenes testing benchmark.
translated by 谷歌翻译
We present AVOD, an Aggregate View Object Detection network for autonomous driving scenarios. The proposed neural network architecture uses LIDAR point clouds and RGB images to generate features that are shared by two subnetworks: a region proposal network (RPN) and a second stage detector network. The proposed RPN uses a novel architecture capable of performing multimodal feature fusion on high resolution feature maps to generate reliable 3D object proposals for multiple object classes in road scenes. Using these proposals, the second stage detection network performs accurate oriented 3D bounding box regression and category classification to predict the extents, orientation, and classification of objects in 3D space. Our proposed architecture is shown to produce state of the art results on the KITTI 3D object detection benchmark [1] while running in real time with a low memory footprint, making it a suitable candidate for deployment on autonomous vehicles. Code is at: https://github.com/kujason/avod
translated by 谷歌翻译
In this paper, we propose a novel 3D object detector that can exploit both LIDAR as well as cameras to perform very accurate localization. Towards this goal, we design an end-to-end learnable architecture that exploits continuous convolutions to fuse image and LIDAR feature maps at different levels of resolution. Our proposed continuous fusion layer encode both discrete-state image features as well as continuous geometric information. This enables us to design a novel, reliable and efficient end-to-end learnable 3D object detector based on multiple sensors. Our experimental evaluation on both KITTI as well as a large scale 3D object detection benchmark shows significant improvements over the state of the art.
translated by 谷歌翻译
基于摄像头的3D对象探测器由于其更广泛的部署而欢迎其比LIDAR传感器较低。我们首先重新访问先前的立体声检测器DSGN,以表示代表3D几何和语义的立体音量构建方式。我们抛光立体声建模,并提出高级版本DSGN ++,旨在在三个主要方面增强整个2d到3D管道的有效信息流。首先,为了有效地将2D信息提高到立体声音量,我们提出了深度扫地(DPS),以允许较密集的连接并提取深度引导的特征。其次,为了掌握不同间距的功能,我们提出了一个新颖的立体声音量 - 双视立体声卷(DSV),该卷(DSV)集成了前视图和顶部视图功能,并重建了相机frustum中的子素深度。第三,随着前景区域在3D空间中的占主导地位,我们提出了一种多模式数据编辑策略-Stereo-lidar拷贝性 - 可确保跨模式对齐并提高数据效率。没有铃铛和哨子,在流行的Kitti基准测试中的各种模式设置中进行了广泛的实验表明,我们的方法始终优于所有类别的基于相机的3D检测器。代码可从https://github.com/chenyilun95/dsgn2获得。
translated by 谷歌翻译
Compared to typical multi-sensor systems, monocular 3D object detection has attracted much attention due to its simple configuration. However, there is still a significant gap between LiDAR-based and monocular-based methods. In this paper, we find that the ill-posed nature of monocular imagery can lead to depth ambiguity. Specifically, objects with different depths can appear with the same bounding boxes and similar visual features in the 2D image. Unfortunately, the network cannot accurately distinguish different depths from such non-discriminative visual features, resulting in unstable depth training. To facilitate depth learning, we propose a simple yet effective plug-and-play module, One Bounding Box Multiple Objects (OBMO). Concretely, we add a set of suitable pseudo labels by shifting the 3D bounding box along the viewing frustum. To constrain the pseudo-3D labels to be reasonable, we carefully design two label scoring strategies to represent their quality. In contrast to the original hard depth labels, such soft pseudo labels with quality scores allow the network to learn a reasonable depth range, boosting training stability and thus improving final performance. Extensive experiments on KITTI and Waymo benchmarks show that our method significantly improves state-of-the-art monocular 3D detectors by a significant margin (The improvements under the moderate setting on KITTI validation set are $\mathbf{1.82\sim 10.91\%}$ mAP in BEV and $\mathbf{1.18\sim 9.36\%}$ mAP in 3D}. Codes have been released at https://github.com/mrsempress/OBMO.
translated by 谷歌翻译
对于许多应用程序,包括自动驾驶,机器人抓握和增强现实,单眼3D对象检测是一项基本但非常重要的任务。现有的领先方法倾向于首先估算输入图像的深度,并基于点云检测3D对象。该例程遭受了深度估计和对象检测之间固有的差距。此外,预测误差积累也会影响性能。在本文中,提出了一种名为MonopCN的新方法。引入单频道的洞察力是,我们建议在训练期间模拟基于点云的探测器的特征学习行为。因此,在推理期间,学习的特征和预测将与基于点云的检测器相似。为了实现这一目标,我们建议一个场景级仿真模块,一个ROI级别的仿真模块和一个响应级仿真模块,这些模块逐渐用于检测器的完整特征学习和预测管道。我们将我们的方法应用于著名的M3D-RPN检测器和CADDN检测器,并在Kitti和Waymo Open数据集上进行了广泛的实验。结果表明,我们的方法始终提高不同边缘的不同单眼探测器的性能,而无需更改网络体系结构。我们的方法最终达到了最先进的性能。
translated by 谷歌翻译
许多基于LIDAR的用于检测大物体,单级对象检测或在简单情况下的方法的方法仍然很好。然而,由于未能利用图像语义,他们检测小物体或在困难情况下的性能并没有超越基于融合的那些的表现。为了提升复杂环境中的检测性能,本文提出了一种深度学习(DL)-embedded的多级3D对象检测网络,其承认LIDAR和相机传感器数据流,名为Voxel-Pixel Fusion网络( vpfnet)。在该网络内部,关键新颖组件称为体素 - 像素融合(VPF)层,其利用了体素 - 像素对的几何关系,并用适当的机制熔化体素特征和像素特征。此外,特别设计了几个参数以在考虑体素 - 像素对的特性之后引导和增强融合效果。最后,在多级难度下对多级3D对象检测任务的基准基准进行评估所提出的方法,并显示以平均平均精度(MAP)的所有最先进的方法优于所有最先进的方法。它也值得注意的是,我们这里的方法在挑战步行课上排名第一。
translated by 谷歌翻译
This paper aims at high-accuracy 3D object detection in autonomous driving scenario. We propose Multi-View 3D networks (MV3D), a sensory-fusion framework that takes both LIDAR point cloud and RGB images as input and predicts oriented 3D bounding boxes. We encode the sparse 3D point cloud with a compact multi-view representation. The network is composed of two subnetworks: one for 3D object proposal generation and another for multi-view feature fusion. The proposal network generates 3D candidate boxes efficiently from the bird's eye view representation of 3D point cloud. We design a deep fusion scheme to combine region-wise features from multiple views and enable interactions between intermediate layers of different paths. Experiments on the challenging KITTI benchmarkshow that our approach outperforms the state-of-the-art by around 25% and 30% AP on the tasks of 3D localization and 3D detection. In addition, for 2D detection, our approach obtains 14.9% higher AP than the state-of-the-art on the hard data among the LIDAR-based methods.
translated by 谷歌翻译
具有多传感器的3D对象检测对于自主驾驶和机器人技术的准确可靠感知系统至关重要。现有的3D探测器通过采用两阶段范式来显着提高准确性,这仅依靠激光点云进行3D提案的细化。尽管令人印象深刻,但点云的稀疏性,尤其是对于遥远的点,使得仅激光雷达的完善模块难以准确识别和定位对象。要解决这个问题,我们提出了一种新颖的多模式两阶段方法FusionRcnn,有效,有效地融合了感兴趣区域(ROI)的点云和摄像头图像。 FusionRcnn自适应地整合了LiDAR的稀疏几何信息和统一注意机制中相机的密集纹理信息。具体而言,它首先利用RoiPooling获得具有统一大小的图像集,并通过在ROI提取步骤中的建议中采样原始点来获取点设置;然后利用模式内的自我注意力来增强域特异性特征,此后通过精心设计的跨注意事项融合了来自两种模态的信息。FusionRCNN从根本上是插件,并支持不同的单阶段方法与不同的单阶段方法。几乎没有建筑变化。对Kitti和Waymo基准测试的广泛实验表明,我们的方法显着提高了流行探测器的性能。可取,FusionRCNN在Waymo上的FusionRCNN显着提高了强大的第二基线,而Waymo上的MAP则超过6.14%,并且优于竞争两阶段方法的表现。代码将很快在https://github.com/xxlbigbrother/fusion-rcnn上发布。
translated by 谷歌翻译
来自LIDAR或相机传感器的3D对象检测任务对于自动驾驶至关重要。先锋尝试多模式融合的尝试补充了稀疏的激光雷达点云,其中包括图像的丰富语义纹理信息,以额外的网络设计和开销为代价。在这项工作中,我们提出了一个名为SPNET的新型语义传递框架,以通过丰富的上下文绘画的指导来提高现有基于激光雷达的3D检测模型的性能,在推理过程中没有额外的计算成本。我们的关键设计是首先通过训练语义绘制的教师模型来利用地面真实标签中潜在的指导性语义知识,然后引导纯LIDAR网络通过不同的粒度传播模块来学习语义绘制的表示:类别:类别:类别:类别:类别:类别:类别:类别:类别:类别:类别:类别:类别:类别:类别:类别:类别:类别:类别:类别:类别:类别:类别:类别:类别:类别:类别:类别:类别:类:类别:类别:类别:类别:类别:类别:类别: - 通过,像素的传递和实例传递。实验结果表明,所提出的SPNET可以与大多数现有的3D检测框架无缝合作,其中AP增益为1〜5%,甚至在KITTI测试基准上实现了新的最新3D检测性能。代码可在以下网址获得:https://github.com/jb892/sp​​net。
translated by 谷歌翻译
由于LIDAR传感器捕获的精确深度信息缺乏准确的深度信息,单眼3D对象检测是一个关键而挑战的自主驾驶任务。在本文中,我们提出了一种立体引导的单目3D对象检测网络,称为SGM3D,其利用立体图像提取的鲁棒3D特征来增强从单眼图像中学到的特征。我们创新地研究了多粒度域适配模块(MG-DA)以利用网络的能力,以便仅基于单手套提示产生立体模拟功能。利用粗均衡特征级以及精细锚级域适配,以引导单眼分支。我们介绍了一个基于IOO匹配的对齐模块(iou-ma),用于立体声和单眼域之间的对象级域适应,以减轻先前阶段中的不匹配。我们对最具挑战性的基蒂和Lyft数据集进行了广泛的实验,并实现了新的最先进的性能。此外,我们的方法可以集成到许多其他单眼的方法中以提高性能而不引入任何额外的计算成本。
translated by 谷歌翻译