在本文中,我们提出了PETRV2,这是来自多视图图像的3D感知统一框架。基于PETR,PETRV2探讨了时间建模的有效性,该时间建模利用先前帧的时间信息来增强3D对象检测。更具体地说,我们扩展了PETR中的3D位置嵌入(3D PE)进行时间建模。 3D PE可以在不同帧的对象位置上实现时间对齐。进一步引入了特征引导的位置编码器,以提高3D PE的数据适应性。为了支持高质量的BEV分割,PETRV2通过添加一组分割查询提供了简单而有效的解决方案。每个分割查询负责分割BEV映射的一个特定补丁。 PETRV2在3D对象检测和BEV细分方面实现了最先进的性能。在PETR框架上还进行了详细的鲁棒性分析。我们希望PETRV2可以作为3D感知的强大基准。代码可在\ url {https://github.com/megvii-research/petr}中获得。
translated by 谷歌翻译
在本文中,我们开发了用于多视图3D对象检测的位置嵌入转换(PETR)。PETR将3D坐标的位置信息编码为图像特征,从而产生3D位置感知功能。对象查询可以感知3D位置感知功能并执行端到端对象检测。PETR在标准Nuscenes数据集上实现了最先进的性能(50.4%NDS和44.1%的地图),并在基准中排名第一。它可以作为未来研究的简单但强大的基准。代码可在\ url {https://github.com/megvii-research/petr}中获得。
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译
3D视觉感知任务,包括基于多相机图像的3D检测和MAP分割,对于自主驾驶系统至关重要。在这项工作中,我们提出了一个称为BeVformer的新框架,该框架以时空变压器学习统一的BEV表示,以支持多个自主驾驶感知任务。简而言之,Bevormer通过通过预定义的网格形BEV查询与空间和时间空间进行交互来利用空间和时间信息。为了汇总空间信息,我们设计了空间交叉注意,每个BEV查询都从相机视图中从感兴趣的区域提取了空间特征。对于时间信息,我们提出暂时的自我注意力,以将历史bev信息偶尔融合。我们的方法在Nuscenes \ texttt {test} set上,以NDS度量为单位达到了新的最新56.9 \%,该设置比以前的最佳艺术高9.0分,并且与基于LIDAR的盆地的性能相当。我们进一步表明,BeVormer明显提高了速度估计的准确性和在低可见性条件下对象的回忆。该代码可在\ url {https://github.com/zhiqi-li/bevformer}中获得。
translated by 谷歌翻译
以视觉为中心的BEV感知由于其固有的优点,最近受到行业和学术界的关注,包括展示世界自然代表和融合友好。随着深度学习的快速发展,已经提出了许多方法来解决以视觉为中心的BEV感知。但是,最近没有针对这个小说和不断发展的研究领域的调查。为了刺激其未来的研究,本文对以视觉为中心的BEV感知及其扩展进行了全面调查。它收集并组织了最近的知识,并对常用算法进行了系统的综述和摘要。它还为几项BEV感知任务提供了深入的分析和比较结果,从而促进了未来作品的比较并激发了未来的研究方向。此外,还讨论了经验实现细节并证明有利于相关算法的开发。
translated by 谷歌翻译
一个自动驾驶感知模型旨在将3D语义表示从多个相机集体提取到自我汽车的鸟类视图(BEV)坐标框架中,以使下游规划师接地。现有的感知方法通常依赖于整个场景的容易出错的深度估计,或者学习稀疏的虚拟3D表示没有目标几何结构,这两者在性能和/或能力上仍然有限。在本文中,我们介绍了一种新颖的端到端体系结构,用于自我3D表示从任意数量的无限摄像机视图中学习。受射线追踪原理的启发,我们将“想象眼睛”的两极分化网格设计为可学习的自我3D表示,并通过适应性注意机制与3D到2D投影一起以自适应注意机制的形式制定学习过程。至关重要的是,该公式允许从2D图像中提取丰富的3D表示,而无需任何深度监督,并且内置的几何结构一致W.R.T. bev。尽管具有简单性和多功能性,但对标准BEV视觉任务(例如,基于摄像机的3D对象检测和BEV细分)进行了广泛的实验表明,我们的模型的表现均优于所有最新替代方案,从多任务学习。
translated by 谷歌翻译
在鸟眼中学习强大的表现(BEV),以进行感知任务,这是趋势和吸引行业和学术界的广泛关注。大多数自动驾驶算法的常规方法在正面或透视视图中执行检测,细分,跟踪等。随着传感器配置变得越来越复杂,从不同的传感器中集成了多源信息,并在统一视图中代表功能至关重要。 BEV感知继承了几个优势,因为代表BEV中的周围场景是直观和融合友好的。对于BEV中的代表对象,对于随后的模块,如计划和/或控制是最可取的。 BEV感知的核心问题在于(a)如何通过从透视视图到BEV来通过视图转换来重建丢失的3D信息; (b)如何在BEV网格中获取地面真理注释; (c)如何制定管道以合并来自不同来源和视图的特征; (d)如何适应和概括算法作为传感器配置在不同情况下各不相同。在这项调查中,我们回顾了有关BEV感知的最新工作,并对不同解决方案进行了深入的分析。此外,还描述了该行业的BEV方法的几种系统设计。此外,我们推出了一套完整的实用指南,以提高BEV感知任务的性能,包括相机,激光雷达和融合输入。最后,我们指出了该领域的未来研究指示。我们希望该报告能阐明社区,并鼓励对BEV感知的更多研究。我们保留一个活跃的存储库来收集最新的工作,并在https://github.com/openperceptionx/bevperception-survey-recipe上提供一包技巧的工具箱。
translated by 谷歌翻译
伯德眼景(BEV)中的语义细分是自动驾驶的重要任务。尽管这项任务吸引了大量的研究工作,但灵活应对在自动驾驶汽车上配备的任意(单个或多个)摄像头传感器仍然具有挑战性。在本文中,我们介绍了BEVSEGFORMER,这是一种有效的基于变压器的方法,用于从任意摄像机钻机中进行BEV语义分割。具体而言,我们的方法首先编码带有共享骨架的任意摄像机的图像功能。然后,这些图像功能通过基于变压器的编码器增强。此外,我们引入了BEV变压器解码器模块以解析BEV语义分割结果。有效的多相机可变形注意单元旨在进行BEV-to-to-image视图转换。最后,查询是根据BEV中网格的布局重塑的,并以监督方式进行了更大的采样以产生语义分割结果。我们在公共Nuscenes数据集和自收集的数据集上评估了所提出的算法。实验结果表明,我们的方法在任意摄像机钻机上实现了BEV语义分割的有希望的性能。我们还通过消融研究证明了每个组件的有效性。
translated by 谷歌翻译
3D object detection with surround-view images is an essential task for autonomous driving. In this work, we propose DETR4D, a Transformer-based framework that explores sparse attention and direct feature query for 3D object detection in multi-view images. We design a novel projective cross-attention mechanism for query-image interaction to address the limitations of existing methods in terms of geometric cue exploitation and information loss for cross-view objects. In addition, we introduce a heatmap generation technique that bridges 3D and 2D spaces efficiently via query initialization. Furthermore, unlike the common practice of fusing intermediate spatial features for temporal aggregation, we provide a new perspective by introducing a novel hybrid approach that performs cross-frame fusion over past object queries and image features, enabling efficient and robust modeling of temporal information. Extensive experiments on the nuScenes dataset demonstrate the effectiveness and efficiency of the proposed DETR4D.
translated by 谷歌翻译
Bird's-Eye-View (BEV) 3D Object Detection is a crucial multi-view technique for autonomous driving systems. Recently, plenty of works are proposed, following a similar paradigm consisting of three essential components, i.e., camera feature extraction, BEV feature construction, and task heads. Among the three components, BEV feature construction is BEV-specific compared with 2D tasks. Existing methods aggregate the multi-view camera features to the flattened grid in order to construct the BEV feature. However, flattening the BEV space along the height dimension fails to emphasize the informative features of different heights. For example, the barrier is located at a low height while the truck is located at a high height. In this paper, we propose a novel method named BEV Slice Attention Network (BEV-SAN) for exploiting the intrinsic characteristics of different heights. Instead of flattening the BEV space, we first sample along the height dimension to build the global and local BEV slices. Then, the features of BEV slices are aggregated from the camera features and merged by the attention mechanism. Finally, we fuse the merged local and global BEV features by a transformer to generate the final feature map for task heads. The purpose of local BEV slices is to emphasize informative heights. In order to find them, we further propose a LiDAR-guided sampling strategy to leverage the statistical distribution of LiDAR to determine the heights of local slices. Compared with uniform sampling, LiDAR-guided sampling can determine more informative heights. We conduct detailed experiments to demonstrate the effectiveness of BEV-SAN. Code will be released.
translated by 谷歌翻译
The dominant multi-camera 3D detection paradigm is based on explicit 3D feature construction, which requires complicated indexing of local image-view features via 3D-to-2D projection. Other methods implicitly introduce geometric positional encoding and perform global attention (e.g., PETR) to build the relationship between image tokens and 3D objects. The 3D-to-2D perspective inconsistency and global attention lead to a weak correlation between foreground tokens and queries, resulting in slow convergence. We propose Focal-PETR with instance-guided supervision and spatial alignment module to adaptively focus object queries on discriminative foreground regions. Focal-PETR additionally introduces a down-sampling strategy to reduce the consumption of global attention. Due to the highly parallelized implementation and down-sampling strategy, our model, without depth supervision, achieves leading performance on the large-scale nuScenes benchmark and a superior speed of 30 FPS on a single RTX3090 GPU. Extensive experiments show that our method outperforms PETR while consuming 3x fewer training hours. The code will be made publicly available.
translated by 谷歌翻译
Motion prediction is highly relevant to the perception of dynamic objects and static map elements in the scenarios of autonomous driving. In this work, we propose PIP, the first end-to-end Transformer-based framework which jointly and interactively performs online mapping, object detection and motion prediction. PIP leverages map queries, agent queries and mode queries to encode the instance-wise information of map elements, agents and motion intentions, respectively. Based on the unified query representation, a differentiable multi-task interaction scheme is proposed to exploit the correlation between perception and prediction. Even without human-annotated HD map or agent's historical tracking trajectory as guidance information, PIP realizes end-to-end multi-agent motion prediction and achieves better performance than tracking-based and HD-map-based methods. PIP provides comprehensive high-level information of the driving scene (vectorized static map and dynamic objects with motion information), and contributes to the downstream planning and control. Code and models will be released for facilitating further research.
translated by 谷歌翻译
3D车道检测是自动驾驶系统的组成部分。以前的CNN和基于变压器的方法通常首先从前视图图像中生成鸟类视图(BEV)特征映射,然后使用带有BEV功能映射的子网络作为输入来预测3D车道。这种方法需要在BEV和前视图之间进行明确的视图转换,这本身仍然是一个具有挑战性的问题。在本文中,我们提出了一种基于单阶段变压器的方法,该方法直接计算3D车道参数并可以规避困难的视图变换步骤。具体而言,我们通过使用曲线查询来将3D车道检测作为曲线传播问题。 3D车道查询由动态和有序的锚点集表示。通过这种方式,在变压器解码器迭代中具有曲线表示的查询可完善3D车道检测结果。此外,引入了曲线交叉意见模块,以计算曲线查询和图像特征之间的相似性。此外,提供了可以捕获曲线查询更多相对图像特征的上下文采样模块,以进一步提高3D车道检测性能。我们评估了合成数据集和现实数据集的3D车道检测方法,实验结果表明,与最先进的方法相比,我们的方法实现了有希望的性能。每个组件的有效性也通过消融研究验证。
translated by 谷歌翻译
Recently, Bird's-Eye-View (BEV) representation has gained increasing attention in multi-view 3D object detection, which has demonstrated promising applications in autonomous driving. Although multi-view camera systems can be deployed at low cost, the lack of depth information makes current approaches adopt large models for good performance. Therefore, it is essential to improve the efficiency of BEV 3D object detection. Knowledge Distillation (KD) is one of the most practical techniques to train efficient yet accurate models. However, BEV KD is still under-explored to the best of our knowledge. Different from image classification tasks, BEV 3D object detection approaches are more complicated and consist of several components. In this paper, we propose a unified framework named BEV-LGKD to transfer the knowledge in the teacher-student manner. However, directly applying the teacher-student paradigm to BEV features fails to achieve satisfying results due to heavy background information in RGB cameras. To solve this problem, we propose to leverage the localization advantage of LiDAR points. Specifically, we transform the LiDAR points to BEV space and generate the foreground mask and view-dependent mask for the teacher-student paradigm. It is to be noted that our method only uses LiDAR points to guide the KD between RGB models. As the quality of depth estimation is crucial for BEV perception, we further introduce depth distillation to our framework. Our unified framework is simple yet effective and achieves a significant performance boost. Code will be released.
translated by 谷歌翻译
LiDAR and camera are two essential sensors for 3D object detection in autonomous driving. LiDAR provides accurate and reliable 3D geometry information while the camera provides rich texture with color. Despite the increasing popularity of fusing these two complementary sensors, the challenge remains in how to effectively fuse 3D LiDAR point cloud with 2D camera images. Recent methods focus on point-level fusion which paints the LiDAR point cloud with camera features in the perspective view or bird's-eye view (BEV)-level fusion which unifies multi-modality features in the BEV representation. In this paper, we rethink these previous fusion strategies and analyze their information loss and influences on geometric and semantic features. We present SemanticBEVFusion to deeply fuse camera features with LiDAR features in a unified BEV representation while maintaining per-modality strengths for 3D object detection. Our method achieves state-of-the-art performance on the large-scale nuScenes dataset, especially for challenging distant objects. The code will be made publicly available.
translated by 谷歌翻译
Bird's Eye View(BEV)语义分割在自动驾驶的空间传感中起着至关重要的作用。尽管最近的文献在BEV MAP的理解上取得了重大进展,但它们都是基于基于摄像头的系统,这些系统难以处理遮挡并检测复杂的交通场景中的遥远对象。车辆到车辆(V2V)通信技术使自动驾驶汽车能够共享感应信息,与单代理系统相比,可以显着改善感知性能和范围。在本文中,我们提出了Cobevt,这是可以合作生成BEV MAP预测的第一个通用多代理多机构感知框架。为了有效地从基础变压器体系结构中的多视图和多代理数据融合相机功能,我们设计了融合的轴向注意力或传真模块,可以捕获跨视图和代理的局部和全局空间交互。 V2V感知数据集OPV2V的广泛实验表明,COBEVT实现了合作BEV语义分段的最新性能。此外,COBEVT被证明可以推广到其他任务,包括1)具有单代理多摄像机的BEV分割和2)具有多代理激光雷达系统的3D对象检测,并实现具有实时性能的最新性能时间推理速度。
translated by 谷歌翻译
在这项工作中,我们为基于视觉的不均衡的BEV表示学习提出了PolarBev。为了适应摄像机成像的预先处理效果,我们将BEV空间横向和辐射上栅格化,并引入极性嵌入分解,以模拟极性网格之间的关联。极性网格被重新排列到类似阵列的常规表示,以进行有效处理。此外,为了确定2到3D对应关系,我们根据假设平面迭代更新BEV表面,并采用基于高度的特征转换。PolarBev在单个2080TI GPU上保持实时推理速度,并且在BEV语义分割和BEV实例分割方面都优于其他方法。展示彻底消融以验证设计。该代码将在\ url {https://github.com/superz-liu/polarbev}上发布。
translated by 谷歌翻译
最近已经提出了3D车道检测的方法,以解决许多自动驾驶场景(上坡/下坡,颠簸等)中不准确的车道布局问题。先前的工作在复杂的情况下苦苦挣扎,因为它们对前视图和鸟类视图(BEV)之间的空间转换以及缺乏现实数据集的简单设计。在这些问题上,我们介绍了Persformer:具有新型基于变压器的空间特征变换模块的端到端单眼3D车道检测器。我们的模型通过参考摄像头参数来参与相关的前视本地区域来生成BEV功能。 Persformer采用统一的2D/3D锚设计和辅助任务,以同时检测2D/3D车道,从而提高功能一致性并分享多任务学习的好处。此外,我们发布了第一个大型现实世界3D车道数据集之一:OpenLane,具有高质量的注释和场景多样性。 OpenLane包含200,000帧,超过880,000个实例级别的车道,14个车道类别,以及场景标签和封闭式对象注释,以鼓励开发车道检测和更多与工业相关的自动驾驶方法。我们表明,在新的OpenLane数据集和Apollo 3D Lane合成数据集中,Persformer在3D车道检测任务中的表现明显优于竞争基线,并且在OpenLane上的2D任务中也与最新的算法相当。该项目页面可在https://github.com/openperceptionx/persformer_3dlane上找到,OpenLane数据集可在https://github.com/openperceptionx/openlane上提供。
translated by 谷歌翻译
自动驾驶中的3D对象检测旨在推理3D世界中感兴趣的对象的“什么”和“在哪里”。遵循先前2D对象检测的传统智慧,现有方法通常采用垂直轴的规范笛卡尔坐标系。但是,我们共轭这并不符合自我汽车的视角的本质,因为每个板载摄像头都以激进(非垂体)轴的成像几何形状感知到了楔形的楔形世界。因此,在本文中,我们主张对极性坐标系的开发,并提出一个新的极性变压器(极性形式),以在Bird's-eye-View(BEV)中更准确的3D对象检测(BEV),仅作为输入仅作为输入的多相机2D图像。具体而言,我们设计了一个基于交叉注意的极性检测头,而无需限制输入结构的形状以处理不规则的极性网格。为了解决沿极性距离维度的不受约束的物体量表变化,我们进一步引入了多个层状表示策略。结果,我们的模型可以通过参与序列到序列时尚的相应图像观察来充分利用极性表示,但要受几何约束。对Nuscenes数据集进行的彻底实验表明,我们的极性形式的表现明显优于最先进的3D对象检测替代方案,并且在BEV语义分割任务上产生了竞争性能。
translated by 谷歌翻译
Accurate localization ability is fundamental in autonomous driving. Traditional visual localization frameworks approach the semantic map-matching problem with geometric models, which rely on complex parameter tuning and thus hinder large-scale deployment. In this paper, we propose BEV-Locator: an end-to-end visual semantic localization neural network using multi-view camera images. Specifically, a visual BEV (Birds-Eye-View) encoder extracts and flattens the multi-view images into BEV space. While the semantic map features are structurally embedded as map queries sequence. Then a cross-model transformer associates the BEV features and semantic map queries. The localization information of ego-car is recursively queried out by cross-attention modules. Finally, the ego pose can be inferred by decoding the transformer outputs. We evaluate the proposed method in large-scale nuScenes and Qcraft datasets. The experimental results show that the BEV-locator is capable to estimate the vehicle poses under versatile scenarios, which effectively associates the cross-model information from multi-view images and global semantic maps. The experiments report satisfactory accuracy with mean absolute errors of 0.052m, 0.135m and 0.251$^\circ$ in lateral, longitudinal translation and heading angle degree.
translated by 谷歌翻译