Current outdoor LiDAR-based 3D object detection methods mainly adopt the training-from-scratch paradigm. Unfortunately, this paradigm heavily relies on large-scale labeled data, whose collection can be expensive and time-consuming. Self-supervised pre-training is an effective and desirable way to alleviate this dependence on extensive annotated data. Recently, masked modeling has become a successful self-supervised learning approach for point clouds. However, current works mainly focus on synthetic or indoor datasets. When applied to large-scale and sparse outdoor point clouds, they fail to yield satisfactory results. In this work, we present BEV-MAE, a simple masked autoencoder pre-training framework for 3D object detection on outdoor point clouds. Specifically, we first propose a bird's eye view (BEV) guided masking strategy to guide the 3D encoder learning feature representation in a BEV perspective and avoid complex decoder design during pre-training. Besides, we introduce a learnable point token to maintain a consistent receptive field size of the 3D encoder with fine-tuning for masked point cloud inputs. Finally, based on the property of outdoor point clouds, i.e., the point clouds of distant objects are more sparse, we propose point density prediction to enable the 3D encoder to learn location information, which is essential for object detection. Experimental results show that BEV-MAE achieves new state-of-the-art self-supervised results on both Waymo and nuScenes with diverse 3D object detectors. Furthermore, with only 20% data and 7% training cost during pre-training, BEV-MAE achieves comparable performance with the state-of-the-art method ProposalContrast. The source code and pre-trained models will be made publicly available.
translated by 谷歌翻译
Despite the tremendous progress of Masked Autoencoders (MAE) in developing vision tasks such as image and video, exploring MAE in large-scale 3D point clouds remains challenging due to the inherent irregularity. In contrast to previous 3D MAE frameworks, which either design a complex decoder to infer masked information from maintained regions or adopt sophisticated masking strategies, we instead propose a much simpler paradigm. The core idea is to apply a \textbf{G}enerative \textbf{D}ecoder for MAE (GD-MAE) to automatically merges the surrounding context to restore the masked geometric knowledge in a hierarchical fusion manner. In doing so, our approach is free from introducing the heuristic design of decoders and enjoys the flexibility of exploring various masking strategies. The corresponding part costs less than \textbf{12\%} latency compared with conventional methods, while achieving better performance. We demonstrate the efficacy of the proposed method on several large-scale benchmarks: Waymo, KITTI, and ONCE. Consistent improvement on downstream detection tasks illustrates strong robustness and generalization capability. Not only our method reveals state-of-the-art results, but remarkably, we achieve comparable accuracy even with \textbf{20\%} of the labeled data on the Waymo dataset. The code will be released at \url{https://github.com/Nightmare-n/GD-MAE}.
translated by 谷歌翻译
基于面具的预训练在没有手动注释的监督的情况下,在图像,视频和语言中进行自我监督的学习取得了巨大的成功。但是,作为信息冗余数据,尚未在3D对象检测的字段中进行研究。由于3D对象检测中的点云是大规模的,因此无法重建输入点云。在本文中,我们提出了一个蒙版素分类网络,用于预训练大规模点云。我们的关键思想是将点云分为体素表示,并分类体素是否包含点云。这种简单的策略使网络是对物体形状的体素意识,从而改善了3D对象检测的性能。广泛的实验显示了我们在三个流行数据集(Kitti,Waymo和Nuscenes)上使用3D对象检测器(第二,Centerpoint和PV-RCNN)的预训练模型的效果。代码可在https://github.com/chaytonmin/voxel-mae上公开获得。
translated by 谷歌翻译
We show how the inherent, but often neglected, properties of large-scale LiDAR point clouds can be exploited for effective self-supervised representation learning. To this end, we design a highly data-efficient feature pre-training backbone that significantly reduces the amount of tedious 3D annotations to train state-of-the-art object detectors. In particular, we propose a Masked AutoEncoder (MAELi) that intuitively utilizes the sparsity of the LiDAR point clouds in both, the encoder and the decoder, during reconstruction. This results in more expressive and useful features, directly applicable to downstream perception tasks, such as 3D object detection for autonomous driving. In a novel reconstruction scheme, MAELi distinguishes between free and occluded space and leverages a new masking strategy which targets the LiDAR's inherent spherical projection. To demonstrate the potential of MAELi, we pre-train one of the most widespread 3D backbones, in an end-to-end fashion and show the merit of our fully unsupervised pre-trained features on several 3D object detection architectures. Given only a tiny fraction of labeled frames to fine-tune such detectors, we achieve significant performance improvements. For example, with only $\sim800$ labeled frames, MAELi features improve a SECOND model by +10.09APH/LEVEL 2 on Waymo Vehicles.
translated by 谷歌翻译
现有的无监督点云预训练的方法被限制在场景级或点/体素级实例歧视上。场景级别的方法往往会失去对识别道路对象至关重要的本地细节,而点/体素级方法固有地遭受了有限的接收领域,而这种接收领域无力感知大型对象或上下文环境。考虑到区域级表示更适合3D对象检测,我们设计了一个新的无监督点云预训练框架,称为proposalcontrast,该框架通过对比的区域建议来学习强大的3D表示。具体而言,通过从每个点云中采样一组详尽的区域建议,每个提案中的几何点关系都是建模用于创建表达性建议表示形式的。为了更好地适应3D检测属性,提案contrast可以通过群体间和统一分离来优化,即提高跨语义类别和对象实例的提议表示的歧视性。在各种3D检测器(即PV-RCNN,Centerpoint,Pointpillars和Pointrcnn)和数据集(即Kitti,Waymo和一次)上验证了提案cont抗对流的概括性和可传递性。
translated by 谷歌翻译
Pre-training by numerous image data has become de-facto for robust 2D representations. In contrast, due to the expensive data acquisition and annotation, a paucity of large-scale 3D datasets severely hinders the learning for high-quality 3D features. In this paper, we propose an alternative to obtain superior 3D representations from 2D pre-trained models via Image-to-Point Masked Autoencoders, named as I2P-MAE. By self-supervised pre-training, we leverage the well learned 2D knowledge to guide 3D masked autoencoding, which reconstructs the masked point tokens with an encoder-decoder architecture. Specifically, we first utilize off-the-shelf 2D models to extract the multi-view visual features of the input point cloud, and then conduct two types of image-to-point learning schemes on top. For one, we introduce a 2D-guided masking strategy that maintains semantically important point tokens to be visible for the encoder. Compared to random masking, the network can better concentrate on significant 3D structures and recover the masked tokens from key spatial cues. For another, we enforce these visible tokens to reconstruct the corresponding multi-view 2D features after the decoder. This enables the network to effectively inherit high-level 2D semantics learned from rich image data for discriminative 3D modeling. Aided by our image-to-point pre-training, the frozen I2P-MAE, without any fine-tuning, achieves 93.4% accuracy for linear SVM on ModelNet40, competitive to the fully trained results of existing methods. By further fine-tuning on on ScanObjectNN's hardest split, I2P-MAE attains the state-of-the-art 90.11% accuracy, +3.68% to the second-best, demonstrating superior transferable capacity. Code will be available at https://github.com/ZrrSkywalker/I2P-MAE.
translated by 谷歌翻译
室内场景云的无监督对比学习取得了巨大的成功。但是,室外场景中无监督的学习点云仍然充满挑战,因为以前的方法需要重建整个场景并捕获对比度目标的部分视图。这在带有移动物体,障碍物和传感器的室外场景中是不可行的。在本文中,我们提出了CO^3,即合作对比度学习和上下文形状的预测,以无监督的方式学习3D表示室外景点云。与现有方法相比,Co^3具有几种优点。 (1)它利用了从车辆侧和基础架构侧来的激光点云来构建差异,但同时维护对比度学习的通用语义信息,这比以前的方法构建的视图更合适。 (2)在对比度目标的同时,提出了形状上下文预测作为预训练目标,并为无监督的3D点云表示学习带来了更多与任务相关的信息,这在将学习的表示形式转移到下游检测任务时是有益的。 (3)与以前的方法相比,CO^3学到的表示形式可以通过不同类型的LIDAR传感器收集到不同的室外场景数据集。 (4)CO^3将一次和Kitti数据集的当前最新方法提高到2.58地图。代码和模型将发布。我们认为Co^3将有助于了解室外场景中的LiDar Point云。
translated by 谷歌翻译
We propose a new self-supervised method for pre-training the backbone of deep perception models operating on point clouds. The core idea is to train the model on a pretext task which is the reconstruction of the surface on which the 3D points are sampled, and to use the underlying latent vectors as input to the perception head. The intuition is that if the network is able to reconstruct the scene surface, given only sparse input points, then it probably also captures some fragments of semantic information, that can be used to boost an actual perception task. This principle has a very simple formulation, which makes it both easy to implement and widely applicable to a large range of 3D sensors and deep networks performing semantic segmentation or object detection. In fact, it supports a single-stream pipeline, as opposed to most contrastive learning approaches, allowing training on limited resources. We conducted extensive experiments on various autonomous driving datasets, involving very different kinds of lidars, for both semantic segmentation and object detection. The results show the effectiveness of our method to learn useful representations without any annotation, compared to existing approaches. Code is available at \href{https://github.com/valeoai/ALSO}{github.com/valeoai/ALSO}
translated by 谷歌翻译
在鸟眼中学习强大的表现(BEV),以进行感知任务,这是趋势和吸引行业和学术界的广泛关注。大多数自动驾驶算法的常规方法在正面或透视视图中执行检测,细分,跟踪等。随着传感器配置变得越来越复杂,从不同的传感器中集成了多源信息,并在统一视图中代表功能至关重要。 BEV感知继承了几个优势,因为代表BEV中的周围场景是直观和融合友好的。对于BEV中的代表对象,对于随后的模块,如计划和/或控制是最可取的。 BEV感知的核心问题在于(a)如何通过从透视视图到BEV来通过视图转换来重建丢失的3D信息; (b)如何在BEV网格中获取地面真理注释; (c)如何制定管道以合并来自不同来源和视图的特征; (d)如何适应和概括算法作为传感器配置在不同情况下各不相同。在这项调查中,我们回顾了有关BEV感知的最新工作,并对不同解决方案进行了深入的分析。此外,还描述了该行业的BEV方法的几种系统设计。此外,我们推出了一套完整的实用指南,以提高BEV感知任务的性能,包括相机,激光雷达和融合输入。最后,我们指出了该领域的未来研究指示。我们希望该报告能阐明社区,并鼓励对BEV感知的更多研究。我们保留一个活跃的存储库来收集最新的工作,并在https://github.com/openperceptionx/bevperception-survey-recipe上提供一包技巧的工具箱。
translated by 谷歌翻译
蒙面自动编码在图像和语言领域的自我监督学习方面取得了巨大的成功。但是,基于面具的预处理尚未显示出对点云理解的好处,这可能是由于PointNet(PointNet)无法正确处理训练的标准骨架,而不是通过训练期间掩盖引入的测试分配不匹配。在本文中,我们通过提出一个判别性掩码式变压器框架,maskPoint}来弥合这一差距。我们的关键想法是将点云表示为离散的占用值(1如果点云的一部分;如果不是的,则为0),并在蒙版对象点和采样噪声点之间执行简单的二进制分类作为代理任务。这样,我们的方法是对点云中的点采样差异的强大,并促进了学习丰富的表示。我们在几个下游任务中评估了验证的模型,包括3D形状分类,分割和现实词对象检测,并展示了最新的结果,同时获得了明显的预读速度(例如,扫描仪上的4.1倍)先前的最新变压器基线。代码可在https://github.com/haotian-liu/maskpoint上找到。
translated by 谷歌翻译
我们呈现Point-Bert,一种用于学习变压器的新范式,以概括BERT对3D点云的概念。灵感来自BERT,我们将屏蔽点建模(MPM)任务设计为预列火车点云变压器。具体地,我们首先将点云划分为几个本地点修补程序,并且具有离散变化性AutoEncoder(DVAE)的点云标记器被设计为生成包含有意义的本地信息的离散点令牌。然后,我们随机掩盖了一些输入点云的补丁并将它们送入骨干变压器。预训练目标是在销售器获得的点代币的监督下恢复蒙面地点的原始点令牌。广泛的实验表明,拟议的BERT风格的预训练策略显着提高了标准点云变压器的性能。配备了我们的预培训策略,我们表明,纯变压器架构对ModelNet40的准确性为93.8%,在ScanObjectnn的最艰难的设置上的准确性为83.1%,超越精心设计的点云模型,手工制作的设计更少。我们还证明,Point-Bert从新的任务和域中获悉的表示,我们的模型在很大程度上推动了几个射击点云分类任务的最先进。代码和预先训练的型号可在https://github.com/lulutang0608/pint -bert上获得
translated by 谷歌翻译
基于变压器的自我监督表示方法学习方法从未标记的数据集中学习通用功能,以提供有用的网络初始化参数,用于下游任务。最近,基于掩盖3D点云数据的局部表面斑块的自我监督学习的探索还不足。在本文中,我们提出了3D点云表示学习中的蒙版自动编码器(缩写为MAE3D),这是一种新颖的自动编码范式,用于自我监督学习。我们首先将输入点云拆分为补丁,然后掩盖其中的一部分,然后使用我们的补丁嵌入模块提取未掩盖的补丁的功能。其次,我们采用贴片的MAE3D变形金刚学习点云补丁的本地功能以及补丁之间的高级上下文关系,并完成蒙版补丁的潜在表示。我们将点云重建模块与多任务损失一起完成,从而完成不完整的点云。我们在Shapenet55上进行了自我监督的预训练,并使用点云完成前文本任务,并在ModelNet40和ScanObjectnn(PB \ _t50 \ _RS,最难的变体)上微调预训练的模型。全面的实验表明,我们的MAE3D从Point Cloud补丁提取的本地功能对下游分类任务有益,表现优于最先进的方法($ 93.4 \%\%\%\%$和$ 86.2 \%$ $分类精度)。
translated by 谷歌翻译
实时和高性能3D对象检测对于自动驾驶至关重要。最近表现最佳的3D对象探测器主要依赖于基于点或基于3D Voxel的卷积,这两者在计算上均无效地部署。相比之下,基于支柱的方法仅使用2D卷积,从而消耗了较少的计算资源,但它们的检测准确性远远落后于基于体素的对应物。在本文中,通过检查基于支柱和体素的探测器之间的主要性能差距,我们开发了一个实时和高性能的柱子检测器,称为Pillarnet。提出的柱子由一个强大的编码网络组成,用于有效的支柱特征学习,用于空间语义特征融合的颈网和常用的检测头。仅使用2D卷积,Pillarnet具有可选的支柱尺寸的灵活性,并与经典的2D CNN骨架兼容,例如VGGNET和RESNET.ADITIONICLY,Pillarnet受益于我们设计的方向iOu decoupled iou Recressions you Recressions损失以及IOU Aware Pareace Predication Prediction Predictight offication Branch。大规模Nuscenes数据集和Waymo Open数据集的广泛实验结果表明,在有效性和效率方面,所提出的Pillarnet在最新的3D检测器上表现良好。源代码可在https://github.com/agent-sgs/pillarnet.git上找到。
translated by 谷歌翻译
Arguably one of the top success stories of deep learning is transfer learning. The finding that pre-training a network on a rich source set (e.g., ImageNet) can help boost performance once fine-tuned on a usually much smaller target set, has been instrumental to many applications in language and vision. Yet, very little is known about its usefulness in 3D point cloud understanding. We see this as an opportunity considering the effort required for annotating data in 3D. In this work, we aim at facilitating research on 3D representation learning. Different from previous works, we focus on high-level scene understanding tasks. To this end, we select a suite of diverse datasets and tasks to measure the effect of unsupervised pre-training on a large source set of 3D scenes. Our findings are extremely encouraging: using a unified triplet of architecture, source dataset, and contrastive loss for pre-training, we achieve improvement over recent best results in segmentation and detection across 6 different benchmarks for indoor and outdoor, real and synthetic datasets -demonstrating that the learned representation can generalize across domains. Furthermore, the improvement was similar to supervised pre-training, suggesting that future efforts should favor scaling data collection over more detailed annotation. We hope these findings will encourage more research on unsupervised pretext task design for 3D deep learning. Our code is publicly available at https://github.com/facebookresearch/PointContrast
translated by 谷歌翻译
为了提高单帧3D对象检测的检测器,我们提出了一种新方法来训练它,以模拟在多帧点云上训练的检测器之后的功能和响应。我们的方法仅在训练单帧检测器时才需要多帧点云,并且一旦受过训练,它就可以在推理过程中仅用单帧点云作为输入来检测对象。我们设计了一个新颖的模拟多帧单阶段对象检测器(SMF-SSD)框架来实现该方法:多视图密集对象融合以使地面真实对象具有生成多帧点云;自我发项体素蒸馏,以促进从多框到单框体素的一到一对知识转移;多尺度的BEV功能蒸馏以在低级空间和高级语义BEV特征中传递知识;和自适应响应蒸馏以激活高置信度和准确定位的单帧反应。 Waymo测试集上的实验结果表明,我们的SMF-SSD始终优于所有最新的单帧3D对象检测器,用于所有难度级别1和2的对象类别的MAP和MAPH。
translated by 谷歌翻译
变压器在自然语言处理中的成功最近引起了计算机视觉领域的关注。由于能够学习长期依赖性,变压器已被用作广泛使用的卷积运算符的替代品。事实证明,这种替代者在许多任务中都取得了成功,其中几种最先进的方法依靠变压器来更好地学习。在计算机视觉中,3D字段还见证了使用变压器来增加3D卷积神经网络和多层感知器网络的增加。尽管许多调查都集中在视力中的变压器上,但由于与2D视觉相比,由于数据表示和处理的差异,3D视觉需要特别注意。在这项工作中,我们介绍了针对不同3D视觉任务的100多种变压器方法的系统和彻底审查,包括分类,细分,检测,完成,姿势估计等。我们在3D Vision中讨论了变形金刚的设计,该设计使其可以使用各种3D表示形式处理数据。对于每个应用程序,我们强调了基于变压器的方法的关键属性和贡献。为了评估这些方法的竞争力,我们将它们的性能与12个3D基准测试的常见非转化方法进行了比较。我们通过讨论3D视觉中变压器的不同开放方向和挑战来结束调查。除了提出的论文外,我们的目标是频繁更新最新的相关论文及其相应的实现:https://github.com/lahoud/3d-vision-transformers。
translated by 谷歌翻译
最近,融合了激光雷达点云和相机图像,提高了3D对象检测的性能和稳健性,因为这两种方式自然具有强烈的互补性。在本文中,我们通过引入新型级联双向融合〜(CB融合)模块和多模态一致性〜(MC)损耗来提出用于多模态3D对象检测的EPNet ++。更具体地说,所提出的CB融合模块提高点特征的丰富语义信息,以级联双向交互融合方式具有图像特征,导致更全面且辨别的特征表示。 MC损失明确保证预测分数之间的一致性,以获得更全面且可靠的置信度分数。基蒂,JRDB和Sun-RGBD数据集的实验结果展示了通过最先进的方法的EPNet ++的优越性。此外,我们强调一个关键但很容易被忽视的问题,这是探讨稀疏场景中的3D探测器的性能和鲁棒性。广泛的实验存在,EPNet ++优于现有的SOTA方法,在高稀疏点云壳中具有显着的边距,这可能是降低LIDAR传感器的昂贵成本的可用方向。代码将来会发布。
translated by 谷歌翻译
点云的学习表示是3D计算机视觉中的重要任务,尤其是没有手动注释的监督。以前的方法通常会从自动编码器中获得共同的援助,以通过重建输入本身来建立自我判断。但是,现有的基于自我重建的自动编码器仅关注全球形状,而忽略本地和全球几何形状之间的层次结构背景,这是3D表示学习的重要监督。为了解决这个问题,我们提出了一个新颖的自我监督点云表示学习框架,称为3D遮挡自动编码器(3D-OAE)。我们的关键想法是随机遮住输入点云的某些局部补丁,并通过使用剩余的可见图来恢复遮挡的补丁,从而建立监督。具体而言,我们设计了一个编码器,用于学习可见的本地贴片的特征,并设计了一个用于利用这些功能预测遮挡贴片的解码器。与以前的方法相反,我们的3D-OAE可以去除大量的斑块,并仅使用少量可见斑块进行预测,这使我们能够显着加速训练并产生非平凡的自我探索性能。训练有素的编码器可以进一步转移到各种下游任务。我们证明了我们在广泛使用基准下的不同判别和生成应用中的最先进方法的表现。
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译
近年来,由于深度学习技术的发展,LiDar Point Clouds的3D对象检测取得了长足的进步。尽管基于体素或基于点的方法在3D对象检测中很受欢迎,但它们通常涉及耗时的操作,例如有关体素的3D卷积或点之间的球查询,从而使所得网络不适合时间关键应用程序。另一方面,基于2D视图的方法具有较高的计算效率,而通常比基于体素或基于点的方法获得的性能低。在这项工作中,我们提出了一个基于实时视图的单阶段3D对象检测器,即CVFNET完成此任务。为了在苛刻的效率条件下加强跨视图的学习,我们的框架提取了不同视图的特征,并以有效的渐进式方式融合了它们。我们首先提出了一个新颖的点范围特征融合模块,该模块在多个阶段深入整合点和范围视图特征。然后,当将所获得的深点视图转换为鸟类视图时,特殊的切片柱旨在很好地维护3D几何形状。为了更好地平衡样品比率,提出了一个稀疏的柱子检测头,将检测集中在非空网上。我们对流行的Kitti和Nuscenes基准进行了实验,并以准确性和速度来实现最先进的性能。
translated by 谷歌翻译