The past few years have witnessed the prevalence of self-supervised representation learning within the language and 2D vision communities. However, such advancements have not been fully migrated to the community of 3D point cloud learning. Different from previous pre-training pipelines for 3D point clouds that generally fall into the scope of either generative modeling or contrastive learning, in this paper, we investigate a translative pre-training paradigm, namely PointVST, driven by a novel self-supervised pretext task of cross-modal translation from an input 3D object point cloud to its diverse forms of 2D rendered images (e.g., silhouette, depth, contour). Specifically, we begin with deducing view-conditioned point-wise embeddings via the insertion of the viewpoint indicator, and then adaptively aggregate a view-specific global codeword, which is further fed into the subsequent 2D convolutional translation heads for image generation. We conduct extensive experiments on common task scenarios of 3D shape analysis, where our PointVST shows consistent and prominent performance superiority over current state-of-the-art methods under diverse evaluation protocols. Our code will be made publicly available.
translated by 谷歌翻译
作为3D对象的两个基本表示方式,2D多视图图像和3D点云反映了来自视觉外观和几何结构各个方面的形状信息。与基于深度学习的2D多视图图像建模不同,该模型在各种3D形状分析任务中展示了领先的性能,基于3D点云的几何建模仍然遭受学习能力不足。在本文中,我们创新地构建了一个统一的跨模式知识转移框架,该框架将2D图像的歧视性视觉描述器提炼成3D点云的几何描述符。从技术上讲,在经典的教师学习范式下,我们提出了多视觉愿景到几何的蒸馏,由深入的2D图像编码器作为老师和深层的3D点云编码器组成。为了实现异质特征对齐,我们进一步提出了可见性感知的特征投影,通过该投影可以通过该投影将每个点嵌入可以汇总到多视图几何描述符中。对3D形状分类,部分分割和无监督学习的广泛实验验证了我们方法的优势。我们将公开提供代码和数据。
translated by 谷歌翻译
Point clouds are characterized by irregularity and unstructuredness, which pose challenges in efficient data exploitation and discriminative feature extraction. In this paper, we present an unsupervised deep neural architecture called Flattening-Net to represent irregular 3D point clouds of arbitrary geometry and topology as a completely regular 2D point geometry image (PGI) structure, in which coordinates of spatial points are captured in colors of image pixels. \mr{Intuitively, Flattening-Net implicitly approximates a locally smooth 3D-to-2D surface flattening process while effectively preserving neighborhood consistency.} \mr{As a generic representation modality, PGI inherently encodes the intrinsic property of the underlying manifold structure and facilitates surface-style point feature aggregation.} To demonstrate its potential, we construct a unified learning framework directly operating on PGIs to achieve \mr{diverse types of high-level and low-level} downstream applications driven by specific task networks, including classification, segmentation, reconstruction, and upsampling. Extensive experiments demonstrate that our methods perform favorably against the current state-of-the-art competitors. We will make the code and data publicly available at https://github.com/keeganhk/Flattening-Net.
translated by 谷歌翻译
近期云的自我监督学习最近取得了很大的关注,因为它在点云任务上解决了标签效率和域间隙问题。在本文中,我们提出了一种新颖的自我监督框架,用于学习部分点云的信息陈述。我们利用包含内容和姿势属性的LIDAR扫描的部分点云,我们表明解开部分点云等两个因素增强了特征表示学习。为此,我们的框架由三个主要部分组成:1)完成网络以捕获点云的整体语义; 2)一个姿势回归网络,了解从扫描部分数据的视角; 3)局部重建网络,以鼓励模型学习内容和构成功能。为了展示学习特征表示的稳健性,我们开展了几个下游任务,包括分类,部分分割和登记,并进行了最先进的方法的比较。我们的方法不仅优于现有的自我监督方法,而且还展示了合成和现实世界数据集的更好普遍性。
translated by 谷歌翻译
大规模点云的注释仍然耗时,并且对于许多真实世界任务不可用。点云预训练是用于获得快速适配的可扩展模型的一个潜在解决方案。因此,在本文中,我们调查了一种新的自我监督学习方法,称为混合和解除戒(MD),用于点云预培训。顾名思义,我们探索如何将原始点云与混合点云分开,并利用这一具有挑战的任务作为模型培训的借口优化目标。考虑到原始数据集中的有限培训数据,这远低于普遍的想象,混合过程可以有效地产生更高质量的样本。我们构建一个基线网络以验证我们的直觉,只包含两个模块,编码器和解码器。给定混合点云,首先预先训练编码器以提取语义嵌入。然后,利用实例 - 自适应解码器根据嵌入来解除点云。尽管简单,编码器本质上是能够在训练后捕获点云关键点,并且可以快速适应下游任务,包括预先训练和微调范例的分类和分割。在两个数据集上的广泛实验表明编码器+我们的(MD)显着超越了从头划痕培训的编码器和快速收敛的编码器。在消融研究中,我们进一步研究了每个部件的效果,并讨论了拟议的自我监督学习策略的优势。我们希望这种自我监督的学习尝试点云可以铺平了减少对大规模标记数据的深度学习模型依赖的方式,并在将来节省了大量的注释成本。
translated by 谷歌翻译
点云的学习表示是3D计算机视觉中的重要任务,尤其是没有手动注释的监督。以前的方法通常会从自动编码器中获得共同的援助,以通过重建输入本身来建立自我判断。但是,现有的基于自我重建的自动编码器仅关注全球形状,而忽略本地和全球几何形状之间的层次结构背景,这是3D表示学习的重要监督。为了解决这个问题,我们提出了一个新颖的自我监督点云表示学习框架,称为3D遮挡自动编码器(3D-OAE)。我们的关键想法是随机遮住输入点云的某些局部补丁,并通过使用剩余的可见图来恢复遮挡的补丁,从而建立监督。具体而言,我们设计了一个编码器,用于学习可见的本地贴片的特征,并设计了一个用于利用这些功能预测遮挡贴片的解码器。与以前的方法相反,我们的3D-OAE可以去除大量的斑块,并仅使用少量可见斑块进行预测,这使我们能够显着加速训练并产生非平凡的自我探索性能。训练有素的编码器可以进一步转移到各种下游任务。我们证明了我们在广泛使用基准下的不同判别和生成应用中的最先进方法的表现。
translated by 谷歌翻译
Pre-training by numerous image data has become de-facto for robust 2D representations. In contrast, due to the expensive data acquisition and annotation, a paucity of large-scale 3D datasets severely hinders the learning for high-quality 3D features. In this paper, we propose an alternative to obtain superior 3D representations from 2D pre-trained models via Image-to-Point Masked Autoencoders, named as I2P-MAE. By self-supervised pre-training, we leverage the well learned 2D knowledge to guide 3D masked autoencoding, which reconstructs the masked point tokens with an encoder-decoder architecture. Specifically, we first utilize off-the-shelf 2D models to extract the multi-view visual features of the input point cloud, and then conduct two types of image-to-point learning schemes on top. For one, we introduce a 2D-guided masking strategy that maintains semantically important point tokens to be visible for the encoder. Compared to random masking, the network can better concentrate on significant 3D structures and recover the masked tokens from key spatial cues. For another, we enforce these visible tokens to reconstruct the corresponding multi-view 2D features after the decoder. This enables the network to effectively inherit high-level 2D semantics learned from rich image data for discriminative 3D modeling. Aided by our image-to-point pre-training, the frozen I2P-MAE, without any fine-tuning, achieves 93.4% accuracy for linear SVM on ModelNet40, competitive to the fully trained results of existing methods. By further fine-tuning on on ScanObjectNN's hardest split, I2P-MAE attains the state-of-the-art 90.11% accuracy, +3.68% to the second-best, demonstrating superior transferable capacity. Code will be available at https://github.com/ZrrSkywalker/I2P-MAE.
translated by 谷歌翻译
许多3D表示(例如,点云)是下面连续3D表面的离散样本。该过程不可避免地介绍了底层的3D形状上的采样变化。在学习3D表示中,应忽略应忽略变化,而应捕获基础3D形状的可转换知识。这成为现有代表学习范式的大挑战。本文在点云上自动编码。标准自动编码范例强制编码器捕获这种采样变体,因为解码器必须重建具有采样变化的原始点云。我们介绍了隐式AutoEncoder(IAE),这是一种简单而有效的方法,通过用隐式解码器替换点云解码器来解决这一挑战。隐式解码器输出与相同模型的不同点云采样之间共享的连续表示。在隐式表示下重建可以优先考虑编码器丢弃采样变体,引入更多空间以学习有用的功能。在一个简单的线性AutoEncoder下,理论上理论地证明这一索赔。此外,隐式解码器提供丰富的空间来为不同的任务设计合适的隐式表示。我们展示了IAE对3D对象和3D场景的各种自我监督学习任务的有用性。实验结果表明,IAE在每项任务中始终如一地优于最先进的。
translated by 谷歌翻译
无人监督的学习目睹了自然语言理解和最近的2D图像领域的巨大成功。如何利用无监督学习的3D点云分析的力量仍然是开放的。大多数现有方法只是简单地适应2D域中使用的技术到3D域,同时不完全利用3D数据的特殊性。在这项工作中,我们提出了一种对3D点云的无监督代表学习的点辨别学习方法,该方法专门为点云数据设计,可以学习本地和全局形状特征。我们通过对骨干网络产生的中间级别和全球层面特征进行新的点歧视损失来实现这一目标。该点歧视损失强制执行与属于相应局部形状区域的点,并且与随机采样的嘈杂点不一致。我们的方法简单,设计简单,通过添加额外的适配模块和用于骨干编码器的无监督培训的点一致性模块。培训后,可以在对下游任务的分类器或解码器的监督培训期间丢弃这两个模块。我们在各种设置中对3D对象分类,3D语义和部分分割进行了广泛的实验,实现了新的最先进的结果。我们还对我们的方法进行了详细的分析,目视证明我们所学到的无监督特征的重建本地形状与地面真理形状高度一致。
translated by 谷歌翻译
我们呈现Point-Bert,一种用于学习变压器的新范式,以概括BERT对3D点云的概念。灵感来自BERT,我们将屏蔽点建模(MPM)任务设计为预列火车点云变压器。具体地,我们首先将点云划分为几个本地点修补程序,并且具有离散变化性AutoEncoder(DVAE)的点云标记器被设计为生成包含有意义的本地信息的离散点令牌。然后,我们随机掩盖了一些输入点云的补丁并将它们送入骨干变压器。预训练目标是在销售器获得的点代币的监督下恢复蒙面地点的原始点令牌。广泛的实验表明,拟议的BERT风格的预训练策略显着提高了标准点云变压器的性能。配备了我们的预培训策略,我们表明,纯变压器架构对ModelNet40的准确性为93.8%,在ScanObjectnn的最艰难的设置上的准确性为83.1%,超越精心设计的点云模型,手工制作的设计更少。我们还证明,Point-Bert从新的任务和域中获悉的表示,我们的模型在很大程度上推动了几个射击点云分类任务的最先进。代码和预先训练的型号可在https://github.com/lulutang0608/pint -bert上获得
translated by 谷歌翻译
近年来,3D视觉的自我监督预训练引起了研究的兴趣。为了学习信息的表示,许多以前的作品都利用了3D功能的不向导,\ eg,同一场景的视图之间的透视感,深度和RGB图像之间的模态侵权次数,点云和voxels之间的格式不变。尽管他们取得了令人鼓舞的结果,但以前的研究缺乏对这些不稳定的系统性比较。为了解决这个问题,我们的工作首次引入了一个统一的框架,根据该框架可以研究各种预培训方法。我们进行了广泛的实验,并仔细研究了3D预训练中不同不变的贡献。另外,我们提出了一种简单但有效的方法,该方法可以共同预先培训3D编码器和使用对比度学习的深度图编码器。通过我们的方法进行预训练的模型在下游任务方面具有显着的性能提高。例如,预先训练的投票表现优于Sun RGB-D和扫描对象检测基准的先前方法,并具有明显的利润。
translated by 谷歌翻译
Arguably one of the top success stories of deep learning is transfer learning. The finding that pre-training a network on a rich source set (e.g., ImageNet) can help boost performance once fine-tuned on a usually much smaller target set, has been instrumental to many applications in language and vision. Yet, very little is known about its usefulness in 3D point cloud understanding. We see this as an opportunity considering the effort required for annotating data in 3D. In this work, we aim at facilitating research on 3D representation learning. Different from previous works, we focus on high-level scene understanding tasks. To this end, we select a suite of diverse datasets and tasks to measure the effect of unsupervised pre-training on a large source set of 3D scenes. Our findings are extremely encouraging: using a unified triplet of architecture, source dataset, and contrastive loss for pre-training, we achieve improvement over recent best results in segmentation and detection across 6 different benchmarks for indoor and outdoor, real and synthetic datasets -demonstrating that the learned representation can generalize across domains. Furthermore, the improvement was similar to supervised pre-training, suggesting that future efforts should favor scaling data collection over more detailed annotation. We hope these findings will encourage more research on unsupervised pretext task design for 3D deep learning. Our code is publicly available at https://github.com/facebookresearch/PointContrast
translated by 谷歌翻译
您将如何通过一些错过来修复物理物体?您可能会想象它的原始形状从先前捕获的图像中,首先恢复其整体(全局)但粗大的形状,然后完善其本地细节。我们有动力模仿物理维修程序以解决点云完成。为此,我们提出了一个跨模式的形状转移双转化网络(称为CSDN),这是一种带有全循环参与图像的粗到精细范式,以完成优质的点云完成。 CSDN主要由“ Shape Fusion”和“ Dual-Refinect”模块组成,以应对跨模式挑战。第一个模块将固有的形状特性从单个图像传输,以指导点云缺失区域的几何形状生成,在其中,我们建议iPadain嵌入图像的全局特征和部分点云的完成。第二个模块通过调整生成点的位置来完善粗糙输出,其中本地改进单元通过图卷积利用了小说和输入点之间的几何关系,而全局约束单元则利用输入图像来微调生成的偏移。与大多数现有方法不同,CSDN不仅探讨了图像中的互补信息,而且还可以在整个粗到精细的完成过程中有效利用跨模式数据。实验结果表明,CSDN对十个跨模式基准的竞争对手表现出色。
translated by 谷歌翻译
我们建议在2D域中利用自我监督的技术来实现细粒度的3D形状分割任务。这是受到观察的启发:基于视图的表面表示比基于点云或体素占用率的3D对应物更有效地建模高分辨率表面细节和纹理。具体而言,给定3D形状,我们将其从多个视图中渲染,并在对比度学习框架内建立密集的对应学习任务。结果,与仅在2D或3D中使用自学的替代方案相比,学到的2D表示是视图不变和几何一致的,在对有限的标记形状进行培训时,可以更好地概括概括。对纹理(渲染peple)和未纹理(partnet)3D数据集的实验表明,我们的方法在细粒部分分割中优于最先进的替代方案。当仅一组稀疏的视图可供训练或形状纹理时,对基准的改进就会更大,这表明MVDecor受益于2D处理和3D几何推理。
translated by 谷歌翻译
蒙面自动编码是一种流行而有效的自我监督学习方法,可以指向云学习。但是,大多数现有方法仅重建掩盖点并忽略本地几何信息,这对于了解点云数据也很重要。在这项工作中,据我们所知,我们首次尝试将局部几何信息明确考虑到掩盖的自动编码中,并提出一种新颖的蒙版表面预测(Masksurf)方法。具体而言,考虑到以高比例掩盖的输入点云,我们学习一个基于变压器的编码器码头网络,通过同时预测表面位置(即点)和每条效率方向(即,正常),以估算基础掩盖的表面。 。点和正态的预测由倒角距离和新引入的位置指标的正常距离以设定的方式进行监督。在三种微调策略下,我们的Masksurf在六个下游任务上得到了验证。特别是,MaskSurf在OBJ-BG设置下的ScanoBjectNN的现实世界数据集上胜过其最接近的竞争对手Point-Mae,证明了掩盖的表面预测的优势比蒙版的预测优势比蒙版的预测。代码将在https://github.com/ybzh/masksurf上找到。
translated by 谷歌翻译
变压器一直是自然语言处理(NLP)和计算机视觉(CV)革命的核心。 NLP和CV的显着成功启发了探索变压器在点云处理中的使用。但是,变压器如何应对点云的不规则性和无序性质?变压器对于不同的3D表示(例如,基于点或体素)的合适性如何?各种3D处理任务的变压器有多大的能力?截至目前,仍然没有对这些问题的研究进行系统的调查。我们第一次为3D点云分析提供了越来越受欢迎的变压器的全面概述。我们首先介绍变压器体系结构的理论,并在2D/3D字段中审查其应用程序。然后,我们提出三种不同的分类法(即实现 - 数据表示和基于任务),它们可以从多个角度对当前的基于变压器的方法进行分类。此外,我们介绍了研究3D中自我注意机制的变异和改进的结果。为了证明变压器在点云分析中的优势,我们提供了基于各种变压器的分类,分割和对象检测方法的全面比较。最后,我们建议三个潜在的研究方向,为3D变压器的开发提供福利参考。
translated by 谷歌翻译
Recent work on 4D point cloud sequences has attracted a lot of attention. However, obtaining exhaustively labeled 4D datasets is often very expensive and laborious, so it is especially important to investigate how to utilize raw unlabeled data. However, most existing self-supervised point cloud representation learning methods only consider geometry from a static snapshot omitting the fact that sequential observations of dynamic scenes could reveal more comprehensive geometric details. And the video representation learning frameworks mostly model motion as image space flows, let alone being 3D-geometric-aware. To overcome such issues, this paper proposes a new 4D self-supervised pre-training method called Complete-to-Partial 4D Distillation. Our key idea is to formulate 4D self-supervised representation learning as a teacher-student knowledge distillation framework and let the student learn useful 4D representations with the guidance of the teacher. Experiments show that this approach significantly outperforms previous pre-training approaches on a wide range of 4D point cloud sequence understanding tasks including indoor and outdoor scenarios.
translated by 谷歌翻译
We propose a novel approach to self-supervised learning of point cloud representations by differentiable neural rendering. Motivated by the fact that informative point cloud features should be able to encode rich geometry and appearance cues and render realistic images, we train a point-cloud encoder within a devised point-based neural renderer by comparing the rendered images with real images on massive RGB-D data. The learned point-cloud encoder can be easily integrated into various downstream tasks, including not only high-level tasks like 3D detection and segmentation, but low-level tasks like 3D reconstruction and image synthesis. Extensive experiments on various tasks demonstrate the superiority of our approach compared to existing pre-training methods.
translated by 谷歌翻译
基于变压器的自我监督表示方法学习方法从未标记的数据集中学习通用功能,以提供有用的网络初始化参数,用于下游任务。最近,基于掩盖3D点云数据的局部表面斑块的自我监督学习的探索还不足。在本文中,我们提出了3D点云表示学习中的蒙版自动编码器(缩写为MAE3D),这是一种新颖的自动编码范式,用于自我监督学习。我们首先将输入点云拆分为补丁,然后掩盖其中的一部分,然后使用我们的补丁嵌入模块提取未掩盖的补丁的功能。其次,我们采用贴片的MAE3D变形金刚学习点云补丁的本地功能以及补丁之间的高级上下文关系,并完成蒙版补丁的潜在表示。我们将点云重建模块与多任务损失一起完成,从而完成不完整的点云。我们在Shapenet55上进行了自我监督的预训练,并使用点云完成前文本任务,并在ModelNet40和ScanObjectnn(PB \ _t50 \ _RS,最难的变体)上微调预训练的模型。全面的实验表明,我们的MAE3D从Point Cloud补丁提取的本地功能对下游分类任务有益,表现优于最先进的方法($ 93.4 \%\%\%\%$和$ 86.2 \%$ $分类精度)。
translated by 谷歌翻译
The success of deep learning heavily relies on large-scale data with comprehensive labels, which is more expensive and time-consuming to fetch in 3D compared to 2D images or natural languages. This promotes the potential of utilizing models pretrained with data more than 3D as teachers for cross-modal knowledge transferring. In this paper, we revisit masked modeling in a unified fashion of knowledge distillation, and we show that foundational Transformers pretrained with 2D images or natural languages can help self-supervised 3D representation learning through training Autoencoders as Cross-Modal Teachers (ACT). The pretrained Transformers are transferred as cross-modal 3D teachers using discrete variational autoencoding self-supervision, during which the Transformers are frozen with prompt tuning for better knowledge inheritance. The latent features encoded by the 3D teachers are used as the target of masked point modeling, wherein the dark knowledge is distilled to the 3D Transformer students as foundational geometry understanding. Our ACT pretrained 3D learner achieves state-of-the-art generalization capacity across various downstream benchmarks, e.g., 88.21% overall accuracy on ScanObjectNN. Codes will be released at https://github.com/RunpeiDong/ACT.
translated by 谷歌翻译