多视图投影方法在3D理解任务等方面表现出有希望的性能,如3D分类和分割。然而,它仍然不明确如何将这种多视图方法与广泛可用的3D点云组合。以前的方法使用未受忘掉的启发式方法在点级别结合功能。为此,我们介绍了多视图点云(vinoint云)的概念,表示每个3D点作为从多个视图点提取的一组功能。这种新颖的3D Vintor云表示将3D点云表示的紧凑性与多视图表示的自然观。当然,我们可以用卷积和汇集操作配备这一新的表示。我们以理论上建立的功能形式部署了Voint神经网络(vointnet),以学习vinite空间中的表示。我们的小说代表在ScanObjectnn,ModelNet40和ShapEnet​​ Core55上实现了3D分类和检索的最先进的性能。此外,我们在ShapeNet零件上实现了3D语义细分的竞争性能。进一步的分析表明,与其他方法相比,求力提高了旋转和闭塞的鲁棒性。
translated by 谷歌翻译
Multi-view projection techniques have shown themselves to be highly effective in achieving top-performing results in the recognition of 3D shapes. These methods involve learning how to combine information from multiple view-points. However, the camera view-points from which these views are obtained are often fixed for all shapes. To overcome the static nature of current multi-view techniques, we propose learning these view-points. Specifically, we introduce the Multi-View Transformation Network (MVTN), which uses differentiable rendering to determine optimal view-points for 3D shape recognition. As a result, MVTN can be trained end-to-end with any multi-view network for 3D shape classification. We integrate MVTN into a novel adaptive multi-view pipeline that is capable of rendering both 3D meshes and point clouds. Our approach demonstrates state-of-the-art performance in 3D classification and shape retrieval on several benchmarks (ModelNet40, ScanObjectNN, ShapeNet Core55). Further analysis indicates that our approach exhibits improved robustness to occlusion compared to other methods. We also investigate additional aspects of MVTN, such as 2D pretraining and its use for segmentation. To support further research in this area, we have released MVTorch, a PyTorch library for 3D understanding and generation using multi-view projections.
translated by 谷歌翻译
学习3D点云的新表示形式是3D视觉中的一个活跃研究领域,因为订单不变的点云结构仍然对神经网络体系结构的设计构成挑战。最近的作品探索了学习全球或本地功能或两者兼而有之,但是均未通过分析点的局部方向分布来捕获上下文形状信息的早期方法。在本文中,我们利用点附近的点方向分布,以获取点云的表现力局部邻里表示。我们通过将给定点的球形邻域分为预定义的锥体来实现这一目标,并将每个体积内部的统计数据用作点特征。这样,本地贴片不仅可以由所选点的最近邻居表示,还可以考虑沿该点周围多个方向定义的点密度分布。然后,我们能够构建涉及依赖MLP(多层感知器)层的Odfblock的方向分布函数(ODF)神经网络。新的ODFNET模型可实现ModelNet40和ScanObjectNN数据集的对象分类的最新精度,并在Shapenet S3DIS数据集上进行分割。
translated by 谷歌翻译
Point cloud is an important type of geometric data structure. Due to its irregular format, most researchers transform such data to regular 3D voxel grids or collections of images. This, however, renders data unnecessarily voluminous and causes issues. In this paper, we design a novel type of neural network that directly consumes point clouds, which well respects the permutation invariance of points in the input. Our network, named PointNet, provides a unified architecture for applications ranging from object classification, part segmentation, to scene semantic parsing. Though simple, PointNet is highly efficient and effective. Empirically, it shows strong performance on par or even better than state of the art. Theoretically, we provide analysis towards understanding of what the network has learnt and why the network is robust with respect to input perturbation and corruption.
translated by 谷歌翻译
Point cloud learning has lately attracted increasing attention due to its wide applications in many areas, such as computer vision, autonomous driving, and robotics. As a dominating technique in AI, deep learning has been successfully used to solve various 2D vision problems. However, deep learning on point clouds is still in its infancy due to the unique challenges faced by the processing of point clouds with deep neural networks. Recently, deep learning on point clouds has become even thriving, with numerous methods being proposed to address different problems in this area. To stimulate future research, this paper presents a comprehensive review of recent progress in deep learning methods for point clouds. It covers three major tasks, including 3D shape classification, 3D object detection and tracking, and 3D point cloud segmentation. It also presents comparative results on several publicly available datasets, together with insightful observations and inspiring future research directions.
translated by 谷歌翻译
点云正在获得突出的突出,作为代表3D形状的方法,但其不规则结构对深度学习方法构成了挑战。在本文中,我们提出了一种使用随机散步学习3D形状的新方法。以前的作品试图调整卷积神经网络(CNNS)或将网格或网格结构强加到3D点云。这项工作提出了一种不同的方法来表示和学习特定点集的形状。关键的想法是在多个随机散步通过云设置的点上施加结构,用于探索3D对象的不同区域。然后我们学习每次和每次步行代表,并在推理时聚合多个步行预测。我们的方法实现了两个3D形状分析任务的最先进结果:分类和检索。此外,我们提出了一种形状复杂性指示器功能,该函数使用交叉步道和步行间方差措施来细分形状空间。
translated by 谷歌翻译
Unlike images which are represented in regular dense grids, 3D point clouds are irregular and unordered, hence applying convolution on them can be difficult. In this paper, we extend the dynamic filter to a new convolution operation, named PointConv. PointConv can be applied on point clouds to build deep convolutional networks. We treat convolution kernels as nonlinear functions of the local coordinates of 3D points comprised of weight and density functions. With respect to a given point, the weight functions are learned with multi-layer perceptron networks and density functions through kernel density estimation. The most important contribution of this work is a novel reformulation proposed for efficiently computing the weight functions, which allowed us to dramatically scale up the network and significantly improve its performance. The learned convolution kernel can be used to compute translation-invariant and permutation-invariant convolution on any point set in the 3D space. Besides, PointConv can also be used as deconvolution operators to propagate features from a subsampled point cloud back to its original resolution. Experiments on ModelNet40, ShapeNet, and ScanNet show that deep convolutional neural networks built on PointConv are able to achieve state-of-the-art on challenging semantic segmentation benchmarks on 3D point clouds. Besides, our experiments converting CIFAR-10 into a point cloud showed that networks built on PointConv can match the performance of convolutional networks in 2D images of a similar structure.
translated by 谷歌翻译
Recent advances in Neural Radiance Fields (NeRFs) treat the problem of novel view synthesis as Sparse Radiance Field (SRF) optimization using sparse voxels for efficient and fast rendering (plenoxels,InstantNGP). In order to leverage machine learning and adoption of SRFs as a 3D representation, we present SPARF, a large-scale ShapeNet-based synthetic dataset for novel view synthesis consisting of $\sim$ 17 million images rendered from nearly 40,000 shapes at high resolution (400 X 400 pixels). The dataset is orders of magnitude larger than existing synthetic datasets for novel view synthesis and includes more than one million 3D-optimized radiance fields with multiple voxel resolutions. Furthermore, we propose a novel pipeline (SuRFNet) that learns to generate sparse voxel radiance fields from only few views. This is done by using the densely collected SPARF dataset and 3D sparse convolutions. SuRFNet employs partial SRFs from few/one images and a specialized SRF loss to learn to generate high-quality sparse voxel radiance fields that can be rendered from novel views. Our approach achieves state-of-the-art results in the task of unconstrained novel view synthesis based on few views on ShapeNet as compared to recent baselines. The SPARF dataset will be made public with the code and models on the project website https://abdullahamdi.com/sparf/ .
translated by 谷歌翻译
3D点云的卷积经过广泛研究,但在几何深度学习中却远非完美。卷积的传统智慧在3D点之间表现出特征对应关系,这是对差的独特特征学习的内在限制。在本文中,我们提出了自适应图卷积(AGCONV),以供点云分析的广泛应用。 AGCONV根据其动态学习的功能生成自适应核。与使用固定/各向同性核的解决方案相比,AGCONV提高了点云卷积的灵活性,有效,精确地捕获了不同语义部位的点之间的不同关系。与流行的注意力体重方案不同,AGCONV实现了卷积操作内部的适应性,而不是简单地将不同的权重分配给相邻点。广泛的评估清楚地表明,我们的方法优于各种基准数据集中的点云分类和分割的最新方法。同时,AGCONV可以灵活地采用更多的点云分析方法来提高其性能。为了验证其灵活性和有效性,我们探索了基于AGCONV的完成,DeNoing,Upsmpling,注册和圆圈提取的范式,它们与竞争对手相当甚至优越。我们的代码可在https://github.com/hrzhou2/adaptconv-master上找到。
translated by 谷歌翻译
3D语义分割的最新作品建议通过使用专用网络处理每种模式并将学习的2D功能投射到3D点上,从而利用图像和点云之间的协同作用。合并大规模点云和图像会引起几个挑战,例如在点和像素之间构建映射,以及在多个视图之间汇总特征。当前方法需要网格重建或专门传感器来恢复闭塞,并使用启发式方法选择和汇总可用的图像。相比之下,我们提出了一个可端到端的可训练的多视图聚合模型,该模型利用3D点的观看条件从任意位置拍摄的图像中合并特征。我们的方法可以结合标准2D和3D网络,并优于在有色点云和混合2D/3D网络上运行的3D模型,而无需进行着色,网格融化或真实的深度图。我们为S3DIS(74.7 MIOU 6倍)和Kitti-360(58.3 MIOU)设置了大型室内/室外语义细分的新最先进的。我们的完整管道可以在https://github.com/drprojects/deepviewagg上访问,并且仅需要原始的3D扫描以及一组图像和姿势。
translated by 谷歌翻译
We propose a novel approach to self-supervised learning of point cloud representations by differentiable neural rendering. Motivated by the fact that informative point cloud features should be able to encode rich geometry and appearance cues and render realistic images, we train a point-cloud encoder within a devised point-based neural renderer by comparing the rendered images with real images on massive RGB-D data. The learned point-cloud encoder can be easily integrated into various downstream tasks, including not only high-level tasks like 3D detection and segmentation, but low-level tasks like 3D reconstruction and image synthesis. Extensive experiments on various tasks demonstrate the superiority of our approach compared to existing pre-training methods.
translated by 谷歌翻译
点云识别是工业机器人和自主驾驶中的重要任务。最近,几个点云处理模型已经实现了最先进的表演。然而,这些方法缺乏旋转稳健性,并且它们的性能严重降低了随机旋转,未能扩展到具有不同方向的现实情景。为此,我们提出了一种名为基于自行轮廓的转换(SCT)的方法,该方法可以灵活地集成到针对任意旋转的各种现有点云识别模型中。 SCT通过引入轮廓感知的转换(CAT)提供有效的旋转和翻译不变性,该转换(CAT)线性地将点数的笛卡尔坐标转换为翻译和旋转 - 不变表示。我们证明猫是一种基于理论分析的旋转和翻译不变的转换。此外,提出了帧对准模块来增强通过捕获轮廓并将基于自平台的帧转换为帧内帧来增强鉴别特征提取。广泛的实验结果表明,SCT在合成和现实世界基准的有效性和效率的任意旋转下表现出最先进的方法。此外,稳健性和一般性评估表明SCT是稳健的,适用于各种点云处理模型,它突出了工业应用中SCT的优势。
translated by 谷歌翻译
点云分析没有姿势前导者在真实应用中非常具有挑战性,因为点云的方向往往是未知的。在本文中,我们提出了一个全新的点集学习框架prin,即点亮旋转不变网络,专注于点云分析中的旋转不变特征提取。我们通过密度意识的自适应采样构建球形信号,以处理球形空间中的扭曲点分布。提出了球形Voxel卷积和点重新采样以提取每个点的旋转不变特征。此外,我们将Prin扩展到称为Sprin的稀疏版本,直接在稀疏点云上运行。 Prin和Sprin都可以应用于从对象分类,部分分割到3D特征匹配和标签对齐的任务。结果表明,在随机旋转点云的数据集上,Sprin比无任何数据增强的最先进方法表现出更好的性能。我们还为我们的方法提供了彻底的理论证明和分析,以实现我们的方法实现的点明智的旋转不变性。我们的代码可在https://github.com/qq456cvb/sprin上找到。
translated by 谷歌翻译
我们呈现NESF,一种用于单独从构成的RGB图像中生成3D语义场的方法。代替经典的3D表示,我们的方法在最近的基础上建立了隐式神经场景表示的工作,其中3D结构被点亮功能捕获。我们利用这种方法来恢复3D密度领域,我们然后在其中培训由构成的2D语义地图监督的3D语义分段模型。尽管仅在2D信号上培训,我们的方法能够从新颖的相机姿势生成3D一致的语义地图,并且可以在任意3D点查询。值得注意的是,NESF与产生密度场的任何方法兼容,并且随着密度场的质量改善,其精度可提高。我们的实证分析在复杂的实际呈现的合成场景中向竞争性2D和3D语义分割基线表现出可比的质量。我们的方法是第一个提供真正密集的3D场景分段,需要仅需要2D监督培训,并且不需要任何关于新颖场景的推论的语义输入。我们鼓励读者访问项目网站。
translated by 谷歌翻译
We present a network architecture for processing point clouds that directly operates on a collection of points represented as a sparse set of samples in a high-dimensional lattice. Naïvely applying convolutions on this lattice scales poorly, both in terms of memory and computational cost, as the size of the lattice increases. Instead, our network uses sparse bilateral convolutional layers as building blocks. These layers maintain efficiency by using indexing structures to apply convolutions only on occupied parts of the lattice, and allow flexible specifications of the lattice structure enabling hierarchical and spatially-aware feature learning, as well as joint 2D-3D reasoning. Both point-based and image-based representations can be easily incorporated in a network with such layers and the resulting model can be trained in an end-to-end manner. We present results on 3D segmentation tasks where our approach outperforms existing state-of-the-art techniques.
translated by 谷歌翻译
综合照片 - 现实图像和视频是计算机图形的核心,并且是几十年的研究焦点。传统上,使用渲染算法(如光栅化或射线跟踪)生成场景的合成图像,其将几何形状和材料属性的表示为输入。统称,这些输入定义了实际场景和呈现的内容,并且被称为场景表示(其中场景由一个或多个对象组成)。示例场景表示是具有附带纹理的三角形网格(例如,由艺术家创建),点云(例如,来自深度传感器),体积网格(例如,来自CT扫描)或隐式曲面函数(例如,截短的符号距离)字段)。使用可分辨率渲染损耗的观察结果的这种场景表示的重建被称为逆图形或反向渲染。神经渲染密切相关,并将思想与经典计算机图形和机器学习中的思想相结合,以创建用于合成来自真实观察图像的图像的算法。神经渲染是朝向合成照片现实图像和视频内容的目标的跨越。近年来,我们通过数百个出版物显示了这一领域的巨大进展,这些出版物显示了将被动组件注入渲染管道的不同方式。这种最先进的神经渲染进步的报告侧重于将经典渲染原则与学习的3D场景表示结合的方法,通常现在被称为神经场景表示。这些方法的一个关键优势在于它们是通过设计的3D-一致,使诸如新颖的视点合成捕获场景的应用。除了处理静态场景的方法外,我们还涵盖了用于建模非刚性变形对象的神经场景表示...
translated by 谷歌翻译
3D point clouds are rich in geometric structure information, while 2D images contain important and continuous texture information. Combining 2D information to achieve better 3D semantic segmentation has become mainstream in 3D scene understanding. Albeit the success, it still remains elusive how to fuse and process the cross-dimensional features from these two distinct spaces. Existing state-of-the-art usually exploit bidirectional projection methods to align the cross-dimensional features and realize both 2D & 3D semantic segmentation tasks. However, to enable bidirectional mapping, this framework often requires a symmetrical 2D-3D network structure, thus limiting the network's flexibility. Meanwhile, such dual-task settings may distract the network easily and lead to over-fitting in the 3D segmentation task. As limited by the network's inflexibility, fused features can only pass through a decoder network, which affects model performance due to insufficient depth. To alleviate these drawbacks, in this paper, we argue that despite its simplicity, projecting unidirectionally multi-view 2D deep semantic features into the 3D space aligned with 3D deep semantic features could lead to better feature fusion. On the one hand, the unidirectional projection enforces our model focused more on the core task, i.e., 3D segmentation; on the other hand, unlocking the bidirectional to unidirectional projection enables a deeper cross-domain semantic alignment and enjoys the flexibility to fuse better and complicated features from very different spaces. In joint 2D-3D approaches, our proposed method achieves superior performance on the ScanNetv2 benchmark for 3D semantic segmentation.
translated by 谷歌翻译
由于真实的3D注释的类别数据的不可用,在合成数据集中,传统的学习3D对象类别的方法主要受到培训和评估。我们的主要目标是通过在与现有的合成对应物类似的幅度下收集现实世界数据来促进该领域的进步。因此,这项工作的主要贡献是一个大型数据集,称为3D中的常见对象,具有使用相机姿势和地面真相3D点云注释的对象类别的真实多视图图像。 DataSet总共包含从50 MS-Coco类别的近19,000个视频中捕获对象的150万帧,因此,在类别和对象的数量方面,它比替代更大。我们利用这款新数据集进行了几个新型综合和以类别为中心的3D重建方法的第一个大规模“野外”评估。最后,我们贡献了一种新型的神经渲染方法,它利用强大的变压器来重建对象,给出少量的视图。 CO3D DataSet可在HTTPS://github.com/facebookResearch/co3d获取。
translated by 谷歌翻译
Scene understanding is a major challenge of today's computer vision. Center to this task is image segmentation, since scenes are often provided as a set of pictures. Nowadays, many such datasets also provide 3D geometry information given as a 3D point cloud acquired by a laser scanner or a depth camera. To exploit this geometric information, many current approaches rely on both a 2D loss and 3D loss, requiring not only 2D per pixel labels but also 3D per point labels. However obtaining a 3D groundtruth is challenging, time-consuming and error-prone. In this paper, we show that image segmentation can benefit from 3D geometric information without requiring any 3D groundtruth, by training the geometric feature extraction with a 2D segmentation loss in an end-to-end fashion. Our method starts by extracting a map of 3D features directly from the point cloud by using a lightweight and simple 3D encoder neural network. The 3D feature map is then used as an additional input to a classical image segmentation network. During training, the 3D features extraction is optimized for the segmentation task by back-propagation through the entire pipeline. Our method exhibits state-of-the-art performance with much lighter input dataset requirements, since no 3D groundtruth is required.
translated by 谷歌翻译
变压器在自然语言处理中的成功最近引起了计算机视觉领域的关注。由于能够学习长期依赖性,变压器已被用作广泛使用的卷积运算符的替代品。事实证明,这种替代者在许多任务中都取得了成功,其中几种最先进的方法依靠变压器来更好地学习。在计算机视觉中,3D字段还见证了使用变压器来增加3D卷积神经网络和多层感知器网络的增加。尽管许多调查都集中在视力中的变压器上,但由于与2D视觉相比,由于数据表示和处理的差异,3D视觉需要特别注意。在这项工作中,我们介绍了针对不同3D视觉任务的100多种变压器方法的系统和彻底审查,包括分类,细分,检测,完成,姿势估计等。我们在3D Vision中讨论了变形金刚的设计,该设计使其可以使用各种3D表示形式处理数据。对于每个应用程序,我们强调了基于变压器的方法的关键属性和贡献。为了评估这些方法的竞争力,我们将它们的性能与12个3D基准测试的常见非转化方法进行了比较。我们通过讨论3D视觉中变压器的不同开放方向和挑战来结束调查。除了提出的论文外,我们的目标是频繁更新最新的相关论文及其相应的实现:https://github.com/lahoud/3d-vision-transformers。
translated by 谷歌翻译