3D shape models are becoming widely available and easier to capture, making available 3D information crucial for progress in object classification. Current state-of-theart methods rely on CNNs to address this problem. Recently, we witness two types of CNNs being developed: CNNs based upon volumetric representations versus CNNs based upon multi-view representations. Empirical results from these two types of CNNs exhibit a large gap, indicating that existing volumetric CNN architectures and approaches are unable to fully exploit the power of 3D representations. In this paper, we aim to improve both volumetric CNNs and multi-view CNNs according to extensive analysis of existing approaches. To this end, we introduce two distinct network architectures of volumetric CNNs. In addition, we examine multi-view CNNs, where we introduce multiresolution filtering in 3D. Overall, we are able to outperform current state-of-the-art methods for both volumetric CNNs and multi-view CNNs. We provide extensive experiments designed to evaluate underlying design choices, thus providing a better understanding of the space of methods available for object classification on 3D data.
translated by 谷歌翻译
A longstanding question in computer vision concerns the representation of 3D shapes for recognition: should 3D shapes be represented with descriptors operating on their native 3D formats, such as voxel grid or polygon mesh, or can they be effectively represented with view-based descriptors? We address this question in the context of learning to recognize 3D shapes from a collection of their rendered views on 2D images. We first present a standard CNN architecture trained to recognize the shapes' rendered views independently of each other, and show that a 3D shape can be recognized even from a single view at an accuracy far higher than using state-of-the-art 3D shape descriptors. Recognition rates further increase when multiple views of the shapes are provided. In addition, we present a novel CNN architecture that combines information from multiple views of a 3D shape into a single and compact shape descriptor offering even better recognition performance. The same architecture can be applied to accurately recognize human hand-drawn sketches of shapes. We conclude that a collection of 2D views can be highly informative for 3D shape recognition and is amenable to emerging CNN architectures and their derivatives.
translated by 谷歌翻译
Point cloud is an important type of geometric data structure. Due to its irregular format, most researchers transform such data to regular 3D voxel grids or collections of images. This, however, renders data unnecessarily voluminous and causes issues. In this paper, we design a novel type of neural network that directly consumes point clouds, which well respects the permutation invariance of points in the input. Our network, named PointNet, provides a unified architecture for applications ranging from object classification, part segmentation, to scene semantic parsing. Though simple, PointNet is highly efficient and effective. Empirically, it shows strong performance on par or even better than state of the art. Theoretically, we provide analysis towards understanding of what the network has learnt and why the network is robust with respect to input perturbation and corruption.
translated by 谷歌翻译
Our method completes a partial 3D scan using a 3D Encoder-Predictor network that leverages semantic features from a 3D classification network. The predictions are correlated with a shape database, which we use in a multi-resolution 3D shape synthesis step. We obtain completed high-resolution meshes that are inferred from partial, low-resolution input scans.
translated by 谷歌翻译
Robust object recognition is a crucial skill for robots operating autonomously in real world environments. Range sensors such as LiDAR and RGBD cameras are increasingly found in modern robotic systems, providing a rich source of 3D information that can aid in this task. However, many current systems do not fully utilize this information and have trouble efficiently dealing with large amounts of point cloud data. In this paper, we propose VoxNet, an architecture to tackle this problem by integrating a volumetric Occupancy Grid representation with a supervised 3D Convolutional Neural Network (3D CNN). We evaluate our approach on publicly available benchmarks using LiDAR, RGBD, and CAD data. VoxNet achieves accuracy beyond the state of the art while labeling hundreds of instances per second.
translated by 谷歌翻译
A key requirement for leveraging supervised deep learning methods is the availability of large, labeled datasets. Unfortunately, in the context of RGB-D scene understanding, very little data is available -current datasets cover a small range of scene views and have limited semantic annotations. To address this issue, we introduce ScanNet, an RGB-D video dataset containing 2.5M views in 1513 scenes annotated with 3D camera poses, surface reconstructions, and semantic segmentations. To collect this data, we designed an easy-to-use and scalable RGB-D capture system that includes automated surface reconstruction and crowdsourced semantic annotation. We show that using this data helps achieve state-of-the-art performance on several 3D scene understanding tasks, including 3D object classification, semantic voxel labeling, and CAD model retrieval.
translated by 谷歌翻译
同时对象识别和姿势估计是机器人安全与人类和环境安全相互作用的两个关键功能。尽管对象识别和姿势估计都使用视觉输入,但大多数最先进的问题将它们作为两个独立的问题解决,因为前者需要视图不变的表示,而对象姿势估计需要一个与观点有关的描述。如今,多视图卷积神经网络(MVCNN)方法显示出最新的分类性能。尽管已广泛探索了MVCNN对象识别,但对多视图对象构成估计方法的研究很少,而同时解决这两个问题的研究更少。 MVCNN方法中虚拟摄像机的姿势通常是预先定义的,从而绑定了这种方法的应用。在本文中,我们提出了一种能够同时处理对象识别和姿势估计的方法。特别是,我们开发了一个深度的对象不合时宜的熵估计模型,能够预测给定3D对象的最佳观点。然后将对象的视图馈送到网络中,以同时预测目标对象的姿势和类别标签。实验结果表明,从此类位置获得的观点足以达到良好的精度得分。此外,我们设计了现实生活中的饮料场景,以证明拟议方法在真正的机器人任务中的运作效果如何。代码可在线获得:github.com/subhadityamukherjee/more_mvcnn
translated by 谷歌翻译
Point cloud learning has lately attracted increasing attention due to its wide applications in many areas, such as computer vision, autonomous driving, and robotics. As a dominating technique in AI, deep learning has been successfully used to solve various 2D vision problems. However, deep learning on point clouds is still in its infancy due to the unique challenges faced by the processing of point clouds with deep neural networks. Recently, deep learning on point clouds has become even thriving, with numerous methods being proposed to address different problems in this area. To stimulate future research, this paper presents a comprehensive review of recent progress in deep learning methods for point clouds. It covers three major tasks, including 3D shape classification, 3D object detection and tracking, and 3D point cloud segmentation. It also presents comparative results on several publicly available datasets, together with insightful observations and inspiring future research directions.
translated by 谷歌翻译
Multi-view projection techniques have shown themselves to be highly effective in achieving top-performing results in the recognition of 3D shapes. These methods involve learning how to combine information from multiple view-points. However, the camera view-points from which these views are obtained are often fixed for all shapes. To overcome the static nature of current multi-view techniques, we propose learning these view-points. Specifically, we introduce the Multi-View Transformation Network (MVTN), which uses differentiable rendering to determine optimal view-points for 3D shape recognition. As a result, MVTN can be trained end-to-end with any multi-view network for 3D shape classification. We integrate MVTN into a novel adaptive multi-view pipeline that is capable of rendering both 3D meshes and point clouds. Our approach demonstrates state-of-the-art performance in 3D classification and shape retrieval on several benchmarks (ModelNet40, ScanObjectNN, ShapeNet Core55). Further analysis indicates that our approach exhibits improved robustness to occlusion compared to other methods. We also investigate additional aspects of MVTN, such as 2D pretraining and its use for segmentation. To support further research in this area, we have released MVTorch, a PyTorch library for 3D understanding and generation using multi-view projections.
translated by 谷歌翻译
卷积神经网络(CNN)已被广泛用于各种视觉任务,例如图像分类,语义分割等。不幸的是,标准2D CNN不太适合球形信号,例如全景图像或球形投影,因为球体是一个非结构化的网格。在本文中,我们提出了球形变压器,可以将球形信号转换为可以通过标准CNN直接处理的向量,从而通过预处理可以在任务和数据集中重复使用许多精心设计的CNNS体系结构。为此,提出的方法首先使用局部结构化采样方法(例如HealPix)通过使用球形点及其相邻点的信息来构建变压器网格,然后通过网格将球形信号转换为向量。通过构建球形变压器模块,我们可以直接使用多个CNN体系结构。我们评估了有关球形MNIST识别,3D对象分类和全向图像语义分割的任务的方法。对于3D对象分类,我们进一步提出了一种基于渲染的投影方法,以提高性能和旋转等值模型,以提高抗旋转能力。关于三个任务的实验结果表明,我们的方法比最先进的方法实现了卓越的性能。
translated by 谷歌翻译
We present a learnt system for multi-view stereopsis. In contrast to recent learning based methods for 3D reconstruction, we leverage the underlying 3D geometry of the problem through feature projection and unprojection along viewing rays. By formulating these operations in a differentiable manner, we are able to learn the system end-to-end for the task of metric 3D reconstruction. End-to-end learning allows us to jointly reason about shape priors while conforming to geometric constraints, enabling reconstruction from much fewer images (even a single image) than required by classical approaches as well as completion of unseen surfaces. We thoroughly evaluate our approach on the ShapeNet dataset and demonstrate the benefits over classical approaches and recent learning based methods.
translated by 谷歌翻译
隐式3D表示的最新进展,即神经辐射场(NERFS),以可区分的方式使准确且具有逼真的3D重建成为可能。这种新的表示可以有效地以一种紧凑的格式传达数百个高分辨率图像的信息,并允许对新观点的逼真综合。在这项工作中,使用NERF的变体称为全体氧,我们为感知任务创建了第一个大规模隐式表示数据集,称为Fustection,该数据集由两个部分组成,这些部分既包含以对象为中心和场景为中心的扫描,用于分类和分段, 。它显示了原始数据集的显着内存压缩率(96.4 \%),同时以统一形式包含2D和3D信息。我们构建了直接作为输入这种隐式格式的分类和分割模型,并提出了一种新颖的增强技术,以避免在图像的背景上过度拟合。代码和数据可在https://postech-cvlab.github.io/perfception中公开获得。
translated by 谷歌翻译
综合照片 - 现实图像和视频是计算机图形的核心,并且是几十年的研究焦点。传统上,使用渲染算法(如光栅化或射线跟踪)生成场景的合成图像,其将几何形状和材料属性的表示为输入。统称,这些输入定义了实际场景和呈现的内容,并且被称为场景表示(其中场景由一个或多个对象组成)。示例场景表示是具有附带纹理的三角形网格(例如,由艺术家创建),点云(例如,来自深度传感器),体积网格(例如,来自CT扫描)或隐式曲面函数(例如,截短的符号距离)字段)。使用可分辨率渲染损耗的观察结果的这种场景表示的重建被称为逆图形或反向渲染。神经渲染密切相关,并将思想与经典计算机图形和机器学习中的思想相结合,以创建用于合成来自真实观察图像的图像的算法。神经渲染是朝向合成照片现实图像和视频内容的目标的跨越。近年来,我们通过数百个出版物显示了这一领域的巨大进展,这些出版物显示了将被动组件注入渲染管道的不同方式。这种最先进的神经渲染进步的报告侧重于将经典渲染原则与学习的3D场景表示结合的方法,通常现在被称为神经场景表示。这些方法的一个关键优势在于它们是通过设计的3D-一致,使诸如新颖的视点合成捕获场景的应用。除了处理静态场景的方法外,我们还涵盖了用于建模非刚性变形对象的神经场景表示...
translated by 谷歌翻译
我们为RGB视频提供了基于变压器的神经网络体系结构,用于多对象3D重建。它依赖于表示知识的两种替代方法:作为特征的全局3D网格和一系列特定的2D网格。我们通过专用双向注意机制在两者之间逐步交换信息。我们利用有关图像形成过程的知识,以显着稀疏注意力重量矩阵,从而使我们的体系结构在记忆和计算方面可行。我们在3D特征网格的顶部附上一个detr风格的头,以检测场景中的对象并预测其3D姿势和3D形状。与以前的方法相比,我们的体系结构是单阶段,端到端可训练,并且可以从整体上考虑来自多个视频帧的场景,而无需脆弱的跟踪步骤。我们在挑战性的SCAN2CAD数据集上评估了我们的方法,在该数据集中,我们的表现要优于RGB视频的3D对象姿势估算的最新最新方法; (2)将多视图立体声与RGB-D CAD对齐结合的强大替代方法。我们计划发布我们的源代码。
translated by 谷歌翻译
点云分析没有姿势前导者在真实应用中非常具有挑战性,因为点云的方向往往是未知的。在本文中,我们提出了一个全新的点集学习框架prin,即点亮旋转不变网络,专注于点云分析中的旋转不变特征提取。我们通过密度意识的自适应采样构建球形信号,以处理球形空间中的扭曲点分布。提出了球形Voxel卷积和点重新采样以提取每个点的旋转不变特征。此外,我们将Prin扩展到称为Sprin的稀疏版本,直接在稀疏点云上运行。 Prin和Sprin都可以应用于从对象分类,部分分割到3D特征匹配和标签对齐的任务。结果表明,在随机旋转点云的数据集上,Sprin比无任何数据增强的最先进方法表现出更好的性能。我们还为我们的方法提供了彻底的理论证明和分析,以实现我们的方法实现的点明智的旋转不变性。我们的代码可在https://github.com/qq456cvb/sprin上找到。
translated by 谷歌翻译
where the highest resolution is required, using facial performance capture as a case in point.
translated by 谷歌翻译
This paper focuses on semantic scene completion, a task for producing a complete 3D voxel representation of volumetric occupancy and semantic labels for a scene from a single-view depth map observation. Previous work has considered scene completion and semantic labeling of depth maps separately. However, we observe that these two problems are tightly intertwined. To leverage the coupled nature of these two tasks, we introduce the semantic scene completion network (SSCNet), an end-to-end 3D convolutional network that takes a single depth image as input and simultaneously outputs occupancy and semantic labels for all voxels in the camera view frustum. Our network uses a dilation-based 3D context module to efficiently expand the receptive field and enable 3D context learning. To train our network, we construct SUNCG -a manually created largescale dataset of synthetic 3D scenes with dense volumetric annotations. Our experiments demonstrate that the joint model outperforms methods addressing each task in isolation and outperforms alternative approaches on the semantic scene completion task. The dataset, code and pretrained model will be available online upon acceptance.
translated by 谷歌翻译
变压器在自然语言处理中的成功最近引起了计算机视觉领域的关注。由于能够学习长期依赖性,变压器已被用作广泛使用的卷积运算符的替代品。事实证明,这种替代者在许多任务中都取得了成功,其中几种最先进的方法依靠变压器来更好地学习。在计算机视觉中,3D字段还见证了使用变压器来增加3D卷积神经网络和多层感知器网络的增加。尽管许多调查都集中在视力中的变压器上,但由于与2D视觉相比,由于数据表示和处理的差异,3D视觉需要特别注意。在这项工作中,我们介绍了针对不同3D视觉任务的100多种变压器方法的系统和彻底审查,包括分类,细分,检测,完成,姿势估计等。我们在3D Vision中讨论了变形金刚的设计,该设计使其可以使用各种3D表示形式处理数据。对于每个应用程序,我们强调了基于变压器的方法的关键属性和贡献。为了评估这些方法的竞争力,我们将它们的性能与12个3D基准测试的常见非转化方法进行了比较。我们通过讨论3D视觉中变压器的不同开放方向和挑战来结束调查。除了提出的论文外,我们的目标是频繁更新最新的相关论文及其相应的实现:https://github.com/lahoud/3d-vision-transformers。
translated by 谷歌翻译
最近,数据驱动的单视图重建方法在建模3D穿着人类中表现出很大的进展。然而,这种方法严重影响了单视图输入所固有的深度模糊和闭塞。在本文中,我们通过考虑一小部分输入视图并调查从这些视图中适当利用信息的最佳策略来解决这个问题。我们提出了一种数据驱动的端到端方法,其从稀疏相机视图重建穿着人的人类的隐式3D表示。具体而言,我们介绍了三个关键组件:首先是使用透视相机模型的空间一致的重建,允许使用人员在输入视图中的任意放置;第二个基于关注的融合层,用于从多个观点来看聚合视觉信息;第三种机制在多视图上下文下编码本地3D模式。在实验中,我们展示了所提出的方法优于定量和定性地在标准数据上表达现有技术。为了展示空间一致的重建,我们将我们的方法应用于动态场景。此外,我们在使用多摄像头平台获取的真实数据上应用我们的方法,并证明我们的方法可以获得与多视图立体声相当的结果,从而迅速更少的视图。
translated by 谷歌翻译