We present an automatic method for annotating images of indoor scenes with the CAD models of the objects by relying on RGB-D scans. Through a visual evaluation by 3D experts, we show that our method retrieves annotations that are at least as accurate as manual annotations, and can thus be used as ground truth without the burden of manually annotating 3D data. We do this using an analysis-by-synthesis approach, which compares renderings of the CAD models with the captured scene. We introduce a 'cloning procedure' that identifies objects that have the same geometry, to annotate these objects with the same CAD models. This allows us to obtain complete annotations for the ScanNet dataset and the recent ARKitScenes dataset.
translated by 谷歌翻译
AR/VR应用程序和机器人需要知道场景何时更改。一个示例是从场景中移动,添加或删除对象时。我们提出了仅基于场景更改的3D对象发现方法。我们的方法不需要编码有关对象的任何假设,而是通过利用其连贯的动作来发现对象。最初将变化视为深度图的差异,并在对象进行刚性运动时被分割为对象。图切割优化将不断变化的标签传播到几何一致的区域。实验表明,我们的方法在针对竞争基线的3RSCAN数据集上实现了最先进的性能。我们方法的源代码可以在https://github.com/katadam/objectscanmove上找到。
translated by 谷歌翻译
场景理解是一个活跃的研究区域。商业深度传感器(如Kinect)在过去几年中启用了几个RGB-D数据集的发布,它在3D场景理解中产生了新的方法。最近,在Apple的iPad和iPhone中推出LIDAR传感器,可以在他们通常使用的设备上访问高质量的RGB-D数据。这在对计算机视觉社区以及应用程序开发人员来说,这是一个全新的时代。现场理解的基本研究与机器学习的进步一起可以影响人们的日常经历。然而,将这些现场改变为现实世界经验的理解方法需要额外的创新和发展。在本文中,我们介绍了Arkitscenes。它不仅是具有现在广泛可用深度传感器的第一个RGB-D数据集,而且是我们最好的知识,它也是了解数据发布的最大的室内场景。除了来自移动设备的原始和处理的数据之外,Arkitscenes还包括使用固定激光扫描仪捕获的高分辨率深度图,以及手动标记为家具的大型分类的3D定向边界盒。我们进一步分析了两个下游任务数据的有用性:3D对象检测和色彩引导深度上采样。我们展示了我们的数据集可以帮助推动现有最先进的方法的边界,并引入了更好代表真实情景的新挑战。
translated by 谷歌翻译
The goal of this paper is to estimate the 6D pose and dimensions of unseen object instances in an RGB-D image. Contrary to "instance-level" 6D pose estimation tasks, our problem assumes that no exact object CAD models are available during either training or testing time. To handle different and unseen object instances in a given category, we introduce Normalized Object Coordinate Space (NOCS)-a shared canonical representation for all possible object instances within a category. Our region-based neural network is then trained to directly infer the correspondence from observed pixels to this shared object representation (NOCS) along with other object information such as class label and instance mask. These predictions can be combined with the depth map to jointly estimate the metric 6D pose and dimensions of multiple objects in a cluttered scene. To train our network, we present a new contextaware technique to generate large amounts of fully annotated mixed reality data. To further improve our model and evaluate its performance on real data, we also provide a fully annotated real-world dataset with large environment and instance variation. Extensive experiments demonstrate that the proposed method is able to robustly estimate the pose and size of unseen object instances in real environments while also achieving state-of-the-art performance on standard 6D pose estimation benchmarks.
translated by 谷歌翻译
A key requirement for leveraging supervised deep learning methods is the availability of large, labeled datasets. Unfortunately, in the context of RGB-D scene understanding, very little data is available -current datasets cover a small range of scene views and have limited semantic annotations. To address this issue, we introduce ScanNet, an RGB-D video dataset containing 2.5M views in 1513 scenes annotated with 3D camera poses, surface reconstructions, and semantic segmentations. To collect this data, we designed an easy-to-use and scalable RGB-D capture system that includes automated surface reconstruction and crowdsourced semantic annotation. We show that using this data helps achieve state-of-the-art performance on several 3D scene understanding tasks, including 3D object classification, semantic voxel labeling, and CAD model retrieval.
translated by 谷歌翻译
点云的语义场景重建是3D场景理解的必不可少的任务。此任务不仅需要识别场景中的每个实例,而且还需要根据部分观察到的点云恢复其几何形状。现有方法通常尝试基于基于检测的主链的不完整点云建议直接预测完整对象的占用值。但是,由于妨碍了各种检测到的假阳性对象建议以及对完整对象学习占用值的不完整点观察的歧义,因此该框架始终无法重建高保真网格。为了绕开障碍,我们提出了一个分离的实例网格重建(DIMR)框架,以了解有效的点场景。采用基于分割的主链来减少假阳性对象建议,这进一步使我们对识别与重建之间关系的探索有益。根据准确的建议,我们利用网状意识的潜在代码空间来解开形状完成和网格生成的过程,从而缓解了由不完整的点观测引起的歧义。此外,通过在测试时间访问CAD型号池,我们的模型也可以通过在没有额外训练的情况下执行网格检索来改善重建质量。我们用多个指标彻底评估了重建的网格质量,并证明了我们在具有挑战性的扫描仪数据集上的优越性。代码可在\ url {https://github.com/ashawkey/dimr}上获得。
translated by 谷歌翻译
6D对象姿势估计是计算机视觉和机器人研究中的基本问题之一。尽管最近在同一类别内将姿势估计概括为新的对象实例(即类别级别的6D姿势估计)方面已做出了许多努力,但考虑到有限的带注释数据,它仍然在受限的环境中受到限制。在本文中,我们收集了Wild6D,这是一种具有不同实例和背景的新的未标记的RGBD对象视频数据集。我们利用这些数据在野外概括了类别级别的6D对象姿势效果,并通过半监督学习。我们提出了一个新模型,称为呈现姿势估计网络reponet,该模型使用带有合成数据的自由地面真实性共同训练,以及在现实世界数据上具有轮廓匹配的目标函数。在不使用实际数据上的任何3D注释的情况下,我们的方法优于先前数据集上的最先进方法,而我们的WILD6D测试集(带有手动注释进行评估)则优于较大的边距。带有WILD6D数据的项目页面:https://oasisyang.github.io/semi-pose。
translated by 谷歌翻译
我们介绍了日常桌面对象的998 3D型号的数据集及其847,000个现实世界RGB和深度图像。每个图像的相机姿势和对象姿势的准确注释都以半自动化方式执行,以促进将数据集用于多种3D应用程序,例如形状重建,对象姿势估计,形状检索等。3D重建由于缺乏适当的现实世界基准来完成该任务,并证明我们的数据集可以填补该空白。整个注释数据集以及注释工具和评估基线的源代码可在http://www.ocrtoc.org/3d-reconstruction.html上获得。
translated by 谷歌翻译
我们介绍了Amazon Berkeley对象(ABO),这是一个新的大型数据集,旨在帮助弥合真实和虚拟3D世界之间的差距。ABO包含产品目录图像,元数据和艺术家创建的3D模型,具有复杂的几何形状和与真实的家用物体相对应的物理基础材料。我们得出了具有挑战性的基准,这些基准利用ABO的独特属性,并测量最先进的对象在三个开放问题上的最新限制,以了解实际3D对象:单视3D 3D重建,材料估计和跨域多视图对象检索。
translated by 谷歌翻译
Although RGB-D sensors have enabled major breakthroughs for several vision tasks, such as 3D reconstruction, we have not attained the same level of success in highlevel scene understanding. Perhaps one of the main reasons is the lack of a large-scale benchmark with 3D annotations and 3D evaluation metrics. In this paper, we introduce an RGB-D benchmark suite for the goal of advancing the state-of-the-arts in all major scene understanding tasks. Our dataset is captured by four different sensors and contains 10,335 RGB-D images, at a similar scale as PASCAL VOC. The whole dataset is densely annotated and includes 146,617 2D polygons and 64,595 3D bounding boxes with accurate object orientations, as well as a 3D room layout and scene category for each image. This dataset enables us to train data-hungry algorithms for scene-understanding tasks, evaluate them using meaningful 3D metrics, avoid overfitting to a small testing set, and study cross-sensor bias. Testing Set Training Set NYU (795 images) SUN RGB-D (5,285 images) NYU 32.50 34.33 SUN RGB-D 15.78 33.20
translated by 谷歌翻译
Segmenting humans in 3D indoor scenes has become increasingly important with the rise of human-centered robotics and AR/VR applications. In this direction, we explore the tasks of 3D human semantic-, instance- and multi-human body-part segmentation. Few works have attempted to directly segment humans in point clouds (or depth maps), which is largely due to the lack of training data on humans interacting with 3D scenes. We address this challenge and propose a framework for synthesizing virtual humans in realistic 3D scenes. Synthetic point cloud data is attractive since the domain gap between real and synthetic depth is small compared to images. Our analysis of different training schemes using a combination of synthetic and realistic data shows that synthetic data for pre-training improves performance in a wide variety of segmentation tasks and models. We further propose the first end-to-end model for 3D multi-human body-part segmentation, called Human3D, that performs all the above segmentation tasks in a unified manner. Remarkably, Human3D even outperforms previous task-specific state-of-the-art methods. Finally, we manually annotate humans in test scenes from EgoBody to compare the proposed training schemes and segmentation models.
translated by 谷歌翻译
从单个RGB图像预测3D形状和静态对象的姿势是现代计算机视觉中的重要研究区域。其应用范围从增强现实到机器人和数字内容创建。通常,通过直接对象形状和姿势预测来执行此任务,该任务是不准确的。有希望的研究方向通过从大规模数据库中检索CAD模型并将它们对准到图像中观察到的对象来确保有意义的形状预测。然而,现有的工作并没有考虑到对象几何,导致对象姿态预测不准确,特别是对于未经看法。在这项工作中,我们演示了如何从RGB图像到呈现的CAD模型的跨域Keypoint匹配如何允许更精确的对象姿态预测与通过直接预测所获得的那些相比。我们进一步表明,关键点匹配不仅可以用于估计对象的姿势,还可以用于修改对象本身的形状。这与单独使用对象检索可以实现的准确性是重要的,其固有地限于可用的CAD模型。允许形状适配桥接检索到的CAD模型与观察到的形状之间的间隙。我们在挑战PIX3D数据集上展示了我们的方法。所提出的几何形状预测将AP网格改善在所看到的物体上的33.2至37.8上的33.2至37.8。未经证明对象的8.2至17.1。此外,在遵循所提出的形状适应时,我们展示了更准确的形状预测而不会与CAD模型紧密匹配。代码在HTTPS://github.com/florianlanger/leveraging_geometry_for_shape_eStimation上公开使用。
translated by 谷歌翻译
我们呈现ROCA,一种新的端到端方法,可以从形状数据库到单个输入图像中检索并对齐3D CAD模型。这使得从2D RGB观察开始观察到的场景的3D感知,其特征在于轻质,紧凑,清洁的CAD表示。我们的方法的核心是我们基于密集的2D-3D对象对应关系和促使对齐的可差的对准优化。因此,罗卡可以提供强大的CAD对准,同时通过利用2D-3D对应关系来学习几何上类似CAD模型来同时通知CAD检索。SCANNET的真实世界图像实验表明,Roca显着提高了现有技术,从检索感知CAD准确度为9.5%至17.6%。
translated by 谷歌翻译
我们提出了MonteboxFinder,该方法给定嘈杂的输入点云将立方体适合输入场景。我们的主要贡献是一种离散的优化算法,从一组最初检测到的立方体,它能够有效地从嘈杂的盒子中过滤好盒子。受到MCT在理解问题的最新应用的启发,我们开发了一种随机算法,该算法是通过设计更有效的。确实,适合立方排列的质量对于将立方体添加到场景的顺序中是不变的。我们为我们的问题开发了几个搜索基准,并在扫描仪数据集上证明了我们的方法更有效和精确。最后,我们坚信我们的核心算法非常笼统,并且可以扩展到3D场景理解中的许多其他问题。
translated by 谷歌翻译
在这项工作中,我们探索在野外重建手对象交互。这个问题的核心挑战是缺乏适当的3D标记数据。为了克服这个问题,我们提出了一种基于优化的程序,该过程不需要直接的3D监督。我们采用的一般策略是利用所有可用的相关数据(2D边界框,2D手键盘,2D实例掩码,3D对象模型,实验室Mocap)为3D重建提供约束。我们不是单独优化手和对象,我们共同优化它们,这使我们能够基于手动对象触点,碰撞和遮挡来施加额外的约束。我们的方法在史诗厨房和100天的手中数据集中产生令人信服的重建,跨越一系列对象类别。定量地,我们证明我们的方法对现有的实验室设置中的现有方法有利地进行了比较,其中地面真理3D注释提供。
translated by 谷歌翻译
当前的3D分割方法很大程度上依赖于大规模的点状数据集,众所周知,这些数据集众所周知。很少有尝试规避需要每点注释的需求。在这项工作中,我们研究了弱监督的3D语义实例分割。关键的想法是利用3D边界框标签,更容易,更快地注释。确实,我们表明只有仅使用边界框标签训练密集的分割模型。在我们方法的核心上,\ name {}是一个深层模型,灵感来自经典的霍夫投票,直接投票赞成边界框参数,并且是专门针对边界盒票的专门定制的群集方法。这超出了常用的中心票,这不会完全利用边界框注释。在扫描仪测试中,我们弱监督的模型在其他弱监督的方法中获得了领先的性能(+18 MAP@50)。值得注意的是,它还达到了当前完全监督模型的50分数的地图的97%。为了进一步说明我们的工作的实用性,我们在最近发布的Arkitscenes数据集中训练Box2mask,该数据集仅使用3D边界框注释,并首次显示引人注目的3D实例细分掩码。
translated by 谷歌翻译
我们提出了一种称为DPODV2(密集姿势对象检测器)的三个阶段6 DOF对象检测方法,该方法依赖于致密的对应关系。我们将2D对象检测器与密集的对应关系网络和多视图姿势细化方法相结合,以估计完整的6 DOF姿势。与通常仅限于单眼RGB图像的其他深度学习方法不同,我们提出了一个统一的深度学习网络,允许使用不同的成像方式(RGB或DEPTH)。此外,我们提出了一种基于可区分渲染的新型姿势改进方法。主要概念是在多个视图中比较预测并渲染对应关系,以获得与所有视图中预测的对应关系一致的姿势。我们提出的方法对受控设置中的不同数据方式和培训数据类型进行了严格的评估。主要结论是,RGB在对应性估计中表现出色,而如果有良好的3D-3D对应关系,则深度有助于姿势精度。自然,他们的组合可以实现总体最佳性能。我们进行广泛的评估和消融研究,以分析和验证几个具有挑战性的数据集的结果。 DPODV2在所有这些方面都取得了出色的成果,同时仍然保持快速和可扩展性,独立于使用的数据模式和培训数据的类型
translated by 谷歌翻译
This paper focuses on semantic scene completion, a task for producing a complete 3D voxel representation of volumetric occupancy and semantic labels for a scene from a single-view depth map observation. Previous work has considered scene completion and semantic labeling of depth maps separately. However, we observe that these two problems are tightly intertwined. To leverage the coupled nature of these two tasks, we introduce the semantic scene completion network (SSCNet), an end-to-end 3D convolutional network that takes a single depth image as input and simultaneously outputs occupancy and semantic labels for all voxels in the camera view frustum. Our network uses a dilation-based 3D context module to efficiently expand the receptive field and enable 3D context learning. To train our network, we construct SUNCG -a manually created largescale dataset of synthetic 3D scenes with dense volumetric annotations. Our experiments demonstrate that the joint model outperforms methods addressing each task in isolation and outperforms alternative approaches on the semantic scene completion task. The dataset, code and pretrained model will be available online upon acceptance.
translated by 谷歌翻译
深度学习识别的进步导致使用2D图像准确的对象检测。然而,这些2D感知方法对于完整的3D世界信息不足。同时,高级3D形状估计接近形状本身的焦点,而不考虑公制量表。这些方法无法确定对象的准确位置和方向。为了解决这个问题,我们提出了一个框架,该框架共同估计了从单个RGB图像的度量标度形状和姿势。我们的框架有两个分支:公制刻度对象形状分支(MSO)和归一化对象坐标空间分支(NOC)。 MSOS分支估计在相机坐标中观察到的度量标准形状。 NOCS分支预测归一化对象坐标空间(NOCS)映射,并从预测的度量刻度网格与渲染的深度图执行相似性转换,以获得6D姿势和大小。此外,我们介绍了归一化对象中心估计(NOCE),以估计从相机到物体中心的几何对齐距离。我们在合成和实际数据集中验证了我们的方法,以评估类别级对象姿势和形状。
translated by 谷歌翻译
We introduce Similarity Group Proposal Network (SGPN), a simple and intuitive deep learning framework for 3D object instance segmentation on point clouds. SGPN uses a single network to predict point grouping proposals and a corresponding semantic class for each proposal, from which we can directly extract instance segmentation results. Important to the effectiveness of SGPN is its novel representation of 3D instance segmentation results in the form of a similarity matrix that indicates the similarity between each pair of points in embedded feature space, thus producing an accurate grouping proposal for each point. Experimental results on various 3D scenes show the effectiveness of our method on 3D instance segmentation, and we also evaluate the capability of SGPN to improve 3D object detection and semantic segmentation results. We also demonstrate its flexibility by seamlessly incorporating 2D CNN features into the framework to boost performance.
translated by 谷歌翻译