Traditional 3D scene understanding approaches rely on labeled 3D datasets to train a model for a single task with supervision. We propose OpenScene, an alternative approach where a model predicts dense features for 3D scene points that are co-embedded with text and image pixels in CLIP feature space. This zero-shot approach enables task-agnostic training and open-vocabulary queries. For example, to perform SOTA zero-shot 3D semantic segmentation it first infers CLIP features for every 3D point and later classifies them based on similarities to embeddings of arbitrary class labels. More interestingly, it enables a suite of open-vocabulary scene understanding applications that have never been done before. For example, it allows a user to enter an arbitrary text query and then see a heat map indicating which parts of a scene match. Our approach is effective at identifying objects, materials, affordances, activities, and room types in complex 3D scenes, all using a single model trained without any labeled 3D data.
translated by 谷歌翻译
神经辐射场(NERFS)表现出惊人的能力,可以从新颖的观点中综合3D场景的图像。但是,他们依赖于基于射线行进的专门体积渲染算法,这些算法与广泛部署的图形硬件的功能不匹配。本文介绍了基于纹理多边形的新的NERF表示形式,该表示可以有效地与标准渲染管道合成新型图像。 NERF表示为一组多边形,其纹理代表二进制不相处和特征向量。用Z-Buffer对多边形的传统渲染产生了每个像素的图像,该图像由在片段着色器中运行的小型,观点依赖的MLP来解释,以产生最终的像素颜色。这种方法使NERF可以使用传统的Polygon栅格化管道渲染,该管道提供了庞大的像素级并行性,从而在包括移动电话在内的各种计算平台上实现了交互式帧速率。
translated by 谷歌翻译
我们研究气动非划和操纵(即吹),作为有效移动散射物体进入目标插座的一种手段。由于空气动力的混乱性质,吹吹控制器必须(i)不断适应其动作的意外变化,(ii)保持细粒度的控制,因为丝毫失误可能会导致很大的意外后果(例如,散射对象已经已经存在在一堆中)和(iii)推断远程计划(例如,将机器人移至战略性吹动地点)。我们在深度强化学习的背景下应对这些挑战,引入了空间动作地图框架的多频版本。这可以有效学习基于视觉的政策,这些政策有效地结合了高级计划和低级闭环控制,以进行动态移动操作。实验表明,我们的系统学会了对任务的有效行为,特别是证明吹吹以比推动更好的下游性能,并且我们的政策改善了基线的性能。此外,我们表明我们的系统自然会鼓励跨越低级细粒控制和高级计划的不同亚物质之间的新兴专业化。在配备微型气鼓的真实移动机器人上,我们表明我们的模拟训练策略很好地转移到了真实的环境中,并可以推广到新颖的物体。
translated by 谷歌翻译
3D对象检测是安全关键型机器人应用(如自主驾驶)的关键模块。对于这些应用,我们最关心检测如何影响自我代理人的行为和安全性(Egocentric观点)。直观地,当它更有可能干扰自我代理商的运动轨迹时,我们寻求更准确的对象几何描述。然而,基于箱交叉口(IOU)的电流检测指标是以对象为中心的,并且不设计用于捕获物体和自助代理之间的时空关系。为了解决这个问题,我们提出了一种新的EnoCentric测量来评估3D对象检测,即支持距离误差(SDE)。我们基于SDE的分析显示,EPECENTIC检测质量由边界框的粗糙几何形状界定。鉴于SDE将从更准确的几何描述中受益的洞察力,我们建议将物体代表为Amodal轮廓,特别是Amodal星形多边形,并设计简单的模型,椋鸟,预测这种轮廓。我们对大型Waymo公开数据集的实验表明,与IOU相比,SDE更好地反映了检测质量对自我代理人安全的影响;恒星的估计轮廓始终如一地改善最近的3D对象探测器的Enocentric检测质量。
translated by 谷歌翻译
这项工作的目标是通过扫描平台捕获的数据进行3D重建和新颖的观看综合,该平台在城市室外环境中常设世界映射(例如,街景)。给定一系列由摄像机和扫描仪通过室外场景的摄像机和扫描仪进行的序列,我们产生可以从中提取3D表面的模型,并且可以合成新颖的RGB图像。我们的方法扩展了神经辐射字段,已经证明了用于在受控设置中的小型场景中的逼真新颖的图像,用于利用异步捕获的LIDAR数据,用于寻址捕获图像之间的曝光变化,以及利用预测的图像分段来监督密度。在光线指向天空。这三个扩展中的每一个都在街道视图数据上的实验中提供了显着的性能改进。我们的系统产生最先进的3D表面重建,并与传统方法(例如〜Colmap)和最近的神经表示(例如〜MIP-NERF)相比,合成更高质量的新颖视图。
translated by 谷歌翻译
We present a method that synthesizes novel views of complex scenes by interpolating a sparse set of nearby views. The core of our method is a network architecture that includes a multilayer perceptron and a ray transformer that estimates radiance and volume density at continuous 5D locations (3D spatial locations and 2D viewing directions), drawing appearance information on the fly from multiple source views. By drawing on source views at render time, our method hearkens back to classic work on image-based rendering (IBR), and allows us to render high-resolution imagery. Unlike neural scene representation work that optimizes per-scene functions for rendering, we learn a generic view interpolation function that generalizes to novel scenes. We render images using classic volume rendering, which is fully differentiable and allows us to train using only multiview posed images as supervision. Experiments show that our method outperforms recent novel view synthesis methods that also seek to generalize to novel scenes. Further, if fine-tuned on each scene, our method is competitive with state-of-the-art single-scene neural rendering methods. 1
translated by 谷歌翻译
Training parts from ShapeNet. (b) t-SNE plot of part embeddings. (c) Reconstructing entire scenes with Local Implicit Grids Figure 1:We learn an embedding of parts from objects in ShapeNet [3] using a part autoencoder with an implicit decoder. We show that this representation of parts is generalizable across object categories, and easily scalable to large scenes. By localizing implicit functions in a grid, we are able to reconstruct entire scenes from points via optimization of the latent grid.
translated by 谷歌翻译
Figure 1. This paper introduces Local Deep Implicit Functions, a 3D shape representation that decomposes an input shape (mesh on left in every triplet) into a structured set of shape elements (colored ellipses on right) whose contributions to an implicit surface reconstruction (middle) are represented by latent vectors decoded by a deep network. Project video and website at ldif.cs.princeton.edu.
translated by 谷歌翻译
Figure 1. Shapes from the ShapeNet [8] database, fit to a structured implicit template, and arranged by template parameters using t-SNE [52]. Similar shape classes, such as airplanes, cars, and chairs, naturally cluster by template parameters. 1
translated by 谷歌翻译
Access to large, diverse RGB-D datasets is critical for training RGB-D scene understanding algorithms. However, existing datasets still cover only a limited number of views or a restricted scale of spaces. In this paper, we introduce Matterport3D, a large-scale RGB-D dataset containing 10,800 panoramic views from 194,400 RGB-D images of 90 building-scale scenes. Annotations are provided with surface reconstructions, camera poses, and 2D and 3D semantic segmentations. The precise global alignment and comprehensive, diverse panoramic set of views over entire buildings enable a variety of supervised and self-supervised computer vision tasks, including keypoint matching, view overlap prediction, normal prediction from color, semantic segmentation, and region classification.
translated by 谷歌翻译