我们介绍了Amazon Berkeley对象(ABO),这是一个新的大型数据集,旨在帮助弥合真实和虚拟3D世界之间的差距。ABO包含产品目录图像,元数据和艺术家创建的3D模型,具有复杂的几何形状和与真实的家用物体相对应的物理基础材料。我们得出了具有挑战性的基准,这些基准利用ABO的独特属性,并测量最先进的对象在三个开放问题上的最新限制,以了解实际3D对象:单视3D 3D重建,材料估计和跨域多视图对象检索。
translated by 谷歌翻译
我们介绍了日常桌面对象的998 3D型号的数据集及其847,000个现实世界RGB和深度图像。每个图像的相机姿势和对象姿势的准确注释都以半自动化方式执行,以促进将数据集用于多种3D应用程序,例如形状重建,对象姿势估计,形状检索等。3D重建由于缺乏适当的现实世界基准来完成该任务,并证明我们的数据集可以填补该空白。整个注释数据集以及注释工具和评估基线的源代码可在http://www.ocrtoc.org/3d-reconstruction.html上获得。
translated by 谷歌翻译
一个3D场景由一组对象组成,每个对象都有一个形状和一个布局,使其在太空中的位置。从2D图像中了解3D场景是一个重要的目标,并具有机器人技术和图形的应用。尽管最近在预测单个图像的3D形状和布局方面取得了进步,但大多数方法都依赖于3D地面真相来进行训练,这很昂贵。我们克服了这些局限性,并提出了一种方法,该方法学会预测对象的3D形状和布局,而无需任何地面真相形状或布局信息:相反,我们依靠具有2D监督的多视图图像,可以更轻松地按大规模收集。通过在3D仓库,Hypersim和扫描仪上进行的广泛实验,我们证明了我们的进近量表与逼真的图像的大型数据集相比,并与依赖3D地面真理的方法进行了比较。在Hypersim和Scannet上,如果没有可靠的3D地面真相,我们的方法优于在较小和较少的数据集上训练的监督方法。
translated by 谷歌翻译
Understanding the 3D world without supervision is currently a major challenge in computer vision as the annotations required to supervise deep networks for tasks in this domain are expensive to obtain on a large scale. In this paper, we address the problem of unsupervised viewpoint estimation. We formulate this as a self-supervised learning task, where image reconstruction provides the supervision needed to predict the camera viewpoint. Specifically, we make use of pairs of images of the same object at training time, from unknown viewpoints, to self-supervise training by combining the viewpoint information from one image with the appearance information from the other. We demonstrate that using a perspective spatial transformer allows efficient viewpoint learning, outperforming existing unsupervised approaches on synthetic data, and obtains competitive results on the challenging PASCAL3D+ dataset.
translated by 谷歌翻译
我们提出了可区分的立体声,这是一种多视图立体方法,可从几乎没有输入视图和嘈杂摄像机中重建形状和纹理。我们将传统的立体定向和现代可区分渲染配对,以构建端到端模型,该模型可以预测具有不同拓扑和形状的物体的纹理3D网眼。我们将立体定向作为优化问题,并通过简单的梯度下降同时更新形状和相机。我们进行了广泛的定量分析,并与传统的多视图立体声技术和基于最先进的学习方法进行比较。我们展示了令人信服的重建,这些重建是在挑战现实世界的场景上,以及具有复杂形状,拓扑和纹理的大量对象类型。项目网页:https://shubham-goel.github.io/ds/
translated by 谷歌翻译
传统上,本征成像或内在图像分解被描述为将图像分解为两层:反射率,材料的反射率;和一个阴影,由光和几何之间的相互作用产生。近年来,深入学习技术已广泛应用,以提高这些分离的准确性。在本调查中,我们概述了那些在知名内在图像数据集和文献中使用的相关度量的结果,讨论了预测所需的内在图像分解的适用性。虽然Lambertian的假设仍然是许多方法的基础,但我们表明,对图像形成过程更复杂的物理原理组件的潜力越来越意识到,这是光学准确的材料模型和几何形状,更完整的逆轻型运输估计。考虑使用的前瞻和模型以及驾驶分解过程的学习架构和方法,我们将这些方法分类为分解的类型。考虑到最近神经,逆和可微分的渲染技术的进步,我们还提供了关于未来研究方向的见解。
translated by 谷歌翻译
单视图重建的方法通常依赖于观点注释,剪影,缺乏背景,同一实例的多个视图,模板形状或对称性。我们通过明确利用不同对象实例的图像之间的一致性来避免所有此类监督和假设。结果,我们的方法可以从描述相同对象类别的大量未标记图像中学习。我们的主要贡献是利用跨境一致性的两种方法:(i)渐进式调理,一种培训策略,以逐步将模型从类别中逐步专业为课程学习方式进行实例; (ii)邻居重建,具有相似形状或纹理的实例之间的损失。对于我们方法的成功也至关重要的是:我们的结构化自动编码体系结构将图像分解为显式形状,纹理,姿势和背景;差异渲染的适应性公式;以及一个新的优化方案在3D和姿势学习之间交替。我们将我们的方法(独角兽)在多样化的合成造型数据集上进行比较,这是需要多种视图作为监督的方法的经典基准 - 以及标准的实数基准(Pascal3d+ Car,Cub,Cub,Cub,Cub),大多数方法都需要已知的模板和Silhouette注释。我们还展示了对更具挑战性的现实收藏集(Compcars,LSUN)的适用性,在该收藏中,剪影不可用,图像没有在物体周围裁剪。
translated by 谷歌翻译
综合照片 - 现实图像和视频是计算机图形的核心,并且是几十年的研究焦点。传统上,使用渲染算法(如光栅化或射线跟踪)生成场景的合成图像,其将几何形状和材料属性的表示为输入。统称,这些输入定义了实际场景和呈现的内容,并且被称为场景表示(其中场景由一个或多个对象组成)。示例场景表示是具有附带纹理的三角形网格(例如,由艺术家创建),点云(例如,来自深度传感器),体积网格(例如,来自CT扫描)或隐式曲面函数(例如,截短的符号距离)字段)。使用可分辨率渲染损耗的观察结果的这种场景表示的重建被称为逆图形或反向渲染。神经渲染密切相关,并将思想与经典计算机图形和机器学习中的思想相结合,以创建用于合成来自真实观察图像的图像的算法。神经渲染是朝向合成照片现实图像和视频内容的目标的跨越。近年来,我们通过数百个出版物显示了这一领域的巨大进展,这些出版物显示了将被动组件注入渲染管道的不同方式。这种最先进的神经渲染进步的报告侧重于将经典渲染原则与学习的3D场景表示结合的方法,通常现在被称为神经场景表示。这些方法的一个关键优势在于它们是通过设计的3D-一致,使诸如新颖的视点合成捕获场景的应用。除了处理静态场景的方法外,我们还涵盖了用于建模非刚性变形对象的神经场景表示...
translated by 谷歌翻译
Pixel-aligned Implicit function (PIFu): We present pixel-aligned implicit function (PIFu), which allows recovery of high-resolution 3D textured surfaces of clothed humans from a single input image (top row). Our approach can digitize intricate variations in clothing, such as wrinkled skirts and high-heels, including complex hairstyles. The shape and textures can be fully recovered including largely unseen regions such as the back of the subject. PIFu can also be naturally extended to multi-view input images (bottom row).
translated by 谷歌翻译
Rapid advances in 2D perception have led to systems that accurately detect objects in real-world images. However, these systems make predictions in 2D, ignoring the 3D structure of the world. Concurrently, advances in 3D shape prediction have mostly focused on synthetic benchmarks and isolated objects. We unify advances in these two areas. We propose a system that detects objects in real-world images and produces a triangle mesh giving the full 3D shape of each detected object. Our system, called Mesh R-CNN, augments Mask R-CNN with a mesh prediction branch that outputs meshes with varying topological structure by first predicting coarse voxel representations which are converted to meshes and refined with a graph convolution network operating over the mesh's vertices and edges. We validate our mesh prediction branch on ShapeNet, where we outperform prior work on single-image shape prediction. We then deploy our full Mesh R-CNN system on Pix3D, where we jointly detect objects and predict their 3D shapes. Project page: https://gkioxari.github.io/meshrcnn/.
translated by 谷歌翻译
Access to large, diverse RGB-D datasets is critical for training RGB-D scene understanding algorithms. However, existing datasets still cover only a limited number of views or a restricted scale of spaces. In this paper, we introduce Matterport3D, a large-scale RGB-D dataset containing 10,800 panoramic views from 194,400 RGB-D images of 90 building-scale scenes. Annotations are provided with surface reconstructions, camera poses, and 2D and 3D semantic segmentations. The precise global alignment and comprehensive, diverse panoramic set of views over entire buildings enable a variety of supervised and self-supervised computer vision tasks, including keypoint matching, view overlap prediction, normal prediction from color, semantic segmentation, and region classification.
translated by 谷歌翻译
The goal of this paper is to estimate the 6D pose and dimensions of unseen object instances in an RGB-D image. Contrary to "instance-level" 6D pose estimation tasks, our problem assumes that no exact object CAD models are available during either training or testing time. To handle different and unseen object instances in a given category, we introduce Normalized Object Coordinate Space (NOCS)-a shared canonical representation for all possible object instances within a category. Our region-based neural network is then trained to directly infer the correspondence from observed pixels to this shared object representation (NOCS) along with other object information such as class label and instance mask. These predictions can be combined with the depth map to jointly estimate the metric 6D pose and dimensions of multiple objects in a cluttered scene. To train our network, we present a new contextaware technique to generate large amounts of fully annotated mixed reality data. To further improve our model and evaluate its performance on real data, we also provide a fully annotated real-world dataset with large environment and instance variation. Extensive experiments demonstrate that the proposed method is able to robustly estimate the pose and size of unseen object instances in real environments while also achieving state-of-the-art performance on standard 6D pose estimation benchmarks.
translated by 谷歌翻译
6D object pose estimation problem has been extensively studied in the field of Computer Vision and Robotics. It has wide range of applications such as robot manipulation, augmented reality, and 3D scene understanding. With the advent of Deep Learning, many breakthroughs have been made; however, approaches continue to struggle when they encounter unseen instances, new categories, or real-world challenges such as cluttered backgrounds and occlusions. In this study, we will explore the available methods based on input modality, problem formulation, and whether it is a category-level or instance-level approach. As a part of our discussion, we will focus on how 6D object pose estimation can be used for understanding 3D scenes.
translated by 谷歌翻译
大多数室内3D场景重建方法都致力于恢复3D几何和场景布局。在这项工作中,我们超越了这一点提出Photoscene,该框架是一个场景的输入图像以及大约对齐的CAD几何(自动或手动指定的重建),并构建具有高质量材料和高质量材料和高质量的材料的photorealistic Digital Twin类似的照明。我们使用程序材料图对场景材料进行建模;这样的图代表了逼真的和分辨率无关的材料。我们优化了这些图的参数及其纹理量表和旋转,以及场景照明,以通过可区分的渲染层最好地匹配输入图像。我们评估了从扫描仪,Sun RGB-D和库存照片的对象和布局重建的技术,并证明我们的方法重建高质量的,完全可重新可重新可重新的3D场景,这些场景可以在任意观点,Zooms和Lighting下重新渲染。
translated by 谷歌翻译
Input: 3 views of held-out scene NeRF pixelNeRF Output: Rendered new views Input Novel views Input Novel views Input Novel views Figure 1: NeRF from one or few images. We present pixelNeRF, a learning framework that predicts a Neural Radiance Field (NeRF) representation from a single (top) or few posed images (bottom). PixelNeRF can be trained on a set of multi-view images, allowing it to generate plausible novel view synthesis from very few input images without test-time optimization (bottom left). In contrast, NeRF has no generalization capabilities and performs poorly when only three input views are available (bottom right).
translated by 谷歌翻译
从2D图像中学习可变形的3D对象通常是一个不适的问题。现有方法依赖于明确的监督来建立多视图对应关系,例如模板形状模型和关键点注释,这将其在“野外”中的对象上限制了。建立对应关系的一种更自然的方法是观看四处移动的对象的视频。在本文中,我们介绍了Dove,一种方法,可以从在线可用的单眼视频中学习纹理的3D模型,而无需关键点,视点或模板形状监督。通过解决对称性诱导的姿势歧义并利用视频中的时间对应关系,该模型会自动学会从每个单独的RGB框架中分解3D形状,表达姿势和纹理,并准备在测试时间进行单像推断。在实验中,我们表明现有方法无法学习明智的3D形状,而无需其他关键点或模板监督,而我们的方法在时间上产生了时间一致的3D模型,可以从任意角度来对其进行动画和呈现。
translated by 谷歌翻译
最近,数据驱动的单视图重建方法在建模3D穿着人类中表现出很大的进展。然而,这种方法严重影响了单视图输入所固有的深度模糊和闭塞。在本文中,我们通过考虑一小部分输入视图并调查从这些视图中适当利用信息的最佳策略来解决这个问题。我们提出了一种数据驱动的端到端方法,其从稀疏相机视图重建穿着人的人类的隐式3D表示。具体而言,我们介绍了三个关键组件:首先是使用透视相机模型的空间一致的重建,允许使用人员在输入视图中的任意放置;第二个基于关注的融合层,用于从多个观点来看聚合视觉信息;第三种机制在多视图上下文下编码本地3D模式。在实验中,我们展示了所提出的方法优于定量和定性地在标准数据上表达现有技术。为了展示空间一致的重建,我们将我们的方法应用于动态场景。此外,我们在使用多摄像头平台获取的真实数据上应用我们的方法,并证明我们的方法可以获得与多视图立体声相当的结果,从而迅速更少的视图。
translated by 谷歌翻译
We introduce Structured 3D Features, a model based on a novel implicit 3D representation that pools pixel-aligned image features onto dense 3D points sampled from a parametric, statistical human mesh surface. The 3D points have associated semantics and can move freely in 3D space. This allows for optimal coverage of the person of interest, beyond just the body shape, which in turn, additionally helps modeling accessories, hair, and loose clothing. Owing to this, we present a complete 3D transformer-based attention framework which, given a single image of a person in an unconstrained pose, generates an animatable 3D reconstruction with albedo and illumination decomposition, as a result of a single end-to-end model, trained semi-supervised, and with no additional postprocessing. We show that our S3F model surpasses the previous state-of-the-art on various tasks, including monocular 3D reconstruction, as well as albedo and shading estimation. Moreover, we show that the proposed methodology allows novel view synthesis, relighting, and re-posing the reconstruction, and can naturally be extended to handle multiple input images (e.g. different views of a person, or the same view, in different poses, in video). Finally, we demonstrate the editing capabilities of our model for 3D virtual try-on applications.
translated by 谷歌翻译
我们为RGB视频提供了基于变压器的神经网络体系结构,用于多对象3D重建。它依赖于表示知识的两种替代方法:作为特征的全局3D网格和一系列特定的2D网格。我们通过专用双向注意机制在两者之间逐步交换信息。我们利用有关图像形成过程的知识,以显着稀疏注意力重量矩阵,从而使我们的体系结构在记忆和计算方面可行。我们在3D特征网格的顶部附上一个detr风格的头,以检测场景中的对象并预测其3D姿势和3D形状。与以前的方法相比,我们的体系结构是单阶段,端到端可训练,并且可以从整体上考虑来自多个视频帧的场景,而无需脆弱的跟踪步骤。我们在挑战性的SCAN2CAD数据集上评估了我们的方法,在该数据集中,我们的表现要优于RGB视频的3D对象姿势估算的最新最新方法; (2)将多视图立体声与RGB-D CAD对齐结合的强大替代方法。我们计划发布我们的源代码。
translated by 谷歌翻译
We introduce an approach for recovering the 6D pose of multiple known objects in a scene captured by a set of input images with unknown camera viewpoints. First, we present a single-view single-object 6D pose estimation method, which we use to generate 6D object pose hypotheses. Second, we develop a robust method for matching individual 6D object pose hypotheses across different input images in order to jointly estimate camera viewpoints and 6D poses of all objects in a single consistent scene. Our approach explicitly handles object symmetries, does not require depth measurements, is robust to missing or incorrect object hypotheses, and automatically recovers the number of objects in the scene. Third, we develop a method for global scene refinement given multiple object hypotheses and their correspondences across views. This is achieved by solving an object-level bundle adjustment problem that refines the poses of cameras and objects to minimize the reprojection error in all views. We demonstrate that the proposed method, dubbed Cosy-Pose, outperforms current state-of-the-art results for single-view and multi-view 6D object pose estimation by a large margin on two challenging benchmarks: the YCB-Video and T-LESS datasets. Code and pre-trained models are available on the project webpage. 5
translated by 谷歌翻译