我们提出了一种从有限重叠的图像中对场景进行平面表面重建的方法。此重构任务是具有挑战性的,因为它需要共同推理单个图像3D重建,图像之间的对应关系以及图像之间的相对摄像头姿势。过去的工作提出了基于优化的方法。我们引入了一种更简单的方法,即平面形式,该方法使用应用于3D感知平面令牌的变压器执行3D推理。我们的实验表明,我们的方法比以前的工作更有效,并且几项3D特定的设计决策对于成功的成功至关重要。
translated by 谷歌翻译
我们提出了一个简单的基线,用于直接估计两个图像之间的相对姿势(旋转和翻译,包括比例)。深度方法最近显示出很强的进步,但通常需要复杂或多阶段的体系结构。我们表明,可以将少数修改应用于视觉变压器(VIT),以使其计算接近八点算法。这种归纳偏见使一种简单的方法在多种环境中具有竞争力,通常在有限的数据制度中具有强劲的性能增长,从而实质上有所改善。
translated by 谷歌翻译
This paper studies the challenging two-view 3D reconstruction in a rigorous sparse-view configuration, which is suffering from insufficient correspondences in the input image pairs for camera pose estimation. We present a novel Neural One-PlanE RANSAC framework (termed NOPE-SAC in short) that exerts excellent capability to learn one-plane pose hypotheses from 3D plane correspondences. Building on the top of a siamese plane detection network, our NOPE-SAC first generates putative plane correspondences with a coarse initial pose. It then feeds the learned 3D plane parameters of correspondences into shared MLPs to estimate the one-plane camera pose hypotheses, which are subsequently reweighed in a RANSAC manner to obtain the final camera pose. Because the neural one-plane pose minimizes the number of plane correspondences for adaptive pose hypotheses generation, it enables stable pose voting and reliable pose refinement in a few plane correspondences for the sparse-view inputs. In the experiments, we demonstrate that our NOPE-SAC significantly improves the camera pose estimation for the two-view inputs with severe viewpoint changes, setting several new state-of-the-art performances on two challenging benchmarks, i.e., MatterPort3D and ScanNet, for sparse-view 3D reconstruction. The source code is released at https://github.com/IceTTTb/NopeSAC for reproducible research.
translated by 谷歌翻译
We introduce an approach for recovering the 6D pose of multiple known objects in a scene captured by a set of input images with unknown camera viewpoints. First, we present a single-view single-object 6D pose estimation method, which we use to generate 6D object pose hypotheses. Second, we develop a robust method for matching individual 6D object pose hypotheses across different input images in order to jointly estimate camera viewpoints and 6D poses of all objects in a single consistent scene. Our approach explicitly handles object symmetries, does not require depth measurements, is robust to missing or incorrect object hypotheses, and automatically recovers the number of objects in the scene. Third, we develop a method for global scene refinement given multiple object hypotheses and their correspondences across views. This is achieved by solving an object-level bundle adjustment problem that refines the poses of cameras and objects to minimize the reprojection error in all views. We demonstrate that the proposed method, dubbed Cosy-Pose, outperforms current state-of-the-art results for single-view and multi-view 6D object pose estimation by a large margin on two challenging benchmarks: the YCB-Video and T-LESS datasets. Code and pre-trained models are available on the project webpage. 5
translated by 谷歌翻译
Camera pose estimation is a key step in standard 3D reconstruction pipelines that operate on a dense set of images of a single object or scene. However, methods for pose estimation often fail when only a few images are available because they rely on the ability to robustly identify and match visual features between image pairs. While these methods can work robustly with dense camera views, capturing a large set of images can be time-consuming or impractical. We propose SparsePose for recovering accurate camera poses given a sparse set of wide-baseline images (fewer than 10). The method learns to regress initial camera poses and then iteratively refine them after training on a large-scale dataset of objects (Co3D: Common Objects in 3D). SparsePose significantly outperforms conventional and learning-based baselines in recovering accurate camera rotations and translations. We also demonstrate our pipeline for high-fidelity 3D reconstruction using only 5-9 images of an object.
translated by 谷歌翻译
我们提出了PlanarRecon-从摆姿势的单眼视频中对3D平面进行全球连贯检测和重建的新型框架。与以前的作品从单个图像中检测到2D的平面不同,PlanarRecon逐步检测每个视频片段中的3D平面,该片段由一组关键帧组成,由一组关键帧组成,使用神经网络的场景体积表示。基于学习的跟踪和融合模块旨在合并以前片段的平面以形成连贯的全球平面重建。这种设计使PlanarRecon可以在每个片段中的多个视图中整合观察结果,并在不同的信息中整合了时间信息,从而使场景抽象的准确且相干地重建具有低聚合物的几何形状。实验表明,所提出的方法在实时时可以在扫描仪数据集上实现最先进的性能。
translated by 谷歌翻译
While object reconstruction has made great strides in recent years, current methods typically require densely captured images and/or known camera poses, and generalize poorly to novel object categories. To step toward object reconstruction in the wild, this work explores reconstructing general real-world objects from a few images without known camera poses or object categories. The crux of our work is solving two fundamental 3D vision problems -- shape reconstruction and pose estimation -- in a unified approach. Our approach captures the synergies of these two problems: reliable camera pose estimation gives rise to accurate shape reconstruction, and the accurate reconstruction, in turn, induces robust correspondence between different views and facilitates pose estimation. Our method FORGE predicts 3D features from each view and leverages them in conjunction with the input images to establish cross-view correspondence for estimating relative camera poses. The 3D features are then transformed by the estimated poses into a shared space and are fused into a neural radiance field. The reconstruction results are rendered by volume rendering techniques, enabling us to train the model without 3D shape ground-truth. Our experiments show that FORGE reliably reconstructs objects from five views. Our pose estimation method outperforms existing ones by a large margin. The reconstruction results under predicted poses are comparable to the ones using ground-truth poses. The performance on novel testing categories matches the results on categories seen during training. Project page: https://ut-austin-rpl.github.io/FORGE/
translated by 谷歌翻译
我们为RGB视频提供了基于变压器的神经网络体系结构,用于多对象3D重建。它依赖于表示知识的两种替代方法:作为特征的全局3D网格和一系列特定的2D网格。我们通过专用双向注意机制在两者之间逐步交换信息。我们利用有关图像形成过程的知识,以显着稀疏注意力重量矩阵,从而使我们的体系结构在记忆和计算方面可行。我们在3D特征网格的顶部附上一个detr风格的头,以检测场景中的对象并预测其3D姿势和3D形状。与以前的方法相比,我们的体系结构是单阶段,端到端可训练,并且可以从整体上考虑来自多个视频帧的场景,而无需脆弱的跟踪步骤。我们在挑战性的SCAN2CAD数据集上评估了我们的方法,在该数据集中,我们的表现要优于RGB视频的3D对象姿势估算的最新最新方法; (2)将多视图立体声与RGB-D CAD对齐结合的强大替代方法。我们计划发布我们的源代码。
translated by 谷歌翻译
Estimating 6D poses of objects from images is an important problem in various applications such as robot manipulation and virtual reality. While direct regression of images to object poses has limited accuracy, matching rendered images of an object against the input image can produce accurate results. In this work, we propose a novel deep neural network for 6D pose matching named DeepIM. Given an initial pose estimation, our network is able to iteratively refine the pose by matching the rendered image against the observed image. The network is trained to predict a relative pose transformation using a disentangled representation of 3D location and 3D orientation and an iterative training process. Experiments on two commonly used benchmarks for 6D pose estimation demonstrate that DeepIM achieves large improvements over stateof-the-art methods. We furthermore show that DeepIM is able to match previously unseen objects.
translated by 谷歌翻译
我们提出了一种称为DPODV2(密集姿势对象检测器)的三个阶段6 DOF对象检测方法,该方法依赖于致密的对应关系。我们将2D对象检测器与密集的对应关系网络和多视图姿势细化方法相结合,以估计完整的6 DOF姿势。与通常仅限于单眼RGB图像的其他深度学习方法不同,我们提出了一个统一的深度学习网络,允许使用不同的成像方式(RGB或DEPTH)。此外,我们提出了一种基于可区分渲染的新型姿势改进方法。主要概念是在多个视图中比较预测并渲染对应关系,以获得与所有视图中预测的对应关系一致的姿势。我们提出的方法对受控设置中的不同数据方式和培训数据类型进行了严格的评估。主要结论是,RGB在对应性估计中表现出色,而如果有良好的3D-3D对应关系,则深度有助于姿势精度。自然,他们的组合可以实现总体最佳性能。我们进行广泛的评估和消融研究,以分析和验证几个具有挑战性的数据集的结果。 DPODV2在所有这些方面都取得了出色的成果,同时仍然保持快速和可扩展性,独立于使用的数据模式和培训数据的类型
translated by 谷歌翻译
本文介绍了一种新型的多视图6 DOF对象姿势细化方法,重点是改进对合成数据训练的方法。它基于DPOD检测器,该检测器会在每个帧中产生密集的2D-3D对应关系。我们选择使用多个具有已知相机转换的帧,因为它允许通过可解释的ICP样损耗函数引入几何约束。损耗函数是通过可区分的渲染器实现的,并经过迭代进行了优化。我们还证明,仅根据合成数据训练的完整检测和完善管道可用于自动标记的真实数据。我们对linemod,caslusion,自制和YCB-V数据集执行定量评估,并与对合成和真实数据训练的最新方法相比,报告出色的性能。我们从经验上证明,我们的方法仅需要几个帧,并且可以在外部摄像机校准中关闭相机位置和噪音,从而使其实际用法更加容易且无处不在。
translated by 谷歌翻译
我们的方法从单个RGB-D观察中研究了以对象为中心的3D理解的复杂任务。由于这是一个不适的问题,因此现有的方法在3D形状和6D姿势和尺寸估计中都遭受了遮挡的复杂多对象方案的尺寸估计。我们提出了Shapo,这是一种联合多对象检测的方法,3D纹理重建,6D对象姿势和尺寸估计。 Shapo的关键是一条单杆管道,可回归形状,外观和构成潜在的代码以及每个对象实例的口罩,然后以稀疏到密集的方式进一步完善。首先学到了一种新颖的剖面形状和前景数据库,以将对象嵌入各自的形状和外观空间中。我们还提出了一个基于OCTREE的新颖的可区分优化步骤,使我们能够以分析的方式进一步改善对象形状,姿势和外观。我们新颖的联合隐式纹理对象表示使我们能够准确地识别和重建新颖的看不见的对象,而无需访问其3D网格。通过广泛的实验,我们表明我们的方法在模拟的室内场景上进行了训练,可以准确地回归现实世界中新颖物体的形状,外观和姿势,并以最小的微调。我们的方法显着超过了NOCS数据集上的所有基准,对于6D姿势估计,MAP的绝对改进为8%。项目页面:https://zubair-irshad.github.io/projects/shapo.html
translated by 谷歌翻译
最近的3D Vision进步已经受到包括几何诱导偏见的专业架构的推动。在本文中,我们使用域名可靠架构来解决3D重建,并将与模型的额外输入直接注入相同类型的电感偏差。这种方法使得可以应用现有的一般模型,例如感知的富域,而无需架构更改,同时保持定制模型的数据效率。特别是我们研究如何将摄像机,投影射线发射和末端几何形式作为模型输入进行编码,并在多个基准上展示竞争性的多视图深度估计性能。
translated by 谷歌翻译
我们描述了一种数据驱动的方法,用于指定任意对象的多个图像,以推断相机观点。该任务是经典几何管道(例如SFM和SLAM)的核心组成部分,也是当代神经方法(例如NERF)的至关重要的预处理要求,以对象重建和视图合成。与现有的对应驱动的方法相反,鉴于稀疏视图的表现不佳,我们提出了一种基于自上而下的预测方法来估计相机观点。我们的主要技术见解是使用基于能量的公式来表示相对摄像机旋转的分布,从而使我们能够明确表示由对象对称或视图引起的多个摄像机模式。利用这些相对预测,我们共同估计了来自多个图像的一致摄像机旋转集。我们表明,我们的方法优于最先进的SFM和SLAM方法,并且在可见和看不见的类别上都稀疏图像。此外,我们的概率方法显着优于直接回归相对姿势的表现,这表明对多模型建模对于相干关节重建很重要。我们证明,我们的系统可以是从多视图数据集中进行野外重建的垫脚石。可以在https://jasonyzhang.com/relpose上找到带有代码和视频的项目页面。
translated by 谷歌翻译
从极端视图图像中恢复相机的空间布局和场景的几何形状是计算机视觉的长期挑战。盛行的3D重建算法通常采用匹配范式的图像,并假定场景的一部分是可以在图像上进行的,当输入之间几乎没有重叠时的性能较差。相比之下,人类可以通过形状的先验知识将一个图像中的可见部分与另一个图像中相应的不可见组件相关联。受这个事实的启发,我们提出了一个名为虚拟通信(VC)的新颖概念。 VC是来自两个图像的一对像素,它们的相机射线在3D中相交。与经典的对应关系相似,VC符合异性几何形状;与经典的信件不同,VC不需要在视图上可以共同提供。因此,即使图像不重叠,也可以建立和利用VC。我们介绍了一种方法,可以在场景中找到基于人类的虚拟对应关系。我们展示了如何与经典捆绑捆绑调整无缝集成的风险投资,以恢复跨极视图的相机姿势。实验表明,在具有挑战性的情况下,我们的方法显着优于最先进的摄像头姿势估计方法,并且在传统的密集捕获的设置中是可比的。我们的方法还释放了多个下游任务的潜力,例如在极端视图场景中从多视图立体声和新型视图合成中进行场景重建。
translated by 谷歌翻译
We present ObjectMatch, a semantic and object-centric camera pose estimation for RGB-D SLAM pipelines. Modern camera pose estimators rely on direct correspondences of overlapping regions between frames; however, they cannot align camera frames with little or no overlap. In this work, we propose to leverage indirect correspondences obtained via semantic object identification. For instance, when an object is seen from the front in one frame and from the back in another frame, we can provide additional pose constraints through canonical object correspondences. We first propose a neural network to predict such correspondences on a per-pixel level, which we then combine in our energy formulation with state-of-the-art keypoint matching solved with a joint Gauss-Newton optimization. In a pairwise setting, our method improves registration recall of state-of-the-art feature matching from 77% to 87% overall and from 21% to 52% in pairs with 10% or less inter-frame overlap. In registering RGB-D sequences, our method outperforms cutting-edge SLAM baselines in challenging, low frame-rate scenarios, achieving more than 35% reduction in trajectory error in multiple scenes.
translated by 谷歌翻译
This paper introduces SuperGlue, a neural network that matches two sets of local features by jointly finding correspondences and rejecting non-matchable points. Assignments are estimated by solving a differentiable optimal transport problem, whose costs are predicted by a graph neural network. We introduce a flexible context aggregation mechanism based on attention, enabling SuperGlue to reason about the underlying 3D scene and feature assignments jointly. Compared to traditional, hand-designed heuristics, our technique learns priors over geometric transformations and regularities of the 3D world through end-to-end training from image pairs. SuperGlue outperforms other learned approaches and achieves state-of-the-art results on the task of pose estimation in challenging real-world indoor and outdoor environments. The proposed method performs matching in real-time on a modern GPU and can be readily integrated into modern SfM or SLAM systems. The code and trained weights are publicly available at github.com/magicleap/SuperGluePretrainedNetwork.
translated by 谷歌翻译
Transformers are powerful visual learners, in large part due to their conspicuous lack of manually-specified priors. This flexibility can be problematic in tasks that involve multiple-view geometry, due to the near-infinite possible variations in 3D shapes and viewpoints (requiring flexibility), and the precise nature of projective geometry (obeying rigid laws). To resolve this conundrum, we propose a "light touch" approach, guiding visual Transformers to learn multiple-view geometry but allowing them to break free when needed. We achieve this by using epipolar lines to guide the Transformer's cross-attention maps, penalizing attention values outside the epipolar lines and encouraging higher attention along these lines since they contain geometrically plausible matches. Unlike previous methods, our proposal does not require any camera pose information at test-time. We focus on pose-invariant object instance retrieval, where standard Transformer networks struggle, due to the large differences in viewpoint between query and retrieved images. Experimentally, our method outperforms state-of-the-art approaches at object retrieval, without needing pose information at test-time.
translated by 谷歌翻译
现代计算机视觉已超越了互联网照片集的领域,并进入了物理世界,通过非结构化的环境引导配备摄像头的机器人和自动驾驶汽车。为了使这些体现的代理与现实世界对象相互作用,相机越来越多地用作深度传感器,重建了各种下游推理任务的环境。机器学习辅助的深度感知或深度估计会预测图像中每个像素的距离。尽管已经在深入估算中取得了令人印象深刻的进步,但仍然存在重大挑战:(1)地面真相深度标签很难大规模收集,(2)通常认为相机信息是已知的,但通常是不可靠的,并且(3)限制性摄像机假设很常见,即使在实践中使用了各种各样的相机类型和镜头。在本论文中,我们专注于放松这些假设,并描述将相机变成真正通用深度传感器的最终目标的贡献。
translated by 谷歌翻译
We introduce ViewNeRF, a Neural Radiance Field-based viewpoint estimation method that learns to predict category-level viewpoints directly from images during training. While NeRF is usually trained with ground-truth camera poses, multiple extensions have been proposed to reduce the need for this expensive supervision. Nonetheless, most of these methods still struggle in complex settings with large camera movements, and are restricted to single scenes, i.e. they cannot be trained on a collection of scenes depicting the same object category. To address these issues, our method uses an analysis by synthesis approach, combining a conditional NeRF with a viewpoint predictor and a scene encoder in order to produce self-supervised reconstructions for whole object categories. Rather than focusing on high fidelity reconstruction, we target efficient and accurate viewpoint prediction in complex scenarios, e.g. 360{\deg} rotation on real data. Our model shows competitive results on synthetic and real datasets, both for single scenes and multi-instance collections.
translated by 谷歌翻译