We present ObjectMatch, a semantic and object-centric camera pose estimation for RGB-D SLAM pipelines. Modern camera pose estimators rely on direct correspondences of overlapping regions between frames; however, they cannot align camera frames with little or no overlap. In this work, we propose to leverage indirect correspondences obtained via semantic object identification. For instance, when an object is seen from the front in one frame and from the back in another frame, we can provide additional pose constraints through canonical object correspondences. We first propose a neural network to predict such correspondences on a per-pixel level, which we then combine in our energy formulation with state-of-the-art keypoint matching solved with a joint Gauss-Newton optimization. In a pairwise setting, our method improves registration recall of state-of-the-art feature matching from 77% to 87% overall and from 21% to 52% in pairs with 10% or less inter-frame overlap. In registering RGB-D sequences, our method outperforms cutting-edge SLAM baselines in challenging, low frame-rate scenarios, achieving more than 35% reduction in trajectory error in multiple scenes.
translated by 谷歌翻译
我们呈现ROCA,一种新的端到端方法,可以从形状数据库到单个输入图像中检索并对齐3D CAD模型。这使得从2D RGB观察开始观察到的场景的3D感知,其特征在于轻质,紧凑,清洁的CAD表示。我们的方法的核心是我们基于密集的2D-3D对象对应关系和促使对齐的可差的对准优化。因此,罗卡可以提供强大的CAD对准,同时通过利用2D-3D对应关系来学习几何上类似CAD模型来同时通知CAD检索。SCANNET的真实世界图像实验表明,Roca显着提高了现有技术,从检索感知CAD准确度为9.5%至17.6%。
translated by 谷歌翻译
The goal of this paper is to estimate the 6D pose and dimensions of unseen object instances in an RGB-D image. Contrary to "instance-level" 6D pose estimation tasks, our problem assumes that no exact object CAD models are available during either training or testing time. To handle different and unseen object instances in a given category, we introduce Normalized Object Coordinate Space (NOCS)-a shared canonical representation for all possible object instances within a category. Our region-based neural network is then trained to directly infer the correspondence from observed pixels to this shared object representation (NOCS) along with other object information such as class label and instance mask. These predictions can be combined with the depth map to jointly estimate the metric 6D pose and dimensions of multiple objects in a cluttered scene. To train our network, we present a new contextaware technique to generate large amounts of fully annotated mixed reality data. To further improve our model and evaluate its performance on real data, we also provide a fully annotated real-world dataset with large environment and instance variation. Extensive experiments demonstrate that the proposed method is able to robustly estimate the pose and size of unseen object instances in real environments while also achieving state-of-the-art performance on standard 6D pose estimation benchmarks.
translated by 谷歌翻译
从单眼视频中估算移动摄像头的姿势是一个具有挑战性的问题,尤其是由于动态环境中移动对象的存在,在动态环境中,现有摄像头姿势估计方法的性能易于几何一致的像素。为了应对这一挑战,我们为视频提供了一种强大的密度间接结构,该结构是基于由成对光流初始化的致密对应的。我们的关键想法是将远程视频对应性优化为密集的点轨迹,并使用它来学习对运动分割的强大估计。提出了一种新型的神经网络结构来处理不规则的点轨迹数据。然后,在远程点轨迹的一部分中,通过全局捆绑式调整估算和优化摄像头姿势,这些轨迹被归类为静态。 MPI Sintel数据集的实验表明,与现有最新方法相比,我们的系统产生的相机轨迹明显更准确。此外,我们的方法能够在完全静态的场景上保留相机姿势的合理准确性,该场景始终优于端到端深度学习的强大最新密度对应方法,这证明了密集间接方法的潜力基于光流和点轨迹。由于点轨迹表示是通用的,因此我们进一步介绍了具有动态对象的复杂运动的野外单眼视频的比较。代码可在https://github.com/bytedance/particle-sfm上找到。
translated by 谷歌翻译
我们提出了一种新的方法,用于从室内环境中的RGB-D序列进行连接3D多对象跟踪和重建。为此,我们在每个帧中检测并重建对象,同时预测密集的对应关系映射到归一化对象空间中。我们利用这些对应关系来告知图神经网络,以解决所有对象的最佳,时间一致的7-DOF姿势轨迹。我们方法的新颖性是两个方面:首先,我们提出了一种基于图的新方法,用于随着时间的流逝而进行区分姿势估计,以学习最佳的姿势轨迹。其次,我们提出了沿时间轴的重建和姿势估计的联合公式,以实现健壮和几何一致的多对象跟踪。为了验证我们的方法,我们引入了一个新的合成数据集,其中包含2381个唯一室内序列,总共有60k渲染的RGB-D图像,用于多对象跟踪,并带有移动对象和来自合成3D-Front数据集的相机位置。我们证明,与现有最新方法相比,我们的方法将所有测试序列的累积MOTA得分提高了24.8%。在关于合成和现实世界序列的几个消融中,我们表明我们的基于图的完全端到端学习方法可以显着提高跟踪性能。
translated by 谷歌翻译
Erroneous feature matches have severe impact on subsequent camera pose estimation and often require additional, time-costly measures, like RANSAC, for outlier rejection. Our method tackles this challenge by addressing feature matching and pose optimization jointly. To this end, we propose a graph attention network to predict image correspondences along with confidence weights. The resulting matches serve as weighted constraints in a differentiable pose estimation. Training feature matching with gradients from pose optimization naturally learns to down-weight outliers and boosts pose estimation on image pairs compared to SuperGlue by 6.7% on ScanNet. At the same time, it reduces the pose estimation time by over 50% and renders RANSAC iterations unnecessary. Moreover, we integrate information from multiple views by spanning the graph across multiple frames to predict the matches all at once. Multi-view matching combined with end-to-end training improves the pose estimation metrics on Matterport3D by 18.8% compared to SuperGlue.
translated by 谷歌翻译
我们提出了一种称为DPODV2(密集姿势对象检测器)的三个阶段6 DOF对象检测方法,该方法依赖于致密的对应关系。我们将2D对象检测器与密集的对应关系网络和多视图姿势细化方法相结合,以估计完整的6 DOF姿势。与通常仅限于单眼RGB图像的其他深度学习方法不同,我们提出了一个统一的深度学习网络,允许使用不同的成像方式(RGB或DEPTH)。此外,我们提出了一种基于可区分渲染的新型姿势改进方法。主要概念是在多个视图中比较预测并渲染对应关系,以获得与所有视图中预测的对应关系一致的姿势。我们提出的方法对受控设置中的不同数据方式和培训数据类型进行了严格的评估。主要结论是,RGB在对应性估计中表现出色,而如果有良好的3D-3D对应关系,则深度有助于姿势精度。自然,他们的组合可以实现总体最佳性能。我们进行广泛的评估和消融研究,以分析和验证几个具有挑战性的数据集的结果。 DPODV2在所有这些方面都取得了出色的成果,同时仍然保持快速和可扩展性,独立于使用的数据模式和培训数据的类型
translated by 谷歌翻译
Estimating 6D poses of objects from images is an important problem in various applications such as robot manipulation and virtual reality. While direct regression of images to object poses has limited accuracy, matching rendered images of an object against the input image can produce accurate results. In this work, we propose a novel deep neural network for 6D pose matching named DeepIM. Given an initial pose estimation, our network is able to iteratively refine the pose by matching the rendered image against the observed image. The network is trained to predict a relative pose transformation using a disentangled representation of 3D location and 3D orientation and an iterative training process. Experiments on two commonly used benchmarks for 6D pose estimation demonstrate that DeepIM achieves large improvements over stateof-the-art methods. We furthermore show that DeepIM is able to match previously unseen objects.
translated by 谷歌翻译
在本文中,我们建议超越建立的基于视觉的本地化方法,该方法依赖于查询图像和3D点云之间的视觉描述符匹配。尽管通过视觉描述符匹配关键点使本地化高度准确,但它具有重大的存储需求,提出了隐私问题,并需要长期对描述符进行更新。为了优雅地应对大规模定位的实用挑战,我们提出了Gomatch,这是基于视觉的匹配的替代方法,仅依靠几何信息来匹配图像键点与地图的匹配,这是轴承矢量集。我们的新型轴承矢量表示3D点,可显着缓解基于几何的匹配中的跨模式挑战,这阻止了先前的工作在现实环境中解决本地化。凭借额外的仔细建筑设计,Gomatch在先前的基于几何的匹配工作中改善了(1067m,95.7升)和(1.43m,34.7摄氏度),平均中位数姿势错误,同时需要7个尺寸,同时需要7片。与最佳基于视觉的匹配方法相比,几乎1.5/1.7%的存储容量。这证实了其对现实世界本地化的潜力和可行性,并为不需要存储视觉描述符的城市规模的视觉定位方法打开了未来努力的大门。
translated by 谷歌翻译
Camera pose estimation is a key step in standard 3D reconstruction pipelines that operate on a dense set of images of a single object or scene. However, methods for pose estimation often fail when only a few images are available because they rely on the ability to robustly identify and match visual features between image pairs. While these methods can work robustly with dense camera views, capturing a large set of images can be time-consuming or impractical. We propose SparsePose for recovering accurate camera poses given a sparse set of wide-baseline images (fewer than 10). The method learns to regress initial camera poses and then iteratively refine them after training on a large-scale dataset of objects (Co3D: Common Objects in 3D). SparsePose significantly outperforms conventional and learning-based baselines in recovering accurate camera rotations and translations. We also demonstrate our pipeline for high-fidelity 3D reconstruction using only 5-9 images of an object.
translated by 谷歌翻译
This paper introduces SuperGlue, a neural network that matches two sets of local features by jointly finding correspondences and rejecting non-matchable points. Assignments are estimated by solving a differentiable optimal transport problem, whose costs are predicted by a graph neural network. We introduce a flexible context aggregation mechanism based on attention, enabling SuperGlue to reason about the underlying 3D scene and feature assignments jointly. Compared to traditional, hand-designed heuristics, our technique learns priors over geometric transformations and regularities of the 3D world through end-to-end training from image pairs. SuperGlue outperforms other learned approaches and achieves state-of-the-art results on the task of pose estimation in challenging real-world indoor and outdoor environments. The proposed method performs matching in real-time on a modern GPU and can be readily integrated into modern SfM or SLAM systems. The code and trained weights are publicly available at github.com/magicleap/SuperGluePretrainedNetwork.
translated by 谷歌翻译
3D感知最近的进展在了解3DACHAPES甚至场景的几何结构方面表现出令人印象深刻的进展。灵感来自这些进步的几何理解,我们旨在利用几何约束下学到的表示基于图像的感知。我们介绍一种基于多视图RGB-D数据学习View-Invariant的方法,用于网络预训练的网络预训练的几何感知表示,然后可以将其有效地传送到下游2D任务。我们建议在多视图IM-ysge约束和图像 - 几何约束下采用对比学习,以便在学习的2D表示中进行编码。这不仅仅是在几乎非仅对图像的语义分割,实例分段和对象检测的基于图像的基于图像的基于图像的TASK上学习而改进,而且,但是,在低数据方案中提供了显着的改进。我们对全数据的语义细分显示6.0%的显着提高,以及剪刀上的基线20%数据上的11.9%。
translated by 谷歌翻译
Video provides us with the spatio-temporal consistency needed for visual learning. Recent approaches have utilized this signal to learn correspondence estimation from close-by frame pairs. However, by only relying on close-by frame pairs, those approaches miss out on the richer long-range consistency between distant overlapping frames. To address this, we propose a self-supervised approach for correspondence estimation that learns from multiview consistency in short RGB-D video sequences. Our approach combines pairwise correspondence estimation and registration with a novel SE(3) transformation synchronization algorithm. Our key insight is that self-supervised multiview registration allows us to obtain correspondences over longer time frames; increasing both the diversity and difficulty of sampled pairs. We evaluate our approach on indoor scenes for correspondence estimation and RGB-D pointcloud registration and find that we perform on-par with supervised approaches.
translated by 谷歌翻译
在这项工作中,我们探讨了对物体在看不见的世界中同时本地化和映射中的使用,并提出了一个对象辅助系统(OA-Slam)。更确切地说,我们表明,与低级点相比,物体的主要好处在于它们的高级语义和歧视力。相反,要点比代表对象(Cuboid或椭圆形)的通用粗模型具有更好的空间定位精度。我们表明,将点和对象组合非常有趣,可以解决相机姿势恢复的问题。我们的主要贡献是:(1)我们使用高级对象地标提高了SLAM系统的重新定位能力; (2)我们构建了一个能够使用3D椭圆形识别,跟踪和重建对象的自动系统; (3)我们表明,基于对象的本地化可用于重新初始化或恢复相机跟踪。我们的全自动系统允许对象映射和增强姿势跟踪恢复,我们认为这可以极大地受益于AR社区。我们的实验表明,可以从经典方法失败的视点重新定位相机。我们证明,尽管跟踪损失损失,但这种本地化使SLAM系统仍可以继续工作,而这种损失可能会经常发生在不理会的用户中。我们的代码和测试数据在gitlab.inria.fr/tangram/oa-slam上发布。
translated by 谷歌翻译
结合同时定位和映射(SLAM)估计和动态场景建模可以高效地在动态环境中获得机器人自主权。机器人路径规划和障碍避免任务依赖于场景中动态对象运动的准确估计。本文介绍了VDO-SLAM,这是一种强大的视觉动态对象感知SLAM系统,用于利用语义信息,使得能够在场景中进行准确的运动估计和跟踪动态刚性物体,而无需任何先前的物体形状或几何模型的知识。所提出的方法识别和跟踪环境中的动态对象和静态结构,并将这些信息集成到统一的SLAM框架中。这导致机器人轨迹的高度准确估计和对象的全部SE(3)运动以及环境的时空地图。该系统能够从对象的SE(3)运动中提取线性速度估计,为复杂的动态环境中的导航提供重要功能。我们展示了所提出的系统对许多真实室内和室外数据集的性能,结果表明了对最先进的算法的一致和实质性的改进。可以使用源代码的开源版本。
translated by 谷歌翻译
由于其对环境变化的鲁棒性,视觉猛感的间接方法是受欢迎的。 ORB-SLAM2 \ CITE {ORBSLM2}是该域中的基准方法,但是,除非选择帧作为关键帧,否则它会消耗从未被重用的描述符。轻量级和高效,因为它跟踪相邻帧之间的关键点而不计算描述符。为此,基于稀疏光流提出了一种两个级粗到微小描述符独立的Keypoint匹配方法。在第一阶段,我们通过简单但有效的运动模型预测初始关键点对应,然后通过基于金字塔的稀疏光流跟踪鲁棒地建立了对应关系。在第二阶段,我们利用运动平滑度和末端几何形状的约束来改进对应关系。特别是,我们的方法仅计算关键帧的描述符。我们在\ texit {tum}和\ texit {icl-nuim} RGB-D数据集上测试Fastorb-Slam,并将其准确性和效率与九种现有的RGB-D SLAM方法进行比较。定性和定量结果表明,我们的方法实现了最先进的准确性,并且大约是ORB-SLAM2的两倍。
translated by 谷歌翻译
我们介绍了日常桌面对象的998 3D型号的数据集及其847,000个现实世界RGB和深度图像。每个图像的相机姿势和对象姿势的准确注释都以半自动化方式执行,以促进将数据集用于多种3D应用程序,例如形状重建,对象姿势估计,形状检索等。3D重建由于缺乏适当的现实世界基准来完成该任务,并证明我们的数据集可以填补该空白。整个注释数据集以及注释工具和评估基线的源代码可在http://www.ocrtoc.org/3d-reconstruction.html上获得。
translated by 谷歌翻译
Access to large, diverse RGB-D datasets is critical for training RGB-D scene understanding algorithms. However, existing datasets still cover only a limited number of views or a restricted scale of spaces. In this paper, we introduce Matterport3D, a large-scale RGB-D dataset containing 10,800 panoramic views from 194,400 RGB-D images of 90 building-scale scenes. Annotations are provided with surface reconstructions, camera poses, and 2D and 3D semantic segmentations. The precise global alignment and comprehensive, diverse panoramic set of views over entire buildings enable a variety of supervised and self-supervised computer vision tasks, including keypoint matching, view overlap prediction, normal prediction from color, semantic segmentation, and region classification.
translated by 谷歌翻译
6D object pose estimation problem has been extensively studied in the field of Computer Vision and Robotics. It has wide range of applications such as robot manipulation, augmented reality, and 3D scene understanding. With the advent of Deep Learning, many breakthroughs have been made; however, approaches continue to struggle when they encounter unseen instances, new categories, or real-world challenges such as cluttered backgrounds and occlusions. In this study, we will explore the available methods based on input modality, problem formulation, and whether it is a category-level or instance-level approach. As a part of our discussion, we will focus on how 6D object pose estimation can be used for understanding 3D scenes.
translated by 谷歌翻译
虽然最近出现了类别级的9DOF对象姿势估计,但由于较大的对象形状和颜色等类别内差异,因此,先前基于对应的或直接回归方法的准确性均受到限制。 - 级别的物体姿势和尺寸炼油机Catre,能够迭代地增强点云的姿势估计以产生准确的结果。鉴于初始姿势估计,Catre通过对齐部分观察到的点云和先验的抽象形状来预测初始姿势和地面真理之间的相对转换。具体而言,我们提出了一种新颖的分离体系结构,以了解旋转与翻译/大小估计之间的固有区别。广泛的实验表明,我们的方法在REAL275,Camera25和LM基准测试中的最先进方法高达〜85.32Hz,并在类别级别跟踪上取得了竞争成果。我们进一步证明,Catre可以对看不见的类别进行姿势改进。可以使用代码和训练有素的型号。
translated by 谷歌翻译