We introduce an approach for recovering the 6D pose of multiple known objects in a scene captured by a set of input images with unknown camera viewpoints. First, we present a single-view single-object 6D pose estimation method, which we use to generate 6D object pose hypotheses. Second, we develop a robust method for matching individual 6D object pose hypotheses across different input images in order to jointly estimate camera viewpoints and 6D poses of all objects in a single consistent scene. Our approach explicitly handles object symmetries, does not require depth measurements, is robust to missing or incorrect object hypotheses, and automatically recovers the number of objects in the scene. Third, we develop a method for global scene refinement given multiple object hypotheses and their correspondences across views. This is achieved by solving an object-level bundle adjustment problem that refines the poses of cameras and objects to minimize the reprojection error in all views. We demonstrate that the proposed method, dubbed Cosy-Pose, outperforms current state-of-the-art results for single-view and multi-view 6D object pose estimation by a large margin on two challenging benchmarks: the YCB-Video and T-LESS datasets. Code and pre-trained models are available on the project webpage. 5
translated by 谷歌翻译
We introduce MegaPose, a method to estimate the 6D pose of novel objects, that is, objects unseen during training. At inference time, the method only assumes knowledge of (i) a region of interest displaying the object in the image and (ii) a CAD model of the observed object. The contributions of this work are threefold. First, we present a 6D pose refiner based on a render&compare strategy which can be applied to novel objects. The shape and coordinate system of the novel object are provided as inputs to the network by rendering multiple synthetic views of the object's CAD model. Second, we introduce a novel approach for coarse pose estimation which leverages a network trained to classify whether the pose error between a synthetic rendering and an observed image of the same object can be corrected by the refiner. Third, we introduce a large-scale synthetic dataset of photorealistic images of thousands of objects with diverse visual and shape properties and show that this diversity is crucial to obtain good generalization performance on novel objects. We train our approach on this large synthetic dataset and apply it without retraining to hundreds of novel objects in real images from several pose estimation benchmarks. Our approach achieves state-of-the-art performance on the ModelNet and YCB-Video datasets. An extensive evaluation on the 7 core datasets of the BOP challenge demonstrates that our approach achieves performance competitive with existing approaches that require access to the target objects during training. Code, dataset and trained models are available on the project page: https://megapose6d.github.io/.
translated by 谷歌翻译
Estimating 6D poses of objects from images is an important problem in various applications such as robot manipulation and virtual reality. While direct regression of images to object poses has limited accuracy, matching rendered images of an object against the input image can produce accurate results. In this work, we propose a novel deep neural network for 6D pose matching named DeepIM. Given an initial pose estimation, our network is able to iteratively refine the pose by matching the rendered image against the observed image. The network is trained to predict a relative pose transformation using a disentangled representation of 3D location and 3D orientation and an iterative training process. Experiments on two commonly used benchmarks for 6D pose estimation demonstrate that DeepIM achieves large improvements over stateof-the-art methods. We furthermore show that DeepIM is able to match previously unseen objects.
translated by 谷歌翻译
我们提出了一种称为DPODV2(密集姿势对象检测器)的三个阶段6 DOF对象检测方法,该方法依赖于致密的对应关系。我们将2D对象检测器与密集的对应关系网络和多视图姿势细化方法相结合,以估计完整的6 DOF姿势。与通常仅限于单眼RGB图像的其他深度学习方法不同,我们提出了一个统一的深度学习网络,允许使用不同的成像方式(RGB或DEPTH)。此外,我们提出了一种基于可区分渲染的新型姿势改进方法。主要概念是在多个视图中比较预测并渲染对应关系,以获得与所有视图中预测的对应关系一致的姿势。我们提出的方法对受控设置中的不同数据方式和培训数据类型进行了严格的评估。主要结论是,RGB在对应性估计中表现出色,而如果有良好的3D-3D对应关系,则深度有助于姿势精度。自然,他们的组合可以实现总体最佳性能。我们进行广泛的评估和消融研究,以分析和验证几个具有挑战性的数据集的结果。 DPODV2在所有这些方面都取得了出色的成果,同时仍然保持快速和可扩展性,独立于使用的数据模式和培训数据的类型
translated by 谷歌翻译
我们提出了一个基于按键的对象级别的SLAM框架,该框架可以为对称和不对称对象提供全球一致的6DOF姿势估计。据我们所知,我们的系统是最早利用来自SLAM的相机姿势信息的系统之一,以提供先验知识,以跟踪对称对象的关键点 - 确保新测量与当前的3D场景一致。此外,我们的语义关键点网络经过训练,可以预测捕获预测的真实错误的关键点的高斯协方差,因此不仅可以作为系统优化问题中残留物的权重,而且还可以作为检测手段有害的统计异常值,而无需选择手动阈值。实验表明,我们的方法以6DOF对象姿势估算和实时速度为最先进的状态提供了竞争性能。我们的代码,预培训模型和关键点标签可用https://github.com/rpng/suo_slam。
translated by 谷歌翻译
Estimating the 6D pose of known objects is important for robots to interact with the real world. The problem is challenging due to the variety of objects as well as the complexity of a scene caused by clutter and occlusions between objects. In this work, we introduce PoseCNN, a new Convolutional Neural Network for 6D object pose estimation. PoseCNN estimates the 3D translation of an object by localizing its center in the image and predicting its distance from the camera. The 3D rotation of the object is estimated by regressing to a quaternion representation. We also introduce a novel loss function that enables PoseCNN to handle symmetric objects. In addition, we contribute a large scale video dataset for 6D object pose estimation named the YCB-Video dataset. Our dataset provides accurate 6D poses of 21 objects from the YCB dataset observed in 92 videos with 133,827 frames. We conduct extensive experiments on our YCB-Video dataset and the OccludedLINEMOD dataset to show that PoseCNN is highly robust to occlusions, can handle symmetric objects, and provide accurate pose estimation using only color images as input. When using depth data to further refine the poses, our approach achieves state-of-the-art results on the challenging OccludedLINEMOD dataset. Our code and dataset are available at https://rse-lab.cs.washington.edu/projects/posecnn/.
translated by 谷歌翻译
本文介绍了一种新型的多视图6 DOF对象姿势细化方法,重点是改进对合成数据训练的方法。它基于DPOD检测器,该检测器会在每个帧中产生密集的2D-3D对应关系。我们选择使用多个具有已知相机转换的帧,因为它允许通过可解释的ICP样损耗函数引入几何约束。损耗函数是通过可区分的渲染器实现的,并经过迭代进行了优化。我们还证明,仅根据合成数据训练的完整检测和完善管道可用于自动标记的真实数据。我们对linemod,caslusion,自制和YCB-V数据集执行定量评估,并与对合成和真实数据训练的最新方法相比,报告出色的性能。我们从经验上证明,我们的方法仅需要几个帧,并且可以在外部摄像机校准中关闭相机位置和噪音,从而使其实际用法更加容易且无处不在。
translated by 谷歌翻译
估计对象的6D姿势是必不可少的计算机视觉任务。但是,大多数常规方法从单个角度依赖相机数据,因此遭受遮挡。我们通过称为MV6D的新型多视图6D姿势估计方法克服了这个问题,该方法从多个角度根据RGB-D图像准确地预测了混乱场景中所有对象的6D姿势。我们将方法以PVN3D网络为基础,该网络使用单个RGB-D图像来预测目标对象的关键点。我们通过从多个视图中使用组合点云来扩展此方法,并将每个视图中的图像与密集层层融合。与当前的多视图检测网络(例如Cosypose)相反,我们的MV6D可以以端到端的方式学习多个观点的融合,并且不需要多个预测阶段或随后对预测的微调。此外,我们介绍了三个新颖的影像学数据集,这些数据集具有沉重的遮挡的混乱场景。所有这些都从多个角度包含RGB-D图像,例如语义分割和6D姿势估计。即使在摄像头不正确的情况下,MV6D也明显优于多视图6D姿势估计中最新的姿势估计。此外,我们表明我们的方法对动态相机设置具有强大的态度,并且其准确性随着越来越多的观点而逐渐增加。
translated by 谷歌翻译
We introduce a novel method for 3D object detection and pose estimation from color images only. We first use segmentation to detect the objects of interest in 2D even in presence of partial occlusions and cluttered background. By contrast with recent patch-based methods, we rely on a "holistic" approach: We apply to the detected objects a Convolutional Neural Network (CNN) trained to predict their 3D poses in the form of 2D projections of the corners of their 3D bounding boxes. This, however, is not sufficient for handling objects from the recent T-LESS dataset: These objects exhibit an axis of rotational symmetry, and the similarity of two images of such an object under two different poses makes training the CNN challenging. We solve this problem by restricting the range of poses used for training, and by introducing a classifier to identify the range of a pose at run-time before estimating it. We also use an optional additional step that refines the predicted poses. We improve the state-of-the-art on the LINEMOD dataset from 73.7% [2] to 89.3% of correctly registered RGB frames. We are also the first to report results on the Occlusion dataset [1] using color images only. We obtain 54% of frames passing the Pose 6D criterion on average on several sequences of the T-LESS dataset, compared to the 67% of the state-of-the-art [10] on the same sequences which uses both color and depth. The full approach is also scalable, as a single network can be trained for multiple objects simultaneously.
translated by 谷歌翻译
We propose a single-shot approach for simultaneously detecting an object in an RGB image and predicting its 6D pose without requiring multiple stages or having to examine multiple hypotheses. Unlike a recently proposed single-shot technique for this task [11] that only predicts an approximate 6D pose that must then be refined, ours is accurate enough not to require additional post-processing. As a result, it is much faster -50 fps on a Titan X (Pascal) GPU -and more suitable for real-time processing. The key component of our method is a new CNN architecture inspired by [28,29] that directly predicts the 2D image locations of the projected vertices of the object's 3D bounding box. The object's 6D pose is then estimated using a PnP algorithm.For single object and multiple object pose estimation on the LINEMOD and OCCLUSION datasets, our approach substantially outperforms other recent 26] when they are all used without postprocessing. During post-processing, a pose refinement step can be used to boost the accuracy of these two methods, but at 10 fps or less, they are much slower than our method.
translated by 谷歌翻译
我们提出了一种方法,用于估计具有单个RGB图像的可用3D模型的刚性对象的6DOF姿势。与基于经典对应的方法不同,该方法可以预测输入图像的像素的3D对象坐标,该建议的方法可以预测3D对象坐标在相机frustum中采样的3D查询点。从像素到3D点的移动,这是受到3D重建方法的最新PIFU式方法的启发,可以对整个对象(包括(自我)遮挡部分)进行推理。对于与与像素对齐的图像功能相关的3D查询点,我们训练完全连接的神经网络来预测:(i)相应的3D对象坐标,以及(ii)签名到对象表面的签名距离,首先定义仅适用于地表附近的查询点。我们将该网络实现的映射称为神经通信字段。然后,通过Kabsch-Ransac算法从预测的3D-3D对应关系中稳健地估计对象姿势。所提出的方法在三个BOP数据集上实现了最先进的结果,并且在咬合挑战性案例中表现出了优越。项目网站在:linhuang17.github.io/ncf。
translated by 谷歌翻译
在这项工作中,我们探讨了对物体在看不见的世界中同时本地化和映射中的使用,并提出了一个对象辅助系统(OA-Slam)。更确切地说,我们表明,与低级点相比,物体的主要好处在于它们的高级语义和歧视力。相反,要点比代表对象(Cuboid或椭圆形)的通用粗模型具有更好的空间定位精度。我们表明,将点和对象组合非常有趣,可以解决相机姿势恢复的问题。我们的主要贡献是:(1)我们使用高级对象地标提高了SLAM系统的重新定位能力; (2)我们构建了一个能够使用3D椭圆形识别,跟踪和重建对象的自动系统; (3)我们表明,基于对象的本地化可用于重新初始化或恢复相机跟踪。我们的全自动系统允许对象映射和增强姿势跟踪恢复,我们认为这可以极大地受益于AR社区。我们的实验表明,可以从经典方法失败的视点重新定位相机。我们证明,尽管跟踪损失损失,但这种本地化使SLAM系统仍可以继续工作,而这种损失可能会经常发生在不理会的用户中。我们的代码和测试数据在gitlab.inria.fr/tangram/oa-slam上发布。
translated by 谷歌翻译
We present a novel method for detecting 3D model instances and estimating their 6D poses from RGB data in a single shot. To this end, we extend the popular SSD paradigm to cover the full 6D pose space and train on synthetic model data only. Our approach competes or surpasses current state-of-the-art methods that leverage RGB-D data on multiple challenging datasets. Furthermore, our method produces these results at around 10Hz, which is many times faster than the related methods. For the sake of reproducibility, we make our trained networks and detection code publicly available. 1
translated by 谷歌翻译
我们介绍了日常桌面对象的998 3D型号的数据集及其847,000个现实世界RGB和深度图像。每个图像的相机姿势和对象姿势的准确注释都以半自动化方式执行,以促进将数据集用于多种3D应用程序,例如形状重建,对象姿势估计,形状检索等。3D重建由于缺乏适当的现实世界基准来完成该任务,并证明我们的数据集可以填补该空白。整个注释数据集以及注释工具和评估基线的源代码可在http://www.ocrtoc.org/3d-reconstruction.html上获得。
translated by 谷歌翻译
我们提出了一种从有限重叠的图像中对场景进行平面表面重建的方法。此重构任务是具有挑战性的,因为它需要共同推理单个图像3D重建,图像之间的对应关系以及图像之间的相对摄像头姿势。过去的工作提出了基于优化的方法。我们引入了一种更简单的方法,即平面形式,该方法使用应用于3D感知平面令牌的变压器执行3D推理。我们的实验表明,我们的方法比以前的工作更有效,并且几项3D特定的设计决策对于成功的成功至关重要。
translated by 谷歌翻译
相机的估计与一组图像相关联的估计通常取决于图像之间的特征匹配。相比之下,我们是第一个通过使用对象区域来指导姿势估计问题而不是显式语义对象检测来应对这一挑战的人。我们提出了姿势炼油机网络(PosErnet),一个轻量级的图形神经网络,以完善近似的成对相对摄像头姿势。posernet利用对象区域之间的关联(简洁地表示为边界框),跨越了多个视图到全球完善的稀疏连接的视图图。我们在不同尺寸的图表上评估了7个尺寸的数据集,并展示了该过程如何有益于基于优化的运动平均算法,从而相对于基于边界框获得的初始估计,将旋转的中值误差提高了62度。代码和数据可在https://github.com/iit-pavis/posernet上找到。
translated by 谷歌翻译
估计没有先验知识的新对象的相对姿势是一个困难的问题,而它是机器人技术和增强现实中非常需要的能力。我们提出了一种方法,可以在训练图像和对象的3D几何形状都没有可用时跟踪对象中对象的6D运动。因此,与以前的作品相反,我们的方法可以立即考虑开放世界中的未知对象,而无需任何先前的信息或特定的培训阶段。我们考虑两个架构,一个基于两个帧,另一个依赖于变压器编码器,它们可以利用任意数量的过去帧。我们仅使用具有域随机化的合成渲染训练架构。我们在具有挑战性的数据集上的结果与以前需要更多信息的作品(训练目标对象,3D模型和/或深度数据的培训图像)相当。我们的源代码可从https://github.com/nv-nguyen/pizza获得
translated by 谷歌翻译
A key technical challenge in performing 6D object pose estimation from RGB-D image is to fully leverage the two complementary data sources. Prior works either extract information from the RGB image and depth separately or use costly post-processing steps, limiting their performances in highly cluttered scenes and real-time applications. In this work, we present DenseFusion, a generic framework for estimating 6D pose of a set of known objects from RGB-D images. DenseFusion is a heterogeneous architecture that processes the two data sources individually and uses a novel dense fusion network to extract pixel-wise dense feature embedding, from which the pose is estimated. Furthermore, we integrate an end-to-end iterative pose refinement procedure that further improves the pose estimation while achieving near real-time inference. Our experiments show that our method outperforms state-of-the-art approaches in two datasets, YCB-Video and LineMOD. We also deploy our proposed method to a real robot to grasp and manipulate objects based on the estimated pose. Our code and video are available at https://sites.google.com/view/densefusion/.
translated by 谷歌翻译
我们介绍Cendernet,这是一个基于中心和曲率表示的多视图图像的6D姿势估计的框架。为反光,无纹理对象寻找精确的姿势是工业机器人技术的关键挑战。我们的方法包括三个阶段:首先,一个完全卷积的神经网络可预测每种观点的中心和曲率热图;其次,中心热图用于检测对象实例并找到其3D中心。第三,使用3D中心和曲率热图估算6D对象姿势。通过使用渲染和能力方法共同优化视图的姿势,我们的方法自然处理遮挡和对象对称性。我们表明,Cendernet在两个与行业相关的数据集上优于以前的方法:DIMO和T-less。
translated by 谷歌翻译
6D object pose estimation problem has been extensively studied in the field of Computer Vision and Robotics. It has wide range of applications such as robot manipulation, augmented reality, and 3D scene understanding. With the advent of Deep Learning, many breakthroughs have been made; however, approaches continue to struggle when they encounter unseen instances, new categories, or real-world challenges such as cluttered backgrounds and occlusions. In this study, we will explore the available methods based on input modality, problem formulation, and whether it is a category-level or instance-level approach. As a part of our discussion, we will focus on how 6D object pose estimation can be used for understanding 3D scenes.
translated by 谷歌翻译