Estimating 6D poses of objects from images is an important problem in various applications such as robot manipulation and virtual reality. While direct regression of images to object poses has limited accuracy, matching rendered images of an object against the input image can produce accurate results. In this work, we propose a novel deep neural network for 6D pose matching named DeepIM. Given an initial pose estimation, our network is able to iteratively refine the pose by matching the rendered image against the observed image. The network is trained to predict a relative pose transformation using a disentangled representation of 3D location and 3D orientation and an iterative training process. Experiments on two commonly used benchmarks for 6D pose estimation demonstrate that DeepIM achieves large improvements over stateof-the-art methods. We furthermore show that DeepIM is able to match previously unseen objects.
translated by 谷歌翻译
Estimating the 6D pose of known objects is important for robots to interact with the real world. The problem is challenging due to the variety of objects as well as the complexity of a scene caused by clutter and occlusions between objects. In this work, we introduce PoseCNN, a new Convolutional Neural Network for 6D object pose estimation. PoseCNN estimates the 3D translation of an object by localizing its center in the image and predicting its distance from the camera. The 3D rotation of the object is estimated by regressing to a quaternion representation. We also introduce a novel loss function that enables PoseCNN to handle symmetric objects. In addition, we contribute a large scale video dataset for 6D object pose estimation named the YCB-Video dataset. Our dataset provides accurate 6D poses of 21 objects from the YCB dataset observed in 92 videos with 133,827 frames. We conduct extensive experiments on our YCB-Video dataset and the OccludedLINEMOD dataset to show that PoseCNN is highly robust to occlusions, can handle symmetric objects, and provide accurate pose estimation using only color images as input. When using depth data to further refine the poses, our approach achieves state-of-the-art results on the challenging OccludedLINEMOD dataset. Our code and dataset are available at https://rse-lab.cs.washington.edu/projects/posecnn/.
translated by 谷歌翻译
We introduce an approach for recovering the 6D pose of multiple known objects in a scene captured by a set of input images with unknown camera viewpoints. First, we present a single-view single-object 6D pose estimation method, which we use to generate 6D object pose hypotheses. Second, we develop a robust method for matching individual 6D object pose hypotheses across different input images in order to jointly estimate camera viewpoints and 6D poses of all objects in a single consistent scene. Our approach explicitly handles object symmetries, does not require depth measurements, is robust to missing or incorrect object hypotheses, and automatically recovers the number of objects in the scene. Third, we develop a method for global scene refinement given multiple object hypotheses and their correspondences across views. This is achieved by solving an object-level bundle adjustment problem that refines the poses of cameras and objects to minimize the reprojection error in all views. We demonstrate that the proposed method, dubbed Cosy-Pose, outperforms current state-of-the-art results for single-view and multi-view 6D object pose estimation by a large margin on two challenging benchmarks: the YCB-Video and T-LESS datasets. Code and pre-trained models are available on the project webpage. 5
translated by 谷歌翻译
虽然最近出现了类别级的9DOF对象姿势估计,但由于较大的对象形状和颜色等类别内差异,因此,先前基于对应的或直接回归方法的准确性均受到限制。 - 级别的物体姿势和尺寸炼油机Catre,能够迭代地增强点云的姿势估计以产生准确的结果。鉴于初始姿势估计,Catre通过对齐部分观察到的点云和先验的抽象形状来预测初始姿势和地面真理之间的相对转换。具体而言,我们提出了一种新颖的分离体系结构,以了解旋转与翻译/大小估计之间的固有区别。广泛的实验表明,我们的方法在REAL275,Camera25和LM基准测试中的最先进方法高达〜85.32Hz,并在类别级别跟踪上取得了竞争成果。我们进一步证明,Catre可以对看不见的类别进行姿势改进。可以使用代码和训练有素的型号。
translated by 谷歌翻译
我们提出了一种称为DPODV2(密集姿势对象检测器)的三个阶段6 DOF对象检测方法,该方法依赖于致密的对应关系。我们将2D对象检测器与密集的对应关系网络和多视图姿势细化方法相结合,以估计完整的6 DOF姿势。与通常仅限于单眼RGB图像的其他深度学习方法不同,我们提出了一个统一的深度学习网络,允许使用不同的成像方式(RGB或DEPTH)。此外,我们提出了一种基于可区分渲染的新型姿势改进方法。主要概念是在多个视图中比较预测并渲染对应关系,以获得与所有视图中预测的对应关系一致的姿势。我们提出的方法对受控设置中的不同数据方式和培训数据类型进行了严格的评估。主要结论是,RGB在对应性估计中表现出色,而如果有良好的3D-3D对应关系,则深度有助于姿势精度。自然,他们的组合可以实现总体最佳性能。我们进行广泛的评估和消融研究,以分析和验证几个具有挑战性的数据集的结果。 DPODV2在所有这些方面都取得了出色的成果,同时仍然保持快速和可扩展性,独立于使用的数据模式和培训数据的类型
translated by 谷歌翻译
We introduce MegaPose, a method to estimate the 6D pose of novel objects, that is, objects unseen during training. At inference time, the method only assumes knowledge of (i) a region of interest displaying the object in the image and (ii) a CAD model of the observed object. The contributions of this work are threefold. First, we present a 6D pose refiner based on a render&compare strategy which can be applied to novel objects. The shape and coordinate system of the novel object are provided as inputs to the network by rendering multiple synthetic views of the object's CAD model. Second, we introduce a novel approach for coarse pose estimation which leverages a network trained to classify whether the pose error between a synthetic rendering and an observed image of the same object can be corrected by the refiner. Third, we introduce a large-scale synthetic dataset of photorealistic images of thousands of objects with diverse visual and shape properties and show that this diversity is crucial to obtain good generalization performance on novel objects. We train our approach on this large synthetic dataset and apply it without retraining to hundreds of novel objects in real images from several pose estimation benchmarks. Our approach achieves state-of-the-art performance on the ModelNet and YCB-Video datasets. An extensive evaluation on the 7 core datasets of the BOP challenge demonstrates that our approach achieves performance competitive with existing approaches that require access to the target objects during training. Code, dataset and trained models are available on the project page: https://megapose6d.github.io/.
translated by 谷歌翻译
从RGB图像中对刚性对象进行精确的6D构成估计是机器人技术和增强现实中的一项至关重要的任务。为了解决这个问题,我们提出了DeepRM,这是一种新型的经过精炼的新型经过的网络体系结构。 DeepRM利用初始粗姿势估计来渲染目标对象的合成图像。然后将渲染图像与观察到的图像匹配,以预测更新先前姿势估计值的刚性变换。重复此过程以逐步完善每次迭代的估计值。 LSTM单元用于通过每个完善步骤来传播信息,从而显着提高整体性能。与许多基于2阶段的透视点解决方案相反,DEEPRM是端到端训练的,并使用可扩展的主链,可以通过单个参数调整以提高准确性和效率。在训练过程中,添加了多尺度的光流头,以预测观察到的和合成图像之间的光流。光流预测稳定了训练过程,并强制学习与姿势估计任务相关的功能。我们的结果表明,DEEPRM在两个广泛接受的具有挑战性的数据集上实现了最先进的性能。
translated by 谷歌翻译
本文提出了一种类别级别的6D对象姿势和形状估计方法IDAPS,其允许在类别中跟踪6D姿势并估计其3D形状。我们使用深度图像作为输入开发类别级别自动编码器网络,其中来自自动编码器编码的特征嵌入在类别中对象的姿势。自动编码器可用于粒子过滤器框架,以估计和跟踪类别中的对象的姿势。通过利用基于符号距离函数的隐式形状表示,我们构建延迟网络以估计给定对象的估计姿势的3D形状的潜在表示。然后,估计的姿势和形状可用于以迭代方式互相更新。我们的类别级别6D对象姿势和形状估计流水线仅需要2D检测和分段进行初始化。我们在公开的数据集中评估我们的方法,并展示其有效性。特别是,我们的方法在形状估计上实现了相对高的准确性。
translated by 谷歌翻译
估计没有先验知识的新对象的相对姿势是一个困难的问题,而它是机器人技术和增强现实中非常需要的能力。我们提出了一种方法,可以在训练图像和对象的3D几何形状都没有可用时跟踪对象中对象的6D运动。因此,与以前的作品相反,我们的方法可以立即考虑开放世界中的未知对象,而无需任何先前的信息或特定的培训阶段。我们考虑两个架构,一个基于两个帧,另一个依赖于变压器编码器,它们可以利用任意数量的过去帧。我们仅使用具有域随机化的合成渲染训练架构。我们在具有挑战性的数据集上的结果与以前需要更多信息的作品(训练目标对象,3D模型和/或深度数据的培训图像)相当。我们的源代码可从https://github.com/nv-nguyen/pizza获得
translated by 谷歌翻译
A key technical challenge in performing 6D object pose estimation from RGB-D image is to fully leverage the two complementary data sources. Prior works either extract information from the RGB image and depth separately or use costly post-processing steps, limiting their performances in highly cluttered scenes and real-time applications. In this work, we present DenseFusion, a generic framework for estimating 6D pose of a set of known objects from RGB-D images. DenseFusion is a heterogeneous architecture that processes the two data sources individually and uses a novel dense fusion network to extract pixel-wise dense feature embedding, from which the pose is estimated. Furthermore, we integrate an end-to-end iterative pose refinement procedure that further improves the pose estimation while achieving near real-time inference. Our experiments show that our method outperforms state-of-the-art approaches in two datasets, YCB-Video and LineMOD. We also deploy our proposed method to a real robot to grasp and manipulate objects based on the estimated pose. Our code and video are available at https://sites.google.com/view/densefusion/.
translated by 谷歌翻译
We introduce a novel method for 3D object detection and pose estimation from color images only. We first use segmentation to detect the objects of interest in 2D even in presence of partial occlusions and cluttered background. By contrast with recent patch-based methods, we rely on a "holistic" approach: We apply to the detected objects a Convolutional Neural Network (CNN) trained to predict their 3D poses in the form of 2D projections of the corners of their 3D bounding boxes. This, however, is not sufficient for handling objects from the recent T-LESS dataset: These objects exhibit an axis of rotational symmetry, and the similarity of two images of such an object under two different poses makes training the CNN challenging. We solve this problem by restricting the range of poses used for training, and by introducing a classifier to identify the range of a pose at run-time before estimating it. We also use an optional additional step that refines the predicted poses. We improve the state-of-the-art on the LINEMOD dataset from 73.7% [2] to 89.3% of correctly registered RGB frames. We are also the first to report results on the Occlusion dataset [1] using color images only. We obtain 54% of frames passing the Pose 6D criterion on average on several sequences of the T-LESS dataset, compared to the 67% of the state-of-the-art [10] on the same sequences which uses both color and depth. The full approach is also scalable, as a single network can be trained for multiple objects simultaneously.
translated by 谷歌翻译
This paper addresses the challenge of 6DoF pose estimation from a single RGB image under severe occlusion or truncation. Many recent works have shown that a two-stage approach, which first detects keypoints and then solves a Perspective-n-Point (PnP) problem for pose estimation, achieves remarkable performance. However, most of these methods only localize a set of sparse keypoints by regressing their image coordinates or heatmaps, which are sensitive to occlusion and truncation. Instead, we introduce a Pixel-wise Voting Network (PVNet) to regress pixel-wise unit vectors pointing to the keypoints and use these vectors to vote for keypoint locations using RANSAC. This creates a flexible representation for localizing occluded or truncated keypoints. Another important feature of this representation is that it provides uncertainties of keypoint locations that can be further leveraged by the PnP solver. Experiments show that the proposed approach outperforms the state of the art on the LINEMOD, Occlusion LINEMOD and YCB-Video datasets by a large margin, while being efficient for real-time pose estimation. We further create a Truncation LINEMOD dataset to validate the robustness of our approach against truncation. The code will be avaliable at https://zju-3dv.github.io/pvnet/.
translated by 谷歌翻译
深度学习识别的进步导致使用2D图像准确的对象检测。然而,这些2D感知方法对于完整的3D世界信息不足。同时,高级3D形状估计接近形状本身的焦点,而不考虑公制量表。这些方法无法确定对象的准确位置和方向。为了解决这个问题,我们提出了一个框架,该框架共同估计了从单个RGB图像的度量标度形状和姿势。我们的框架有两个分支:公制刻度对象形状分支(MSO)和归一化对象坐标空间分支(NOC)。 MSOS分支估计在相机坐标中观察到的度量标准形状。 NOCS分支预测归一化对象坐标空间(NOCS)映射,并从预测的度量刻度网格与渲染的深度图执行相似性转换,以获得6D姿势和大小。此外,我们介绍了归一化对象中心估计(NOCE),以估计从相机到物体中心的几何对齐距离。我们在合成和实际数据集中验证了我们的方法,以评估类别级对象姿势和形状。
translated by 谷歌翻译
The goal of this paper is to estimate the 6D pose and dimensions of unseen object instances in an RGB-D image. Contrary to "instance-level" 6D pose estimation tasks, our problem assumes that no exact object CAD models are available during either training or testing time. To handle different and unseen object instances in a given category, we introduce Normalized Object Coordinate Space (NOCS)-a shared canonical representation for all possible object instances within a category. Our region-based neural network is then trained to directly infer the correspondence from observed pixels to this shared object representation (NOCS) along with other object information such as class label and instance mask. These predictions can be combined with the depth map to jointly estimate the metric 6D pose and dimensions of multiple objects in a cluttered scene. To train our network, we present a new contextaware technique to generate large amounts of fully annotated mixed reality data. To further improve our model and evaluate its performance on real data, we also provide a fully annotated real-world dataset with large environment and instance variation. Extensive experiments demonstrate that the proposed method is able to robustly estimate the pose and size of unseen object instances in real environments while also achieving state-of-the-art performance on standard 6D pose estimation benchmarks.
translated by 谷歌翻译
We propose a single-shot approach for simultaneously detecting an object in an RGB image and predicting its 6D pose without requiring multiple stages or having to examine multiple hypotheses. Unlike a recently proposed single-shot technique for this task [11] that only predicts an approximate 6D pose that must then be refined, ours is accurate enough not to require additional post-processing. As a result, it is much faster -50 fps on a Titan X (Pascal) GPU -and more suitable for real-time processing. The key component of our method is a new CNN architecture inspired by [28,29] that directly predicts the 2D image locations of the projected vertices of the object's 3D bounding box. The object's 6D pose is then estimated using a PnP algorithm.For single object and multiple object pose estimation on the LINEMOD and OCCLUSION datasets, our approach substantially outperforms other recent 26] when they are all used without postprocessing. During post-processing, a pose refinement step can be used to boost the accuracy of these two methods, but at 10 fps or less, they are much slower than our method.
translated by 谷歌翻译
本文介绍了一种新型的多视图6 DOF对象姿势细化方法,重点是改进对合成数据训练的方法。它基于DPOD检测器,该检测器会在每个帧中产生密集的2D-3D对应关系。我们选择使用多个具有已知相机转换的帧,因为它允许通过可解释的ICP样损耗函数引入几何约束。损耗函数是通过可区分的渲染器实现的,并经过迭代进行了优化。我们还证明,仅根据合成数据训练的完整检测和完善管道可用于自动标记的真实数据。我们对linemod,caslusion,自制和YCB-V数据集执行定量评估,并与对合成和真实数据训练的最新方法相比,报告出色的性能。我们从经验上证明,我们的方法仅需要几个帧,并且可以在外部摄像机校准中关闭相机位置和噪音,从而使其实际用法更加容易且无处不在。
translated by 谷歌翻译
本文介绍了一个有效的对称性和无对应框架,称为SC6D,对于单个单眼RGB图像的6D对象姿势估计。SC6D既不需要对象的3D CAD模型,也不需要对称对称的任何先验知识。姿势估计分解为三个子任务:a)对象3D旋转表示学习和匹配;b)估计对象中心的2D位置;和c)通过分类的比例不变距离估计(沿Z轴的翻译)。SC6D在三个基准数据集(T-less,YCB-V和ITODD)上进行了评估,并在T-less数据集中获得最先进的性能。此外,SC6D在计算上比以前的最新方法Surfemb更有效。实施和预培训模型可在https://github.com/dingdingcai/sc6d-pose上公开获得。
translated by 谷歌翻译
6D object pose estimation problem has been extensively studied in the field of Computer Vision and Robotics. It has wide range of applications such as robot manipulation, augmented reality, and 3D scene understanding. With the advent of Deep Learning, many breakthroughs have been made; however, approaches continue to struggle when they encounter unseen instances, new categories, or real-world challenges such as cluttered backgrounds and occlusions. In this study, we will explore the available methods based on input modality, problem formulation, and whether it is a category-level or instance-level approach. As a part of our discussion, we will focus on how 6D object pose estimation can be used for understanding 3D scenes.
translated by 谷歌翻译
我们提出了一个基于按键的对象级别的SLAM框架,该框架可以为对称和不对称对象提供全球一致的6DOF姿势估计。据我们所知,我们的系统是最早利用来自SLAM的相机姿势信息的系统之一,以提供先验知识,以跟踪对称对象的关键点 - 确保新测量与当前的3D场景一致。此外,我们的语义关键点网络经过训练,可以预测捕获预测的真实错误的关键点的高斯协方差,因此不仅可以作为系统优化问题中残留物的权重,而且还可以作为检测手段有害的统计异常值,而无需选择手动阈值。实验表明,我们的方法以6DOF对象姿势估算和实时速度为最先进的状态提供了竞争性能。我们的代码,预培训模型和关键点标签可用https://github.com/rpng/suo_slam。
translated by 谷歌翻译
我们提出了一种方法,用于估计具有单个RGB图像的可用3D模型的刚性对象的6DOF姿势。与基于经典对应的方法不同,该方法可以预测输入图像的像素的3D对象坐标,该建议的方法可以预测3D对象坐标在相机frustum中采样的3D查询点。从像素到3D点的移动,这是受到3D重建方法的最新PIFU式方法的启发,可以对整个对象(包括(自我)遮挡部分)进行推理。对于与与像素对齐的图像功能相关的3D查询点,我们训练完全连接的神经网络来预测:(i)相应的3D对象坐标,以及(ii)签名到对象表面的签名距离,首先定义仅适用于地表附近的查询点。我们将该网络实现的映射称为神经通信字段。然后,通过Kabsch-Ransac算法从预测的3D-3D对应关系中稳健地估计对象姿势。所提出的方法在三个BOP数据集上实现了最先进的结果,并且在咬合挑战性案例中表现出了优越。项目网站在:linhuang17.github.io/ncf。
translated by 谷歌翻译