In RGB-D based 6D pose estimation, direct regression approaches can directly predict the 3D rotation and translation from RGB-D data, allowing for quick deployment and efficient inference. However, directly regressing the absolute translation of the pose suffers from diverse object translation distribution between the training and testing datasets, which is usually caused by the diversity of pose distribution of objects in 3D physical space. To this end, we generalize the pin-hole camera projection model to a residual-based projection model and propose the projective residual regression (Res6D) mechanism. Given a reference point for each object in an RGB-D image, Res6D not only reduces the distribution gap and shrinks the regression target to a small range by regressing the residual between the target and the reference point, but also aligns its output residual and its input to follow the projection equation between the 2D plane and 3D space. By plugging Res6D into the latest direct regression methods, we achieve state-of-the-art overall results on datasets including Occlusion LineMOD (ADD(S): 79.7%), LineMOD (ADD(S): 99.5%), and YCB-Video datasets (AUC of ADD(S): 95.4%).
translated by 谷歌翻译
很少有6D姿势估计方法使用骨干网络从RGB和深度图像中提取功能,而Uni6D是这样做的先驱。我们发现UNI6D中性能限制的主要原因是实例外部和实例 - 内噪声。 uni6d不可避免地会由于其固有的直接管道设计而从接收场中的背景像素引入实例外部噪声,并忽略了输入深度数据中的实例 - 内侧噪声。在这项工作中,我们提出了一种两步的denoising方法,以处理UNI6D中上述噪声。在第一步中,实例分割网络用于裁剪和掩盖实例,以消除非实施区域的噪声。在第二步中,提出了一个轻巧的深度剥夺模块,以校准深度特征,然后再将其输入姿势回归网络。广泛的实验表明,我们称为uni6dv2的方法能够有效,稳健地消除噪声,在不牺牲过多的推理效率的情况下超过UNI6D。它还减少了对需要昂贵标签的注释真实数据的需求。
translated by 谷歌翻译
Estimating 6D poses of objects from images is an important problem in various applications such as robot manipulation and virtual reality. While direct regression of images to object poses has limited accuracy, matching rendered images of an object against the input image can produce accurate results. In this work, we propose a novel deep neural network for 6D pose matching named DeepIM. Given an initial pose estimation, our network is able to iteratively refine the pose by matching the rendered image against the observed image. The network is trained to predict a relative pose transformation using a disentangled representation of 3D location and 3D orientation and an iterative training process. Experiments on two commonly used benchmarks for 6D pose estimation demonstrate that DeepIM achieves large improvements over stateof-the-art methods. We furthermore show that DeepIM is able to match previously unseen objects.
translated by 谷歌翻译
本文介绍了一个有效的对称性和无对应框架,称为SC6D,对于单个单眼RGB图像的6D对象姿势估计。SC6D既不需要对象的3D CAD模型,也不需要对称对称的任何先验知识。姿势估计分解为三个子任务:a)对象3D旋转表示学习和匹配;b)估计对象中心的2D位置;和c)通过分类的比例不变距离估计(沿Z轴的翻译)。SC6D在三个基准数据集(T-less,YCB-V和ITODD)上进行了评估,并在T-less数据集中获得最先进的性能。此外,SC6D在计算上比以前的最新方法Surfemb更有效。实施和预培训模型可在https://github.com/dingdingcai/sc6d-pose上公开获得。
translated by 谷歌翻译
A key technical challenge in performing 6D object pose estimation from RGB-D image is to fully leverage the two complementary data sources. Prior works either extract information from the RGB image and depth separately or use costly post-processing steps, limiting their performances in highly cluttered scenes and real-time applications. In this work, we present DenseFusion, a generic framework for estimating 6D pose of a set of known objects from RGB-D images. DenseFusion is a heterogeneous architecture that processes the two data sources individually and uses a novel dense fusion network to extract pixel-wise dense feature embedding, from which the pose is estimated. Furthermore, we integrate an end-to-end iterative pose refinement procedure that further improves the pose estimation while achieving near real-time inference. Our experiments show that our method outperforms state-of-the-art approaches in two datasets, YCB-Video and LineMOD. We also deploy our proposed method to a real robot to grasp and manipulate objects based on the estimated pose. Our code and video are available at https://sites.google.com/view/densefusion/.
translated by 谷歌翻译
从RGB-D图像中对刚性对象的6D姿势估计对于机器人技术中的对象抓握和操纵至关重要。尽管RGB通道和深度(d)通道通常是互补的,分别提供了外观和几何信息,但如何完全从两个跨模式数据中完全受益仍然是非平凡的。从简单而新的观察结果来看,当对象旋转时,其语义标签是姿势不变的,而其关键点偏移方向是姿势的变体。为此,我们提出了So(3)pose,这是一个新的表示学习网络,可以探索SO(3)equivariant和So(3) - 从深度通道中进行姿势估计的特征。 SO(3) - 激素特征有助于学习更独特的表示,以分割来自RGB通道外观相似的对象。 SO(3) - 等级特征与RGB功能通信,以推导(缺失的)几何形状,以检测从深度通道的反射表面的对象的关键点。与大多数现有的姿势估计方法不同,我们的SO(3) - 不仅可以实现RGB和深度渠道之间的信息通信,而且自然会吸收SO(3) - 等级的几何学知识,从深度图像中,导致更好的外观和更好的外观和更好几何表示学习。综合实验表明,我们的方法在三个基准测试中实现了最先进的性能。
translated by 谷歌翻译
Estimating the 6D pose of known objects is important for robots to interact with the real world. The problem is challenging due to the variety of objects as well as the complexity of a scene caused by clutter and occlusions between objects. In this work, we introduce PoseCNN, a new Convolutional Neural Network for 6D object pose estimation. PoseCNN estimates the 3D translation of an object by localizing its center in the image and predicting its distance from the camera. The 3D rotation of the object is estimated by regressing to a quaternion representation. We also introduce a novel loss function that enables PoseCNN to handle symmetric objects. In addition, we contribute a large scale video dataset for 6D object pose estimation named the YCB-Video dataset. Our dataset provides accurate 6D poses of 21 objects from the YCB dataset observed in 92 videos with 133,827 frames. We conduct extensive experiments on our YCB-Video dataset and the OccludedLINEMOD dataset to show that PoseCNN is highly robust to occlusions, can handle symmetric objects, and provide accurate pose estimation using only color images as input. When using depth data to further refine the poses, our approach achieves state-of-the-art results on the challenging OccludedLINEMOD dataset. Our code and dataset are available at https://rse-lab.cs.washington.edu/projects/posecnn/.
translated by 谷歌翻译
In this paper, we propose a novel 3D graph convolution based pipeline for category-level 6D pose and size estimation from monocular RGB-D images. The proposed method leverages an efficient 3D data augmentation and a novel vector-based decoupled rotation representation. Specifically, we first design an orientation-aware autoencoder with 3D graph convolution for latent feature learning. The learned latent feature is insensitive to point shift and size thanks to the shift and scale-invariance properties of the 3D graph convolution. Then, to efficiently decode the rotation information from the latent feature, we design a novel flexible vector-based decomposable rotation representation that employs two decoders to complementarily access the rotation information. The proposed rotation representation has two major advantages: 1) decoupled characteristic that makes the rotation estimation easier; 2) flexible length and rotated angle of the vectors allow us to find a more suitable vector representation for specific pose estimation task. Finally, we propose a 3D deformation mechanism to increase the generalization ability of the pipeline. Extensive experiments show that the proposed pipeline achieves state-of-the-art performance on category-level tasks. Further, the experiments demonstrate that the proposed rotation representation is more suitable for the pose estimation tasks than other rotation representations.
translated by 谷歌翻译
This paper addresses the challenge of 6DoF pose estimation from a single RGB image under severe occlusion or truncation. Many recent works have shown that a two-stage approach, which first detects keypoints and then solves a Perspective-n-Point (PnP) problem for pose estimation, achieves remarkable performance. However, most of these methods only localize a set of sparse keypoints by regressing their image coordinates or heatmaps, which are sensitive to occlusion and truncation. Instead, we introduce a Pixel-wise Voting Network (PVNet) to regress pixel-wise unit vectors pointing to the keypoints and use these vectors to vote for keypoint locations using RANSAC. This creates a flexible representation for localizing occluded or truncated keypoints. Another important feature of this representation is that it provides uncertainties of keypoint locations that can be further leveraged by the PnP solver. Experiments show that the proposed approach outperforms the state of the art on the LINEMOD, Occlusion LINEMOD and YCB-Video datasets by a large margin, while being efficient for real-time pose estimation. We further create a Truncation LINEMOD dataset to validate the robustness of our approach against truncation. The code will be avaliable at https://zju-3dv.github.io/pvnet/.
translated by 谷歌翻译
估计对象的6D姿势是必不可少的计算机视觉任务。但是,大多数常规方法从单个角度依赖相机数据,因此遭受遮挡。我们通过称为MV6D的新型多视图6D姿势估计方法克服了这个问题,该方法从多个角度根据RGB-D图像准确地预测了混乱场景中所有对象的6D姿势。我们将方法以PVN3D网络为基础,该网络使用单个RGB-D图像来预测目标对象的关键点。我们通过从多个视图中使用组合点云来扩展此方法,并将每个视图中的图像与密集层层融合。与当前的多视图检测网络(例如Cosypose)相反,我们的MV6D可以以端到端的方式学习多个观点的融合,并且不需要多个预测阶段或随后对预测的微调。此外,我们介绍了三个新颖的影像学数据集,这些数据集具有沉重的遮挡的混乱场景。所有这些都从多个角度包含RGB-D图像,例如语义分割和6D姿势估计。即使在摄像头不正确的情况下,MV6D也明显优于多视图6D姿势估计中最新的姿势估计。此外,我们表明我们的方法对动态相机设置具有强大的态度,并且其准确性随着越来越多的观点而逐渐增加。
translated by 谷歌翻译
6D object pose estimation problem has been extensively studied in the field of Computer Vision and Robotics. It has wide range of applications such as robot manipulation, augmented reality, and 3D scene understanding. With the advent of Deep Learning, many breakthroughs have been made; however, approaches continue to struggle when they encounter unseen instances, new categories, or real-world challenges such as cluttered backgrounds and occlusions. In this study, we will explore the available methods based on input modality, problem formulation, and whether it is a category-level or instance-level approach. As a part of our discussion, we will focus on how 6D object pose estimation can be used for understanding 3D scenes.
translated by 谷歌翻译
虽然最近出现了类别级的9DOF对象姿势估计,但由于较大的对象形状和颜色等类别内差异,因此,先前基于对应的或直接回归方法的准确性均受到限制。 - 级别的物体姿势和尺寸炼油机Catre,能够迭代地增强点云的姿势估计以产生准确的结果。鉴于初始姿势估计,Catre通过对齐部分观察到的点云和先验的抽象形状来预测初始姿势和地面真理之间的相对转换。具体而言,我们提出了一种新颖的分离体系结构,以了解旋转与翻译/大小估计之间的固有区别。广泛的实验表明,我们的方法在REAL275,Camera25和LM基准测试中的最先进方法高达〜85.32Hz,并在类别级别跟踪上取得了竞争成果。我们进一步证明,Catre可以对看不见的类别进行姿势改进。可以使用代码和训练有素的型号。
translated by 谷歌翻译
我们提出了一种称为DPODV2(密集姿势对象检测器)的三个阶段6 DOF对象检测方法,该方法依赖于致密的对应关系。我们将2D对象检测器与密集的对应关系网络和多视图姿势细化方法相结合,以估计完整的6 DOF姿势。与通常仅限于单眼RGB图像的其他深度学习方法不同,我们提出了一个统一的深度学习网络,允许使用不同的成像方式(RGB或DEPTH)。此外,我们提出了一种基于可区分渲染的新型姿势改进方法。主要概念是在多个视图中比较预测并渲染对应关系,以获得与所有视图中预测的对应关系一致的姿势。我们提出的方法对受控设置中的不同数据方式和培训数据类型进行了严格的评估。主要结论是,RGB在对应性估计中表现出色,而如果有良好的3D-3D对应关系,则深度有助于姿势精度。自然,他们的组合可以实现总体最佳性能。我们进行广泛的评估和消融研究,以分析和验证几个具有挑战性的数据集的结果。 DPODV2在所有这些方面都取得了出色的成果,同时仍然保持快速和可扩展性,独立于使用的数据模式和培训数据的类型
translated by 谷歌翻译
估计看不见对象的6D姿势对许多现实世界应用非常有需求。但是,当前的最新姿势估计方法只能处理以前训练的对象。在本文中,我们提出了一项新任务,以使算法能够估计测试过程中新颖对象的6D姿势估计。我们收集一个具有真实图像和合成图像的数据集,并且在测试集中最多可见48个看不见的对象。同时,我们提出了一个名为infimum Add(IADD)的新指标,这是对具有不同类型姿势歧义的对象的不变测量。还提供了针对此任务的两个阶段基线解决方案。通过训练端到端的3D对应网络,我们的方法可以准确有效地找到看不见的对象和部分视图RGBD图像之间的相应点。然后,它使用算法鲁棒到对象对称性从对应关系中计算6D姿势。广泛的实验表明,我们的方法的表现优于几个直观基线,从而验证其有效性。所有数据,代码和模型都将公开可用。项目页面:www.graspnet.net/unseen6d
translated by 谷歌翻译
6D object pose estimation has been a research topic in the field of computer vision and robotics. Many modern world applications like robot grasping, manipulation, autonomous navigation etc, require the correct pose of objects present in a scene to perform their specific task. It becomes even harder when the objects are placed in a cluttered scene and the level of occlusion is high. Prior works have tried to overcome this problem but could not achieve accuracy that can be considered reliable in real-world applications. In this paper, we present an architecture that, unlike prior work, is context-aware. It utilizes the context information available to us about the objects. Our proposed architecture treats the objects separately according to their types i.e; symmetric and non-symmetric. A deeper estimator and refiner network pair is used for non-symmetric objects as compared to symmetric due to their intrinsic differences. Our experiments show an enhancement in the accuracy of about 3.2% over the LineMOD dataset, which is considered a benchmark for pose estimation in the occluded and cluttered scenes, against the prior state-of-the-art DenseFusion. Our results also show that the inference time we got is sufficient for real-time usage.
translated by 谷歌翻译
当前基于RGB的6D对象姿势估计方法在数据集和现实世界应用程序上取得了明显的性能。但是,从单个2D图像特征中预测6D姿势容易受到环境和纹理或相似物体表面的变化的干扰。因此,基于RGB的方法通常比基于RGBD的方法获得的竞争结果较低,后者既部署图像特征和3D结构特征。为了缩小这一性能差距,本文提出了一个6D对象姿势估计的框架,该框架从2个RGB图像中学习隐式3D信息。结合学习的3D信息和2D图像功能,我们在场景和对象模型之间建立了更稳定的对应关系。为了寻求从RGB输入中使用3D信息的最佳方法,我们对三种不同的方法进行了调查,包括早期融合,中融合和晚融合。我们确定中融合方法是恢复最精确的3D关键点的最佳方法,可用于对象姿势估计。该实验表明,我们的方法优于最先进的RGB方法,并通过基于RGBD的方法获得了可比的结果。
translated by 谷歌翻译
我们提出了一种方法,用于估计具有单个RGB图像的可用3D模型的刚性对象的6DOF姿势。与基于经典对应的方法不同,该方法可以预测输入图像的像素的3D对象坐标,该建议的方法可以预测3D对象坐标在相机frustum中采样的3D查询点。从像素到3D点的移动,这是受到3D重建方法的最新PIFU式方法的启发,可以对整个对象(包括(自我)遮挡部分)进行推理。对于与与像素对齐的图像功能相关的3D查询点,我们训练完全连接的神经网络来预测:(i)相应的3D对象坐标,以及(ii)签名到对象表面的签名距离,首先定义仅适用于地表附近的查询点。我们将该网络实现的映射称为神经通信字段。然后,通过Kabsch-Ransac算法从预测的3D-3D对应关系中稳健地估计对象姿势。所提出的方法在三个BOP数据集上实现了最先进的结果,并且在咬合挑战性案例中表现出了优越。项目网站在:linhuang17.github.io/ncf。
translated by 谷歌翻译
深度学习识别的进步导致使用2D图像准确的对象检测。然而,这些2D感知方法对于完整的3D世界信息不足。同时,高级3D形状估计接近形状本身的焦点,而不考虑公制量表。这些方法无法确定对象的准确位置和方向。为了解决这个问题,我们提出了一个框架,该框架共同估计了从单个RGB图像的度量标度形状和姿势。我们的框架有两个分支:公制刻度对象形状分支(MSO)和归一化对象坐标空间分支(NOC)。 MSOS分支估计在相机坐标中观察到的度量标准形状。 NOCS分支预测归一化对象坐标空间(NOCS)映射,并从预测的度量刻度网格与渲染的深度图执行相似性转换,以获得6D姿势和大小。此外,我们介绍了归一化对象中心估计(NOCE),以估计从相机到物体中心的几何对齐距离。我们在合成和实际数据集中验证了我们的方法,以评估类别级对象姿势和形状。
translated by 谷歌翻译
6D对象姿势估计是计算机视觉和机器人研究中的基本问题之一。尽管最近在同一类别内将姿势估计概括为新的对象实例(即类别级别的6D姿势估计)方面已做出了许多努力,但考虑到有限的带注释数据,它仍然在受限的环境中受到限制。在本文中,我们收集了Wild6D,这是一种具有不同实例和背景的新的未标记的RGBD对象视频数据集。我们利用这些数据在野外概括了类别级别的6D对象姿势效果,并通过半监督学习。我们提出了一个新模型,称为呈现姿势估计网络reponet,该模型使用带有合成数据的自由地面真实性共同训练,以及在现实世界数据上具有轮廓匹配的目标函数。在不使用实际数据上的任何3D注释的情况下,我们的方法优于先前数据集上的最先进方法,而我们的WILD6D测试集(带有手动注释进行评估)则优于较大的边距。带有WILD6D数据的项目页面:https://oasisyang.github.io/semi-pose。
translated by 谷歌翻译
我们提出了一种基于相交的球体的新型关键点投票方案,其比现有方案更准确,并且允许较小的更多分散关键点。该方案基于点之间的距离,其作为1D数量可以比在先前的工作中的2D和3D向量和偏移量中更精确地回归,从而产生更准确的小点定位。该方案构成了RGB-D数据中的6 DOF姿势估计的所提出的RCVPOS方法的基础,这在处理闭塞时特别有效。训练CNN以估计与每个RGB像素的深度模式对应的3D点之间的距离,以及在对象帧中定义的一组3分散键点。在推断下,产生在每个3D点处的球体,其半径等于该估计距离。这些球体的表面投票给增量3D累加器空间,其峰值指示Keypoint位置。所提出的径向投票方案比以前的矢量或偏移方案更准确,并且稳健地分散关键点。实验表明,RCPOSE是高度准确和竞争的,在LineMod 99.7%和YCB-Video 97.2%数据集上实现最先进的结果,显着得分+ 7.9%(71.1%)比以前的挑战遮挡Linemod上的方法数据集。
translated by 谷歌翻译