很少有6D姿势估计方法使用骨干网络从RGB和深度图像中提取功能,而Uni6D是这样做的先驱。我们发现UNI6D中性能限制的主要原因是实例外部和实例 - 内噪声。 uni6d不可避免地会由于其固有的直接管道设计而从接收场中的背景像素引入实例外部噪声,并忽略了输入深度数据中的实例 - 内侧噪声。在这项工作中,我们提出了一种两步的denoising方法,以处理UNI6D中上述噪声。在第一步中,实例分割网络用于裁剪和掩盖实例,以消除非实施区域的噪声。在第二步中,提出了一个轻巧的深度剥夺模块,以校准深度特征,然后再将其输入姿势回归网络。广泛的实验表明,我们称为uni6dv2的方法能够有效,稳健地消除噪声,在不牺牲过多的推理效率的情况下超过UNI6D。它还减少了对需要昂贵标签的注释真实数据的需求。
translated by 谷歌翻译
In RGB-D based 6D pose estimation, direct regression approaches can directly predict the 3D rotation and translation from RGB-D data, allowing for quick deployment and efficient inference. However, directly regressing the absolute translation of the pose suffers from diverse object translation distribution between the training and testing datasets, which is usually caused by the diversity of pose distribution of objects in 3D physical space. To this end, we generalize the pin-hole camera projection model to a residual-based projection model and propose the projective residual regression (Res6D) mechanism. Given a reference point for each object in an RGB-D image, Res6D not only reduces the distribution gap and shrinks the regression target to a small range by regressing the residual between the target and the reference point, but also aligns its output residual and its input to follow the projection equation between the 2D plane and 3D space. By plugging Res6D into the latest direct regression methods, we achieve state-of-the-art overall results on datasets including Occlusion LineMOD (ADD(S): 79.7%), LineMOD (ADD(S): 99.5%), and YCB-Video datasets (AUC of ADD(S): 95.4%).
translated by 谷歌翻译
This paper addresses the challenge of 6DoF pose estimation from a single RGB image under severe occlusion or truncation. Many recent works have shown that a two-stage approach, which first detects keypoints and then solves a Perspective-n-Point (PnP) problem for pose estimation, achieves remarkable performance. However, most of these methods only localize a set of sparse keypoints by regressing their image coordinates or heatmaps, which are sensitive to occlusion and truncation. Instead, we introduce a Pixel-wise Voting Network (PVNet) to regress pixel-wise unit vectors pointing to the keypoints and use these vectors to vote for keypoint locations using RANSAC. This creates a flexible representation for localizing occluded or truncated keypoints. Another important feature of this representation is that it provides uncertainties of keypoint locations that can be further leveraged by the PnP solver. Experiments show that the proposed approach outperforms the state of the art on the LINEMOD, Occlusion LINEMOD and YCB-Video datasets by a large margin, while being efficient for real-time pose estimation. We further create a Truncation LINEMOD dataset to validate the robustness of our approach against truncation. The code will be avaliable at https://zju-3dv.github.io/pvnet/.
translated by 谷歌翻译
在这项工作中,我们通过利用3D Suite Blender生产具有6D姿势的合成RGBD图像数据集来提出数据生成管道。提出的管道可以有效地生成大量的照片现实的RGBD图像,以了解感兴趣的对象。此外,引入了域随机化技术的集合来弥合真实数据和合成数据之间的差距。此外,我们通过整合对象检测器Yolo-V4微型和6D姿势估计算法PVN3D来开发实时的两阶段6D姿势估计方法,用于时间敏感的机器人应用。借助提出的数据生成管道,我们的姿势估计方法可以仅使用没有任何预训练模型的合成数据从头开始训练。在LineMod数据集评估时,与最先进的方法相比,所得网络显示出竞争性能。我们还证明了在机器人实验中提出的方法,在不同的照明条件下从混乱的背景中抓住家用物体。
translated by 谷歌翻译
Estimating the 6D pose of known objects is important for robots to interact with the real world. The problem is challenging due to the variety of objects as well as the complexity of a scene caused by clutter and occlusions between objects. In this work, we introduce PoseCNN, a new Convolutional Neural Network for 6D object pose estimation. PoseCNN estimates the 3D translation of an object by localizing its center in the image and predicting its distance from the camera. The 3D rotation of the object is estimated by regressing to a quaternion representation. We also introduce a novel loss function that enables PoseCNN to handle symmetric objects. In addition, we contribute a large scale video dataset for 6D object pose estimation named the YCB-Video dataset. Our dataset provides accurate 6D poses of 21 objects from the YCB dataset observed in 92 videos with 133,827 frames. We conduct extensive experiments on our YCB-Video dataset and the OccludedLINEMOD dataset to show that PoseCNN is highly robust to occlusions, can handle symmetric objects, and provide accurate pose estimation using only color images as input. When using depth data to further refine the poses, our approach achieves state-of-the-art results on the challenging OccludedLINEMOD dataset. Our code and dataset are available at https://rse-lab.cs.washington.edu/projects/posecnn/.
translated by 谷歌翻译
A key technical challenge in performing 6D object pose estimation from RGB-D image is to fully leverage the two complementary data sources. Prior works either extract information from the RGB image and depth separately or use costly post-processing steps, limiting their performances in highly cluttered scenes and real-time applications. In this work, we present DenseFusion, a generic framework for estimating 6D pose of a set of known objects from RGB-D images. DenseFusion is a heterogeneous architecture that processes the two data sources individually and uses a novel dense fusion network to extract pixel-wise dense feature embedding, from which the pose is estimated. Furthermore, we integrate an end-to-end iterative pose refinement procedure that further improves the pose estimation while achieving near real-time inference. Our experiments show that our method outperforms state-of-the-art approaches in two datasets, YCB-Video and LineMOD. We also deploy our proposed method to a real robot to grasp and manipulate objects based on the estimated pose. Our code and video are available at https://sites.google.com/view/densefusion/.
translated by 谷歌翻译
对象姿态估计有多个重要应用,例如机器人抓握和增强现实。我们提出了一种估计了提高当前提案的准确性的6D对象的6D姿势,仍然可以实时使用。我们的方法使用RGB-D数据作为段对象的输入并估计它们的姿势。它使用具有多个头部的神经网络,一个头估计对象分类并生成掩码,第二估计转换向量的值,最后一个头估计表示对象旋转的四元轴的值。这些头部利用特征提取和特征融合期间使用的金字塔架构。我们的方法可以实时使用,其低推理时间为0.12秒并具有高精度。通过这种快速推理和良好准确性的组合,可以在机器人挑选和放置任务和/或增强现实应用中使用我们的方法。
translated by 谷歌翻译
从RGB-D图像中对刚性对象的6D姿势估计对于机器人技术中的对象抓握和操纵至关重要。尽管RGB通道和深度(d)通道通常是互补的,分别提供了外观和几何信息,但如何完全从两个跨模式数据中完全受益仍然是非平凡的。从简单而新的观察结果来看,当对象旋转时,其语义标签是姿势不变的,而其关键点偏移方向是姿势的变体。为此,我们提出了So(3)pose,这是一个新的表示学习网络,可以探索SO(3)equivariant和So(3) - 从深度通道中进行姿势估计的特征。 SO(3) - 激素特征有助于学习更独特的表示,以分割来自RGB通道外观相似的对象。 SO(3) - 等级特征与RGB功能通信,以推导(缺失的)几何形状,以检测从深度通道的反射表面的对象的关键点。与大多数现有的姿势估计方法不同,我们的SO(3) - 不仅可以实现RGB和深度渠道之间的信息通信,而且自然会吸收SO(3) - 等级的几何学知识,从深度图像中,导致更好的外观和更好的外观和更好几何表示学习。综合实验表明,我们的方法在三个基准测试中实现了最先进的性能。
translated by 谷歌翻译
Estimating 6D poses of objects from images is an important problem in various applications such as robot manipulation and virtual reality. While direct regression of images to object poses has limited accuracy, matching rendered images of an object against the input image can produce accurate results. In this work, we propose a novel deep neural network for 6D pose matching named DeepIM. Given an initial pose estimation, our network is able to iteratively refine the pose by matching the rendered image against the observed image. The network is trained to predict a relative pose transformation using a disentangled representation of 3D location and 3D orientation and an iterative training process. Experiments on two commonly used benchmarks for 6D pose estimation demonstrate that DeepIM achieves large improvements over stateof-the-art methods. We furthermore show that DeepIM is able to match previously unseen objects.
translated by 谷歌翻译
Applications in the field of augmented reality or robotics often require joint localisation and 6D pose estimation of multiple objects. However, most algorithms need one network per object class to be trained in order to provide the best results. Analysing all visible objects demands multiple inferences, which is memory and time-consuming. We present a new single-stage architecture called CASAPose that determines 2D-3D correspondences for pose estimation of multiple different objects in RGB images in one pass. It is fast and memory efficient, and achieves high accuracy for multiple objects by exploiting the output of a semantic segmentation decoder as control input to a keypoint recognition decoder via local class-adaptive normalisation. Our new differentiable regression of keypoint locations significantly contributes to a faster closing of the domain gap between real test and synthetic training data. We apply segmentation-aware convolutions and upsampling operations to increase the focus inside the object mask and to reduce mutual interference of occluding objects. For each inserted object, the network grows by only one output segmentation map and a negligible number of parameters. We outperform state-of-the-art approaches in challenging multi-object scenes with inter-object occlusion and synthetic training.
translated by 谷歌翻译
6D object pose estimation problem has been extensively studied in the field of Computer Vision and Robotics. It has wide range of applications such as robot manipulation, augmented reality, and 3D scene understanding. With the advent of Deep Learning, many breakthroughs have been made; however, approaches continue to struggle when they encounter unseen instances, new categories, or real-world challenges such as cluttered backgrounds and occlusions. In this study, we will explore the available methods based on input modality, problem formulation, and whether it is a category-level or instance-level approach. As a part of our discussion, we will focus on how 6D object pose estimation can be used for understanding 3D scenes.
translated by 谷歌翻译
本文介绍了一个有效的对称性和无对应框架,称为SC6D,对于单个单眼RGB图像的6D对象姿势估计。SC6D既不需要对象的3D CAD模型,也不需要对称对称的任何先验知识。姿势估计分解为三个子任务:a)对象3D旋转表示学习和匹配;b)估计对象中心的2D位置;和c)通过分类的比例不变距离估计(沿Z轴的翻译)。SC6D在三个基准数据集(T-less,YCB-V和ITODD)上进行了评估,并在T-less数据集中获得最先进的性能。此外,SC6D在计算上比以前的最新方法Surfemb更有效。实施和预培训模型可在https://github.com/dingdingcai/sc6d-pose上公开获得。
translated by 谷歌翻译
最近,基于RGBD的类别级别6D对象姿势估计已实现了有希望的性能提高,但是,深度信息的要求禁止更广泛的应用。为了缓解这个问题,本文提出了一种新的方法,名为“对象级别深度重建网络”(旧网)仅将RGB图像作为类别级别6D对象姿势估计的输入。我们建议通过将类别级别的形状在对象级深度和规范的NOC表示中直接从单眼RGB图像中直接预测对象级的深度。引入了两个名为归一化的全局位置提示(NGPH)和形状吸引的脱钩深度重建(SDDR)模块的模块,以学习高保真对象级的深度和精致的形状表示。最后,通过将预测的规范表示与背面预测的对象级深度对齐来解决6D对象姿势。在具有挑战性的Camera25和Real275数据集上进行了广泛的实验,表明我们的模型虽然很简单,但可以实现最先进的性能。
translated by 谷歌翻译
我们介绍了一种简单而有效的算法,它使用卷积神经网络直接从视频中估计对象。我们的方法利用了视频序列的时间信息,并计算了支持机器人和AR域的计算上高效且鲁棒。我们所提出的网络采用预先训练的2D对象检测器作为输入,并通过经常性神经网络聚合视觉特征以在每个帧处进行预测。YCB-Video数据集的实验评估表明,我们的方法与最先进的算法相提并论。此外,通过30 FPS的速度,它也比现有技术更有效,因此适用于需要实时对象姿态估计的各种应用。
translated by 谷歌翻译
估计对象的6D姿势是必不可少的计算机视觉任务。但是,大多数常规方法从单个角度依赖相机数据,因此遭受遮挡。我们通过称为MV6D的新型多视图6D姿势估计方法克服了这个问题,该方法从多个角度根据RGB-D图像准确地预测了混乱场景中所有对象的6D姿势。我们将方法以PVN3D网络为基础,该网络使用单个RGB-D图像来预测目标对象的关键点。我们通过从多个视图中使用组合点云来扩展此方法,并将每个视图中的图像与密集层层融合。与当前的多视图检测网络(例如Cosypose)相反,我们的MV6D可以以端到端的方式学习多个观点的融合,并且不需要多个预测阶段或随后对预测的微调。此外,我们介绍了三个新颖的影像学数据集,这些数据集具有沉重的遮挡的混乱场景。所有这些都从多个角度包含RGB-D图像,例如语义分割和6D姿势估计。即使在摄像头不正确的情况下,MV6D也明显优于多视图6D姿势估计中最新的姿势估计。此外,我们表明我们的方法对动态相机设置具有强大的态度,并且其准确性随着越来越多的观点而逐渐增加。
translated by 谷歌翻译
实时机器人掌握,支持随后的精确反对操作任务,是高级高级自治系统的优先目标。然而,尚未找到这样一种可以用时间效率进行充分准确的掌握的算法。本文提出了一种新的方法,其具有2阶段方法,它使用深神经网络结合快速的2D对象识别,以及基于点对特征框架的随后的精确和快速的6D姿态估计来形成实时3D对象识别和抓握解决方案能够多对象类场景。所提出的解决方案有可能在实时应用上稳健地进行,需要效率和准确性。为了验证我们的方法,我们进行了广泛且彻底的实验,涉及我们自己的数据集的费力准备。实验结果表明,该方法在5CM5DEG度量标准中的精度97.37%,平均距离度量分数99.37%。实验结果显示了通过使用该方法的总体62%的相对改善(5cm5deg度量)和52.48%(平均距离度量)。此外,姿势估计执行也显示出运行时间的平均改善47.6%。最后,为了说明系统在实时操作中的整体效率,进行了一个拾取和放置的机器人实验,并显示了90%的准确度的令人信服的成功率。此实验视频可在https://sites.google.com/view/dl-ppf6dpose/上获得。
translated by 谷歌翻译
The goal of this paper is to estimate the 6D pose and dimensions of unseen object instances in an RGB-D image. Contrary to "instance-level" 6D pose estimation tasks, our problem assumes that no exact object CAD models are available during either training or testing time. To handle different and unseen object instances in a given category, we introduce Normalized Object Coordinate Space (NOCS)-a shared canonical representation for all possible object instances within a category. Our region-based neural network is then trained to directly infer the correspondence from observed pixels to this shared object representation (NOCS) along with other object information such as class label and instance mask. These predictions can be combined with the depth map to jointly estimate the metric 6D pose and dimensions of multiple objects in a cluttered scene. To train our network, we present a new contextaware technique to generate large amounts of fully annotated mixed reality data. To further improve our model and evaluate its performance on real data, we also provide a fully annotated real-world dataset with large environment and instance variation. Extensive experiments demonstrate that the proposed method is able to robustly estimate the pose and size of unseen object instances in real environments while also achieving state-of-the-art performance on standard 6D pose estimation benchmarks.
translated by 谷歌翻译
在本文中,我们考虑了同时找到和从单个2D图像中恢复多手的具有挑战性的任务。先前的研究要么关注单手重建,要么以多阶段的方式解决此问题。此外,常规的两阶段管道首先检测到手部区域,然后估计每个裁剪贴片的3D手姿势。为了减少预处理和特征提取中的计算冗余,我们提出了一条简洁但有效的单阶段管道。具体而言,我们为多手重建设计了多头自动编码器结构,每个HEAD网络分别共享相同的功能图并分别输出手动中心,姿势和纹理。此外,我们采用了一个弱监督的计划来减轻昂贵的3D现实世界数据注释的负担。为此,我们提出了一系列通过舞台训练方案优化的损失,其中根据公开可用的单手数据集生成具有2D注释的多手数据集。为了进一步提高弱监督模型的准确性,我们在单手和多个手设置中采用了几个功能一致性约束。具体而言,从本地功能估算的每只手的关键点应与全局功能预测的重新投影点一致。在包括Freihand,HO3D,Interhand 2.6M和RHD在内的公共基准测试的广泛实验表明,我们的方法在弱监督和完全监督的举止中优于基于最先进的模型方法。代码和模型可在{\ url {https://github.com/zijinxuxu/smhr}}上获得。
translated by 谷歌翻译
我们提出了一种称为DPODV2(密集姿势对象检测器)的三个阶段6 DOF对象检测方法,该方法依赖于致密的对应关系。我们将2D对象检测器与密集的对应关系网络和多视图姿势细化方法相结合,以估计完整的6 DOF姿势。与通常仅限于单眼RGB图像的其他深度学习方法不同,我们提出了一个统一的深度学习网络,允许使用不同的成像方式(RGB或DEPTH)。此外,我们提出了一种基于可区分渲染的新型姿势改进方法。主要概念是在多个视图中比较预测并渲染对应关系,以获得与所有视图中预测的对应关系一致的姿势。我们提出的方法对受控设置中的不同数据方式和培训数据类型进行了严格的评估。主要结论是,RGB在对应性估计中表现出色,而如果有良好的3D-3D对应关系,则深度有助于姿势精度。自然,他们的组合可以实现总体最佳性能。我们进行广泛的评估和消融研究,以分析和验证几个具有挑战性的数据集的结果。 DPODV2在所有这些方面都取得了出色的成果,同时仍然保持快速和可扩展性,独立于使用的数据模式和培训数据的类型
translated by 谷歌翻译
从RGB图像中对刚性对象进行精确的6D构成估计是机器人技术和增强现实中的一项至关重要的任务。为了解决这个问题,我们提出了DeepRM,这是一种新型的经过精炼的新型经过的网络体系结构。 DeepRM利用初始粗姿势估计来渲染目标对象的合成图像。然后将渲染图像与观察到的图像匹配,以预测更新先前姿势估计值的刚性变换。重复此过程以逐步完善每次迭代的估计值。 LSTM单元用于通过每个完善步骤来传播信息,从而显着提高整体性能。与许多基于2阶段的透视点解决方案相反,DEEPRM是端到端训练的,并使用可扩展的主链,可以通过单个参数调整以提高准确性和效率。在训练过程中,添加了多尺度的光流头,以预测观察到的和合成图像之间的光流。光流预测稳定了训练过程,并强制学习与姿势估计任务相关的功能。我们的结果表明,DEEPRM在两个广泛接受的具有挑战性的数据集上实现了最先进的性能。
translated by 谷歌翻译