预测对象的姿势是核心计算机视觉任务。基于深度学习的姿势估计方法需要CAD数据来使用3D中间表示或项目2D外观。但是,当感兴趣对象的CAD数据不可用时,不能使用这些方法。此外,现有方法并未精确地反映了学习过程的透视变形。此外,由于自闭锁而尚未得到很好的信息损失。在这方面,我们提出了一种新的姿势估计系统,该系统由空间雕刻模块组成,该空间雕刻模块重构参考3D特征来替换CAD数据。此外,我们的新型转换模块,动态投射空间变压器(DPROST),转换参考3D功能,以在考虑透视失真的同时反映姿势。此外,我们通过新的双向Z缓冲(Biz缓冲区)方法克服了自闭锁问题,其提取了对象的前视图和自闭合的背视图。最后,我们建议一个透视电网距离损失(PGDL),从而能够稳定地学习没有CAD数据的姿势估计。实验结果表明,我们的方法在LineMod DataSet上占据了最先进的方法,以及LineMod-occlusion数据集上的可比性,甚至与在网络训练中需要CAD数据的方法相比。
translated by 谷歌翻译
Estimating 6D poses of objects from images is an important problem in various applications such as robot manipulation and virtual reality. While direct regression of images to object poses has limited accuracy, matching rendered images of an object against the input image can produce accurate results. In this work, we propose a novel deep neural network for 6D pose matching named DeepIM. Given an initial pose estimation, our network is able to iteratively refine the pose by matching the rendered image against the observed image. The network is trained to predict a relative pose transformation using a disentangled representation of 3D location and 3D orientation and an iterative training process. Experiments on two commonly used benchmarks for 6D pose estimation demonstrate that DeepIM achieves large improvements over stateof-the-art methods. We furthermore show that DeepIM is able to match previously unseen objects.
translated by 谷歌翻译
我们提出了一种方法,用于估计具有单个RGB图像的可用3D模型的刚性对象的6DOF姿势。与基于经典对应的方法不同,该方法可以预测输入图像的像素的3D对象坐标,该建议的方法可以预测3D对象坐标在相机frustum中采样的3D查询点。从像素到3D点的移动,这是受到3D重建方法的最新PIFU式方法的启发,可以对整个对象(包括(自我)遮挡部分)进行推理。对于与与像素对齐的图像功能相关的3D查询点,我们训练完全连接的神经网络来预测:(i)相应的3D对象坐标,以及(ii)签名到对象表面的签名距离,首先定义仅适用于地表附近的查询点。我们将该网络实现的映射称为神经通信字段。然后,通过Kabsch-Ransac算法从预测的3D-3D对应关系中稳健地估计对象姿势。所提出的方法在三个BOP数据集上实现了最先进的结果,并且在咬合挑战性案例中表现出了优越。项目网站在:linhuang17.github.io/ncf。
translated by 谷歌翻译
We introduce MegaPose, a method to estimate the 6D pose of novel objects, that is, objects unseen during training. At inference time, the method only assumes knowledge of (i) a region of interest displaying the object in the image and (ii) a CAD model of the observed object. The contributions of this work are threefold. First, we present a 6D pose refiner based on a render&compare strategy which can be applied to novel objects. The shape and coordinate system of the novel object are provided as inputs to the network by rendering multiple synthetic views of the object's CAD model. Second, we introduce a novel approach for coarse pose estimation which leverages a network trained to classify whether the pose error between a synthetic rendering and an observed image of the same object can be corrected by the refiner. Third, we introduce a large-scale synthetic dataset of photorealistic images of thousands of objects with diverse visual and shape properties and show that this diversity is crucial to obtain good generalization performance on novel objects. We train our approach on this large synthetic dataset and apply it without retraining to hundreds of novel objects in real images from several pose estimation benchmarks. Our approach achieves state-of-the-art performance on the ModelNet and YCB-Video datasets. An extensive evaluation on the 7 core datasets of the BOP challenge demonstrates that our approach achieves performance competitive with existing approaches that require access to the target objects during training. Code, dataset and trained models are available on the project page: https://megapose6d.github.io/.
translated by 谷歌翻译
We propose a single-shot approach for simultaneously detecting an object in an RGB image and predicting its 6D pose without requiring multiple stages or having to examine multiple hypotheses. Unlike a recently proposed single-shot technique for this task [11] that only predicts an approximate 6D pose that must then be refined, ours is accurate enough not to require additional post-processing. As a result, it is much faster -50 fps on a Titan X (Pascal) GPU -and more suitable for real-time processing. The key component of our method is a new CNN architecture inspired by [28,29] that directly predicts the 2D image locations of the projected vertices of the object's 3D bounding box. The object's 6D pose is then estimated using a PnP algorithm.For single object and multiple object pose estimation on the LINEMOD and OCCLUSION datasets, our approach substantially outperforms other recent 26] when they are all used without postprocessing. During post-processing, a pose refinement step can be used to boost the accuracy of these two methods, but at 10 fps or less, they are much slower than our method.
translated by 谷歌翻译
深度学习识别的进步导致使用2D图像准确的对象检测。然而,这些2D感知方法对于完整的3D世界信息不足。同时,高级3D形状估计接近形状本身的焦点,而不考虑公制量表。这些方法无法确定对象的准确位置和方向。为了解决这个问题,我们提出了一个框架,该框架共同估计了从单个RGB图像的度量标度形状和姿势。我们的框架有两个分支:公制刻度对象形状分支(MSO)和归一化对象坐标空间分支(NOC)。 MSOS分支估计在相机坐标中观察到的度量标准形状。 NOCS分支预测归一化对象坐标空间(NOCS)映射,并从预测的度量刻度网格与渲染的深度图执行相似性转换,以获得6D姿势和大小。此外,我们介绍了归一化对象中心估计(NOCE),以估计从相机到物体中心的几何对齐距离。我们在合成和实际数据集中验证了我们的方法,以评估类别级对象姿势和形状。
translated by 谷歌翻译
本文介绍了一个有效的对称性和无对应框架,称为SC6D,对于单个单眼RGB图像的6D对象姿势估计。SC6D既不需要对象的3D CAD模型,也不需要对称对称的任何先验知识。姿势估计分解为三个子任务:a)对象3D旋转表示学习和匹配;b)估计对象中心的2D位置;和c)通过分类的比例不变距离估计(沿Z轴的翻译)。SC6D在三个基准数据集(T-less,YCB-V和ITODD)上进行了评估,并在T-less数据集中获得最先进的性能。此外,SC6D在计算上比以前的最新方法Surfemb更有效。实施和预培训模型可在https://github.com/dingdingcai/sc6d-pose上公开获得。
translated by 谷歌翻译
We introduce an approach for recovering the 6D pose of multiple known objects in a scene captured by a set of input images with unknown camera viewpoints. First, we present a single-view single-object 6D pose estimation method, which we use to generate 6D object pose hypotheses. Second, we develop a robust method for matching individual 6D object pose hypotheses across different input images in order to jointly estimate camera viewpoints and 6D poses of all objects in a single consistent scene. Our approach explicitly handles object symmetries, does not require depth measurements, is robust to missing or incorrect object hypotheses, and automatically recovers the number of objects in the scene. Third, we develop a method for global scene refinement given multiple object hypotheses and their correspondences across views. This is achieved by solving an object-level bundle adjustment problem that refines the poses of cameras and objects to minimize the reprojection error in all views. We demonstrate that the proposed method, dubbed Cosy-Pose, outperforms current state-of-the-art results for single-view and multi-view 6D object pose estimation by a large margin on two challenging benchmarks: the YCB-Video and T-LESS datasets. Code and pre-trained models are available on the project webpage. 5
translated by 谷歌翻译
6D对象姿势估计是计算机视觉和机器人研究中的基本问题之一。尽管最近在同一类别内将姿势估计概括为新的对象实例(即类别级别的6D姿势估计)方面已做出了许多努力,但考虑到有限的带注释数据,它仍然在受限的环境中受到限制。在本文中,我们收集了Wild6D,这是一种具有不同实例和背景的新的未标记的RGBD对象视频数据集。我们利用这些数据在野外概括了类别级别的6D对象姿势效果,并通过半监督学习。我们提出了一个新模型,称为呈现姿势估计网络reponet,该模型使用带有合成数据的自由地面真实性共同训练,以及在现实世界数据上具有轮廓匹配的目标函数。在不使用实际数据上的任何3D注释的情况下,我们的方法优于先前数据集上的最先进方法,而我们的WILD6D测试集(带有手动注释进行评估)则优于较大的边距。带有WILD6D数据的项目页面:https://oasisyang.github.io/semi-pose。
translated by 谷歌翻译
Estimating the 6D pose of known objects is important for robots to interact with the real world. The problem is challenging due to the variety of objects as well as the complexity of a scene caused by clutter and occlusions between objects. In this work, we introduce PoseCNN, a new Convolutional Neural Network for 6D object pose estimation. PoseCNN estimates the 3D translation of an object by localizing its center in the image and predicting its distance from the camera. The 3D rotation of the object is estimated by regressing to a quaternion representation. We also introduce a novel loss function that enables PoseCNN to handle symmetric objects. In addition, we contribute a large scale video dataset for 6D object pose estimation named the YCB-Video dataset. Our dataset provides accurate 6D poses of 21 objects from the YCB dataset observed in 92 videos with 133,827 frames. We conduct extensive experiments on our YCB-Video dataset and the OccludedLINEMOD dataset to show that PoseCNN is highly robust to occlusions, can handle symmetric objects, and provide accurate pose estimation using only color images as input. When using depth data to further refine the poses, our approach achieves state-of-the-art results on the challenging OccludedLINEMOD dataset. Our code and dataset are available at https://rse-lab.cs.washington.edu/projects/posecnn/.
translated by 谷歌翻译
This paper addresses the challenge of 6DoF pose estimation from a single RGB image under severe occlusion or truncation. Many recent works have shown that a two-stage approach, which first detects keypoints and then solves a Perspective-n-Point (PnP) problem for pose estimation, achieves remarkable performance. However, most of these methods only localize a set of sparse keypoints by regressing their image coordinates or heatmaps, which are sensitive to occlusion and truncation. Instead, we introduce a Pixel-wise Voting Network (PVNet) to regress pixel-wise unit vectors pointing to the keypoints and use these vectors to vote for keypoint locations using RANSAC. This creates a flexible representation for localizing occluded or truncated keypoints. Another important feature of this representation is that it provides uncertainties of keypoint locations that can be further leveraged by the PnP solver. Experiments show that the proposed approach outperforms the state of the art on the LINEMOD, Occlusion LINEMOD and YCB-Video datasets by a large margin, while being efficient for real-time pose estimation. We further create a Truncation LINEMOD dataset to validate the robustness of our approach against truncation. The code will be avaliable at https://zju-3dv.github.io/pvnet/.
translated by 谷歌翻译
A key technical challenge in performing 6D object pose estimation from RGB-D image is to fully leverage the two complementary data sources. Prior works either extract information from the RGB image and depth separately or use costly post-processing steps, limiting their performances in highly cluttered scenes and real-time applications. In this work, we present DenseFusion, a generic framework for estimating 6D pose of a set of known objects from RGB-D images. DenseFusion is a heterogeneous architecture that processes the two data sources individually and uses a novel dense fusion network to extract pixel-wise dense feature embedding, from which the pose is estimated. Furthermore, we integrate an end-to-end iterative pose refinement procedure that further improves the pose estimation while achieving near real-time inference. Our experiments show that our method outperforms state-of-the-art approaches in two datasets, YCB-Video and LineMOD. We also deploy our proposed method to a real robot to grasp and manipulate objects based on the estimated pose. Our code and video are available at https://sites.google.com/view/densefusion/.
translated by 谷歌翻译
虽然最近出现了类别级的9DOF对象姿势估计,但由于较大的对象形状和颜色等类别内差异,因此,先前基于对应的或直接回归方法的准确性均受到限制。 - 级别的物体姿势和尺寸炼油机Catre,能够迭代地增强点云的姿势估计以产生准确的结果。鉴于初始姿势估计,Catre通过对齐部分观察到的点云和先验的抽象形状来预测初始姿势和地面真理之间的相对转换。具体而言,我们提出了一种新颖的分离体系结构,以了解旋转与翻译/大小估计之间的固有区别。广泛的实验表明,我们的方法在REAL275,Camera25和LM基准测试中的最先进方法高达〜85.32Hz,并在类别级别跟踪上取得了竞争成果。我们进一步证明,Catre可以对看不见的类别进行姿势改进。可以使用代码和训练有素的型号。
translated by 谷歌翻译
我们提出了一种称为DPODV2(密集姿势对象检测器)的三个阶段6 DOF对象检测方法,该方法依赖于致密的对应关系。我们将2D对象检测器与密集的对应关系网络和多视图姿势细化方法相结合,以估计完整的6 DOF姿势。与通常仅限于单眼RGB图像的其他深度学习方法不同,我们提出了一个统一的深度学习网络,允许使用不同的成像方式(RGB或DEPTH)。此外,我们提出了一种基于可区分渲染的新型姿势改进方法。主要概念是在多个视图中比较预测并渲染对应关系,以获得与所有视图中预测的对应关系一致的姿势。我们提出的方法对受控设置中的不同数据方式和培训数据类型进行了严格的评估。主要结论是,RGB在对应性估计中表现出色,而如果有良好的3D-3D对应关系,则深度有助于姿势精度。自然,他们的组合可以实现总体最佳性能。我们进行广泛的评估和消融研究,以分析和验证几个具有挑战性的数据集的结果。 DPODV2在所有这些方面都取得了出色的成果,同时仍然保持快速和可扩展性,独立于使用的数据模式和培训数据的类型
translated by 谷歌翻译
在本文中,我们考虑了同时找到和从单个2D图像中恢复多手的具有挑战性的任务。先前的研究要么关注单手重建,要么以多阶段的方式解决此问题。此外,常规的两阶段管道首先检测到手部区域,然后估计每个裁剪贴片的3D手姿势。为了减少预处理和特征提取中的计算冗余,我们提出了一条简洁但有效的单阶段管道。具体而言,我们为多手重建设计了多头自动编码器结构,每个HEAD网络分别共享相同的功能图并分别输出手动中心,姿势和纹理。此外,我们采用了一个弱监督的计划来减轻昂贵的3D现实世界数据注释的负担。为此,我们提出了一系列通过舞台训练方案优化的损失,其中根据公开可用的单手数据集生成具有2D注释的多手数据集。为了进一步提高弱监督模型的准确性,我们在单手和多个手设置中采用了几个功能一致性约束。具体而言,从本地功能估算的每只手的关键点应与全局功能预测的重新投影点一致。在包括Freihand,HO3D,Interhand 2.6M和RHD在内的公共基准测试的广泛实验表明,我们的方法在弱监督和完全监督的举止中优于基于最先进的模型方法。代码和模型可在{\ url {https://github.com/zijinxuxu/smhr}}上获得。
translated by 谷歌翻译
本文介绍了一种新型的多视图6 DOF对象姿势细化方法,重点是改进对合成数据训练的方法。它基于DPOD检测器,该检测器会在每个帧中产生密集的2D-3D对应关系。我们选择使用多个具有已知相机转换的帧,因为它允许通过可解释的ICP样损耗函数引入几何约束。损耗函数是通过可区分的渲染器实现的,并经过迭代进行了优化。我们还证明,仅根据合成数据训练的完整检测和完善管道可用于自动标记的真实数据。我们对linemod,caslusion,自制和YCB-V数据集执行定量评估,并与对合成和真实数据训练的最新方法相比,报告出色的性能。我们从经验上证明,我们的方法仅需要几个帧,并且可以在外部摄像机校准中关闭相机位置和噪音,从而使其实际用法更加容易且无处不在。
translated by 谷歌翻译
估计没有先验知识的新对象的相对姿势是一个困难的问题,而它是机器人技术和增强现实中非常需要的能力。我们提出了一种方法,可以在训练图像和对象的3D几何形状都没有可用时跟踪对象中对象的6D运动。因此,与以前的作品相反,我们的方法可以立即考虑开放世界中的未知对象,而无需任何先前的信息或特定的培训阶段。我们考虑两个架构,一个基于两个帧,另一个依赖于变压器编码器,它们可以利用任意数量的过去帧。我们仅使用具有域随机化的合成渲染训练架构。我们在具有挑战性的数据集上的结果与以前需要更多信息的作品(训练目标对象,3D模型和/或深度数据的培训图像)相当。我们的源代码可从https://github.com/nv-nguyen/pizza获得
translated by 谷歌翻译
我们提出了一个新的框架,以重建整体3D室内场景,包括单视图像的房间背景和室内对象。由于室内场景的严重阻塞,现有方法只能产生具有有限几何质量的室内物体的3D形状。为了解决这个问题,我们提出了一个与实例一致的隐式函数(InstPifu),以进行详细的对象重建。与实例对齐的注意模块结合使用,我们的方法有权将混合的局部特征与遮挡实例相结合。此外,与以前的方法不同,该方法仅代表房间背景为3D边界框,深度图或一组平面,我们通过隐式表示恢复了背景的精细几何形状。在E SUN RGB-D,PIX3D,3D-FUTURE和3D-FRONT数据集上进行的广泛实验表明,我们的方法在背景和前景对象重建中均优于现有方法。我们的代码和模型将公开可用。
translated by 谷歌翻译
最先进的对象姿势估计通过使用多模型公式来处理测试图像中的多个实例:检测作为第一阶段,然后每个对象单独训练的网络,以作为第二阶段的2d-3d几何对应关系预测。随后,使用Perspective-N点算法在运行时估算姿势。不幸的是,多模型配方很慢,并且与所涉及的对象实例的数量相比不能很好地扩展。最近的方法表明,直接6D对象姿势估计是可行的,当时是从上述几何对应关系得出的。我们提出了一种方法,该方法学习了多个对象的中间几何表示,以直接回归测试图像中所有实例的6D姿势。固有的端到端训练性克服了单独处理单个对象实例的要求。通过计算相互关联的联合会,将姿势假设聚集在不同的实例中,从而相对于对象实例的数量实现了可忽略的运行时开销。多个挑战性标准数据集的结果表明,尽管姿势估计的性能快于35倍以上,但姿势估计性能优于单模最先进的方法。我们还提供了一个分析,显示存在90多个对象实例的图像实时适用性(> 24 fps)。进一步的结果表明,用6D姿势监督基于几何相应的对象姿势估计的优势。
translated by 谷歌翻译
我们的方法从单个RGB-D观察中研究了以对象为中心的3D理解的复杂任务。由于这是一个不适的问题,因此现有的方法在3D形状和6D姿势和尺寸估计中都遭受了遮挡的复杂多对象方案的尺寸估计。我们提出了Shapo,这是一种联合多对象检测的方法,3D纹理重建,6D对象姿势和尺寸估计。 Shapo的关键是一条单杆管道,可回归形状,外观和构成潜在的代码以及每个对象实例的口罩,然后以稀疏到密集的方式进一步完善。首先学到了一种新颖的剖面形状和前景数据库,以将对象嵌入各自的形状和外观空间中。我们还提出了一个基于OCTREE的新颖的可区分优化步骤,使我们能够以分析的方式进一步改善对象形状,姿势和外观。我们新颖的联合隐式纹理对象表示使我们能够准确地识别和重建新颖的看不见的对象,而无需访问其3D网格。通过广泛的实验,我们表明我们的方法在模拟的室内场景上进行了训练,可以准确地回归现实世界中新颖物体的形状,外观和姿势,并以最小的微调。我们的方法显着超过了NOCS数据集上的所有基准,对于6D姿势估计,MAP的绝对改进为8%。项目页面:https://zubair-irshad.github.io/projects/shapo.html
translated by 谷歌翻译