刚性对象的6D姿势的估计是计算机视觉中的一个基本问题。传统上,姿势估计与确定单一最佳估计有关。但是,单个估计无法表达视觉歧义,在许多情况下,由于对象对称或识别特征的阻塞,这在许多情况下是不可避免的。无法说明姿势的歧义可能会导致后续方法的失败,这是在失败成本高时无法接受的。完全姿势分布的估计与单个估计相反,非常适合表达姿势不确定性。由此激励,我们提出了一种新颖的姿势分布估计方法。对象姿势上概率分布的隐式公式来自对象的中间表示作为一组关键点。这样可以确保姿势分布估计值具有很高的解释性。此外,我们的方法基于保守近似,这导致可靠的估计。该方法已被评估在YCB-V和T-less数据集上旋转分布估计的任务,并在所有对象上可靠地执行。
translated by 谷歌翻译
单图像姿势估计是许多视觉和机器人任务中的一个基本问题,并且现有的深度学习方法不会完全建模和处理来遭受:i)关于预测的不确定性,ii)具有多个(有时是无限)正确姿势的对称对象。为此,我们引入了一种在SO(3)上估算任意非参数分布的方法。我们的关键思想是通过神经网络隐含地表示分布,该神经网络估计给定输入图像和候选姿势的概率。网格采样或梯度上升可用于找到最有可能的姿势,但也可以评估任何姿势的概率,从而实现关于对称性和不确定性的推理。这是代表流形分布的最通用方法,为了展示丰富的表现力,我们介绍了一个具有挑战性的对称和几乎对称对象的数据集。我们不需要对姿势不确定性的监督 - 模型仅以一个示例训练单个姿势。但是,我们的隐式模型具有高度表达能力在3D姿势上处理复杂的分布,同时仍然在标准的非歧义环境上获得准确的姿势估计,从而在Pascal3d+和ModelNet10-SO-SO(3)基准方面实现了最先进的性能。
translated by 谷歌翻译
This paper addresses the challenge of 6DoF pose estimation from a single RGB image under severe occlusion or truncation. Many recent works have shown that a two-stage approach, which first detects keypoints and then solves a Perspective-n-Point (PnP) problem for pose estimation, achieves remarkable performance. However, most of these methods only localize a set of sparse keypoints by regressing their image coordinates or heatmaps, which are sensitive to occlusion and truncation. Instead, we introduce a Pixel-wise Voting Network (PVNet) to regress pixel-wise unit vectors pointing to the keypoints and use these vectors to vote for keypoint locations using RANSAC. This creates a flexible representation for localizing occluded or truncated keypoints. Another important feature of this representation is that it provides uncertainties of keypoint locations that can be further leveraged by the PnP solver. Experiments show that the proposed approach outperforms the state of the art on the LINEMOD, Occlusion LINEMOD and YCB-Video datasets by a large margin, while being efficient for real-time pose estimation. We further create a Truncation LINEMOD dataset to validate the robustness of our approach against truncation. The code will be avaliable at https://zju-3dv.github.io/pvnet/.
translated by 谷歌翻译
我们提出了一个基于按键的对象级别的SLAM框架,该框架可以为对称和不对称对象提供全球一致的6DOF姿势估计。据我们所知,我们的系统是最早利用来自SLAM的相机姿势信息的系统之一,以提供先验知识,以跟踪对称对象的关键点 - 确保新测量与当前的3D场景一致。此外,我们的语义关键点网络经过训练,可以预测捕获预测的真实错误的关键点的高斯协方差,因此不仅可以作为系统优化问题中残留物的权重,而且还可以作为检测手段有害的统计异常值,而无需选择手动阈值。实验表明,我们的方法以6DOF对象姿势估算和实时速度为最先进的状态提供了竞争性能。我们的代码,预培训模型和关键点标签可用https://github.com/rpng/suo_slam。
translated by 谷歌翻译
我们提出了一种学习致密,连续的2D-3D对应分布的方法,这些方法来自数据表面的对象表面,没有实际上是对称性的视觉歧义。我们还使用所学习的分布来提出一个新的6D姿势估计的刚性物体,以便样本,得分和细化姿势假设。通过编码器 - 解码器查询模型和小型全连接键模型,在对象特定的潜空间中表示对应丢失的对应分布。我们的方法对于视觉歧义而言,我们表明查询和关键模型学习代表准确的多模态表面分布。我们的姿势估计方法显着提高了全面的BOP挑战,纯粹对合成数据训练的综合性挑战,甚至与在真实数据上培训的方法相比。项目网站位于https://surfemb.github.io/。
translated by 谷歌翻译
We present a method for 3D object detection and pose estimation from a single image. In contrast to current techniques that only regress the 3D orientation of an object, our method first regresses relatively stable 3D object properties using a deep convolutional neural network and then combines these estimates with geometric constraints provided by a 2D object bounding box to produce a complete 3D bounding box. The first network output estimates the 3D object orientation using a novel hybrid discrete-continuous loss, which significantly outperforms the L2 loss. The second output regresses the 3D object dimensions, which have relatively little variance compared to alternatives and can often be predicted for many object types. These estimates, combined with the geometric constraints on translation imposed by the 2D bounding box, enable us to recover a stable and accurate 3D object pose. We evaluate our method on the challenging KITTI object detection benchmark [2] both on the official metric of 3D orientation estimation and also on the accuracy of the obtained 3D bounding boxes. Although conceptually simple, our method outperforms more complex and computationally expensive approaches that leverage semantic segmentation, instance level segmentation and flat ground priors [4] and sub-category detection [23][24]. Our discrete-continuous loss also produces state of the art results for 3D viewpoint estimation on the Pascal 3D+ dataset[26].
translated by 谷歌翻译
本文介绍了一个有效的对称性和无对应框架,称为SC6D,对于单个单眼RGB图像的6D对象姿势估计。SC6D既不需要对象的3D CAD模型,也不需要对称对称的任何先验知识。姿势估计分解为三个子任务:a)对象3D旋转表示学习和匹配;b)估计对象中心的2D位置;和c)通过分类的比例不变距离估计(沿Z轴的翻译)。SC6D在三个基准数据集(T-less,YCB-V和ITODD)上进行了评估,并在T-less数据集中获得最先进的性能。此外,SC6D在计算上比以前的最新方法Surfemb更有效。实施和预培训模型可在https://github.com/dingdingcai/sc6d-pose上公开获得。
translated by 谷歌翻译
当前基于RGB的6D对象姿势估计方法在数据集和现实世界应用程序上取得了明显的性能。但是,从单个2D图像特征中预测6D姿势容易受到环境和纹理或相似物体表面的变化的干扰。因此,基于RGB的方法通常比基于RGBD的方法获得的竞争结果较低,后者既部署图像特征和3D结构特征。为了缩小这一性能差距,本文提出了一个6D对象姿势估计的框架,该框架从2个RGB图像中学习隐式3D信息。结合学习的3D信息和2D图像功能,我们在场景和对象模型之间建立了更稳定的对应关系。为了寻求从RGB输入中使用3D信息的最佳方法,我们对三种不同的方法进行了调查,包括早期融合,中融合和晚融合。我们确定中融合方法是恢复最精确的3D关键点的最佳方法,可用于对象姿势估计。该实验表明,我们的方法优于最先进的RGB方法,并通过基于RGBD的方法获得了可比的结果。
translated by 谷歌翻译
在许多机器人应用中,要执行已知,刚体对象及其随后的抓握的6多-DOF姿势估计的环境设置几乎保持不变,甚至可能是机器人事先知道的。在本文中,我们将此问题称为特定实例的姿势估计:只有在有限的一组熟悉的情况下,该机器人将以高度准确性估算姿势。场景中的微小变化,包括照明条件和背景外观的变化,是可以接受的,但没有预期的改变。为此,我们提出了一种方法,可以快速训练和部署管道,以估算单个RGB图像的对象的连续6-DOF姿势。关键的想法是利用已知的相机姿势和刚性的身体几何形状部分自动化大型标记数据集的生成。然后,数据集以及足够的域随机化来监督深度神经网络的培训,以预测语义关键。在实验上,我们证明了我们提出的方法的便利性和有效性,以准确估计物体姿势,仅需要少量的手动注释才能进行训练。
translated by 谷歌翻译
We introduce MegaPose, a method to estimate the 6D pose of novel objects, that is, objects unseen during training. At inference time, the method only assumes knowledge of (i) a region of interest displaying the object in the image and (ii) a CAD model of the observed object. The contributions of this work are threefold. First, we present a 6D pose refiner based on a render&compare strategy which can be applied to novel objects. The shape and coordinate system of the novel object are provided as inputs to the network by rendering multiple synthetic views of the object's CAD model. Second, we introduce a novel approach for coarse pose estimation which leverages a network trained to classify whether the pose error between a synthetic rendering and an observed image of the same object can be corrected by the refiner. Third, we introduce a large-scale synthetic dataset of photorealistic images of thousands of objects with diverse visual and shape properties and show that this diversity is crucial to obtain good generalization performance on novel objects. We train our approach on this large synthetic dataset and apply it without retraining to hundreds of novel objects in real images from several pose estimation benchmarks. Our approach achieves state-of-the-art performance on the ModelNet and YCB-Video datasets. An extensive evaluation on the 7 core datasets of the BOP challenge demonstrates that our approach achieves performance competitive with existing approaches that require access to the target objects during training. Code, dataset and trained models are available on the project page: https://megapose6d.github.io/.
translated by 谷歌翻译
Estimating 6D poses of objects from images is an important problem in various applications such as robot manipulation and virtual reality. While direct regression of images to object poses has limited accuracy, matching rendered images of an object against the input image can produce accurate results. In this work, we propose a novel deep neural network for 6D pose matching named DeepIM. Given an initial pose estimation, our network is able to iteratively refine the pose by matching the rendered image against the observed image. The network is trained to predict a relative pose transformation using a disentangled representation of 3D location and 3D orientation and an iterative training process. Experiments on two commonly used benchmarks for 6D pose estimation demonstrate that DeepIM achieves large improvements over stateof-the-art methods. We furthermore show that DeepIM is able to match previously unseen objects.
translated by 谷歌翻译
密集对象跟踪,能够通过像素级精度本地化特定的对象点,是一个重要的计算机视觉任务,具有多种机器人的下游应用程序。现有方法在单个前向通行证中计算密集的键盘嵌入,这意味着模型培训以一次性跟踪所有内容,或者将它们的全部容量分配给稀疏预定义的点,交易一般性以获得准确性。在本文中,我们基于观察到给定时间的相关点数通常相对较少,例如,探索中间地面。掌握目标对象的点。我们的主要贡献是一种新颖的架构,灵感来自少量任务适应,这允许一个稀疏样式的网络在嵌入点嵌入的关键点嵌入时的条件。我们的中央发现是,这种方法提供了密集嵌入模型的一般性,同时提供准确性更加接近稀疏关键点方法。我们呈现了说明此容量与准确性权衡的结果,并使用真正的机器人挑选任务展示将转移到新对象实例(在课程中)的能力。
translated by 谷歌翻译
大多数实时人类姿势估计方法都基于检测接头位置。使用检测到的关节位置,可以计算偏差和肢体的俯仰。然而,由于这种旋转轴仍然不观察,因此不能计算沿着肢体沿着肢体至关重要的曲折,这对于诸如体育分析和计算机动画至关重要。在本文中,我们引入了方向关键点,一种用于估计骨骼关节的全位置和旋转的新方法,仅使用单帧RGB图像。灵感来自Motion-Capture Systems如何使用一组点标记来估计全骨骼旋转,我们的方法使用虚拟标记来生成足够的信息,以便准确地推断使用简单的后处理。旋转预测改善了接头角度最佳报告的平均误差48%,并且在15个骨骼旋转中实现了93%的精度。该方法还通过MPJPE在原理数据集上测量,通过MPJPE测量,该方法还改善了当前的最新结果14%,并概括为野外数据集。
translated by 谷歌翻译
Traditionally, monocular 3D human pose estimation employs a machine learning model to predict the most likely 3D pose for a given input image. However, a single image can be highly ambiguous and induces multiple plausible solutions for the 2D-3D lifting step which results in overly confident 3D pose predictors. To this end, we propose \emph{DiffPose}, a conditional diffusion model, that predicts multiple hypotheses for a given input image. In comparison to similar approaches, our diffusion model is straightforward and avoids intensive hyperparameter tuning, complex network structures, mode collapse, and unstable training. Moreover, we tackle a problem of the common two-step approach that first estimates a distribution of 2D joint locations via joint-wise heatmaps and consecutively approximates them based on first- or second-moment statistics. Since such a simplification of the heatmaps removes valid information about possibly correct, though labeled unlikely, joint locations, we propose to represent the heatmaps as a set of 2D joint candidate samples. To extract information about the original distribution from these samples we introduce our \emph{embedding transformer} that conditions the diffusion model. Experimentally, we show that DiffPose slightly improves upon the state of the art for multi-hypothesis pose estimation for simple poses and outperforms it by a large margin for highly ambiguous poses.
translated by 谷歌翻译
通过Perspective-N点(PNP)从单个RGB图像找到3D对象是计算机视觉中的长期问题。在端到端的深度学习的驱动下,最近的研究表明将PNP解释为一个可区分的层,因此可以通过反向传播梯度W.R.T.可以部分学习2d-3d点对应。对象姿势。然而,由于确定性姿势本质上是非差异的,因此学习整个不受限制的2D-3D点无法与现有方法融合。在本文中,我们提出了EPRO-PNP,这是用于一般端到端姿势估计的概率PNP层,该阶段估计输出了SE(3)歧管上的姿势分布,从本质上讲,将分类软效量带到连续域。 2d-3d坐标和相应的权重被视为通过最大程度地减少预测姿势分布和目标姿势分布之间的KL差异来学习的中间变量。基本原则统一了现有方法并类似于注意机制。 EPRO-PNP显着胜过竞争基线,缩小基于PNP的方法与LineMod 6DOF姿势估计和NUSCENES 3D对象检测基准的差距。
translated by 谷歌翻译
从点云输入中的6-DOF GRASP学习中取得了巨大的成功,但是由于点集无秩序而引起的计算成本仍然是一个令人关注的问题。另外,我们从本文中的RGB-D输入中探讨了GRASP的生成。提出的解决方案Kepoint-GraspNet检测图像空间中Gripper Kepoint的投影,然后用PNP算法恢复SE(3)姿势。建立了基于原始形状和抓住家族的合成数据集来检查我们的想法。基于公制的评估表明,我们的方法在掌握建议的准确性,多样性和时间成本方面优于基准。最后,机器人实验显示出很高的成功率,证明了在现实世界应用中的想法的潜力。
translated by 谷歌翻译
We introduce an approach for recovering the 6D pose of multiple known objects in a scene captured by a set of input images with unknown camera viewpoints. First, we present a single-view single-object 6D pose estimation method, which we use to generate 6D object pose hypotheses. Second, we develop a robust method for matching individual 6D object pose hypotheses across different input images in order to jointly estimate camera viewpoints and 6D poses of all objects in a single consistent scene. Our approach explicitly handles object symmetries, does not require depth measurements, is robust to missing or incorrect object hypotheses, and automatically recovers the number of objects in the scene. Third, we develop a method for global scene refinement given multiple object hypotheses and their correspondences across views. This is achieved by solving an object-level bundle adjustment problem that refines the poses of cameras and objects to minimize the reprojection error in all views. We demonstrate that the proposed method, dubbed Cosy-Pose, outperforms current state-of-the-art results for single-view and multi-view 6D object pose estimation by a large margin on two challenging benchmarks: the YCB-Video and T-LESS datasets. Code and pre-trained models are available on the project webpage. 5
translated by 谷歌翻译
We propose a single-shot approach for simultaneously detecting an object in an RGB image and predicting its 6D pose without requiring multiple stages or having to examine multiple hypotheses. Unlike a recently proposed single-shot technique for this task [11] that only predicts an approximate 6D pose that must then be refined, ours is accurate enough not to require additional post-processing. As a result, it is much faster -50 fps on a Titan X (Pascal) GPU -and more suitable for real-time processing. The key component of our method is a new CNN architecture inspired by [28,29] that directly predicts the 2D image locations of the projected vertices of the object's 3D bounding box. The object's 6D pose is then estimated using a PnP algorithm.For single object and multiple object pose estimation on the LINEMOD and OCCLUSION datasets, our approach substantially outperforms other recent 26] when they are all used without postprocessing. During post-processing, a pose refinement step can be used to boost the accuracy of these two methods, but at 10 fps or less, they are much slower than our method.
translated by 谷歌翻译
Its numerous applications make multi-human 3D pose estimation a remarkably impactful area of research. Nevertheless, assuming a multiple-view system composed of several regular RGB cameras, 3D multi-pose estimation presents several challenges. First of all, each person must be uniquely identified in the different views to separate the 2D information provided by the cameras. Secondly, the 3D pose estimation process from the multi-view 2D information of each person must be robust against noise and potential occlusions in the scenario. In this work, we address these two challenges with the help of deep learning. Specifically, we present a model based on Graph Neural Networks capable of predicting the cross-view correspondence of the people in the scenario along with a Multilayer Perceptron that takes the 2D points to yield the 3D poses of each person. These two models are trained in a self-supervised manner, thus avoiding the need for large datasets with 3D annotations.
translated by 谷歌翻译
我们提出了一种称为DPODV2(密集姿势对象检测器)的三个阶段6 DOF对象检测方法,该方法依赖于致密的对应关系。我们将2D对象检测器与密集的对应关系网络和多视图姿势细化方法相结合,以估计完整的6 DOF姿势。与通常仅限于单眼RGB图像的其他深度学习方法不同,我们提出了一个统一的深度学习网络,允许使用不同的成像方式(RGB或DEPTH)。此外,我们提出了一种基于可区分渲染的新型姿势改进方法。主要概念是在多个视图中比较预测并渲染对应关系,以获得与所有视图中预测的对应关系一致的姿势。我们提出的方法对受控设置中的不同数据方式和培训数据类型进行了严格的评估。主要结论是,RGB在对应性估计中表现出色,而如果有良好的3D-3D对应关系,则深度有助于姿势精度。自然,他们的组合可以实现总体最佳性能。我们进行广泛的评估和消融研究,以分析和验证几个具有挑战性的数据集的结果。 DPODV2在所有这些方面都取得了出色的成果,同时仍然保持快速和可扩展性,独立于使用的数据模式和培训数据的类型
translated by 谷歌翻译