在本文中,我们提出了一个迭代的自我训练框架,用于SIM到现实的6D对象姿势估计,以促进具有成本效益的机器人抓钩。给定bin选择场景,我们建立了一个光真实的模拟器来合成丰富的虚拟数据,并使用它来训练初始姿势估计网络。然后,该网络扮演教师模型的角色,该模型为未标记的真实数据生成了姿势预测。有了这些预测,我们进一步设计了一个全面的自适应选择方案,以区分可靠的结果,并将它们作为伪标签来更新学生模型以估算真实数据。为了不断提高伪标签的质量,我们通过将受过训练的学生模型作为新老师并使用精致的教师模型重新标记实际数据来迭代上述步骤。我们在公共基准和新发布的数据集上评估了我们的方法,分别提高了11.49%和22.62%的方法。我们的方法还能够将机器人箱的成功成功提高19.54%,这表明了对机器人应用的迭代SIM到现实解决方案的潜力。
translated by 谷歌翻译
学习估计对象姿势通常需要地面真理(GT)标签,例如CAD模型和绝对级对象姿势,这在现实世界中获得昂贵且费力。为了解决这个问题,我们为类别级对象姿势估计提出了一个无监督的域适应(UDA),称为\ textbf {uda-cope}。受到最近的多模态UDA技术的启发,所提出的方法利用教师学生自我监督的学习方案来训练姿势估计网络而不使用目标域标签。我们还在预测归一化对象坐标空间(NOCS)地图和观察点云之间引入了双向滤波方法,不仅使我们的教师网络更加强大地对目标域,而且为学生网络培训提供更可靠的伪标签。广泛的实验结果表明了我们所提出的方法的有效性,可以定量和定性。值得注意的是,在不利用目标域GT标签的情况下,我们所提出的方法可以实现与依赖于GT标签的现有方法相当或有时优越的性能。
translated by 谷歌翻译
商业深度传感器通常会产生嘈杂和缺失的深度,尤其是在镜面和透明的对象上,这对下游深度或基于点云的任务构成了关键问题。为了减轻此问题,我们提出了一个强大的RGBD融合网络Swindrnet,以进行深度修复。我们进一步提出了域随机增强深度模拟(DREDS)方法,以使用基于物理的渲染模拟主动的立体声深度系统,并生成一个大规模合成数据集,该数据集包含130k Photorealistic RGB图像以及其模拟深度带有现实主义的传感器。为了评估深度恢复方法,我们还策划了一个现实世界中的数据集,即STD,该数据集捕获了30个混乱的场景,这些场景由50个对象组成,具有不同的材料,从透明,透明,弥漫性。实验表明,提议的DREDS数据集桥接了SIM到实地域间隙,因此,经过训练,我们的Swindrnet可以无缝地概括到其他真实的深度数据集,例如。 ClearGrasp,并以实时速度优于深度恢复的竞争方法。我们进一步表明,我们的深度恢复有效地提高了下游任务的性能,包括类别级别的姿势估计和掌握任务。我们的数据和代码可从https://github.com/pku-epic/dreds获得
translated by 谷歌翻译
In this paper, we introduce neural texture learning for 6D object pose estimation from synthetic data and a few unlabelled real images. Our major contribution is a novel learning scheme which removes the drawbacks of previous works, namely the strong dependency on co-modalities or additional refinement. These have been previously necessary to provide training signals for convergence. We formulate such a scheme as two sub-optimisation problems on texture learning and pose learning. We separately learn to predict realistic texture of objects from real image collections and learn pose estimation from pixel-perfect synthetic data. Combining these two capabilities allows then to synthesise photorealistic novel views to supervise the pose estimator with accurate geometry. To alleviate pose noise and segmentation imperfection present during the texture learning phase, we propose a surfel-based adversarial training loss together with texture regularisation from synthetic data. We demonstrate that the proposed approach significantly outperforms the recent state-of-the-art methods without ground-truth pose annotations and demonstrates substantial generalisation improvements towards unseen scenes. Remarkably, our scheme improves the adopted pose estimators substantially even when initialised with much inferior performance.
translated by 谷歌翻译
6D对象姿势估计是计算机视觉和机器人研究中的基本问题之一。尽管最近在同一类别内将姿势估计概括为新的对象实例(即类别级别的6D姿势估计)方面已做出了许多努力,但考虑到有限的带注释数据,它仍然在受限的环境中受到限制。在本文中,我们收集了Wild6D,这是一种具有不同实例和背景的新的未标记的RGBD对象视频数据集。我们利用这些数据在野外概括了类别级别的6D对象姿势效果,并通过半监督学习。我们提出了一个新模型,称为呈现姿势估计网络reponet,该模型使用带有合成数据的自由地面真实性共同训练,以及在现实世界数据上具有轮廓匹配的目标函数。在不使用实际数据上的任何3D注释的情况下,我们的方法优于先前数据集上的最先进方法,而我们的WILD6D测试集(带有手动注释进行评估)则优于较大的边距。带有WILD6D数据的项目页面:https://oasisyang.github.io/semi-pose。
translated by 谷歌翻译
对象点云的语义分析在很大程度上是由释放基准数据集的驱动的,包括合成的数据集,其实例是从对象CAD模型中采样的。但是,从合成数据中学习可能不会推广到实际情况,在这种情况下,点云通常不完整,不均匀分布和嘈杂。可以通过学习域适应算法来减轻模拟对真实性(SIM2REAL)域间隙的挑战。但是,我们认为通过更现实的渲染来产生合成点云是一种强大的选择,因为可以捕获系统的非均匀噪声模式。为此,我们提出了一个集成方案,该方案包括通过将斑点模式的投影渲染到CAD模型上,以及一种新颖的准平衡自我训练,通过散布驱动驱动的选择,通过将斑点模式投影到CAD模型上,并通过将斑点模式投影和一种新颖的准平衡自我训练来渲染立体声图像,该方案包括对象点云的物理现实综合。长尾巴的伪标记为样品。实验结果可以验证我们方法的有效性及其两个模块,用于对点云分类的无监督域适应,从而实现最新的性能。源代码和SpeckLenet合成数据集可在https://github.com/gorilla-lab-scut/qs3上找到。
translated by 谷歌翻译
A key technical challenge in performing 6D object pose estimation from RGB-D image is to fully leverage the two complementary data sources. Prior works either extract information from the RGB image and depth separately or use costly post-processing steps, limiting their performances in highly cluttered scenes and real-time applications. In this work, we present DenseFusion, a generic framework for estimating 6D pose of a set of known objects from RGB-D images. DenseFusion is a heterogeneous architecture that processes the two data sources individually and uses a novel dense fusion network to extract pixel-wise dense feature embedding, from which the pose is estimated. Furthermore, we integrate an end-to-end iterative pose refinement procedure that further improves the pose estimation while achieving near real-time inference. Our experiments show that our method outperforms state-of-the-art approaches in two datasets, YCB-Video and LineMOD. We also deploy our proposed method to a real robot to grasp and manipulate objects based on the estimated pose. Our code and video are available at https://sites.google.com/view/densefusion/.
translated by 谷歌翻译
在这项工作中,我们通过利用3D Suite Blender生产具有6D姿势的合成RGBD图像数据集来提出数据生成管道。提出的管道可以有效地生成大量的照片现实的RGBD图像,以了解感兴趣的对象。此外,引入了域随机化技术的集合来弥合真实数据和合成数据之间的差距。此外,我们通过整合对象检测器Yolo-V4微型和6D姿势估计算法PVN3D来开发实时的两阶段6D姿势估计方法,用于时间敏感的机器人应用。借助提出的数据生成管道,我们的姿势估计方法可以仅使用没有任何预训练模型的合成数据从头开始训练。在LineMod数据集评估时,与最先进的方法相比,所得网络显示出竞争性能。我们还证明了在机器人实验中提出的方法,在不同的照明条件下从混乱的背景中抓住家用物体。
translated by 谷歌翻译
We introduce MegaPose, a method to estimate the 6D pose of novel objects, that is, objects unseen during training. At inference time, the method only assumes knowledge of (i) a region of interest displaying the object in the image and (ii) a CAD model of the observed object. The contributions of this work are threefold. First, we present a 6D pose refiner based on a render&compare strategy which can be applied to novel objects. The shape and coordinate system of the novel object are provided as inputs to the network by rendering multiple synthetic views of the object's CAD model. Second, we introduce a novel approach for coarse pose estimation which leverages a network trained to classify whether the pose error between a synthetic rendering and an observed image of the same object can be corrected by the refiner. Third, we introduce a large-scale synthetic dataset of photorealistic images of thousands of objects with diverse visual and shape properties and show that this diversity is crucial to obtain good generalization performance on novel objects. We train our approach on this large synthetic dataset and apply it without retraining to hundreds of novel objects in real images from several pose estimation benchmarks. Our approach achieves state-of-the-art performance on the ModelNet and YCB-Video datasets. An extensive evaluation on the 7 core datasets of the BOP challenge demonstrates that our approach achieves performance competitive with existing approaches that require access to the target objects during training. Code, dataset and trained models are available on the project page: https://megapose6d.github.io/.
translated by 谷歌翻译
对于单眼深度估计,获取真实数据的地面真相并不容易,因此通常使用监督的合成数据采用域适应方法。但是,由于缺乏实际数据的监督,这仍然可能会导致较大的域间隙。在本文中,我们通过从真实数据中生成可靠的伪基础真理来开发一个域适应框架,以提供直接的监督。具体而言,我们提出了两种用于伪标记的机制:1)通过测量图像具有相同内容但不同样式的深度预测的一致性,通过测量深度预测的一致性; 2)通过点云完成网络的3D感知伪标记,该网络学会完成3D空间中的深度值,从而在场景中提供更多的结构信息,以完善并生成更可靠的伪标签。在实验中,我们表明我们的伪标记方法改善了各种环境中的深度估计,包括在训练过程中使用立体声对。此外,该提出的方法对现实世界数据集中的几种最新无监督域的适应方法表现出色。
translated by 谷歌翻译
Estimating 6D poses of objects from images is an important problem in various applications such as robot manipulation and virtual reality. While direct regression of images to object poses has limited accuracy, matching rendered images of an object against the input image can produce accurate results. In this work, we propose a novel deep neural network for 6D pose matching named DeepIM. Given an initial pose estimation, our network is able to iteratively refine the pose by matching the rendered image against the observed image. The network is trained to predict a relative pose transformation using a disentangled representation of 3D location and 3D orientation and an iterative training process. Experiments on two commonly used benchmarks for 6D pose estimation demonstrate that DeepIM achieves large improvements over stateof-the-art methods. We furthermore show that DeepIM is able to match previously unseen objects.
translated by 谷歌翻译
语义分割在广泛的计算机视觉应用中起着基本作用,提供了全球对图像​​的理解的关键信息。然而,最先进的模型依赖于大量的注释样本,其比在诸如图像分类的任务中获得更昂贵的昂贵的样本。由于未标记的数据替代地获得更便宜,因此无监督的域适应达到了语义分割社区的广泛成功并不令人惊讶。本调查致力于总结这一令人难以置信的快速增长的领域的五年,这包含了语义细分本身的重要性,以及将分段模型适应新环境的关键需求。我们提出了最重要的语义分割方法;我们对语义分割的域适应技术提供了全面的调查;我们揭示了多域学习,域泛化,测试时间适应或无源域适应等较新的趋势;我们通过描述在语义细分研究中最广泛使用的数据集和基准测试来结束本调查。我们希望本调查将在学术界和工业中提供具有全面参考指导的研究人员,并有助于他们培养现场的新研究方向。
translated by 谷歌翻译
视觉感知任务通常需要大量的标记数据,包括3D姿势和图像空间分割掩码。创建此类培训数据集的过程可能很难或耗时,可以扩展到一般使用的功效。考虑对刚性对象的姿势估计的任务。在大型公共数据集中接受培训时,基于神经网络的深层方法表现出良好的性能。但是,将这些网络调整为其他新颖对象,或针对不同环境的现有模型进行微调,需要大量的时间投资才能产生新标记的实例。为此,我们提出了ProgressLabeller作为一种方法,以更有效地以可扩展的方式从彩色图像序列中生成大量的6D姿势训练数据。 ProgressLabeller还旨在支持透明或半透明的对象,以深度密集重建的先前方法将失败。我们通过快速创建一个超过1M样品的数据集来证明ProgressLabeller的有效性,我们将其微调一个最先进的姿势估计网络,以显着提高下游机器人的抓地力。 ProgressLabeller是https://github.com/huijiezh/progresslabeller的开放源代码。
translated by 谷歌翻译
The goal of this paper is to estimate the 6D pose and dimensions of unseen object instances in an RGB-D image. Contrary to "instance-level" 6D pose estimation tasks, our problem assumes that no exact object CAD models are available during either training or testing time. To handle different and unseen object instances in a given category, we introduce Normalized Object Coordinate Space (NOCS)-a shared canonical representation for all possible object instances within a category. Our region-based neural network is then trained to directly infer the correspondence from observed pixels to this shared object representation (NOCS) along with other object information such as class label and instance mask. These predictions can be combined with the depth map to jointly estimate the metric 6D pose and dimensions of multiple objects in a cluttered scene. To train our network, we present a new contextaware technique to generate large amounts of fully annotated mixed reality data. To further improve our model and evaluate its performance on real data, we also provide a fully annotated real-world dataset with large environment and instance variation. Extensive experiments demonstrate that the proposed method is able to robustly estimate the pose and size of unseen object instances in real environments while also achieving state-of-the-art performance on standard 6D pose estimation benchmarks.
translated by 谷歌翻译
监督的多视图立体声(MVS)方法在重建质量方面取得了显着进步,但遭受了收集大规模基础真相深度的挑战。在本文中,我们提出了一种基于知识蒸馏的MVS的新型自我监督培训管道,称为\ textit {kd-Mvs},主要由自我监督的教师培训和基于蒸馏的学生培训组成。具体而言,使用光度和特征一致性同时以自学的方式对教师模型进行了训练。然后,我们通过概率知识转移将教师模型的知识提炼为学生模型。在对经过验证的知识的监督下,学生模型能够以很大的优势优于其老师。在多个数据集上进行的广泛实验表明,我们的方法甚至可以胜过监督方法。
translated by 谷歌翻译
我们提出了一种称为DPODV2(密集姿势对象检测器)的三个阶段6 DOF对象检测方法,该方法依赖于致密的对应关系。我们将2D对象检测器与密集的对应关系网络和多视图姿势细化方法相结合,以估计完整的6 DOF姿势。与通常仅限于单眼RGB图像的其他深度学习方法不同,我们提出了一个统一的深度学习网络,允许使用不同的成像方式(RGB或DEPTH)。此外,我们提出了一种基于可区分渲染的新型姿势改进方法。主要概念是在多个视图中比较预测并渲染对应关系,以获得与所有视图中预测的对应关系一致的姿势。我们提出的方法对受控设置中的不同数据方式和培训数据类型进行了严格的评估。主要结论是,RGB在对应性估计中表现出色,而如果有良好的3D-3D对应关系,则深度有助于姿势精度。自然,他们的组合可以实现总体最佳性能。我们进行广泛的评估和消融研究,以分析和验证几个具有挑战性的数据集的结果。 DPODV2在所有这些方面都取得了出色的成果,同时仍然保持快速和可扩展性,独立于使用的数据模式和培训数据的类型
translated by 谷歌翻译
6D object pose estimation problem has been extensively studied in the field of Computer Vision and Robotics. It has wide range of applications such as robot manipulation, augmented reality, and 3D scene understanding. With the advent of Deep Learning, many breakthroughs have been made; however, approaches continue to struggle when they encounter unseen instances, new categories, or real-world challenges such as cluttered backgrounds and occlusions. In this study, we will explore the available methods based on input modality, problem formulation, and whether it is a category-level or instance-level approach. As a part of our discussion, we will focus on how 6D object pose estimation can be used for understanding 3D scenes.
translated by 谷歌翻译
Estimating the 6D pose of known objects is important for robots to interact with the real world. The problem is challenging due to the variety of objects as well as the complexity of a scene caused by clutter and occlusions between objects. In this work, we introduce PoseCNN, a new Convolutional Neural Network for 6D object pose estimation. PoseCNN estimates the 3D translation of an object by localizing its center in the image and predicting its distance from the camera. The 3D rotation of the object is estimated by regressing to a quaternion representation. We also introduce a novel loss function that enables PoseCNN to handle symmetric objects. In addition, we contribute a large scale video dataset for 6D object pose estimation named the YCB-Video dataset. Our dataset provides accurate 6D poses of 21 objects from the YCB dataset observed in 92 videos with 133,827 frames. We conduct extensive experiments on our YCB-Video dataset and the OccludedLINEMOD dataset to show that PoseCNN is highly robust to occlusions, can handle symmetric objects, and provide accurate pose estimation using only color images as input. When using depth data to further refine the poses, our approach achieves state-of-the-art results on the challenging OccludedLINEMOD dataset. Our code and dataset are available at https://rse-lab.cs.washington.edu/projects/posecnn/.
translated by 谷歌翻译
实时机器人掌握,支持随后的精确反对操作任务,是高级高级自治系统的优先目标。然而,尚未找到这样一种可以用时间效率进行充分准确的掌握的算法。本文提出了一种新的方法,其具有2阶段方法,它使用深神经网络结合快速的2D对象识别,以及基于点对特征框架的随后的精确和快速的6D姿态估计来形成实时3D对象识别和抓握解决方案能够多对象类场景。所提出的解决方案有可能在实时应用上稳健地进行,需要效率和准确性。为了验证我们的方法,我们进行了广泛且彻底的实验,涉及我们自己的数据集的费力准备。实验结果表明,该方法在5CM5DEG度量标准中的精度97.37%,平均距离度量分数99.37%。实验结果显示了通过使用该方法的总体62%的相对改善(5cm5deg度量)和52.48%(平均距离度量)。此外,姿势估计执行也显示出运行时间的平均改善47.6%。最后,为了说明系统在实时操作中的整体效率,进行了一个拾取和放置的机器人实验,并显示了90%的准确度的令人信服的成功率。此实验视频可在https://sites.google.com/view/dl-ppf6dpose/上获得。
translated by 谷歌翻译
Semi-supervised object detection is important for 3D scene understanding because obtaining large-scale 3D bounding box annotations on point clouds is time-consuming and labor-intensive. Existing semi-supervised methods usually employ teacher-student knowledge distillation together with an augmentation strategy to leverage unlabeled point clouds. However, these methods adopt global augmentation with scene-level transformations and hence are sub-optimal for instance-level object detection. In this work, we propose an object-level point augmentor (OPA) that performs local transformations for semi-supervised 3D object detection. In this way, the resultant augmentor is derived to emphasize object instances rather than irrelevant backgrounds, making the augmented data more useful for object detector training. Extensive experiments on the ScanNet and SUN RGB-D datasets show that the proposed OPA performs favorably against the state-of-the-art methods under various experimental settings. The source code will be available at https://github.com/nomiaro/OPA.
translated by 谷歌翻译