在本文中,我们解决了从单眼图像估算以前未见对象的3D方向的任务。该任务与大多数现有深度学习方法所考虑的任务形成对比,后者通常假设在训练过程中观察到测试对象。为了处理看不见的对象,我们遵循基于检索的策略,并通过计算查询图像和合成生成的参考图像之间的多尺度局部相似性来防止网络学习特定于对象的特征。然后,我们引入了一个自适应融合模块,该模块可稳健地将局部相似性汇总到成对图像的全局相似性评分中。此外,我们通过制定快速检索策略来加快检索过程。我们在LineMod,LineMod-Ocluded和T-less数据集上进行的实验表明,与以前的工作相比,我们的方法对看不见的对象产生了明显的概括。我们的代码和预训练模型可在https://sailor-z.github.io/projects/unseen_object_pose.html上找到。
translated by 谷歌翻译
We present a new method which provides object location priors for previously unseen object 6D pose estimation. Existing approaches build upon a template matching strategy and convolve a set of reference images with the query. Unfortunately, their performance is affected by the object scale mismatches between the references and the query. To address this issue, we present a finer-grained correlation estimation module, which handles the object scale mismatches by computing correlations with adjustable receptive fields. We also propose to decouple the correlations into scale-robust and scale-aware representations to estimate the object location and size, respectively. Our method achieves state-of-the-art unseen object localization and 6D pose estimation results on LINEMOD and GenMOP. We further construct a challenging synthetic dataset, where the results highlight the better robustness of our method to varying backgrounds, illuminations, and object sizes, as well as to the reference-query domain gap.
translated by 谷歌翻译
Estimating 6D poses of objects from images is an important problem in various applications such as robot manipulation and virtual reality. While direct regression of images to object poses has limited accuracy, matching rendered images of an object against the input image can produce accurate results. In this work, we propose a novel deep neural network for 6D pose matching named DeepIM. Given an initial pose estimation, our network is able to iteratively refine the pose by matching the rendered image against the observed image. The network is trained to predict a relative pose transformation using a disentangled representation of 3D location and 3D orientation and an iterative training process. Experiments on two commonly used benchmarks for 6D pose estimation demonstrate that DeepIM achieves large improvements over stateof-the-art methods. We furthermore show that DeepIM is able to match previously unseen objects.
translated by 谷歌翻译
我们提出了一种称为DPODV2(密集姿势对象检测器)的三个阶段6 DOF对象检测方法,该方法依赖于致密的对应关系。我们将2D对象检测器与密集的对应关系网络和多视图姿势细化方法相结合,以估计完整的6 DOF姿势。与通常仅限于单眼RGB图像的其他深度学习方法不同,我们提出了一个统一的深度学习网络,允许使用不同的成像方式(RGB或DEPTH)。此外,我们提出了一种基于可区分渲染的新型姿势改进方法。主要概念是在多个视图中比较预测并渲染对应关系,以获得与所有视图中预测的对应关系一致的姿势。我们提出的方法对受控设置中的不同数据方式和培训数据类型进行了严格的评估。主要结论是,RGB在对应性估计中表现出色,而如果有良好的3D-3D对应关系,则深度有助于姿势精度。自然,他们的组合可以实现总体最佳性能。我们进行广泛的评估和消融研究,以分析和验证几个具有挑战性的数据集的结果。 DPODV2在所有这些方面都取得了出色的成果,同时仍然保持快速和可扩展性,独立于使用的数据模式和培训数据的类型
translated by 谷歌翻译
Estimating the 6D pose of known objects is important for robots to interact with the real world. The problem is challenging due to the variety of objects as well as the complexity of a scene caused by clutter and occlusions between objects. In this work, we introduce PoseCNN, a new Convolutional Neural Network for 6D object pose estimation. PoseCNN estimates the 3D translation of an object by localizing its center in the image and predicting its distance from the camera. The 3D rotation of the object is estimated by regressing to a quaternion representation. We also introduce a novel loss function that enables PoseCNN to handle symmetric objects. In addition, we contribute a large scale video dataset for 6D object pose estimation named the YCB-Video dataset. Our dataset provides accurate 6D poses of 21 objects from the YCB dataset observed in 92 videos with 133,827 frames. We conduct extensive experiments on our YCB-Video dataset and the OccludedLINEMOD dataset to show that PoseCNN is highly robust to occlusions, can handle symmetric objects, and provide accurate pose estimation using only color images as input. When using depth data to further refine the poses, our approach achieves state-of-the-art results on the challenging OccludedLINEMOD dataset. Our code and dataset are available at https://rse-lab.cs.washington.edu/projects/posecnn/.
translated by 谷歌翻译
In this paper, we propose a novel 3D graph convolution based pipeline for category-level 6D pose and size estimation from monocular RGB-D images. The proposed method leverages an efficient 3D data augmentation and a novel vector-based decoupled rotation representation. Specifically, we first design an orientation-aware autoencoder with 3D graph convolution for latent feature learning. The learned latent feature is insensitive to point shift and size thanks to the shift and scale-invariance properties of the 3D graph convolution. Then, to efficiently decode the rotation information from the latent feature, we design a novel flexible vector-based decomposable rotation representation that employs two decoders to complementarily access the rotation information. The proposed rotation representation has two major advantages: 1) decoupled characteristic that makes the rotation estimation easier; 2) flexible length and rotated angle of the vectors allow us to find a more suitable vector representation for specific pose estimation task. Finally, we propose a 3D deformation mechanism to increase the generalization ability of the pipeline. Extensive experiments show that the proposed pipeline achieves state-of-the-art performance on category-level tasks. Further, the experiments demonstrate that the proposed rotation representation is more suitable for the pose estimation tasks than other rotation representations.
translated by 谷歌翻译
尽管提取了通过手工制作和基于学习的描述符实现的本地特征的进步,但它们仍然受到不符合非刚性转换的不变性的限制。在本文中,我们提出了一种计算来自静止图像的特征的新方法,该特征对于非刚性变形稳健,以避免匹配可变形表面和物体的问题。我们的变形感知当地描述符,命名优惠,利用极性采样和空间变压器翘曲,以提供旋转,尺度和图像变形的不变性。我们通过将等距非刚性变形应用于模拟环境中的对象作为指导来提供高度辨别的本地特征来培训模型架构端到端。该实验表明,我们的方法优于静止图像中的实际和现实合成可变形对象的不同数据集中的最先进的手工制作,基于学习的图像和RGB-D描述符。描述符的源代码和培训模型在https://www.verlab.dcc.ufmg.br/descriptors/neUrips2021上公开可用。
translated by 谷歌翻译
A key technical challenge in performing 6D object pose estimation from RGB-D image is to fully leverage the two complementary data sources. Prior works either extract information from the RGB image and depth separately or use costly post-processing steps, limiting their performances in highly cluttered scenes and real-time applications. In this work, we present DenseFusion, a generic framework for estimating 6D pose of a set of known objects from RGB-D images. DenseFusion is a heterogeneous architecture that processes the two data sources individually and uses a novel dense fusion network to extract pixel-wise dense feature embedding, from which the pose is estimated. Furthermore, we integrate an end-to-end iterative pose refinement procedure that further improves the pose estimation while achieving near real-time inference. Our experiments show that our method outperforms state-of-the-art approaches in two datasets, YCB-Video and LineMOD. We also deploy our proposed method to a real robot to grasp and manipulate objects based on the estimated pose. Our code and video are available at https://sites.google.com/view/densefusion/.
translated by 谷歌翻译
从RGB图像中对刚性对象进行精确的6D构成估计是机器人技术和增强现实中的一项至关重要的任务。为了解决这个问题,我们提出了DeepRM,这是一种新型的经过精炼的新型经过的网络体系结构。 DeepRM利用初始粗姿势估计来渲染目标对象的合成图像。然后将渲染图像与观察到的图像匹配,以预测更新先前姿势估计值的刚性变换。重复此过程以逐步完善每次迭代的估计值。 LSTM单元用于通过每个完善步骤来传播信息,从而显着提高整体性能。与许多基于2阶段的透视点解决方案相反,DEEPRM是端到端训练的,并使用可扩展的主链,可以通过单个参数调整以提高准确性和效率。在训练过程中,添加了多尺度的光流头,以预测观察到的和合成图像之间的光流。光流预测稳定了训练过程,并强制学习与姿势估计任务相关的功能。我们的结果表明,DEEPRM在两个广泛接受的具有挑战性的数据集上实现了最先进的性能。
translated by 谷歌翻译
对象姿态估计有多个重要应用,例如机器人抓握和增强现实。我们提出了一种估计了提高当前提案的准确性的6D对象的6D姿势,仍然可以实时使用。我们的方法使用RGB-D数据作为段对象的输入并估计它们的姿势。它使用具有多个头部的神经网络,一个头估计对象分类并生成掩码,第二估计转换向量的值,最后一个头估计表示对象旋转的四元轴的值。这些头部利用特征提取和特征融合期间使用的金字塔架构。我们的方法可以实时使用,其低推理时间为0.12秒并具有高精度。通过这种快速推理和良好准确性的组合,可以在机器人挑选和放置任务和/或增强现实应用中使用我们的方法。
translated by 谷歌翻译
RGB图像的刚性对象的可伸缩6D构成估计旨在处理多个对象并推广到新物体。我们建立在一个著名的自动编码框架的基础上,以应对对象对称性和缺乏标记的训练数据,我们通过将自动编码器的潜在表示形状分解为形状并构成子空间来实现可伸缩性。潜在形状空间通过对比度度量学习模型不同对象的相似性,并将潜在姿势代码与旋转检索的规范旋转进行比较。由于不同的对象对称会诱导不一致的潜在姿势空间,因此我们用规范旋转重新输入形状表示,以生成形状依赖的姿势代码簿以进行旋转检索。我们在两个基准上显示了最新的性能,其中包含无类别和每日对象的无纹理CAD对象,并通过扩展到跨类别的每日对象的更具挑战性的设置,进一步证明了可扩展性。
translated by 谷歌翻译
最新的6D对象构成估计方法,包括无监督的方法,需要许多真实的训练图像。不幸的是,对于某些应用,例如在空间或深水下的应用程序,几乎是不可能获取真实图像的,即使是未注释的。在本文中,我们提出了一种可以仅在合成图像上训练的方法,也可以选择使用一些其他真实图像。鉴于从第一个网络获得的粗糙姿势估计,它使用第二个网络来预测使用粗糙姿势和真实图像呈现的图像之间的密集2D对应场,并渗透了所需的姿势校正。与最新方法相比,这种方法对合成图像和真实图像之间的域变化敏感得多。它与需要注释的真实图像进行训练时的方法表现出色,并且在使用二十个真实图像的情况下,它们的表现要优于它们。
translated by 谷歌翻译
This work addresses cross-view camera pose estimation, i.e., determining the 3-DoF camera pose of a given ground-level image w.r.t. an aerial image of the local area. We propose SliceMatch, which consists of ground and aerial feature extractors, feature aggregators, and a pose predictor. The feature extractors extract dense features from the ground and aerial images. Given a set of candidate camera poses, the feature aggregators construct a single ground descriptor and a set of rotational equivariant pose-dependent aerial descriptors. Notably, our novel aerial feature aggregator has a cross-view attention module for ground-view guided aerial feature selection, and utilizes the geometric projection of the ground camera's viewing frustum on the aerial image to pool features. The efficient construction of aerial descriptors is achieved by using precomputed masks and by re-assembling the aerial descriptors for rotated poses. SliceMatch is trained using contrastive learning and pose estimation is formulated as a similarity comparison between the ground descriptor and the aerial descriptors. SliceMatch outperforms the state-of-the-art by 19% and 62% in median localization error on the VIGOR and KITTI datasets, with 3x FPS of the fastest baseline.
translated by 谷歌翻译
本文提出了一种类别级别的6D对象姿势和形状估计方法IDAPS,其允许在类别中跟踪6D姿势并估计其3D形状。我们使用深度图像作为输入开发类别级别自动编码器网络,其中来自自动编码器编码的特征嵌入在类别中对象的姿势。自动编码器可用于粒子过滤器框架,以估计和跟踪类别中的对象的姿势。通过利用基于符号距离函数的隐式形状表示,我们构建延迟网络以估计给定对象的估计姿势的3D形状的潜在表示。然后,估计的姿势和形状可用于以迭代方式互相更新。我们的类别级别6D对象姿势和形状估计流水线仅需要2D检测和分段进行初始化。我们在公开的数据集中评估我们的方法,并展示其有效性。特别是,我们的方法在形状估计上实现了相对高的准确性。
translated by 谷歌翻译
This paper addresses the challenge of 6DoF pose estimation from a single RGB image under severe occlusion or truncation. Many recent works have shown that a two-stage approach, which first detects keypoints and then solves a Perspective-n-Point (PnP) problem for pose estimation, achieves remarkable performance. However, most of these methods only localize a set of sparse keypoints by regressing their image coordinates or heatmaps, which are sensitive to occlusion and truncation. Instead, we introduce a Pixel-wise Voting Network (PVNet) to regress pixel-wise unit vectors pointing to the keypoints and use these vectors to vote for keypoint locations using RANSAC. This creates a flexible representation for localizing occluded or truncated keypoints. Another important feature of this representation is that it provides uncertainties of keypoint locations that can be further leveraged by the PnP solver. Experiments show that the proposed approach outperforms the state of the art on the LINEMOD, Occlusion LINEMOD and YCB-Video datasets by a large margin, while being efficient for real-time pose estimation. We further create a Truncation LINEMOD dataset to validate the robustness of our approach against truncation. The code will be avaliable at https://zju-3dv.github.io/pvnet/.
translated by 谷歌翻译
近年来,已经产生了大量的视觉内容,并从许多领域共享,例如社交媒体平台,医学成像和机器人。这种丰富的内容创建和共享引入了新的挑战,特别是在寻找类似内容内容的图像检索(CBIR)-A的数据库中,即长期建立的研究区域,其中需要改进的效率和准确性来实时检索。人工智能在CBIR中取得了进展,并大大促进了实例搜索过程。在本调查中,我们审查了最近基于深度学习算法和技术开发的实例检索工作,通过深网络架构类型,深度功能,功能嵌入方法以及网络微调策略组织了调查。我们的调查考虑了各种各样的最新方法,在那里,我们识别里程碑工作,揭示各种方法之间的联系,并呈现常用的基准,评估结果,共同挑战,并提出未来的未来方向。
translated by 谷歌翻译
类别级的对象姿势估计旨在预测已知类别集的任意对象的6D姿势以及3D度量大小。最近的方法利用了先验改编的形状,以将观察到的点云映射到规范空间中,并应用Umeyama算法以恢复姿势和大小。然而,它们的形状先验整合策略间接增强了姿势估计,从而导致姿势敏感的特征提取和推理速度缓慢。为了解决这个问题,在本文中,我们提出了一个新颖的几何形状引导的残留对象边界框投影网络RBP置rbp置置,该框架共同预测对象的姿势和残留的矢量,描述了从形状优先指示的对象表面投影中的位移迈向真实的表面投影。残留矢量的这种定义本质上是零均值且相对较小,并且明确封装了3D对象的空间提示,以进行稳健和准确的姿势回归。我们强制执行几何学意识的一致性项,以使预测的姿势和残留向量对齐以进一步提高性能。
translated by 谷歌翻译
准确且强大的视觉对象跟踪是最具挑战性和最基本的计算机视觉问题之一。它需要在图像序列中估计目标的轨迹,仅给出其初始位置和分段,或者在边界框的形式中粗略近似。判别相关滤波器(DCF)和深度暹罗网络(SNS)被出现为主导跟踪范式,这导致了重大进展。在过去十年的视觉对象跟踪快速演变之后,该调查介绍了90多个DCFS和暹罗跟踪器的系统和彻底审查,基于九个跟踪基准。首先,我们介绍了DCF和暹罗跟踪核心配方的背景理论。然后,我们在这些跟踪范式中区分和全面地审查共享以及具体的开放研究挑战。此外,我们彻底分析了DCF和暹罗跟踪器对九个基准的性能,涵盖了视觉跟踪的不同实验方面:数据集,评估度量,性能和速度比较。通过提出根据我们的分析提出尊重开放挑战的建议和建议来完成调查。
translated by 谷歌翻译
虽然最近出现了类别级的9DOF对象姿势估计,但由于较大的对象形状和颜色等类别内差异,因此,先前基于对应的或直接回归方法的准确性均受到限制。 - 级别的物体姿势和尺寸炼油机Catre,能够迭代地增强点云的姿势估计以产生准确的结果。鉴于初始姿势估计,Catre通过对齐部分观察到的点云和先验的抽象形状来预测初始姿势和地面真理之间的相对转换。具体而言,我们提出了一种新颖的分离体系结构,以了解旋转与翻译/大小估计之间的固有区别。广泛的实验表明,我们的方法在REAL275,Camera25和LM基准测试中的最先进方法高达〜85.32Hz,并在类别级别跟踪上取得了竞争成果。我们进一步证明,Catre可以对看不见的类别进行姿势改进。可以使用代码和训练有素的型号。
translated by 谷歌翻译
基于无人机(UAV)基于无人机的视觉对象跟踪已实现了广泛的应用,并且由于其多功能性和有效性而引起了智能运输系统领域的越来越多的关注。作为深度学习革命性趋势的新兴力量,暹罗网络在基于无人机的对象跟踪中闪耀,其准确性,稳健性和速度有希望的平衡。由于开发了嵌入式处理器和深度神经网络的逐步优化,暹罗跟踪器获得了广泛的研究并实现了与无人机的初步组合。但是,由于无人机在板载计算资源和复杂的现实情况下,暹罗网络的空中跟踪仍然在许多方面都面临严重的障碍。为了进一步探索基于无人机的跟踪中暹罗网络的部署,这项工作对前沿暹罗跟踪器进行了全面的审查,以及使用典型的无人机板载处理器进行评估的详尽无人用分析。然后,进行板载测试以验证代表性暹罗跟踪器在现实世界无人机部署中的可行性和功效。此外,为了更好地促进跟踪社区的发展,这项工作分析了现有的暹罗跟踪器的局限性,并进行了以低弹片评估表示的其他实验。最后,深入讨论了基于无人机的智能运输系统的暹罗跟踪的前景。领先的暹罗跟踪器的统一框架,即代码库及其实验评估的结果,请访问https://github.com/vision4robotics/siamesetracking4uav。
translated by 谷歌翻译