Vision transformers have emerged as powerful tools for many computer vision tasks. It has been shown that their features and class tokens can be used for salient object segmentation. However, the properties of segmentation transformers remain largely unstudied. In this work we conduct an in-depth study of the spatial attentions of different backbone layers of semantic segmentation transformers and uncover interesting properties. The spatial attentions of a patch intersecting with an object tend to concentrate within the object, whereas the attentions of larger, more uniform image areas rather follow a diffusive behavior. In other words, vision transformers trained to segment a fixed set of object classes generalize to objects well beyond this set. We exploit this by extracting heatmaps that can be used to segment unknown objects within diverse backgrounds, such as obstacles in traffic scenes. Our method is training-free and its computational overhead negligible. We use off-the-shelf transformers trained for street-scene segmentation to process other scene types.
translated by 谷歌翻译
Unpaired exemplar-based image-to-image (UEI2I) translation aims to translate a source image to a target image domain with the style of a target image exemplar, without ground-truth input-translation pairs. Existing UEI2I methods represent style using either a global, image-level feature vector, or one vector per object instance/class but requiring knowledge of the scene semantics. Here, by contrast, we propose to represent style as a dense feature map, allowing for a finer-grained transfer to the source image without requiring any external semantic information. We then rely on perceptual and adversarial losses to disentangle our dense style and content representations, and exploit unsupervised cross-domain semantic correspondences to warp the exemplar style to the source content. We demonstrate the effectiveness of our method on two datasets using standard metrics together with a new localized style metric measuring style similarity in a class-wise manner. Our results evidence that the translations produced by our approach are more diverse and closer to the exemplars than those of the state-of-the-art methods while nonetheless preserving the source content.
translated by 谷歌翻译
We present a new method which provides object location priors for previously unseen object 6D pose estimation. Existing approaches build upon a template matching strategy and convolve a set of reference images with the query. Unfortunately, their performance is affected by the object scale mismatches between the references and the query. To address this issue, we present a finer-grained correlation estimation module, which handles the object scale mismatches by computing correlations with adjustable receptive fields. We also propose to decouple the correlations into scale-robust and scale-aware representations to estimate the object location and size, respectively. Our method achieves state-of-the-art unseen object localization and 6D pose estimation results on LINEMOD and GenMOP. We further construct a challenging synthetic dataset, where the results highlight the better robustness of our method to varying backgrounds, illuminations, and object sizes, as well as to the reference-query domain gap.
translated by 谷歌翻译
Recent approaches to drape garments quickly over arbitrary human bodies leverage self-supervision to eliminate the need for large training sets. However, they are designed to train one network per clothing item, which severely limits their generalization abilities. In our work, we rely on self-supervision to train a single network to drape multiple garments. This is achieved by predicting a 3D deformation field conditioned on the latent codes of a generative network, which models garments as unsigned distance fields. Our pipeline can generate and drape previously unseen garments of any topology, whose shape can be edited by manipulating their latent codes. Being fully differentiable, our formulation makes it possible to recover accurate 3D models of garments from partial observations -- images or 3D scans -- via gradient descent. Our code will be made publicly available.
translated by 谷歌翻译
如果不正确地进行,无监督的自我锻炼练习和体育训练可能会造成严重伤害。我们介绍了一个基于学习的框架,该框架可以识别用户犯的错误,并提出纠正措施,以更轻松,更安全的个人培训。我们的框架不依赖于硬编码的启发式规则。取而代之的是,它从数据中学习,这有助于其适应特定用户需求。为此,我们使用作用于用户姿势序列的图形卷积网络(GCN)体系结构来模拟身体关节轨迹之间的关系。为了评估我们的方法,我们介绍了一个具有3种不同体育锻炼的数据集。我们的方法产生了90.9%的错误识别准确性,并成功纠正了94.2%的错误。
translated by 谷歌翻译
Knowledge distillation facilitates the training of a compact student network by using a deep teacher one. While this has achieved great success in many tasks, it remains completely unstudied for image-based 6D object pose estimation. In this work, we introduce the first knowledge distillation method driven by the 6D pose estimation task. To this end, we observe that most modern 6D pose estimation frameworks output local predictions, such as sparse 2D keypoints or dense representations, and that the compact student network typically struggles to predict such local quantities precisely. Therefore, instead of imposing prediction-to-prediction supervision from the teacher to the student, we propose to distill the teacher's \emph{distribution} of local predictions into the student network, facilitating its training. Our experiments on several benchmarks show that our distillation method yields state-of-the-art results with different compact student models and for both keypoint-based and dense prediction-based architectures.
translated by 谷歌翻译
在这项工作中,我们解决了从点云数据估算对象的6D姿势的任务。尽管最近基于学习的方法解决此任务的方法在合成数据集上表现出了很大的成功,但我们观察到它们在存在现实世界数据的情况下失败了。因此,我们分析了这些故障的原因,我们将其追溯到源云和目标点云的特征分布之间的差,以及广泛使用的SVD损耗函数对两个点之间旋转范围的敏感性云。我们通过基于点对应的负模可能性引入损失函数来解决新的归一化策略,匹配归一化以及第二个挑战。我们的两个贡献是一般的,可以应用于许多现有的基于学习的3D对象注册框架,我们通过在其中两个DCP和IDAM中实现它们来说明它们。我们对现实的TUD-L,LineMod和canluded-LineMod数据集的实验证明了我们策略的好处。它们允许首次基于学习的3D对象注册方法在现实世界中获得有意义的结果。因此,我们希望它们是点云注册方法未来开发的关键。
translated by 谷歌翻译
最新的6D对象构成估计方法,包括无监督的方法,需要许多真实的训练图像。不幸的是,对于某些应用,例如在空间或深水下的应用程序,几乎是不可能获取真实图像的,即使是未注释的。在本文中,我们提出了一种可以仅在合成图像上训练的方法,也可以选择使用一些其他真实图像。鉴于从第一个网络获得的粗糙姿势估计,它使用第二个网络来预测使用粗糙姿势和真实图像呈现的图像之间的密集2D对应场,并渗透了所需的姿势校正。与最新方法相比,这种方法对合成图像和真实图像之间的域变化敏感得多。它与需要注释的真实图像进行训练时的方法表现出色,并且在使用二十个真实图像的情况下,它们的表现要优于它们。
translated by 谷歌翻译
在本文中,我们解决了从单眼图像估算以前未见对象的3D方向的任务。该任务与大多数现有深度学习方法所考虑的任务形成对比,后者通常假设在训练过程中观察到测试对象。为了处理看不见的对象,我们遵循基于检索的策略,并通过计算查询图像和合成生成的参考图像之间的多尺度局部相似性来防止网络学习特定于对象的特征。然后,我们引入了一个自适应融合模块,该模块可稳健地将局部相似性汇总到成对图像的全局相似性评分中。此外,我们通过制定快速检索策略来加快检索过程。我们在LineMod,LineMod-Ocluded和T-less数据集上进行的实验表明,与以前的工作相比,我们的方法对看不见的对象产生了明显的概括。我们的代码和预训练模型可在https://sailor-z.github.io/projects/unseen_object_pose.html上找到。
translated by 谷歌翻译
对抗性培训是一种流行的方法,可以抵御对抗对抗攻击的模型。然而,它比清洁投入的培训表现出更严重的过度装备。在这项工作中,我们从训练实例的角度调查了这种现象,即培训输入目标对。基于定量度量测量实例的难度,我们分析了模型对不同难度级别的培训实例的行为。这让我们展示了对抗性培训的泛化性能的衰减是模型试图融合艰苦的对抗实例的结果。理论上,我们验证了我们对线性和一般非线性模型的观察,证明了在硬实例上培训的模型具有比在简单实例上培训的概率更差的泛化性能。此外,我们证明,由于不同难度水平的情况训练的模型之间的泛化差距的差异随着对抗预算的大小而增加。最后,我们对几种情况进行了减轻对抗的方法的案例研究。我们的分析表明,成功减轻对抗性过度装备的方法避免了拟合艰苦的对抗实例,而拟合艰苦的对抗实例的情况不会达到真正的鲁棒性。
translated by 谷歌翻译