我们解决了一对图像之间找到密集的视觉对应关系的重要任务。由于各种因素,例如质地差,重复的模式,照明变化和运动模糊,这是一个具有挑战性的问题。与使用密集信号基础真相作为本地功能匹配培训的直接监督的方法相反,我们训练3DG-STFM:一种多模式匹配模型(教师),以在3D密集的对应性监督下执行深度一致性,并将知识转移到2D单峰匹配模型(学生)。教师和学生模型均由两个基于变压器的匹配模块组成,这些模块以粗略的方式获得密集的对应关系。教师模型指导学生模型学习RGB诱导的深度信息,以实现粗糙和精细分支的匹配目的。我们还在模型压缩任务上评估了3DG-STFM。据我们所知,3DG-STFM是第一种用于本地功能匹配任务的学生教师学习方法。该实验表明,我们的方法优于室内和室外摄像头姿势估计以及同型估计问题的最先进方法。代码可在以下网址获得:https://github.com/ryan-prime/3dg-stfm。
translated by 谷歌翻译
We present a novel method for local image feature matching. Instead of performing image feature detection, description, and matching sequentially, we propose to first establish pixel-wise dense matches at a coarse level and later refine the good matches at a fine level. In contrast to dense methods that use a cost volume to search correspondences, we use self and cross attention layers in Transformer to obtain feature descriptors that are conditioned on both images. The global receptive field provided by Transformer enables our method to produce dense matches in low-texture areas, where feature detectors usually struggle to produce repeatable interest points. The experiments on indoor and outdoor datasets show that LoFTR outperforms state-of-the-art methods by a large margin. LoFTR also ranks first on two public benchmarks of visual localization among the published methods. Code is available at our project page: https://zju3dv.github.io/loftr/.
translated by 谷歌翻译
在图像之间生成健壮和可靠的对应关系是多种应用程序的基本任务。为了在全球和局部粒度上捕获上下文,我们提出了Aspanformer,这是一种基于变压器的无探测器匹配器,建立在层次的注意力结构上,采用了一种新颖的注意操作,能够以自适应方式调整注意力跨度。为了实现这一目标,首先,在每个跨注意阶段都会回归流图,以定位搜索区域的中心。接下来,在中心周围生成一个采样网格,其大小不是根据固定的经验配置为固定的,而是根据与流图一起估计的像素不确定性的自适应计算。最后,在派生区域内的两个图像上计算注意力,称为注意跨度。通过这些方式,我们不仅能够维持长期依赖性,而且能够在高相关性的像素之间获得细粒度的注意,从而补偿基本位置和匹配任务中的零件平滑度。在广泛的评估基准上的最新准确性验证了我们方法的强匹配能力。
translated by 谷歌翻译
This paper introduces SuperGlue, a neural network that matches two sets of local features by jointly finding correspondences and rejecting non-matchable points. Assignments are estimated by solving a differentiable optimal transport problem, whose costs are predicted by a graph neural network. We introduce a flexible context aggregation mechanism based on attention, enabling SuperGlue to reason about the underlying 3D scene and feature assignments jointly. Compared to traditional, hand-designed heuristics, our technique learns priors over geometric transformations and regularities of the 3D world through end-to-end training from image pairs. SuperGlue outperforms other learned approaches and achieves state-of-the-art results on the task of pose estimation in challenging real-world indoor and outdoor environments. The proposed method performs matching in real-time on a modern GPU and can be readily integrated into modern SfM or SLAM systems. The code and trained weights are publicly available at github.com/magicleap/SuperGluePretrainedNetwork.
translated by 谷歌翻译
本地图像功能匹配,旨在识别图像对的识别和相应的相似区域,是计算机视觉中的重要概念。大多数现有的图像匹配方法遵循一对一的分配原则,并采用共同最近的邻居来确保跨图像之间本地特征之间的独特对应关系。但是,来自不同条件的图像可能会容纳大规模变化或观点多样性,以便一对一的分配可能在密集匹配中导致模棱两可或丢失的表示形式。在本文中,我们介绍了一种新颖的无探测器本地特征匹配方法Adamatcher,该方法首先通过轻巧的特征交互模块与密集的特征相关联,并估算了配对图像的可见面积,然后执行贴片级多到 - 一个分配可以预测匹配建议,并最终根据一对一的完善模块进行完善。广泛的实验表明,Adamatcher的表现优于固体基线,并在许多下游任务上实现最先进的结果。此外,多对一分配和一对一的完善模块可以用作其他匹配方法(例如Superglue)的改进网络,以进一步提高其性能。代码将在出版时提供。
translated by 谷歌翻译
在本文中,我们建议超越建立的基于视觉的本地化方法,该方法依赖于查询图像和3D点云之间的视觉描述符匹配。尽管通过视觉描述符匹配关键点使本地化高度准确,但它具有重大的存储需求,提出了隐私问题,并需要长期对描述符进行更新。为了优雅地应对大规模定位的实用挑战,我们提出了Gomatch,这是基于视觉的匹配的替代方法,仅依靠几何信息来匹配图像键点与地图的匹配,这是轴承矢量集。我们的新型轴承矢量表示3D点,可显着缓解基于几何的匹配中的跨模式挑战,这阻止了先前的工作在现实环境中解决本地化。凭借额外的仔细建筑设计,Gomatch在先前的基于几何的匹配工作中改善了(1067m,95.7升)和(1.43m,34.7摄氏度),平均中位数姿势错误,同时需要7个尺寸,同时需要7片。与最佳基于视觉的匹配方法相比,几乎1.5/1.7%的存储容量。这证实了其对现实世界本地化的潜力和可行性,并为不需要存储视觉描述符的城市规模的视觉定位方法打开了未来努力的大门。
translated by 谷歌翻译
本地功能匹配是在子像素级别上的计算密集任务。尽管基于检测器的方法和特征描述符在低文本场景中遇到了困难,但具有顺序提取到匹配管道的基于CNN的方法无法使用编码器的匹配能力,并且倾向于覆盖用于匹配的解码器。相比之下,我们提出了一种新型的层次提取和匹配变压器,称为火柴场。在层次编码器的每个阶段,我们将自我注意事项与特征提取和特征匹配的交叉注意相结合,从而产生了人直觉提取和匹配方案。这种匹配感知的编码器释放了过载的解码器,并使该模型高效。此外,将自我交叉注意在分层体系结构中的多尺度特征结合起来,可以提高匹配的鲁棒性,尤其是在低文本室内场景或更少的室外培训数据中。得益于这样的策略,MatchFormer是效率,鲁棒性和精度的多赢解决方案。与以前的室内姿势估计中的最佳方法相比,我们的Lite MatchFormer只有45%的Gflops,但获得了 +1.3%的精度增益和41%的运行速度提升。大型火柴构造器以四个不同的基准达到最新的基准,包括室内姿势估计(SCANNET),室外姿势估计(Megadepth),同型估计和图像匹配(HPATCH)和视觉定位(INLOC)。
translated by 谷歌翻译
在许多视觉应用程序中,查找跨图像的对应是一项重要任务。最新的最新方法着重于以粗到精细的方式设计的基于端到端学习的架构。他们使用非常深的CNN或多块变压器来学习强大的表示,这需要高计算能力。此外,这些方法在不理解对象,图像内部形状的情况下学习功能,因此缺乏解释性。在本文中,我们提出了一个用于图像匹配的体系结构,该体系结构高效,健壮且可解释。更具体地说,我们介绍了一个名为toblefm的新型功能匹配模块,该模块可以大致将图像跨图像的空间结构大致组织到一个主题中,然后扩大每个主题内部的功能以进行准确的匹配。为了推断主题,我们首先学习主题的全局嵌入,然后使用潜在变量模型来检测图像结构将图像结构分配到主题中。我们的方法只能在共同可见性区域执行匹配以减少计算。在室外和室内数据集中进行的广泛实验表明,我们的方法在匹配性能和计算效率方面优于最新方法。该代码可在https://github.com/truongkhang/topicfm上找到。
translated by 谷歌翻译
Erroneous feature matches have severe impact on subsequent camera pose estimation and often require additional, time-costly measures, like RANSAC, for outlier rejection. Our method tackles this challenge by addressing feature matching and pose optimization jointly. To this end, we propose a graph attention network to predict image correspondences along with confidence weights. The resulting matches serve as weighted constraints in a differentiable pose estimation. Training feature matching with gradients from pose optimization naturally learns to down-weight outliers and boosts pose estimation on image pairs compared to SuperGlue by 6.7% on ScanNet. At the same time, it reduces the pose estimation time by over 50% and renders RANSAC iterations unnecessary. Moreover, we integrate information from multiple views by spanning the graph across multiple frames to predict the matches all at once. Multi-view matching combined with end-to-end training improves the pose estimation metrics on Matterport3D by 18.8% compared to SuperGlue.
translated by 谷歌翻译
在两个图像之间建立密集对应是基本计算机视觉问题,通常通过匹配本地特征描述符来解决。然而,如果没有全球意识,这种本地特征通常不足以消除类似地区。并计算图像的成对特征相关性是计算昂贵和内存密集型。为了使本地特征意识到全球背景并提高其匹配的准确性,我们介绍了DendeGap,一种新的解决方案,以获得高效密集的信念学习,在锚点上调节图形结构化神经网络。具体地,我们首先提出利用锚点的曲线图结构,以在和图像间的情况下之前提供稀疏但可靠,并通过定向边沿传播到所有图像点。我们还通过光加权消息传递层设计了图形结构化网络以广播多级上下文,并以低内存成本生成高分辨率特征映射。最后,基于预测的特征图,我们使用循环一致性引入用于准确的对应预测的粗略框架。我们的特征描述符捕获本地和全局信息,从而启用一个连续的特征字段,用于以高分辨率查询任意点。通过对大型室内和室外数据集的全面的消融实验和评估,我们证明我们的方法在大多数基准上推动了最先进的函授学习。
translated by 谷歌翻译
关键点匹配是多个图像相关应用的关键组件,例如图像拼接,视觉同时定位和映射(SLAM)等。基于手工制作和最近出现的深度学习的关键点匹配方法仅依赖于关键点和本地功能,同时在上述应用中丢失其他可用传感器(如惯性测量单元(IMU))的视觉。在本文中,我们证明IMU集成的运动估计可用于利用图像之间的关键点之前的空间分布。为此,提出了一种注意力制剂的概率视角,以自然地将空间分布集成到注意力图神经网络中。在空间分布的帮助下,可以减少用于建模隐藏特征的网络的努力。此外,我们为所提出的关键点匹配网络提出了一个投影损耗,它在匹配和未匹配的关键点之间提供了平滑的边缘。图像匹配在Visual Slam数据集上的实验表明了呈现的方法的有效性和效率。
translated by 谷歌翻译
监督的多视图立体声(MVS)方法在重建质量方面取得了显着进步,但遭受了收集大规模基础真相深度的挑战。在本文中,我们提出了一种基于知识蒸馏的MVS的新型自我监督培训管道,称为\ textit {kd-Mvs},主要由自我监督的教师培训和基于蒸馏的学生培训组成。具体而言,使用光度和特征一致性同时以自学的方式对教师模型进行了训练。然后,我们通过概率知识转移将教师模型的知识提炼为学生模型。在对经过验证的知识的监督下,学生模型能够以很大的优势优于其老师。在多个数据集上进行的广泛实验表明,我们的方法甚至可以胜过监督方法。
translated by 谷歌翻译
在统一功能对应模型中建模稀疏和致密的图像匹配最近引起了研究的兴趣。但是,现有的努力主要集中于提高匹配的准确性,同时忽略其效率,这对于现实世界的应用至关重要。在本文中,我们提出了一种有效的结构,该结构以粗到精细的方式找到对应关系,从而显着提高了功能对应模型的效率。为了实现这一目标,多个变压器块是阶段范围连接的,以逐步完善共享的多尺度特征提取网络上的预测坐标。给定一对图像和任意查询坐标,所有对应关系均在单个进纸传球内预测。我们进一步提出了一种自适应查询聚类策略和基于不确定性的离群检测模块,以与提出的框架合作,以进行更快,更好的预测。对各种稀疏和密集的匹配任务进行的实验证明了我们方法在效率和有效性上对现有的最新作品的优势。
translated by 谷歌翻译
基于可穿戴传感器的人类动作识别(HAR)最近取得了杰出的成功。但是,基于可穿戴传感器的HAR的准确性仍然远远落后于基于视觉模式的系统(即RGB视频,骨架和深度)。多样化的输入方式可以提供互补的提示,从而提高HAR的准确性,但是如何利用基于可穿戴传感器的HAR的多模式数据的优势很少探索。当前,可穿戴设备(即智能手表)只能捕获有限的非视态模式数据。这阻碍了多模式HAR关联,因为它无法同时使用视觉和非视态模态数据。另一个主要挑战在于如何在有限的计算资源上有效地利用可穿戴设备上的多模式数据。在这项工作中,我们提出了一种新型的渐进骨骼到传感器知识蒸馏(PSKD)模型,该模型仅利用时间序列数据,即加速度计数据,从智能手表来解决基于可穿戴传感器的HAR问题。具体而言,我们使用来自教师(人类骨架序列)和学生(时间序列加速度计数据)模式的数据构建多个教师模型。此外,我们提出了一种有效的渐进学习计划,以消除教师和学生模型之间的绩效差距。我们还设计了一种称为自适应信心语义(ACS)的新型损失功能,以使学生模型可以自适应地选择其中一种教师模型或所需模拟的地面真实标签。为了证明我们提出的PSKD方法的有效性,我们对伯克利-MHAD,UTD-MHAD和MMACT数据集进行了广泛的实验。结果证实,与以前的基于单传感器的HAR方法相比,提出的PSKD方法具有竞争性能。
translated by 谷歌翻译
大多数现有的基于学习的图像匹配管道都是为更好的特征探测器和描述符而设计的,这些探测器和描述符对重复纹理,观点更改等非常可靠,而对旋转不变性的关注很少。结果,由于缺乏关键点方向预测,这些方法通常表现出与手工制作的算法相比,与手工制作的算法相比,数据的性能较低。为了有效地解决该问题,提出了一种基于知识蒸馏的方法来改善旋转稳健性,而无需额外的计算成本。具体而言,基于基本模型,我们提出了多方面的特征聚合(MOFA),后来被用作蒸馏管道中的教师。此外,将旋转的内核融合(RKF)应用于学生模型的每个卷积内核,以促进学习旋转不变的特征。最终,实验表明,我们的建议可以在各种旋转下成功概括,而在推理阶段无需额外成本。
translated by 谷歌翻译
In this paper, we propose an end-to-end framework that jointly learns keypoint detection, descriptor representation and cross-frame matching for the task of image-based 3D localization. Prior art has tackled each of these components individually, purportedly aiming to alleviate difficulties in effectively train a holistic network. We design a self-supervised image warping correspondence loss for both feature detection and matching, a weakly-supervised epipolar constraints loss on relative camera pose learning, and a directional matching scheme that detects key-point features in a source image and performs coarse-to-fine correspondence search on the target image. We leverage this framework to enforce cycle consistency in our matching module. In addition, we propose a new loss to robustly handle both definite inlier/outlier matches and less-certain matches. The integration of these learning mechanisms enables end-to-end training of a single network performing all three localization components. Bench-marking our approach on public data-sets, exemplifies how such an end-to-end framework is able to yield more accurate localization that out-performs both traditional methods as well as state-of-the-art weakly supervised methods.
translated by 谷歌翻译
Recently, Bird's-Eye-View (BEV) representation has gained increasing attention in multi-view 3D object detection, which has demonstrated promising applications in autonomous driving. Although multi-view camera systems can be deployed at low cost, the lack of depth information makes current approaches adopt large models for good performance. Therefore, it is essential to improve the efficiency of BEV 3D object detection. Knowledge Distillation (KD) is one of the most practical techniques to train efficient yet accurate models. However, BEV KD is still under-explored to the best of our knowledge. Different from image classification tasks, BEV 3D object detection approaches are more complicated and consist of several components. In this paper, we propose a unified framework named BEV-LGKD to transfer the knowledge in the teacher-student manner. However, directly applying the teacher-student paradigm to BEV features fails to achieve satisfying results due to heavy background information in RGB cameras. To solve this problem, we propose to leverage the localization advantage of LiDAR points. Specifically, we transform the LiDAR points to BEV space and generate the foreground mask and view-dependent mask for the teacher-student paradigm. It is to be noted that our method only uses LiDAR points to guide the KD between RGB models. As the quality of depth estimation is crucial for BEV perception, we further introduce depth distillation to our framework. Our unified framework is simple yet effective and achieves a significant performance boost. Code will be released.
translated by 谷歌翻译
We introduce a lightweight network to improve descriptors of keypoints within the same image. The network takes the original descriptors and the geometric properties of keypoints as the input, and uses an MLP-based self-boosting stage and a Transformer-based cross-boosting stage to enhance the descriptors. The enhanced descriptors can be either real-valued or binary ones. We use the proposed network to boost both hand-crafted (ORB, SIFT) and the state-of-the-art learning-based descriptors (SuperPoint, ALIKE) and evaluate them on image matching, visual localization, and structure-from-motion tasks. The results show that our method significantly improves the performance of each task, particularly in challenging cases such as large illumination changes or repetitive patterns. Our method requires only 3.2ms on desktop GPU and 27ms on embedded GPU to process 2000 features, which is fast enough to be applied to a practical system.
translated by 谷歌翻译
Knowledge distillation facilitates the training of a compact student network by using a deep teacher one. While this has achieved great success in many tasks, it remains completely unstudied for image-based 6D object pose estimation. In this work, we introduce the first knowledge distillation method driven by the 6D pose estimation task. To this end, we observe that most modern 6D pose estimation frameworks output local predictions, such as sparse 2D keypoints or dense representations, and that the compact student network typically struggles to predict such local quantities precisely. Therefore, instead of imposing prediction-to-prediction supervision from the teacher to the student, we propose to distill the teacher's \emph{distribution} of local predictions into the student network, facilitating its training. Our experiments on several benchmarks show that our distillation method yields state-of-the-art results with different compact student models and for both keypoint-based and dense prediction-based architectures.
translated by 谷歌翻译
This paper proposes a generalizable, end-to-end deep learning-based method for relative pose regression between two images. Given two images of the same scene captured from different viewpoints, our algorithm predicts the relative rotation and translation between the two respective cameras. Despite recent progress in the field, current deep-based methods exhibit only limited generalization to scenes not seen in training. Our approach introduces a network architecture that extracts a grid of coarse features for each input image using the pre-trained LoFTR network. It subsequently relates corresponding features in the two images, and finally uses a convolutional network to recover the relative rotation and translation between the respective cameras. Our experiments indicate that the proposed architecture can generalize to novel scenes, obtaining higher accuracy than existing deep-learning-based methods in various settings and datasets, in particular with limited training data.
translated by 谷歌翻译