在许多计算机视觉管道中,在图像之间建立一组稀疏的关键点相关性是一项基本任务。通常,这转化为一个计算昂贵的最近邻居搜索,必须将一个图像的每个键盘描述符与其他图像的所有描述符进行比较。为了降低匹配阶段的计算成本,我们提出了一个能够检测到每个图像处的互补关键集的深度提取网络。由于仅需要在不同图像上比较同一组中的描述符,因此匹配相计算复杂度随集合数量而降低。我们训练我们的网络以预测关键点并共同计算相应的描述符。特别是,为了学习互补的关键点集,我们引入了一种新颖的无监督损失,对不同集合之间的交叉点进行了惩罚。此外,我们提出了一种基于描述符的新型加权方案,旨在惩罚使用非歧视性描述符的关键点的检测。通过广泛的实验,我们表明,我们的功能提取网络仅在合成的扭曲图像和完全无监督的方式进行训练,以降低匹配的复杂性,在3D重建和重新定位任务上取得了竞争成果。
translated by 谷歌翻译
Sparse local feature extraction is usually believed to be of important significance in typical vision tasks such as simultaneous localization and mapping, image matching and 3D reconstruction. At present, it still has some deficiencies needing further improvement, mainly including the discrimination power of extracted local descriptors, the localization accuracy of detected keypoints, and the efficiency of local feature learning. This paper focuses on promoting the currently popular sparse local feature learning with camera pose supervision. Therefore, it pertinently proposes a Shared Coupling-bridge scheme with four light-weight yet effective improvements for weakly-supervised local feature (SCFeat) learning. It mainly contains: i) the \emph{Feature-Fusion-ResUNet Backbone} (F2R-Backbone) for local descriptors learning, ii) a shared coupling-bridge normalization to improve the decoupling training of description network and detection network, iii) an improved detection network with peakiness measurement to detect keypoints and iv) the fundamental matrix error as a reward factor to further optimize feature detection training. Extensive experiments prove that our SCFeat improvement is effective. It could often obtain a state-of-the-art performance on classic image matching and visual localization. In terms of 3D reconstruction, it could still achieve competitive results. For sharing and communication, our source codes are available at https://github.com/sunjiayuanro/SCFeat.git.
translated by 谷歌翻译
现有方法以非可分子点检测关键点,因此它们不能直接通过背部传播优化关键点的位置。为解决此问题,我们呈现了一个可差异的关键点检测模块,其输出精确的子像素键点。然后提出了再分断损耗直接优化这些子像素键点,并且呈现了分散峰值损耗以获得准确的关键点正则化。我们还以子像素方式提取描述符,并通过稳定的神经输注误差丢失训练。此外,轻量化网络被设计用于关键点检测和描述符提取,其可以在商业GPU上以每秒95帧运行为95帧。在同性记估计,相机姿态估计和视觉(重新)定位任务中,所提出的方法通过最先进的方法实现了相同的性能,而大大减少了推理时间。
translated by 谷歌翻译
尽管提取了通过手工制作和基于学习的描述符实现的本地特征的进步,但它们仍然受到不符合非刚性转换的不变性的限制。在本文中,我们提出了一种计算来自静止图像的特征的新方法,该特征对于非刚性变形稳健,以避免匹配可变形表面和物体的问题。我们的变形感知当地描述符,命名优惠,利用极性采样和空间变压器翘曲,以提供旋转,尺度和图像变形的不变性。我们通过将等距非刚性变形应用于模拟环境中的对象作为指导来提供高度辨别的本地特征来培训模型架构端到端。该实验表明,我们的方法优于静止图像中的实际和现实合成可变形对象的不同数据集中的最先进的手工制作,基于学习的图像和RGB-D描述符。描述符的源代码和培训模型在https://www.verlab.dcc.ufmg.br/descriptors/neUrips2021上公开可用。
translated by 谷歌翻译
We introduce a lightweight network to improve descriptors of keypoints within the same image. The network takes the original descriptors and the geometric properties of keypoints as the input, and uses an MLP-based self-boosting stage and a Transformer-based cross-boosting stage to enhance the descriptors. The enhanced descriptors can be either real-valued or binary ones. We use the proposed network to boost both hand-crafted (ORB, SIFT) and the state-of-the-art learning-based descriptors (SuperPoint, ALIKE) and evaluate them on image matching, visual localization, and structure-from-motion tasks. The results show that our method significantly improves the performance of each task, particularly in challenging cases such as large illumination changes or repetitive patterns. Our method requires only 3.2ms on desktop GPU and 27ms on embedded GPU to process 2000 features, which is fast enough to be applied to a practical system.
translated by 谷歌翻译
弱监督学习可以帮助本地特征方法来克服以密集标记的对应关系获取大规模数据集的障碍。然而,由于弱监管无法区分检测和描述步骤造成的损失,因此直接在联合描述 - 然后检测管道内进行弱监督的学习,其性能受到限制。在本文中,我们提出了一种针对弱监督当地特征学习量身定制的解耦描述的管道。在我们的管道内,检测步骤与描述步骤分离并推迟直到学习判别和鲁棒描述符。此外,我们介绍了一条线到窗口搜索策略,以明确地使用相机姿势信息以获得更好的描述符学习。广泛的实验表明,我们的方法,即POSFEAT(相机姿势监督特征),以前完全和弱监督的方法优异,在各种下游任务上实现了最先进的性能。
translated by 谷歌翻译
This paper introduces SuperGlue, a neural network that matches two sets of local features by jointly finding correspondences and rejecting non-matchable points. Assignments are estimated by solving a differentiable optimal transport problem, whose costs are predicted by a graph neural network. We introduce a flexible context aggregation mechanism based on attention, enabling SuperGlue to reason about the underlying 3D scene and feature assignments jointly. Compared to traditional, hand-designed heuristics, our technique learns priors over geometric transformations and regularities of the 3D world through end-to-end training from image pairs. SuperGlue outperforms other learned approaches and achieves state-of-the-art results on the task of pose estimation in challenging real-world indoor and outdoor environments. The proposed method performs matching in real-time on a modern GPU and can be readily integrated into modern SfM or SLAM systems. The code and trained weights are publicly available at github.com/magicleap/SuperGluePretrainedNetwork.
translated by 谷歌翻译
在本文中,我们建议超越建立的基于视觉的本地化方法,该方法依赖于查询图像和3D点云之间的视觉描述符匹配。尽管通过视觉描述符匹配关键点使本地化高度准确,但它具有重大的存储需求,提出了隐私问题,并需要长期对描述符进行更新。为了优雅地应对大规模定位的实用挑战,我们提出了Gomatch,这是基于视觉的匹配的替代方法,仅依靠几何信息来匹配图像键点与地图的匹配,这是轴承矢量集。我们的新型轴承矢量表示3D点,可显着缓解基于几何的匹配中的跨模式挑战,这阻止了先前的工作在现实环境中解决本地化。凭借额外的仔细建筑设计,Gomatch在先前的基于几何的匹配工作中改善了(1067m,95.7升)和(1.43m,34.7摄氏度),平均中位数姿势错误,同时需要7个尺寸,同时需要7片。与最佳基于视觉的匹配方法相比,几乎1.5/1.7%的存储容量。这证实了其对现实世界本地化的潜力和可行性,并为不需要存储视觉描述符的城市规模的视觉定位方法打开了未来努力的大门。
translated by 谷歌翻译
在本文中,我们解决了估算图像之间尺度因子的问题。我们制定规模估计问题作为对尺度因素的概率分布的预测。我们设计了一种新的架构,ScaleNet,它利用扩张的卷积以及自我和互相关层来预测图像之间的比例。我们展示了具有估计尺度的整流图像导致各种任务和方法的显着性能改进。具体而言,我们展示了ScaleNet如何与稀疏的本地特征和密集的通信网络组合,以改善不同的基准和数据集中的相机姿势估计,3D重建或密集的几何匹配。我们对多项任务提供了广泛的评估,并分析了标准齿的计算开销。代码,评估协议和培训的型号在https://github.com/axelbarroso/scalenet上公开提供。
translated by 谷歌翻译
小天体的任务在很大程度上依赖于光学特征跟踪,以表征和相对导航。尽管深度学习导致了功能检测和描述方面的巨大进步,但由于大规模,带注释的数据集的可用性有限,因此培训和验证了空间应用程序的数据驱动模型具有挑战性。本文介绍了Astrovision,这是一个大规模数据集,由115,970个密集注释的,真实的图像组成,这些图像是过去和正在进行的任务中捕获的16个不同物体的真实图像。我们利用Astrovision开发一组标准化基准,并对手工和数据驱动的功能检测和描述方法进行详尽的评估。接下来,我们采用Astrovision对最先进的,深刻的功能检测和描述网络进行端到端培训,并在多个基准测试中表现出改善的性能。将公开使用完整的基准管道和数据集,以促进用于空间应用程序的计算机视觉算法的发展。
translated by 谷歌翻译
In this paper, we propose an end-to-end framework that jointly learns keypoint detection, descriptor representation and cross-frame matching for the task of image-based 3D localization. Prior art has tackled each of these components individually, purportedly aiming to alleviate difficulties in effectively train a holistic network. We design a self-supervised image warping correspondence loss for both feature detection and matching, a weakly-supervised epipolar constraints loss on relative camera pose learning, and a directional matching scheme that detects key-point features in a source image and performs coarse-to-fine correspondence search on the target image. We leverage this framework to enforce cycle consistency in our matching module. In addition, we propose a new loss to robustly handle both definite inlier/outlier matches and less-certain matches. The integration of these learning mechanisms enables end-to-end training of a single network performing all three localization components. Bench-marking our approach on public data-sets, exemplifies how such an end-to-end framework is able to yield more accurate localization that out-performs both traditional methods as well as state-of-the-art weakly supervised methods.
translated by 谷歌翻译
Erroneous feature matches have severe impact on subsequent camera pose estimation and often require additional, time-costly measures, like RANSAC, for outlier rejection. Our method tackles this challenge by addressing feature matching and pose optimization jointly. To this end, we propose a graph attention network to predict image correspondences along with confidence weights. The resulting matches serve as weighted constraints in a differentiable pose estimation. Training feature matching with gradients from pose optimization naturally learns to down-weight outliers and boosts pose estimation on image pairs compared to SuperGlue by 6.7% on ScanNet. At the same time, it reduces the pose estimation time by over 50% and renders RANSAC iterations unnecessary. Moreover, we integrate information from multiple views by spanning the graph across multiple frames to predict the matches all at once. Multi-view matching combined with end-to-end training improves the pose estimation metrics on Matterport3D by 18.8% compared to SuperGlue.
translated by 谷歌翻译
This paper presents a self-supervised framework for training interest point detectors and descriptors suitable for a large number of multiple-view geometry problems in computer vision. As opposed to patch-based neural networks, our fully-convolutional model operates on full-sized images and jointly computes pixel-level interest point locations and associated descriptors in one forward pass. We introduce Homographic Adaptation, a multi-scale, multihomography approach for boosting interest point detection repeatability and performing cross-domain adaptation (e.g., synthetic-to-real). Our model, when trained on the MS-COCO generic image dataset using Homographic Adaptation, is able to repeatedly detect a much richer set of interest points than the initial pre-adapted deep model and any other traditional corner detector. The final system gives rise to state-of-the-art homography estimation results on HPatches when compared to LIFT, SIFT and ORB.
translated by 谷歌翻译
大多数图像匹配方法在遇到大规模的图像变化时表现不佳。为了解决这个问题,首先,我们提出了一种规模差异感知图像匹配方法(SDAIM),其通过根据估计比例的尺度比调整图像对的两个图像大小来降低局部特征提取之前的图像比例差异。其次,为了准确估计比例比率,我们提出了一种可执行的加强匹配模块(CVARM),然后设计了一种基于CVARM的新型神经网络,称为Scale-Net。所提出的CVARM可以对图像对内的可释放区域进行更多的压力,并抑制仅在一个图像中可见的那些区域的分散注意力。定量和定性实验证实,与所有现有比例比率估计方法相比,所提出的尺度净净值具有更高的比例估计精度和更好的泛化能力。图像匹配和相对姿势估计任务的进一步实验表明,我们的SDAIM和Scale-Net能够大大提高代表性本地特征的性能和最先进的本地特征匹配方法。
translated by 谷歌翻译
We introduce a novel Deep Network architecture that implements the full feature point handling pipeline, that is, detection, orientation estimation, and feature description. While previous works have successfully tackled each one of these problems individually, we show how to learn to do all three in a unified manner while preserving end-to-end differentiability. We then demonstrate that our Deep pipeline outperforms state-of-the-art methods on a number of benchmark datasets, without the need of retraining.
translated by 谷歌翻译
准确的相机姿势估计是许多应用程序(例如自动驾驶,移动机器人技术和增强现实)的基本要求。在这项工作中,我们解决了在给定环境中从单个RGB图像中估算全局6 DOF摄像头的问题。以前的作品考虑图像的每个部分都有价值对于本地化。但是,许多图像区域,例如天空,遮挡和重复的非固定模式,不能用于本地化。除了添加不必要的计算工作外,从此类地区提取和匹配功能还会产生许多错误的匹配,从而降低了本地化准确性和效率。我们的工作解决了这一特定问题,并通过利用有趣的3D模型的有趣概念来显示,我们可以利用歧视性环境零件并避免出于单个图像本地化而避免无用的图像区域。有趣的是,通过避免从树木,灌木丛,汽车,行人和遮挡等不可靠的图像区域中选择关键点,我们的工作自然而然地作为离群过滤器。这使我们的系统高效,在最小的对应关系中,由于异常值的数量很少,因此需要高度准确。我们的工作超过了室外剑桥地标数据集的最新方法。仅在推理上依靠单个图像,它的精度方法超过了构成姿势先验和/或参考3D模型的精度方法,同时更快。通过选择仅100个对应关系,它超过了从数千个对应关系进行定位的类似方法,同时更有效。特别是,与这些方法相比,它实现了,在旧院面场景中,本地化提高了33%。此外,它甚至可以从图像顺序中学习的直接姿势回归器
translated by 谷歌翻译
点云注册是许多应用程序(例如本地化,映射,跟踪和重建)的基本任务。成功的注册依赖于提取鲁棒和歧视性的几何特征。现有的基于学习的方法需要高计算能力来同时处理大量原始点。尽管这些方法取得了令人信服的结果,但由于高计算成本,它们很难在现实情况下应用。在本文中,我们介绍了一个框架,该框架使用图形注意网络有效地从经济上提取密集的特征,以进行点云匹配和注册(DFGAT)。 DFGAT的检测器负责在大型原始数据集中找到高度可靠的关键点。 DFGAT的描述符将这些关键点与邻居相结合,以提取不变的密度特征,以准备匹配。图形注意力网络使用了丰富点云之间关系的注意机制。最后,我们将其视为最佳运输问题,并使用Sinkhorn算法找到正匹配和负面匹配。我们对KITTI数据集进行了彻底的测试,并评估了该方法的有效性。结果表明,与其他最先进的方法相比,使用有效紧凑的关键点选择和描述可以实现最佳性能匹配指标,并达到99.88%注册的最高成功率。
translated by 谷歌翻译
Local feature detection is a key ingredient of many image processing and computer vision applications, such as visual odometry and localization. Most existing algorithms focus on feature detection from a sharp image. They would thus have degraded performance once the image is blurred, which could happen easily under low-lighting conditions. To address this issue, we propose a simple yet both efficient and effective keypoint detection method that is able to accurately localize the salient keypoints in a blurred image. Our method takes advantages of a novel multi-layer perceptron (MLP) based architecture that significantly improve the detection repeatability for a blurred image. The network is also light-weight and able to run in real-time, which enables its deployment for time-constrained applications. Extensive experimental results demonstrate that our detector is able to improve the detection repeatability with blurred images, while keeping comparable performance as existing state-of-the-art detectors for sharp images.
translated by 谷歌翻译
对于视网膜图像匹配(RIM),我们提出了SuperRetina,这是第一个具有可训练的键盘检测器和描述符的端到端方法。 SuperRetina以一种新颖的半监督方式接受了训练。一小部分(近100张)图像未完全标记,并用于监督网络以检测血管树上的关键点。为了攻击手动标记的不完整性,我们提出了进行性逐步扩展,以丰富每个训练时期的关键点标签。通过利用基于关键的改进的三重态损失作为描述损失,超级逆局以完全输入图像大小产生高度歧视性描述符。在多个现实世界数据集上进行了广泛的实验证明了超级丽菌的生存能力。即使手动标记被自动标记取代,因此使训练过程完全免费手动通道,超级retina也可以与多个强大的基线进行比较,以进行两个RIM任务,即图像注册和身份验证。 SuperRetina将是开源。
translated by 谷歌翻译
在许多临床应用中,内窥镜图像之间的特征匹配和查找对应关系是从临床序列中进行快速异常定位的许多临床应用中的关键步骤。尽管如此,由于内窥镜图像中存在较高的纹理可变性,稳健和准确的特征匹配的发展成为一项具有挑战性的任务。最近,通过卷积神经网络(CNN)提取的深度学习技术已在各种计算机视觉任务中获得了吸引力。但是,他们都遵循一个有监督的学习计划,其中需要大量注释的数据才能达到良好的性能,这通常不总是可用于医疗数据数据库。为了克服与标记的数据稀缺性有关的限制,自我监督的学习范式最近在许多应用程序中表现出了巨大的成功。本文提出了一种基于深度学习技术的内窥镜图像匹配的新型自我监督方法。与标准手工制作的本地功能描述符相比,我们的方法在精度和召回方面优于它们。此外,与选择基于精度和匹配分数的基于最先进的基于深度学习的监督方法相比,我们的自我监管的描述符提供了竞争性能。
translated by 谷歌翻译