特征形式的图像补丁的独特表示是许多计算机视觉和机器人任务的关键组成部分,例如图像匹配,图像检索和视觉定位。最先进的描述符,来自手工制作的描述符,例如SIFT到诸如HardNet之类的学习者,通常是高维的; 128个维度甚至更多。维度越高,使用此类描述符的方法的内存消耗和计算时间越大。在本文中,我们研究了多层感知器(MLP),以提取低维但高质量的描述符。我们在无监督,自我监督和监督的设置中彻底分析了我们的方法,并评估了四个代表性描述符的降维结果。我们考虑不同的应用程序,包括视觉定位,补丁验证,图像匹配和检索。实验表明,我们的轻量级MLP比PCA获得了更好的尺寸降低。我们的方法生成的较低维描述符在下游任务中的原始高维描述符,尤其是对于手工制作的任务。该代码将在https://github.com/prbonn/descriptor-dr上找到。
translated by 谷歌翻译
We introduce a lightweight network to improve descriptors of keypoints within the same image. The network takes the original descriptors and the geometric properties of keypoints as the input, and uses an MLP-based self-boosting stage and a Transformer-based cross-boosting stage to enhance the descriptors. The enhanced descriptors can be either real-valued or binary ones. We use the proposed network to boost both hand-crafted (ORB, SIFT) and the state-of-the-art learning-based descriptors (SuperPoint, ALIKE) and evaluate them on image matching, visual localization, and structure-from-motion tasks. The results show that our method significantly improves the performance of each task, particularly in challenging cases such as large illumination changes or repetitive patterns. Our method requires only 3.2ms on desktop GPU and 27ms on embedded GPU to process 2000 features, which is fast enough to be applied to a practical system.
translated by 谷歌翻译
在许多临床应用中,内窥镜图像之间的特征匹配和查找对应关系是从临床序列中进行快速异常定位的许多临床应用中的关键步骤。尽管如此,由于内窥镜图像中存在较高的纹理可变性,稳健和准确的特征匹配的发展成为一项具有挑战性的任务。最近,通过卷积神经网络(CNN)提取的深度学习技术已在各种计算机视觉任务中获得了吸引力。但是,他们都遵循一个有监督的学习计划,其中需要大量注释的数据才能达到良好的性能,这通常不总是可用于医疗数据数据库。为了克服与标记的数据稀缺性有关的限制,自我监督的学习范式最近在许多应用程序中表现出了巨大的成功。本文提出了一种基于深度学习技术的内窥镜图像匹配的新型自我监督方法。与标准手工制作的本地功能描述符相比,我们的方法在精度和召回方面优于它们。此外,与选择基于精度和匹配分数的基于最先进的基于深度学习的监督方法相比,我们的自我监管的描述符提供了竞争性能。
translated by 谷歌翻译
尽管提取了通过手工制作和基于学习的描述符实现的本地特征的进步,但它们仍然受到不符合非刚性转换的不变性的限制。在本文中,我们提出了一种计算来自静止图像的特征的新方法,该特征对于非刚性变形稳健,以避免匹配可变形表面和物体的问题。我们的变形感知当地描述符,命名优惠,利用极性采样和空间变压器翘曲,以提供旋转,尺度和图像变形的不变性。我们通过将等距非刚性变形应用于模拟环境中的对象作为指导来提供高度辨别的本地特征来培训模型架构端到端。该实验表明,我们的方法优于静止图像中的实际和现实合成可变形对象的不同数据集中的最先进的手工制作,基于学习的图像和RGB-D描述符。描述符的源代码和培训模型在https://www.verlab.dcc.ufmg.br/descriptors/neUrips2021上公开可用。
translated by 谷歌翻译
Advanced visual localization techniques encompass image retrieval challenges and 6 Degree-of-Freedom (DoF) camera pose estimation, such as hierarchical localization. Thus, they must extract global and local features from input images. Previous methods have achieved this through resource-intensive or accuracy-reducing means, such as combinatorial pipelines or multi-task distillation. In this study, we present a novel method called SuperGF, which effectively unifies local and global features for visual localization, leading to a higher trade-off between localization accuracy and computational efficiency. Specifically, SuperGF is a transformer-based aggregation model that operates directly on image-matching-specific local features and generates global features for retrieval. We conduct experimental evaluations of our method in terms of both accuracy and efficiency, demonstrating its advantages over other methods. We also provide implementations of SuperGF using various types of local features, including dense and sparse learning-based or hand-crafted descriptors.
translated by 谷歌翻译
Sparse local feature extraction is usually believed to be of important significance in typical vision tasks such as simultaneous localization and mapping, image matching and 3D reconstruction. At present, it still has some deficiencies needing further improvement, mainly including the discrimination power of extracted local descriptors, the localization accuracy of detected keypoints, and the efficiency of local feature learning. This paper focuses on promoting the currently popular sparse local feature learning with camera pose supervision. Therefore, it pertinently proposes a Shared Coupling-bridge scheme with four light-weight yet effective improvements for weakly-supervised local feature (SCFeat) learning. It mainly contains: i) the \emph{Feature-Fusion-ResUNet Backbone} (F2R-Backbone) for local descriptors learning, ii) a shared coupling-bridge normalization to improve the decoupling training of description network and detection network, iii) an improved detection network with peakiness measurement to detect keypoints and iv) the fundamental matrix error as a reward factor to further optimize feature detection training. Extensive experiments prove that our SCFeat improvement is effective. It could often obtain a state-of-the-art performance on classic image matching and visual localization. In terms of 3D reconstruction, it could still achieve competitive results. For sharing and communication, our source codes are available at https://github.com/sunjiayuanro/SCFeat.git.
translated by 谷歌翻译
鉴于近期对视觉描述符的隐私开启的关于场景启示符的分析,我们开发隐藏输入图像内容的描述符。特别是,我们提出了对培训防止图像重建的视觉描述符的对抗性学习框架,同时保持匹配精度。我们允许一个特征编码网络和图像重建网络彼此竞争,使得特征编码器尝试利用其生成的描述符推出图像重建,而重构器尝试从描述符恢复输入图像。实验结果表明,通过我们的方法获得的视觉描述符显着恶化了对应匹配和相机定位性能的最小影响。
translated by 谷歌翻译
近年来,已经产生了大量的视觉内容,并从许多领域共享,例如社交媒体平台,医学成像和机器人。这种丰富的内容创建和共享引入了新的挑战,特别是在寻找类似内容内容的图像检索(CBIR)-A的数据库中,即长期建立的研究区域,其中需要改进的效率和准确性来实时检索。人工智能在CBIR中取得了进展,并大大促进了实例搜索过程。在本调查中,我们审查了最近基于深度学习算法和技术开发的实例检索工作,通过深网络架构类型,深度功能,功能嵌入方法以及网络微调策略组织了调查。我们的调查考虑了各种各样的最新方法,在那里,我们识别里程碑工作,揭示各种方法之间的联系,并呈现常用的基准,评估结果,共同挑战,并提出未来的未来方向。
translated by 谷歌翻译
We introduce a novel Deep Network architecture that implements the full feature point handling pipeline, that is, detection, orientation estimation, and feature description. While previous works have successfully tackled each one of these problems individually, we show how to learn to do all three in a unified manner while preserving end-to-end differentiability. We then demonstrate that our Deep pipeline outperforms state-of-the-art methods on a number of benchmark datasets, without the need of retraining.
translated by 谷歌翻译
我们研究学习特征姿势的问题,即比例和方向,以构成感兴趣的图像区域。尽管它显然很简单,但问题是不平凡的。很难获得具有模型直接从中学习的明确姿势注释的大规模图像区域。为了解决这个问题,我们通过直方图对准技术提出了一个自制的学习框架。它通过随机重新缩放/旋转来生成成对的图像贴片,然后训练估计器以预测其比例/方向值,从而使其相对差异与所使用的重新分组/旋转一致。估算器学会了预测规模/方向的非参数直方图分布,而无需任何监督。实验表明,它在规模/方向估计中显着优于先前的方法,还可以通过将我们的斑块姿势纳入匹配过程中来改善图像匹配和6个DOF相机姿势估计。
translated by 谷歌翻译
Image descriptors based on activations of Convolutional Neural Networks (CNNs) have become dominant in image retrieval due to their discriminative power, compactness of representation, and search efficiency. Training of CNNs, either from scratch or fine-tuning, requires a large amount of annotated data, where a high quality of annotation is often crucial. In this work, we propose to fine-tune CNNs for image retrieval on a large collection of unordered images in a fully automated manner. Reconstructed 3D models obtained by the state-of-the-art retrieval and structure-from-motion methods guide the selection of the training data. We show that both hard-positive and hard-negative examples, selected by exploiting the geometry and the camera positions available from the 3D models, enhance the performance of particular-object retrieval. CNN descriptor whitening discriminatively learned from the same training data outperforms commonly used PCA whitening. We propose a novel trainable Generalized-Mean (GeM) pooling layer that generalizes max and average pooling and show that it boosts retrieval performance. Applying the proposed method to the VGG network achieves state-of-the-art performance on the standard benchmarks: Oxford Buildings, Paris, and Holidays datasets.
translated by 谷歌翻译
弱监督学习可以帮助本地特征方法来克服以密集标记的对应关系获取大规模数据集的障碍。然而,由于弱监管无法区分检测和描述步骤造成的损失,因此直接在联合描述 - 然后检测管道内进行弱监督的学习,其性能受到限制。在本文中,我们提出了一种针对弱监督当地特征学习量身定制的解耦描述的管道。在我们的管道内,检测步骤与描述步骤分离并推迟直到学习判别和鲁棒描述符。此外,我们介绍了一条线到窗口搜索策略,以明确地使用相机姿势信息以获得更好的描述符学习。广泛的实验表明,我们的方法,即POSFEAT(相机姿势监督特征),以前完全和弱监督的方法优异,在各种下游任务上实现了最先进的性能。
translated by 谷歌翻译
现有方法以非可分子点检测关键点,因此它们不能直接通过背部传播优化关键点的位置。为解决此问题,我们呈现了一个可差异的关键点检测模块,其输出精确的子像素键点。然后提出了再分断损耗直接优化这些子像素键点,并且呈现了分散峰值损耗以获得准确的关键点正则化。我们还以子像素方式提取描述符,并通过稳定的神经输注误差丢失训练。此外,轻量化网络被设计用于关键点检测和描述符提取,其可以在商业GPU上以每秒95帧运行为95帧。在同性记估计,相机姿态估计和视觉(重新)定位任务中,所提出的方法通过最先进的方法实现了相同的性能,而大大减少了推理时间。
translated by 谷歌翻译
点云注册是许多应用程序(例如本地化,映射,跟踪和重建)的基本任务。成功的注册依赖于提取鲁棒和歧视性的几何特征。现有的基于学习的方法需要高计算能力来同时处理大量原始点。尽管这些方法取得了令人信服的结果,但由于高计算成本,它们很难在现实情况下应用。在本文中,我们介绍了一个框架,该框架使用图形注意网络有效地从经济上提取密集的特征,以进行点云匹配和注册(DFGAT)。 DFGAT的检测器负责在大型原始数据集中找到高度可靠的关键点。 DFGAT的描述符将这些关键点与邻居相结合,以提取不变的密度特征,以准备匹配。图形注意力网络使用了丰富点云之间关系的注意机制。最后,我们将其视为最佳运输问题,并使用Sinkhorn算法找到正匹配和负面匹配。我们对KITTI数据集进行了彻底的测试,并评估了该方法的有效性。结果表明,与其他最先进的方法相比,使用有效紧凑的关键点选择和描述可以实现最佳性能匹配指标,并达到99.88%注册的最高成功率。
translated by 谷歌翻译
Efficient detection and description of geometric regions in images is a prerequisite in visual systems for localization and mapping. Such systems still rely on traditional hand-crafted methods for efficient generation of lightweight descriptors, a common limitation of the more powerful neural network models that come with high compute and specific hardware requirements. In this paper, we focus on the adaptations required by detection and description neural networks to enable their use in computationally limited platforms such as robots, mobile, and augmented reality devices. To that end, we investigate and adapt network quantization techniques to accelerate inference and enable its use on compute limited platforms. In addition, we revisit common practices in descriptor quantization and propose the use of a binary descriptor normalization layer, enabling the generation of distinctive binary descriptors with a constant number of ones. ZippyPoint, our efficient quantized network with binary descriptors, improves the network runtime speed, the descriptor matching speed, and the 3D model size, by at least an order of magnitude when compared to full-precision counterparts. These improvements come at a minor performance degradation as evaluated on the tasks of homography estimation, visual localization, and map-free visual relocalization. Code and trained models will be released upon acceptance.
translated by 谷歌翻译
We tackle the problem of large scale visual place recognition, where the task is to quickly and accurately recognize the location of a given query photograph. We present the following three principal contributions. First, we develop a convolutional neural network (CNN) architecture that is trainable in an end-to-end manner directly for the place recognition task. The main component of this architecture, NetVLAD, is a new generalized VLAD layer, inspired by the "Vector of Locally Aggregated Descriptors" image representation commonly used in image retrieval. The layer is readily pluggable into any CNN architecture and amenable to training via backpropagation. Second, we develop a training procedure, based on a new weakly supervised ranking loss, to learn parameters of the architecture in an end-to-end manner from images depicting the same places over time downloaded from Google Street View Time Machine. Finally, we show that the proposed architecture significantly outperforms non-learnt image representations and off-the-shelf CNN descriptors on two challenging place recognition benchmarks, and improves over current stateof-the-art compact image representations on standard image retrieval benchmarks.
translated by 谷歌翻译
我们提出了一种Saimaa环形密封(Pusa hispida saimensis)的方法。通过摄像机捕获和众包访问大型图像量,为动物监测和保护提供了新的可能性,并呼吁自动分析方法,特别是在重新识别图像中的单个动物时。所提出的方法通过PELAGE模式聚合(NORPPA)重新识别新型环形密封件,利用Saimaa环形密封件的永久和独特的毛线模式和基于内容的图像检索技术。首先,对查询图像进行了预处理,每个密封实例都进行了分段。接下来,使用基于U-NET编码器解码器的方法提取密封件的层模式。然后,将基于CNN的仿射不变特征嵌入并聚集到Fisher载体中。最后,使用Fisher载体之间的余弦距离用于从已知个体数据库中找到最佳匹配。我们在新的挑战性Saimaa环形密封件重新识别数据集上对该方法进行了各种修改的广泛实验。在与替代方法的比较中,提出的方法显示出在我们的数据集上产生最佳的重新识别精度。
translated by 谷歌翻译
视觉地位识别是自主驾驶导航和移动机器人定位等应用的具有挑战性的任务。分散注意力在复杂的场景中呈现的元素经常导致视觉场所的感知偏差。为了解决这个问题,必须将信息与任务相关区域中的信息集成到图像表示中至关重要。在本文中,我们介绍了一种基于视觉变压器的新型整体地点识别模型,TransVPR。它受益于变形金刚的自我关注操作的理想性能,这可以自然地聚合任务相关的特征。从多个级别的变压器的关注,重点关注不同的感兴趣区域,以产生全球图像表示。另外,由熔融注意掩模过滤的变压器层的输出令牌被认为是密钥贴片描述符,用于执行空间匹配以重新排名通过全局图像特征检索的候选。整个模型允许具有单个目标和图像级监控的端到端培训。 TransVPR在几个现实世界基准上实现最先进的性能,同时保持低计算时间和存储要求。
translated by 谷歌翻译
降低降低方法是无监督的方法,它学习了低维空间,在这些方法中,初始空间的某些特性(通常是“邻居”的概念)被保留。这种方法通常需要在大的K-NN图或复杂的优化求解器上传播。另一方面,通常用于从头开始学习表示形式,依靠简单,更可扩展的框架来学习的自我监督学习方法。在本文中,我们提出了TLDR,这是通用输入空间的一种降低方法,该方法正在移植Zbontar等人的最新自我监督学习框架。 (2021)降低维度的特定任务,超越任意表示。我们建议使用最近的邻居从训练组中构建对,并减少冗余损失,以学习在此类对之间产生表示形式的编码器。 TLDR是一种简单,易于训练和广泛适用性的方法。它由一个离线最近的邻居计算步骤组成,该步骤可以高度近似,并且是一个直接的学习过程。为了提高可伸缩性,我们专注于提高线性维度的降低,并在图像和文档检索任务上显示一致的收益,例如在Roxford上获得PCA的 +4%地图,用于GEM-AP,改善了ImageNet上的Dino的性能或以10倍的压缩保留。
translated by 谷歌翻译
In this paper, we propose an end-to-end framework that jointly learns keypoint detection, descriptor representation and cross-frame matching for the task of image-based 3D localization. Prior art has tackled each of these components individually, purportedly aiming to alleviate difficulties in effectively train a holistic network. We design a self-supervised image warping correspondence loss for both feature detection and matching, a weakly-supervised epipolar constraints loss on relative camera pose learning, and a directional matching scheme that detects key-point features in a source image and performs coarse-to-fine correspondence search on the target image. We leverage this framework to enforce cycle consistency in our matching module. In addition, we propose a new loss to robustly handle both definite inlier/outlier matches and less-certain matches. The integration of these learning mechanisms enables end-to-end training of a single network performing all three localization components. Bench-marking our approach on public data-sets, exemplifies how such an end-to-end framework is able to yield more accurate localization that out-performs both traditional methods as well as state-of-the-art weakly supervised methods.
translated by 谷歌翻译