Image descriptors based on activations of Convolutional Neural Networks (CNNs) have become dominant in image retrieval due to their discriminative power, compactness of representation, and search efficiency. Training of CNNs, either from scratch or fine-tuning, requires a large amount of annotated data, where a high quality of annotation is often crucial. In this work, we propose to fine-tune CNNs for image retrieval on a large collection of unordered images in a fully automated manner. Reconstructed 3D models obtained by the state-of-the-art retrieval and structure-from-motion methods guide the selection of the training data. We show that both hard-positive and hard-negative examples, selected by exploiting the geometry and the camera positions available from the 3D models, enhance the performance of particular-object retrieval. CNN descriptor whitening discriminatively learned from the same training data outperforms commonly used PCA whitening. We propose a novel trainable Generalized-Mean (GeM) pooling layer that generalizes max and average pooling and show that it boosts retrieval performance. Applying the proposed method to the VGG network achieves state-of-the-art performance on the standard benchmarks: Oxford Buildings, Paris, and Holidays datasets.
translated by 谷歌翻译
近年来,已经产生了大量的视觉内容,并从许多领域共享,例如社交媒体平台,医学成像和机器人。这种丰富的内容创建和共享引入了新的挑战,特别是在寻找类似内容内容的图像检索(CBIR)-A的数据库中,即长期建立的研究区域,其中需要改进的效率和准确性来实时检索。人工智能在CBIR中取得了进展,并大大促进了实例搜索过程。在本调查中,我们审查了最近基于深度学习算法和技术开发的实例检索工作,通过深网络架构类型,深度功能,功能嵌入方法以及网络微调策略组织了调查。我们的调查考虑了各种各样的最新方法,在那里,我们识别里程碑工作,揭示各种方法之间的联系,并呈现常用的基准,评估结果,共同挑战,并提出未来的未来方向。
translated by 谷歌翻译
We tackle the problem of large scale visual place recognition, where the task is to quickly and accurately recognize the location of a given query photograph. We present the following three principal contributions. First, we develop a convolutional neural network (CNN) architecture that is trainable in an end-to-end manner directly for the place recognition task. The main component of this architecture, NetVLAD, is a new generalized VLAD layer, inspired by the "Vector of Locally Aggregated Descriptors" image representation commonly used in image retrieval. The layer is readily pluggable into any CNN architecture and amenable to training via backpropagation. Second, we develop a training procedure, based on a new weakly supervised ranking loss, to learn parameters of the architecture in an end-to-end manner from images depicting the same places over time downloaded from Google Street View Time Machine. Finally, we show that the proposed architecture significantly outperforms non-learnt image representations and off-the-shelf CNN descriptors on two challenging place recognition benchmarks, and improves over current stateof-the-art compact image representations on standard image retrieval benchmarks.
translated by 谷歌翻译
这项工作旨在改善具有自我监督的实例检索。我们发现使用最近开发的自我监督(SSL)学习方法(如SIMCLR和MOCO)的微调未能提高实例检索的性能。在这项工作中,我们确定了例如检索的学习表示应该是不变的视点和背景等的大变化,而当前SSL方法应用的自增强阳性不能为学习强大的实例级别表示提供强大的信号。为了克服这个问题,我们提出了一种在\ texit {实例级别}对比度上建立的新SSL方法,以通过动态挖掘迷你批次和存储库来学习类内不变性训练。广泛的实验表明,insclr在实例检索上实现了比最先进的SSL方法更类似或更好的性能。代码可在https://github.com/zeludeng/insclr获得。
translated by 谷歌翻译
实例级图像检索(IIR)或简单的实例检索,涉及在数据集中查找包含查询实例(例如对象)的数据集中所有图像的问题。本文首次尝试使用基于实例歧视的对比学习(CL)解决此问题。尽管CL在许多计算机视觉任务中表现出令人印象深刻的性能,但在IIR领域也从未找到过类似的成功。在这项工作中,我们通过探索从预先训练和微调的CL模型中得出判别表示的能力来解决此问题。首先,我们通过比较预先训练的深度神经网络(DNN)分类器与CL模型学到的功能相比,研究了IIR转移学习的功效。这些发现启发了我们提出了一种新的培训策略,该策略通过使用平均精度(AP)损失以及微调方法来学习针对IIR量身定制的对比功能表示形式,从而优化CL以学习为导向IIR的功能。我们的经验评估表明,从挑战性的牛津和巴黎数据集中的预先培训的DNN分类器中学到的现成的特征上的表现显着提高。
translated by 谷歌翻译
Deep convolutional networks have proven to be very successful in learning task specific features that allow for unprecedented performance on various computer vision tasks. Training of such networks follows mostly the supervised learning paradigm, where sufficiently many input-output pairs are required for training. Acquisition of large training sets is one of the key challenges, when approaching a new task. In this paper, we aim for generic feature learning and present an approach for training a convolutional network using only unlabeled data. To this end, we train the network to discriminate between a set of surrogate classes. Each surrogate class is formed by applying a variety of transformations to a randomly sampled 'seed' image patch. In contrast to supervised network training, the resulting feature representation is not class specific. It rather provides robustness to the transformations that have been applied during training. This generic feature representation allows for classification results that outperform the state of the art for unsupervised learning on several popular datasets . While such generic features cannot compete with class specific features from supervised training on a classification task, we show that they are advantageous on geometric matching problems, where they also outperform the SIFT descriptor.
translated by 谷歌翻译
视觉地点识别(VPR)是一个具有挑战性的任务,具有巨大的计算成本与高识别性能之间的不平衡。由于轻质卷积神经网络(CNNS)和局部聚合描述符(VLAD)层向量的火车能力的实用特征提取能力,我们提出了一种由前部组成的轻量级弱监管的端到端神经网络-anded的感知模型称为ghostcnn和学习的VLAD层作为后端。 Ghostcnn基于幽灵模块,这些模块是基于重量的CNN架构。它们可以使用线性操作而不是传统的卷积过程生成冗余特征映射,从而在计算资源和识别准确性之间进行良好的权衡。为了进一步增强我们提出的轻量级模型,我们将扩张的卷曲添加到Ghost模块中,以获取包含更多空间语义信息的功能,提高准确性。最后,在常用的公共基准和我们的私人数据集上进行的丰富实验验证了所提出的神经网络,分别将VGG16-NetVlad的拖鞋和参数减少了99.04%和80.16%。此外,两种模型都达到了类似的准确性。
translated by 谷歌翻译
我们提出了一种Saimaa环形密封(Pusa hispida saimensis)的方法。通过摄像机捕获和众包访问大型图像量,为动物监测和保护提供了新的可能性,并呼吁自动分析方法,特别是在重新识别图像中的单个动物时。所提出的方法通过PELAGE模式聚合(NORPPA)重新识别新型环形密封件,利用Saimaa环形密封件的永久和独特的毛线模式和基于内容的图像检索技术。首先,对查询图像进行了预处理,每个密封实例都进行了分段。接下来,使用基于U-NET编码器解码器的方法提取密封件的层模式。然后,将基于CNN的仿射不变特征嵌入并聚集到Fisher载体中。最后,使用Fisher载体之间的余弦距离用于从已知个体数据库中找到最佳匹配。我们在新的挑战性Saimaa环形密封件重新识别数据集上对该方法进行了各种修改的广泛实验。在与替代方法的比较中,提出的方法显示出在我们的数据集上产生最佳的重新识别精度。
translated by 谷歌翻译
A longstanding question in computer vision concerns the representation of 3D shapes for recognition: should 3D shapes be represented with descriptors operating on their native 3D formats, such as voxel grid or polygon mesh, or can they be effectively represented with view-based descriptors? We address this question in the context of learning to recognize 3D shapes from a collection of their rendered views on 2D images. We first present a standard CNN architecture trained to recognize the shapes' rendered views independently of each other, and show that a 3D shape can be recognized even from a single view at an accuracy far higher than using state-of-the-art 3D shape descriptors. Recognition rates further increase when multiple views of the shapes are provided. In addition, we present a novel CNN architecture that combines information from multiple views of a 3D shape into a single and compact shape descriptor offering even better recognition performance. The same architecture can be applied to accurately recognize human hand-drawn sketches of shapes. We conclude that a collection of 2D views can be highly informative for 3D shape recognition and is amenable to emerging CNN architectures and their derivatives.
translated by 谷歌翻译
质量功能表示是实例图像检索的关键。为了实现这一目标,现有方法通常诉诸于在基准数据集上预先训练的深度模型,或者使用与任务有关的标记辅助数据集微调模型。尽管取得了有希望的结果,但这种方法受两个问题的限制:1)基准数据集和给定检索任务的数据集之间的域差距; 2)无法轻易获得所需的辅助数据集。鉴于这种情况,这项工作研究了一种不同的方法,例如以前没有得到很好的研究:{我​​们可以学习功能表示\ textit {特定于}给定的检索任务以实现出色的检索吗?}我们发现令人鼓舞。通过添加一个对象建议生成器来生成用于自我监督学习的图像区域,研究的方法可以成功地学习特定于给定数据集的特定特征表示以进行检索。通过使用数据集挖掘的图像相似性信息来提高图像相似性信息,可以使此表示更加有效。经过实验验证,这种简单的``自我监督学习 +自我促进''方法可以很好地与相关的最新检索方法竞争。进行消融研究以表明这种方法的吸引力及其对跨数据集的概括的限制。
translated by 谷歌翻译
We introduce a novel Deep Network architecture that implements the full feature point handling pipeline, that is, detection, orientation estimation, and feature description. While previous works have successfully tackled each one of these problems individually, we show how to learn to do all three in a unified manner while preserving end-to-end differentiability. We then demonstrate that our Deep pipeline outperforms state-of-the-art methods on a number of benchmark datasets, without the need of retraining.
translated by 谷歌翻译
地理定位的概念是指确定地球上的某些“实体”的位置的过程,通常使用全球定位系统(GPS)坐标。感兴趣的实体可以是图像,图像序列,视频,卫星图像,甚至图像中可见的物体。由于GPS标记媒体的大规模数据集由于智能手机和互联网而迅速变得可用,而深入学习已经上升以提高机器学习模型的性能能力,因此由于其显着影响而出现了视觉和对象地理定位的领域广泛的应用,如增强现实,机器人,自驾驶车辆,道路维护和3D重建。本文提供了对涉及图像的地理定位的全面调查,其涉及从捕获图像(图像地理定位)或图像内的地理定位对象(对象地理定位)的地理定位的综合调查。我们将提供深入的研究,包括流行算法的摘要,对所提出的数据集的描述以及性能结果的分析来说明每个字段的当前状态。
translated by 谷歌翻译
视觉地位识别是自主驾驶导航和移动机器人定位等应用的具有挑战性的任务。分散注意力在复杂的场景中呈现的元素经常导致视觉场所的感知偏差。为了解决这个问题,必须将信息与任务相关区域中的信息集成到图像表示中至关重要。在本文中,我们介绍了一种基于视觉变压器的新型整体地点识别模型,TransVPR。它受益于变形金刚的自我关注操作的理想性能,这可以自然地聚合任务相关的特征。从多个级别的变压器的关注,重点关注不同的感兴趣区域,以产生全球图像表示。另外,由熔融注意掩模过滤的变压器层的输出令牌被认为是密钥贴片描述符,用于执行空间匹配以重新排名通过全局图像特征检索的候选。整个模型允许具有单个目标和图像级监控的端到端培训。 TransVPR在几个现实世界基准上实现最先进的性能,同时保持低计算时间和存储要求。
translated by 谷歌翻译
We propose bilinear models, a recognition architecture that consists of two feature extractors whose outputs are multiplied using outer product at each location of the image and pooled to obtain an image descriptor. This architecture can model local pairwise feature interactions in a translationally invariant manner which is particularly useful for fine-grained categorization. It also generalizes various orderless texture descriptors such as the Fisher vector, VLAD and O2P. We present experiments with bilinear models where the feature extractors are based on convolutional neural networks. The bilinear form simplifies gradient computation and allows end-to-end training of both networks using image labels only. Using networks initialized from the ImageNet dataset followed by domain specific fine-tuning we obtain 84.1% accuracy of the CUB-200-2011 dataset requiring only category labels at training time. We present experiments and visualizations that analyze the effects of fine-tuning and the choice two networks on the speed and accuracy of the models. Results show that the architecture compares favorably to the existing state of the art on a number of fine-grained datasets while being substantially simpler and easier to train. Moreover, our most accurate model is fairly efficient running at 8 frames/sec on a NVIDIA Tesla K40 GPU.
translated by 谷歌翻译
在这项工作中,我们提出了一种具有里程碑意义的检索方法,该方法利用了全球和本地功能。暹罗网络用于全球功能提取和度量学习,该网络对具有里程碑意义的搜索进行了初步排名。我们利用暹罗体系结构的提取特征图作为本地描述符,然后使用本地描述符之间的余弦相似性进一步完善搜索结果。我们对Google Landmark数据集进行了更深入的分析,该数据集用于评估,并增加数据集以处理各种类内差异。此外,我们进行了几项实验,以比较转移学习和度量学习的影响以及使用其他局部描述符的实验。我们表明,使用本地功能的重新排列可以改善搜索结果。我们认为,使用余弦相似性的拟议的本地特征提取是一种简单的方法,可以扩展到许多其他检索任务。
translated by 谷歌翻译
This work explores the use of spatial context as a source of free and plentiful supervisory signal for training a rich visual representation. Given only a large, unlabeled image collection, we extract random pairs of patches from each image and train a convolutional neural net to predict the position of the second patch relative to the first. We argue that doing well on this task requires the model to learn to recognize objects and their parts. We demonstrate that the feature representation learned using this within-image context indeed captures visual similarity across images. For example, this representation allows us to perform unsupervised visual discovery of objects like cats, people, and even birds from the Pascal VOC 2011 detection dataset. Furthermore, we show that the learned ConvNet can be used in the R-CNN framework [21] and provides a significant boost over a randomly-initialized ConvNet, resulting in state-of-theart performance among algorithms which use only Pascalprovided training set annotations.
translated by 谷歌翻译
近年来,机器人社区已经广泛检查了关于同时定位和映射应用范围内的地点识别任务的方法。这篇文章提出了一种基于外观的循环闭合检测管道,命名为“fild ++”(快速和增量环闭合检测) .First,系统由连续图像馈送,并且通过通过单个卷积神经网络通过两次,通过单个卷积神经网络来提取全局和局部深度特征。灵活,分级导航的小世界图逐步构建表示机器人遍历路径的可视数据库基于计算的全局特征。最后,每个时间步骤抓取查询映像,被设置为在遍历的路线上检索类似的位置。遵循的图像到图像配对,它利用本地特征来评估空间信息。因此,在拟议的文章中,我们向全球和本地特征提取提出了一个网络与我们之前的一个网络工作(FILD),而在生成的深度本地特征上采用了彻底搜索验证过程,避免利用哈希代码。关于11个公共数据集的详尽实验表现出系统的高性能(实现其中八个的最高召回得分)和低执行时间(在新学院平均22.05毫秒,这是与其他国家相比包含52480图像的最大版本) - 最艺术方法。
translated by 谷歌翻译
Is strong supervision necessary for learning a good visual representation? Do we really need millions of semantically-labeled images to train a Convolutional Neural Network (CNN)? In this paper, we present a simple yet surprisingly powerful approach for unsupervised learning of CNN. Specifically, we use hundreds of thousands of unlabeled videos from the web to learn visual representations.Our key idea is that visual tracking provides the supervision. That is, two patches connected by a track should have similar visual representation in deep feature space since they probably belong to the same object or object part. We design a Siamese-triplet network with a ranking loss function to train this CNN representation. Without using a single image from ImageNet, just using 100K unlabeled videos and the VOC 2012 dataset, we train an ensemble of unsupervised networks that achieves 52% mAP (no bounding box regression). This performance comes tantalizingly close to its ImageNet-supervised counterpart, an ensemble which achieves a mAP of 54.4%. We also show that our unsupervised network can perform competitively in other tasks such as surface-normal estimation.
translated by 谷歌翻译
Clustering is a class of unsupervised learning methods that has been extensively applied and studied in computer vision. Little work has been done to adapt it to the end-to-end training of visual features on large scale datasets. In this work, we present DeepCluster, a clustering method that jointly learns the parameters of a neural network and the cluster assignments of the resulting features. DeepCluster iteratively groups the features with a standard clustering algorithm, kmeans, and uses the subsequent assignments as supervision to update the weights of the network. We apply DeepCluster to the unsupervised training of convolutional neural networks on large datasets like ImageNet and YFCC100M. The resulting model outperforms the current state of the art by a significant margin on all the standard benchmarks.
translated by 谷歌翻译
降低降低方法是无监督的方法,它学习了低维空间,在这些方法中,初始空间的某些特性(通常是“邻居”的概念)被保留。这种方法通常需要在大的K-NN图或复杂的优化求解器上传播。另一方面,通常用于从头开始学习表示形式,依靠简单,更可扩展的框架来学习的自我监督学习方法。在本文中,我们提出了TLDR,这是通用输入空间的一种降低方法,该方法正在移植Zbontar等人的最新自我监督学习框架。 (2021)降低维度的特定任务,超越任意表示。我们建议使用最近的邻居从训练组中构建对,并减少冗余损失,以学习在此类对之间产生表示形式的编码器。 TLDR是一种简单,易于训练和广泛适用性的方法。它由一个离线最近的邻居计算步骤组成,该步骤可以高度近似,并且是一个直接的学习过程。为了提高可伸缩性,我们专注于提高线性维度的降低,并在图像和文档检索任务上显示一致的收益,例如在Roxford上获得PCA的 +4%地图,用于GEM-AP,改善了ImageNet上的Dino的性能或以10倍的压缩保留。
translated by 谷歌翻译