视觉地位识别(VPR)通常关注本地化室外图像。但是,本地化包含部分户外场景的室内场景对于各种应用来说可能具有很大的值。在本文中,我们介绍了内部视觉地点识别(IOVPR),一个任务,旨在通过Windows可见的户外场景本地化图像。对于此任务,我们介绍了新的大型数据集Amsterdam-XXXL,在阿姆斯特丹拍摄的图像,由640万全景街头视图图像和1000个用户生成的室内查询组成。此外,我们介绍了一个新的培训协议,内部数据增强,以适应视觉地点识别方法,以便展示内外视觉识别的潜力。我们经验展示了我们提出的数据增强方案的优势,较小的规模,同时展示了现有方法的大规模数据集的难度。通过这项新任务,我们旨在鼓励为IOVPR制定方法。数据集和代码可用于HTTPS://github.com/saibr/iovpr的研究目的
translated by 谷歌翻译
地理定位的概念是指确定地球上的某些“实体”的位置的过程,通常使用全球定位系统(GPS)坐标。感兴趣的实体可以是图像,图像序列,视频,卫星图像,甚至图像中可见的物体。由于GPS标记媒体的大规模数据集由于智能手机和互联网而迅速变得可用,而深入学习已经上升以提高机器学习模型的性能能力,因此由于其显着影响而出现了视觉和对象地理定位的领域广泛的应用,如增强现实,机器人,自驾驶车辆,道路维护和3D重建。本文提供了对涉及图像的地理定位的全面调查,其涉及从捕获图像(图像地理定位)或图像内的地理定位对象(对象地理定位)的地理定位的综合调查。我们将提供深入的研究,包括流行算法的摘要,对所提出的数据集的描述以及性能结果的分析来说明每个字段的当前状态。
translated by 谷歌翻译
We tackle the problem of large scale visual place recognition, where the task is to quickly and accurately recognize the location of a given query photograph. We present the following three principal contributions. First, we develop a convolutional neural network (CNN) architecture that is trainable in an end-to-end manner directly for the place recognition task. The main component of this architecture, NetVLAD, is a new generalized VLAD layer, inspired by the "Vector of Locally Aggregated Descriptors" image representation commonly used in image retrieval. The layer is readily pluggable into any CNN architecture and amenable to training via backpropagation. Second, we develop a training procedure, based on a new weakly supervised ranking loss, to learn parameters of the architecture in an end-to-end manner from images depicting the same places over time downloaded from Google Street View Time Machine. Finally, we show that the proposed architecture significantly outperforms non-learnt image representations and off-the-shelf CNN descriptors on two challenging place recognition benchmarks, and improves over current stateof-the-art compact image representations on standard image retrieval benchmarks.
translated by 谷歌翻译
The PASCAL Visual Object Classes (VOC) challenge is a benchmark in visual object category recognition and detection, providing the vision and machine learning communities with a standard dataset of images and annotation, and standard evaluation procedures. Organised annually from 2005 to present, the challenge and its associated dataset has become accepted as the benchmark for object detection.This paper describes the dataset and evaluation procedure. We review the state-of-the-art in evaluated methods for both classification and detection, analyse whether the methods are statistically different, what they are learning from the images (e.g. the object or its context), and what the methods find easy or confuse. The paper concludes with lessons learnt in the three year history of the challenge, and proposes directions for future improvement and extension.
translated by 谷歌翻译
预测拍摄图片的国家有许多潜在的应用,例如对虚假索赔,冒名顶替者的识别,预防虚假信息运动,对假新闻的识别等等。先前的作品主要集中在拍摄图片的地理坐标的估计上。然而,从语义和法医学的角度来看,认识到已经拍摄图像的国家可能更重要,而不是确定其空间坐标。到目前为止,只有少数作品已经解决了这项任务,主要是依靠包含特征地标的图像,例如标志性的纪念碑。在上面的框架中,本文提供了两个主要贡献。首先,我们介绍了一个新的数据集,即Vippgeo数据集,其中包含近400万张图像,可用于训练DL模型进行国家分类。该数据集仅包含这种图像与国家识别的相关性,并且它是通过注意删除非显着图像(例如描绘面孔的图像或特定的非相关物体,例如飞机或船舶)来构建的。其次,我们使用数据集来训练深度学习架构,以将国家识别问题视为分类问题。我们执行的实验表明,我们的网络提供了比当前最新状态更好的结果。特别是,我们发现,要求网络直接识别该国提供比首先估算地理配位的更好的结果,然后使用它们将其追溯到拍摄图片的国家。
translated by 谷歌翻译
从世界上任何地方拍摄的单个地面RGB图像预测地理位置(地理位置)是一个非常具有挑战性的问题。挑战包括由于不同的环境场景而导致的图像多样性,相同位置的出现急剧变化,具体取决于一天中的时间,天气,季节和更重要的是,该预测是由单个图像可能只有一个可能只有一个图像做出的很少有地理线索。由于这些原因,大多数现有作品仅限于特定的城市,图像或全球地标。在这项工作中,我们专注于为行星尺度单位图地理定位开发有效的解决方案。为此,我们提出了转运器,这是一个统一的双分支变压器网络,在整个图像上关注细节,并在极端的外观变化下产生健壮的特征表示。转运器将RGB图像及其语义分割图作为输入,在每个变压器层之后的两个平行分支之间进行交互,并以多任务方式同时执行地理位置定位和场景识别。我们在四个基准数据集上评估转运器-IM2GPS,IM2GPS3K,YFCC4K,YFCC26K,并获得5.5%,14.1%,4.9%,9.9%的大陆级别准确度比最新的级别的精度提高。在现实世界测试图像上还验证了转运器,发现比以前的方法更有效。
translated by 谷歌翻译
这项工作介绍了用于户外机器人技术的视觉跨视图定位。给定一个地面颜色图像和包含本地周围环境的卫星贴片,任务是确定地面摄像头在卫星贴片中的位置。相关工作解决了用于射程传感器(LIDAR,RADAR)的此任务,但对于视觉,仅作为初始跨视图图像检索步骤之后的次要回归步骤。由于还可以通过任何粗糙的本地化(例如,从GPS/GNSS,时间过滤)检索局部卫星贴片,因此我们删除图像检索目标并仅关注度量定位。我们设计了一种具有密集的卫星描述符的新型网络体系结构,在瓶颈处与相似性匹配(而不是图像检索中的输出)以及一个密集的空间分布作为输出,以捕获多模式的定位歧义。我们将使用全局图像描述符的最新回归基线进行比较。关于最近提出的活力和牛津机器人数据集的定量和定性实验结果验证了我们的设计。产生的概率与定位精度相关,甚至可以在未知的方向时大致估计地面摄像头的标题。总体而言,与最先进的面积相比,我们的方法将中值度量定位误差降低了51%,37%和28%,而在同一区域,整个区域和整个时间之间分别概括。
translated by 谷歌翻译
Visual place recognition (VPR) is usually considered as a specific image retrieval problem. Limited by existing training frameworks, most deep learning-based works cannot extract sufficiently stable global features from RGB images and rely on a time-consuming re-ranking step to exploit spatial structural information for better performance. In this paper, we propose StructVPR, a novel training architecture for VPR, to enhance structural knowledge in RGB global features and thus improve feature stability in a constantly changing environment. Specifically, StructVPR uses segmentation images as a more definitive source of structural knowledge input into a CNN network and applies knowledge distillation to avoid online segmentation and inference of seg-branch in testing. Considering that not all samples contain high-quality and helpful knowledge, and some even hurt the performance of distillation, we partition samples and weigh each sample's distillation loss to enhance the expected knowledge precisely. Finally, StructVPR achieves impressive performance on several benchmarks using only global retrieval and even outperforms many two-stage approaches by a large margin. After adding additional re-ranking, ours achieves state-of-the-art performance while maintaining a low computational cost.
translated by 谷歌翻译
The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the chal-
translated by 谷歌翻译
在设计可持续和弹性的城市建造环境的同时,越来越多地促进了世界各地的,重大的数据差距对压迫可持续性问题挑战开展的研究。已知人行道具有强大的经济和环境影响;然而,由于数据收集的成本持久和耗时的性质,大多数城市缺乏它们的表面的空间目录。计算机愿景的最新进展与街道级别图像的可用性一起为城市提供了新的机会,以利用较低的实施成本和更高的准确性提取大规模建筑环境数据。在本文中,我们提出了一个基于主动学习的框架,利用计算机视觉技术来使用广泛可用的街道图像进行分类的计算机视觉技术。我们培训了来自纽约市和波士顿的图像的框架,评价结果显示了90.5%的Miou评分。此外,我们使用六个不同城市的图像评估框架,表明它可以应用于具有不同城市面料的区域,即使在培训数据的领域之外。 Citysurfaces可以为研究人员和城市代理商提供低成本,准确,可扩展的方法来收集人行道材料数据,在寻求主要可持续性问题方面发挥着关键作用,包括气候变化和地表水管理。
translated by 谷歌翻译
全球城市可免费获得大量的地理参考全景图像,以及各种各样的城市物体上的位置和元数据的详细地图。它们提供了有关城市物体的潜在信息来源,但是对象检测的手动注释是昂贵,费力和困难的。我们可以利用这种多媒体来源自动注释街道级图像作为手动标签的廉价替代品吗?使用Panorams框架,我们引入了一种方法,以根据城市上下文信息自动生成全景图像的边界框注释。遵循这种方法,我们仅以快速自动的方式从开放数据源中获得了大规模的(尽管嘈杂,但都嘈杂,但对城市数据集进行了注释。该数据集涵盖了阿姆斯特丹市,其中包括771,299张全景图像中22个对象类别的1400万个嘈杂的边界框注释。对于许多对象,可以从地理空间元数据(例如建筑价值,功能和平均表面积)获得进一步的细粒度信息。这样的信息将很难(即使不是不可能)单独根据图像来获取。为了进行详细评估,我们引入了一个有效的众包协议,用于在全景图像中进行边界框注释,我们将其部署以获取147,075个地面真实对象注释,用于7,348张图像的子集,Panorams-clean数据集。对于我们的Panorams-Noisy数据集,我们对噪声以及不同类型的噪声如何影响图像分类和对象检测性能提供了广泛的分析。我们可以公开提供数据集,全景噪声和全景清洁,基准和工具。
translated by 谷歌翻译
视觉定位通过使用查询图像和地图之间的对应分析来解决估计摄像机姿势的挑战。此任务是计算和数据密集型,这在彻底评估各种数据集上的方法攻击挑战。然而,为了进一步进一步前进,我们声称应该在覆盖广域品种的多个数据集上进行稳健的视觉定位算法。为了促进这一点,我们介绍了Kapture,一种新的,灵活,统一的数据格式和工具箱,用于视觉本地化和结构 - 来自运动(SFM)。它可以轻松使用不同的数据集以及有效和可重复使用的数据处理。为了证明这一点,我们提出了一种多功能管道,用于视觉本地化,促进使用不同的本地和全局特征,3D数据(例如深度图),非视觉传感器数据(例如IMU,GPS,WiFi)和各种处理算法。使用多种管道配置,我们在我们的实验中显示出Kapture的巨大功能性。此外,我们在八个公共数据集中评估我们的方法,在那里他们排名第一,并在其中许多上排名第一。为了促进未来的研究,我们在允许BSD许可证下释放本文中使用的代码,模型和本文中使用的所有数据集。 github.com/naver/kapture,github.com/naver/kapture-localization.
translated by 谷歌翻译
Visual Place Recognition is an essential component of systems for camera localization and loop closure detection, and it has attracted widespread interest in multiple domains such as computer vision, robotics and AR/VR. In this work, we propose a faster, lighter and stronger approach that can generate models with fewer parameters and can spend less time in the inference stage. We designed RepVGG-lite as the backbone network in our architecture, it is more discriminative than other general networks in the Place Recognition task. RepVGG-lite has more speed advantages while achieving higher performance. We extract only one scale patch-level descriptors from global descriptors in the feature extraction stage. Then we design a trainable feature matcher to exploit both spatial relationships of the features and their visual appearance, which is based on the attention mechanism. Comprehensive experiments on challenging benchmark datasets demonstrate the proposed method outperforming recent other state-of-the-art learned approaches, and achieving even higher inference speed. Our system has 14 times less params than Patch-NetVLAD, 6.8 times lower theoretical FLOPs, and run faster 21 and 33 times in feature extraction and feature matching. Moreover, the performance of our approach is 0.5\% better than Patch-NetVLAD in Recall@1. We used subsets of Mapillary Street Level Sequences dataset to conduct experiments for all other challenging conditions.
translated by 谷歌翻译
Access to large, diverse RGB-D datasets is critical for training RGB-D scene understanding algorithms. However, existing datasets still cover only a limited number of views or a restricted scale of spaces. In this paper, we introduce Matterport3D, a large-scale RGB-D dataset containing 10,800 panoramic views from 194,400 RGB-D images of 90 building-scale scenes. Annotations are provided with surface reconstructions, camera poses, and 2D and 3D semantic segmentations. The precise global alignment and comprehensive, diverse panoramic set of views over entire buildings enable a variety of supervised and self-supervised computer vision tasks, including keypoint matching, view overlap prediction, normal prediction from color, semantic segmentation, and region classification.
translated by 谷歌翻译
TU Dresden www.cityscapes-dataset.net train/val -fine annotation -3475 images train -coarse annotation -20 000 images test -fine annotation -1525 images
translated by 谷歌翻译
本文介绍了Omnicity,这是一种从多层次和多视图图像中了解无所不能的城市理解的新数据集。更确切地说,Omnicity包含多视图的卫星图像以及街道级全景图和单视图图像,构成了超过100k像素的注释图像,这些图像是从纽约市的25k Geo-Locations中良好的一致性和收集的。为了减轻大量像素的注释努力,我们提出了一个有效的街景图像注释管道,该管道利用了卫星视图的现有标签地图以及不同观点之间的转换关系(卫星,Panorama和Mono-View)。有了新的Omnicity数据集,我们为各种任务提供基准,包括构建足迹提取,高度估计以及构建平面/实例/细粒细分。我们还分析了视图对每个任务的影响,不同模型的性能,现有方法的局限性等。与现有的多层次和多视图基准相比,我们的Omnicity包含更多具有更丰富注释类型和更丰富的图像更多的视图,提供了从最先进的模型获得的更多基线结果,并为街道级全景图像中的细粒度建筑实例细分介绍了一项新颖的任务。此外,Omnicity为现有任务提供了新的问题设置,例如跨视图匹配,合成,分割,检测等,并促进开发新方法,以了解大规模的城市理解,重建和仿真。 Omnicity数据集以及基准将在https://city-super.github.io/omnicity上找到。
translated by 谷歌翻译
Image descriptors based on activations of Convolutional Neural Networks (CNNs) have become dominant in image retrieval due to their discriminative power, compactness of representation, and search efficiency. Training of CNNs, either from scratch or fine-tuning, requires a large amount of annotated data, where a high quality of annotation is often crucial. In this work, we propose to fine-tune CNNs for image retrieval on a large collection of unordered images in a fully automated manner. Reconstructed 3D models obtained by the state-of-the-art retrieval and structure-from-motion methods guide the selection of the training data. We show that both hard-positive and hard-negative examples, selected by exploiting the geometry and the camera positions available from the 3D models, enhance the performance of particular-object retrieval. CNN descriptor whitening discriminatively learned from the same training data outperforms commonly used PCA whitening. We propose a novel trainable Generalized-Mean (GeM) pooling layer that generalizes max and average pooling and show that it boosts retrieval performance. Applying the proposed method to the VGG network achieves state-of-the-art performance on the standard benchmarks: Oxford Buildings, Paris, and Holidays datasets.
translated by 谷歌翻译
尽管已经广泛地表明,对深度神经网络的检索特定培训是有益的,但是对于最近的邻居图像搜索质量,大多数这些模型都在地标图像的域中进行培训并测试。然而,某些应用程序使用来自各种其他域的图像,因此需要具有良好概括属性的网络 - 通用CBIR模型。据我们所知,到目前为止,没有关于一般图像检索质量的基准模型的测试协议。在分析流行的图像检索测试集后,我们决定手动策划GPR1200,易于使用和可访问的,但具有挑战性的基准数据集,具有广泛的图像类别。随后使用该基准测试在其泛化质量上评估不同架构的各种预磨料模型。我们表明,大规模预借鉴显着提高了检索性能,并通过适当的微调进一步提高这些属性的实验。通过这些有希望的结果,我们希望增加对通用CBIR的研究课题的兴趣。
translated by 谷歌翻译
Scene categorization is a fundamental problem in computer vision. However, scene understanding research has been constrained by the limited scope of currently-used databases which do not capture the full variety of scene categories. Whereas standard databases for object categorization contain hundreds of different classes of objects, the largest available dataset of scene categories contains only 15 classes. In this paper we propose the extensive Scene UNderstanding (SUN) database that contains 899 categories and 130,519 images. We use 397 well-sampled categories to evaluate numerous state-of-the-art algorithms for scene recognition and establish new bounds of performance. We measure human scene classification performance on the SUN database and compare this with computational methods. Additionally, we study a finer-grained scene representation to detect scenes embedded inside of larger scenes.
translated by 谷歌翻译
分析了2011年至2021年发表的88个来源,本文对基于计算机的建筑物和建筑环境进行了首次系统评价,以评估其对建筑和城市设计研究的价值。遵循多阶段的选择过程,讨论了有关建筑应用,例如建筑物分类,详细分类,定性环境分析,建筑条件调查和建筑价值估算等建筑应用程序的类型。这揭示了当前的研究差距和趋势,并突出了研究目标的两个主要类别。首先,要使用或优化计算机视觉方法进行体系结构图像数据,然后可以帮助自动化耗时,劳动密集型或复杂的视觉分析任务。其次,通过查找视觉,统计和定性数据之间的模式和关系来探索机器学习方法的方法论上的好处,以研究有关建筑环境的新问题,这可以克服传统手动分析的局限性。不断增长的研究为建筑和设计研究提供了新的方法,论文确定了未来的研究挑战和方向。
translated by 谷歌翻译