Scene recognition is one of the hallmark tasks of computer vision, allowing definition of a context for object recognition. Whereas the tremendous recent progress in object recognition tasks is due to the availability of large datasets like ImageNet and the rise of Convolutional Neural Networks (CNNs) for learning high-level features, performance at scene recognition has not attained the same level of success. This may be because current deep features trained from ImageNet are not competitive enough for such tasks. Here, we introduce a new scene-centric database called Places with over 7 million labeled pictures of scenes. We propose new methods to compare the density and diversity of image datasets and show that Places is as dense as other scene datasets and has more diversity. Using CNN, we learn deep features for scene recognition tasks, and establish new state-of-the-art results on several scene-centric datasets. A visualization of the CNN layers' responses allows us to show differences in the internal representations of object-centric and scene-centric networks.
translated by 谷歌翻译
The rise of multi-million-item dataset initiatives has enabled data-hungry machine learning algorithms to reach nearhuman semantic classification performance at tasks such as visual object and scene recognition. Here we describe the Places Database, a repository of 10 million scene photographs, labeled with scene semantic categories, comprising a large and diverse list of the types of environments encountered in the world. Using the state-of-the-art Convolutional Neural Networks (CNNs), we provide scene classification CNNs (Places-CNNs) as baselines, that significantly outperform the previous approaches. Visualization of the CNNs trained on Places shows that object detectors emerge as an intermediate representation of scene classification. With its high-coverage and high-diversity of exemplars, the Places Database along with the Places-CNNs offer a novel resource to guide future progress on scene recognition problems.
translated by 谷歌翻译
Scene categorization is a fundamental problem in computer vision. However, scene understanding research has been constrained by the limited scope of currently-used databases which do not capture the full variety of scene categories. Whereas standard databases for object categorization contain hundreds of different classes of objects, the largest available dataset of scene categories contains only 15 classes. In this paper we propose the extensive Scene UNderstanding (SUN) database that contains 899 categories and 130,519 images. We use 397 well-sampled categories to evaluate numerous state-of-the-art algorithms for scene recognition and establish new bounds of performance. We measure human scene classification performance on the SUN database and compare this with computational methods. Additionally, we study a finer-grained scene representation to detect scenes embedded inside of larger scenes.
translated by 谷歌翻译
令人难忘性测量在闪光后将容易记忆的难忘,这可能有助于设计杂志盖板,旅游宣传材料等。最近的作品对令人难忘的通用图像,对象图像或面部照片的可视化功能。然而,这些方法不能有效地预测户外自然场景图像的令人难忘性。为了克服以前作品的这种缺点,在本文中,我们提供了回答:“究竟是什么让户外自然场景令人难忘的东西”。为此,我们首先建立大规模的户外自然场景图像难忘(LNSIM)数据库,其中包含2,632个户外自然场景图像,其基础令人难忘分数和多标签场景类别注释。然后,类似于以前的作品,我们挖掘了我们的数据库,调查了如何影响户外自然场景的令人难忘程度,中高水平和高水平的手工业。特别是,我们发现场景类别的高级特征与户外自然场景难忘相当相关,深神经网络(DNN)学习的深度特征在预测令人难忘分数方面也是有效的。此外,将具有类别特征的深度特征组合可以进一步提高难忘预测的性能。因此,我们提出了基于端到端的DNN的户外自然场景难忘(DeepnSM)预测器,其利用了学习的类别相关的特征。然后,实验结果验证了我们深度的模型的有效性,超出了最先进的方法。最后,我们试图了解我们Deepnsm模型的良好表现的原因,并研究了我们的Deepnsm模型成功或未能准确预测户外自然场景的令人难忘的情况。代码:github.com/jiaxinlu-home/natural-cene-memorability-dataset。
translated by 谷歌翻译
In this work, we revisit the global average pooling layer proposed in [13], and shed light on how it explicitly enables the convolutional neural network to have remarkable localization ability despite being trained on image-level labels. While this technique was previously proposed as a means for regularizing training, we find that it actually builds a generic localizable deep representation that can be applied to a variety of tasks. Despite the apparent simplicity of global average pooling, we are able to achieve 37.1% top-5 error for object localization on ILSVRC 2014, which is remarkably close to the 34.2% top-5 error achieved by a fully supervised CNN approach. We demonstrate that our network is able to localize the discriminative image regions on a variety of tasks despite not being trained for them.
translated by 谷歌翻译
Recent results indicate that the generic descriptors extracted from the convolutional neural networks are very powerful. This paper adds to the mounting evidence that this is indeed the case. We report on a series of experiments conducted for different recognition tasks using the publicly available code and model of the OverFeat network which was trained to perform object classification on ILSVRC13. We use features extracted from the OverFeat network as a generic image representation to tackle the diverse range of recognition tasks of object image classification, scene recognition, fine grained recognition, attribute detection and image retrieval applied to a diverse set of datasets. We selected these tasks and datasets as they gradually move further away from the original task and data the OverFeat network was trained to solve. Astonishingly, we report consistent superior results compared to the highly tuned state-of-the-art systems in all the visual classification tasks on various datasets. For instance retrieval it consistently outperforms low memory footprint methods except for sculptures dataset. The results are achieved using a linear SVM classifier (or L2 distance in case of retrieval) applied to a feature representation of size 4096 extracted from a layer in the net. The representations are further modified using simple augmentation techniques e.g. jittering. The results strongly suggest that features obtained from deep learning with convolutional nets should be the primary candidate in most visual recognition tasks.
translated by 谷歌翻译
The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the chal-
translated by 谷歌翻译
场景分类已确定为一个具有挑战性的研究问题。与单个对象的图像相比,场景图像在语义上可能更为复杂和抽象。它们的差异主要在于识别的粒度水平。然而,图像识别是场景识别良好表现的关键支柱,因为从对象图像中获得的知识可用于准确识别场景。现有场景识别方法仅考虑场景的类别标签。但是,我们发现包含详细的本地描述的上下文信息也有助于允许场景识别模型更具歧视性。在本文中,我们旨在使用对象中编码的属性和类别标签信息来改善场景识别。基于属性和类别标签的互补性,我们提出了一个多任务属性识别识别(MASR)网络,该网络学习一个类别嵌入式,同时预测场景属性。属性采集和对象注释是乏味且耗时的任务。我们通过提出部分监督的注释策略来解决该问题,其中人类干预大大减少。该策略为现实世界情景提供了更具成本效益的解决方案,并且需要减少注释工作。此外,考虑到对象检测到的分数所指示的重要性水平,我们重新进行了权威预测。使用提出的方法,我们有效地注释了四个大型数据集的属性标签,并系统地研究场景和属性识别如何相互受益。实验结果表明,与最先进的方法相比
translated by 谷歌翻译
We evaluate whether features extracted from the activation of a deep convolutional network trained in a fully supervised fashion on a large, fixed set of object recognition tasks can be repurposed to novel generic tasks. Our generic tasks may differ significantly from the originally trained tasks and there may be insufficient labeled or unlabeled data to conventionally train or adapt a deep architecture to the new tasks. We investigate and visualize the semantic clustering of deep convolutional features with respect to a variety of such tasks, including scene recognition, domain adaptation, and fine-grained recognition challenges. We compare the efficacy of relying on various network levels to define a fixed feature, and report novel results that significantly outperform the state-of-the-art on several important vision challenges. We are releasing DeCAF, an open-source implementation of these deep convolutional activation features, along with all associated network parameters to enable vision researchers to be able to conduct experimentation with deep representations across a range of visual concept learning paradigms.
translated by 谷歌翻译
Semantic understanding of visual scenes is one of the holy grails of computer vision. Despite efforts of the community in data collection, there are still few image datasets covering a wide range of scenes and object categories with pixel-wise annotations for scene understanding. In this work, we present a densely annotated dataset ADE20K, which spans diverse annotations of scenes, objects, parts of objects, and in some cases even parts of parts. Totally there are 25k images of the complex everyday scenes containing a variety of objects in their natural spatial context. On average there are 19.5 instances and 10.5 object classes per image. Based on ADE20K, we construct benchmarks for scene parsing and instance segmentation. We provide baseline performances on both of the benchmarks and re-implement the state-ofthe-art models for open source. We further evaluate the effect of synchronized batch normalization and find that a reasonably large batch size is crucial for the semantic segmentation performance. We show that the networks trained on ADE20K are able to segment a wide variety of scenes and objects 1 .
translated by 谷歌翻译
Given a grayscale photograph as input, this paper attacks the problem of hallucinating a plausible color version of the photograph. This problem is clearly underconstrained, so previous approaches have either relied on significant user interaction or resulted in desaturated colorizations. We propose a fully automatic approach that produces vibrant and realistic colorizations. We embrace the underlying uncertainty of the problem by posing it as a classification task and use class-rebalancing at training time to increase the diversity of colors in the result. The system is implemented as a feed-forward pass in a CNN at test time and is trained on over a million color images. We evaluate our algorithm using a "colorization Turing test," asking human participants to choose between a generated and ground truth color image. Our method successfully fools humans on 32% of the trials, significantly higher than previous methods. Moreover, we show that colorization can be a powerful pretext task for self-supervised feature learning, acting as a cross-channel encoder. This approach results in state-of-the-art performance on several feature learning benchmarks.
translated by 谷歌翻译
We build new test sets for the CIFAR-10 and ImageNet datasets. Both benchmarks have been the focus of intense research for almost a decade, raising the danger of overfitting to excessively re-used test sets. By closely following the original dataset creation processes, we test to what extent current classification models generalize to new data. We evaluate a broad range of models and find accuracy drops of 3% -15% on CIFAR-10 and 11% -14% on ImageNet. However, accuracy gains on the original test sets translate to larger gains on the new test sets. Our results suggest that the accuracy drops are not caused by adaptivity, but by the models' inability to generalize to slightly "harder" images than those found in the original test sets.
translated by 谷歌翻译
We propose a general framework called Network Dissection for quantifying the interpretability of latent representations of CNNs by evaluating the alignment between individual hidden units and a set of semantic concepts. Given any CNN model, the proposed method draws on a broad data set of visual concepts to score the semantics of hidden units at each intermediate convolutional layer. The units with semantics are given labels across a range of objects, parts, scenes, textures, materials, and colors. We use the proposed method to test the hypothesis that interpretability of units is equivalent to random linear combinations of units, then we apply our method to compare the latent representations of various networks when trained to solve different supervised and self-supervised training tasks. We further analyze the effect of training iterations, compare networks trained with different initializations, examine the impact of network depth and width, and measure the effect of dropout and batch normalization on the interpretability of deep visual representations. We demonstrate that the proposed method can shed light on characteristics of CNN models and training methods that go beyond measurements of their discriminative power.
translated by 谷歌翻译
In spite of many dataset efforts for human action recognition, current computer vision algorithms are still severely limited in terms of the variability and complexity of the actions that they can recognize. This is in part due to the simplicity of current benchmarks, which mostly focus on simple actions and movements occurring on manually trimmed videos. In this paper we introduce ActivityNet, a new largescale video benchmark for human activity understanding. Our benchmark aims at covering a wide range of complex human activities that are of interest to people in their daily living. In its current version, ActivityNet provides samples from 203 activity classes with an average of 137 untrimmed videos per class and 1.41 activity instances per video, for a total of 849 video hours. We illustrate three scenarios in which ActivityNet can be used to compare algorithms for human activity understanding: untrimmed video classification, trimmed activity classification and activity detection.
translated by 谷歌翻译
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise object localization. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.
translated by 谷歌翻译
给定空中图像,空中场景解析(ASP)目标,以解释图像内容的语义结构,例如,通过将语义标签分配给图像的每个像素来解释图像内容的语义结构。随着数据驱动方法的推广,过去几十年通过在使用高分辨率航空图像时,通过接近基于瓦片级场景分类或分段的图像分析的方案来解决了对ASP的有希望的进展。然而,前者的方案通常会产生瓷砖技术边界的结果,而后者需要处理从像素到语义的复杂建模过程,这通常需要具有像素 - 明智语义标签的大规模和良好的图像样本。在本文中,我们在ASP中解决了这些问题,从瓷砖级场景分类到像素明智语义标签的透视图。具体而言,我们首先通过文献综述重新审视空中图像解释。然后,我们提出了一个大规模的场景分类数据集,其中包含一百万个空中图像被称为百万援助。使用所提出的数据集,我们还通过经典卷积神经网络(CNN)报告基准测试实验。最后,我们通过统一瓦片级场景分类和基于对象的图像分析来实现ASP,以实现像素明智的语义标记。密集实验表明,百万援助是一个具有挑战性但有用的数据集,可以作为评估新开发的算法的基准。当从百万辅助救援方面传输知识时,百万辅助的微调CNN模型始终如一,而不是那些用于空中场景分类的预磨料想象。此外,我们设计的分层多任务学习方法实现了对挑战GID的最先进的像素 - 明智的分类,拓宽了用于航空图像解释的像素明智语义标记的瓦片级场景分类。
translated by 谷歌翻译
室内场景识别是一种不断增长的领域,具有巨大的行为理解,机器人本地化和老年人监测等。在这项研究中,我们使用从社交媒体收集的多模态学习和视频数据来从新的角度来看场景识别的任务。社交媒体视频的可访问性和各种可以为现代场景识别技术和应用提供现实数据。我们提出了一种基于转录语音的融合到文本和视觉功能的模型,用于在名为Instaindoor的室内场景的社交媒体视频的新型数据集上进行分类。我们的模型可实现高达70%的精度和0.7 F1分数。此外,我们通过在室内场景的YouTube-8M子集上基准测试,我们突出了我们的方法的潜力,在那里它达到了74%的精度和0.74f1分数。我们希望这项工作的贡献铺平了在挑战领域的室内场景认可领域的新型研究。
translated by 谷歌翻译
Scene parsing, or recognizing and segmenting objects and stuff in an image, is one of the key problems in computer vision. Despite the community's efforts in data collection, there are still few image datasets covering a wide range of scenes and object categories with dense and detailed annotations for scene parsing. In this paper, we introduce and analyze the ADE20K dataset, spanning diverse annotations of scenes, objects, parts of objects, and in some cases even parts of parts. A scene parsing benchmark is built upon the ADE20K with 150 object and stuff classes included. Several segmentation baseline models are evaluated on the benchmark. A novel network design called Cascade Segmentation Module is proposed to parse a scene into stuff, objects, and object parts in a cascade and improve over the baselines. We further show that the trained scene parsing networks can lead to applications such as image content removal and scene synthesis 1 .
translated by 谷歌翻译
Computer vision has a great potential to help our daily lives by searching for lost keys, watering flowers or reminding us to take a pill.To succeed with such tasks, computer vision methods need to be trained from real and diverse examples of our daily dynamic scenes. While most of such scenes are not particularly exciting, they typically do not appear on YouTube, in movies or TV broadcasts. So how do we collect sufficiently many diverse but boring samples representing our lives? We propose a novel Hollywood in Homes approach to collect such data. Instead of shooting videos in the lab, we ensure diversity by distributing and crowdsourcing the whole process of video creation from script writing to video recording and annotation. Following this procedure we collect a new dataset, Charades, with hundreds of people recording videos in their own homes, acting out casual everyday activities. The dataset is composed of 9,848 annotated videos with an average length of 30 seconds, showing activities of 267 people from three continents, and over 15% of the videos have more than one person. Each video is annotated by multiple free-text descriptions, action labels, action intervals and classes of interacted objects. In total, Charades provides 27,847 video descriptions, 66,500 temporally localized intervals for 157 action classes and 41,104 labels for 46 object classes. Using this rich data, we evaluate and provide baseline results for several tasks including action recognition and automatic description generation. We believe that the realism, diversity, and casual nature of this dataset will present unique challenges and new opportunities for computer vision community.
translated by 谷歌翻译
Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in
translated by 谷歌翻译