In this paper we present the first large-scale scene attribute database. First, we perform crowdsourced human studies to find a taxonomy of 102 discriminative attributes. We discover attributes related to materials, surface properties , lighting, affordances, and spatial layout. Next, we build the "SUN attribute database" on top of the diverse SUN categorical database. We use crowdsourcing to annotate attributes for 14,340 images from 707 scene categories. We perform numerous experiments to study the interplay between scene attributes and scene categories. We train and evaluate attribute classifiers and then study the feasibility of attributes as an intermediate scene representation for scene classification, zero shot learning, automatic image caption-ing, semantic image search, and parsing natural images. We show that when used as features for these tasks, low dimensional scene attributes can compete with or improve on the state of the art performance. The experiments suggest that scene attributes are an effective low-dimensional feature for capturing high-level context and semantics in scenes.
translated by 谷歌翻译
more "winter" more "night" more "warm" more "moist" more "rain" more "autumn" Figure 1: Our method enables high-level editing of outdoor photographs. In this example, the user provides an input image (left) and six attribute queries corresponding to the desired changes, such as more "autumn". Our method hallucinates six plausible versions of the scene with the desired attributes (right), by learning local color transforms from a large dataset of annotated outdoor webcams. Abstract We live in a dynamic visual world where the appearance of scenes changes dramatically from hour to hour or season to season. In this work we study "transient scene attributes"-high level properties which affect scene appearance, such as "snow", "autumn", "dusk", "fog". We define 40 transient attributes and use crowd-sourcing to annotate thousands of images from 101 webcams. We use this "transient attribute database" to train regressors that can predict the presence of attributes in novel images. We demonstrate a photo organization method based on predicted attributes. Finally we propose a high-level image editing method which allows a user to adjust the attributes of a scene, e.g. change a scene to be "snowy" or "sunset". To support attribute manipulation we introduce a novel appearance transfer technique which is simple and fast yet competitive with the state-of-the-art. We show that we can convincingly modify many transient attributes in outdoor scenes.
translated by 谷歌翻译
Progress in scene understanding requires reasoning about the rich and diverse visual environments that make up our daily experience. To this end, we propose the Scene Understanding database, a nearly exhaustive collection of scenes categorized at the same level of specificity as human discourse. The database contains 908 distinct scene categories and 131,072 images. Given this data with both scene and object labels available, we perform in-depth analysis of co-occurrence statistics and the contextual relationship. To better understand this large scale taxonomy of scene categories , we perform two human experiments: we quantify human scene recognition accuracy, and we measure how typical each image is of its assigned scene category. Next, we perform computational experiments: scene recognition with global image features, indoor versus outdoor classification, and "scene detection," in which we relax the assumption that one image depicts only one scene category. Finally, we relate human experiments to machine performance and explore the Communicated by M. Hebert. relationship between human and machine recognition errors and the relationship between image "typicality" and machine recognition accuracy.
translated by 谷歌翻译
我们提出了一个新的数据集,其目标是通过将对象识别问题放在更广泛的场景理解问题的背景下来推进最先进的对象识别。这是通过收集包含自然环境中的共同对象的复杂日常场景的图像来实现的。使用每个实例分段来标记对象以帮助精确对象定位。我们的数据集包含91种对象类型的照片,这些对象类型很容易被4岁的孩子识别。在328k图像中总共有250万个标记的实例,我们的数据集的创建吸引了大量的群众工作者参与,通过新颖的用户界面进行类别检测,实例化和实例分割。我们提供了与PASCAL,ImageNet和SUN相比较的数据集的详细统计分析。最后,我们使用可变形零件模型为边界框和分割检测结果提供基线性能分析。
translated by 谷歌翻译
ImageNet大规模视觉识别挑战是数百个对象类别和数百万图像的基准对象类别分类和检测。挑战每年都从2010年开始,吸引了来自50多个机构的参与。本文描述了这个基准数据集的创建以及结果可能实现的对象识别的进展。我们讨论了收集大规模地面实况注释的挑战,突出了分类对象识别中的突破性,提供了对大规模图像分类和对象检测领域当前状态的详细分析,并比较了最先进的计算机视觉准确性与人类准确性。我们总结了五年挑战中的经验教训,并提出了未来的方向和改进。
translated by 谷歌翻译
Traditional supervised visual learning simply asks anno-tators "what" label an image should have. We propose an approach for image classification problems requiring subjective judgment that also asks "why", and uses that information to enrich the learned model. We develop two forms of visual annotator rationales: in the first, the annotator highlights the spatial region of interest he found most influential to the label selected, and in the second, he comments on the visual attributes that were most important. For either case, we show how to map the response to synthetic contrast examples, and then exploit an existing large-margin learning technique to refine the decision boundary accordingly. Results on multiple scene categorization and human attractiveness tasks show the promise of our approach, which can more accurately learn complex categories with the explanations behind the label choices.
translated by 谷歌翻译
While there has been remarkable progress in the performance of visual recognition algorithms, the state-of-the-art models tend to be exceptionally data-hungry. Large labeled training datasets, expensive and tedious to produce, are required to optimize millions of parameters in deep network models. Lagging behind the growth in model capacity, the available datasets are quickly becoming outdated in terms of size and density. To circumvent this bottleneck, we propose to amplify human effort through a partially automated labeling scheme, leveraging deep learning with humans in the loop. Starting from a large set of candidate images for each category, we iteratively sample a subset, ask people to label them, classify the others with a trained model, split the set into positives, negatives, and unlabeled based on the classification confidence, and then iterate with the unlabeled set. To assess the effectiveness of this cascading procedure and enable further progress in visual recognition research, we construct a new image dataset, LSUN. It contains around one million labeled images for each of 10 scene categories and 20 object categories. We experiment with training popular convolutional networks and find that they achieve substantial performance gains when trained on this dataset.
translated by 谷歌翻译
The rise of multi-million-item dataset initiatives has enabled data-hungry machine learning algorithms to reach near-human semantic classification performance at tasks such as visual object and scene recognition. Here we describe the Places Database, a repository of 10 million scene photographs, labeled with scene semantic categories, comprising a large and diverse list of the types of environments encountered in the world. Using the state-of-the-art Convolutional Neural Networks (CNNs), we provide scene classification CNNs (Places-CNNs) as baselines, that significantly outperform the previous approaches. Visualization of the CNNs trained on Places shows that object detectors emerge as an intermediate representation of scene classification. With its high-coverage and high-diversity of exemplars, the Places Database along with the Places-CNNs offer a novel resource to guide future progress on scene recognition problems.
translated by 谷歌翻译
Several recent works have explored the benefits of providing more detailed annotations for object recognition. These annotations provide information beyond object names, and allow a detector to reason and describe individual instances in plain English. However, by demanding more specific details from annotators, new difficulties arise, such as stronger language dependencies and limited anno-tator attention. In this work, we present the challenges of constructing such a detailed dataset, and discuss why the benefits of using this data outweigh the difficulties of collecting it.
translated by 谷歌翻译
Human-nameable visual attributes offer many advantages when used as mid-level features for object recognition , but existing techniques to gather relevant attributes can be inefficient (costing substantial effort or expertise) and/or insufficient (descriptive properties need not be dis-criminative). We introduce an approach to define a vocabulary of attributes that is both human understandable and discriminative. The system takes object/scene-labeled images as input, and returns as output a set of attributes elicited from human annotators that distinguish the categories of interest. To ensure a compact vocabulary and efficient use of annotators' effort, we 1) show how to actively augment the vocabulary such that new attributes resolve inter-class confusions, and 2) propose a novel "nameabil-ity" manifold that prioritizes candidate attributes by their likelihood of being associated with a nameable property. We demonstrate the approach with multiple datasets, and show its clear advantages over baselines that lack a name-ability model or rely on a list of expert-provided attributes.
translated by 谷歌翻译
Scene recognition is one of the hallmark tasks of computer vision, allowing definition of a context for object recognition. Whereas the tremendous recent progress in object recognition tasks is due to the availability of large datasets like ImageNet and the rise of Convolutional Neural Networks (CNNs) for learning high-level features , performance at scene recognition has not attained the same level of success. This may be because current deep features trained from ImageNet are not competitive enough for such tasks. Here, we introduce a new scene-centric database called Places with over 7 million labeled pictures of scenes. We propose new methods to compare the density and diversity of image datasets and show that Places is as dense as other scene datasets and has more diversity. Using CNN, we learn deep features for scene recognition tasks, and establish new state-of-the-art results on several scene-centric datasets. A visualization of the CNN layers' responses allows us to show differences in the internal representations of object-centric and scene-centric networks.
translated by 谷歌翻译
We study the problem of object recognition for categories for which we have no training examples, a task also called zero-data or zero-shot learning. This situation has hardly been studied in computer vision research, even though it occurs frequently: the world contains tens of thousands of different object classes and for only few of them image collections have been formed and suitably annotated. To tackle the problem we introduce attribute-based classification: objects are identified based on a high-level description that is phrased in terms of semantic attributes, such as the object's color or shape. Because the identification of each such property transcends the specific learning task at hand, the attribute classifiers can be pre-learned independently, e.g. from existing image datasets unrelated to the current task. Afterwards, new classes can be detected based on their attribute representation, without the need for a new training phase. In this paper we also introduce a new dataset, Animals with Attributes, of over 30,000 images of 50 animal classes, annotated with 85 semantic attributes. Extensive experiments on this and two more datasets show that attribute-based classification indeed is able to categorize images without access to any training images of the target classes.
translated by 谷歌翻译
We consider the task of learning visual connections between object categories using the ImageNet dataset, which is a large-scale dataset ontology containing more than 15 thousand object classes. We want to discover visual relationships between the classes that are currently missing (such as similar colors or shapes or textures). In this work we learn 20 visual attributes and use them in a zero-shot transfer learning experiment as well as to make visual connections between semantically unrelated object categories.
translated by 谷歌翻译
Creative Commons Attribution-Noncommercial-Share Alike Detailed Terms http://creativecommons.org/licenses/by-nc-sa/4.0/ The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Abstract-When glancing at a magazine, or browsing the Internet, we are continuously exposed to photographs. Despite this overflow of visual information, humans are extremely good at remembering thousands of pictures along with some of their visual details. But not all images are equal in memory. Some stick in our minds while others are quickly forgotten. In this paper we focus on the problem of predicting how memorable an image will be. We show that memorability is an intrinsic and stable property of an image that is shared across different viewers, and remains stable across delays. We introduce a database for which we have measured the probability that each picture will be recognized after a single view. We analyze a collection of image features, labels, and attributes that contribute to making an image memorable, and we train a predictor based on global image descriptors. We find that predicting image memorability is a task that can be addressed with current computer vision techniques. While making memorable images is a challenging task in visualization, photography, and education, this work is a first attempt to quantify this useful property of images.
translated by 谷歌翻译
We present an extensive three year study on economically annotating video with crowdsourced marketplaces. Our public framework has annotated thousands of real world videos, including massive data sets unprecedented for their size, complexity, and cost. To accomplish this, we designed a state-of-the-art video annotation user interface and demonstrate that, despite common intuition, many contemporary interfaces are sub-optimal. We present several user studies that evaluate different aspects of our system and demonstrate that minimizing the cognitive load of the user is crucial when designing an annotation platform. We then deploy this interface on Amazon Mechanical Turk and discover expert and talented workers who are capable of annotating difficult videos with dense and closely cropped labels. We argue that video annotation requires specialized skill; most workers are poor annotators, mandating robust quality control protocols. We show that traditional crowdsourced micro-tasks are not suitable for video annotation and instead demonstrate that deploying time-consuming macro-tasks on MTurk is effective. Finally, we show that by A preliminary version of this work appeared in ECCV 2010 by Vondrick et al. extracting pixel-based features from manually labeled key frames, we are able to leverage more sophisticated interpolation strategies to maximize performance given a fixed budget. We validate the power of our framework on difficult , real-world data sets and we demonstrate an inherent trade-off between the mix of human and cloud computing used vs. the accuracy and cost of the labeling. We further introduce a novel, cost-based evaluation criteria that compares vision algorithms by the budget required to achieve an acceptable performance. We hope our findings will spur innovation in the creation of massive labeled video data sets and enable novel data-driven computer vision applications.
translated by 谷歌翻译
尽管在诸如图像分类之类的感知任务方面取得了进步,但计算机在诸如图像描述和问答交换之类的认知任务上仍然表现不佳。认知是任务的核心,不仅涉及识别,而且涉及我们的视觉世界。然而,用于处理认知任务图像中的丰富内容的模型仍在使用为感知任务设计的相同数据集进行训练。为了在认知任务中取得成功,模型需要理解图像中对象之间的交互和关系。当被问及“乘坐什么车辆?”时,计算机将需要识别图像中的物体以及骑车(人,马车)和拉动(马,马车)之间的关系,以便正确回答“人员正在骑车”马车”。在本文中,我们提出了Visual Genome数据集,以便对这种关系进行建模。我们收集每个图像中对象,属性和关系的密集注释,以学习这些模型。具体来说,我们的数据集包含超过100K的图像,其中每个图像平均有21个对象,18个属性和18个对象之间的成对关系。我们将区域描述中的对象,属性,关系和名词短语与WordNet同义词的答案对进行了解。这些注释一起代表了图像描述,对象,属性,关系和问题答案的最密集和最大的数据集。
translated by 谷歌翻译
Given two images, we want to predict which exhibits a particular visual attribute more than the other-even when the two images are quite similar. Existing relative attribute methods rely on global ranking functions; yet rarely will the visual cues relevant to a comparison be constant for all data, nor will humans' perception of the attribute necessarily permit a global ordering. To address these issues, we propose a local learning approach for fine-grained visual comparisons. Given a novel pair of images, we learn a local ranking model on the fly, using only analogous training comparisons. We show how to identify these analogous pairs using learned metrics. With results on three challenging datasets-including a large newly curated dataset for fine-grained comparisons-our method outperforms state-of-the-art methods for relative attribute prediction.
translated by 谷歌翻译
With the recent renaissance of deep convolution neu-ral networks, encouraging breakthroughs have been achieved on the supervised recognition tasks, where each class has sufficient training data and fully annotated training data. However, to scale the recognition to a large number of classes with few or now training samples for each class remains an unsolved problem. One approach to scaling up the recognition is to develop models capable of recognizing unseen categories without any training instances, or zero-shot recognition/ learning. This article provides a comprehensive review of existing zero-shot recognition techniques covering various aspects ranging from representations of models, and from datasets and evaluation settings. We also overview related recognition tasks including one-shot and open set recognition which can be used as natural extensions of zero-shot recognition when limited number of class samples become available or when zero-shot recognition is implemented in a real-world setting. Importantly, we highlight the limitations of existing approaches and point out future research directions in this existing new research area.
translated by 谷歌翻译
Recognizing visual content in unconstrained videos has become a very important problem for many applications. Existing corpora for video analysis lack scale and/or content diversity , and thus limited the needed progress in this critical area. In this paper, we describe and release a new database called CCV, containing 9,317 web videos over 20 semantic categories , including events like "baseball" and "parade", scenes like "beach", and objects like "cat". The database was collected with extra care to ensure relevance to consumer interest and originality of video content without post-editing. Such videos typically have very little textual annotation and thus can benefit from the development of automatic content analysis techniques. We used Amazon MTurk platform to perform manual annotation , and studied the behaviors and performance of human annotators on MTurk. We also compared the abilities in understanding consumer video content by humans and machines. For the latter, we implemented automatic classifiers using state-of-the-art multi-modal approach that achieved top performance in recent TRECVID multimedia event detection task. Results confirmed classifiers fusing audio and video features significantly outperform single-modality solutions. We also found that humans are much better at understanding categories of nonrigid objects such as "cat", while current automatic techniques are relatively close to humans in recognizing categories that have distinctive background scenes or audio patterns.
translated by 谷歌翻译