Visual Question Answering (VQA) is a challenging task that has receivedincreasing attention from both the computer vision and the natural languageprocessing communities. Given an image and a question in natural language, itrequires reasoning over visual elements of the image and general knowledge toinfer the correct answer. In the first part of this survey, we examine thestate of the art by comparing modern approaches to the problem. We classifymethods by their mechanism to connect the visual and textual modalities. Inparticular, we examine the common approach of combining convolutional andrecurrent neural networks to map images and questions to a common featurespace. We also discuss memory-augmented and modular architectures thatinterface with structured knowledge bases. In the second part of this survey,we review the datasets available for training and evaluating VQA systems. Thevarious datatsets contain questions at different levels of complexity, whichrequire different capabilities and types of reasoning. We examine in depth thequestion/answer pairs from the Visual Genome project, and evaluate therelevance of the structured annotations of images with scene graphs for VQA.Finally, we discuss promising future directions for the field, in particularthe connection to structured knowledge bases and the use of natural languageprocessing models.
translated by 谷歌翻译
在构建可以推理和回答视觉数据的人工智能系统时,我们需要诊断测试来分析我们的进展并发现缺点。视觉问题回答的现有基准可以提供帮助,但是模型可以利用强大的偏差来正确地回答问题而无需推理。它们还会混淆多种错误来源,因此很难确定模型的缺陷。我们提供了一个测试一系列视觉推理能力的诊断数据。它包含最小的偏差,并有详细的注释,描述每个问题需要的推理类型。我们使用这个数据集来分析各种现代视觉推理系统,为他们的能力和局限提供新的见解。
translated by 谷歌翻译
We present the MAC network, a novel fully differentiable neural networkarchitecture, designed to facilitate explicit and expressive reasoning. MACmoves away from monolithic black-box neural architectures towards a design thatencourages both transparency and versatility. The model approaches problems bydecomposing them into a series of attention-based reasoning steps, eachperformed by a novel recurrent Memory, Attention, and Composition (MAC) cellthat maintains a separation between control and memory. By stringing the cellstogether and imposing structural constraints that regulate their interaction,MAC effectively learns to perform iterative reasoning processes that aredirectly inferred from the data in an end-to-end approach. We demonstrate themodel's strength, robustness and interpretability on the challenging CLEVRdataset for visual reasoning, achieving a new state-of-the-art 98.9% accuracy,halving the error rate of the previous best model. More importantly, we showthat the model is computationally-efficient and data-efficient, in particularrequiring 5x less data than existing models to achieve strong results.
translated by 谷歌翻译
尽管在诸如图像分类之类的感知任务方面取得了进步,但计算机在诸如图像描述和问答交换之类的认知任务上仍然表现不佳。认知是任务的核心,不仅涉及识别,而且涉及我们的视觉世界。然而,用于处理认知任务图像中的丰富内容的模型仍在使用为感知任务设计的相同数据集进行训练。为了在认知任务中取得成功,模型需要理解图像中对象之间的交互和关系。当被问及“乘坐什么车辆?”时,计算机将需要识别图像中的物体以及骑车(人,马车)和拉动(马,马车)之间的关系,以便正确回答“人员正在骑车”马车”。在本文中,我们提出了Visual Genome数据集,以便对这种关系进行建模。我们收集每个图像中对象,属性和关系的密集注释,以学习这些模型。具体来说,我们的数据集包含超过100K的图像,其中每个图像平均有21个对象,18个属性和18个对象之间的成对关系。我们将区域描述中的对象,属性,关系和名词短语与WordNet同义词的答案对进行了解。这些注释一起代表了图像描述,对象,属性,关系和问题答案的最密集和最大的数据集。
translated by 谷歌翻译
语言基础图像理解任务经常被提出作为评估人工智能进展的方法。理想情况下,这些任务应该测试过多的功能,这些功能集成了计算机视觉,推理和自然语言理解。然而,最近的研究表明,最先进的系统不是通过视觉图灵测试进行,而是通过数据集和评估程序中的缺陷实现了良好的性能。我们回顾当前的事态并勾勒出前进的道路。
translated by 谷歌翻译
Visual Question Answering (VQA) is a recent problem in computer vision and natural language processing that has garnered a large amount of interest from the deep learning, computer vision, and natural language processing communities. In VQA, an algorithm needs to answer text-based questions about images. Since the release of the first VQA dataset in 2014, additional datasets have been released and many algorithms have been proposed. In this review, we critically examine the current state of VQA in terms of problem formulation, existing datasets, evaluation metrics, and algorithms. In particular, we discuss the limitations of current datasets with regard to their ability to properly train and assess VQA algorithms. We then exhaustively review existing algorithms for VQA. Finally, we discuss possible future directions for VQA and image understanding research.
translated by 谷歌翻译
我们提出了自由形式和开放式视觉问答(VQA)的任务。给定图像和关于图像的自然语言问题,任务是提供准确的自然语言答案。镜像现实世界,例如帮助视障者,问题和答案都是开放式的。视觉问题有选择地针对图像的不同区域,包括背景细节和基础上下文。因此,在VQA上成功的系统通常需要比生成通用图像标题的系统更详细地了解图像和复杂推理。此外,VQA适合自动评估,因为许多开放式结果只包含几个单词或关闭可以以多选格式提供的答案集。我们提供了一个包含~0.25Mimages,~0.76M问题和~10M答案(www.visualqa.org)的数据集,并讨论了它提供的信息。提供了许多VQA的基线和方法,并与人类表现进行了比较。我们的VQA演示可在CloudCV上获得(http://cloudcv.org/vqa)。
translated by 谷歌翻译
In visual question answering (VQA), an algorithm must answer text-based questions about images. While multiple datasets for VQA have been created since late 2014, they all have flaws in both their content and the way algorithms are evaluated on them. As a result, evaluation scores are inflated and predominantly determined by answering easier questions, making it difficult to compare different methods. In this paper, we analyze existing VQA algorithms using a new dataset called the Task Driven Image Understanding Challenge (TDIUC), which has over 1.6 million questions organized into 12 different categories. We also introduce questions that are meaningless for a given image to force a VQA system to reason about image content. We propose new evaluation schemes that compensate for over-represented question-types and make it easier to study the strengths and weaknesses of algorithms. We analyze the performance of both baseline and state-of-the-art VQA models, including multi-modal compact bilinear pooling (MCB), neural module networks, and recurrent answering units. Our experiments establish how attention helps certain categories more than others, determine which models work better than others, and explain how simple models (e.g. MLP) can surpass more complex models (MCB) by simply learning to answer large, easy question categories.
translated by 谷歌翻译
常识性知识和常识推理是机器智能的主要瓶颈。在NLP社区中,已经创建了许多基准数据集和任务来解决语言理解的常识推理。这些任务旨在评估机器获取和学习常识知识的能力,以便推理和理解自然语言文本。由于这些任务成为工具和常识研究的推动力,本文旨在概述现有的任务和基准,知识资源,以及对自然语言理解的常识推理的学习和推理方法。通过这一点,我们的目标是支持更好的理解theart的状态,它的局限性和未来的挑战。
translated by 谷歌翻译
视觉理解远远超出了对象识别。只需一瞥图像,我们就可以毫不费力地想象超出像素的世界:例如,我们可以推断人们的行为,目标和心理状态。虽然这个任务对于人类来说很容易,但对于今天的视觉系统来说,这是非常困难的,需要更高阶的认知和关于世界的常识推理。在本文中,我们将此任务形式化为Visual Commonsense Reasoning。除了回答用自然语言表达的具有挑战性的视觉问题之外,模型还必须提供解释为什么其答案为真的理论基础。我们引入一个新的数据集VCR,包含290k多项选择QA问题源于110k电影场景。大规模生成非平凡和高质量问题的关键方法是对抗性匹配,这种新方法将丰富的注释转换为具有最小偏差的多项选择题。 Tomove对认知水平的图像理解,我们提出了一个新的理由,称为识别网络识别(R2C),模拟必要的分层推理,用于基础,情境化和推理。实验结果表明,人类发现VCR很容易(准确率超过90%) ),最先进的模型挣扎(约45%)。我们的R2C有助于缩小这一差距(约65%);但是,挑战远未解决,我们提供的分析表明了未来工作的原因。
translated by 谷歌翻译
视觉和语言交叉处的问题对于具有挑战性的研究问题和它们所实现的丰富应用集都具有重要意义。然而,我们世界的内在结构和语言中的偏见倾向于比视觉模式更简单的学习信号,导致模型忽略视觉信息,导致其能力膨胀。我们建议针对Visual QuestionAnswering(VQA)的任务来对抗这些语言先验,并使视觉(VQA中的V)成为问题!具体来说,我们通过收集互补图像来平衡流行的VQA数据集,使得我们的平衡数据集中的每个问题不仅与单个图像相关联,而且与一对相似图像相关联,从而产生两个不同的问题答案。我们的数据集通过构造比原始VQAdataset更平衡,并且具有大约两倍的图像 - 问题对。我们的完整平衡数据集可在www.visualqa.org上公开获取,作为Visual Question Answering Dataset and Challenge(VQAv2.0)第二次迭代的一部分。我们在平衡数据集上进一步对许多最先进的VQA模型进行了基准测试。所有模型在我们的平衡数据集上的表现都要差得多,这表明这些模型确实学会了利用语言先验。这一发现提供了第一个具体的经验证据,证明了从业者的定性意义。最后,我们用于识别互补图像的数据收集协议我们开发了一种新颖的可解释模型,该模型除了提供给定(图像,问题)对的答案之外,还提供了基于实例的解释。具体来说,它识别出与原始图像相似的图像,但它认为对同一问题有不同的答案。这有助于在其用户之间建立对机器的信任。
translated by 谷歌翻译
用于回答视觉问题的机器教学方法在过去几年取得了显着进展,但是虽然在特定数据集上展示了令人印象深刻的结果,但这些方法缺乏一些重要的人力,包括以结构方式整合新的视觉类和概念,为答案和处理提供解释。没有新例子的新域名。在本文中,我们提出了一个系统,该系统在CLEVR数据集上实现了最先进的结果,没有任何问题 - 答案训练,利用真实的视觉估计并解释答案。该系统包括一个问题表示阶段,后面跟着一个回答过程,该过程调用一组可扩展的视觉估计器。它可以解释问题,包括其失败,并提供负面答案的替代方案。该方案建立在最近提出的框架之上,其扩展允许系统在不依赖训练样本的情况下处理新领域。
translated by 谷歌翻译
用于视觉推理的现有方法试图使用黑盒体系结构直接将输入映射到输出,而无需对前提推理过程进行明确建模。因此,这些黑盒模型经常学习利用数据中的偏差而不是学习进行视觉推理。在模块网络的启发下,本文提出了一种视觉推理模型,它包含一个程序生成器,构造了推理过程的显式表示。执行,以及执行生成程序以产生答案的执行引擎。程序生成器和执行引擎都由神经网络实现,并使用反向传播和REINFORCE的组合进行训练。使用CLEVR基准进行可视化推理,我们表明我们的模型明显优于强基线,并在各种环境中更好地推广。
translated by 谷歌翻译
The complex compositional structure of language makes problems at theintersection of vision and language challenging. But language also provides astrong prior that can result in good superficial performance, without theunderlying models truly understanding the visual content. This can hinderprogress in pushing state of art in the computer vision aspects of multi-modalAI. In this paper, we address binary Visual Question Answering (VQA) onabstract scenes. We formulate this problem as visual verification of conceptsinquired in the questions. Specifically, we convert the question to a tuplethat concisely summarizes the visual concept to be detected in the image. Ifthe concept can be found in the image, the answer to the question is "yes", andotherwise "no". Abstract scenes play two roles (1) They allow us to focus onthe high-level semantics of the VQA task as opposed to the low-levelrecognition problems, and perhaps more importantly, (2) They provide us themodality to balance the dataset such that language priors are controlled, andthe role of vision is essential. In particular, we collect fine-grained pairsof scenes for every question, such that the answer to the question is "yes" forone scene, and "no" for the other for the exact same question. Indeed, languagepriors alone do not perform better than chance on our balanced dataset.Moreover, our proposed approach matches the performance of a state-of-the-artVQA approach on the unbalanced dataset, and outperforms it on the balanceddataset.
translated by 谷歌翻译
Automatic description generation from natural images is a challenging problemthat has recently received a large amount of interest from the computer visionand natural language processing communities. In this survey, we classify theexisting approaches based on how they conceptualize this problem, viz., modelsthat cast description as either generation problem or as a retrieval problemover a visual or multimodal representational space. We provide a detailedreview of existing models, highlighting their advantages and disadvantages.Moreover, we give an overview of the benchmark image datasets and theevaluation measures that have been developed to assess the quality ofmachine-generated image descriptions. Finally we extrapolate future directionsin the area of automatic image description generation.
translated by 谷歌翻译
与图像相关的问题定义了为产生适当答案所需的特定视觉任务。答案可能取决于图像中的细节,需要复杂的推理和先验知识的使用。当人类执行这项任务时,他们能够以灵活和强大的方式完成任务,将任何新颖的视觉功能模块化地集成到各种各样的任务细节中。相比之下,当前通过机器解决该问题的方法是基于将问题视为缺乏这种能力的端到端学习问题。我们提出了一种不同的方法,受到上述人类能力的启发。该方法基于问题的组成结构。根本的想法是,一个问题具有基于其结构的抽象表示,其本质上是组成的。因此,问题可以通过与其子结构相对应的程序组合来回答。表示的基本元素是逻辑模式,它们组合在一起表示问题。这些模式包括对象类,属性和关系的参数表示。每个基本模式被映射到包含有意义的视觉任务的基本过程,并且组成模式以产生整体应答过程。基于此方法的UnCoRd(理解撰写和响应)系统将现有的检测和分类方案集成到一组对象类,属性和关系中。这些方案以模块化方式结合,为否定答案提供详细的答案和更正。此外,还需要一个外部知识库来查询所需的常识。我们对系统进行了定性分析,展示了它的代表能力并为未来的发展提供了建议。
translated by 谷歌翻译
我们已经看到了基本感知任务的巨大进步,例如对象识别和检测。然而,由于缺乏更深层推理的能力,AI模型仍然无法与人类在高级视觉任务中相匹配。最近,人们提出了视觉问答(QA)的新任务来评估模型对深度图像理解的能力。以前的作品已经在QA句子和图像之间建立了松散的全局关联。然而,在实践中,许多问题和答案与图像中的局部区域有关。我们通过对象级接地在文本描述和图像区域之间建立语义链接。除了以前工作中使用的文本答案之外,它还支持使用visualanswers的新型QA。我们在一个基础设置中研究视觉QA任务,其中包含大量7W多选QA对。此外,我们评估人员绩效和QA任务的几个基线模型。最后,我们提出了一种新颖的LSTM模型,其空间注意力可以解决7W QA任务。
translated by 谷歌翻译
We address a question answering task on real-world images that is set up as aVisual Turing Test. By combining latest advances in image representation andnatural language processing, we propose Ask Your Neurons, a scalable, jointlytrained, end-to-end formulation to this problem. In contrast to previous efforts, we are facing a multi-modal problem wherethe language output (answer) is conditioned on visual and natural languageinputs (image and question). We provide additional insights into the problem byanalyzing how much information is contained only in the language part for whichwe provide a new human baseline. To study human consensus, which is related tothe ambiguities inherent in this challenging task, we propose two novel metricsand collect additional answers which extend the original DAQUAR dataset toDAQUAR-Consensus. Moreover, we also extend our analysis to VQA, a large-scale questionanswering about images dataset, where we investigate some particular designchoices and show the importance of stronger visual models. At the same time, weachieve strong performance of our model that still uses a global imagerepresentation. Finally, based on such analysis, we refine our Ask Your Neuronson DAQUAR, which also leads to a better performance on this challenging task.
translated by 谷歌翻译
The ability to ask questions is a powerful tool to gather information in order to learn about the world and resolve ambiguities. In this paper, we explore a novel problem of generating discriminative questions to help disambiguate visual instances. Our work can be seen as a complement and new extension to the rich research studies on image captioning and question answering. We introduce the first large-scale dataset with over 10,000 carefully annotated images-question tuples to facilitate benchmarking. In particular, each tuple consists of a pair of images and 4.6 discriminative questions (as positive samples) and 5.9 non-discriminative questions (as negative samples) on average. In addition, we present an effective method for visual discriminative question generation. The method can be trained in a weakly supervised manner without discrimina-tive images-question tuples but just existing visual question answering datasets. Promising results are shown against representative baselines through quantitative evaluations and user studies.
translated by 谷歌翻译