When building artificial intelligence systems that can reason and answer questions about visual data, we need diagnostic tests to analyze our progress and discover shortcomings. Existing benchmarks for visual question answering can help, but have strong biases that models can exploit to correctly answer questions without reasoning. They also conflate multiple sources of error, making it hard to pinpoint model weaknesses. We present a diagnostic dataset that tests a range of visual reasoning abilities. It contains minimal biases and has detailed annotations describing the kind of reasoning each question requires. We use this dataset to analyze a variety of modern visual reasoning systems, providing novel insights into their abilities and limitations.
translated by 谷歌翻译
人类视觉感知的关键方面是能够将视觉场景分解为单个对象并进一步进入对象部分,形成部分整个层次结构。这种复合结构可以诱导丰富的语义概念和关系,从而在视觉信号的解释和组织中发挥着重要作用,以及视觉感知和推理的概括。但是,现有的视觉推理基准主要专注于物体而不是零件。基于完整的部分整个层次结构的视觉推理比以前粒度概念,更丰富的几何关系和更复杂的物理学所致的对象的推理更具挑战性。因此,为了更好地为基于部分的概念,关系和物理推理服务,我们介绍了一个名为PTR的新型大规模诊断视觉推理数据集。 PTR包含大约70k RGBD合成图像,具有地面真理对象和有关语义实例分段,颜色属性,空间和几何关系的部分级别注释,以及诸如稳定性的某些物理性质。这些图像与700K机生成的问题配对,涵盖各种类型的推理类型,使其成为视觉推理模型的良好测试平台。我们在这个数据集上检查了几种最先进的视觉推理模型,并观察到他们在人类可以容易地推断正确答案的情况下仍然存在许多令人惊讶的错误。我们认为,此数据集将开辟基于零件推理的新机会。
translated by 谷歌翻译
We introduce GQA, a new dataset for real-world visual reasoning and compositional question answering, seeking to address key shortcomings of previous VQA datasets. We have developed a strong and robust question engine that leverages Visual Genome scene graph structures to create 22M diverse reasoning questions, which all come with functional programs that represent their semantics. We use the programs to gain tight control over the answer distribution and present a new tunable smoothing technique to mitigate question biases. Accompanying the dataset is a suite of new metrics that evaluate essential qualities such as consistency, grounding and plausibility. A careful analysis is performed for baselines as well as state-of-the-art models, providing fine-grained results for different question types and topologies. Whereas a blind LSTM obtains a mere 42.1%, and strong VQA models achieve 54.1%, human performance tops at 89.3%, offering ample opportunity for new research to explore. We hope GQA will provide an enabling resource for the next generation of models with enhanced robustness, improved consistency, and deeper semantic understanding of vision and language.
translated by 谷歌翻译
Relational reasoning is a central component of generally intelligent behavior, but has proven difficult for neural networks to learn. In this paper we describe how to use Relation Networks (RNs) as a simple plug-and-play module to solve problems that fundamentally hinge on relational reasoning. We tested RN-augmented networks on three tasks: visual question answering using a challenging dataset called CLEVR, on which we achieve state-of-the-art, super-human performance; text-based question answering using the bAbI suite of tasks; and complex reasoning about dynamic physical systems. Then, using a curated dataset called Sort-of-CLEVR we show that powerful convolutional networks do not have a general capacity to solve relational questions, but can gain this capacity when augmented with RNs. Our work shows how a deep learning architecture equipped with an RN module can implicitly discover and learn to reason about entities and their relations.
translated by 谷歌翻译
Artificial Intelligence (AI) and its applications have sparked extraordinary interest in recent years. This achievement can be ascribed in part to advances in AI subfields including Machine Learning (ML), Computer Vision (CV), and Natural Language Processing (NLP). Deep learning, a sub-field of machine learning that employs artificial neural network concepts, has enabled the most rapid growth in these domains. The integration of vision and language has sparked a lot of attention as a result of this. The tasks have been created in such a way that they properly exemplify the concepts of deep learning. In this review paper, we provide a thorough and an extensive review of the state of the arts approaches, key models design principles and discuss existing datasets, methods, their problem formulation and evaluation measures for VQA and Visual reasoning tasks to understand vision and language representation learning. We also present some potential future paths in this field of research, with the hope that our study may generate new ideas and novel approaches to handle existing difficulties and develop new applications.
translated by 谷歌翻译
We introduce a general-purpose conditioning method for neural networks called FiLM: Feature-wise Linear Modulation. FiLM layers influence neural network computation via a simple, feature-wise affine transformation based on conditioning information. We show that FiLM layers are highly effective for visual reasoning -answering image-related questions which require a multi-step, high-level process -a task which has proven difficult for standard deep learning methods that do not explicitly model reasoning. Specifically, we show on visual reasoning tasks that FiLM layers 1) halve state-of-theart error for the CLEVR benchmark, 2) modulate features in a coherent manner, 3) are robust to ablations and architectural modifications, and 4) generalize well to challenging, new data from few examples or even zero-shot.
translated by 谷歌翻译
我们介绍了CLEVR-MATH,这是一个多模式数学单词问题数据集,该数据集由涉及加法/减法的简单数学单词问题组成,部分地表示文本描述,部分地是由图像说明了场景。文本描述了图像中描述的场景上执行的动作。由于提出的问题可能与图像中的场景有关,而是针对采用动作之前或之后的场景状态,因此求解器设想或想象由于这些动作而导致的状态发生了变化。解决这些单词问题需要语言,视觉和数学推理的结合。我们将最新的神经和神经符号模型应用于CLEVR-MATH的视觉问题,并经验评估其表现。我们的结果表明,两种方法如何推广到操作链。我们讨论了两者在解决多模式单词问题解决的任务时的局限性。
translated by 谷歌翻译
We present a new AI task -Embodied Question Answering (EmbodiedQA) -where an agent is spawned at a random location in a 3D environment and asked a question ('What color is the car?'). In order to answer, the agent must first intelligently navigate to explore the environment, gather information through first-person (egocentric) vision, and then answer the question ('orange'). This challenging task requires a range of AI skills -active perception, language understanding, goal-driven navigation, commonsense reasoning, and grounding of language into actions. In this work, we develop the environments, end-to-end-trained reinforcement learning agents, and evaluation protocols for EmbodiedQA.
translated by 谷歌翻译
近年来,视觉问题应答(VQA)在近年来,由于了解来自多种方式的信息(即图像,语言),近年来近年来在近年来的机器学习社区中获得了很多牵引力。在VQA中,基于一组图像提出了一系列问题,并且手头的任务是到达答案。为实现这一目标,我们采用了一种基于象征的推理方法,使用正式逻辑框架。图像和问题被转换为执行显式推理的符号表示。我们提出了一种正式的逻辑框架,其中(i)图像在场景图的帮助下将图像转换为逻辑背景事实,(ii)问题被基于变压器的深度学习模型转换为一阶谓词逻辑条款,(iii)通过使用背景知识和谓词条款的接地来执行可靠性检查,以获得答案。我们所提出的方法是高度解释的,并且可以通过人容易地分析管道中的每个步骤。我们验证了我们在CLEVR和GQA数据集上的方法。我们在Clevr DataSet上实现了99.6%的近似完美的准确性,可与艺术模式相当,展示正式逻辑是一个可行的工具来解决视觉问题的回答。我们的模型也是数据高效,在仅在培训数据的10%培训时,在缩放数据集中实现99.1%的准确性。
translated by 谷歌翻译
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such as helping the visually impaired, both the questions and answers are open-ended. Visual questions selectively target different areas of an image, including background details and underlying context. As a result, a system that succeeds at VQA typically needs a more detailed understanding of the image and complex reasoning than a system producing generic image captions. Moreover, VQA is amenable to automatic evaluation, since many open-ended answers contain only a few words or a closed set of answers that can be provided in a multiple-choice format. We provide a dataset containing ∼0.25M images, ∼0.76M questions, and ∼10M answers (www.visualqa.org), and discuss the information it provides. Numerous baselines and methods for VQA are provided and compared with human performance. Our VQA demo is available on CloudCV (http://cloudcv.org/vqa).
translated by 谷歌翻译
Visual question answering is fundamentally compositional in nature-a question like where is the dog? shares substructure with questions like what color is the dog? and where is the cat? This paper seeks to simultaneously exploit the representational capacity of deep networks and the compositional linguistic structure of questions. We describe a procedure for constructing and learning neural module networks, which compose collections of jointly-trained neural "modules" into deep networks for question answering. Our approach decomposes questions into their linguistic substructures, and uses these structures to dynamically instantiate modular networks (with reusable components for recognizing dogs, classifying colors, etc.). The resulting compound networks are jointly trained. We evaluate our approach on two challenging datasets for visual question answering, achieving state-of-the-art results on both the VQA natural image dataset and a new dataset of complex questions about abstract shapes.
translated by 谷歌翻译
We introduce a new dataset for joint reasoning about natural language and images, with a focus on semantic diversity, compositionality, and visual reasoning challenges. The data contains 107,292 examples of English sentences paired with web photographs. The task is to determine whether a natural language caption is true about a pair of photographs. We crowdsource the data using sets of visually rich images and a compare-and-contrast task to elicit linguistically diverse language. Qualitative analysis shows the data requires compositional joint reasoning, including about quantities, comparisons, and relations. Evaluation using state-of-the-art visual reasoning methods shows the data presents a strong challenge. * Contributed equally. † Work done as an undergraduate at Cornell University. 1 In parts of this paper, we use the term compositional differently than it is commonly used in linguistics to refer to reasoning that requires composition. This type of reasoning often manifests itself in highly compositional language.2 Appendix G contains license information for all photographs used in this paper. 3 The top example is True, while the bottom is False.
translated by 谷歌翻译
Visual Question Answering (VQA) models often perform poorly on out-of-distribution data and struggle on domain generalization. Due to the multi-modal nature of this task, multiple factors of variation are intertwined, making generalization difficult to analyze. This motivates us to introduce a virtual benchmark, Super-CLEVR, where different factors in VQA domain shifts can be isolated in order that their effects can be studied independently. Four factors are considered: visual complexity, question redundancy, concept distribution and concept compositionality. With controllably generated data, Super-CLEVR enables us to test VQA methods in situations where the test data differs from the training data along each of these axes. We study four existing methods, including two neural symbolic methods NSCL and NSVQA, and two non-symbolic methods FiLM and mDETR; and our proposed method, probabilistic NSVQA (P-NSVQA), which extends NSVQA with uncertainty reasoning. P-NSVQA outperforms other methods on three of the four domain shift factors. Our results suggest that disentangling reasoning and perception, combined with probabilistic uncertainty, form a strong VQA model that is more robust to domain shifts. The dataset and code are released at https://github.com/Lizw14/Super-CLEVR.
translated by 谷歌翻译
'Actions' play a vital role in how humans interact with the world. Thus, autonomous agents that would assist us in everyday tasks also require the capability to perform 'Reasoning about Actions & Change' (RAC). This has been an important research direction in Artificial Intelligence (AI) in general, but the study of RAC with visual and linguistic inputs is relatively recent. The CLEVR_HYP (Sampat et. al., 2021) is one such testbed for hypothetical vision-language reasoning with actions as the key focus. In this work, we propose a novel learning strategy that can improve reasoning about the effects of actions. We implement an encoder-decoder architecture to learn the representation of actions as vectors. We combine the aforementioned encoder-decoder architecture with existing modality parsers and a scene graph question answering model to evaluate our proposed system on the CLEVR_HYP dataset. We conduct thorough experiments to demonstrate the effectiveness of our proposed approach and discuss its advantages over previous baselines in terms of performance, data efficiency, and generalization capability.
translated by 谷歌翻译
3D场景理解是一个相对新兴的研究领域。在本文中,我们介绍了3D现实世界场景(VQA-3D)中的视觉问题应答任务,旨在给出3D场景的所有可能的问题。为了解决这个问题,提出了第一个VQA-3D数据集,即CLEVR3D,其中包含在1,129个现实世界场景中的60k个问题。具体而言,我们开发一个问题发动机利用3D场景图结构来生成不同的推理问题,涵盖物体属性的问题(即,大小,颜色和材料)及其空间关系。建立在此数据集之上,我们进一步设计了第一个VQA-3D基线模型TransVQA3D。 TransVQA3D型号采用精心设计的变压器架构,实现优越的VQA-3D性能,与纯语言基线和先前的3D推理方法直接应用于3D场景。实验结果验证了VQA-3D作为辅助任务可以提高3D场景理解的性能,包括节点明智分类和全图识别的场景图分析。
translated by 谷歌翻译
目前的视觉问题应答(VQA)任务主要考虑回答自然图像的人为注释问题。然而,除了自然图像之外,在视觉理解和推理研究中仍然可以解读具有语义丰富性的抽象图。在这项工作中,我们介绍了ICON问题的新挑战(ICONQA),其目标是在图标图像上下文中回答问题。我们发布了ICONQA,这是一个由107,439个问题和三个子任务组成的大型数据集:多图像选择,多文本选择和填充空白。 ICONQA数据集是由真实世界图中的启发,突出了抽象图理解和综合认知推理的重要性。因此,ICONQA不仅需要对象识别和文本理解等感知技能,而且还需要多种认知推理技能,例如几何推理,致辞推理和算术推理。为了促进潜在的iconqa模型来学习图标图像的语义表示,我们进一步发布了一个图标数据集图标645,其中包含377级上的645,687个彩色图标。我们进行广泛的用户研究和盲目实验,并重现各种先进的VQA方法来基准iconQA任务。此外,我们开发了一个强大的ICONQA基线Patch-TRM,它应用金字塔跨模型变压器,其中包含在图标数据集上预先培训的输入图嵌入式。 iconqa和图标645可在https://iconqa.github.io提供。
translated by 谷歌翻译
Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in
translated by 谷歌翻译
Visual understanding goes well beyond object recognition. With one glance at an image, we can effortlessly imagine the world beyond the pixels: for instance, we can infer people's actions, goals, and mental states. While this task is easy for humans, it is tremendously difficult for today's vision systems, requiring higher-order cognition and commonsense reasoning about the world. We formalize this task as Visual Commonsense Reasoning. Given a challenging question about an image, a machine must answer correctly and then provide a rationale justifying its answer.Next, we introduce a new dataset, VCR, consisting of 290k multiple choice QA problems derived from 110k movie scenes. The key recipe for generating non-trivial and highquality problems at scale is Adversarial Matching, a new approach to transform rich annotations into multiple choice questions with minimal bias. Experimental results show that while humans find VCR easy (over 90% accuracy), state-of-the-art vision models struggle (∼45%).To move towards cognition-level understanding, we present a new reasoning engine, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning. R2C helps narrow the gap between humans and machines (∼65%); still, the challenge is far from solved, and we provide analysis that suggests avenues for future work.
translated by 谷歌翻译
图表是一种流行且有效的数据可视化形式。图表问题应答(CQA)是用于评估图表理解的任务,从根本上与理解自然图像不同。 CQA需要分析图表的文本和视觉组件之间的关系,以便回答一般问题或推断数值。大多数现有的CQA数据集和IT模型都基于简化通常能够超越人类性能的假设。在这项工作中,我们进一步探讨了这一结果背后的原因,并提出了一个共同学习分类和回归的新模式。我们的语言视觉与共同关注变压器设置捕获问题与文本元素之间的复杂相互作用,该元素通常存在于现实世界图表中。我们通过广泛的实验和故障验证了这些结论,并在现实的PlotQA数据集中进行了故障,优于较大的边距,同时表现出竞争性能。我们的模型的边缘尤其强调了与词汇外答案的问题,其中许多需要回归。我们希望这项工作能够进一步促进解决挑战性和高实际实际任务的进一步研究图表理解。
translated by 谷歌翻译