Visual Question Answering (VQA) models often perform poorly on out-of-distribution data and struggle on domain generalization. Due to the multi-modal nature of this task, multiple factors of variation are intertwined, making generalization difficult to analyze. This motivates us to introduce a virtual benchmark, Super-CLEVR, where different factors in VQA domain shifts can be isolated in order that their effects can be studied independently. Four factors are considered: visual complexity, question redundancy, concept distribution and concept compositionality. With controllably generated data, Super-CLEVR enables us to test VQA methods in situations where the test data differs from the training data along each of these axes. We study four existing methods, including two neural symbolic methods NSCL and NSVQA, and two non-symbolic methods FiLM and mDETR; and our proposed method, probabilistic NSVQA (P-NSVQA), which extends NSVQA with uncertainty reasoning. P-NSVQA outperforms other methods on three of the four domain shift factors. Our results suggest that disentangling reasoning and perception, combined with probabilistic uncertainty, form a strong VQA model that is more robust to domain shifts. The dataset and code are released at https://github.com/Lizw14/Super-CLEVR.
translated by 谷歌翻译
人类视觉感知的关键方面是能够将视觉场景分解为单个对象并进一步进入对象部分,形成部分整个层次结构。这种复合结构可以诱导丰富的语义概念和关系,从而在视觉信号的解释和组织中发挥着重要作用,以及视觉感知和推理的概括。但是,现有的视觉推理基准主要专注于物体而不是零件。基于完整的部分整个层次结构的视觉推理比以前粒度概念,更丰富的几何关系和更复杂的物理学所致的对象的推理更具挑战性。因此,为了更好地为基于部分的概念,关系和物理推理服务,我们介绍了一个名为PTR的新型大规模诊断视觉推理数据集。 PTR包含大约70k RGBD合成图像,具有地面真理对象和有关语义实例分段,颜色属性,空间和几何关系的部分级别注释,以及诸如稳定性的某些物理性质。这些图像与700K机生成的问题配对,涵盖各种类型的推理类型,使其成为视觉推理模型的良好测试平台。我们在这个数据集上检查了几种最先进的视觉推理模型,并观察到他们在人类可以容易地推断正确答案的情况下仍然存在许多令人惊讶的错误。我们认为,此数据集将开辟基于零件推理的新机会。
translated by 谷歌翻译
We introduce GQA, a new dataset for real-world visual reasoning and compositional question answering, seeking to address key shortcomings of previous VQA datasets. We have developed a strong and robust question engine that leverages Visual Genome scene graph structures to create 22M diverse reasoning questions, which all come with functional programs that represent their semantics. We use the programs to gain tight control over the answer distribution and present a new tunable smoothing technique to mitigate question biases. Accompanying the dataset is a suite of new metrics that evaluate essential qualities such as consistency, grounding and plausibility. A careful analysis is performed for baselines as well as state-of-the-art models, providing fine-grained results for different question types and topologies. Whereas a blind LSTM obtains a mere 42.1%, and strong VQA models achieve 54.1%, human performance tops at 89.3%, offering ample opportunity for new research to explore. We hope GQA will provide an enabling resource for the next generation of models with enhanced robustness, improved consistency, and deeper semantic understanding of vision and language.
translated by 谷歌翻译
When building artificial intelligence systems that can reason and answer questions about visual data, we need diagnostic tests to analyze our progress and discover shortcomings. Existing benchmarks for visual question answering can help, but have strong biases that models can exploit to correctly answer questions without reasoning. They also conflate multiple sources of error, making it hard to pinpoint model weaknesses. We present a diagnostic dataset that tests a range of visual reasoning abilities. It contains minimal biases and has detailed annotations describing the kind of reasoning each question requires. We use this dataset to analyze a variety of modern visual reasoning systems, providing novel insights into their abilities and limitations.
translated by 谷歌翻译
Artificial Intelligence (AI) and its applications have sparked extraordinary interest in recent years. This achievement can be ascribed in part to advances in AI subfields including Machine Learning (ML), Computer Vision (CV), and Natural Language Processing (NLP). Deep learning, a sub-field of machine learning that employs artificial neural network concepts, has enabled the most rapid growth in these domains. The integration of vision and language has sparked a lot of attention as a result of this. The tasks have been created in such a way that they properly exemplify the concepts of deep learning. In this review paper, we provide a thorough and an extensive review of the state of the arts approaches, key models design principles and discuss existing datasets, methods, their problem formulation and evaluation measures for VQA and Visual reasoning tasks to understand vision and language representation learning. We also present some potential future paths in this field of research, with the hope that our study may generate new ideas and novel approaches to handle existing difficulties and develop new applications.
translated by 谷歌翻译
我们介绍了CLEVR-MATH,这是一个多模式数学单词问题数据集,该数据集由涉及加法/减法的简单数学单词问题组成,部分地表示文本描述,部分地是由图像说明了场景。文本描述了图像中描述的场景上执行的动作。由于提出的问题可能与图像中的场景有关,而是针对采用动作之前或之后的场景状态,因此求解器设想或想象由于这些动作而导致的状态发生了变化。解决这些单词问题需要语言,视觉和数学推理的结合。我们将最新的神经和神经符号模型应用于CLEVR-MATH的视觉问题,并经验评估其表现。我们的结果表明,两种方法如何推广到操作链。我们讨论了两者在解决多模式单词问题解决的任务时的局限性。
translated by 谷歌翻译
Videos often capture objects, their visible properties, their motion, and the interactions between different objects. Objects also have physical properties such as mass, which the imaging pipeline is unable to directly capture. However, these properties can be estimated by utilizing cues from relative object motion and the dynamics introduced by collisions. In this paper, we introduce CRIPP-VQA, a new video question answering dataset for reasoning about the implicit physical properties of objects in a scene. CRIPP-VQA contains videos of objects in motion, annotated with questions that involve counterfactual reasoning about the effect of actions, questions about planning in order to reach a goal, and descriptive questions about visible properties of objects. The CRIPP-VQA test set enables evaluation under several out-of-distribution settings -- videos with objects with masses, coefficients of friction, and initial velocities that are not observed in the training distribution. Our experiments reveal a surprising and significant performance gap in terms of answering questions about implicit properties (the focus of this paper) and explicit properties of objects (the focus of prior work).
translated by 谷歌翻译
服务机器人应该能够与非专家用户自然互动,不仅可以帮助他们完成各种任务,还可以接收指导,以解决指导中可能存在的歧义。我们考虑了视觉接地的任务,在这种情况下,代理将对象从拥挤的场景中分离出自然语言描述。现代的整体视觉接地方法通常忽略语言结构,而努力覆盖通用领域,因此很大程度上依靠大型数据集。此外,由于基准和目标域之间的高视觉差异,它们在RGB-D数据集中的传输性能受到了影响。模块化方法将学习与领域建模结合并利用语言的组成性质,以使视觉表示从语言解析中解脱出来,但由于缺乏强大的监督,要么依靠外部解析或以端到端的方式进行训练。在这项工作中,我们试图通过引入一个完全脱钩的模块化框架来解决这些局限性,以构成实体,属性和空间关系的组成视觉基础。我们利用在合成域中生成的丰富场景图表注释,并独立训练每个模块。我们的方法在模拟和两个真实的RGB-D场景数据集中进行了评估。实验结果表明,我们的框架的解耦性可以轻松地与域适应方法相结合,以实现SIMS到现实的视觉识别,从而为机器人应用中的视觉接地提供了数据效率,健壮且可解释的解决方案。
translated by 谷歌翻译
Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in
translated by 谷歌翻译
Despite the superior performance brought by vision-and-language pretraining, it remains unclear whether learning with multi-modal data can help understand each individual modality. In this work, we investigate how language can help with visual representation learning from a probing perspective. Specifically, we compare vision-and-language and vision-only models by probing their visual representations on a broad range of tasks, in order to assess the quality of the learned representations in a fine-grained manner. Interestingly, our probing results suggest that vision-and-language models are better at label prediction tasks like object and attribute prediction, while vision-only models are stronger at dense prediction tasks that require more localized information. With further analysis using detailed metrics, our study suggests that language helps vision models learn better semantics, but not localization. Code is released at https://github.com/Lizw14/visual_probing.
translated by 谷歌翻译
A number of studies have found that today's Visual Question Answering (VQA) models are heavily driven by superficial correlations in the training data and lack sufficient image grounding. To encourage development of models geared towards the latter, we propose a new setting for VQA where for every question type, train and test sets have different prior distributions of answers. Specifically, we present new splits of the VQA v1 and VQA v2 datasets, which we call Visual Question Answering under Changing Priors (VQA-CP v1 and VQA-CP v2 respectively). First, we evaluate several existing VQA models under this new setting and show that their performance degrades significantly compared to the original VQA setting. Second, we propose a novel Grounded Visual Question Answering model (GVQA) that contains inductive biases and restrictions in the architecture specifically designed to prevent the model from 'cheating' by primarily relying on priors in the training data. Specifically, GVQA explicitly disentangles the recognition of visual concepts present in the image from the identification of plausible answer space for a given question, enabling the model to more robustly generalize across different distributions of answers. GVQA is built off an existing VQA model -Stacked Attention Networks (SAN). Our experiments demonstrate that GVQA significantly outperforms SAN on both VQA-CP v1 and VQA-CP v2 datasets. Interestingly, it also outperforms more powerful VQA models such as Multimodal Compact Bilinear Pooling (MCB) in several cases. GVQA offers strengths complementary to SAN when trained and evaluated on the original VQA v1 and VQA v2 datasets. Finally, GVQA is more transparent and interpretable than existing VQA models.
translated by 谷歌翻译
视觉问题回答(VQA)近年来见证了巨大进展。但是,大多数努力只关注2D图像问题应答任务。在本文中,我们介绍了将VQA扩展到3D域的第一次尝试,这可以促进人工智能对3D现实世界情景的看法。与基于图像的VQA不同,3D问题应答(3DQA)将颜色点云作为输入,需要外观和3D几何理解能力来回答3D相关问题。为此,我们提出了一种基于新颖的基于变换器的3DQA框架\ TextBF {“3DQA-TR”},其包括两个编码器,分别用于利用外观和几何信息。外观,几何和的多模码信息语言问题最终可以通过3D语言伯特互相参加,以预测目标答案。要验证我们提出的3DQA框架的有效性,我们还开发了第一个建立的3DQA DataSet \ TextBF {“scanqa”} SCANNet DataSet并包含$ \ SIM $ 6K问题,$ \ SIM $ 30k答案,可满足806美元的场景。在此数据集上的广泛实验展示了我们提出的3DQA框架在现有的VQA框架上的明显优势,以及我们主要设计的有效性。我们的代码和数据集将公开可用于促进此方向的研究。
translated by 谷歌翻译
3D场景理解是一个相对新兴的研究领域。在本文中,我们介绍了3D现实世界场景(VQA-3D)中的视觉问题应答任务,旨在给出3D场景的所有可能的问题。为了解决这个问题,提出了第一个VQA-3D数据集,即CLEVR3D,其中包含在1,129个现实世界场景中的60k个问题。具体而言,我们开发一个问题发动机利用3D场景图结构来生成不同的推理问题,涵盖物体属性的问题(即,大小,颜色和材料)及其空间关系。建立在此数据集之上,我们进一步设计了第一个VQA-3D基线模型TransVQA3D。 TransVQA3D型号采用精心设计的变压器架构,实现优越的VQA-3D性能,与纯语言基线和先前的3D推理方法直接应用于3D场景。实验结果验证了VQA-3D作为辅助任务可以提高3D场景理解的性能,包括节点明智分类和全图识别的场景图分析。
translated by 谷歌翻译
Visual question answering is fundamentally compositional in nature-a question like where is the dog? shares substructure with questions like what color is the dog? and where is the cat? This paper seeks to simultaneously exploit the representational capacity of deep networks and the compositional linguistic structure of questions. We describe a procedure for constructing and learning neural module networks, which compose collections of jointly-trained neural "modules" into deep networks for question answering. Our approach decomposes questions into their linguistic substructures, and uses these structures to dynamically instantiate modular networks (with reusable components for recognizing dogs, classifying colors, etc.). The resulting compound networks are jointly trained. We evaluate our approach on two challenging datasets for visual question answering, achieving state-of-the-art results on both the VQA natural image dataset and a new dataset of complex questions about abstract shapes.
translated by 谷歌翻译
场景图是一个场景的结构化表示,可以清楚地表达场景中对象之间的对象,属性和关系。随着计算机视觉技术继续发展,只需检测和识别图像中的对象,人们不再满足。相反,人们期待着对视觉场景更高的理解和推理。例如,给定图像,我们希望不仅检测和识别图像中的对象,还要知道对象之间的关系(视觉关系检测),并基于图像内容生成文本描述(图像标题)。或者,我们可能希望机器告诉我们图像中的小女孩正在做什么(视觉问题应答(VQA)),甚至从图像中移除狗并找到类似的图像(图像编辑和检索)等。这些任务需要更高水平的图像视觉任务的理解和推理。场景图只是场景理解的强大工具。因此,场景图引起了大量研究人员的注意力,相关的研究往往是跨模型,复杂,快速发展的。然而,目前没有对场景图的相对系统的调查。为此,本调查对现行场景图研究进行了全面调查。更具体地说,我们首先总结了场景图的一般定义,随后对场景图(SGG)和SGG的发电方法进行了全面和系统的讨论,借助于先验知识。然后,我们调查了场景图的主要应用,并汇总了最常用的数据集。最后,我们对场景图的未来发展提供了一些见解。我们相信这将是未来研究场景图的一个非常有帮助的基础。
translated by 谷歌翻译
视觉表示学习在各种现实世界中无处不在,包括视觉理解,视频理解,多模式分析,人类计算机的互动和城市计算。由于出现了大量多模式的异质空间/时间/时空数据,因此在大数据时代,缺乏可解释性,鲁棒性和分布外的概括正在成为现有视觉模型的挑战。大多数现有方法倾向于符合原始数据/可变分布,而忽略了多模式知识背后的基本因果关系,该知识缺乏统一的指导和分析,并分析了为什么现代视觉表示学习方法很容易崩溃成数据偏见并具有有限的概括和认知能力。因此,受到人类水平代理人的强大推理能力的启发,近年来见证了巨大的努力,以发展因果推理范式,以良好的认知能力实现强大的代表性和模型学习。在本文中,我们对视觉表示学习的现有因果推理方法进行了全面审查,涵盖了基本理论,模型和数据集。还讨论了当前方法和数据集的局限性。此外,我们提出了一些预期的挑战,机会和未来的研究方向,用于基准视觉表示学习中的因果推理算法。本文旨在为这个新兴领域提供全面的概述,引起人们的注意,鼓励讨论,使发展新颖的因果推理方法,公开可用的基准和共识建设标准的紧迫性,以可靠的视觉表示和相关的真实实践。世界应用更有效。
translated by 谷歌翻译
We have seen great progress in basic perceptual tasks such as object recognition and detection. However, AI models still fail to match humans in high-level vision tasks due to the lack of capacities for deeper reasoning. Recently the new task of visual question answering (QA) has been proposed to evaluate a model's capacity for deep image understanding. Previous works have established a loose, global association between QA sentences and images. However, many questions and answers, in practice, relate to local regions in the images. We establish a semantic link between textual descriptions and image regions by object-level grounding. It enables a new type of QA with visual answers, in addition to textual answers used in previous work. We study the visual QA tasks in a grounded setting with a large collection of 7W multiple-choice QA pairs. Furthermore, we evaluate human performance and several baseline models on the QA tasks. Finally, we propose a novel LSTM model with spatial attention to tackle the 7W QA tasks.
translated by 谷歌翻译
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such as helping the visually impaired, both the questions and answers are open-ended. Visual questions selectively target different areas of an image, including background details and underlying context. As a result, a system that succeeds at VQA typically needs a more detailed understanding of the image and complex reasoning than a system producing generic image captions. Moreover, VQA is amenable to automatic evaluation, since many open-ended answers contain only a few words or a closed set of answers that can be provided in a multiple-choice format. We provide a dataset containing ∼0.25M images, ∼0.76M questions, and ∼10M answers (www.visualqa.org), and discuss the information it provides. Numerous baselines and methods for VQA are provided and compared with human performance. Our VQA demo is available on CloudCV (http://cloudcv.org/vqa).
translated by 谷歌翻译
最近,3D视觉和语言任务吸引了不断增长的研究兴趣。与其他视觉和语言任务相比,3D视觉问题回答(VQA)任务的利用较小,并且更容易受到语言先验和共同参考的歧义。同时,由于规模和注释方法有限,最近提出的几个3D VQA数据集并不能很好地支持3D VQA任务。在这项工作中,我们通过收集一个新的3D VQA数据集(称为FE-3DGQA),正式定义和解决3D接地的VQA任务,并具有多样化且相对自由形式的提问,以及密集和完全接地的边界框注释。为了获得更多可解释的答案,我们标记了出现在复杂的质量检查对中的对象,该对象具有不同的语义类型,包括答案接地的对象(均出现并未出现在问题中),以及用于答案的对象的上下文对象。我们还提出了一个新的3D VQA框架,以有效地预测完全视觉扎根和可解释的答案。广泛的实验证明,我们新收集的基准数据集可有效地用于评估不同方面的各种3D VQA方法,而我们新提出的框架也可以在新的基准数据集中实现最新的性能。新收集的数据集和我们的代码都将在http://github.com/zlccccc/3dgqa上公开获得。
translated by 谷歌翻译