We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such as helping the visually impaired, both the questions and answers are open-ended. Visual questions selectively target different areas of an image, including background details and underlying context. As a result, a system that succeeds at VQA typically needs a more detailed understanding of the image and complex reasoning than a system producing generic image captions. Moreover, VQA is amenable to automatic evaluation, since many open-ended answers contain only a few words or a closed set of answers that can be provided in a multiple-choice format. We provide a dataset containing ∼0.25M images, ∼0.76M questions, and ∼10M answers (www.visualqa.org), and discuss the information it provides. Numerous baselines and methods for VQA are provided and compared with human performance. Our VQA demo is available on CloudCV (http://cloudcv.org/vqa).
translated by 谷歌翻译
Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in
translated by 谷歌翻译
Visual understanding goes well beyond object recognition. With one glance at an image, we can effortlessly imagine the world beyond the pixels: for instance, we can infer people's actions, goals, and mental states. While this task is easy for humans, it is tremendously difficult for today's vision systems, requiring higher-order cognition and commonsense reasoning about the world. We formalize this task as Visual Commonsense Reasoning. Given a challenging question about an image, a machine must answer correctly and then provide a rationale justifying its answer.Next, we introduce a new dataset, VCR, consisting of 290k multiple choice QA problems derived from 110k movie scenes. The key recipe for generating non-trivial and highquality problems at scale is Adversarial Matching, a new approach to transform rich annotations into multiple choice questions with minimal bias. Experimental results show that while humans find VCR easy (over 90% accuracy), state-of-the-art vision models struggle (∼45%).To move towards cognition-level understanding, we present a new reasoning engine, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning. R2C helps narrow the gap between humans and machines (∼65%); still, the challenge is far from solved, and we provide analysis that suggests avenues for future work.
translated by 谷歌翻译
Artificial Intelligence (AI) and its applications have sparked extraordinary interest in recent years. This achievement can be ascribed in part to advances in AI subfields including Machine Learning (ML), Computer Vision (CV), and Natural Language Processing (NLP). Deep learning, a sub-field of machine learning that employs artificial neural network concepts, has enabled the most rapid growth in these domains. The integration of vision and language has sparked a lot of attention as a result of this. The tasks have been created in such a way that they properly exemplify the concepts of deep learning. In this review paper, we provide a thorough and an extensive review of the state of the arts approaches, key models design principles and discuss existing datasets, methods, their problem formulation and evaluation measures for VQA and Visual reasoning tasks to understand vision and language representation learning. We also present some potential future paths in this field of research, with the hope that our study may generate new ideas and novel approaches to handle existing difficulties and develop new applications.
translated by 谷歌翻译
We have seen great progress in basic perceptual tasks such as object recognition and detection. However, AI models still fail to match humans in high-level vision tasks due to the lack of capacities for deeper reasoning. Recently the new task of visual question answering (QA) has been proposed to evaluate a model's capacity for deep image understanding. Previous works have established a loose, global association between QA sentences and images. However, many questions and answers, in practice, relate to local regions in the images. We establish a semantic link between textual descriptions and image regions by object-level grounding. It enables a new type of QA with visual answers, in addition to textual answers used in previous work. We study the visual QA tasks in a grounded setting with a large collection of 7W multiple-choice QA pairs. Furthermore, we evaluate human performance and several baseline models on the QA tasks. Finally, we propose a novel LSTM model with spatial attention to tackle the 7W QA tasks.
translated by 谷歌翻译
A number of studies have found that today's Visual Question Answering (VQA) models are heavily driven by superficial correlations in the training data and lack sufficient image grounding. To encourage development of models geared towards the latter, we propose a new setting for VQA where for every question type, train and test sets have different prior distributions of answers. Specifically, we present new splits of the VQA v1 and VQA v2 datasets, which we call Visual Question Answering under Changing Priors (VQA-CP v1 and VQA-CP v2 respectively). First, we evaluate several existing VQA models under this new setting and show that their performance degrades significantly compared to the original VQA setting. Second, we propose a novel Grounded Visual Question Answering model (GVQA) that contains inductive biases and restrictions in the architecture specifically designed to prevent the model from 'cheating' by primarily relying on priors in the training data. Specifically, GVQA explicitly disentangles the recognition of visual concepts present in the image from the identification of plausible answer space for a given question, enabling the model to more robustly generalize across different distributions of answers. GVQA is built off an existing VQA model -Stacked Attention Networks (SAN). Our experiments demonstrate that GVQA significantly outperforms SAN on both VQA-CP v1 and VQA-CP v2 datasets. Interestingly, it also outperforms more powerful VQA models such as Multimodal Compact Bilinear Pooling (MCB) in several cases. GVQA offers strengths complementary to SAN when trained and evaluated on the original VQA v1 and VQA v2 datasets. Finally, GVQA is more transparent and interpretable than existing VQA models.
translated by 谷歌翻译
We introduce a new dataset for joint reasoning about natural language and images, with a focus on semantic diversity, compositionality, and visual reasoning challenges. The data contains 107,292 examples of English sentences paired with web photographs. The task is to determine whether a natural language caption is true about a pair of photographs. We crowdsource the data using sets of visually rich images and a compare-and-contrast task to elicit linguistically diverse language. Qualitative analysis shows the data requires compositional joint reasoning, including about quantities, comparisons, and relations. Evaluation using state-of-the-art visual reasoning methods shows the data presents a strong challenge. * Contributed equally. † Work done as an undergraduate at Cornell University. 1 In parts of this paper, we use the term compositional differently than it is commonly used in linguistics to refer to reasoning that requires composition. This type of reasoning often manifests itself in highly compositional language.2 Appendix G contains license information for all photographs used in this paper. 3 The top example is True, while the bottom is False.
translated by 谷歌翻译
视觉问题回答(VQA)近年来见证了巨大进展。但是,大多数努力只关注2D图像问题应答任务。在本文中,我们介绍了将VQA扩展到3D域的第一次尝试,这可以促进人工智能对3D现实世界情景的看法。与基于图像的VQA不同,3D问题应答(3DQA)将颜色点云作为输入,需要外观和3D几何理解能力来回答3D相关问题。为此,我们提出了一种基于新颖的基于变换器的3DQA框架\ TextBF {“3DQA-TR”},其包括两个编码器,分别用于利用外观和几何信息。外观,几何和的多模码信息语言问题最终可以通过3D语言伯特互相参加,以预测目标答案。要验证我们提出的3DQA框架的有效性,我们还开发了第一个建立的3DQA DataSet \ TextBF {“scanqa”} SCANNet DataSet并包含$ \ SIM $ 6K问题,$ \ SIM $ 30k答案,可满足806美元的场景。在此数据集上的广泛实验展示了我们提出的3DQA框架在现有的VQA框架上的明显优势,以及我们主要设计的有效性。我们的代码和数据集将公开可用于促进此方向的研究。
translated by 谷歌翻译
Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and topdown attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.
translated by 谷歌翻译
视觉问题应答(VQA)任务利用视觉图像和语言分析来回回答图像的文本问题。它是一个流行的研究课题,在过去十年中越来越多的现实应用。本文介绍了我们最近对AliceMind-MMU的研究(阿里巴巴的编码器 - 解码器来自Damo Academy - 多媒体理解的机器智能实验室),其比人类在VQA上获得相似甚至略微更好的结果。这是通过系统地改善VQA流水线来实现的,包括:(1)具有全面的视觉和文本特征表示的预培训; (2)与学习参加的有效跨模型互动; (3)一个新颖的知识挖掘框架,具有专门的专业专家模块,适用于复杂的VQA任务。处理不同类型的视觉问题,需要具有相应的专业知识在提高我们的VQA架构的表现方面发挥着重要作用,这取决于人力水平。进行了广泛的实验和分析,以证明新的研究工作的有效性。
translated by 谷歌翻译
We introduce GQA, a new dataset for real-world visual reasoning and compositional question answering, seeking to address key shortcomings of previous VQA datasets. We have developed a strong and robust question engine that leverages Visual Genome scene graph structures to create 22M diverse reasoning questions, which all come with functional programs that represent their semantics. We use the programs to gain tight control over the answer distribution and present a new tunable smoothing technique to mitigate question biases. Accompanying the dataset is a suite of new metrics that evaluate essential qualities such as consistency, grounding and plausibility. A careful analysis is performed for baselines as well as state-of-the-art models, providing fine-grained results for different question types and topologies. Whereas a blind LSTM obtains a mere 42.1%, and strong VQA models achieve 54.1%, human performance tops at 89.3%, offering ample opportunity for new research to explore. We hope GQA will provide an enabling resource for the next generation of models with enhanced robustness, improved consistency, and deeper semantic understanding of vision and language.
translated by 谷歌翻译
Recent years have witnessed an increasing interest in image-based question-answering (QA) tasks. However, due to data limitations, there has been much less work on video-based QA. In this paper, we present TVQA, a largescale video QA dataset based on 6 popular TV shows. TVQA consists of 152,545 QA pairs from 21,793 clips, spanning over 460 hours of video. Questions are designed to be compositional in nature, requiring systems to jointly localize relevant moments within a clip, comprehend subtitle-based dialogue, and recognize relevant visual concepts. We provide analyses of this new dataset as well as several baselines and a multi-stream end-to-end trainable neural network framework for the TVQA task. The dataset is publicly available at http://tvqa.cs.unc.edu.
translated by 谷歌翻译
The domain of joint vision-language understanding, especially in the context of reasoning in Visual Question Answering (VQA) models, has garnered significant attention in the recent past. While most of the existing VQA models focus on improving the accuracy of VQA, the way models arrive at an answer is oftentimes a black box. As a step towards making the VQA task more explainable and interpretable, our method is built upon the SOTA VQA framework by augmenting it with an end-to-end explanation generation module. In this paper, we investigate two network architectures, including Long Short-Term Memory (LSTM) and Transformer decoder, as the explanation generator. Our method generates human-readable textual explanations while maintaining SOTA VQA accuracy on the GQA-REX (77.49%) and VQA-E (71.48%) datasets. Approximately 65.16% of the generated explanations are approved by humans as valid. Roughly 60.5% of the generated explanations are valid and lead to the correct answers.
translated by 谷歌翻译
在过去的几年中,在文化遗产领域中使用深度学习和计算机视觉在文化遗产领域变得非常相关,其中包括有关音频智能指南,互动博物馆和增强现实的大量应用。所有这些技术都需要大量数据才能有效工作并对用户有用。在艺术品的背景下,专家在昂贵且耗时的过程中注释了此类数据。特别是,对于每件艺术品,必须收集艺术品和描述表的图像,以执行诸如视觉问题回答之类的常见任务。在本文中,我们提出了一种视觉问题回答的方法,该方法允许在运行时生成一个描述表,该表可用于回答有关艺术品的视觉和上下文问题,从而完全避免了图像和注释过程。为此,我们研究了使用GPT-3来生成描述用于艺术品,以分析通过字幕指标分析生成的描述的质量。最后,我们评估了视觉问答答案和字幕任务的性能。
translated by 谷歌翻译
Figure 1: Our MovieQA dataset contains 14,944 questions about 408 movies. It contains multiple sources of information: plots, subtitles, video clips, scripts, and DVS transcriptions. In this figure we show example QAs from The Matrix and localize them in the timeline.
translated by 谷歌翻译
目前的视觉问题应答(VQA)任务主要考虑回答自然图像的人为注释问题。然而,除了自然图像之外,在视觉理解和推理研究中仍然可以解读具有语义丰富性的抽象图。在这项工作中,我们介绍了ICON问题的新挑战(ICONQA),其目标是在图标图像上下文中回答问题。我们发布了ICONQA,这是一个由107,439个问题和三个子任务组成的大型数据集:多图像选择,多文本选择和填充空白。 ICONQA数据集是由真实世界图中的启发,突出了抽象图理解和综合认知推理的重要性。因此,ICONQA不仅需要对象识别和文本理解等感知技能,而且还需要多种认知推理技能,例如几何推理,致辞推理和算术推理。为了促进潜在的iconqa模型来学习图标图像的语义表示,我们进一步发布了一个图标数据集图标645,其中包含377级上的645,687个彩色图标。我们进行广泛的用户研究和盲目实验,并重现各种先进的VQA方法来基准iconQA任务。此外,我们开发了一个强大的ICONQA基线Patch-TRM,它应用金字塔跨模型变压器,其中包含在图标数据集上预先培训的输入图嵌入式。 iconqa和图标645可在https://iconqa.github.io提供。
translated by 谷歌翻译
This paper presents stacked attention networks (SANs) that learn to answer natural language questions from images. SANs use semantic representation of a question as query to search for the regions in an image that are related to the answer. We argue that image question answering (QA) often requires multiple steps of reasoning. Thus, we develop a multiple-layer SAN in which we query an image multiple times to infer the answer progressively. Experiments conducted on four image QA data sets demonstrate that the proposed SANs significantly outperform previous state-of-the-art approaches. The visualization of the attention layers illustrates the progress that the SAN locates the relevant visual clues that lead to the answer of the question layer-by-layer.
translated by 谷歌翻译
通用视觉(GPV)系统是旨在解决各种视觉任务的模型,而无需进行架构更改。如今,GPV主要从大型完全监督的数据集中学习技能和概念。通过获取数据以迅速学习每个技能的每个概念,将GPV扩展到数万个概念都变得令人望而却步。这项工作提出了一种有效且廉价的替代方法:从监督数据集中学习技能,从Web图像搜索中学习概念,并利用GPV的关键特征:跨技能传递视觉知识的能力。我们使用跨越10K+视觉概念的1M+图像的数据集来演示3个基准上的两个现有GPV(GPV-1和VL-T5)的Webly Supumented概念扩展:5个基于可可的数据集(80个主要概念),这是一个新的策划系列,这是一个新的策划系列。基于OpenImages和VisualGenome存储库(〜500个概念)以及Web衍生的数据集(10K+概念)的5个数据集。我们还提出了一种新的体系结构GPV-2,该架构支持各种任务 - 从分类和本地化等视觉任务到Qu Viewer+语言任务,例如QA和字幕,再到更多的利基市场,例如人类对象互动检测。 GPV-2从Web数据中受益匪浅,并且在这些基准测试中胜过GPV-1和VL-T5。我们的数据,代码和Web演示可在https://prior.allenai.org/projects/gpv2上获得。
translated by 谷歌翻译
Large language models (LLMs) have demonstrated excellent zero-shot generalization to new language tasks. However, effective utilization of LLMs for zero-shot visual question-answering (VQA) remains challenging, primarily due to the modality disconnection and task disconnection between LLM and VQA task. End-to-end training on vision and language data may bridge the disconnections, but is inflexible and computationally expensive. To address this issue, we propose \emph{Img2Prompt}, a plug-and-play module that provides the prompts that can bridge the aforementioned modality and task disconnections, so that LLMs can perform zero-shot VQA tasks without end-to-end training. In order to provide such prompts, we further employ LLM-agnostic models to provide prompts that can describe image content and self-constructed question-answer pairs, which can effectively guide LLM to perform zero-shot VQA tasks. Img2Prompt offers the following benefits: 1) It can flexibly work with various LLMs to perform VQA. 2)~Without the needing of end-to-end training, it significantly reduces the cost of deploying LLM for zero-shot VQA tasks. 3) It achieves comparable or better performance than methods relying on end-to-end training. For example, we outperform Flamingo~\cite{Deepmind:Flamingo2022} by 5.6\% on VQAv2. On the challenging A-OKVQA dataset, our method even outperforms few-shot methods by as much as 20\%.
translated by 谷歌翻译