The domain of joint vision-language understanding, especially in the context of reasoning in Visual Question Answering (VQA) models, has garnered significant attention in the recent past. While most of the existing VQA models focus on improving the accuracy of VQA, the way models arrive at an answer is oftentimes a black box. As a step towards making the VQA task more explainable and interpretable, our method is built upon the SOTA VQA framework by augmenting it with an end-to-end explanation generation module. In this paper, we investigate two network architectures, including Long Short-Term Memory (LSTM) and Transformer decoder, as the explanation generator. Our method generates human-readable textual explanations while maintaining SOTA VQA accuracy on the GQA-REX (77.49%) and VQA-E (71.48%) datasets. Approximately 65.16% of the generated explanations are approved by humans as valid. Roughly 60.5% of the generated explanations are valid and lead to the correct answers.
translated by 谷歌翻译
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such as helping the visually impaired, both the questions and answers are open-ended. Visual questions selectively target different areas of an image, including background details and underlying context. As a result, a system that succeeds at VQA typically needs a more detailed understanding of the image and complex reasoning than a system producing generic image captions. Moreover, VQA is amenable to automatic evaluation, since many open-ended answers contain only a few words or a closed set of answers that can be provided in a multiple-choice format. We provide a dataset containing ∼0.25M images, ∼0.76M questions, and ∼10M answers (www.visualqa.org), and discuss the information it provides. Numerous baselines and methods for VQA are provided and compared with human performance. Our VQA demo is available on CloudCV (http://cloudcv.org/vqa).
translated by 谷歌翻译
Artificial Intelligence (AI) and its applications have sparked extraordinary interest in recent years. This achievement can be ascribed in part to advances in AI subfields including Machine Learning (ML), Computer Vision (CV), and Natural Language Processing (NLP). Deep learning, a sub-field of machine learning that employs artificial neural network concepts, has enabled the most rapid growth in these domains. The integration of vision and language has sparked a lot of attention as a result of this. The tasks have been created in such a way that they properly exemplify the concepts of deep learning. In this review paper, we provide a thorough and an extensive review of the state of the arts approaches, key models design principles and discuss existing datasets, methods, their problem formulation and evaluation measures for VQA and Visual reasoning tasks to understand vision and language representation learning. We also present some potential future paths in this field of research, with the hope that our study may generate new ideas and novel approaches to handle existing difficulties and develop new applications.
translated by 谷歌翻译
Large language models (LLMs) have demonstrated excellent zero-shot generalization to new language tasks. However, effective utilization of LLMs for zero-shot visual question-answering (VQA) remains challenging, primarily due to the modality disconnection and task disconnection between LLM and VQA task. End-to-end training on vision and language data may bridge the disconnections, but is inflexible and computationally expensive. To address this issue, we propose \emph{Img2Prompt}, a plug-and-play module that provides the prompts that can bridge the aforementioned modality and task disconnections, so that LLMs can perform zero-shot VQA tasks without end-to-end training. In order to provide such prompts, we further employ LLM-agnostic models to provide prompts that can describe image content and self-constructed question-answer pairs, which can effectively guide LLM to perform zero-shot VQA tasks. Img2Prompt offers the following benefits: 1) It can flexibly work with various LLMs to perform VQA. 2)~Without the needing of end-to-end training, it significantly reduces the cost of deploying LLM for zero-shot VQA tasks. 3) It achieves comparable or better performance than methods relying on end-to-end training. For example, we outperform Flamingo~\cite{Deepmind:Flamingo2022} by 5.6\% on VQAv2. On the challenging A-OKVQA dataset, our method even outperforms few-shot methods by as much as 20\%.
translated by 谷歌翻译
Natural language explanations promise to offer intuitively understandable explanations of a neural network's decision process in complex vision-language tasks, as pursued in recent VL-NLE models. While current models offer impressive performance on task accuracy and explanation plausibility, they suffer from a range of issues: Some models feature a modular design where the explanation generation module is poorly integrated with a separate module for task-answer prediction, employ backbone models trained on limited sets of tasks, or incorporate ad hoc solutions to increase performance on single datasets. We propose to evade these limitations by applying recent advances in large-scale multi-task pretraining of generative Transformer models to the problem of VL-NLE tasks. Our approach outperforms recent models by a large margin, with human annotators preferring the generated explanations over the ground truth in two out of three evaluated datasets. As a novel challenge in VL-NLE research, we propose the problem of multi-task VL-NLE and show that jointly training on multiple tasks can increase the explanation quality. We discuss the ethical implications of high-quality NLE generation and other issues in recent VL-NLE research.
translated by 谷歌翻译
在过去的几年中,在文化遗产领域中使用深度学习和计算机视觉在文化遗产领域变得非常相关,其中包括有关音频智能指南,互动博物馆和增强现实的大量应用。所有这些技术都需要大量数据才能有效工作并对用户有用。在艺术品的背景下,专家在昂贵且耗时的过程中注释了此类数据。特别是,对于每件艺术品,必须收集艺术品和描述表的图像,以执行诸如视觉问题回答之类的常见任务。在本文中,我们提出了一种视觉问题回答的方法,该方法允许在运行时生成一个描述表,该表可用于回答有关艺术品的视觉和上下文问题,从而完全避免了图像和注释过程。为此,我们研究了使用GPT-3来生成描述用于艺术品,以分析通过字幕指标分析生成的描述的质量。最后,我们评估了视觉问答答案和字幕任务的性能。
translated by 谷歌翻译
Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and topdown attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.
translated by 谷歌翻译
Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in
translated by 谷歌翻译
在回答问题时,人类会利用跨不同模式可用的信息来综合一致,完整的思想链(COT)。在深度学习模型(例如大规模语言模型)的情况下,这个过程通常是黑匣子。最近,科学问题基准已用于诊断AI系统的多跳推理能力和解释性。但是,现有数据集无法为答案提供注释,或仅限于仅文本模式,小尺度和有限的域多样性。为此,我们介绍了科学问题答案(SQA),这是一个新的基准,由〜21k的多模式多种选择问题组成,其中包含各种科学主题和答案的注释,并提供相应的讲座和解释。我们进一步设计语言模型,以学习将讲座和解释作为思想链(COT),以模仿回答SQA问题时的多跳上推理过程。 SQA在语言模型中展示了COT的实用性,因为COT将问题的答案绩效提高了1.20%的GPT-3和3.99%的unifiedqa。我们还探索了模型的上限,以通过喂食输入中的那些来利用解释;我们观察到它将GPT-3的少量性能提高了18.96%。我们的分析进一步表明,与人类类似的语言模型受益于解释,从较少的数据中学习并仅使用40%的数据实现相同的性能。
translated by 谷歌翻译
从预训练的语言模型中进行的引导已被证明是用于建立基础视觉模型(VLM)的有效方法,例如图像字幕或视觉问题的答案。但是,很难用它来使模型符合用户的理由来获得特定答案。为了引起和加强常识性原因,我们提出了一个迭代采样和调整范式,称为Illume,执行以下循环:给定图像问题提示提示,VLM采样了多个候选人,并通过人类评论家通过偏好提供最小的反馈。选择,用于微调。该循环增加了训练数据,并逐渐雕刻出VLM的合理化功能。我们的详尽实验表明,Illume在使用较少的培训数据的同时,仅需要最少的反馈,与标准监督的微调竞争。
translated by 谷歌翻译
与自然语言解释的视觉结合旨在推断文本图像对之间的关​​系并生成句子以解释决策过程。先前的方法主要依靠预先训练的视觉模型来执行关系推断和语言模型来生成相应的解释。但是,预训练的视觉模型主要在文本和图像之间建立令牌级别的对齐,但忽略了短语(块)和视觉内容之间的高级语义对齐,这对于视觉推理至关重要。此外,仅基于编码的联合表示形式的解释生成器并未明确考虑关键的关系推理的决策点。因此,产生的解释不太忠于视觉语言推理。为了减轻这些问题,我们提出了一种统一的块意见对齐和基于词汇约束的方法,称为CALEC。它包含一个块感知的语义交互器(ARR。CSI),一个关系属性和词汇约束感知的发生器(arr。Lecg)。具体而言,CSI利用语言和各个图像区域固有的句子结构来构建块感知语义对齐。关系下属使用基于注意力的推理网络来合并令牌级别和块级视觉语言表示。 LECG利用词汇约束来将关系下列者重点关注的单词或块纳入解释世代,从而提高了解释的忠诚和信息性。我们在三个数据集上进行了广泛的实验,实验结果表明,CALEC在推理准确性和生成的解释的质量方面显着优于其他竞争者模型。
translated by 谷歌翻译
在本文中,我们提出了端到端的结构化多峰关注(SMA)神经网络,主要解决了上述前两个问题。 SMA首先使用结构图表示来编码图像中出现的对象对象,对象文本和文本文本关系,然后设计多模式图注意网络以推理它。最后,由上述模块的输出由全局本地注意力应答模块处理,以通过跟随M4C迭代地生成从两个OCR和常规词汇拼接的答案。我们所提出的模型优于TextVQA数据集上的SOTA模型以及除基于预先训练的水龙头之外的所有模型中的所有模型中的ST-VQA数据集的两个任务。展示了强大的推理能力,它还在TextVQA挑战中获得了第一名的第一名。我们在几种推理模型中广泛测试了不同的OCR方法,并调查了逐步提高了OCR性能对TextVQA基准的影响。通过更好的OCR结果,不同的型号对VQA准确性的戏剧性提高,但我们的模型受益最强烈的文本视觉推理能力。要授予我们的方法,并为进一步作品提供公平的测试基础,我们还为TextVQA数据集提供人为的地面实际OCR注释,这些ocr注释未在原始版本中提供。 TextVQA数据集的代码和地面ocr注释在https://github.com/chenyugao-cs/sma提供
translated by 谷歌翻译
最近,3D视觉和语言任务吸引了不断增长的研究兴趣。与其他视觉和语言任务相比,3D视觉问题回答(VQA)任务的利用较小,并且更容易受到语言先验和共同参考的歧义。同时,由于规模和注释方法有限,最近提出的几个3D VQA数据集并不能很好地支持3D VQA任务。在这项工作中,我们通过收集一个新的3D VQA数据集(称为FE-3DGQA),正式定义和解决3D接地的VQA任务,并具有多样化且相对自由形式的提问,以及密集和完全接地的边界框注释。为了获得更多可解释的答案,我们标记了出现在复杂的质量检查对中的对象,该对象具有不同的语义类型,包括答案接地的对象(均出现并未出现在问题中),以及用于答案的对象的上下文对象。我们还提出了一个新的3D VQA框架,以有效地预测完全视觉扎根和可解释的答案。广泛的实验证明,我们新收集的基准数据集可有效地用于评估不同方面的各种3D VQA方法,而我们新提出的框架也可以在新的基准数据集中实现最新的性能。新收集的数据集和我们的代码都将在http://github.com/zlccccc/3dgqa上公开获得。
translated by 谷歌翻译
This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that (1) it can be finetuned for either vision-language generation (e.g., image captioning) or understanding (e.g., visual question answering) tasks, and (2) it uses a shared multi-layer transformer network for both encoding and decoding, which differs from many existing methods where the encoder and decoder are implemented using separate models. The unified VLP model is pre-trained on a large amount of image-text pairs using the unsupervised learning objectives of two tasks: bidirectional and sequence-to-sequence (seq2seq) masked vision-language prediction. The two tasks differ solely in what context the prediction conditions on. This is controlled by utilizing specific self-attention masks for the shared transformer network. To the best of our knowledge, VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions, and VQA 2.0. The code and the pre-trained models are available at https://github.com/LuoweiZhou/VLP.
translated by 谷歌翻译
在传统的视觉问题(VQG)中,大多数图像具有多个概念(例如,对象和类别),可以生成问题,但培训模型以模仿培训数据中给出的任意选择概念。这使得训练困难并且还造成评估问题 - 对于大多数图像而言,存在多个有效问题,但人类参考资料只捕获一个或多个。我们呈现指导视觉问题 - VQG的变体,它根据对问题类型和应该探索的对象的期望来解决基于分类信息的问题生成器。我们提出了两个变体:(i)明确指导的模型,使演员(人机或自动化)能够选择哪些对象和类别来生成问题; (ii)基于离散潜在变量的基于离散潜变量,了解了一个隐式导游的模型,该模型将了解条件的哪些对象和类别。在答案类别增强VQA数据集上评估所提出的模型,我们的定量结果显示了对现有技术的大大改进(超过9bleu-4增加)。人类评估验证指导有助于生成语法相干的问题,并与给定的图像和对象相关。
translated by 谷歌翻译
Visual understanding goes well beyond object recognition. With one glance at an image, we can effortlessly imagine the world beyond the pixels: for instance, we can infer people's actions, goals, and mental states. While this task is easy for humans, it is tremendously difficult for today's vision systems, requiring higher-order cognition and commonsense reasoning about the world. We formalize this task as Visual Commonsense Reasoning. Given a challenging question about an image, a machine must answer correctly and then provide a rationale justifying its answer.Next, we introduce a new dataset, VCR, consisting of 290k multiple choice QA problems derived from 110k movie scenes. The key recipe for generating non-trivial and highquality problems at scale is Adversarial Matching, a new approach to transform rich annotations into multiple choice questions with minimal bias. Experimental results show that while humans find VCR easy (over 90% accuracy), state-of-the-art vision models struggle (∼45%).To move towards cognition-level understanding, we present a new reasoning engine, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning. R2C helps narrow the gap between humans and machines (∼65%); still, the challenge is far from solved, and we provide analysis that suggests avenues for future work.
translated by 谷歌翻译
基于知识的视觉问题答案(KVQA)任务旨在回答需要其他外部知识以及对图像和问题的理解的问题。关于KVQA的最新研究以多模式形式注入外部知识,并且随着更多的知识,可能会添加无关紧要的信息,并且可能会混淆问题的回答。为了正确使用知识,本研究提出了以下内容:1)我们介绍了根据标题不确定性和语义相似性计算出的新型语义不一致度量;2)我们建议一种基于语义不一致度量的新的外部知识同化方法,并将其应用于集成KVQA的明确知识和隐性知识;3)使用OK-VQA数据集评估所提出的方法并实现最新性能。
translated by 谷歌翻译
视觉问题应答(VQA)任务利用视觉图像和语言分析来回回答图像的文本问题。它是一个流行的研究课题,在过去十年中越来越多的现实应用。本文介绍了我们最近对AliceMind-MMU的研究(阿里巴巴的编码器 - 解码器来自Damo Academy - 多媒体理解的机器智能实验室),其比人类在VQA上获得相似甚至略微更好的结果。这是通过系统地改善VQA流水线来实现的,包括:(1)具有全面的视觉和文本特征表示的预培训; (2)与学习参加的有效跨模型互动; (3)一个新颖的知识挖掘框架,具有专门的专业专家模块,适用于复杂的VQA任务。处理不同类型的视觉问题,需要具有相应的专业知识在提高我们的VQA架构的表现方面发挥着重要作用,这取决于人力水平。进行了广泛的实验和分析,以证明新的研究工作的有效性。
translated by 谷歌翻译
人们说:“一张照片值一千字”。那么,我们如何从图像中获取丰富的信息?我们认为,通过使用视觉线索来桥接大型的识别视觉基础模型和语言模型,我们可以无需任何额外的跨模式训练。得益于基础模型的强大零拍功能,我们首先构建图像的丰富语义表示(例如,图像标签,对象属性 /位置,字幕)作为结构化的文本提示,称为视觉线索,使用视觉基础模型。基于视觉线索,我们使用大型语言模型为视觉内容生成一系列综合描述,然后再次通过视觉模型验证,以选择与图像最合适的候选人。我们通过定量和定性测量评估生成的描述的质量。结果证明了这种结构化语义表示的有效性。
translated by 谷歌翻译