视觉关系推理是近期跨模态分析任务的核心组成部分,旨在推理对象与其属性之间的视觉关系。这些关系传达了丰富的语义,有助于增强视觉表现,改善跨模态分析。以前的工作成功地设计了潜在关系或刚性分类关系建模策略,实现了性能的提升。但是,这种方法忽略了模糊性。由于不同视觉外观的不同关系语义,关系中固有的。在这项工作中,我们探索通过基于人类先验知识的上下文敏感嵌入来建模关系。我们新颖地提出了一种即插即用的关系推理模块,它注入了关系嵌入来增强图像编码器。具体来说,我们设计升级的图形卷积网络(GCN),以利用对象嵌入和对象之间的关系方向性的信息来生成关系感知图像表示。通过将关系推理模块应用于视觉问答(VQA)和跨模态信息检索(CMIR)任务,我们证明了它的有效性。在VQA 2.0和CMPlacesdatasets上进行了大量实验,并且在与最先进的工作进行比较时报告了优异的性能。
translated by 谷歌翻译
People often refer to entities in an image in terms of their relationshipswith other entities. For example, "the black cat sitting under the table"refers to both a "black cat" entity and its relationship with another "table"entity. Understanding these relationships is essential for interpreting andgrounding such natural language expressions. Most prior work focuses on eithergrounding entire referential expressions holistically to one region, orlocalizing relationships based on a fixed set of categories. In this paper weinstead present a modular deep architecture capable of analyzing referentialexpressions into their component parts, identifying entities and relationshipsmentioned in the input expression and grounding them all in the scene. We callthis approach Compositional Modular Networks (CMNs): a novel architecture thatlearns linguistic analysis and visual inference end-to-end. Our approach isbuilt around two types of neural modules that inspect local regions andpairwise interactions between regions. We evaluate CMNs on multiple referentialexpression datasets, outperforming state-of-the-art approaches on all tasks.
translated by 谷歌翻译
多模式注意力网络是目前涉及真实图像的视觉问答(VQA)任务的最先进模型。虽然注意力集中在与问题相关的视觉内容上,但这种简单机制可能不足以模拟VQA或其他高级任务所需的复杂推理功能。在本文中,我们提出了MuRel,这是一种多模式关系网络,它通过端到端的方式来理解真实图像。我们的第一个贡献是介绍了MuRel单元,一个原子推理原语,通过丰富的矢量表示来表示问题和图像区域之间的交互,以及利用成对组合建模区域关系。其次,我们将细胞整合到一个完整的MuRel网络中,逐步细化视觉和问题交互,可以利用定义可视化方案比仅仅注意力图更精细。我们验证了我们的方法与各种消融研究的相关性,并展示了其在三个数据集上基于注意力的方法的优越性:VQA 2.0,VQA-CP v2和TDIUC。我们的最终MuRel网络在这一具有挑战性的环境中具有竞争力或超越最先进的结果。我们的代码可用:https://github.com/Cadene/murel.bootstrap.pytorch
translated by 谷歌翻译
理解每个图像 - 问题对的协作推理对于可解释的视觉问题答案系统来说是非常关键的,但未被探索。尽管最近的工作也尝试使用显式组合过程来组合嵌入在问题中的多个子任务,但是他们的模型依赖于注释或手工制作的规则来获得有效的推理过程,从而导致繁重的工作负载或者在组合调度上表现不佳。在本文中,为了在不同和不受限制的情况下更好地对齐图像和语言域,我们提出了一种新的神经网络模型,该模型在从问题解析的依赖树上执行全局推理,并且我们将我们的模型称为解析树引导的推理网络(PTGRN) )。该网络由三个协作模块组成:i)用于利用从问题解析的每个单词的本地视觉证据的注意模块,ii)用于组成先前挖掘的证据的门控残留组合模块,以及iii)由解析树引导的传播模块到沿着稀树传递开采的证据。因此,我们的PTGRN能够构建可解释的VQA系统,该系统逐渐根据问题驱动的解析树推理路线推导出图像提示。关系数据集的实验证明了我们的PTGRN优于当前最先进的VQA方法,并且可视化结果突出了我们推理系统的可解释能力。
translated by 谷歌翻译
Natural language questions are inherently compositional, and many are most easily answered by reasoning about their decomposition into modular sub-problems. For example, to answer "is there an equal number of balls and boxes?" we can look for balls, look for boxes, count them, and compare the results. The recently proposed Neural Module Network (NMN) architecture [3, 2] implements this approach to question answering by parsing questions into linguistic sub-structures and assembling question-specific deep networks from smaller modules that each solve one subtask. However, existing NMN implementations rely on brittle off-the-shelf parsers, and are restricted to the module configurations proposed by these parsers rather than learning them from data. In this paper, we propose End-to-End Module Networks (N2NMNs), which learn to reason by directly predicting instance-specific network layouts without the aid of a parser. Our model learns to generate network structures (by imitating expert demonstrations) while simultaneously learning network parameters (using the downstream task loss). Experimental results on the new CLEVR dataset targeted at com-positional question answering show that N2NMNs achieve an error reduction of nearly 50% relative to state-of-the-art attentional approaches, while discovering interpretable network architectures specialized for each question.
translated by 谷歌翻译
In this paper, we address referring expression comprehension: localizing an image region described by a natural language expression. While most recent work treats expressions as a single unit, we propose to decompose them into three modular components related to subject appearance , location, and relationship to other objects. This allows us to flexibly adapt to expressions containing different types of information in an end-to-end framework. In our model, which we call the Modular Attention Network (MAttNet), two types of attention are utilized: language-based attention that learns the module weights as well as the word/phrase attention that each module should focus on; and visual attention that allows the subject and relationship modules to focus on relevant image components. Module weights combine scores from all three modules dynamically to output an overall score. Experiments show that MAttNet outperforms previous state-of-the-art methods by a large margin on both bounding-box-level and pixel-level comprehension tasks. Demo 1 and code 2 are provided.
translated by 谷歌翻译
用于视觉推理的现有方法试图使用黑盒体系结构直接将输入映射到输出,而无需对前提推理过程进行明确建模。因此,这些黑盒模型经常学习利用数据中的偏差而不是学习进行视觉推理。在模块网络的启发下,本文提出了一种视觉推理模型,它包含一个程序生成器,构造了推理过程的显式表示。执行,以及执行生成程序以产生答案的执行引擎。程序生成器和执行引擎都由神经网络实现,并使用反向传播和REINFORCE的组合进行训练。使用CLEVR基准进行可视化推理,我们表明我们的模型明显优于强基线,并在各种环境中更好地推广。
translated by 谷歌翻译
我们提出了神经符号概念学习者(NS-CL),这是一个学习视觉概念,单词和句子语义解析的模型,没有对任何句子进行明确的监督;相反,我们的模型通过简单地查看atimages并阅读成对的问题和答案来学习。我们的模型构建了基于对象的场景表示,并将句子转换为可执行的符号程序。为了弥合两个模块的学习,我们使用aneuro-symbolic推理模块,在latentscene表示上执行这些程序。类似于人类概念学习,感知模块基于所提及的对象的语言描述来学习视觉概念。同时,学习的视觉概念有助于学习新单词和解析新句子。我们使用课程学习来指导图像和语言的大型组成空间的这些原型。广泛的实验证明了我们的模型在学习视觉概念,单词表示和句子语义解析方面的准确性和有效性。此外,我们的方法允许容易地推广到新的对象属性,组合,语言概念,场景和问题,甚至新的程序域。它还支持应用程序,包括视觉问答和双向图像文本检索。
translated by 谷歌翻译
Visual Question Answering (VQA) is a recent problem in computer vision and natural language processing that has garnered a large amount of interest from the deep learning, computer vision, and natural language processing communities. In VQA, an algorithm needs to answer text-based questions about images. Since the release of the first VQA dataset in 2014, additional datasets have been released and many algorithms have been proposed. In this review, we critically examine the current state of VQA in terms of problem formulation, existing datasets, evaluation metrics, and algorithms. In particular, we discuss the limitations of current datasets with regard to their ability to properly train and assess VQA algorithms. We then exhaustively review existing algorithms for VQA. Finally, we discuss possible future directions for VQA and image understanding research.
translated by 谷歌翻译
视觉对话需要使用对话历史作为上下文来回答基于图像的一系列问题。除了在视觉问答(VQA)中发现的挑战(可以看作是一轮对话),视觉表盘还可以包含更多。我们关注一个称为视觉干扰分辨率的问题,该问题涉及确定哪些词,通常是名词短语和代词,共同引用图像中的同一实体/对象实例。这是至关重要的,特别是对于代词(例如,“它”),如对话代理必须首先将它链接到先前的共同参考(例如,“船”),然后才能依靠共同参与“船”的视觉基础来推断代词`的'。先前的工作(在视觉对话中)模拟视觉共参考解决方案(a)通过历史记录的内存网络隐含地,或(b)整个问题的粗略级别;而不是明确地在词组级别的粒度。在这项工作中,我们提出了一种用于视觉对话的神经模块网络架构,引入了两个新颖的模块 - 参考和排除 - 在更精细的单词级别执行显式的,基础的共参考分辨率。我们通过实现近乎完美的精确度来展示我们的模型在MNIST Dialog上的有效性,这是一个视觉上简单但有思想的复杂数据集,以及onVisDial,一个在真实图像上的大型且具有挑战性的视觉对话数据集,其中我们的模型优于其他方法,并且更易于解释,定性的,坚定的,一致的。
translated by 谷歌翻译
Visual Question Answering (VQA) is a challenging task that has receivedincreasing attention from both the computer vision and the natural languageprocessing communities. Given an image and a question in natural language, itrequires reasoning over visual elements of the image and general knowledge toinfer the correct answer. In the first part of this survey, we examine thestate of the art by comparing modern approaches to the problem. We classifymethods by their mechanism to connect the visual and textual modalities. Inparticular, we examine the common approach of combining convolutional andrecurrent neural networks to map images and questions to a common featurespace. We also discuss memory-augmented and modular architectures thatinterface with structured knowledge bases. In the second part of this survey,we review the datasets available for training and evaluating VQA systems. Thevarious datatsets contain questions at different levels of complexity, whichrequire different capabilities and types of reasoning. We examine in depth thequestion/answer pairs from the Visual Genome project, and evaluate therelevance of the structured annotations of images with scene graphs for VQA.Finally, we discuss promising future directions for the field, in particularthe connection to structured knowledge bases and the use of natural languageprocessing models.
translated by 谷歌翻译
最近,视觉问答(VQA)已经成为多模式学习中最重要的任务之一,因为它需要理解视觉和文本模式。现有方法主要依靠提取图像和问题特征来通过多模式融合或注意机制来学习它们的联合特征嵌入。最近的一些研究利用外部独立于VPA的模型来检测图像中的候选实体或属性,这些实体或属性用作与VQA任务互补的语义知识。但是,这些候选实体或属性可能与VQA任务无关,并且可能会增加语义容量。为了更好地利用图像中的语义知识,我们提出了一个新的框架来学习VQA的视觉关系事实。具体来说,我们通过语义相似性模块建立基于Visual Genome数据集的Relation-VQA(R-VQA)数据集,其中每个数据由图像,相应的问题,正确的答案和支持关系事实组成。然后采用Awell定义的关系检测器来预测与视觉问题相关的关系事实。我们进一步提出了一种由视觉注意和语义关注组成的多步骤注意模型,以提取相关的视觉知识和语义知识。我们对两个基准数据集进行了全面的实验,证明我们的模型实现了最先进的性能并验证了考虑视觉关系事实的好处。
translated by 谷歌翻译
视觉推理是一种特殊的视觉问题回答问题,它本质上是多步骤和组合的,并且还需要密集的文本视觉交互。我们提出CMM:Cascaded Mutual Modulation作为一种新颖的端到端推理模型。 CMM包括针对问题和图像的多步骤理解过程。在每个步骤中,我们使用特征线性调制(FiLM)技术使文本/视觉管道相互控制。实验表明,CMM明显优于大多数相关模型,并且在两种视觉推理基准上达到了最新水平:CLEVR和NLVR,从合成语言和自然语言中收集。消融研究证实,我们的多步骤框架和我们的视觉引导语言调制对任务都是至关重要的。我们的代码可以通过以下网址获得://github.com/FlamingHorizo​​n/CMM-VR。
translated by 谷歌翻译
A key solution to visual question answering (VQA) exists in how to fuse visual and language features extracted from an input image and question. We show that an attention mechanism that enables dense, bi-directional interactions between the two modalities contributes to boost accuracy of prediction of answers. Specifically, we present a simple architecture that is fully symmetric between visual and language representations, in which each question word attends on image regions and each image region attends on question words. It can be stacked to form a hierarchy for multi-step interactions between an image-question pair. We show through experiments that the proposed architecture achieves a new state-of-the-art on VQA and VQA 2.0 despite its small size. We also present qualitative evaluation, demonstrating how the proposed attention mechanism can generate reasonable attention maps on images and questions , which leads to the correct answer prediction.
translated by 谷歌翻译
人类学习通过建立在先前获得的知识之上来解决日益复杂的任务。通常,我们学习的任务中存在自然进展 - 大多数不需要完全独立的解决方案,但可以分解为更简单的子任务。我们建议将每个任务的求解器表示为一个神经模块,以类似功能程序的方式调用现有模块(求解器更简单的任务)。下层模块是调用模块的黑盒,只能通过查询和输出进行通信。因此,新任务的模块学习查询现有模块并组合其输出以生成自己的输出。我们的模型有效地结合了以前的技能组合,不会遗忘,并且完全不同。我们在学习一组视觉推理任务时测试我们的模型,并通过逐步学习来证明所有任务的改进性能。通过使用人类评判来评估推理过程,我们表明我们的模型比基于注意力的基线更易于理解。
translated by 谷歌翻译
关系推理是一般智能行为的核心组成部分,但已证明神经网络难以学习。在本文中,我们描述了如何使用关系网络(RN)作为一个简单的即插即用模块来解决从根本上依赖于关系推理的问题。我们在三个任务上测试了增强型网络:使用名为CLEVR的具有挑战性的数据集进行视觉问题回答,我们在其上实现了最先进的超人类表现;使用bAbI套件的基于文本的问题回答;关于动态物理系统的复杂推理。然后,使用名为Sort-of-CLEVR的acuted数据集,我们表明强大的卷积网络没有解决关系问题的一般能力,但是当用RN增强时可以增加这种能力。我们的工作展示了如何配备RN模块的深度学习架构可以隐含地发现和学习关于实体及其关系的理论。
translated by 谷歌翻译
An important goal of computer vision is to build systems that learn visual representations over time that can be applied to many tasks. In this paper, we investigate a vision-language embedding as a core representation and show that it leads to better cross-task transfer than standard multi-task learning. In particular, the task of visual recognition is aligned to the task of visual question answering by forcing each to use the same word-region embeddings. We show this leads to greater inductive transfer from recognition to VQA than standard multitask learning. Visual recognition also improves, especially for categories that have relatively few recognition training labels but appear often in the VQA setting. Thus, our paper takes a small step towards creating more general vision systems by showing the benefit of interpretable, flexible, and trainable core representations.
translated by 谷歌翻译
Inspired by the recent success of text-based question answering , visual question answering (VQA) is proposed to automatically answer natural language questions with the reference to a given image. Compared with text-based QA, VQA is more challenging because the reasoning process on visual domain needs both effective semantic embedding and fine-grained visual understanding. Existing approaches predominantly infer answers from the abstract low-level visual features, while neglecting the modeling of high-level image semantics and the rich spatial context of regions. To solve the challenges, we propose a multi-level attention network for visual question answering that can simultaneously reduce the semantic gap by semantic attention and benefit fine-grained spatial inference by visual attention. First, we generate semantic concepts from high-level semantics in convolutional neural networks (CNN) and select those question-related concepts as semantic attention. Second, we encode region-based middle-level outputs from CNN into spatially-embedded representation by a bidirec-tional recurrent neural network, and further pinpoint the answer-related regions by multiple layer perceptron as visual attention. Third, we jointly optimize semantic attention , visual attention and question embedding by a softmax classifier to infer the final answer. Extensive experiments show the proposed approach outperforms the-state-of-arts on two challenging VQA datasets.
translated by 谷歌翻译
在构建可以推理和回答视觉数据的人工智能系统时,我们需要诊断测试来分析我们的进展并发现缺点。视觉问题回答的现有基准可以提供帮助,但是模型可以利用强大的偏差来正确地回答问题而无需推理。它们还会混淆多种错误来源,因此很难确定模型的缺陷。我们提供了一个测试一系列视觉推理能力的诊断数据。它包含最小的偏差,并有详细的注释,描述每个问题需要的推理类型。我们使用这个数据集来分析各种现代视觉推理系统,为他们的能力和局限提供新的见解。
translated by 谷歌翻译
视觉问题回答本质上是构成性的问题 - 如“狗在哪里?”这样的问题。分享子结构的问题,如“什么颜色是狗?”和“猫在哪里?”本文试图同时探索深层网络的表征能力和问题的组合结构。我们描述了构建和学习*神经模块网络*的过程,它将联合训练的神经“模块”的集合组成深入的网络以进行问答。我们的方法将问题分解为他们的语言子结构,并使用这些结构动态实例化模块化网络(使用可重用组件来识别狗,对颜色进行分类等)。由此产生的复合网络是共同训练的。我们评估了我们针对视觉问题回答的双挑战数据集的方法,在VQA自然图像数据集和关于抽象形状的复杂问题的新数据集上实现了最先进的结果。
translated by 谷歌翻译