Bilinear models provide an appealing framework for mixing and merginginformation in Visual Question Answering (VQA) tasks. They help to learn highlevel associations between question meaning and visual concepts in the image,but they suffer from huge dimensionality issues. We introduce MUTAN, amultimodal tensor-based Tucker decomposition to efficiently parametrizebilinear interactions between visual and textual representations. Additionallyto the Tucker framework, we design a low-rank matrix-based decomposition toexplicitly constrain the interaction rank. With MUTAN, we control thecomplexity of the merging scheme while keeping nice interpretable fusionrelations. We show how our MUTAN model generalizes some of the latest VQAarchitectures, providing state-of-the-art results.
translated by 谷歌翻译
多模式表征学习在深度学习社区中越来越受到关注。虽然双线性模型提供了一个有趣的框架来找到模态的微妙组合,但它们的参数数量与输入维度呈二次方式,使得它们在经典深度学习管道中的实际应用具有挑战性。在本文中,我们介绍了BLOCK,一种基于块 - 超对角张量分解的新型多模态融合。它利用了block-termranks的概念,它概括了已经用于多模态融合的张量的等级和模式等级的概念。它允许定义用于优化融合模型的表现力和复杂性之间的权衡的新方法,并且能够在保持强大的单模态表示的同时表示模态之间的非常精细的相互作用。我们通过将BLOCK用于两个具有挑战性的任务来展示我们融合模型的实用性:VisualQuestion Answering(VQA)和视觉关系检测(VRD),我们设计端到端可学习架构来表示模态之间的相关交互。通过大量实验,我们证明了BLOCK与VQA和VRD任务的最先进的多模态融合模型相比是有利的。我们的代码位于\ url {https://github.com/Cadene/block.bootstrap.pytorch}。
translated by 谷歌翻译
视觉关系推理是近期跨模态分析任务的核心组成部分,旨在推理对象与其属性之间的视觉关系。这些关系传达了丰富的语义,有助于增强视觉表现,改善跨模态分析。以前的工作成功地设计了潜在关系或刚性分类关系建模策略,实现了性能的提升。但是,这种方法忽略了模糊性。由于不同视觉外观的不同关系语义,关系中固有的。在这项工作中,我们探索通过基于人类先验知识的上下文敏感嵌入来建模关系。我们新颖地提出了一种即插即用的关系推理模块,它注入了关系嵌入来增强图像编码器。具体来说,我们设计升级的图形卷积网络(GCN),以利用对象嵌入和对象之间的关系方向性的信息来生成关系感知图像表示。通过将关系推理模块应用于视觉问答(VQA)和跨模态信息检索(CMIR)任务,我们证明了它的有效性。在VQA 2.0和CMPlacesdatasets上进行了大量实验,并且在与最先进的工作进行比较时报告了优异的性能。
translated by 谷歌翻译
最近,视觉问答(VQA)已经成为多模式学习中最重要的任务之一,因为它需要理解视觉和文本模式。现有方法主要依靠提取图像和问题特征来通过多模式融合或注意机制来学习它们的联合特征嵌入。最近的一些研究利用外部独立于VPA的模型来检测图像中的候选实体或属性,这些实体或属性用作与VQA任务互补的语义知识。但是,这些候选实体或属性可能与VQA任务无关,并且可能会增加语义容量。为了更好地利用图像中的语义知识,我们提出了一个新的框架来学习VQA的视觉关系事实。具体来说,我们通过语义相似性模块建立基于Visual Genome数据集的Relation-VQA(R-VQA)数据集,其中每个数据由图像,相应的问题,正确的答案和支持关系事实组成。然后采用Awell定义的关系检测器来预测与视觉问题相关的关系事实。我们进一步提出了一种由视觉注意和语义关注组成的多步骤注意模型,以提取相关的视觉知识和语义知识。我们对两个基准数据集进行了全面的实验,证明我们的模型实现了最先进的性能并验证了考虑视觉关系事实的好处。
translated by 谷歌翻译
Inspired by the recent success of text-based question answering , visual question answering (VQA) is proposed to automatically answer natural language questions with the reference to a given image. Compared with text-based QA, VQA is more challenging because the reasoning process on visual domain needs both effective semantic embedding and fine-grained visual understanding. Existing approaches predominantly infer answers from the abstract low-level visual features, while neglecting the modeling of high-level image semantics and the rich spatial context of regions. To solve the challenges, we propose a multi-level attention network for visual question answering that can simultaneously reduce the semantic gap by semantic attention and benefit fine-grained spatial inference by visual attention. First, we generate semantic concepts from high-level semantics in convolutional neural networks (CNN) and select those question-related concepts as semantic attention. Second, we encode region-based middle-level outputs from CNN into spatially-embedded representation by a bidirec-tional recurrent neural network, and further pinpoint the answer-related regions by multiple layer perceptron as visual attention. Third, we jointly optimize semantic attention , visual attention and question embedding by a softmax classifier to infer the final answer. Extensive experiments show the proposed approach outperforms the-state-of-arts on two challenging VQA datasets.
translated by 谷歌翻译
生物感知中的注意机制被认为选择感知信息的子集以用于更复杂的处理,这将对所有感觉输入执行禁止。然而,在计算机视觉中,尽管软注意力成功,其中信息被重新加权和聚合,但从未被过滤掉,因此对硬注意力的探索相对较少,其中一些信息被选择性地忽略。在这里,我们介绍了一种新的方法,以便在最近发布的视觉问题回答数据集中获得非常具有竞争力的性能,在某些情况下超过类似的软注意力架构,同时完全忽略某些功能。尽管认为困难机制是不可微分的,但我们发现特征量与语义相关性相关,并为我们机制的注意选择标准提供了有用的信号。因为硬注意选择输入信息的重要特征,所以它也比类似的软注意机制更有效。这对于使用非局部成对运算的近期方法尤其重要,其中计算和存储器成本在特征集的大小上是二次的。
translated by 谷歌翻译
现有的注意机制要么参与局部图像网格,要么参与对象级别功能的视觉问答(VQA)。由于观察问题可能与对象实例及其部分有关,我们提出了一种新颖的注意机制,共同考虑两个层次的视觉细节之间的相互关系。产生的自下而上的注意力进一步与自上而下的信息合并,以仅聚焦与给定问题最相关的场景元素。我们的设计通过有效的张量分解方案分层融合多模态信息,即语言,对象和网格级特征。所提出的模型将VQAv1上最先进的单一模型性能从67.9%提高到68.2%,在VQAv2上从65.7%提高到67.4%,表明显着提升。
translated by 谷歌翻译
Visual question answering (VQA) is challenging because it requires a simultaneous understanding of both the visual content of images and the textual content of questions. The approaches used to represent the images and questions in a fine-grained manner and questions and to fuse these multi-modal features play key roles in performance. Bilinear pooling based models have been shown to outperform traditional linear models for VQA, but their high-dimensional representations and high computational complexity may seriously limit their applicability in practice. For multi-modal feature fusion, here we develop a Multi-modal Fac-torized Bilinear (MFB) pooling approach to efficiently and effectively combine multi-modal features, which results in superior performance for VQA compared with other bilin-ear pooling approaches. For fine-grained image and question representation, we develop a 'co-attention' mechanism using an end-to-end deep network architecture to jointly learn both the image and question attentions. Combining the proposed MFB approach with co-attention learning in a new network architecture provides a unified model for VQA. Our experimental results demonstrate that the single MFB with co-attention model achieves new state-of-the-art performance on the real-world VQA dataset. Code available at https://github.com/yuzcccc/mfb.
translated by 谷歌翻译
当前的视觉问题回答(VQA)系统可以回答关于“已知”视觉内容的智能问题。然而,当在推理期间(“开放世界”场景)呈现关于视觉和语言“未知”概念的问题时,他们的表现显着下降。一个实用的VSA系统应该能够处理现实世界环境中的新概念。为了解决这个问题,我们提出了一种基于范例的方法,它将学习(即知识)转移到以前的“已知”概念,以回答关于“未知”的问题。我们学习了一个高度辨别的联合嵌入空间,其中视觉和语义特征被融合以给出统一的表示。一旦将新颖的概念呈现给模型,它就寻找与关节嵌入空间中的样本集最接近的匹配。该辅助信息与给定的图像 - 问题对一起使用,以分层方式改善视觉注意力。由于在大数据集上处理高维样本可能是一个重大挑战,我们引入了一个有效的匹配方案,它使用紧凑的特征描述进行搜索和检索。为了评估我们的模型,我们提出了一个新的VQA分割,将未知的视觉和语义概念与训练集分开。我们的方法在提出的Open-World VQAdataset和标准VQA数据集上显示了对最先进的VQA模型的显着改进。
translated by 谷歌翻译
视觉问题回答是一个具有挑战性的问题,需要结合计算机视觉和自然语言处理的概念。大多数现有方法使用双流策略,计算图像和问题特征,然后使用各种技术合并。尽管如此,非常依赖于更高级别的图像表示,这允许捕获语义和空间关系。在本文中,我们提出了一种新的基于图形的视觉问答方法。我们的方法结合了图形学习模块,它学习了inputimage的问题特定图形表示,以及最近的图形卷积概念,旨在学习捕获特定问题交互的图像表示。我们使用由建议的图学习器模块增强的简单基线架构来测试VQA v2数据集上的ourapproach。我们以66.18%的准确度获得最新结果,并证明了所提方法的可解释性。
translated by 谷歌翻译
解决基础语言任务通常需要推理给定任务上下文中对象之间的关系。例如,回答问题:“盘子里的杯子是什么颜色的?”我们必须检查特定杯子的颜色,该杯子满足关于盘子的“开启”关系。最近的工作提出了各种方法复杂的关系推理。然而,它们的大部分功率都在推理结构中,而场景用简单的局部外观特征表示。在本文中,我们采用另一种方法,在视觉场景中为对象构建上下文化表示,以支持关系推理。我们提出了语言条件图网络(LCGN)的一般框架,其中每个节点表示一个对象,并且由相关对象的上下文感知表示通过以文本输入为条件的迭代消息传递来描述。例如,调整与盘子的“上”关系,对象``mug''收集来自对象``plate''的消息,将其表示更新为“盘子上的杯子”,这可以很容易地消耗掉通过简单的分类器进行答案预测。我们通过实验证明,我们的LCGN能够有效地支持关系推理,并提高了几个任务和数据集的性能。
translated by 谷歌翻译
可解释的VQA模型的一个关键方面是它们能够将答案基于图像中的相关区域。具有这种能力的当前方法依赖于监督学习和人类注释的基础来训练VQA架构内的注意机制。不幸的是,获得特定于视觉接地的人类注释是困难且昂贵的。在这项工作中,我们证明了我们可以有效地训练具有地面监督的VQA架构,该架构可以从可用的区域描述和对象注释中自动获得。我们还表明,我们的模型在这种雷监督下进行训练,可以产生视觉基础,在手动注释的基础上实现更高的相关性,同时实现最先进的VQA精度。
translated by 谷歌翻译
A key solution to visual question answering (VQA) exists in how to fuse visual and language features extracted from an input image and question. We show that an attention mechanism that enables dense, bi-directional interactions between the two modalities contributes to boost accuracy of prediction of answers. Specifically, we present a simple architecture that is fully symmetric between visual and language representations, in which each question word attends on image regions and each image region attends on question words. It can be stacked to form a hierarchy for multi-step interactions between an image-question pair. We show through experiments that the proposed architecture achieves a new state-of-the-art on VQA and VQA 2.0 despite its small size. We also present qualitative evaluation, demonstrating how the proposed attention mechanism can generate reasonable attention maps on images and questions , which leads to the correct answer prediction.
translated by 谷歌翻译
Visual Question Answering (VQA) is a recent problem in computer vision and natural language processing that has garnered a large amount of interest from the deep learning, computer vision, and natural language processing communities. In VQA, an algorithm needs to answer text-based questions about images. Since the release of the first VQA dataset in 2014, additional datasets have been released and many algorithms have been proposed. In this review, we critically examine the current state of VQA in terms of problem formulation, existing datasets, evaluation metrics, and algorithms. In particular, we discuss the limitations of current datasets with regard to their ability to properly train and assess VQA algorithms. We then exhaustively review existing algorithms for VQA. Finally, we discuss possible future directions for VQA and image understanding research.
translated by 谷歌翻译
We address the problem of Visual Question Answering (VQA), which requires joint image and language understanding to answer a question about a given photograph. Recent approaches have applied deep image captioning methods based on convolutional-recurrent networks to this problem, but have failed to model spatial inference. To remedy this, we propose a model we call the Spatial Memory Network and apply it to the VQA task. Memory networks are recurrent neural networks with an explicit attention mechanism that selects certain parts of the information stored in memory. Our Spatial Memory Network stores neuron activations from different spatial regions of the image in its memory, and uses the question to choose relevant regions for computing the answer, a process of which constitutes a single "hop" in the network. We propose a novel spatial attention architecture that aligns words with image patches in the first hop, and obtain improved results by adding a second attention hop which considers the whole question to choose visual evidence based on the results of the first hop. To better understand the inference process learned by the network, we design synthetic questions that specifically require spatial inference and visualize the attention weights. We evaluate our model on two published visual question answering datasets, DAQUAR [1] and VQA [2], and obtain improved results compared to a strong deep baseline model (iBOWIMG) which concatenates image and question features to predict the answer [3].
translated by 谷歌翻译
在本文中,我们提出了一种新的问题引导混合卷积(QGHC)网络用于视觉问答(VQA)。大多数最先进的VQA方法融合了神经网络的高级文本和视觉特征,并在学习多模态特征时放弃了视觉空间信息。针对这些问题,从输入问题生成的问题引导内核被设计为回旋具有视觉特征,可在早期捕捉文本和视觉关系。问题导向卷积可以紧密耦合文本和视觉信息,但在学习内核时也会引入更多参数。我们应用groupconvolution,它包括与问题无关的内核和依赖于问题的内核,以减小参数大小并减轻过度拟合。混合卷积可以用较少的参数生成具有辨别力的多模态特征。所提出的方法也是对现有的双线性池融合和基于注意力的VQA方法的补充。通过与它们集成,我们的方法可以进一步提高性能。公共VQA数据集的大量实验验证了QGHC的有效性。
translated by 谷歌翻译
Visual Question Answering (VQA) is a challenging task that has receivedincreasing attention from both the computer vision and the natural languageprocessing communities. Given an image and a question in natural language, itrequires reasoning over visual elements of the image and general knowledge toinfer the correct answer. In the first part of this survey, we examine thestate of the art by comparing modern approaches to the problem. We classifymethods by their mechanism to connect the visual and textual modalities. Inparticular, we examine the common approach of combining convolutional andrecurrent neural networks to map images and questions to a common featurespace. We also discuss memory-augmented and modular architectures thatinterface with structured knowledge bases. In the second part of this survey,we review the datasets available for training and evaluating VQA systems. Thevarious datatsets contain questions at different levels of complexity, whichrequire different capabilities and types of reasoning. We examine in depth thequestion/answer pairs from the Visual Genome project, and evaluate therelevance of the structured annotations of images with scene graphs for VQA.Finally, we discuss promising future directions for the field, in particularthe connection to structured knowledge bases and the use of natural languageprocessing models.
translated by 谷歌翻译
近年来,已经成功地探索了使用从大型语言或视觉数据集训练的矢量表示来建模文本或视觉信息。但是,视觉问题回答等任务需要将这些向量表示相互组合。多模式池的方法包括元素方式的产品或总和,以及视觉和文本表示的连接。我们假设这些方法不是视觉和文本向量的外在产品。由于外部产品由于其高维度而通常是不可行的,因此我们建议利用多模式紧凑双线性池(MCB)来有效地压缩地组合多模态特征。我们在视觉问题回答和基础任务上广泛评估MCB。我们始终显示MCB对没有MCB的消融的好处。对于视觉问题回答,我们提出了两次使用MCB的anarchitecture,一次用于预测对空间特征的关注,并再次将有人参与的表示与问题表示相结合。该模型优于Visual7Wdataset和VQA挑战的最新技术。
translated by 谷歌翻译
It is commonly assumed that language refers to high-level visual conceptswhile leaving low-level visual processing unaffected. This view dominates thecurrent literature in computational models for language-vision tasks, wherevisual and linguistic input are mostly processed independently before beingfused into a single representation. In this paper, we deviate from this classicpipeline and propose to modulate the \emph{entire visual processing} bylinguistic input. Specifically, we condition the batch normalization parametersof a pretrained residual network (ResNet) on a language embedding. Thisapproach, which we call MOdulated RESnet (\MRN), significantly improves strongbaselines on two visual question answering tasks. Our ablation study shows thatmodulating from the early stages of the visual processing is beneficial.
translated by 谷歌翻译
Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.
translated by 谷歌翻译