Bilinear models provide an appealing framework for mixing and merginginformation in Visual Question Answering (VQA) tasks. They help to learn highlevel associations between question meaning and visual concepts in the image,but they suffer from huge dimensionality issues. We introduce MUTAN, amultimodal tensor-based Tucker decomposition to efficiently parametrizebilinear interactions between visual and textual representations. Additionallyto the Tucker framework, we design a low-rank matrix-based decomposition toexplicitly constrain the interaction rank. With MUTAN, we control thecomplexity of the merging scheme while keeping nice interpretable fusionrelations. We show how our MUTAN model generalizes some of the latest VQAarchitectures, providing state-of-the-art results.
translated by 谷歌翻译
视觉关系推理是近期跨模态分析任务的核心组成部分,旨在推理对象与其属性之间的视觉关系。这些关系传达了丰富的语义,有助于增强视觉表现,改善跨模态分析。以前的工作成功地设计了潜在关系或刚性分类关系建模策略,实现了性能的提升。但是,这种方法忽略了模糊性。由于不同视觉外观的不同关系语义,关系中固有的。在这项工作中,我们探索通过基于人类先验知识的上下文敏感嵌入来建模关系。我们新颖地提出了一种即插即用的关系推理模块,它注入了关系嵌入来增强图像编码器。具体来说,我们设计升级的图形卷积网络(GCN),以利用对象嵌入和对象之间的关系方向性的信息来生成关系感知图像表示。通过将关系推理模块应用于视觉问答(VQA)和跨模态信息检索(CMIR)任务,我们证明了它的有效性。在VQA 2.0和CMPlacesdatasets上进行了大量实验,并且在与最先进的工作进行比较时报告了优异的性能。
translated by 谷歌翻译
多模式表征学习在深度学习社区中越来越受到关注。虽然双线性模型提供了一个有趣的框架来找到模态的微妙组合,但它们的参数数量与输入维度呈二次方式,使得它们在经典深度学习管道中的实际应用具有挑战性。在本文中,我们介绍了BLOCK,一种基于块 - 超对角张量分解的新型多模态融合。它利用了block-termranks的概念,它概括了已经用于多模态融合的张量的等级和模式等级的概念。它允许定义用于优化融合模型的表现力和复杂性之间的权衡的新方法,并且能够在保持强大的单模态表示的同时表示模态之间的非常精细的相互作用。我们通过将BLOCK用于两个具有挑战性的任务来展示我们融合模型的实用性:VisualQuestion Answering(VQA)和视觉关系检测(VRD),我们设计端到端可学习架构来表示模态之间的相关交互。通过大量实验,我们证明了BLOCK与VQA和VRD任务的最先进的多模态融合模型相比是有利的。我们的代码位于\ url {https://github.com/Cadene/block.bootstrap.pytorch}。
translated by 谷歌翻译
生物感知中的注意机制被认为选择感知信息的子集以用于更复杂的处理,这将对所有感觉输入执行禁止。然而,在计算机视觉中,尽管软注意力成功,其中信息被重新加权和聚合,但从未被过滤掉,因此对硬注意力的探索相对较少,其中一些信息被选择性地忽略。在这里,我们介绍了一种新的方法,以便在最近发布的视觉问题回答数据集中获得非常具有竞争力的性能,在某些情况下超过类似的软注意力架构,同时完全忽略某些功能。尽管认为困难机制是不可微分的,但我们发现特征量与语义相关性相关,并为我们机制的注意选择标准提供了有用的信号。因为硬注意选择输入信息的重要特征,所以它也比类似的软注意机制更有效。这对于使用非局部成对运算的近期方法尤其重要,其中计算和存储器成本在特征集的大小上是二次的。
translated by 谷歌翻译
最近,视觉问答(VQA)已经成为多模式学习中最重要的任务之一,因为它需要理解视觉和文本模式。现有方法主要依靠提取图像和问题特征来通过多模式融合或注意机制来学习它们的联合特征嵌入。最近的一些研究利用外部独立于VPA的模型来检测图像中的候选实体或属性,这些实体或属性用作与VQA任务互补的语义知识。但是,这些候选实体或属性可能与VQA任务无关,并且可能会增加语义容量。为了更好地利用图像中的语义知识,我们提出了一个新的框架来学习VQA的视觉关系事实。具体来说,我们通过语义相似性模块建立基于Visual Genome数据集的Relation-VQA(R-VQA)数据集,其中每个数据由图像,相应的问题,正确的答案和支持关系事实组成。然后采用Awell定义的关系检测器来预测与视觉问题相关的关系事实。我们进一步提出了一种由视觉注意和语义关注组成的多步骤注意模型,以提取相关的视觉知识和语义知识。我们对两个基准数据集进行了全面的实验,证明我们的模型实现了最先进的性能并验证了考虑视觉关系事实的好处。
translated by 谷歌翻译
Inspired by the recent success of text-based question answering , visual question answering (VQA) is proposed to automatically answer natural language questions with the reference to a given image. Compared with text-based QA, VQA is more challenging because the reasoning process on visual domain needs both effective semantic embedding and fine-grained visual understanding. Existing approaches predominantly infer answers from the abstract low-level visual features, while neglecting the modeling of high-level image semantics and the rich spatial context of regions. To solve the challenges, we propose a multi-level attention network for visual question answering that can simultaneously reduce the semantic gap by semantic attention and benefit fine-grained spatial inference by visual attention. First, we generate semantic concepts from high-level semantics in convolutional neural networks (CNN) and select those question-related concepts as semantic attention. Second, we encode region-based middle-level outputs from CNN into spatially-embedded representation by a bidirec-tional recurrent neural network, and further pinpoint the answer-related regions by multiple layer perceptron as visual attention. Third, we jointly optimize semantic attention , visual attention and question embedding by a softmax classifier to infer the final answer. Extensive experiments show the proposed approach outperforms the-state-of-arts on two challenging VQA datasets.
translated by 谷歌翻译
现有的注意机制要么参与局部图像网格,要么参与对象级别功能的视觉问答(VQA)。由于观察问题可能与对象实例及其部分有关,我们提出了一种新颖的注意机制,共同考虑两个层次的视觉细节之间的相互关系。产生的自下而上的注意力进一步与自上而下的信息合并,以仅聚焦与给定问题最相关的场景元素。我们的设计通过有效的张量分解方案分层融合多模态信息,即语言,对象和网格级特征。所提出的模型将VQAv1上最先进的单一模型性能从67.9%提高到68.2%,在VQAv2上从65.7%提高到67.4%,表明显着提升。
translated by 谷歌翻译
Visual question answering (VQA) is challenging because it requires a simultaneous understanding of both the visual content of images and the textual content of questions. The approaches used to represent the images and questions in a fine-grained manner and questions and to fuse these multi-modal features play key roles in performance. Bilinear pooling based models have been shown to outperform traditional linear models for VQA, but their high-dimensional representations and high computational complexity may seriously limit their applicability in practice. For multi-modal feature fusion, here we develop a Multi-modal Fac-torized Bilinear (MFB) pooling approach to efficiently and effectively combine multi-modal features, which results in superior performance for VQA compared with other bilin-ear pooling approaches. For fine-grained image and question representation, we develop a 'co-attention' mechanism using an end-to-end deep network architecture to jointly learn both the image and question attentions. Combining the proposed MFB approach with co-attention learning in a new network architecture provides a unified model for VQA. Our experimental results demonstrate that the single MFB with co-attention model achieves new state-of-the-art performance on the real-world VQA dataset. Code available at https://github.com/yuzcccc/mfb.
translated by 谷歌翻译
视觉问题回答是一个具有挑战性的问题,需要结合计算机视觉和自然语言处理的概念。大多数现有方法使用双流策略,计算图像和问题特征,然后使用各种技术合并。尽管如此,非常依赖于更高级别的图像表示,这允许捕获语义和空间关系。在本文中,我们提出了一种新的基于图形的视觉问答方法。我们的方法结合了图形学习模块,它学习了inputimage的问题特定图形表示,以及最近的图形卷积概念,旨在学习捕获特定问题交互的图像表示。我们使用由建议的图学习器模块增强的简单基线架构来测试VQA v2数据集上的ourapproach。我们以66.18%的准确度获得最新结果,并证明了所提方法的可解释性。
translated by 谷歌翻译
A key solution to visual question answering (VQA) exists in how to fuse visual and language features extracted from an input image and question. We show that an attention mechanism that enables dense, bi-directional interactions between the two modalities contributes to boost accuracy of prediction of answers. Specifically, we present a simple architecture that is fully symmetric between visual and language representations, in which each question word attends on image regions and each image region attends on question words. It can be stacked to form a hierarchy for multi-step interactions between an image-question pair. We show through experiments that the proposed architecture achieves a new state-of-the-art on VQA and VQA 2.0 despite its small size. We also present qualitative evaluation, demonstrating how the proposed attention mechanism can generate reasonable attention maps on images and questions , which leads to the correct answer prediction.
translated by 谷歌翻译
当前的视觉问题回答(VQA)系统可以回答关于“已知”视觉内容的智能问题。然而,当在推理期间(“开放世界”场景)呈现关于视觉和语言“未知”概念的问题时,他们的表现显着下降。一个实用的VSA系统应该能够处理现实世界环境中的新概念。为了解决这个问题,我们提出了一种基于范例的方法,它将学习(即知识)转移到以前的“已知”概念,以回答关于“未知”的问题。我们学习了一个高度辨别的联合嵌入空间,其中视觉和语义特征被融合以给出统一的表示。一旦将新颖的概念呈现给模型,它就寻找与关节嵌入空间中的样本集最接近的匹配。该辅助信息与给定的图像 - 问题对一起使用,以分层方式改善视觉注意力。由于在大数据集上处理高维样本可能是一个重大挑战,我们引入了一个有效的匹配方案,它使用紧凑的特征描述进行搜索和检索。为了评估我们的模型,我们提出了一个新的VQA分割,将未知的视觉和语义概念与训练集分开。我们的方法在提出的Open-World VQAdataset和标准VQA数据集上显示了对最先进的VQA模型的显着改进。
translated by 谷歌翻译
Visual Question Answering (VQA) is a recent problem in computer vision and natural language processing that has garnered a large amount of interest from the deep learning, computer vision, and natural language processing communities. In VQA, an algorithm needs to answer text-based questions about images. Since the release of the first VQA dataset in 2014, additional datasets have been released and many algorithms have been proposed. In this review, we critically examine the current state of VQA in terms of problem formulation, existing datasets, evaluation metrics, and algorithms. In particular, we discuss the limitations of current datasets with regard to their ability to properly train and assess VQA algorithms. We then exhaustively review existing algorithms for VQA. Finally, we discuss possible future directions for VQA and image understanding research.
translated by 谷歌翻译
Visual Question Answering (VQA) is a challenging task that has receivedincreasing attention from both the computer vision and the natural languageprocessing communities. Given an image and a question in natural language, itrequires reasoning over visual elements of the image and general knowledge toinfer the correct answer. In the first part of this survey, we examine thestate of the art by comparing modern approaches to the problem. We classifymethods by their mechanism to connect the visual and textual modalities. Inparticular, we examine the common approach of combining convolutional andrecurrent neural networks to map images and questions to a common featurespace. We also discuss memory-augmented and modular architectures thatinterface with structured knowledge bases. In the second part of this survey,we review the datasets available for training and evaluating VQA systems. Thevarious datatsets contain questions at different levels of complexity, whichrequire different capabilities and types of reasoning. We examine in depth thequestion/answer pairs from the Visual Genome project, and evaluate therelevance of the structured annotations of images with scene graphs for VQA.Finally, we discuss promising future directions for the field, in particularthe connection to structured knowledge bases and the use of natural languageprocessing models.
translated by 谷歌翻译
It is commonly assumed that language refers to high-level visual conceptswhile leaving low-level visual processing unaffected. This view dominates thecurrent literature in computational models for language-vision tasks, wherevisual and linguistic input are mostly processed independently before beingfused into a single representation. In this paper, we deviate from this classicpipeline and propose to modulate the \emph{entire visual processing} bylinguistic input. Specifically, we condition the batch normalization parametersof a pretrained residual network (ResNet) on a language embedding. Thisapproach, which we call MOdulated RESnet (\MRN), significantly improves strongbaselines on two visual question answering tasks. Our ablation study shows thatmodulating from the early stages of the visual processing is beneficial.
translated by 谷歌翻译
可解释的VQA模型的一个关键方面是它们能够将答案基于图像中的相关区域。具有这种能力的当前方法依赖于监督学习和人类注释的基础来训练VQA架构内的注意机制。不幸的是,获得特定于视觉接地的人类注释是困难且昂贵的。在这项工作中,我们证明了我们可以有效地训练具有地面监督的VQA架构,该架构可以从可用的区域描述和对象注释中自动获得。我们还表明,我们的模型在这种雷监督下进行训练,可以产生视觉基础,在手动注释的基础上实现更高的相关性,同时实现最先进的VQA精度。
translated by 谷歌翻译
通常通过检测诸如全局和局部运动之类的关键概念,与存在于该场景中的对象类相关的特征以及与全局上下文相关的特征来解决人类活动识别。活动识别中的下一个开放挑战需要一定程度的理解,超越这一点并要求具有精细区分能力的模型以及对场景中的演员和对象之间的交互的详细理解。我们提出了一个能够学习推理视频中语义上有意义的时空交互的模型。 。我们的方法的关键是通过整合现有技术对象检测网络来选择在对象层面执行这种推理。这允许模型学习存在于语义,对象交互相关级别的详细空间交互。我们在三个标准数据集(Twenty-BNSomething-Something,VLOG和EPIC Kitchens)上评估我们的方法,并在所有这些数据集上获得最先进的结果。最后,我们展示了模型所学习的交互的可视化,它说明了对象类及其与不同活动类相对应的交互。
translated by 谷歌翻译
在本文中,我们提出了一种新的问题引导混合卷积(QGHC)网络用于视觉问答(VQA)。大多数最先进的VQA方法融合了神经网络的高级文本和视觉特征,并在学习多模态特征时放弃了视觉空间信息。针对这些问题,从输入问题生成的问题引导内核被设计为回旋具有视觉特征,可在早期捕捉文本和视觉关系。问题导向卷积可以紧密耦合文本和视觉信息,但在学习内核时也会引入更多参数。我们应用groupconvolution,它包括与问题无关的内核和依赖于问题的内核,以减小参数大小并减轻过度拟合。混合卷积可以用较少的参数生成具有辨别力的多模态特征。所提出的方法也是对现有的双线性池融合和基于注意力的VQA方法的补充。通过与它们集成,我们的方法可以进一步提高性能。公共VQA数据集的大量实验验证了QGHC的有效性。
translated by 谷歌翻译
We present the MAC network, a novel fully differentiable neural networkarchitecture, designed to facilitate explicit and expressive reasoning. MACmoves away from monolithic black-box neural architectures towards a design thatencourages both transparency and versatility. The model approaches problems bydecomposing them into a series of attention-based reasoning steps, eachperformed by a novel recurrent Memory, Attention, and Composition (MAC) cellthat maintains a separation between control and memory. By stringing the cellstogether and imposing structural constraints that regulate their interaction,MAC effectively learns to perform iterative reasoning processes that aredirectly inferred from the data in an end-to-end approach. We demonstrate themodel's strength, robustness and interpretability on the challenging CLEVRdataset for visual reasoning, achieving a new state-of-the-art 98.9% accuracy,halving the error rate of the previous best model. More importantly, we showthat the model is computationally-efficient and data-efficient, in particularrequiring 5x less data than existing models to achieve strong results.
translated by 谷歌翻译
理解每个图像 - 问题对的协作推理对于可解释的视觉问题答案系统来说是非常关键的,但未被探索。尽管最近的工作也尝试使用显式组合过程来组合嵌入在问题中的多个子任务,但是他们的模型依赖于注释或手工制作的规则来获得有效的推理过程,从而导致繁重的工作负载或者在组合调度上表现不佳。在本文中,为了在不同和不受限制的情况下更好地对齐图像和语言域,我们提出了一种新的神经网络模型,该模型在从问题解析的依赖树上执行全局推理,并且我们将我们的模型称为解析树引导的推理网络(PTGRN) )。该网络由三个协作模块组成:i)用于利用从问题解析的每个单词的本地视觉证据的注意模块,ii)用于组成先前挖掘的证据的门控残留组合模块,以及iii)由解析树引导的传播模块到沿着稀树传递开采的证据。因此,我们的PTGRN能够构建可解释的VQA系统,该系统逐渐根据问题驱动的解析树推理路线推导出图像提示。关系数据集的实验证明了我们的PTGRN优于当前最先进的VQA方法,并且可视化结果突出了我们推理系统的可解释能力。
translated by 谷歌翻译
Visual Question Answering (VQA) is a novel problem domain where multi-modal inputs must be processed in order to solve the task given in the form of a natural language. As the solutions inherently require to combine visual and natural language processing with abstract reasoning, the problem is considered as AI-complete. Recent advances indicate that using high-level, abstract facts extracted from the inputs might facilitate reasoning. Following that direction we decided to develop a solution combining state-of-the-art object detection and reasoning modules. The results, achieved on the well-balanced CLEVR dataset, confirm the promises and show significant, few percent improvements of accuracy on the complex "counting" task.
translated by 谷歌翻译