2017-05-18
Bilinear models provide an appealing framework for mixing and merginginformation in Visual Question Answering (VQA) tasks. They help to learn highlevel associations between question meaning and visual concepts in the image,but they suffer from huge dimensionality issues. We introduce MUTAN, amultimodal tensor-based Tucker decomposition to efficiently parametrizebilinear interactions between visual and textual representations. Additionallyto the Tucker framework, we design a low-rank matrix-based decomposition toexplicitly constrain the interaction rank. With MUTAN, we control thecomplexity of the merging scheme while keeping nice interpretable fusionrelations. We show how our MUTAN model generalizes some of the latest VQAarchitectures, providing state-of-the-art results.
translated by 谷歌翻译

2018-12-23

translated by 谷歌翻译

2019-01-31

translated by 谷歌翻译

translated by 谷歌翻译

translated by 谷歌翻译
Inspired by the recent success of text-based question answering , visual question answering (VQA) is proposed to automatically answer natural language questions with the reference to a given image. Compared with text-based QA, VQA is more challenging because the reasoning process on visual domain needs both effective semantic embedding and fine-grained visual understanding. Existing approaches predominantly infer answers from the abstract low-level visual features, while neglecting the modeling of high-level image semantics and the rich spatial context of regions. To solve the challenges, we propose a multi-level attention network for visual question answering that can simultaneously reduce the semantic gap by semantic attention and benefit fine-grained spatial inference by visual attention. First, we generate semantic concepts from high-level semantics in convolutional neural networks (CNN) and select those question-related concepts as semantic attention. Second, we encode region-based middle-level outputs from CNN into spatially-embedded representation by a bidirec-tional recurrent neural network, and further pinpoint the answer-related regions by multiple layer perceptron as visual attention. Third, we jointly optimize semantic attention , visual attention and question embedding by a softmax classifier to infer the final answer. Extensive experiments show the proposed approach outperforms the-state-of-arts on two challenging VQA datasets.
translated by 谷歌翻译

translated by 谷歌翻译
Visual question answering (VQA) is challenging because it requires a simultaneous understanding of both the visual content of images and the textual content of questions. The approaches used to represent the images and questions in a fine-grained manner and questions and to fuse these multi-modal features play key roles in performance. Bilinear pooling based models have been shown to outperform traditional linear models for VQA, but their high-dimensional representations and high computational complexity may seriously limit their applicability in practice. For multi-modal feature fusion, here we develop a Multi-modal Fac-torized Bilinear (MFB) pooling approach to efficiently and effectively combine multi-modal features, which results in superior performance for VQA compared with other bilin-ear pooling approaches. For fine-grained image and question representation, we develop a 'co-attention' mechanism using an end-to-end deep network architecture to jointly learn both the image and question attentions. Combining the proposed MFB approach with co-attention learning in a new network architecture provides a unified model for VQA. Our experimental results demonstrate that the single MFB with co-attention model achieves new state-of-the-art performance on the real-world VQA dataset. Code available at https://github.com/yuzcccc/mfb.
translated by 谷歌翻译

2018-06-19

translated by 谷歌翻译

2018-04-03
A key solution to visual question answering (VQA) exists in how to fuse visual and language features extracted from an input image and question. We show that an attention mechanism that enables dense, bi-directional interactions between the two modalities contributes to boost accuracy of prediction of answers. Specifically, we present a simple architecture that is fully symmetric between visual and language representations, in which each question word attends on image regions and each image region attends on question words. It can be stacked to form a hierarchy for multi-step interactions between an image-question pair. We show through experiments that the proposed architecture achieves a new state-of-the-art on VQA and VQA 2.0 despite its small size. We also present qualitative evaluation, demonstrating how the proposed attention mechanism can generate reasonable attention maps on images and questions , which leads to the correct answer prediction.
translated by 谷歌翻译

2018-11-30

translated by 谷歌翻译
Visual Question Answering (VQA) is a recent problem in computer vision and natural language processing that has garnered a large amount of interest from the deep learning, computer vision, and natural language processing communities. In VQA, an algorithm needs to answer text-based questions about images. Since the release of the first VQA dataset in 2014, additional datasets have been released and many algorithms have been proposed. In this review, we critically examine the current state of VQA in terms of problem formulation, existing datasets, evaluation metrics, and algorithms. In particular, we discuss the limitations of current datasets with regard to their ability to properly train and assess VQA algorithms. We then exhaustively review existing algorithms for VQA. Finally, we discuss possible future directions for VQA and image understanding research.
translated by 谷歌翻译

2016-07-20
Visual Question Answering (VQA) is a challenging task that has receivedincreasing attention from both the computer vision and the natural languageprocessing communities. Given an image and a question in natural language, itrequires reasoning over visual elements of the image and general knowledge toinfer the correct answer. In the first part of this survey, we examine thestate of the art by comparing modern approaches to the problem. We classifymethods by their mechanism to connect the visual and textual modalities. Inparticular, we examine the common approach of combining convolutional andrecurrent neural networks to map images and questions to a common featurespace. We also discuss memory-augmented and modular architectures thatinterface with structured knowledge bases. In the second part of this survey,we review the datasets available for training and evaluating VQA systems. Thevarious datatsets contain questions at different levels of complexity, whichrequire different capabilities and types of reasoning. We examine in depth thequestion/answer pairs from the Visual Genome project, and evaluate therelevance of the structured annotations of images with scene graphs for VQA.Finally, we discuss promising future directions for the field, in particularthe connection to structured knowledge bases and the use of natural languageprocessing models.
translated by 谷歌翻译

2017-07-02
It is commonly assumed that language refers to high-level visual conceptswhile leaving low-level visual processing unaffected. This view dominates thecurrent literature in computational models for language-vision tasks, wherevisual and linguistic input are mostly processed independently before beingfused into a single representation. In this paper, we deviate from this classicpipeline and propose to modulate the \emph{entire visual processing} bylinguistic input. Specifically, we condition the batch normalization parametersof a pretrained residual network (ResNet) on a language embedding. Thisapproach, which we call MOdulated RESnet (\MRN), significantly improves strongbaselines on two visual question answering tasks. Our ablation study shows thatmodulating from the early stages of the visual processing is beneficial.
translated by 谷歌翻译

translated by 谷歌翻译

2018-06-16

translated by 谷歌翻译

translated by 谷歌翻译

2018-03-08
We present the MAC network, a novel fully differentiable neural networkarchitecture, designed to facilitate explicit and expressive reasoning. MACmoves away from monolithic black-box neural architectures towards a design thatencourages both transparency and versatility. The model approaches problems bydecomposing them into a series of attention-based reasoning steps, eachperformed by a novel recurrent Memory, Attention, and Composition (MAC) cellthat maintains a separation between control and memory. By stringing the cellstogether and imposing structural constraints that regulate their interaction,MAC effectively learns to perform iterative reasoning processes that aredirectly inferred from the data in an end-to-end approach. We demonstrate themodel's strength, robustness and interpretability on the challenging CLEVRdataset for visual reasoning, achieving a new state-of-the-art 98.9% accuracy,halving the error rate of the previous best model. More importantly, we showthat the model is computationally-efficient and data-efficient, in particularrequiring 5x less data than existing models to achieve strong results.
translated by 谷歌翻译

2018-09-06

translated by 谷歌翻译
Visual Question Answering (VQA) is a novel problem domain where multi-modal inputs must be processed in order to solve the task given in the form of a natural language. As the solutions inherently require to combine visual and natural language processing with abstract reasoning, the problem is considered as AI-complete. Recent advances indicate that using high-level, abstract facts extracted from the inputs might facilitate reasoning. Following that direction we decided to develop a solution combining state-of-the-art object detection and reasoning modules. The results, achieved on the well-balanced CLEVR dataset, confirm the promises and show significant, few percent improvements of accuracy on the complex "counting" task.
translated by 谷歌翻译
${authors} 分类：${tags}
${pubdate}${abstract_cn}
translated by 谷歌翻译