Bilinear models provide an appealing framework for mixing and merginginformation in Visual Question Answering (VQA) tasks. They help to learn highlevel associations between question meaning and visual concepts in the image,but they suffer from huge dimensionality issues. We introduce MUTAN, amultimodal tensor-based Tucker decomposition to efficiently parametrizebilinear interactions between visual and textual representations. Additionallyto the Tucker framework, we design a low-rank matrix-based decomposition toexplicitly constrain the interaction rank. With MUTAN, we control thecomplexity of the merging scheme while keeping nice interpretable fusionrelations. We show how our MUTAN model generalizes some of the latest VQAarchitectures, providing state-of-the-art results.
translated by 谷歌翻译