The complex compositional structure of language makes problems at theintersection of vision and language challenging. But language also provides astrong prior that can result in good superficial performance, without theunderlying models truly understanding the visual content. This can hinderprogress in pushing state of art in the computer vision aspects of multi-modalAI. In this paper, we address binary Visual Question Answering (VQA) onabstract scenes. We formulate this problem as visual verification of conceptsinquired in the questions. Specifically, we convert the question to a tuplethat concisely summarizes the visual concept to be detected in the image. Ifthe concept can be found in the image, the answer to the question is "yes", andotherwise "no". Abstract scenes play two roles (1) They allow us to focus onthe high-level semantics of the VQA task as opposed to the low-levelrecognition problems, and perhaps more importantly, (2) They provide us themodality to balance the dataset such that language priors are controlled, andthe role of vision is essential. In particular, we collect fine-grained pairsof scenes for every question, such that the answer to the question is "yes" forone scene, and "no" for the other for the exact same question. Indeed, languagepriors alone do not perform better than chance on our balanced dataset.Moreover, our proposed approach matches the performance of a state-of-the-artVQA approach on the unbalanced dataset, and outperforms it on the balanceddataset.
translated by 谷歌翻译