Our experience of the world is multimodal - we see objects, hear sounds, feeltexture, smell odors, and taste flavors. Modality refers to the way in whichsomething happens or is experienced and a research problem is characterized asmultimodal when it includes multiple such modalities. In order for ArtificialIntelligence to make progress in understanding the world around us, it needs tobe able to interpret such multimodal signals together. Multimodal machinelearning aims to build models that can process and relate information frommultiple modalities. It is a vibrant multi-disciplinary field of increasingimportance and with extraordinary potential. Instead of focusing on specificmultimodal applications, this paper surveys the recent advances in multimodalmachine learning itself and presents them in a common taxonomy. We go beyondthe typical early and late fusion categorization and identify broaderchallenges that are faced by multimodal machine learning, namely:representation, translation, alignment, fusion, and co-learning. This newtaxonomy will enable researchers to better understand the state of the fieldand identify directions for future research.
translated by 谷歌翻译