Learning a generative model from partial data (data with missingness) is a challenging area of machine learning research. We study a specific implementation of the Auto-Encoding Variational Bayes (AEVB) algorithm, named in this paper as a Variational Auto-Decoder (VAD). VAD is a generic framework which uses Variational Bayes and Markov Chain Monte Carlo (MCMC) methods to learn a generative model from partial data. The main distinction between VAD and Varia-tional Auto-Encoder (VAE) is the encoder component , as VAD does not have one. Using a proposed efficient inference method from a multivari-ate Gaussian approximate posterior, VAD models allow inference to be performed via simple gradient ascent rather than MCMC sampling from a probabilistic decoder. This technique reduces the inference computational cost, allows for using more complex optimization techniques during latent space inference (which are shown to be crucial due to a high degree of freedom in the VAD latent space), and keeps the framework simple to implement. Through extensive experiments over several datasets and different missing ratios, we show that encoders cannot efficiently marginalize the input volatility caused by imputed missing values. We study multimodal datasets in this paper , which is a particular area of impact for VAD models.
translated by 谷歌翻译
多模态情绪分析是研究由语言,视觉和声学模式表达的说话者情感的核心研究领域。多模态学习中的中心挑战涉及推断可以处理和关联来自这些模态的信息的联合表示。然而,现有工作通过要求所有模态作为输入来学习联合表示,因此,学习的表示可能对测试时的噪声缺失模态敏感。随着机器翻译中序列序列(Seq2Seq)模型的最近成功,有机会探索在测试时可能不需要所有输入模态的联合表示的新方法。在本文中,我们提出了一种通过在模态之间进行转换来学习联合表示的方法。我们的方法基于以下关键洞察:从源到目标模态的转换提供了仅使用源模态作为输入来学习联合表示的方法。我们使用循环一致性损失来增强模态转换,以确保我们的联合表示保留最大的信息。一旦我们的翻译模型使用配对的多模态数据进行训练,我们只需要在测试时从源模态获得最终情绪预测的数据。这确保了我们的模型在其他模态中保持强大的功能或缺少信息。我们使用耦合的翻译预测目标训练我们的模型,并在多模态情感分析数据集上实现最新的结果:CMU-MOSI,ICT-MMMO和YouTube。另外的实验表明,我们的模型学习越来越多的判别性联合表示,具有更多的输入模态,同时保持对丢失或扰动模态的鲁棒性。
translated by 谷歌翻译
Humans convey their intentions through the usage of both verbal and nonverbal behaviors during face-to-face communication. Speaker intentions often vary dynamically depending on different nonverbal contexts, such as vocal patterns and facial expressions. As a result, when modeling human language, it is essential to not only consider the literal meaning of the words but also the nonverbal contexts in which these words appear. To better model human language, we first model expressive nonverbal representations by analyzing the fine-grained visual and acoustic patterns that occur during word segments. In addition, we seek to capture the dynamic nature of nonverbal intents by shifting word representations based on the accompanying nonverbal behaviors. To this end, we propose the Recurrent Attended Variation Embedding Network (RAVEN) that models the fine-grained structure of nonverbal subword sequences and dynamically shifts word representations based on nonverbal cues. Our proposed model achieves competitive performance on two publicly available datasets for multimodal sentiment analysis and emotion recognition. We also visualize the shifted word representations in different nonverbal contexts and summarize common patterns regarding multimodal variations of word representations.
translated by 谷歌翻译
分散驾驶是致命的,仅在2015年就在美国夺走了3,477人的生命。虽然已经进行了大量关于在不同条件下对驾驶员的偏离行为进行建模的研究,使用多种模态的准确自动检测,尤其是使用语音模态来提高准确性的贡献很少受到关注。本文介绍了一种新的用于分心驾驶行为的多模态数据集,并使用三种模态的特征来讨论自动分心检测:面部表情,语音和汽车信号。详细的多模态特征分析表明,增加更多模态可以单调增加模型的预测精度。最后,与基线SVM和神经网络模型相比,使用多项式融合层的简单有效的多模态融合技术显示出更好的分心检测结果。
translated by 谷歌翻译
人类多模式语言的计算建模是跨越语言,视觉和声学模式的自然语言处理中的新兴研究领域。理解多模式语言不仅需要对每种模态中的交互进行建模(模内交互),而且更重要的是模态之间的相互作用(交叉模态交互)。在本文中,我们提出了循环多级融合网络(RMFN),它将融合问题分解为多个阶段,每个阶段都集中在多模态信号的子集上,以进行专门的,有效的融合。使用这种多阶段融合方法对交叉模态交互进行建模,该方法构建了前一阶段的中间表示。通过将我们提出的融合方法与递归神经网络的系统相结合来模拟时间和模内相互作用。 RMFN在三种公共数据集中对人类多模式语言进行建模时,展示了最先进的性能,涉及多模式情感分析,情感识别和表现特征识别。我们提供可视化来显示聚变的每个阶段聚焦在多模态信号的不同子集上,学习越来越多的多模态表示。
translated by 谷歌翻译
情感识别是人工智能与人类交流分析交叉的核心研究领域。这是一项重大的技术挑战,因为人类通过语言,视觉和声学模式的复杂特殊组合来展示自己的情感。与传统的多模态融合技术相比,我们从直接的人独立和相对的人依赖视角来处理情感识别。独立于人的视角遵循传统的情感识别方法,该方法直接从观察到的多模态特征中推断绝对情感标签。相对的人依赖性观点通过比较部分视频消息以相对方式来评估情绪识别,以确定情绪强度是否增加或减少。我们提出的模型通过将情绪识别任务划分为三个easiersubtasks来整合这些直接和相对预测的观点。第一个子任务涉及视频的两个短片段之间的相对情感强度的多模态本地排名。第二个子参数使用贝叶斯分析算法推断全局相对情绪等级的局部排名。第三个子任务包括来自观察到的多模态行为的直接预测和来自局部全局的最终情绪预测的相对情绪等级。我们的方法在视听情感识别基准测试中表现出色,并改进了多模式融合的其他算法。
translated by 谷歌翻译
由于存在多个信息源,因此学习多模态数据的表示是一个基本上复杂的研究问题。为了解决多模态数据的复杂性,我们认为适当的代表性学习模型应该:1)根据数据变异的独立因素对表征进行因子分解,捕获2)判别和3)生成任务的重要特征,以及4)耦合特定模态和多模态信息。为了囊括所有这些性质,我们提出了多模态因子分解模型(MFM),它将多模态表示分解为两组独立因子:多模态判别因子和模态特定生成因子。多模态歧视因子在所有模态中共享,并包含判别任务(如预测情感)所需的联合多模态特征。特定于模态的生成因子对于每种模态都是唯一的,并且包含生成数据所需的信息。我们的实验结果表明,我们的模型能够学习有意义的多模态表示,并在五个多模态数据集上实现最先进或竞争性的表现。我们的模型还通过调节独立因子来展示灵活的生成能力。我们进一步解释分解表示以理解影响多模式学习的相互作用。
translated by 谷歌翻译
由自然语言指令引导的导航为指令追随者提出了具有挑战性的推理问题。自然语言教学通常只识别一些高级决策和地标,而不是完整的低级运动行为;必须根据感知背景推断出大部分缺失的信息。在机器学习设置中,这是具有挑战性的:难以收集足够的注释数据以从头开始学习该推理过程,并且难以使用通用序列模型实现推理过程。在这里,我们描述了视觉和语言导航的方法,通过嵌入式扬声器模型解决了这两个问题。我们使用这种说话者模型来(1)​​合成用于数据增强的新指令,并且(2)实现语用推理,其评估候选动作序列解释指令的程度。这些步骤由反映人类生成的粒度的全景动作空间支持。说明。实验表明,这种方法的所有三个组成部分 - 扬声器驱动的数据增强,实用推理和全景动作空间 - 显着提高了基线指令跟随者的性能,使标准基准上最好的现有方法的成功率翻了一倍多。
translated by 谷歌翻译
Multimodal sentiment analysis is an increasingly popular research area, whichextends the conventional language-based definition of sentiment analysis to amultimodal setup where other relevant modalities accompany language. In thispaper, we pose the problem of multimodal sentiment analysis as modelingintra-modality and inter-modality dynamics. We introduce a novel model, termedTensor Fusion Network, which learns both such dynamics end-to-end. The proposedapproach is tailored for the volatile nature of spoken language in onlinevideos as well as accompanying gestures and voice. In the experiments, ourmodel outperforms state-of-the-art approaches for both multimodal and unimodalsentiment analysis.
translated by 谷歌翻译
Our experience of the world is multimodal - we see objects, hear sounds, feeltexture, smell odors, and taste flavors. Modality refers to the way in whichsomething happens or is experienced and a research problem is characterized asmultimodal when it includes multiple such modalities. In order for ArtificialIntelligence to make progress in understanding the world around us, it needs tobe able to interpret such multimodal signals together. Multimodal machinelearning aims to build models that can process and relate information frommultiple modalities. It is a vibrant multi-disciplinary field of increasingimportance and with extraordinary potential. Instead of focusing on specificmultimodal applications, this paper surveys the recent advances in multimodalmachine learning itself and presents them in a common taxonomy. We go beyondthe typical early and late fusion categorization and identify broaderchallenges that are faced by multimodal machine learning, namely:representation, translation, alignment, fusion, and co-learning. This newtaxonomy will enable researchers to better understand the state of the fieldand identify directions for future research.
translated by 谷歌翻译