多模式机器学习是跨越语言,视觉和声学模式的核心研究领域。多模式学习的核心挑战涉及学习表示,可以处理和关联来自多种模态的信息。在本文中,我们提出了两种使用序列到序列(Seq2Seq)方法进行联合多模态表示的无监督学习的方法:a \ textit {Seq2Seq模态翻译模型}和\ textit {Hierarchical Seq2Seq模态翻译模型}。我们还探讨了这些seq2seq模型的多模式输入和输出的多种不同变化。我们使用CMU-MOSI数据集进行多模态情感分析的实验表明,我们的方法学习的信息多模态表示优于基线并在多模态情感分析中实现改进的性能,特别是在我们的模型能够将F1得分提高12分的双峰情况下。我们还讨论了多模式Seq2Seq方法的futuredirections。
translated by 谷歌翻译
由于存在多个信息源,因此学习多模态数据的表示是一个基本上复杂的研究问题。为了解决多模态数据的复杂性,我们认为适当的代表性学习模型应该:1)根据数据变异的独立因素对表征进行因子分解,捕获2)判别和3)生成任务的重要特征,以及4)耦合特定模态和多模态信息。为了囊括所有这些性质,我们提出了多模态因子分解模型(MFM),它将多模态表示分解为两组独立因子:多模态判别因子和模态特定生成因子。多模态歧视因子在所有模态中共享,并包含判别任务(如预测情感)所需的联合多模态特征。特定于模态的生成因子对于每种模态都是唯一的,并且包含生成数据所需的信息。我们的实验结果表明,我们的模型能够学习有意义的多模态表示,并在五个多模态数据集上实现最先进或竞争性的表现。我们的模型还通过调节独立因子来展示灵活的生成能力。我们进一步解释分解表示以理解影响多模式学习的相互作用。
translated by 谷歌翻译
人类多模式语言的计算建模是跨越语言,视觉和声学模式的自然语言处理中的新兴研究领域。理解多模式语言不仅需要对每种模态中的交互进行建模(模内交互),而且更重要的是模态之间的相互作用(交叉模态交互)。在本文中,我们提出了循环多级融合网络(RMFN),它将融合问题分解为多个阶段,每个阶段都集中在多模态信号的子集上,以进行专门的,有效的融合。使用这种多阶段融合方法对交叉模态交互进行建模,该方法构建了前一阶段的中间表示。通过将我们提出的融合方法与递归神经网络的系统相结合来模拟时间和模内相互作用。 RMFN在三种公共数据集中对人类多模式语言进行建模时,展示了最先进的性能,涉及多模式情感分析,情感识别和表现特征识别。我们提供可视化来显示聚变的每个阶段聚焦在多模态信号的不同子集上,学习越来越多的多模态表示。
translated by 谷歌翻译
Human face-to-face communication is a complex multimodal signal. We use words (language modality), gestures (vision modality) and changes in tone (acoustic modality) to convey our intentions. Humans easily process and understand face-to-face communication, however, comprehending this form of communication remains a significant challenge for Artificial Intelligence (AI). AI must understand each modality and the interactions between them that shape human communication. In this paper, we present a novel neural architecture for understanding human communication called the Multi-attention Recurrent Network (MARN). The main strength of our model comes from discovering interactions between modalities through time using a neural component called the Multi-attention Block (MAB) and storing them in the hybrid memory of a recurrent component called the Long-short Term Hybrid Memory (LSTHM). We perform extensive comparisons on six publicly available datasets for multimodal sentiment analysis, speaker trait recognition and emotion recognition. MARN shows state-of-the-art performance on all the datasets.
translated by 谷歌翻译
With the increasing popularity of video sharing websites such as YouTube and Facebook, multimodal sentiment analysis has received increasing attention from the scientific community. Contrary to previous works in multimodal sentiment analysis which focus on holistic information in speech segments such as bag of words representations and average facial expression intensity, we develop a novel deep architecture for multimodal sentiment analysis that performs modality fusion at the word level. In this paper, we propose the Gated Multimodal Embedding LSTM with Temporal Attention (GME-LSTM(A)) model that is composed of 2 modules. The Gated Multimodal Embedding alleviates the difficulties of fusion when there are noisy modalities. The LSTM with Temporal Attention performs word level fusion at a finer fusion resolution between input modalities and attends to the most important time steps. As a result, the GME-LSTM(A) is able to better model the multimodal structure of speech through time and perform better sentiment comprehension. We demonstrate the effectiveness of this approach on the publicly-available Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis (CMU-MOSI) dataset by achieving state-of-the-art sentiment classification and regression results. Qualitative analysis on our model emphasizes the importance of the Temporal Attention Layer in sentiment prediction because the additional acoustic and visual modalities are noisy. We also demonstrate the effectiveness of the Gated Multimodal Embedding in selectively filtering these noisy modalities out. Our results and analysis open new areas in the study of sentiment analysis in human communication and provide new models for multimodal fusion. * Equal contribution.
translated by 谷歌翻译
Multimodal sentiment analysis is an increasingly popular research area, whichextends the conventional language-based definition of sentiment analysis to amultimodal setup where other relevant modalities accompany language. In thispaper, we pose the problem of multimodal sentiment analysis as modelingintra-modality and inter-modality dynamics. We introduce a novel model, termedTensor Fusion Network, which learns both such dynamics end-to-end. The proposedapproach is tailored for the volatile nature of spoken language in onlinevideos as well as accompanying gestures and voice. In the experiments, ourmodel outperforms state-of-the-art approaches for both multimodal and unimodalsentiment analysis.
translated by 谷歌翻译
Humans convey their intentions through the usage of both verbal and nonverbal behaviors during face-to-face communication. Speaker intentions often vary dynamically depending on different nonverbal contexts, such as vocal patterns and facial expressions. As a result, when modeling human language, it is essential to not only consider the literal meaning of the words but also the nonverbal contexts in which these words appear. To better model human language, we first model expressive nonverbal representations by analyzing the fine-grained visual and acoustic patterns that occur during word segments. In addition, we seek to capture the dynamic nature of nonverbal intents by shifting word representations based on the accompanying nonverbal behaviors. To this end, we propose the Recurrent Attended Variation Embedding Network (RAVEN) that models the fine-grained structure of nonverbal subword sequences and dynamically shifts word representations based on nonverbal cues. Our proposed model achieves competitive performance on two publicly available datasets for multimodal sentiment analysis and emotion recognition. We also visualize the shifted word representations in different nonverbal contexts and summarize common patterns regarding multimodal variations of word representations.
translated by 谷歌翻译
Multi-view sequential learning is a fundamental problem in machine learning dealing with multi-view sequences. In a multi-view sequence, there exists two forms of interactions between different views: view-specific interactions and cross-view interactions. In this paper, we present a new neural architecture for multi-view sequential learning called the Memory Fusion Network (MFN) that explicitly accounts for both interactions in a neural architecture and continuously models them through time. The first component of the MFN is called the System of LSTMs, where view-specific interactions are learned in isolation through assigning an LSTM function to each view. The cross-view interactions are then identified using a special attention mechanism called the Delta-memory Attention Network (DMAN) and summarized through time with a Multi-view Gated Memory. Through extensive experimentation , MFN is compared to various proposed approaches for multi-view sequential learning on multiple publicly available benchmark datasets. MFN outperforms all the existing multi-view approaches. Furthermore, MFN outperforms all current state-of-the-art models, setting new state-of-the-art results for these multi-view datasets.
translated by 谷歌翻译
Our experience of the world is multimodal - we see objects, hear sounds, feeltexture, smell odors, and taste flavors. Modality refers to the way in whichsomething happens or is experienced and a research problem is characterized asmultimodal when it includes multiple such modalities. In order for ArtificialIntelligence to make progress in understanding the world around us, it needs tobe able to interpret such multimodal signals together. Multimodal machinelearning aims to build models that can process and relate information frommultiple modalities. It is a vibrant multi-disciplinary field of increasingimportance and with extraordinary potential. Instead of focusing on specificmultimodal applications, this paper surveys the recent advances in multimodalmachine learning itself and presents them in a common taxonomy. We go beyondthe typical early and late fusion categorization and identify broaderchallenges that are faced by multimodal machine learning, namely:representation, translation, alignment, fusion, and co-learning. This newtaxonomy will enable researchers to better understand the state of the fieldand identify directions for future research.
translated by 谷歌翻译
了解视频片段的影响已将研究人员从语言,音频和视频领域汇集到一起。目前该领域的多模式研究大多涉及融合模态的各种技术,并且最独立地处理视频的片段。在(Zadeh等人,2017)和(Poria等人,2017)的工作的推动下,我们提出了我们的架构,关系张量网络,我们使用段内(段内)的模态间相互作用,并考虑视频模型中的片段序列是片段间模态间的相互作用。我们还通过利用更丰富的音频和语言环境以及融合来自文本的基于细粒度知识的极性分数来生成丰富的文本和音频模态。我们在CMU-MOSEI数据集上展示了我们的模型的结果,并且表明我们的模型优于许多基线和最先进的方法,用于分类和情感识别。
translated by 谷歌翻译
我们提出了一种多模态数据融合方法,通过获得$ M + 1 $ dimensiontensor来考虑$ M $模态和神经网络模型的输出层之间的高阶关系。应用基于模态的张量因子化方法,该方法对不同的模态采用不同的因子,导致相对于模型输出去除冗余信息,并导致更少的模型参数,同时性能损失最小。该分解方法用作正则化器,其导致不太复杂的模型并避免过度拟合。此外,基于模态的因子化方法有助于理解每种模态的有用信息量。我们已将该方法应用于三种不同的多模态数据集的遗传分析,人格特质识别和情感识别。结果表明,与所有技术领域的最新技术相比,该方法在几项评估指标上的效率提高了1%至4%。
translated by 谷歌翻译
Multimodal sentiment analysis is a developing area of research, which involves the identification of sentiments in videos. Current research considers utterances as independent entities, i.e., ignores the inter-dependencies and relations among the utterances of a video. In this paper, we propose a LSTM-based model that enables utterances to capture contextual information from their surroundings in the same video, thus aiding the classification process. Our method shows 5-10% performance improvement over the state of the art and high robustness to generalizability.
translated by 谷歌翻译
Sequence-to-Sequence (seq2seq) modeling has rapidly become an important general-purpose NLP tool that has proven effective for many text-generation and sequence-labeling tasks. Seq2seq builds on deep neural language modeling and inherits its remarkable accuracy in estimating local, next-word distributions. In this work, we introduce a model and beam-search training scheme, based on the work of Daumé III and Marcu (2005), that extends seq2seq to learn global sequence scores. This structured approach avoids classical biases associated with local training and unifies the training loss with the test-time usage, while preserving the proven model architecture of seq2seq and its efficient training approach. We show that our system outperforms a highly-optimized attention-based seq2seq system and other baselines on three different sequence to sequence tasks: word ordering, parsing, and machine translation.
translated by 谷歌翻译
In recent years, Deep Learning has been successfully applied to multimodal learning problems, with the aim of learning useful joint representations in data fusion applications. When the available modalities consist of time series data such as video, audio and sensor signals, it becomes imperative to consider their temporal structure during the fusion process. In this paper, we propose the Correlational Recurrent Neural Network (CorrRNN), a novel temporal fusion model for fusing multiple input modalities that are inherently temporal in nature. Key features of our proposed model include: (i) simultaneous learning of the joint representation and temporal dependencies between modalities, (ii) use of multiple loss terms in the objective function, including a maximum correlation loss term to enhance learning of cross-modal information, and (iii) the use of an attention model to dynamically adjust the contribution of different input modalities to the joint representation. We validate our model via experimentation on two different tasks: video-and sensor-based activity classification, and audiovisual speech recognition. We empirically analyze the contributions of different components of the proposed CorrRNN model, and demonstrate its robustness, effectiveness and state-of-the-art performance on multiple datasets.
translated by 谷歌翻译
在过去的十年中,视频博客(vlogs)已经成为人们表达情感的极其流行的方法。这些视频无处不在增加了多模式融合模型的重要性,多模式融合模型将视频和音频特征与传统文本特征自动情绪检测相结合。多模式融合为构建模型提供了独特的机会,这些模型可以从人类观察者可用的全部表达深度中学习。在检测这些视频中的情绪时,声学和视频特征为其他模糊的成绩单提供了清晰度。在本文中,我们提出了一种多模式融合模型,该模型专门使用高级视频和音频特征来分析口语句子的情绪。我们放弃传统的转录功能,以最大限度地减少人为干预,并最大限度地提高我们的模型在大规模现实世界数据上的可部署性。我们选择在非效果域中成功的模型的高级功能,以测试它们在情感检测域中的普遍性。我们在新发布的CMU MultimodalOpinion Sentiment和Emotion Intensity(CMUMOSEI)数据集上训练和测试我们的模型,在验证集上获得0.8049的F1分数,在保持激发试验集上获得0.6325的F1分数。
translated by 谷歌翻译
情感识别是人工智能与人类交流分析交叉的核心研究领域。这是一项重大的技术挑战,因为人类通过语言,视觉和声学模式的复杂特殊组合来展示自己的情感。与传统的多模态融合技术相比,我们从直接的人独立和相对的人依赖视角来处理情感识别。独立于人的视角遵循传统的情感识别方法,该方法直接从观察到的多模态特征中推断绝对情感标签。相对的人依赖性观点通过比较部分视频消息以相对方式来评估情绪识别,以确定情绪强度是否增加或减少。我们提出的模型通过将情绪识别任务划分为三个easiersubtasks来整合这些直接和相对预测的观点。第一个子任务涉及视频的两个短片段之间的相对情感强度的多模态本地排名。第二个子参数使用贝叶斯分析算法推断全局相对情绪等级的局部排名。第三个子任务包括来自观察到的多模态行为的直接预测和来自局部全局的最终情绪预测的相对情绪等级。我们的方法在视听情感识别基准测试中表现出色,并改进了多模式融合的其他算法。
translated by 谷歌翻译
深度学习方法采用多个处理层来学习数据的层次表示,并在manydomains中产生了最先进的结果。最近,各种模型设计和方法在自然语言处理(NLP)的背景下蓬勃发展。在本文中,我们回顾了已经用于大量NLP任务的重要深度学习相关模型和方法,并提供了他们演变的演练。我们对各种模型进行了比较,比较和对比,并对NLP深度学习的过去,现在和未来进行了详细的理解。
translated by 谷歌翻译
当前的多模态情绪分析将情绪分数预测作为年龄机器学习任务。然而,情感分数实际代表的内容经常被忽视。作为观点和情感状态的度量,情绪评分通常包括两个方面:极性和强度。我们将情绪分数分解为这两个方面,并研究它们是通过个体形态传达的,并在自然主义独白背景中结合多模态。特别地,我们构建具有情绪分数预测的单模和多模多任务学习模型,作为主要任务和极性和/或强度分类作为辅助任务。我们的实验表明,情感分析受益于多任务学习,并且在传达情绪的极性和强度方面时,个体形态不同。
translated by 谷歌翻译
在自然语言处理(NLP)中,重要的是检测两个序列之间的关系或者在给定其他观察序列的情况下生成一系列标记。我们将建模序列对的问题类型称为序列到序列(seq2seq)映射问题。许多研究致力于寻找解决这些问题的方法,传统方法依赖于手工制作的特征,对齐模型,分割启发式和外部语言资源的组合。虽然取得了很大进展,但这些传统方法还存在各种缺陷,如复杂的流水线,繁琐的特征工程,以及领域适应的困难。最近,神经网络成为NLP,语音识别和计算机视觉中许多问题的解决方案。神经模型是强大的,因为它们可以端到端地进行训练,很好地概括为看不见的例子,同样的框架可以很容易地适应新的领域。本论文的目的是通过神经网络推进seq2seq映射问题的最新技术。我们从三个主要方面探索解决方案:研究用于表示序列的神经模型,建模序列之间的相互作用,以及使用不成对数据来提高神经模型的性能。对于每个方面,我们提出新模型并评估它们对seq2seq映射的各种任务的功效。
translated by 谷歌翻译
多模态情绪分析是一个非常活跃的研究领域。该领域的机会减少区域是为了改善多模式融合机制。我们提出了一种新颖的特征融合策略,它以层次结构的方式进行,首先融合两种方式,然后融合所有三种形式。在关于个体差异的多模态情感分析中,我们的策略优于传统的特征串联1%,相当于错误率降低5%。在多话语视频片段的话语级多元语音分析中,当前最先进的技术结合了来自同一片段的其他语义的上下文信息,我们的分层融合比目前的话法提供高达2.4%(几乎10%的错误率降低)使用串联。我们的方法的实现是以开源代码的形式公开提供的。
translated by 谷歌翻译