Multimodal sentiment analysis is an increasingly popular research area, whichextends the conventional language-based definition of sentiment analysis to amultimodal setup where other relevant modalities accompany language. In thispaper, we pose the problem of multimodal sentiment analysis as modelingintra-modality and inter-modality dynamics. We introduce a novel model, termedTensor Fusion Network, which learns both such dynamics end-to-end. The proposedapproach is tailored for the volatile nature of spoken language in onlinevideos as well as accompanying gestures and voice. In the experiments, ourmodel outperforms state-of-the-art approaches for both multimodal and unimodalsentiment analysis.
translated by 谷歌翻译
情感分析研究在过去十年中迅速发展,并引起了学术界和工业界的广泛关注,其中大部分都是基于文本的。然而,现实世界中的信息通常是不同的形式。在本文中,我们考虑多模态情感分析的任务,使用音频和文本模式,提出包括多特征融合和多模态融合的融合策略,以提高音频文本情感分析的准确性。我们将其称为DeepFeature Fusion-Audio和Text Modal Fusion(DFF-ATMF)模型,并且从中获得的功能相互补充且功能强大。使用CMU-MOSI语料库和最近发布的用于Youtubevideo情感分析的CMU-MOSEI语料库的实验显示了我们提出的模型的非常有竞争力的结果。令人惊讶的是,我们的方法也在IEMOCAP数据集中实现了最先进的结果,表明我们提出的融合策略也是对多模态情感识别的极端泛化能力。
translated by 谷歌翻译
Affective com puting is an emer ging interdisciplinary research field bringing together researchers and practitioners from various fields, ranging from artificial intelligence, natural language processing, to cog-nitive and social sciences. With the proliferation of videos posted online (e.g., on YouTube, Facebook, Twitter) for product reviews, movie reviews, political views, and more, affective computing research has increasingly evolved from conventional unimodal analysis to more complex forms of multimodal analysis. This is the primary motivation behind our first of its kind, comprehensive literature review of the diverse field of affective computing. Furthermore, existing literature surveys lack a detailed discussion of state of the art in multimodal affect analysis frameworks, which this review aims to address. Multimodality is defined by the presence of more than one modality or channel, e.g., visual, audio, text, gestures, and eye gage. In this paper, we focus mainly on the use of audio, visual and text information for multimodal affect analysis, since around 90% of the relevant literature appears to cover these three modalities. Following an overview of different techniques for unimodal affect analysis, we outline existing methods for fusing information from different modalities. As part of this review, we carry out an extensive study of different categories of state-of-the-art fusion techniques, followed by a critical analysis of potential performance improvements with multimodal analysis compared to unimodal analysis. A comprehensive overview of these two complementary fields aims to form the building blocks for readers, to better understand this challenging and exciting research field.
translated by 谷歌翻译
Humans convey their intentions through the usage of both verbal and nonverbal behaviors during face-to-face communication. Speaker intentions often vary dynamically depending on different nonverbal contexts, such as vocal patterns and facial expressions. As a result, when modeling human language, it is essential to not only consider the literal meaning of the words but also the nonverbal contexts in which these words appear. To better model human language, we first model expressive nonverbal representations by analyzing the fine-grained visual and acoustic patterns that occur during word segments. In addition, we seek to capture the dynamic nature of nonverbal intents by shifting word representations based on the accompanying nonverbal behaviors. To this end, we propose the Recurrent Attended Variation Embedding Network (RAVEN) that models the fine-grained structure of nonverbal subword sequences and dynamically shifts word representations based on nonverbal cues. Our proposed model achieves competitive performance on two publicly available datasets for multimodal sentiment analysis and emotion recognition. We also visualize the shifted word representations in different nonverbal contexts and summarize common patterns regarding multimodal variations of word representations.
translated by 谷歌翻译
Multimodal sentiment analysis is a developing area of research, which involves the identification of sentiments in videos. Current research considers utterances as independent entities, i.e., ignores the inter-dependencies and relations among the utterances of a video. In this paper, we propose a LSTM-based model that enables utterances to capture contextual information from their surroundings in the same video, thus aiding the classification process. Our method shows 5-10% performance improvement over the state of the art and high robustness to generalizability.
translated by 谷歌翻译
We compile baselines, along with dataset split, for multimodal sentiment analysis. In this paper, we explore three different deep-learning based architectures for multimodal sentiment classification, each improving upon the previous. Further, we evaluate these architectures with multiple datasets with fixed train/test partition. We also discuss some major issues, frequently ignored in mul-timodal sentiment analysis research, e.g., role of speaker-exclusive models, importance of different modalities, and generalizability. This framework illustrates the different facets of analysis to be considered while performing multimodal sentiment analysis and, hence, serves as a new benchmark for future research in this emerging field.
translated by 谷歌翻译
Human face-to-face communication is a complex multimodal signal. We use words (language modality), gestures (vision modality) and changes in tone (acoustic modality) to convey our intentions. Humans easily process and understand face-to-face communication, however, comprehending this form of communication remains a significant challenge for Artificial Intelligence (AI). AI must understand each modality and the interactions between them that shape human communication. In this paper, we present a novel neural architecture for understanding human communication called the Multi-attention Recurrent Network (MARN). The main strength of our model comes from discovering interactions between modalities through time using a neural component called the Multi-attention Block (MAB) and storing them in the hybrid memory of a recurrent component called the Long-short Term Hybrid Memory (LSTHM). We perform extensive comparisons on six publicly available datasets for multimodal sentiment analysis, speaker trait recognition and emotion recognition. MARN shows state-of-the-art performance on all the datasets.
translated by 谷歌翻译
音频情感分析是一个流行的研究领域,它扩展了传统的基于文本的情感分析,依赖于从语音中提取的声学特征的有效性。然而,当前音频分析的进展主要集中在提取均匀声学特征或不能有效融合异构特征。在本文中,我们提出了基于话语的深度神经网络模型,它具有基于卷积神经网络(CNN)和长短期记忆(LSTM)的网络的并行组合,以获得称为音频情感矢量(ASV)的代表性特征,最大程度地反映音频中的情感信息。具体来说,我们的模型通过话语级标签进行训练,ASV可以从两个分支中提取和融合。在CNN模型分支中,由信号产生的频谱图作为输入馈送,而在LSTM模型分支中,输入包括从音频中的从属话语中提取的频谱特征和倒谱系数。此外,双向长短期记忆(BiLSTM)机制用于特征融合。进行了广泛的实验,表明我们的模型可以精确,快速地识别音频情绪,并证明我们的ASV比从其他深度学习模型中提取的传统声学特征向量更好。此外,实验结果表明,所提出的模型在Multimodal Opinion-level Sentiment Intensity数据集(MOSI)数据集上的表现优于9.33%的最新方法。
translated by 谷歌翻译
Technology has enabled anyone with an Internet connection to easily create and share their ideas, opinions and content with millions of other people around the world. Much of the content being posted and consumed online is multimodal. With billions of phones, tablets and PCs shipping today with built-in cameras and a host of new video-equipped wearables like Google Glass on the horizon, the amount of video on the Internet will only continue to increase. It has become increasingly difficult for researchers to keep up with this deluge of multimodal content, let alone organize or make sense of it. Mining useful knowledge from video is a critical need that will grow exponentially, in pace with the global growth of content. This is particularly important in sentiment analysis, as both service and product reviews are gradually shifting from unimodal to multimodal. We present a novel method to extract features from visual and textual modalities using deep convolutional neural networks. By feeding such features to a multiple kernel learning classifier, we significantly outperform the state of the art of multimodal emotion recognition and sentiment analysis on different datasets.
translated by 谷歌翻译
People are sharing their opinions, stories and reviews through online video sharing websites every day. Studying sentiment and subjectivity in these opinion videos is experiencing a growing attention from ac-ademia and industry. While sentiment analysis has been successful for text, it is an understudied research question for videos and multimedia content. The biggest setbacks for studies in this direction are lack of a proper dataset, methodology, baselines and statistical analysis of how information from different modality sources relate to each other. This paper introduces to the scientific community the first opinion-level annotated corpus of sentiment and subjectivity analysis in online videos called Multimodal Opinion-level Sentiment Intensity dataset (MOSI). The dataset is rigorously annotated with labels for subjectivity, sentiment intensity, per-frame and per-opinion annotated visual features, and per-milliseconds annotated audio features. Furthermore, we present baselines for future studies in this direction as well as a new multimodal fusion approach that jointly models spoken words and visual gestures.
translated by 谷歌翻译
人类多模式语言的计算建模是跨越语言,视觉和声学模式的自然语言处理中的新兴研究领域。理解多模式语言不仅需要对每种模态中的交互进行建模(模内交互),而且更重要的是模态之间的相互作用(交叉模态交互)。在本文中,我们提出了循环多级融合网络(RMFN),它将融合问题分解为多个阶段,每个阶段都集中在多模态信号的子集上,以进行专门的,有效的融合。使用这种多阶段融合方法对交叉模态交互进行建模,该方法构建了前一阶段的中间表示。通过将我们提出的融合方法与递归神经网络的系统相结合来模拟时间和模内相互作用。 RMFN在三种公共数据集中对人类多模式语言进行建模时,展示了最先进的性能,涉及多模式情感分析,情感识别和表现特征识别。我们提供可视化来显示聚变的每个阶段聚焦在多模态信号的不同子集上,学习越来越多的多模态表示。
translated by 谷歌翻译
多模态情绪分析是一个非常活跃的研究领域。该领域的机会减少区域是为了改善多模式融合机制。我们提出了一种新颖的特征融合策略,它以层次结构的方式进行,首先融合两种方式,然后融合所有三种形式。在关于个体差异的多模态情感分析中,我们的策略优于传统的特征串联1%,相当于错误率降低5%。在多话语视频片段的话语级多元语音分析中,当前最先进的技术结合了来自同一片段的其他语义的上下文信息,我们的分层融合比目前的话法提供高达2.4%(几乎10%的错误率降低)使用串联。我们的方法的实现是以开源代码的形式公开提供的。
translated by 谷歌翻译
During real-life interactions, people are naturally gesturing and modulating their voice to emphasize specific points or to express their emotions. With the recent growth of social websites such as YouTube, Facebook, and Amazon, video reviews are emerging as a new source of multimodal and natural opinions that has been left almost untapped by automatic opinion analysis techniques. This paper presents a method for multimodal sentiment classification , which can identify the sentiment expressed in utterance-level visual datas-treams. Using a new multimodal dataset consisting of sentiment annotated utterances extracted from video reviews, we show that multimodal sentiment analysis can be effectively performed, and that the joint use of visual, acoustic, and linguistic modalities can lead to error rate reductions of up to 10.5% as compared to the best performing individual modality.
translated by 谷歌翻译
With the increasing popularity of video sharing websites such as YouTube and Facebook, multimodal sentiment analysis has received increasing attention from the scientific community. Contrary to previous works in multimodal sentiment analysis which focus on holistic information in speech segments such as bag of words representations and average facial expression intensity, we develop a novel deep architecture for multimodal sentiment analysis that performs modality fusion at the word level. In this paper, we propose the Gated Multimodal Embedding LSTM with Temporal Attention (GME-LSTM(A)) model that is composed of 2 modules. The Gated Multimodal Embedding alleviates the difficulties of fusion when there are noisy modalities. The LSTM with Temporal Attention performs word level fusion at a finer fusion resolution between input modalities and attends to the most important time steps. As a result, the GME-LSTM(A) is able to better model the multimodal structure of speech through time and perform better sentiment comprehension. We demonstrate the effectiveness of this approach on the publicly-available Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis (CMU-MOSI) dataset by achieving state-of-the-art sentiment classification and regression results. Qualitative analysis on our model emphasizes the importance of the Temporal Attention Layer in sentiment prediction because the additional acoustic and visual modalities are noisy. We also demonstrate the effectiveness of the Gated Multimodal Embedding in selectively filtering these noisy modalities out. Our results and analysis open new areas in the study of sentiment analysis in human communication and provide new models for multimodal fusion. * Equal contribution.
translated by 谷歌翻译
多模式机器学习是跨越语言,视觉和声学模式的核心研究领域。多模式学习的核心挑战涉及学习表示,可以处理和关联来自多种模态的信息。在本文中,我们提出了两种使用序列到序列(Seq2Seq)方法进行联合多模态表示的无监督学习的方法:a \ textit {Seq2Seq模态翻译模型}和\ textit {Hierarchical Seq2Seq模态翻译模型}。我们还探讨了这些seq2seq模型的多模式输入和输出的多种不同变化。我们使用CMU-MOSI数据集进行多模态情感分析的实验表明,我们的方法学习的信息多模态表示优于基线并在多模态情感分析中实现改进的性能,特别是在我们的模型能够将F1得分提高12分的双峰情况下。我们还讨论了多模式Seq2Seq方法的futuredirections。
translated by 谷歌翻译
我们提出了一种多模态数据融合方法,通过获得$ M + 1 $ dimensiontensor来考虑$ M $模态和神经网络模型的输出层之间的高阶关系。应用基于模态的张量因子化方法,该方法对不同的模态采用不同的因子,导致相对于模型输出去除冗余信息,并导致更少的模型参数,同时性能损失最小。该分解方法用作正则化器,其导致不太复杂的模型并避免过度拟合。此外,基于模态的因子化方法有助于理解每种模态的有用信息量。我们已将该方法应用于三种不同的多模态数据集的遗传分析,人格特质识别和情感识别。结果表明,与所有技术领域的最新技术相比,该方法在几项评估指标上的效率提高了1%至4%。
translated by 谷歌翻译
Multi-view sequential learning is a fundamental problem in machine learning dealing with multi-view sequences. In a multi-view sequence, there exists two forms of interactions between different views: view-specific interactions and cross-view interactions. In this paper, we present a new neural architecture for multi-view sequential learning called the Memory Fusion Network (MFN) that explicitly accounts for both interactions in a neural architecture and continuously models them through time. The first component of the MFN is called the System of LSTMs, where view-specific interactions are learned in isolation through assigning an LSTM function to each view. The cross-view interactions are then identified using a special attention mechanism called the Delta-memory Attention Network (DMAN) and summarized through time with a Multi-view Gated Memory. Through extensive experimentation , MFN is compared to various proposed approaches for multi-view sequential learning on multiple publicly available benchmark datasets. MFN outperforms all the existing multi-view approaches. Furthermore, MFN outperforms all current state-of-the-art models, setting new state-of-the-art results for these multi-view datasets.
translated by 谷歌翻译
对话中的情感识别是一项具有挑战性的人工智能(AI)任务。最近,它因其在许多有趣的AI任务中的潜在应用而受到欢迎,例如移情对话生成,用户行为理解等。据我们所知,有多模式多方会话数据集可用,其中包含两个以上的发言者对话。在这项工作中,我们提出了MultimodalEmotionLines数据集(MELD),我们通过增强和扩展以前引入的EmotionLines数据集来创建它。 MELD包含来自1433个Friends TV系列对话的13,708个话语。 MELD优于其他对话情感识别数据集SEMAINE和IEMOCAP,因为它包含多方对话,MELD中的话语数量几乎是两个数据集的两倍。 MELD中的每个话语都与情感和情感标签相关联。 MELD中的话语是多模式的,包括音频和视觉模态以及文本。我们还解决了EmotionLines中的几个短期问题,并提出了一个强大的多模态基线。基线结果表明,语境和多模态信息在对话中的情感识别中起着重要作用。
translated by 谷歌翻译
当前的多模态情绪分析将情绪分数预测作为年龄机器学习任务。然而,情感分数实际代表的内容经常被忽视。作为观点和情感状态的度量,情绪评分通常包括两个方面:极性和强度。我们将情绪分数分解为这两个方面,并研究它们是通过个体形态传达的,并在自然主义独白背景中结合多模态。特别地,我们构建具有情绪分数预测的单模和多模多任务学习模型,作为主要任务和极性和/或强度分类作为辅助任务。我们的实验表明,情感分析受益于多任务学习,并且在传达情绪的极性和强度方面时,个体形态不同。
translated by 谷歌翻译