多模态情绪分析是研究由语言,视觉和声学模式表达的说话者情感的核心研究领域。多模态学习中的中心挑战涉及推断可以处理和关联来自这些模态的信息的联合表示。然而,现有工作通过要求所有模态作为输入来学习联合表示,因此,学习的表示可能对测试时的噪声缺失模态敏感。随着机器翻译中序列序列(Seq2Seq)模型的最近成功,有机会探索在测试时可能不需要所有输入模态的联合表示的新方法。在本文中,我们提出了一种通过在模态之间进行转换来学习联合表示的方法。我们的方法基于以下关键洞察:从源到目标模态的转换提供了仅使用源模态作为输入来学习联合表示的方法。我们使用循环一致性损失来增强模态转换,以确保我们的联合表示保留最大的信息。一旦我们的翻译模型使用配对的多模态数据进行训练,我们只需要在测试时从源模态获得最终情绪预测的数据。这确保了我们的模型在其他模态中保持强大的功能或缺少信息。我们使用耦合的翻译预测目标训练我们的模型,并在多模态情感分析数据集上实现最新的结果:CMU-MOSI,ICT-MMMO和YouTube。另外的实验表明,我们的模型学习越来越多的判别性联合表示,具有更多的输入模态,同时保持对丢失或扰动模态的鲁棒性。
translated by 谷歌翻译
With the increasing popularity of video sharing websites such as YouTube and Facebook, multimodal sentiment analysis has received increasing attention from the scientific community. Contrary to previous works in multimodal sentiment analysis which focus on holistic information in speech segments such as bag of words representations and average facial expression intensity, we develop a novel deep architecture for multimodal sentiment analysis that performs modality fusion at the word level. In this paper, we propose the Gated Multimodal Embedding LSTM with Temporal Attention (GME-LSTM(A)) model that is composed of 2 modules. The Gated Multimodal Embedding alleviates the difficulties of fusion when there are noisy modalities. The LSTM with Temporal Attention performs word level fusion at a finer fusion resolution between input modalities and attends to the most important time steps. As a result, the GME-LSTM(A) is able to better model the multimodal structure of speech through time and perform better sentiment comprehension. We demonstrate the effectiveness of this approach on the publicly-available Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis (CMU-MOSI) dataset by achieving state-of-the-art sentiment classification and regression results. Qualitative analysis on our model emphasizes the importance of the Temporal Attention Layer in sentiment prediction because the additional acoustic and visual modalities are noisy. We also demonstrate the effectiveness of the Gated Multimodal Embedding in selectively filtering these noisy modalities out. Our results and analysis open new areas in the study of sentiment analysis in human communication and provide new models for multimodal fusion. * Equal contribution.
translated by 谷歌翻译
Multi-view sequential learning is a fundamental problem in machine learning dealing with multi-view sequences. In a multi-view sequence, there exists two forms of interactions between different views: view-specific interactions and cross-view interactions. In this paper, we present a new neural architecture for multi-view sequential learning called the Memory Fusion Network (MFN) that explicitly accounts for both interactions in a neural architecture and continuously models them through time. The first component of the MFN is called the System of LSTMs, where view-specific interactions are learned in isolation through assigning an LSTM function to each view. The cross-view interactions are then identified using a special attention mechanism called the Delta-memory Attention Network (DMAN) and summarized through time with a Multi-view Gated Memory. Through extensive experimentation , MFN is compared to various proposed approaches for multi-view sequential learning on multiple publicly available benchmark datasets. MFN outperforms all the existing multi-view approaches. Furthermore, MFN outperforms all current state-of-the-art models, setting new state-of-the-art results for these multi-view datasets.
translated by 谷歌翻译
Human face-to-face communication is a complex multimodal signal. We use words (language modality), gestures (vision modality) and changes in tone (acoustic modality) to convey our intentions. Humans easily process and understand face-to-face communication, however, comprehending this form of communication remains a significant challenge for Artificial Intelligence (AI). AI must understand each modality and the interactions between them that shape human communication. In this paper, we present a novel neural architecture for understanding human communication called the Multi-attention Recurrent Network (MARN). The main strength of our model comes from discovering interactions between modalities through time using a neural component called the Multi-attention Block (MAB) and storing them in the hybrid memory of a recurrent component called the Long-short Term Hybrid Memory (LSTHM). We perform extensive comparisons on six publicly available datasets for multimodal sentiment analysis, speaker trait recognition and emotion recognition. MARN shows state-of-the-art performance on all the datasets.
translated by 谷歌翻译
Multimodal sentiment analysis is an increasingly popular research area, whichextends the conventional language-based definition of sentiment analysis to amultimodal setup where other relevant modalities accompany language. In thispaper, we pose the problem of multimodal sentiment analysis as modelingintra-modality and inter-modality dynamics. We introduce a novel model, termedTensor Fusion Network, which learns both such dynamics end-to-end. The proposedapproach is tailored for the volatile nature of spoken language in onlinevideos as well as accompanying gestures and voice. In the experiments, ourmodel outperforms state-of-the-art approaches for both multimodal and unimodalsentiment analysis.
translated by 谷歌翻译
了解视频片段的影响已将研究人员从语言,音频和视频领域汇集到一起。目前该领域的多模式研究大多涉及融合模态的各种技术,并且最独立地处理视频的片段。在(Zadeh等人,2017)和(Poria等人,2017)的工作的推动下,我们提出了我们的架构,关系张量网络,我们使用段内(段内)的模态间相互作用,并考虑视频模型中的片段序列是片段间模态间的相互作用。我们还通过利用更丰富的音频和语言环境以及融合来自文本的基于细粒度知识的极性分数来生成丰富的文本和音频模态。我们在CMU-MOSEI数据集上展示了我们的模型的结果,并且表明我们的模型优于许多基线和最先进的方法,用于分类和情感识别。
translated by 谷歌翻译
人类多模式语言的计算建模是跨越语言,视觉和声学模式的自然语言处理中的新兴研究领域。理解多模式语言不仅需要对每种模态中的交互进行建模(模内交互),而且更重要的是模态之间的相互作用(交叉模态交互)。在本文中,我们提出了循环多级融合网络(RMFN),它将融合问题分解为多个阶段,每个阶段都集中在多模态信号的子集上,以进行专门的,有效的融合。使用这种多阶段融合方法对交叉模态交互进行建模,该方法构建了前一阶段的中间表示。通过将我们提出的融合方法与递归神经网络的系统相结合来模拟时间和模内相互作用。 RMFN在三种公共数据集中对人类多模式语言进行建模时,展示了最先进的性能,涉及多模式情感分析,情感识别和表现特征识别。我们提供可视化来显示聚变的每个阶段聚焦在多模态信号的不同子集上,学习越来越多的多模态表示。
translated by 谷歌翻译
Multimodal sentiment analysis is a developing area of research, which involves the identification of sentiments in videos. Current research considers utterances as independent entities, i.e., ignores the inter-dependencies and relations among the utterances of a video. In this paper, we propose a LSTM-based model that enables utterances to capture contextual information from their surroundings in the same video, thus aiding the classification process. Our method shows 5-10% performance improvement over the state of the art and high robustness to generalizability.
translated by 谷歌翻译
Humans convey their intentions through the usage of both verbal and nonverbal behaviors during face-to-face communication. Speaker intentions often vary dynamically depending on different nonverbal contexts, such as vocal patterns and facial expressions. As a result, when modeling human language, it is essential to not only consider the literal meaning of the words but also the nonverbal contexts in which these words appear. To better model human language, we first model expressive nonverbal representations by analyzing the fine-grained visual and acoustic patterns that occur during word segments. In addition, we seek to capture the dynamic nature of nonverbal intents by shifting word representations based on the accompanying nonverbal behaviors. To this end, we propose the Recurrent Attended Variation Embedding Network (RAVEN) that models the fine-grained structure of nonverbal subword sequences and dynamically shifts word representations based on nonverbal cues. Our proposed model achieves competitive performance on two publicly available datasets for multimodal sentiment analysis and emotion recognition. We also visualize the shifted word representations in different nonverbal contexts and summarize common patterns regarding multimodal variations of word representations.
translated by 谷歌翻译
Our experience of the world is multimodal - we see objects, hear sounds, feeltexture, smell odors, and taste flavors. Modality refers to the way in whichsomething happens or is experienced and a research problem is characterized asmultimodal when it includes multiple such modalities. In order for ArtificialIntelligence to make progress in understanding the world around us, it needs tobe able to interpret such multimodal signals together. Multimodal machinelearning aims to build models that can process and relate information frommultiple modalities. It is a vibrant multi-disciplinary field of increasingimportance and with extraordinary potential. Instead of focusing on specificmultimodal applications, this paper surveys the recent advances in multimodalmachine learning itself and presents them in a common taxonomy. We go beyondthe typical early and late fusion categorization and identify broaderchallenges that are faced by multimodal machine learning, namely:representation, translation, alignment, fusion, and co-learning. This newtaxonomy will enable researchers to better understand the state of the fieldand identify directions for future research.
translated by 谷歌翻译
由于存在多个信息源,因此学习多模态数据的表示是一个基本上复杂的研究问题。为了解决多模态数据的复杂性,我们认为适当的代表性学习模型应该:1)根据数据变异的独立因素对表征进行因子分解,捕获2)判别和3)生成任务的重要特征,以及4)耦合特定模态和多模态信息。为了囊括所有这些性质,我们提出了多模态因子分解模型(MFM),它将多模态表示分解为两组独立因子:多模态判别因子和模态特定生成因子。多模态歧视因子在所有模态中共享,并包含判别任务(如预测情感)所需的联合多模态特征。特定于模态的生成因子对于每种模态都是唯一的,并且包含生成数据所需的信息。我们的实验结果表明,我们的模型能够学习有意义的多模态表示,并在五个多模态数据集上实现最先进或竞争性的表现。我们的模型还通过调节独立因子来展示灵活的生成能力。我们进一步解释分解表示以理解影响多模式学习的相互作用。
translated by 谷歌翻译
Technology has enabled anyone with an Internet connection to easily create and share their ideas, opinions and content with millions of other people around the world. Much of the content being posted and consumed online is multimodal. With billions of phones, tablets and PCs shipping today with built-in cameras and a host of new video-equipped wearables like Google Glass on the horizon, the amount of video on the Internet will only continue to increase. It has become increasingly difficult for researchers to keep up with this deluge of multimodal content, let alone organize or make sense of it. Mining useful knowledge from video is a critical need that will grow exponentially, in pace with the global growth of content. This is particularly important in sentiment analysis, as both service and product reviews are gradually shifting from unimodal to multimodal. We present a novel method to extract features from visual and textual modalities using deep convolutional neural networks. By feeding such features to a multiple kernel learning classifier, we significantly outperform the state of the art of multimodal emotion recognition and sentiment analysis on different datasets.
translated by 谷歌翻译
In recent years, Deep Learning has been successfully applied to multimodal learning problems, with the aim of learning useful joint representations in data fusion applications. When the available modalities consist of time series data such as video, audio and sensor signals, it becomes imperative to consider their temporal structure during the fusion process. In this paper, we propose the Correlational Recurrent Neural Network (CorrRNN), a novel temporal fusion model for fusing multiple input modalities that are inherently temporal in nature. Key features of our proposed model include: (i) simultaneous learning of the joint representation and temporal dependencies between modalities, (ii) use of multiple loss terms in the objective function, including a maximum correlation loss term to enhance learning of cross-modal information, and (iii) the use of an attention model to dynamically adjust the contribution of different input modalities to the joint representation. We validate our model via experimentation on two different tasks: video-and sensor-based activity classification, and audiovisual speech recognition. We empirically analyze the contributions of different components of the proposed CorrRNN model, and demonstrate its robustness, effectiveness and state-of-the-art performance on multiple datasets.
translated by 谷歌翻译
音频情感分析是一个流行的研究领域,它扩展了传统的基于文本的情感分析,依赖于从语音中提取的声学特征的有效性。然而,当前音频分析的进展主要集中在提取均匀声学特征或不能有效融合异构特征。在本文中,我们提出了基于话语的深度神经网络模型,它具有基于卷积神经网络(CNN)和长短期记忆(LSTM)的网络的并行组合,以获得称为音频情感矢量(ASV)的代表性特征,最大程度地反映音频中的情感信息。具体来说,我们的模型通过话语级标签进行训练,ASV可以从两个分支中提取和融合。在CNN模型分支中,由信号产生的频谱图作为输入馈送,而在LSTM模型分支中,输入包括从音频中的从属话语中提取的频谱特征和倒谱系数。此外,双向长短期记忆(BiLSTM)机制用于特征融合。进行了广泛的实验,表明我们的模型可以精确,快速地识别音频情绪,并证明我们的ASV比从其他深度学习模型中提取的传统声学特征向量更好。此外,实验结果表明,所提出的模型在Multimodal Opinion-level Sentiment Intensity数据集(MOSI)数据集上的表现优于9.33%的最新方法。
translated by 谷歌翻译
Deep learning has emerged as a powerful machine learning technique that learns multiple layers of representations or features of the data and produces state-of-the-art prediction results. Along with the success of deep learning in many other application domains, deep learning is also popularly used in sentiment analysis in recent years. This paper first gives an overview of deep learning and then provides a comprehensive survey of its current applications in sentiment analysis.
translated by 谷歌翻译
多模态情绪分析是一个非常活跃的研究领域。该领域的机会减少区域是为了改善多模式融合机制。我们提出了一种新颖的特征融合策略,它以层次结构的方式进行,首先融合两种方式,然后融合所有三种形式。在关于个体差异的多模态情感分析中,我们的策略优于传统的特征串联1%,相当于错误率降低5%。在多话语视频片段的话语级多元语音分析中,当前最先进的技术结合了来自同一片段的其他语义的上下文信息,我们的分层融合比目前的话法提供高达2.4%(几乎10%的错误率降低)使用串联。我们的方法的实现是以开源代码的形式公开提供的。
translated by 谷歌翻译
We compile baselines, along with dataset split, for multimodal sentiment analysis. In this paper, we explore three different deep-learning based architectures for multimodal sentiment classification, each improving upon the previous. Further, we evaluate these architectures with multiple datasets with fixed train/test partition. We also discuss some major issues, frequently ignored in mul-timodal sentiment analysis research, e.g., role of speaker-exclusive models, importance of different modalities, and generalizability. This framework illustrates the different facets of analysis to be considered while performing multimodal sentiment analysis and, hence, serves as a new benchmark for future research in this emerging field.
translated by 谷歌翻译
Affective com puting is an emer ging interdisciplinary research field bringing together researchers and practitioners from various fields, ranging from artificial intelligence, natural language processing, to cog-nitive and social sciences. With the proliferation of videos posted online (e.g., on YouTube, Facebook, Twitter) for product reviews, movie reviews, political views, and more, affective computing research has increasingly evolved from conventional unimodal analysis to more complex forms of multimodal analysis. This is the primary motivation behind our first of its kind, comprehensive literature review of the diverse field of affective computing. Furthermore, existing literature surveys lack a detailed discussion of state of the art in multimodal affect analysis frameworks, which this review aims to address. Multimodality is defined by the presence of more than one modality or channel, e.g., visual, audio, text, gestures, and eye gage. In this paper, we focus mainly on the use of audio, visual and text information for multimodal affect analysis, since around 90% of the relevant literature appears to cover these three modalities. Following an overview of different techniques for unimodal affect analysis, we outline existing methods for fusing information from different modalities. As part of this review, we carry out an extensive study of different categories of state-of-the-art fusion techniques, followed by a critical analysis of potential performance improvements with multimodal analysis compared to unimodal analysis. A comprehensive overview of these two complementary fields aims to form the building blocks for readers, to better understand this challenging and exciting research field.
translated by 谷歌翻译
在过去的十年中,视频博客(vlogs)已经成为人们表达情感的极其流行的方法。这些视频无处不在增加了多模式融合模型的重要性,多模式融合模型将视频和音频特征与传统文本特征自动情绪检测相结合。多模式融合为构建模型提供了独特的机会,这些模型可以从人类观察者可用的全部表达深度中学习。在检测这些视频中的情绪时,声学和视频特征为其他模糊的成绩单提供了清晰度。在本文中,我们提出了一种多模式融合模型,该模型专门使用高级视频和音频特征来分析口语句子的情绪。我们放弃传统的转录功能,以最大限度地减少人为干预,并最大限度地提高我们的模型在大规模现实世界数据上的可部署性。我们选择在非效果域中成功的模型的高级功能,以测试它们在情感检测域中的普遍性。我们在新发布的CMU MultimodalOpinion Sentiment和Emotion Intensity(CMUMOSEI)数据集上训练和测试我们的模型,在验证集上获得0.8049的F1分数,在保持激发试验集上获得0.6325的F1分数。
translated by 谷歌翻译
当前的多模态情绪分析将情绪分数预测作为年龄机器学习任务。然而,情感分数实际代表的内容经常被忽视。作为观点和情感状态的度量,情绪评分通常包括两个方面:极性和强度。我们将情绪分数分解为这两个方面,并研究它们是通过个体形态传达的,并在自然主义独白背景中结合多模态。特别地,我们构建具有情绪分数预测的单模和多模多任务学习模型,作为主要任务和极性和/或强度分类作为辅助任务。我们的实验表明,情感分析受益于多任务学习,并且在传达情绪的极性和强度方面时,个体形态不同。
translated by 谷歌翻译