多模式机器学习是跨越语言,视觉和声学模式的核心研究领域。多模式学习的核心挑战涉及学习表示,可以处理和关联来自多种模态的信息。在本文中,我们提出了两种使用序列到序列(Seq2Seq)方法进行联合多模态表示的无监督学习的方法:a \ textit {Seq2Seq模态翻译模型}和\ textit {Hierarchical Seq2Seq模态翻译模型}。我们还探讨了这些seq2seq模型的多模式输入和输出的多种不同变化。我们使用CMU-MOSI数据集进行多模态情感分析的实验表明,我们的方法学习的信息多模态表示优于基线并在多模态情感分析中实现改进的性能,特别是在我们的模型能够将F1得分提高12分的双峰情况下。我们还讨论了多模式Seq2Seq方法的futuredirections。
translated by 谷歌翻译
由于存在多个信息源,因此学习多模态数据的表示是一个基本上复杂的研究问题。为了解决多模态数据的复杂性,我们认为适当的代表性学习模型应该:1)根据数据变异的独立因素对表征进行因子分解,捕获2)判别和3)生成任务的重要特征,以及4)耦合特定模态和多模态信息。为了囊括所有这些性质,我们提出了多模态因子分解模型(MFM),它将多模态表示分解为两组独立因子:多模态判别因子和模态特定生成因子。多模态歧视因子在所有模态中共享,并包含判别任务(如预测情感)所需的联合多模态特征。特定于模态的生成因子对于每种模态都是唯一的,并且包含生成数据所需的信息。我们的实验结果表明,我们的模型能够学习有意义的多模态表示,并在五个多模态数据集上实现最先进或竞争性的表现。我们的模型还通过调节独立因子来展示灵活的生成能力。我们进一步解释分解表示以理解影响多模式学习的相互作用。
translated by 谷歌翻译
人类多模式语言的计算建模是跨越语言,视觉和声学模式的自然语言处理中的新兴研究领域。理解多模式语言不仅需要对每种模态中的交互进行建模(模内交互),而且更重要的是模态之间的相互作用(交叉模态交互)。在本文中,我们提出了循环多级融合网络(RMFN),它将融合问题分解为多个阶段,每个阶段都集中在多模态信号的子集上,以进行专门的,有效的融合。使用这种多阶段融合方法对交叉模态交互进行建模,该方法构建了前一阶段的中间表示。通过将我们提出的融合方法与递归神经网络的系统相结合来模拟时间和模内相互作用。 RMFN在三种公共数据集中对人类多模式语言进行建模时,展示了最先进的性能,涉及多模式情感分析,情感识别和表现特征识别。我们提供可视化来显示聚变的每个阶段聚焦在多模态信号的不同子集上,学习越来越多的多模态表示。
translated by 谷歌翻译
Human face-to-face communication is a complex multimodal signal. We use words (language modality), gestures (vision modality) and changes in tone (acoustic modality) to convey our intentions. Humans easily process and understand face-to-face communication, however, comprehending this form of communication remains a significant challenge for Artificial Intelligence (AI). AI must understand each modality and the interactions between them that shape human communication. In this paper, we present a novel neural architecture for understanding human communication called the Multi-attention Recurrent Network (MARN). The main strength of our model comes from discovering interactions between modalities through time using a neural component called the Multi-attention Block (MAB) and storing them in the hybrid memory of a recurrent component called the Long-short Term Hybrid Memory (LSTHM). We perform extensive comparisons on six publicly available datasets for multimodal sentiment analysis, speaker trait recognition and emotion recognition. MARN shows state-of-the-art performance on all the datasets.
translated by 谷歌翻译
Multimodal sentiment analysis is an increasingly popular research area, whichextends the conventional language-based definition of sentiment analysis to amultimodal setup where other relevant modalities accompany language. In thispaper, we pose the problem of multimodal sentiment analysis as modelingintra-modality and inter-modality dynamics. We introduce a novel model, termedTensor Fusion Network, which learns both such dynamics end-to-end. The proposedapproach is tailored for the volatile nature of spoken language in onlinevideos as well as accompanying gestures and voice. In the experiments, ourmodel outperforms state-of-the-art approaches for both multimodal and unimodalsentiment analysis.
translated by 谷歌翻译
With the increasing popularity of video sharing websites such as YouTube and Facebook, multimodal sentiment analysis has received increasing attention from the scientific community. Contrary to previous works in multimodal sentiment analysis which focus on holistic information in speech segments such as bag of words representations and average facial expression intensity, we develop a novel deep architecture for multimodal sentiment analysis that performs modality fusion at the word level. In this paper, we propose the Gated Multimodal Embedding LSTM with Temporal Attention (GME-LSTM(A)) model that is composed of 2 modules. The Gated Multimodal Embedding alleviates the difficulties of fusion when there are noisy modalities. The LSTM with Temporal Attention performs word level fusion at a finer fusion resolution between input modalities and attends to the most important time steps. As a result, the GME-LSTM(A) is able to better model the multimodal structure of speech through time and perform better sentiment comprehension. We demonstrate the effectiveness of this approach on the publicly-available Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis (CMU-MOSI) dataset by achieving state-of-the-art sentiment classification and regression results. Qualitative analysis on our model emphasizes the importance of the Temporal Attention Layer in sentiment prediction because the additional acoustic and visual modalities are noisy. We also demonstrate the effectiveness of the Gated Multimodal Embedding in selectively filtering these noisy modalities out. Our results and analysis open new areas in the study of sentiment analysis in human communication and provide new models for multimodal fusion. * Equal contribution.
translated by 谷歌翻译
情感分析研究在过去十年中迅速发展,并引起了学术界和工业界的广泛关注,其中大部分都是基于文本的。然而,现实世界中的信息通常是不同的形式。在本文中,我们考虑多模态情感分析的任务,使用音频和文本模式,提出包括多特征融合和多模态融合的融合策略,以提高音频文本情感分析的准确性。我们将其称为DeepFeature Fusion-Audio和Text Modal Fusion(DFF-ATMF)模型,并且从中获得的功能相互补充且功能强大。使用CMU-MOSI语料库和最近发布的用于Youtubevideo情感分析的CMU-MOSEI语料库的实验显示了我们提出的模型的非常有竞争力的结果。令人惊讶的是,我们的方法也在IEMOCAP数据集中实现了最先进的结果,表明我们提出的融合策略也是对多模态情感识别的极端泛化能力。
translated by 谷歌翻译
Humans convey their intentions through the usage of both verbal and nonverbal behaviors during face-to-face communication. Speaker intentions often vary dynamically depending on different nonverbal contexts, such as vocal patterns and facial expressions. As a result, when modeling human language, it is essential to not only consider the literal meaning of the words but also the nonverbal contexts in which these words appear. To better model human language, we first model expressive nonverbal representations by analyzing the fine-grained visual and acoustic patterns that occur during word segments. In addition, we seek to capture the dynamic nature of nonverbal intents by shifting word representations based on the accompanying nonverbal behaviors. To this end, we propose the Recurrent Attended Variation Embedding Network (RAVEN) that models the fine-grained structure of nonverbal subword sequences and dynamically shifts word representations based on nonverbal cues. Our proposed model achieves competitive performance on two publicly available datasets for multimodal sentiment analysis and emotion recognition. We also visualize the shifted word representations in different nonverbal contexts and summarize common patterns regarding multimodal variations of word representations.
translated by 谷歌翻译
Our experience of the world is multimodal - we see objects, hear sounds, feeltexture, smell odors, and taste flavors. Modality refers to the way in whichsomething happens or is experienced and a research problem is characterized asmultimodal when it includes multiple such modalities. In order for ArtificialIntelligence to make progress in understanding the world around us, it needs tobe able to interpret such multimodal signals together. Multimodal machinelearning aims to build models that can process and relate information frommultiple modalities. It is a vibrant multi-disciplinary field of increasingimportance and with extraordinary potential. Instead of focusing on specificmultimodal applications, this paper surveys the recent advances in multimodalmachine learning itself and presents them in a common taxonomy. We go beyondthe typical early and late fusion categorization and identify broaderchallenges that are faced by multimodal machine learning, namely:representation, translation, alignment, fusion, and co-learning. This newtaxonomy will enable researchers to better understand the state of the fieldand identify directions for future research.
translated by 谷歌翻译
Multi-view sequential learning is a fundamental problem in machine learning dealing with multi-view sequences. In a multi-view sequence, there exists two forms of interactions between different views: view-specific interactions and cross-view interactions. In this paper, we present a new neural architecture for multi-view sequential learning called the Memory Fusion Network (MFN) that explicitly accounts for both interactions in a neural architecture and continuously models them through time. The first component of the MFN is called the System of LSTMs, where view-specific interactions are learned in isolation through assigning an LSTM function to each view. The cross-view interactions are then identified using a special attention mechanism called the Delta-memory Attention Network (DMAN) and summarized through time with a Multi-view Gated Memory. Through extensive experimentation , MFN is compared to various proposed approaches for multi-view sequential learning on multiple publicly available benchmark datasets. MFN outperforms all the existing multi-view approaches. Furthermore, MFN outperforms all current state-of-the-art models, setting new state-of-the-art results for these multi-view datasets.
translated by 谷歌翻译
我们提出了一种多模态数据融合方法,通过获得$ M + 1 $ dimensiontensor来考虑$ M $模态和神经网络模型的输出层之间的高阶关系。应用基于模态的张量因子化方法,该方法对不同的模态采用不同的因子,导致相对于模型输出去除冗余信息,并导致更少的模型参数,同时性能损失最小。该分解方法用作正则化器,其导致不太复杂的模型并避免过度拟合。此外,基于模态的因子化方法有助于理解每种模态的有用信息量。我们已将该方法应用于三种不同的多模态数据集的遗传分析,人格特质识别和情感识别。结果表明,与所有技术领域的最新技术相比,该方法在几项评估指标上的效率提高了1%至4%。
translated by 谷歌翻译
了解视频片段的影响已将研究人员从语言,音频和视频领域汇集到一起。目前该领域的多模式研究大多涉及融合模态的各种技术,并且最独立地处理视频的片段。在(Zadeh等人,2017)和(Poria等人,2017)的工作的推动下,我们提出了我们的架构,关系张量网络,我们使用段内(段内)的模态间相互作用,并考虑视频模型中的片段序列是片段间模态间的相互作用。我们还通过利用更丰富的音频和语言环境以及融合来自文本的基于细粒度知识的极性分数来生成丰富的文本和音频模态。我们在CMU-MOSEI数据集上展示了我们的模型的结果,并且表明我们的模型优于许多基线和最先进的方法,用于分类和情感识别。
translated by 谷歌翻译
In recent years, Deep Learning has been successfully applied to multimodal learning problems, with the aim of learning useful joint representations in data fusion applications. When the available modalities consist of time series data such as video, audio and sensor signals, it becomes imperative to consider their temporal structure during the fusion process. In this paper, we propose the Correlational Recurrent Neural Network (CorrRNN), a novel temporal fusion model for fusing multiple input modalities that are inherently temporal in nature. Key features of our proposed model include: (i) simultaneous learning of the joint representation and temporal dependencies between modalities, (ii) use of multiple loss terms in the objective function, including a maximum correlation loss term to enhance learning of cross-modal information, and (iii) the use of an attention model to dynamically adjust the contribution of different input modalities to the joint representation. We validate our model via experimentation on two different tasks: video-and sensor-based activity classification, and audiovisual speech recognition. We empirically analyze the contributions of different components of the proposed CorrRNN model, and demonstrate its robustness, effectiveness and state-of-the-art performance on multiple datasets.
translated by 谷歌翻译
由于深度学习的最新进展和大规模并行机的可用性,机器翻译最近取得了令人瞩目的成绩。已经有许多尝试来扩展这些成功的资源语言对,但需要成千上万的并行句子。在这项工作中,我们将这一研究方向发挥到极致,并研究即使没有任何平行数据也是否有可能学会翻译。我们提出了一种模型,它从两种不同语言的单语语料库中取出句子并将它们映射到相同的潜在空间。 Bylearning从这个共享特征空间重构两种语言,有效地学习翻译而不使用任何标记数据。我们在两个广泛使用的数据集和两个语言对上展示我们的模型,在Multi30k和WMT英语 - 法语数据集上报告BLEU得分为32.8和15.1,在训练时不使用单个平行句子。
translated by 谷歌翻译
这项工作的目的是识别跟踪面部所说的短语和句子,有或没有音频。与之前致力于识别有限数量的单词或短语的作品不同,我们将唇读作为一个开放世界的问题 - 无约束的自然语言句子,以及野外视频。我们的主要贡献是:(1)我们比较两种唇读模型,一种使用CTC损失,另一种使用序列到序列丢失。两种模型都建立在变形金刚自我关注架构之上; (2)我们研究唇读对音频语音识别的重要程度,特别是当音频信号噪声较大时; (3)我们引入并公开发布了一个新的视听语音识别数据集LRS2-BBC,该数据集由来自英国电视的数千个自然语句组成。我们训练的模型在唇读基准数据集上的表现超过了所有前期工作的表现。
translated by 谷歌翻译
Multimodal sentiment analysis is a developing area of research, which involves the identification of sentiments in videos. Current research considers utterances as independent entities, i.e., ignores the inter-dependencies and relations among the utterances of a video. In this paper, we propose a LSTM-based model that enables utterances to capture contextual information from their surroundings in the same video, thus aiding the classification process. Our method shows 5-10% performance improvement over the state of the art and high robustness to generalizability.
translated by 谷歌翻译
Sequence-to-Sequence (seq2seq) modeling has rapidly become an important general-purpose NLP tool that has proven effective for many text-generation and sequence-labeling tasks. Seq2seq builds on deep neural language modeling and inherits its remarkable accuracy in estimating local, next-word distributions. In this work, we introduce a model and beam-search training scheme, based on the work of Daumé III and Marcu (2005), that extends seq2seq to learn global sequence scores. This structured approach avoids classical biases associated with local training and unifies the training loss with the test-time usage, while preserving the proven model architecture of seq2seq and its efficient training approach. We show that our system outperforms a highly-optimized attention-based seq2seq system and other baselines on three different sequence to sequence tasks: word ordering, parsing, and machine translation.
translated by 谷歌翻译
情感识别是人工智能与人类交流分析交叉的核心研究领域。这是一项重大的技术挑战,因为人类通过语言,视觉和声学模式的复杂特殊组合来展示自己的情感。与传统的多模态融合技术相比,我们从直接的人独立和相对的人依赖视角来处理情感识别。独立于人的视角遵循传统的情感识别方法,该方法直接从观察到的多模态特征中推断绝对情感标签。相对的人依赖性观点通过比较部分视频消息以相对方式来评估情绪识别,以确定情绪强度是否增加或减少。我们提出的模型通过将情绪识别任务划分为三个easiersubtasks来整合这些直接和相对预测的观点。第一个子任务涉及视频的两个短片段之间的相对情感强度的多模态本地排名。第二个子参数使用贝叶斯分析算法推断全局相对情绪等级的局部排名。第三个子任务包括来自观察到的多模态行为的直接预测和来自局部全局的最终情绪预测的相对情绪等级。我们的方法在视听情感识别基准测试中表现出色,并改进了多模式融合的其他算法。
translated by 谷歌翻译
在自然语言处理(NLP)中,重要的是检测两个序列之间的关系或者在给定其他观察序列的情况下生成一系列标记。我们将建模序列对的问题类型称为序列到序列(seq2seq)映射问题。许多研究致力于寻找解决这些问题的方法,传统方法依赖于手工制作的特征,对齐模型,分割启发式和外部语言资源的组合。虽然取得了很大进展,但这些传统方法还存在各种缺陷,如复杂的流水线,繁琐的特征工程,以及领域适应的困难。最近,神经网络成为NLP,语音识别和计算机视觉中许多问题的解决方案。神经模型是强大的,因为它们可以端到端地进行训练,很好地概括为看不见的例子,同样的框架可以很容易地适应新的领域。本论文的目的是通过神经网络推进seq2seq映射问题的最新技术。我们从三个主要方面探索解决方案:研究用于表示序列的神经模型,建模序列之间的相互作用,以及使用不成对数据来提高神经模型的性能。对于每个方面,我们提出新模型并评估它们对seq2seq映射的各种任务的功效。
translated by 谷歌翻译
多模态学习一直缺乏结合不同形式的信息和学习有意义的低维数表达的原则性方法。我们从潜在的变量角度研究多模态学习和传感器融合。我们首先提出了一种用于传感器融合的正则化反复注意滤波器。该算法可以在顺序决策任务中动态组合来自不同类型传感器的信息。每个传感器都与模块化神经网络结合,以最大化其自身信息的效用。门控模块化神经网络通过平衡所有传感器信息的效用,动态地为传感器网络的输出生成一组混合权重。我们设计了一种共同学习机制,以鼓励同时对每个传感器进行自适应和独立学习,并提出一种基于正则化的协同学习方法。在第二部分中,我们重点关注恢复潜在表征的多样性。我们提出了一种使用概率图模型的共同学习方法,该模型在生成模型中强加了结构先验:多模态变分RNN(MVRNN)模型,并导出其目标函数的变分下界。在第三部分中,我们将暹罗结构扩展到传感器融合,以实现稳健的声学事件检测。我们进行实验来研究提取的潜在表征;工作将在接下来的几个月内完成。我们的实验表明,周期性注意过滤器可以根据输入中携带的信息动态组合不同的传感器输入。我们认为MVRNN可以识别对许多下游任务有用的潜在表示,例如语音合成,活动识别以及控制和规划。这两种算法都是通用框架,可以应用于其他任务,其中不同类型的传感器共同用于决策。
translated by 谷歌翻译