多模式机器学习是跨越语言,视觉和声学模式的核心研究领域。多模式学习的核心挑战涉及学习表示,可以处理和关联来自多种模态的信息。在本文中,我们提出了两种使用序列到序列(Seq2Seq)方法进行联合多模态表示的无监督学习的方法:a \ textit {Seq2Seq模态翻译模型}和\ textit {Hierarchical Seq2Seq模态翻译模型}。我们还探讨了这些seq2seq模型的多模式输入和输出的多种不同变化。我们使用CMU-MOSI数据集进行多模态情感分析的实验表明,我们的方法学习的信息多模态表示优于基线并在多模态情感分析中实现改进的性能,特别是在我们的模型能够将F1得分提高12分的双峰情况下。我们还讨论了多模式Seq2Seq方法的futuredirections。
translated by 谷歌翻译
由于存在多个信息源,因此学习多模态数据的表示是一个基本上复杂的研究问题。为了解决多模态数据的复杂性,我们认为适当的代表性学习模型应该:1)根据数据变异的独立因素对表征进行因子分解,捕获2)判别和3)生成任务的重要特征,以及4)耦合特定模态和多模态信息。为了囊括所有这些性质,我们提出了多模态因子分解模型(MFM),它将多模态表示分解为两组独立因子:多模态判别因子和模态特定生成因子。多模态歧视因子在所有模态中共享,并包含判别任务(如预测情感)所需的联合多模态特征。特定于模态的生成因子对于每种模态都是唯一的,并且包含生成数据所需的信息。我们的实验结果表明,我们的模型能够学习有意义的多模态表示,并在五个多模态数据集上实现最先进或竞争性的表现。我们的模型还通过调节独立因子来展示灵活的生成能力。我们进一步解释分解表示以理解影响多模式学习的相互作用。
translated by 谷歌翻译
人类多模式语言的计算建模是跨越语言,视觉和声学模式的自然语言处理中的新兴研究领域。理解多模式语言不仅需要对每种模态中的交互进行建模(模内交互),而且更重要的是模态之间的相互作用(交叉模态交互)。在本文中,我们提出了循环多级融合网络(RMFN),它将融合问题分解为多个阶段,每个阶段都集中在多模态信号的子集上,以进行专门的,有效的融合。使用这种多阶段融合方法对交叉模态交互进行建模,该方法构建了前一阶段的中间表示。通过将我们提出的融合方法与递归神经网络的系统相结合来模拟时间和模内相互作用。 RMFN在三种公共数据集中对人类多模式语言进行建模时,展示了最先进的性能,涉及多模式情感分析,情感识别和表现特征识别。我们提供可视化来显示聚变的每个阶段聚焦在多模态信号的不同子集上,学习越来越多的多模态表示。
translated by 谷歌翻译
With the increasing popularity of video sharing websites such as YouTube and Facebook, multimodal sentiment analysis has received increasing attention from the scientific community. Contrary to previous works in multimodal sentiment analysis which focus on holistic information in speech segments such as bag of words representations and average facial expression intensity, we develop a novel deep architecture for multimodal sentiment analysis that performs modality fusion at the word level. In this paper, we propose the Gated Multimodal Embedding LSTM with Temporal Attention (GME-LSTM(A)) model that is composed of 2 modules. The Gated Multimodal Embedding alleviates the difficulties of fusion when there are noisy modalities. The LSTM with Temporal Attention performs word level fusion at a finer fusion resolution between input modalities and attends to the most important time steps. As a result, the GME-LSTM(A) is able to better model the multimodal structure of speech through time and perform better sentiment comprehension. We demonstrate the effectiveness of this approach on the publicly-available Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis (CMU-MOSI) dataset by achieving state-of-the-art sentiment classification and regression results. Qualitative analysis on our model emphasizes the importance of the Temporal Attention Layer in sentiment prediction because the additional acoustic and visual modalities are noisy. We also demonstrate the effectiveness of the Gated Multimodal Embedding in selectively filtering these noisy modalities out. Our results and analysis open new areas in the study of sentiment analysis in human communication and provide new models for multimodal fusion. * Equal contribution.
translated by 谷歌翻译
Human face-to-face communication is a complex multimodal signal. We use words (language modality), gestures (vision modality) and changes in tone (acoustic modality) to convey our intentions. Humans easily process and understand face-to-face communication, however, comprehending this form of communication remains a significant challenge for Artificial Intelligence (AI). AI must understand each modality and the interactions between them that shape human communication. In this paper, we present a novel neural architecture for understanding human communication called the Multi-attention Recurrent Network (MARN). The main strength of our model comes from discovering interactions between modalities through time using a neural component called the Multi-attention Block (MAB) and storing them in the hybrid memory of a recurrent component called the Long-short Term Hybrid Memory (LSTHM). We perform extensive comparisons on six publicly available datasets for multimodal sentiment analysis, speaker trait recognition and emotion recognition. MARN shows state-of-the-art performance on all the datasets.
translated by 谷歌翻译
Multimodal sentiment analysis is an increasingly popular research area, whichextends the conventional language-based definition of sentiment analysis to amultimodal setup where other relevant modalities accompany language. In thispaper, we pose the problem of multimodal sentiment analysis as modelingintra-modality and inter-modality dynamics. We introduce a novel model, termedTensor Fusion Network, which learns both such dynamics end-to-end. The proposedapproach is tailored for the volatile nature of spoken language in onlinevideos as well as accompanying gestures and voice. In the experiments, ourmodel outperforms state-of-the-art approaches for both multimodal and unimodalsentiment analysis.
translated by 谷歌翻译
Humans convey their intentions through the usage of both verbal and nonverbal behaviors during face-to-face communication. Speaker intentions often vary dynamically depending on different nonverbal contexts, such as vocal patterns and facial expressions. As a result, when modeling human language, it is essential to not only consider the literal meaning of the words but also the nonverbal contexts in which these words appear. To better model human language, we first model expressive nonverbal representations by analyzing the fine-grained visual and acoustic patterns that occur during word segments. In addition, we seek to capture the dynamic nature of nonverbal intents by shifting word representations based on the accompanying nonverbal behaviors. To this end, we propose the Recurrent Attended Variation Embedding Network (RAVEN) that models the fine-grained structure of nonverbal subword sequences and dynamically shifts word representations based on nonverbal cues. Our proposed model achieves competitive performance on two publicly available datasets for multimodal sentiment analysis and emotion recognition. We also visualize the shifted word representations in different nonverbal contexts and summarize common patterns regarding multimodal variations of word representations.
translated by 谷歌翻译
Our experience of the world is multimodal - we see objects, hear sounds, feeltexture, smell odors, and taste flavors. Modality refers to the way in whichsomething happens or is experienced and a research problem is characterized asmultimodal when it includes multiple such modalities. In order for ArtificialIntelligence to make progress in understanding the world around us, it needs tobe able to interpret such multimodal signals together. Multimodal machinelearning aims to build models that can process and relate information frommultiple modalities. It is a vibrant multi-disciplinary field of increasingimportance and with extraordinary potential. Instead of focusing on specificmultimodal applications, this paper surveys the recent advances in multimodalmachine learning itself and presents them in a common taxonomy. We go beyondthe typical early and late fusion categorization and identify broaderchallenges that are faced by multimodal machine learning, namely:representation, translation, alignment, fusion, and co-learning. This newtaxonomy will enable researchers to better understand the state of the fieldand identify directions for future research.
translated by 谷歌翻译
Multi-view sequential learning is a fundamental problem in machine learning dealing with multi-view sequences. In a multi-view sequence, there exists two forms of interactions between different views: view-specific interactions and cross-view interactions. In this paper, we present a new neural architecture for multi-view sequential learning called the Memory Fusion Network (MFN) that explicitly accounts for both interactions in a neural architecture and continuously models them through time. The first component of the MFN is called the System of LSTMs, where view-specific interactions are learned in isolation through assigning an LSTM function to each view. The cross-view interactions are then identified using a special attention mechanism called the Delta-memory Attention Network (DMAN) and summarized through time with a Multi-view Gated Memory. Through extensive experimentation , MFN is compared to various proposed approaches for multi-view sequential learning on multiple publicly available benchmark datasets. MFN outperforms all the existing multi-view approaches. Furthermore, MFN outperforms all current state-of-the-art models, setting new state-of-the-art results for these multi-view datasets.
translated by 谷歌翻译
了解视频片段的影响已将研究人员从语言,音频和视频领域汇集到一起。目前该领域的多模式研究大多涉及融合模态的各种技术,并且最独立地处理视频的片段。在(Zadeh等人,2017)和(Poria等人,2017)的工作的推动下,我们提出了我们的架构,关系张量网络,我们使用段内(段内)的模态间相互作用,并考虑视频模型中的片段序列是片段间模态间的相互作用。我们还通过利用更丰富的音频和语言环境以及融合来自文本的基于细粒度知识的极性分数来生成丰富的文本和音频模态。我们在CMU-MOSEI数据集上展示了我们的模型的结果,并且表明我们的模型优于许多基线和最先进的方法,用于分类和情感识别。
translated by 谷歌翻译
In recent years, Deep Learning has been successfully applied to multimodal learning problems, with the aim of learning useful joint representations in data fusion applications. When the available modalities consist of time series data such as video, audio and sensor signals, it becomes imperative to consider their temporal structure during the fusion process. In this paper, we propose the Correlational Recurrent Neural Network (CorrRNN), a novel temporal fusion model for fusing multiple input modalities that are inherently temporal in nature. Key features of our proposed model include: (i) simultaneous learning of the joint representation and temporal dependencies between modalities, (ii) use of multiple loss terms in the objective function, including a maximum correlation loss term to enhance learning of cross-modal information, and (iii) the use of an attention model to dynamically adjust the contribution of different input modalities to the joint representation. We validate our model via experimentation on two different tasks: video-and sensor-based activity classification, and audiovisual speech recognition. We empirically analyze the contributions of different components of the proposed CorrRNN model, and demonstrate its robustness, effectiveness and state-of-the-art performance on multiple datasets.
translated by 谷歌翻译
我们提出了一种多模态数据融合方法,通过获得$ M + 1 $ dimensiontensor来考虑$ M $模态和神经网络模型的输出层之间的高阶关系。应用基于模态的张量因子化方法,该方法对不同的模态采用不同的因子,导致相对于模型输出去除冗余信息,并导致更少的模型参数,同时性能损失最小。该分解方法用作正则化器,其导致不太复杂的模型并避免过度拟合。此外,基于模态的因子化方法有助于理解每种模态的有用信息量。我们已将该方法应用于三种不同的多模态数据集的遗传分析,人格特质识别和情感识别。结果表明,与所有技术领域的最新技术相比,该方法在几项评估指标上的效率提高了1%至4%。
translated by 谷歌翻译
情感识别是人工智能与人类交流分析交叉的核心研究领域。这是一项重大的技术挑战,因为人类通过语言,视觉和声学模式的复杂特殊组合来展示自己的情感。与传统的多模态融合技术相比,我们从直接的人独立和相对的人依赖视角来处理情感识别。独立于人的视角遵循传统的情感识别方法,该方法直接从观察到的多模态特征中推断绝对情感标签。相对的人依赖性观点通过比较部分视频消息以相对方式来评估情绪识别,以确定情绪强度是否增加或减少。我们提出的模型通过将情绪识别任务划分为三个easiersubtasks来整合这些直接和相对预测的观点。第一个子任务涉及视频的两个短片段之间的相对情感强度的多模态本地排名。第二个子参数使用贝叶斯分析算法推断全局相对情绪等级的局部排名。第三个子任务包括来自观察到的多模态行为的直接预测和来自局部全局的最终情绪预测的相对情绪等级。我们的方法在视听情感识别基准测试中表现出色,并改进了多模式融合的其他算法。
translated by 谷歌翻译
Multimodal sentiment analysis is a developing area of research, which involves the identification of sentiments in videos. Current research considers utterances as independent entities, i.e., ignores the inter-dependencies and relations among the utterances of a video. In this paper, we propose a LSTM-based model that enables utterances to capture contextual information from their surroundings in the same video, thus aiding the classification process. Our method shows 5-10% performance improvement over the state of the art and high robustness to generalizability.
translated by 谷歌翻译
由于深度学习的最新进展和大规模并行机的可用性,机器翻译最近取得了令人瞩目的成绩。已经有许多尝试来扩展这些成功的资源语言对,但需要成千上万的并行句子。在这项工作中,我们将这一研究方向发挥到极致,并研究即使没有任何平行数据也是否有可能学会翻译。我们提出了一种模型,它从两种不同语言的单语语料库中取出句子并将它们映射到相同的潜在空间。 Bylearning从这个共享特征空间重构两种语言,有效地学习翻译而不使用任何标记数据。我们在两个广泛使用的数据集和两个语言对上展示我们的模型,在Multi30k和WMT英语 - 法语数据集上报告BLEU得分为32.8和15.1,在训练时不使用单个平行句子。
translated by 谷歌翻译
在过去的十年中,视频博客(vlogs)已经成为人们表达情感的极其流行的方法。这些视频无处不在增加了多模式融合模型的重要性,多模式融合模型将视频和音频特征与传统文本特征自动情绪检测相结合。多模式融合为构建模型提供了独特的机会,这些模型可以从人类观察者可用的全部表达深度中学习。在检测这些视频中的情绪时,声学和视频特征为其他模糊的成绩单提供了清晰度。在本文中,我们提出了一种多模式融合模型,该模型专门使用高级视频和音频特征来分析口语句子的情绪。我们放弃传统的转录功能,以最大限度地减少人为干预,并最大限度地提高我们的模型在大规模现实世界数据上的可部署性。我们选择在非效果域中成功的模型的高级功能,以测试它们在情感检测域中的普遍性。我们在新发布的CMU MultimodalOpinion Sentiment和Emotion Intensity(CMUMOSEI)数据集上训练和测试我们的模型,在验证集上获得0.8049的F1分数,在保持激发试验集上获得0.6325的F1分数。
translated by 谷歌翻译
通过以各种方式描述相同内容,多种形式可以提供比单一形式更有价值的信息。因此,高度期望通过融合不同模态的特征来容忍有效的联合表示。然而,先前的方法主要集中在融合由单峰深度网络生成的浅特征或高级表示,其仅捕获跨模态的部分层次关联。在本文中,我们建议通过在不同模态特定网络之间贪婪地堆叠多个共享层来密集地集成表示,其被称为密集多模式融合(DMF)。联合表示无关共享层可以捕获不同级别的相关性,共享层之间的连接也提供了一种有效的方式来学习层次关联之间的依赖关系。这两个属性共同归因于DMF中的多个学习路径,这导致更快的收敛,更低的训练损失和更好的性能。我们在三个典型的多模态学习任务中评估我们的模型,包括视听语音识别,跨模态检索和多模态分类。实验中显着的表现表明我们的模型可以学习更有效的联合表示。
translated by 谷歌翻译
在这项工作中,我们建议通过潜在的变量模型来模拟视觉和文本特征之间的相互作用,以进行多模态神经机器翻译。这个潜在变量可以看作是随机嵌入,它可以在目标语言解码器中使用,也可以用于预测图像特征。重要的是,即使在我们的模型公式中我们捕获了视觉和文本特征之间的相关性,我们也不要求图像可用。考试时间。我们表明,我们的潜在变量MMT公式在强基线上有显着改善,包括Elliott和Kadar(2017)的多任务学习方法以及Toyama等人的条件变分自动编码方法。 (2016)。最后,在消融研究中,我们表明(i)除了仅调节图像特征外,还预测图像特征,以及(ii)对在特定变量中编码的最小信息量施加约束,稍微改进了翻译。
translated by 谷歌翻译
自然语言处理是以盎格鲁为中心的,而以英语以外的语言工作的需求模型比以往任何时候都要大。然而,将模型从一种语言转移到另一种语言的任务可能是注释成本,工程时间和工作的昂贵内容。在本文中,我们提出了一个简单有效地将神经模型从英语转移到其他语言的一般框架。该框架依赖于任务表示作为弱监督的一种形式,是模型和任务不可知的,这意味着许多现有的神经架构可以用最小的努力移植到其他语言。唯一的要求是未标记的并行数据,以及在任务表示中定义的损失。我们通过将英语情绪分类器转换为三种不同的语言来评估我们的框架。在测试的基础上,我们发现我们的模型优于许多强基线并且可以与最先进的结果相媲美,这些结果依赖于更复杂的方法和更多的资源和数据。此外,我们发现本文提出的框架能够捕获跨语言的语义richand有意义的表示,尽管缺乏directsupervision。
translated by 谷歌翻译
Current methods for video description are based on encoder-decoder sentence generation using recurrent neu-ral networks (RNNs). Recent work has demonstrated the advantages of integrating temporal attention mechanisms into these models, in which the decoder network predicts each word in the description by selectively giving more weight to encoded features from specific time frames. Such methods typically use two different types of features: image features (from an object classification model), and motion features (from an action recognition model), combined by na¨ıvena¨ıve concatenation in the model input. Because different feature modalities may carry task-relevant information at different times, fusing them by na¨ıvena¨ıve concatena-tion may limit the model's ability to dynamically determine the relevance of each type of feature to different parts of the description. In this paper, we incorporate audio features in addition to the image and motion features. To fuse these three modalities, we introduce a multimodal attention model that can selectively utilize features from different modalities for each word in the output description. Combining our new multimodal attention model with standard temporal attention outperforms state-of-the-art methods on two standard datasets: YouTube2Text and MSR-VTT.
translated by 谷歌翻译