我们提出并研究一个身份敏感的联合嵌入面部和声音。这种嵌入使得能够从语音到面部以及从面部到语音进行跨模态检索。我们做出以下四个贡献:首先,我们表明嵌入可以从谈话面孔的视频中学习,不需要任何身份标签,使用跨模式自我监督的形式;其次,我们制定了针对硬负面采矿目标的课程学习计划对于这项任务来说,这对于学习成功进行是必不可少的;第三,我们展示和评估了在多种场景训练期间看不见和未闻所未闻的身份的跨模态检索,并为这项新任务建立了超标;最后,我们展示了一个使用联接嵌入在TVdramas中自动检索和标记字符的应用程序。
translated by 谷歌翻译
在本文中,我们研究人脸与声音之间的关联。视听整合,特别是面部和声音信息的整合是神经科学中一个研究得很好的领域。结果表明,两种模态之间的重叠信息在说话人识别等感知任务中起着重要作用。通过对我们创建的新数据集的在线研究,我们确认了以前的发现,即人们可以将相关的声音与相应的声音相关联,反之亦然,而不是偶然准确性。我们计算模拟人脸和声音之间的重叠信息,并表明学习的跨模态表示包含足够的信息来识别匹配的面部和声音,其性能与人类相似。我们的表示与某些人口统计特征和从视觉或听觉模态单独获得的特征相关。我们发布我们的视听录音数据集和人们阅读我们研究中使用的短文本的人口统计注释。
translated by 谷歌翻译
We introduce a seemingly impossible task: given only an audio clip of someone speaking, decide which of two face images is the speaker. In this paper we study this, and a number of related cross-modal tasks, aimed at answering the question: how much can we infer from the voice about the face and vice versa? We study this task "in the wild", employing the datasets that are now publicly available for face recognition from static images (VGGFace) and speaker identification from audio (VoxCeleb). These provide training and testing scenarios for both static and dynamic testing of cross-modal matching. We make the following contributions: (i) we introduce CNN architectures for both binary and multi-way cross-modal face and audio matching; (ii) we compare dynamic testing (where video information is available, but the audio is not from the same video) with static testing (where only a single still image is available); and (iii) we use human testing as a baseline to calibrate the difficulty of the task. We show that a CNN can indeed be trained to solve this task in both the static and dynamic scenarios, and is even well above chance on 10-way classification of the face given the voice. The CNN matches human performance on easy examples (e.g. different gender across faces) but exceeds human performance on more challenging examples (e.g. faces with the same gender, age and nationality) 1 .
translated by 谷歌翻译
由于姿势,面部质量,衣服,化妆等方面的差异很大,野外人物鉴定非常具有挑战性。传统的研究,如面部识别,人员重新识别和说话人识别,往往只关注单一的信息模式,这不足以处理实践中的所有情况。多模态人物识别是一种更有前景的方法,我们可以共同利用面部,头部,身体,音频特征等。在本文中,我们介绍了iQIYI-VID,这是用于多模态识别的最大视频数据集。它由5,000个名人的600K视频片段组成。这些视频片段是从400K小时的各种类型的在线视频中提取的,从电影,综艺节目,电视剧到新闻广播。 Allvideo剪辑通过仔细的人工注释过程,标签的错误率低于0.2%。我们在iQIYI-VIDdataset上评估了人脸识别,人员重新识别和说话人识别的最新模型。实验结果表明,这些模型对野外人员识别任务还远远不够。我们进一步证明了多模态特征的简单融合可以极大地改善人的识别。我们已在线发布数据集,以促进多模式人员识别研究。
translated by 谷歌翻译
We propose a tri-modal architecture to predict Big Five personality trait scores from video clips with different channels for audio , text, and video data. For each channel, stacked Convolutional Neural Networks are employed. The channels are fused both on decision-level and by concatenating their respective fully connected layers. It is shown that a multimodal fusion approach outperforms each single modality channel, with an improvement of 9.4% over the best individual modality (video). Full backprop-agation is also shown to be better than a linear combination of modalities, meaning complex interactions between modalities can be leveraged to build better models. Furthermore, we can see the prediction relevance of each modality for each trait. The described model can be used to increase the emotional intelligence of virtual agents.
translated by 谷歌翻译
多模态学习一直缺乏结合不同形式的信息和学习有意义的低维数表达的原则性方法。我们从潜在的变量角度研究多模态学习和传感器融合。我们首先提出了一种用于传感器融合的正则化反复注意滤波器。该算法可以在顺序决策任务中动态组合来自不同类型传感器的信息。每个传感器都与模块化神经网络结合,以最大化其自身信息的效用。门控模块化神经网络通过平衡所有传感器信息的效用,动态地为传感器网络的输出生成一组混合权重。我们设计了一种共同学习机制,以鼓励同时对每个传感器进行自适应和独立学习,并提出一种基于正则化的协同学习方法。在第二部分中,我们重点关注恢复潜在表征的多样性。我们提出了一种使用概率图模型的共同学习方法,该模型在生成模型中强加了结构先验:多模态变分RNN(MVRNN)模型,并导出其目标函数的变分下界。在第三部分中,我们将暹罗结构扩展到传感器融合,以实现稳健的声学事件检测。我们进行实验来研究提取的潜在表征;工作将在接下来的几个月内完成。我们的实验表明,周期性注意过滤器可以根据输入中携带的信息动态组合不同的传感器输入。我们认为MVRNN可以识别对许多下游任务有用的潜在表示,例如语音合成,活动识别以及控制和规划。这两种算法都是通用框架,可以应用于其他任务,其中不同类型的传感器共同用于决策。
translated by 谷歌翻译
我们提出了一种新的框架,称为不相交映射网络(DIMNet),用于跨模态生物特征匹配,特别是声音和面部。与现有方法不同,DIMNet没有明确地了解模态之间的联合关系。相反,DIMNet通过将它们单独映射到它们的共同协变量来学习不同模态的共享表示。然后可以使用这些共享表示来找到模态之间的对应关系。我们凭经验证明,DIMNet能够比其他现有方法获得更好的性能,并且在概念上更简单,数据密集程度更低。
translated by 谷歌翻译
获得大的,人类标记的语音数据集来训练情绪识别模型是一项众所周知的挑战性任务,受到注释成本和标签模糊性的阻碍。在这项工作中,我们考虑学习嵌入forspeech分类的任务,而无需访问任何形式的标记音频。我们基于一个简单的假设:语言的情感内容与说话者的面部表情相关。通过利用这种关系,我们表明表达的注释可以通过跨模式蒸馏从视觉域(面)转移到语音域(声音)。我们做出以下贡献:(i)我们开发了一个强大的面部情感识别网络,以达到标准基准的theart状态; (ii)我们使用教师培训学生,tabularasa,学习语音情感识别的表示(嵌入),而无需访问标记的音频数据; (iii)我们表明,speechemotion嵌入可用于外部基准数据集上的语音情感识别。代码,模型和数据可用。
translated by 谷歌翻译
We present a method for gesture detection and localisation based on multi-scale and multi-modal deep learning. Each visual modality captures spatial information at a particular spatial scale (such as motion of the upper body or a hand), and the whole system operates at three temporal scales. Key to our technique is a training strategy which exploits: i) careful initialization of individual modalities; and ii) gradual fusion involving random dropping of separate channels (dubbed ModDrop) for learning cross-modality correlations while preserving uniqueness of each modality-specific representation. We present experiments on the ChaLearn 2014 Looking at People Challenge gesture recognition track, in which we placed first out of 17 teams. Fusing multiple modalities at several spatial and temporal scales leads to a significant increase in recognition rates, allowing the model to compensate for errors of the individual classifiers as well as noise in the separate channels. Futhermore, the proposed ModDrop training technique ensures robustness of the classifier to missing signals in one or several channels to produce meaningful predictions from any number of available modalities. In addition, we demonstrate the applicability of the proposed fusion scheme to modalities of arbitrary nature by experiments on the same dataset augmented with audio.
translated by 谷歌翻译
With the increasing popularity of video sharing websites such as YouTube and Facebook, multimodal sentiment analysis has received increasing attention from the scientific community. Contrary to previous works in multimodal sentiment analysis which focus on holistic information in speech segments such as bag of words representations and average facial expression intensity, we develop a novel deep architecture for multimodal sentiment analysis that performs modality fusion at the word level. In this paper, we propose the Gated Multimodal Embedding LSTM with Temporal Attention (GME-LSTM(A)) model that is composed of 2 modules. The Gated Multimodal Embedding alleviates the difficulties of fusion when there are noisy modalities. The LSTM with Temporal Attention performs word level fusion at a finer fusion resolution between input modalities and attends to the most important time steps. As a result, the GME-LSTM(A) is able to better model the multimodal structure of speech through time and perform better sentiment comprehension. We demonstrate the effectiveness of this approach on the publicly-available Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis (CMU-MOSI) dataset by achieving state-of-the-art sentiment classification and regression results. Qualitative analysis on our model emphasizes the importance of the Temporal Attention Layer in sentiment prediction because the additional acoustic and visual modalities are noisy. We also demonstrate the effectiveness of the Gated Multimodal Embedding in selectively filtering these noisy modalities out. Our results and analysis open new areas in the study of sentiment analysis in human communication and provide new models for multimodal fusion. * Equal contribution.
translated by 谷歌翻译
Currently, datasets that support audiovisual recognition of people in videos are scarce and limited. In this paper, we introduce an expansion of video data from the IARPA Janus program to support this research area. We refer to the expanded set, which adds labels for voice to the already-existing face labels, as the Janus Multimedia dataset. We first describe the speaker labeling process, which involved a combination of automatic and manual criteria. We then discuss two evaluation settings for this data. In the core condition, the voice and face of the labeled individual are present in every video. In the full condition, no such guarantee is made. The power of audiovisual fusion is then shown using these publicly-available videos and labels, showing significant improvement over only recognizing voice or face alone. In addition to this work, several other possible paths for future research with this dataset are discussed.
translated by 谷歌翻译
情感分析研究在过去十年中迅速发展,并引起了学术界和工业界的广泛关注,其中大部分都是基于文本的。然而,现实世界中的信息通常是不同的形式。在本文中,我们考虑多模态情感分析的任务,使用音频和文本模式,提出包括多特征融合和多模态融合的融合策略,以提高音频文本情感分析的准确性。我们将其称为DeepFeature Fusion-Audio和Text Modal Fusion(DFF-ATMF)模型,并且从中获得的功能相互补充且功能强大。使用CMU-MOSI语料库和最近发布的用于Youtubevideo情感分析的CMU-MOSEI语料库的实验显示了我们提出的模型的非常有竞争力的结果。令人惊讶的是,我们的方法也在IEMOCAP数据集中实现了最先进的结果,表明我们提出的融合策略也是对多模态情感识别的极端泛化能力。
translated by 谷歌翻译
Affective com puting is an emer ging interdisciplinary research field bringing together researchers and practitioners from various fields, ranging from artificial intelligence, natural language processing, to cog-nitive and social sciences. With the proliferation of videos posted online (e.g., on YouTube, Facebook, Twitter) for product reviews, movie reviews, political views, and more, affective computing research has increasingly evolved from conventional unimodal analysis to more complex forms of multimodal analysis. This is the primary motivation behind our first of its kind, comprehensive literature review of the diverse field of affective computing. Furthermore, existing literature surveys lack a detailed discussion of state of the art in multimodal affect analysis frameworks, which this review aims to address. Multimodality is defined by the presence of more than one modality or channel, e.g., visual, audio, text, gestures, and eye gage. In this paper, we focus mainly on the use of audio, visual and text information for multimodal affect analysis, since around 90% of the relevant literature appears to cover these three modalities. Following an overview of different techniques for unimodal affect analysis, we outline existing methods for fusing information from different modalities. As part of this review, we carry out an extensive study of different categories of state-of-the-art fusion techniques, followed by a critical analysis of potential performance improvements with multimodal analysis compared to unimodal analysis. A comprehensive overview of these two complementary fields aims to form the building blocks for readers, to better understand this challenging and exciting research field.
translated by 谷歌翻译
本文的目的是在嘈杂和不受约束的条件下进行说话人识别。我们做出了两项重要贡献。首先,我们介绍一个从开源媒体收集的非常大规模的音频 - 视频说话人识别数据集。我们使用全自动管道,策划VoxCeleb2,其中包含超过6,000个扬声器的超过数十亿个话语。这比公开可用的说话人识别数据集大几倍。其次,我们开发和比较卷积神经网络(CNN)模型和训练策略,这些策略可以有效识别声音在各种条件下的身份。在VoxCeleb2数据集上训练的模型超过基准数据集上先前工作的性能。
translated by 谷歌翻译
This paper reviews recent results in audiovisual fusion and discusses main challenges in the area with a focus on desynchronization of the two modalities and the issue of training and testing where one of the modalities might be absent from testing. ABSTRACT | In this paper, we review recent results on audiovisual (AV) fusion. We also discuss some of the challenges and report on approaches to address them. One important issue in AV fusion is how the modalities interact and influence each other. This review will address this question in the context of AV speech processing, and especially speech recognition, where one of the issues is that the modalities both interact but also sometimes appear to desynchronize from each other. An additional issue that sometimes arises is that one of the modalities may be missing at test time, although it is available at training time; for example, it may be possible to collect AV training data while only having access to audio at test time. We will review approaches to address this issue from the area of multiview learning, where the goal is to learn a model or representation for each of the modalities separately while taking advantage of the rich multimodal training data. In addition to multiview learning, we also discuss the recent application of deep learning (DL) toward AV fusion. We finally draw conclusions and offer our assessment of the future in the area of AV fusion.
translated by 谷歌翻译
可解释性和可解释性是决策支持系统的两个关键方面。在计算机视觉中,它们在与人类行为分析相关的某些任务中是至关重要的,例如在医疗保健应用中。 Despitetheir的重要性,直到最近研究人员才开始探索这些方面。本文介绍了计算机视觉背景下的可解释性和解释性,重点介绍了人们的任务。具体而言,我们在第一印象分析的背景下审查和研究这些机制。据我们所知,这是朝这个方向迈出的第一步。此外,我们还介绍了在视频第一次展示分析中针对可解释性进行组织的挑战。详细分析新引入的数据集,评估协议,并总结挑战的结果。最后,从我们的研究中得出,我们预见的weoutline研究机会将在可解释的计算机视觉领域的发展中起到决定性的作用。
translated by 谷歌翻译
由于抽象概念和多种情感表达,自动情绪识别(AER)是一项具有挑战性的任务。尽管对定义没有达成共识,但人类的情绪状态通常可以被听觉和视觉系统所感知。受到人类认知过程的启发,在AER中同时利用音频和视觉信息是自然的。然而,大多数传统的融合方法仅构建线性范例,例如特征级联和多系统融合,其几乎不捕获音频和视频之间的复杂关联。在本文中,我们引入分解双线性池(FBP)来深度整合音频和视频的特征。具体地,通过各种模态的嵌入式注意机制来选择特征以获得情感相关区域。整个管道可以在神经网络中完成。在EmotiW2018的音频 - 视频子挑战的AFEW数据库上进行验证,建议的方法精确度为62.48%,优于最先进的结果。
translated by 谷歌翻译
当在嘈杂的环境中拍摄视频时,可以使用可见的嘴部动作来增强在视频中看到的扬声器的声音,从而减少背景噪声。虽然大多数现有方法使用仅音频输入,但基于视听神经网络,我们的视觉语音增强获得了改进的性能。我们在训练数据视频中包含我们添加了目标讲话者的声音作为背景噪声。由于音频输入不足以将扬声器的声音与他自己的声音分开,因此训练的模型可以更好地接收视觉输入并且很好地推广到不同的噪声类型。所提出的模型在两个公共唇读数据集上优于先前的视听方法。它也是第一个在非设计用于填充的数据集上展示的,例如巴拉克奥巴马的每周地址。
translated by 谷歌翻译
我们提出了一种联合视听模型,用于隔离来自诸如其他扬声器和背景噪声的混合声音的单个语音信号。仅使用音频作为输入来解决该任务是极具挑战性的,并且不提供分离的语音信号与视频中的扬声器的关联。在本文中,我们提出了一个基于网络的深层模型,它结合了视觉和听觉信号来解决这一任务。视觉特征用于将音频“聚焦”在场景中的所需扬声器上并提高音频分离质量。为了训练我们的联合视听模型,我们介绍了AVSpeech,这是一个由来自网络的数千小时视频片段组成的新数据集。我们展示了我们的方法对经典语音分离任务的适用性,以及涉及激烈访谈,嘈杂的酒吧和尖叫儿童的真实场景,只要求用户在视频中指定他们想要隔离的人的面孔。在混合语音的情况下,我们的方法显示出优于现有技术的仅音频语音分离的优势。此外,我们的模型与扬声器无关(训练有效,适用于任何扬声器),比最近的扬声器视觉分离方法产生更好的结果,这些方法取决于扬声器(需要为每个感兴趣的扬声器训练单独的模型)。
translated by 谷歌翻译
Multimodal sentiment analysis is an increasingly popular research area, whichextends the conventional language-based definition of sentiment analysis to amultimodal setup where other relevant modalities accompany language. In thispaper, we pose the problem of multimodal sentiment analysis as modelingintra-modality and inter-modality dynamics. We introduce a novel model, termedTensor Fusion Network, which learns both such dynamics end-to-end. The proposedapproach is tailored for the volatile nature of spoken language in onlinevideos as well as accompanying gestures and voice. In the experiments, ourmodel outperforms state-of-the-art approaches for both multimodal and unimodalsentiment analysis.
translated by 谷歌翻译