Recent years have seen progress beyond domain-specific sound separation for speech or music towards universal sound separation for arbitrary sounds. Prior work on universal sound separation has investigated separating a target sound out of an audio mixture given a text query. Such text-queried sound separation systems provide a natural and scalable interface for specifying arbitrary target sounds. However, supervised text-queried sound separation systems require costly labeled audio-text pairs for training. Moreover, the audio provided in existing datasets is often recorded in a controlled environment, causing a considerable generalization gap to noisy audio in the wild. In this work, we aim to approach text-queried universal sound separation by using only unlabeled data. We propose to leverage the visual modality as a bridge to learn the desired audio-textual correspondence. The proposed CLIPSep model first encodes the input query into a query vector using the contrastive language-image pretraining (CLIP) model, and the query vector is then used to condition an audio separation model to separate out the target sound. While the model is trained on image-audio pairs extracted from unlabeled videos, at test time we can instead query the model with text inputs in a zero-shot setting, thanks to the joint language-image embedding learned by the CLIP model. Further, videos in the wild often contain off-screen sounds and background noise that may hinder the model from learning the desired audio-textual correspondence. To address this problem, we further propose an approach called noise invariant training for training a query-based sound separation model on noisy data. Experimental results show that the proposed models successfully learn text-queried universal sound separation using only noisy unlabeled videos, even achieving competitive performance against a supervised model in some settings.
translated by 谷歌翻译
我们介绍了Audioscopev2,这是一种最先进的通用音频视频在屏幕上的声音分离系统,该系统能够通过观看野外视频来学习将声音与屏幕上的对象相关联。我们确定了先前关于视听屏幕上的声音分离的几个局限性,包括对时空注意力的粗略分辨率,音频分离模型的收敛性不佳,培训和评估数据的差异有限,以及未能说明贸易。在保存屏幕声音和抑制屏幕外声音之间的关闭。我们为所有这些问题提供解决方案。我们提出的跨模式和自我发场网络体系结构随着时间的推移以精细的分辨率捕获了视听依赖性,我们还提出了有效的可分离变体,这些变体能够扩展到更长的视频而不牺牲太多性能。我们还发现,仅在音频上进行预训练模型可大大改善结果。为了进行培训和评估,我们从大型野外视频数据库(YFCC100M)中收集了新的屏幕上的人类注释。这个新数据集更加多样化和具有挑战性。最后,我们提出了一个校准过程,该过程允许对屏幕重建与屏幕外抑制进行精确调整,从而大大简化了具有不同操作点的模型之间的性能。总体而言,我们的实验结果表明,在屏幕上的分离性能在更一般条件下的屏幕分离性能的改善要比以前具有最小的额外计算复杂性的方法更为普遍。
translated by 谷歌翻译
Current audio-visual separation methods share a standard architecture design where an audio encoder-decoder network is fused with visual encoding features at the encoder bottleneck. This design confounds the learning of multi-modal feature encoding with robust sound decoding for audio separation. To generalize to a new instrument: one must finetune the entire visual and audio network for all musical instruments. We re-formulate visual-sound separation task and propose Instrument as Query (iQuery) with a flexible query expansion mechanism. Our approach ensures cross-modal consistency and cross-instrument disentanglement. We utilize "visually named" queries to initiate the learning of audio queries and use cross-modal attention to remove potential sound source interference at the estimated waveforms. To generalize to a new instrument or event class, drawing inspiration from the text-prompt design, we insert an additional query as an audio prompt while freezing the attention mechanism. Experimental results on three benchmarks demonstrate that our iQuery improves audio-visual sound source separation performance.
translated by 谷歌翻译
将音频分离成不同声音源的深度学习技术面临着几种挑战。标准架构需要培训不同类型的音频源的独立型号。虽然一些通用分离器采用单个模型来靶向多个来源,但它们难以推广到看不见的来源。在本文中,我们提出了一个三个组件的管道,可以从大型但弱标记的数据集:audioset训练通用音频源分离器。首先,我们提出了一种用于处理弱标记训练数据的变压器的声音事件检测系统。其次,我们设计了一种基于查询的音频分离模型,利用此数据进行模型培训。第三,我们设计一个潜在的嵌入处理器来编码指定用于分离的音频目标的查询,允许零拍摄的概括。我们的方法使用单一模型进行多种声音类型的源分离,并仅依赖于跨标记的培训数据。此外,所提出的音频分离器可用于零拍摄设置,学习以分离从未在培训中看到的音频源。为了评估分离性能,我们在侦察中测试我们的模型,同时在不相交的augioset上培训。我们通过对从训练中保持的音频源类型进行另一个实验,进一步通过对训练进行了另一个实验来验证零射性能。该模型在两种情况下实现了对当前监督模型的相当的源 - 失真率(SDR)性能。
translated by 谷歌翻译
The thud of a bouncing ball, the onset of speech as lips open -when visual and audio events occur together, it suggests that there might be a common, underlying event that produced both signals. In this paper, we argue that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation. We propose to learn such a representation in a self-supervised way, by training a neural network to predict whether video frames and audio are temporally aligned. We use this learned representation for three applications: (a) sound source localization, i.e. visualizing the source of sound in a video; (b) audio-visual action recognition; and (c) on/offscreen audio source separation, e.g. removing the off-screen translator's voice from a foreign official's speech. Code, models, and video results are available on our webpage: http://andrewowens.com/multisensory.
translated by 谷歌翻译
最近的成功表明,可以通过文本提示来操纵图像,例如,在雨天的晴天,在雨天中被操纵到同一场景中,这是由文本输入“下雨”驱动的雨天。这些方法经常利用基于样式的图像生成器,该生成器利用多模式(文本和图像)嵌入空间。但是,我们观察到,这种文本输入通常在提供和综合丰富的语义提示时被瓶颈瓶颈,例如将大雨与雨雨区分开。为了解决这个问题,我们主张利用另一种方式,声音,在图像操纵中具有显着优势,因为它可以传达出比文本更多样化的语义提示(生动的情感或自然世界的动态表达)。在本文中,我们提出了一种新颖的方法,该方法首先使用声音扩展了图像文本接头嵌入空间,并应用了一种直接的潜在优化方法来根据音频输入(例如雨的声音)操纵给定的图像。我们的广泛实验表明,我们的声音引导的图像操纵方法在语义和视觉上比最先进的文本和声音引导的图像操纵方法产生更合理的操作结果,这通过我们的人类评估进一步证实。我们的下游任务评估还表明,我们学到的图像文本单嵌入空间有效地编码声音输入。
translated by 谷歌翻译
Videos are a rich source of multi-modal supervision. In this work, we learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams. To this end, we introduce the notion of a multimodal versatile network -a network that can ingest multiple modalities and whose representations enable downstream tasks in multiple modalities. In particular, we explore how best to combine the modalities, such that fine-grained representations of the visual and audio modalities can be maintained, whilst also integrating text into a common embedding. Driven by versatility, we also introduce a novel process of deflation, so that the networks can be effortlessly applied to the visual data in the form of video or a static image. We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks. Equipped with these representations, we obtain state-of-the-art performance on multiple challenging benchmarks including UCF101, HMDB51, Kinetics600, Audioset and ESC-50 when compared to previous self-supervised work. Our models are publicly available [1, 2, 3]. * Equal contribution. † Work done during an internship at DeepMind. 34th Conference on Neural Information Processing Systems (NeurIPS 2020),
translated by 谷歌翻译
We present a method for simultaneously localizing multiple sound sources within a visual scene. This task requires a model to both group a sound mixture into individual sources, and to associate them with a visual signal. Our method jointly solves both tasks at once, using a formulation inspired by the contrastive random walk of Jabri et al. We create a graph in which images and separated sounds correspond to nodes, and train a random walker to transition between nodes from different modalities with high return probability. The transition probabilities for this walk are determined by an audio-visual similarity metric that is learned by our model. We show through experiments with musical instruments and human speech that our model can successfully localize multiple sounds, outperforming other self-supervised methods. Project site: https://hxixixh.github.io/mix-and-localize
translated by 谷歌翻译
我们介绍了视觉匹配任务,其中音频剪辑被转换为听起来像是在目标环境中记录的。鉴于目标环境的图像和源音频的波形,目标是重新合成音频,以匹配目标室声音的可见几何形状和材料所建议的。为了解决这一新颖的任务,我们提出了一个跨模式变压器模型,该模型使用视听注意力将视觉属性注入音频并生成真实的音频输出。此外,我们设计了一个自我监督的训练目标,尽管他们缺乏声学上不匹配的音频,但可以从野外网络视频中学习声学匹配。我们证明,我们的方法成功地将人类的言语转化为图像中描绘的各种现实环境,表现优于传统的声学匹配和更严格的监督基线。
translated by 谷歌翻译
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.
translated by 谷歌翻译
Prior works on improving speech quality with visual input typically study each type of auditory distortion separately (e.g., separation, inpainting, video-to-speech) and present tailored algorithms. This paper proposes to unify these subjects and study Generalized Speech Enhancement, where the goal is not to reconstruct the exact reference clean signal, but to focus on improving certain aspects of speech. In particular, this paper concerns intelligibility, quality, and video synchronization. We cast the problem as audio-visual speech resynthesis, which is composed of two steps: pseudo audio-visual speech recognition (P-AVSR) and pseudo text-to-speech synthesis (P-TTS). P-AVSR and P-TTS are connected by discrete units derived from a self-supervised speech model. Moreover, we utilize self-supervised audio-visual speech model to initialize P-AVSR. The proposed model is coined ReVISE. ReVISE is the first high-quality model for in-the-wild video-to-speech synthesis and achieves superior performance on all LRS3 audio-visual enhancement tasks with a single model. To demonstrates its applicability in the real world, ReVISE is also evaluated on EasyCom, an audio-visual benchmark collected under challenging acoustic conditions with only 1.6 hours of training data. Similarly, ReVISE greatly suppresses noise and improves quality. Project page: https://wnhsu.github.io/ReVISE.
translated by 谷歌翻译
基于音频的自动语音识别(ASR)在嘈杂的环境中显着降低,并且特别容易受到干扰语音的影响,因为模型无法确定要转录的扬声器。视听语音识别(AVSR)系统通过将音频流与不变噪声不变的可视信息补充,帮助模型对所需扬声器的视觉信息来提高鲁棒性。但是,以前的AVSR工作仅关注监督学习设置;因此,通过可用的标记数据量阻碍了进度。在这项工作中,我们提出了一个自我监督的AVSR框架,建立在视听休伯特(AV-HUBERT),是最先进的视听语音表示学习模型。在最大可用的AVSR基准数据集LRS3中,我们的方法在存在的情况下使用少于10%的标签数据(433HR与30HR)之前的最先进(28.0%与14.1%)优于〜50%(28.0%vs.14.1%)禁止噪声,平均减少了基于音频模型的WER以上超过75%(25.8%与5.8%)。
translated by 谷歌翻译
在我们的日常生活中,视听场景是普遍存在的。对于人类来说是常见的常见地定位不同的探测物体,但是对于在没有类别注释的情况下实现类感知的声音对象本地化的机器非常具有挑战性,即,本地化声音对象并识别其类别。为了解决这个问题,我们提出了一个两阶段的逐步学习框架,以仅使用音频和视觉之间的对应方式本地化和识别复杂的视听方案中的探测对象。首先,我们建议通过单一源案例中通过粗粒化的视听对应来确定声音区域。然后,声音区域中的视觉功能被利用为候选对象表示,以建立类别表示对象字典,用于表达视觉字符提取。我们在鸡尾酒会方案中生成类感知对象本地化映射,并使用视听对应来抑制静音区域来引用此字典。最后,我们使用类别级视听一致性作为达到细粒度音频和探测物体分布对齐的监督。关于现实和综合视频的实验表明,我们的模型在本地化和识别物体方面是优越的,以及滤除静音。我们还将学习的视听网络转移到无监督的对象检测任务中,获得合理的性能。
translated by 谷歌翻译
这项工作介绍了开发单声扬声器特定(即个性化)语音增强模型的自我监督学习方法。尽管通才模型必须广泛地解决许多扬声器,但专业模型可以将其增强功能调整到特定说话者的声音上,并希望解决狭窄的问题。因此,除了降低计算复杂性外,专家还能够实现更佳的性能。但是,幼稚的个性化方法可能需要目标用户的干净语音,这是不方便的,例如由于记录条件不足。为此,我们将个性化作为零拍的任务,其中不使用目标扬声器的其他干净演讲来培训,或者不使用几次学习任务,在该任务中,目标是最大程度地减少清洁的持续时间用于转移学习的语音。在本文中,我们提出了自我监督的学习方法,以解决零和少量个性化任务的解决方案。所提出的方法旨在从未知的无标记数据(即,来自目标用户的内在嘈杂录音)中学习个性化的语音功能,而无需知道相应的清洁资源。我们的实验研究了三种不同的自我监督学习机制。结果表明,使用较少的模型参数以及来自目标用户的较少的清洁数据实现了零拍摄的模型,从而实现了数据效率和模型压缩目标。
translated by 谷歌翻译
传统上,音乐标记和基于内容的检索系统是使用预定的本体论构建的,涵盖了一组刚性的音乐属性或文本查询。本文介绍了Mulan:首次尝试新一代的声学模型,这些模型将音乐音频直接与无约束的自然语言描述联系起来。Mulan采用了两座联合音频文本嵌入模型的形式,该模型使用4400万张音乐录音(37万小时)和弱相关的自由形式文本注释训练。通过与广泛的音乐流派和文本样式(包括传统的音乐标签)的兼容性,由此产生的音频文本表示形式涵盖了现有的本体论,同时又毕业至真正的零击功能。我们通过一系列实验演示了Mulan嵌入的多功能性,包括转移学习,零照片标记,音乐域中的语言理解以及跨模式检索应用程序。
translated by 谷歌翻译
We consider the question: what can be learnt by looking at and listening to a large number of unlabelled videos? There is a valuable, but so far untapped, source of information contained in the video itself -the correspondence between the visual and the audio streams, and we introduce a novel "Audio-Visual Correspondence" learning task that makes use of this. Training visual and audio networks from scratch, without any additional supervision other than the raw unconstrained videos themselves, is shown to successfully solve this task, and, more interestingly, result in good visual and audio representations. These features set the new state-of-the-art on two sound classification benchmarks, and perform on par with the state-of-the-art selfsupervised approaches on ImageNet classification. We also demonstrate that the network is able to localize objects in both modalities, as well as perform fine-grained recognition tasks.
translated by 谷歌翻译
尽管视听模型与仅限音频模型相比可以产生卓越的性能和鲁棒性,但由于缺乏标记和未标记的视听数据以及每种方式部署一个模型的成本,它们的开发和采用受到阻碍。在本文中,我们提出了U-Hubert,这是一个自制的预训练框架,可以通过统一的蒙版群集预测目标来利用多模式和单峰语音。通过在预训练期间利用模态辍学,我们证明了一个微调模型可以在PAR上取得比较的性能或比最先进的模态特异性模型更好。此外,我们仅在音频上进行微调的模型可以通过视听和视觉语音输入来表现良好,从而实现了零击的模态概括,以实现语音识别和扬声器验证。特别是,我们的单个模型在带有音频/视听/视觉输入的LRS3上产生1.2%/1.4%/27.2%的语音识别单词错误率。
translated by 谷歌翻译
语音的视频录制包含相关的音频和视觉信息,为语音表示从扬声器的唇部运动和产生的声音提供了强大的信号。我们介绍了视听隐藏单元BERT(AV-HUBERT),是视听语音的自我监督的代表学习框架,这些屏幕屏蔽了多流视频输入并预测自动发现和迭代地精制多模式隐藏单元。 AV-HUBERT学习强大的视听语音表示,这些语音表示受益于唇读和自动语音识别。在最大的公众唇读基准LRS3(433小时)中,AV-Hubert达到32.5%WER,只有30个小时的标签数据,优于前一种最先进的方法(33.6%)培训,达到了一千次转录的视频数据(31k小时)。当使用来自LRS3的所有433小时的标记数据并结合自培训时,唇读WER进一步降低至26.9%。使用我们在相同的基准测试中使用您的视听表示,用于音频语音识别的相对效率为40%,而最先进的性能(1.3%Vs 2.3%)。我们的代码和模型可在https://github.com/facebookResearch/av_hubert获得
translated by 谷歌翻译
来自视频数据的多模态学习最近看过,因为它允许在没有人为注释的情况下培训语义有意义的嵌入,从而使得零射击检索和分类等任务。在这项工作中,我们提出了一种多模态,模态无政府主义融合变压器方法,它学会在多个模态之间交换信息,例如视频,音频和文本,并将它们集成到加入的多模态表示中,以获取聚合的嵌入多模态时间信息。我们建议培训系统的组合丢失,单个模态以及成对的方式,明确地留出任何附加组件,如位置或模态编码。在测试时间时,产生的模型可以处理和融合任意数量的输入模态。此外,变压器的隐式属性允许处理不同长度的输入。为了评估所提出的方法,我们在大规模HOWASET上培训模型,并评估四个具有挑战性的基准数据集上产生的嵌入空间获得最先进的视频检索和零射击视频动作定位。
translated by 谷歌翻译
可以代表和描述环境声音的机器具有实际潜力,例如,用于音频标记和标题系统。普遍的学习范式已经依赖于并行音频文本数据,但是,Web上几乎没有可用。我们提出了vip-ant,它在不使用任何并行音频文本数据的情况下诱导\ textbf {a} udio- \ textBF {t} EXT对齐。我们的主要思想是在双模形图像文本表示和双模态图像 - 音频表示之间共享图像模型;图像模态用作枢轴,并将音频和文本连接在三模态嵌入空间中。在没有配对的音频文本数据的困难零拍设置中,我们的模型在ESC50和US8K音频分类任务上展示了最先进的零点性能,甚至超过了披肩标题的领域的监督状态检索(带音频查询)2.2 \%R @ 1。我们进一步调查了最小音频监控的情况,发现,例如,只有几百个监督的音频文本对将零拍音频分类精度提高8 \%US8K。然而,为了匹配人类奇偶校验,我们的经验缩放实验表明我们需要大约2米$ 2 ^ {21} \约2M $监督的音频标题对。我们的工作开辟了新的途径,用于学习音频文本连接,几乎没有并行音频文本数据。
translated by 谷歌翻译