自我监督学习(SSL)在语音识别方面取得了巨大的成功,而有限的探索已尝试完成其他语音处理任务。由于语音信号包含多方面的信息,包括说话者身份,副语言学,口语内容等,学习所有语音任务的通用表示都具有挑战性。为了解决该问题,我们提出了一个新的预培训模型WAVLM,以解决全堆栈的下游语音任务。 Wavlm共同学习了蒙面的语音预测和预训练。通过这种方式,WAVLM不仅可以通过掩盖的语音预测来保持语音内容建模能力,而且还可以通过语音denoing来提高非ASR任务的潜力。此外,WAVLM还采用封闭式的变压器结构的封闭相对位置偏置,以更好地捕获输入语音的序列排序。我们还将培训数据集从60k小时扩展到94K小时。 WAVLM大型在精湛的基准上实现了最先进的性能,并在其代表性基准上为各种语音处理任务带来了重大改进。代码和预培训模型可在https://aka.ms/wavlm上找到。
translated by 谷歌翻译
最近,先驱工作发现,演讲预训练模型可以解决全堆栈语音处理任务,因为该模型利用底层学习扬声器相关信息和顶层以编码与内容相关的信息。由于网络容量有限,我们认为如果模型专用于音频内容信息学习,则可以进一步提高语音识别性能。为此,我们向自我监督学习(ILS-SSL)提出中间层监督,这将模型通过在中间层上添加额外的SSL丢失来尽可能地专注于内容信息。 LibrisPeech测试 - 其他集合的实验表明,我们的方法显着优于Hubert,这实现了基数/大型模型的W / O语言模型设置的相对字错误率降低了23.5%/ 11.6%。详细分析显示我们模型的底层与拼音单元具有更好的相关性,这与我们的直觉一致,并解释了我们对ASR的方法的成功。
translated by 谷歌翻译
Self-supervised learning (SSL) methods such as WavLM have shown promising speech separation (SS) results in small-scale simulation-based experiments. In this work, we extend the exploration of the SSL-based SS by massively scaling up both the pre-training data (more than 300K hours) and fine-tuning data (10K hours). We also investigate various techniques to efficiently integrate the pre-trained model with the SS network under a limited computation budget, including a low frame rate SSL model training setup and a fine-tuning scheme using only the part of the pre-trained model. Compared with a supervised baseline and the WavLM-based SS model using feature embeddings obtained with the previously released 94K hours trained WavLM, our proposed model obtains 15.9% and 11.2% of relative word error rate (WER) reductions, respectively, for a simulated far-field speech mixture test set. For conversation transcription on real meeting recordings using continuous speech separation, the proposed model achieves 6.8% and 10.6% of relative WER reductions over the purely supervised baseline on AMI and ICSI evaluation sets, respectively, while reducing the computational cost by 38%.
translated by 谷歌翻译
Self-supervised approaches for speech representation learning are challenged by three unique problems: (1) there are multiple sound units in each input utterance, (2) there is no lexicon of input sound units during the pre-training phase, and (3) sound units have variable lengths with no explicit segmentation. To deal with these three problems, we propose the Hidden-Unit BERT (HuBERT) approach for self-supervised speech representation learning, which utilizes an offline clustering step to provide aligned target labels for a BERT-like prediction loss. A key ingredient of our approach is applying the prediction loss over the masked regions only, which forces the model to learn a combined acoustic and language model over the continuous inputs. HuBERT relies primarily on the consistency of the unsupervised clustering step rather than the intrinsic quality of the assigned cluster labels. Starting with a simple k-means teacher of 100 clusters, and using two iterations of clustering, the HuBERT model either matches or improves upon the state-ofthe-art wav2vec 2.0 performance on the Librispeech (960h) and Libri-light (60,000h) benchmarks with 10min, 1h, 10h, 100h, and 960h fine-tuning subsets. Using a 1B parameter model, HuBERT shows up to 19% and 13% relative WER reduction on the more challenging dev-other and test-other evaluation subsets. 1
translated by 谷歌翻译
最近,蒙面的预测预训练在自我监督的学习(SSL)方面取得了显着的进展,以进行语音识别。它通常需要以无监督的方式获得的代码簿,从而使其准确和难以解释。我们提出了两种监督指导的代码书生成方法,以提高自动语音识别(ASR)的性能以及预训练效率,要么通过使用混合ASR系统来解码以生成音素级别对准(命名为PBERT),要么通过在上进行集群进行聚类。从端到端CTC模型(命名CTC聚类)提取的监督语音功能。混合动力和CTC模型均经过与微调相同的少量标记语音训练。实验表明,我们的方法对各种SSL和自我训练基准的优势具有显着优势,相对减少了17.0%。我们的预训练模型在非ASR语音任务中还显示出良好的可传递性。
translated by 谷歌翻译
We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec 2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of labeled data and pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This demonstrates the feasibility of speech recognition with limited amounts of labeled data. 1 1 Code and models are available at https://github.com/pytorch/fairseq Preprint. Under review.
translated by 谷歌翻译
语音的视频录制包含相关的音频和视觉信息,为语音表示从扬声器的唇部运动和产生的声音提供了强大的信号。我们介绍了视听隐藏单元BERT(AV-HUBERT),是视听语音的自我监督的代表学习框架,这些屏幕屏蔽了多流视频输入并预测自动发现和迭代地精制多模式隐藏单元。 AV-HUBERT学习强大的视听语音表示,这些语音表示受益于唇读和自动语音识别。在最大的公众唇读基准LRS3(433小时)中,AV-Hubert达到32.5%WER,只有30个小时的标签数据,优于前一种最先进的方法(33.6%)培训,达到了一千次转录的视频数据(31k小时)。当使用来自LRS3的所有433小时的标记数据并结合自培训时,唇读WER进一步降低至26.9%。使用我们在相同的基准测试中使用您的视听表示,用于音频语音识别的相对效率为40%,而最先进的性能(1.3%Vs 2.3%)。我们的代码和模型可在https://github.com/facebookResearch/av_hubert获得
translated by 谷歌翻译
我们总结了使用巨大的自动语音识别(ASR)模型的大量努力的结果,该模型使用包含大约一百万小时音频的大型,多样的未标记数据集进行了预训练。我们发现,即使对于拥有数万个小时的标记数据的非常大的任务,预训练,自我培训和扩大模型大小的组合也大大提高了数据效率。特别是,在具有34K小时标记数据的ASR任务上,通过微调80亿个参数预先训练的构象异构体模型,我们可以匹配最先进的(SOTA)性能(SOTA)的性能,只有3%的培训数据和通过完整的训练集可以显着改善SOTA。我们还报告了从使用大型预训练和自我训练的模型来完成一系列下游任务所获得的普遍利益,这些任务涵盖了广泛的语音域,并涵盖了多个数据集大小的大小,包括在许多人中获得SOTA性能公共基准。此外,我们利用预先训练的网络的学会表示,在非ASR任务上实现SOTA结果。
translated by 谷歌翻译
自我监督的语音表示,如Wav2Vec 2.0和Hubert正在自动语音识别(ASR)中进行革命性进展。但是,未经监督模型没有完全证明在ASR以外的任务中产生更好的性能。在这项工作中,我们探索了Wav2Vec 2.0和Hubert预先训练模型的部分微调和整个微调,适用于三个非ASR语音任务:语音情感识别,发言者验证和口语理解。我们还比较带有/没有ASR微调的预训练型号。通过简单的下游框架,最佳分数对IEMocap上的语音情感识别的加权精度达到79.58%,扬声器验证对voxcereB1的2.36%,意图分类的准确性为87.51%,Slotp的槽填充的75.32%f1,因此为这三个基准设置新的最先进,证明了微调Wave2VEC 2.0和Hubert模型可以更好地学习韵律,语音印刷和语义表示。
translated by 谷歌翻译
本文研究了一种新型的预训练技术,该技术具有未配对的语音数据Segend2C,用于基于编码器的自动语音识别(ASR)。在一个多任务学习框架内,我们使用声音单元(即伪代码)介绍了编码器 - 编码器网络的两个预训练任务,这些任务来自离线聚类模型。一种是通过在编码器输出中通过掩盖语言建模来预测伪代码,例如Hubert模型,而另一个使解码器学会学会重建伪代码自动加工,而不是生成文本脚本。通过这种方式,解码器学会了在学习生成正确的文本之前先用代码重建原始语音信息。在Librispeech语料库上进行的综合实验表明,在没有解码器预训练的情况下,提出的Speek2C可以相对将单词错误率(WER)降低19.2%,并且在最先进的WAV2VEC 2.0和HUBERT上的表现显着优于微调子集为10h和100h。我们在https://github.com/microsoft/speecht5/tree/main/main/speech2c上发布代码和模型。
translated by 谷歌翻译
在最近的研究中,自我监管的预训练模型倾向于在转移学习中优于监督的预训练模型。特别是,可以在语音应用中使用语音级语音表示的自我监督学习(SSL),这些语音应用需要歧视性表示话语中一致属性的表示:说话者,语言,情感和年龄。现有的框架级别的自我监督语音表示,例如WAV2VEC,可以用作带有汇总的话语级表示,但这些模型通常很大。也有SSL技术可以学习话语级的表示。最成功的方法之一是一种对比方法,它需要负采样:选择替代样品与当前样品(锚)对比。但是,这并不确保所有负面样本属于与没有标签的锚类别不同的​​类别。本文应用了一种非对抗性的自我监督方法来学习话语级的嵌入。我们对没有标签(Dino)从计算机视觉到语音进行了调整,没有标签(Dino)。与对比方法不同,Dino不需要负抽样。我们将Dino与受到监督方式训练的X-Vector进行了比较。当转移到下游任务(说话者验证,语音情绪识别(SER)和阿尔茨海默氏病检测)时,Dino的表现优于X-Vector。我们研究了转移学习过程中几个方面的影响,例如将微调过程分为步骤,块长度或增强。在微调过程中,首先调整最后一个仿射层,然后整个网络一次超过微调。使用较短的块长度,尽管它们产生了更多不同的输入,但并不一定会提高性能,这意味着至少需要具有特定长度的语音段才能为每个应用程序提高性能。增强对SER有帮助。
translated by 谷歌翻译
Through solving pretext tasks, self-supervised learning leverages unlabeled data to extract useful latent representations replacing traditional input features in the downstream task. In audio/speech signal processing, a wide range of features where engineered through decades of research efforts. As it turns out, learning to predict such features (a.k.a pseudo-labels) has proven to be a particularly relevant pretext task, leading to useful self-supervised representations which prove to be effective for downstream tasks. However, methods and common practices for combining such pretext tasks for better performance on the downstream task have not been explored and understood properly. In fact, the process relies almost exclusively on a computationally heavy experimental procedure, which becomes intractable with the increase of the number of pretext tasks. This paper introduces a method to select a group of pretext tasks among a set of candidates. The method we propose estimates calibrated weights for the partial losses corresponding to the considered pretext tasks during the self-supervised training process. The experiments conducted on automatic speech recognition, speaker and emotion recognition validate our approach, as the groups selected and weighted with our method perform better than classic baselines, thus facilitating the selection and combination of relevant pseudo-labels for self-supervised representation learning.
translated by 谷歌翻译
本文介绍了基于Wav2VEC 2.0的跨语言语音表示学习的大规模模型。我们在128种语言中培训最多2B个公共讲话音频的近半小时的型号的模型,比公共数据的数量级比最大的已知事先工作。我们的评估涵盖了广泛的任务,域,数据制度和语言,都是高低资源。在Covost-2语音翻译基准测试中,我们将先前的最先进的状态平均为7.4 BLEU超过21个翻译方向进入英语。对于语音识别,XLS-R在Babel,MLS,CommonVoice以及Voxpopuli上的最佳已知工作中提高,降低了相对的误差率14-34%。 XLS-R还在Voxlingua107语言识别上设置了新的技术状态。此外,我们表明,具有足够的模型规模,交叉思维预先预测可以在将英语演讲翻译成其他语言时才能优于英语撇印,这是一个有利于单晶的预借预制的设置。我们希望XLS-R可以帮助改善世界上更多语言的语音处理任务。
translated by 谷歌翻译
Prior works on improving speech quality with visual input typically study each type of auditory distortion separately (e.g., separation, inpainting, video-to-speech) and present tailored algorithms. This paper proposes to unify these subjects and study Generalized Speech Enhancement, where the goal is not to reconstruct the exact reference clean signal, but to focus on improving certain aspects of speech. In particular, this paper concerns intelligibility, quality, and video synchronization. We cast the problem as audio-visual speech resynthesis, which is composed of two steps: pseudo audio-visual speech recognition (P-AVSR) and pseudo text-to-speech synthesis (P-TTS). P-AVSR and P-TTS are connected by discrete units derived from a self-supervised speech model. Moreover, we utilize self-supervised audio-visual speech model to initialize P-AVSR. The proposed model is coined ReVISE. ReVISE is the first high-quality model for in-the-wild video-to-speech synthesis and achieves superior performance on all LRS3 audio-visual enhancement tasks with a single model. To demonstrates its applicability in the real world, ReVISE is also evaluated on EasyCom, an audio-visual benchmark collected under challenging acoustic conditions with only 1.6 hours of training data. Similarly, ReVISE greatly suppresses noise and improves quality. Project page: https://wnhsu.github.io/ReVISE.
translated by 谷歌翻译
本文调查了视听扬声器表示的自我监督的预训练,其中显示了视觉流,显示说话者的口腔区域与语音一起用作输入。我们的研究重点是视听隐藏单元BERT(AV-HUBERT)方法,该方法是最近开发的通用音频语音训练前训练框架。我们进行了广泛的实验,以探测预训练和视觉方式的有效性。实验结果表明,AV-Hubert可以很好地概括与说话者相关的下游任务,从而使标签效率提高了大约10倍的仅10倍,仅音频和视听扬声器验证。我们还表明,结合视觉信息,甚至仅仅是唇部区域,都大大提高了性能和噪声稳健性,在清洁条件下将EER降低了38%,在嘈杂的条件下将EER降低了75%。
translated by 谷歌翻译
In this paper, we propose a novel multi-modal multi-task encoder-decoder pre-training framework (MMSpeech) for Mandarin automatic speech recognition (ASR), which employs both unlabeled speech and text data. The main difficulty in speech-text joint pre-training comes from the significant difference between speech and text modalities, especially for Mandarin speech and text. Unlike English and other languages with an alphabetic writing system, Mandarin uses an ideographic writing system where character and sound are not tightly mapped to one another. Therefore, we propose to introduce the phoneme modality into pre-training, which can help capture modality-invariant information between Mandarin speech and text. Specifically, we employ a multi-task learning framework including five self-supervised and supervised tasks with speech and text data. For end-to-end pre-training, we introduce self-supervised speech-to-pseudo-codes (S2C) and phoneme-to-text (P2T) tasks utilizing unlabeled speech and text data, where speech-pseudo-codes pairs and phoneme-text pairs are a supplement to the supervised speech-text pairs. To train the encoder to learn better speech representation, we introduce self-supervised masked speech prediction (MSP) and supervised phoneme prediction (PP) tasks to learn to map speech into phonemes. Besides, we directly add the downstream supervised speech-to-text (S2T) task into the pre-training process, which can further improve the pre-training performance and achieve better recognition results even without fine-tuning. Experiments on AISHELL-1 show that our proposed method achieves state-of-the-art performance, with a more than 40% relative improvement compared with other pre-training methods.
translated by 谷歌翻译
最近,即使预训练目标是为语音识别而设计的,自我监督学习(SSL)即使在说话者的识别方面表现出了很强的表现。在本文中,我们研究了哪些因素导致对与说话者相关的任务的自我监督学习成功,例如扬声器验证(SV)通过一系列精心设计的实验。我们对Voxceleb-1数据集的经验结果表明,SSL对SV任务的好处是来自蒙版语音预测丢失,数据量表和模型大小的组合,而SSL量化器具有较小的影响。我们进一步采用了综合梯度归因方法和损失景观可视化,以了解说话者识别性能的自我监督学习的有效性。
translated by 谷歌翻译
学习高级语音表征的自学学习(SSL)一直是在低资源环境中构建自动语音识别(ASR)系统的一种流行方法。但是,文献中提出的共同假设是,可以使用可用于SSL预训练的相同域或语言的大量未标记数据,我们承认,在现实世界中,这是不可行的。在本文中,作为Interspeech Gram Vaani ASR挑战的一部分,我们尝试研究域,语言,数据集大小和上游训练SSL数据对最终性能下游ASR任务的效果。我们还建立在持续的训练范式的基础上,以研究使用SSL训练的模型所拥有的先验知识的效果。广泛的实验和研究表明,ASR系统的性能易受用于SSL预训练的数据。它们的性能随着相似性和预训练数据量的增加而提高。我们认为,我们的工作将有助于语音社区在低资源环境中建立更好的ASR系统,并引导研究改善基于SSL的语音系统预培训的概括。
translated by 谷歌翻译
语音中的自我监督学习涉及在大规模的未注释的语音语料库上训练语音表示网络,然后将学习的表示形式应用于下游任务。由于语音中SSL学习的大多数下游任务主要集中在语音中的内容信息上,因此最理想的语音表示形式应该能够将不需要的变化(例如说话者的变化)从内容中删除。但是,解开扬声器非常具有挑战性,因为删除说话者的信息也很容易导致内容丢失,而后者的损害通常远远超过了前者的好处。在本文中,我们提出了一种新的SSL方法,该方法可以实现扬声器分解而不会严重丢失内容。我们的方法是根据休伯特框架改编的,并结合了解开机制,以使教师标签和博学的代表规范化。我们在一组与内容相关的下游任务上评估了说话者分解的好处,并观察到我们的扬声器示词表示的一致且著名的性能优势。
translated by 谷歌翻译
最近的言语和语言技术的方法预先rain非常大型模型,用于特定任务。然而,这种大型模型的好处通常仅限于世界上少数资源丰富的语言。在这项工作中,我们对来自印度次大陆的低资源语言构建ASR系统进行多种贡献。首先,我们从各种领域策划40个印度语言的17,000小时的原始语音数据,包括教育,新闻,技术和金融。其次,使用这种原始语音数据,我们预先存在于40个印度语言的Wav2Vec样式模型的多个变体。第三,我们分析佩带的模型以查找关键特点:码本矢量的类似探测音素在语言中共享,跨层的表示是语言系列的判别,并且注意力头通常会在小型本地窗口中注意。第四,我们微调了9种语言的下游ASR模型,并在3个公共数据集上获得最先进的结果,包括非常低的资源语言,如Sinhala和Nepali。我们的工作建立了多语言预介质是建立ASR系统的有效策略,为印度次大陆的语言上不同的扬声器建立ASR系统。
translated by 谷歌翻译