Self-supervised pre-training of a speech foundation model, followed by supervised fine-tuning, has shown impressive quality improvements on automatic speech recognition (ASR) tasks. Fine-tuning separate foundation models for many downstream tasks are expensive since the foundation model is usually very big. Parameter-efficient fine-tuning methods (e.g. adapter, sparse update methods) offer an alternative paradigm where a small set of parameters are updated to adapt the foundation model to new tasks. However, these methods still suffer from a high computational memory cost and slow training speed because they require backpropagation through the entire neural network at each step. In the paper, we analyze the performance of features at different layers of a foundation model on the speech recognition task and propose a novel hierarchical feature fusion method for resource-efficient transfer learning from speech foundation models. Experimental results show that the proposed method can achieve better performance on speech recognition task than existing algorithms with fewer number of trainable parameters, less computational memory cost and faster training speed. After combining with Adapters at all layers, the proposed method can achieve the same performance as fine-tuning the whole model with $97\%$ fewer trainable encoder parameters and $53\%$ faster training speed.
translated by 谷歌翻译
交叉语言语音适应旨在解决利用多种丰富资源语言来构建低资源目标语言的模型的问题。由于低资源语言具有有限的培训数据,语音识别模型可以容易地过度装备。在本文中,我们建议使用适配器来研究多种适配器的性能,用于参数有效的交叉语音语音适应。基于我们以前的MetaAdapter,隐含地利用适配器,我们提出了一种名为SimAdapter的新算法,用于从Adapters明确学习知识。我们的算法利用了可以轻松集成到变压器结构中的适配器.METAADAPTER利用元学习将一般知识从训练数据转移到测试语言。 SimAdapter旨在使用适配器微调期间了解源语言与目标语言之间的相似性。我们在公共语音数据集中对五种低资源语言进行广泛的实验。结果表明,与强大的全型微调基线相比,我们的MetaAdapter和SimAdapter方法可以将WER减小2.98%和2.55%,只有2.5%和15.5%的培训参数。此外,我们还表明这两种新型算法可以集成,以便更好的性能,相对减少高达3.55%。
translated by 谷歌翻译
我们提出了一种简单有效的自我监督学习方法,以供语音识别。该方法以随机预测量化器生成的离散标签的形式学习了一个模型,以预测蒙版的语音信号。尤其是量化器的语音输入带有随机初始化的矩阵,并在随机限制的代码簿中进行最近的邻居查找。在自我监督的学习过程中,矩阵和密码簿均未更新。由于未对随机预测量化器进行训练,并与语音识别模型分开,因此该设计使该方法具有灵活性,并且与通用语音识别体系结构兼容。在LibrisPeech上,我们的方法与以前的工作相比,使用非流式模型获得了与以前的工作相似的单词率,并且比WAV2VEC 2.0和WAP2VEC 2.0和w2v-bert提供了较低的单词率率和延迟。在多语言任务上,该方法还提供了与WAV2VEC 2.0和W2V-bert的显着改进。
translated by 谷歌翻译
Self-supervised learning (SSL) is a powerful technique for learning representations from unlabeled data. Transformer based models such as HuBERT, which consist a feature extractor and transformer layers, are leading the field in the speech domain. SSL models are fine-tuned on a wide range of downstream tasks, which involves re-training the majority of the model for each task. Previous studies have introduced applying adapters, which are small lightweight modules commonly used in Natural Language Processing (NLP) to adapt pre-trained models to new tasks. However, such efficient tuning techniques only provide adaptation at the transformer layer, but failed to perform adaptation at the feature extractor. In this paper, we propose CHAPTER, an efficient tuning method specifically designed for SSL speech model, by applying CNN adapters at the feature extractor. Using this method, we can only fine-tune fewer than 5% of parameters per task compared to fully fine-tuning and achieve better and more stable performance. We empirically found that adding CNN adapters to the feature extractor can help the adaptation on emotion and speaker tasks. For instance, the accuracy of SID is improved from 87.71 to 91.56, and the accuracy of ER is improved by 5%.
translated by 谷歌翻译
最近出现了有希望的表现,利用大型预训练的模型来实现各种感兴趣的下游任务。由于模型的规模不断增长,因此,在模型培训和存储方面,基于标准的完整任务适应策略的成本高昂。这导致了参数有效传输学习的新研究方向。但是,现有的尝试通常集中在预训练模型的相同模式(例如图像理解)的下游任务上。这会产生限制,因为在某些特定的方式(例如,视频理解)中,具有足够知识的强大预训练模型较少或不可用。在这项工作中,我们研究了这样一种新型的跨模式转移学习设置,即参数有效的图像到视频传输学习。为了解决此问题,我们为每个视频任务提出了一个新的时空适配器(ST-ADAPTER),以进行参数有效调整。凭借紧凑设计中的内置时空推理能力,ST-ADAPTER可以实现预训练的图像模型,而无需时间知识,以小(〜8%)的每任务参数成本来理解动态视频内容,以大约需要与以前的工作相比,更新参数少20倍。在视频动作识别任务上进行的广泛实验表明,我们的ST-ADAPTER可以匹配甚至优于强大的完整微调策略和最先进的视频模型,同时享受参数效率的优势。
translated by 谷歌翻译
学习高级语音表征的自学学习(SSL)一直是在低资源环境中构建自动语音识别(ASR)系统的一种流行方法。但是,文献中提出的共同假设是,可以使用可用于SSL预训练的相同域或语言的大量未标记数据,我们承认,在现实世界中,这是不可行的。在本文中,作为Interspeech Gram Vaani ASR挑战的一部分,我们尝试研究域,语言,数据集大小和上游训练SSL数据对最终性能下游ASR任务的效果。我们还建立在持续的训练范式的基础上,以研究使用SSL训练的模型所拥有的先验知识的效果。广泛的实验和研究表明,ASR系统的性能易受用于SSL预训练的数据。它们的性能随着相似性和预训练数据量的增加而提高。我们认为,我们的工作将有助于语音社区在低资源环境中建立更好的ASR系统,并引导研究改善基于SSL的语音系统预培训的概括。
translated by 谷歌翻译
我们总结了使用巨大的自动语音识别(ASR)模型的大量努力的结果,该模型使用包含大约一百万小时音频的大型,多样的未标记数据集进行了预训练。我们发现,即使对于拥有数万个小时的标记数据的非常大的任务,预训练,自我培训和扩大模型大小的组合也大大提高了数据效率。特别是,在具有34K小时标记数据的ASR任务上,通过微调80亿个参数预先训练的构象异构体模型,我们可以匹配最先进的(SOTA)性能(SOTA)的性能,只有3%的培训数据和通过完整的训练集可以显着改善SOTA。我们还报告了从使用大型预训练和自我训练的模型来完成一系列下游任务所获得的普遍利益,这些任务涵盖了广泛的语音域,并涵盖了多个数据集大小的大小,包括在许多人中获得SOTA性能公共基准。此外,我们利用预先训练的网络的学会表示,在非ASR任务上实现SOTA结果。
translated by 谷歌翻译
When building state-of-the-art speech translation models, the need for large computational resources is a significant obstacle due to the large training data size and complex models. The availability of pre-trained models is a promising opportunity to build strong speech translation systems efficiently. In a first step, we investigate efficient strategies to build cascaded and end-to-end speech translation systems based on pre-trained models. Using this strategy, we can train and apply the models on a single GPU. While the end-to-end models show superior translation performance to cascaded ones, the application of this technology has a limitation on the need for additional end-to-end training data. In a second step, we proposed an additional similarity loss to encourage the model to generate similar hidden representations for speech and transcript. Using this technique, we can increase the data efficiency and improve the translation quality by 6 BLEU points in scenarios with limited end-to-end training data.
translated by 谷歌翻译
已经证明,基于自我监督的学习(SSL)模型可以生成强大的表示,可用于改善下游语音任务的性能。可以使用几种最先进的SSL模型,并且这些模型中的每一个都优化了不同的损失,这会导致其功能互补的可能性。本文提出了使用此类SSL表示和模型的集合,该集合利用了各种预审预周化模型提取的特征的互补性质。我们假设这导致了更丰富的特征表示,并显示了ASR下游任务的结果。为此,我们使用了三个SSL模型,这些模型在ASR任务上显示出了出色的结果,即Hubert,Wav2Vec2.0和小波。我们使用从预训练的模型获得下游ASR任务的嵌入方式来探索用于ASR任务的模型集合和功能集合。我们使用LiblisPeech(100H)和WSJ数据集的单个模型和预训练的功能获得了改进的性能,用于下游任务。
translated by 谷歌翻译
最近在各种领域中采用了关于下游任务的大型预训练模型。但是,更新大型预训练模型的整个参数集是昂贵的。尽管最近提出的参数效率转移学习(PETL)技术允许在预先训练的骨干网络内更新一小部分参数(例如,仅使用2%的参数)用于新任务,但它们只能通过最多减少训练记忆要求30%。这是因为可训练参数的梯度计算仍然需要通过大型预训练的骨干模型反向传播。为了解决这个问题,我们提出了梯子侧调(LST),这是一种新的PETL技术,可将训练记忆要求减少更多。与现有的参数效率方法不同,将其他参数插入骨干网络中,我们训练梯子侧网络,梯子侧网络是一个小而独立的网络,将中间激活作为通过快速连接(梯子)从骨干网络中获得的输入作为输入,并进行预测。 LST的内存要求明显低于以前的方法,因为它不需要通过骨干网络反向传播,而是仅通过侧网和梯子连接。我们使用NLP(胶)和视觉语言(VQA,GQA,NLVR2,MSCOCO)任务上的各种模型(T5,CLIP-T5)进行评估。 LST节省了69%的内存成本来微调整个网络,而其他方法仅将其中的26%保存在相似的参数使用中(因此,更多的内存节省了2.7倍)。此外,LST在低内存状态下的适配器和洛拉的精度高。为了进一步显示这种更好的记忆效率的优势,我们还将LST应用于较大的T5型号(T5-Large,T5-3B),比完整的微调和其他PETL方法获得更好的胶水性能。我们对VL任务的实验也完全相同。
translated by 谷歌翻译
最近,先驱工作发现,演讲预训练模型可以解决全堆栈语音处理任务,因为该模型利用底层学习扬声器相关信息和顶层以编码与内容相关的信息。由于网络容量有限,我们认为如果模型专用于音频内容信息学习,则可以进一步提高语音识别性能。为此,我们向自我监督学习(ILS-SSL)提出中间层监督,这将模型通过在中间层上添加额外的SSL丢失来尽可能地专注于内容信息。 LibrisPeech测试 - 其他集合的实验表明,我们的方法显着优于Hubert,这实现了基数/大型模型的W / O语言模型设置的相对字错误率降低了23.5%/ 11.6%。详细分析显示我们模型的底层与拼音单元具有更好的相关性,这与我们的直觉一致,并解释了我们对ASR的方法的成功。
translated by 谷歌翻译
Multilingual end-to-end models have shown great improvement over monolingual systems. With the development of pre-training methods on speech, self-supervised multilingual speech representation learning like XLSR has shown success in improving the performance of multilingual automatic speech recognition (ASR). However, similar to the supervised learning, multilingual pre-training may also suffer from language interference and further affect the application of multilingual system. In this paper, we introduce several techniques for improving self-supervised multilingual pre-training by leveraging auxiliary language information, including the language adversarial training, language embedding and language adaptive training during the pre-training stage. We conduct experiments on a multilingual ASR task consisting of 16 languages. Our experimental results demonstrate 14.3% relative gain over the standard XLSR model, and 19.8% relative gain over the no pre-training multilingual model.
translated by 谷歌翻译
自我监督学习(SSL)在语音识别方面取得了巨大的成功,而有限的探索已尝试完成其他语音处理任务。由于语音信号包含多方面的信息,包括说话者身份,副语言学,口语内容等,学习所有语音任务的通用表示都具有挑战性。为了解决该问题,我们提出了一个新的预培训模型WAVLM,以解决全堆栈的下游语音任务。 Wavlm共同学习了蒙面的语音预测和预训练。通过这种方式,WAVLM不仅可以通过掩盖的语音预测来保持语音内容建模能力,而且还可以通过语音denoing来提高非ASR任务的潜力。此外,WAVLM还采用封闭式的变压器结构的封闭相对位置偏置,以更好地捕获输入语音的序列排序。我们还将培训数据集从60k小时扩展到94K小时。 WAVLM大型在精湛的基准上实现了最先进的性能,并在其代表性基准上为各种语音处理任务带来了重大改进。代码和预培训模型可在https://aka.ms/wavlm上找到。
translated by 谷歌翻译
Collecting sufficient labeled data for spoken language understanding (SLU) is expensive and time-consuming. Recent studies achieved promising results by using pre-trained models in low-resource scenarios. Inspired by this, we aim to ask: which (if any) pre-training strategies can improve performance across SLU benchmarks? To answer this question, we employ four types of pre-trained models and their combinations for SLU. We leverage self-supervised speech and language models (LM) pre-trained on large quantities of unpaired data to extract strong speech and text representations. We also explore using supervised models pre-trained on larger external automatic speech recognition (ASR) or SLU corpora. We conduct extensive experiments on the SLU Evaluation (SLUE) benchmark and observe self-supervised pre-trained models to be more powerful, with pre-trained LM and speech models being most beneficial for the Sentiment Analysis and Named Entity Recognition task, respectively.
translated by 谷歌翻译
通过微调将大规模的预训练语言模型适应下游任务是实现NLP基准测试最先进性能的标准方法。然而,微调具有数百万或数十亿个参数的所有重量模型是对低资源设置中不稳定的采样低效,并且浪费,因为它需要为每个任务存储模型的单独副本。最近的工作已经开发了参数高效的微调方法,但这些方法仍然需要相对大量的参数或表现不足标准微调。在这项工作中,我们提出了一种特殊调整大型语言模型的方法,其在任务性能和比率参数之间具有更好的权衡的方法,而不是比上事先工作。 Compacter通过构建适配器,低级优化和参数化超复分乘法层的思想之上来实现这一目标。具体地,Compacter将特定于特定的权重矩阵插入到预估计模型的权重中,这些权重被有效地计算为共享的“慢速”权重和“快速”等级 - 每个Compacter层定义的矩阵之间的矩阵产品的总和。仅通过培训0.047%的预磨料模型的参数,Compacter会在胶水上标准微调和胜过标准微调的标准微调和低资源设置。我们的代码在〜\ url {https://github.com/rabeehk/compacter}上公开使用。
translated by 谷歌翻译
In this paper, we propose a novel multi-modal multi-task encoder-decoder pre-training framework (MMSpeech) for Mandarin automatic speech recognition (ASR), which employs both unlabeled speech and text data. The main difficulty in speech-text joint pre-training comes from the significant difference between speech and text modalities, especially for Mandarin speech and text. Unlike English and other languages with an alphabetic writing system, Mandarin uses an ideographic writing system where character and sound are not tightly mapped to one another. Therefore, we propose to introduce the phoneme modality into pre-training, which can help capture modality-invariant information between Mandarin speech and text. Specifically, we employ a multi-task learning framework including five self-supervised and supervised tasks with speech and text data. For end-to-end pre-training, we introduce self-supervised speech-to-pseudo-codes (S2C) and phoneme-to-text (P2T) tasks utilizing unlabeled speech and text data, where speech-pseudo-codes pairs and phoneme-text pairs are a supplement to the supervised speech-text pairs. To train the encoder to learn better speech representation, we introduce self-supervised masked speech prediction (MSP) and supervised phoneme prediction (PP) tasks to learn to map speech into phonemes. Besides, we directly add the downstream supervised speech-to-text (S2T) task into the pre-training process, which can further improve the pre-training performance and achieve better recognition results even without fine-tuning. Experiments on AISHELL-1 show that our proposed method achieves state-of-the-art performance, with a more than 40% relative improvement compared with other pre-training methods.
translated by 谷歌翻译
Self-supervised pre-trained transformers have improved the state of the art on a variety of speech tasks. Due to the quadratic time and space complexity of self-attention, they usually operate at the level of relatively short (e.g., utterance) segments. In this paper, we study the use of context, i.e., surrounding segments, during fine-tuning and propose a new approach called context-aware fine-tuning. We attach a context module on top of the last layer of a pre-trained model to encode the whole segment into a context embedding vector which is then used as an additional feature for the final prediction. During the fine-tuning stage, we introduce an auxiliary loss that encourages this context embedding vector to be similar to context vectors of surrounding segments. This allows the model to make predictions without access to these surrounding segments at inference time and requires only a tiny overhead compared to standard fine-tuned models. We evaluate the proposed approach using the SLUE and Librilight benchmarks for several downstream tasks: Automatic speech recognition (ASR), named entity recognition (NER), and sentiment analysis (SA). The results show that context-aware fine-tuning not only outperforms a standard fine-tuning baseline but also rivals a strong context injection baseline that uses neighboring speech segments during inference.
translated by 谷歌翻译
Fine-tuning large pre-trained models is an effective transfer mechanism in NLP. However, in the presence of many downstream tasks, fine-tuning is parameter inefficient: an entire new model is required for every task. As an alternative, we propose transfer with adapter modules. Adapter modules yield a compact and extensible model; they add only a few trainable parameters per task, and new tasks can be added without revisiting previous ones. The parameters of the original network remain fixed, yielding a high degree of parameter sharing. To demonstrate adapter's effectiveness, we transfer the recently proposed BERT Transformer model to 26 diverse text classification tasks, including the GLUE benchmark. Adapters attain near state-of-the-art performance, whilst adding only a few parameters per task. On GLUE, we attain within 0.4% of the performance of full fine-tuning, adding only 3.6% parameters per task. By contrast, fine-tuning trains 100% of the parameters per task.
translated by 谷歌翻译
自我监督的预训练可以有效地改善低资源自动语音识别(ASR)的性能。但是,现有的自我监督的预训练是任务不合时宜的,即可以应用于各种下游任务。尽管它扩大了其应用的范围,但预训练模型的容量并未完全用于ASR任务,并且学习的表示形式可能对ASR不最佳。在这项工作中,为了为低资源ASR构建更好的预训练模型,我们提出了一种称为WAV2VEC-S的预训练方法,我们使用特定于任务的半监督预培训来完善自我监督的预培训因此,ASR任务的预训练模型更有效地利用了预培训模型的能力来生成针对ASR的任务特定表示。实验表明,与WAV2VEC 2.0相比,WAV2VEC-S仅需要训练前时间的边际增长,但可以显着改善在内域,跨域和跨语言数据集上的ASR性能。 1H和10H微调分别为24.5%和6.6%。此外,我们表明,半监督的预训练可以通过规范相关分析来弥合自我监管的预训练模型与相应的微调模型之间的表示差距。
translated by 谷歌翻译
我们介绍了一个CLSRIL-23,一个自我监督的基于学习的音频预训练模型,它学习了来自23个指示语言的原始音频的交叉语言语音表示。它基于Wav2Vec 2.0之上,通过培训蒙面潜在语音表示的对比任务来解决,并共同了解所有语言共享的潜伏的量化。我们在预磨练期间比较语言明智的损失,以比较单机和多语言预制的影响。还比较了一些下游微调任务的表现,并且我们的实验表明,在学习语音表示方面,我们的实验表明,在学习语言的语音表示方面,以及在沿着流的性能方面的学习语音表示。在Hindi中使用多语言预磨模模型时,在WER中观察到5%的减少,9.5%。所有代码模型也都是开放的。 CLSRIL-23是一款以23美元的价格培训的型号,以及近10,000小时的音频数据培训,以促进在语言中的语音识别研究。我们希望将使用自我监督方法创建新的最新状态,特别是对于低资源指示语言。
translated by 谷歌翻译