Self-training has been shown to be helpful in addressing data scarcity for many domains, including vision, speech, and language. Specifically, self-training, or pseudo-labeling, labels unsupervised data and adds that to the training pool. In this work, we investigate and use pseudo-labeling for a recently proposed novel setup: joint transcription and translation of speech, which suffers from an absence of sufficient data resources. We show that under such data-deficient circumstances, the unlabeled data can significantly vary in domain from the supervised data, which results in pseudo-label quality degradation. We investigate two categories of remedies that require no additional supervision and target the domain mismatch: pseudo-label filtering and data augmentation. We show that pseudo-label analysis and processing as such results in additional gains on top of the vanilla pseudo-labeling setup resulting in total improvements of up to 0.6% absolute WER and 2.2 BLEU points.
translated by 谷歌翻译
最近的言语和语言技术的方法预先rain非常大型模型,用于特定任务。然而,这种大型模型的好处通常仅限于世界上少数资源丰富的语言。在这项工作中,我们对来自印度次大陆的低资源语言构建ASR系统进行多种贡献。首先,我们从各种领域策划40个印度语言的17,000小时的原始语音数据,包括教育,新闻,技术和金融。其次,使用这种原始语音数据,我们预先存在于40个印度语言的Wav2Vec样式模型的多个变体。第三,我们分析佩带的模型以查找关键特点:码本矢量的类似探测音素在语言中共享,跨层的表示是语言系列的判别,并且注意力头通常会在小型本地窗口中注意。第四,我们微调了9种语言的下游ASR模型,并在3个公共数据集上获得最先进的结果,包括非常低的资源语言,如Sinhala和Nepali。我们的工作建立了多语言预介质是建立ASR系统的有效策略,为印度次大陆的语言上不同的扬声器建立ASR系统。
translated by 谷歌翻译
本文介绍了我们针对IWSLT 2022离线任务的端到端Yitrans语音翻译系统的提交,该任务从英语音频转换为德语,中文和日语。 Yitrans系统建立在大规模训练的编码器模型上。更具体地说,我们首先设计了多阶段的预训练策略,以建立具有大量标记和未标记数据的多模式模型。然后,我们为下游语音翻译任务微调模型的相应组件。此外,我们做出了各种努力,以提高性能,例如数据过滤,数据增强,语音细分,模型集合等。实验结果表明,我们的Yitrans系统比在三个翻译方向上的强基线取得了显着改进,并且比去年在TST2021英语 - 德国人中的最佳端到端系统方面的改进+5.2 BLEU改进。根据自动评估指标,我们的最终意见在英语 - 德国和英语端到端系统上排名第一。我们使代码和模型公开可用。
translated by 谷歌翻译
Speech translation (ST) is the task of directly translating acoustic speech signals in a source language into text in a foreign language. ST task has been addressed, for a long time, using a pipeline approach with two modules : first an Automatic Speech Recognition (ASR) in the source language followed by a text-to-text Machine translation (MT). In the past few years, we have seen a paradigm shift towards the end-to-end approaches using sequence-to-sequence deep neural network models. This paper presents our efforts towards the development of the first Broadcast News end-to-end Arabic to English speech translation system. Starting from independent ASR and MT LDC releases, we were able to identify about 92 hours of Arabic audio recordings for which the manual transcription was also translated into English at the segment level. These data was used to train and compare pipeline and end-to-end speech translation systems under multiple scenarios including transfer learning and data augmentation techniques.
translated by 谷歌翻译
We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zero-shot transfer setting without the need for any fine-tuning. When compared to humans, the models approach their accuracy and robustness. We are releasing models and inference code to serve as a foundation for further work on robust speech processing.
translated by 谷歌翻译
We present SpeechMatrix, a large-scale multilingual corpus of speech-to-speech translations mined from real speech of European Parliament recordings. It contains speech alignments in 136 language pairs with a total of 418 thousand hours of speech. To evaluate the quality of this parallel speech, we train bilingual speech-to-speech translation models on mined data only and establish extensive baseline results on EuroParl-ST, VoxPopuli and FLEURS test sets. Enabled by the multilinguality of SpeechMatrix, we also explore multilingual speech-to-speech translation, a topic which was addressed by few other works. We also demonstrate that model pre-training and sparse scaling using Mixture-of-Experts bring large gains to translation performance. The mined data and models are freely available.
translated by 谷歌翻译
本文介绍了基于Wav2VEC 2.0的跨语言语音表示学习的大规模模型。我们在128种语言中培训最多2B个公共讲话音频的近半小时的型号的模型,比公共数据的数量级比最大的已知事先工作。我们的评估涵盖了广泛的任务,域,数据制度和语言,都是高低资源。在Covost-2语音翻译基准测试中,我们将先前的最先进的状态平均为7.4 BLEU超过21个翻译方向进入英语。对于语音识别,XLS-R在Babel,MLS,CommonVoice以及Voxpopuli上的最佳已知工作中提高,降低了相对的误差率14-34%。 XLS-R还在Voxlingua107语言识别上设置了新的技术状态。此外,我们表明,具有足够的模型规模,交叉思维预先预测可以在将英语演讲翻译成其他语言时才能优于英语撇印,这是一个有利于单晶的预借预制的设置。我们希望XLS-R可以帮助改善世界上更多语言的语音处理任务。
translated by 谷歌翻译
Data scarcity is one of the main issues with the end-to-end approach for Speech Translation, as compared to the cascaded one. Although most data resources for Speech Translation are originally document-level, they offer a sentence-level view, which can be directly used during training. But this sentence-level view is single and static, potentially limiting the utility of the data. Our proposed data augmentation method SegAugment challenges this idea and aims to increase data availability by providing multiple alternative sentence-level views of a dataset. Our method heavily relies on an Audio Segmentation system to re-segment the speech of each document, after which we obtain the target text with alignment methods. The Audio Segmentation system can be parameterized with different length constraints, thus giving us access to multiple and diverse sentence-level views for each document. Experiments in MuST-C show consistent gains across 8 language pairs, with an average increase of 2.2 BLEU points, and up to 4.7 BLEU for lower-resource scenarios in mTEDx. Additionally, we find that SegAugment is also applicable to purely sentence-level data, as in CoVoST, and that it enables Speech Translation models to completely close the gap between the gold and automatic segmentation at inference time.
translated by 谷歌翻译
Direct speech-to-speech translation (S2ST), in which all components can be optimized jointly, is advantageous over cascaded approaches to achieve fast inference with a simplified pipeline. We present a novel two-pass direct S2ST architecture, {\textit UnitY}, which first generates textual representations and predicts discrete acoustic units subsequently. We enhance the model performance by subword prediction in the first-pass decoder, advanced two-pass decoder architecture design and search strategy, and better training regularization. To leverage large amounts of unlabeled text data, we pre-train the first-pass text decoder based on the self-supervised denoising auto-encoding task. Experimental evaluations on benchmark datasets at various data scales demonstrate that UnitY outperforms a single-pass speech-to-unit translation model by 2.5-4.2 ASR-BLEU with 2.83x decoding speed-up. We show that the proposed methods boost the performance even when predicting spectrogram in the second pass. However, predicting discrete units achieves 2.51x decoding speed-up compared to that case.
translated by 谷歌翻译
我们介绍了一种无线文字语音转换(S2ST)系统,可以将来自一种语言的语音转换为另一种语言,并且可以在不需要任何文本数据的情况下构建。与文献中的现有工作不同,我们解决了模拟多扬声器目标语音的挑战,并用现实世界的S2ST数据训练系统。我们方法的关键是一种自我监督的单位语音标准化技术,该标准化技术将预先训练的语音编码器具有来自多个扬声器的配对声音,以及单个参考扬声器,以减少由于复印件引起的变化,同时保留词汇内容。只有10分钟的语音标准化的配对数据,我们在培训\ vp〜s2st数据集上的S2ST模型时获得平均3.2 BLEU增益,而不是在未标准化的语音目标上培训的基线。我们还将自动开采的S2ST数据纳入并显示额外的2.0 BLEU增益。据我们所知,我们是第一个建立无线的S2ST技术,可以用真实世界的数据培训,并为多种语言配对工作。
translated by 谷歌翻译
由于训练和测试分布之间的不匹配,自动语音识别(ASR)的跨域性能可能会受到严重阻碍。由于目标域通常缺乏标记的数据,并且在声学和语言水平上存在域移位,因此对ASR进行无监督的域适应性(UDA)是一项挑战。先前的工作表明,通过利用未标记的数据的自我检查,自我监督的学习(SSL)或伪标记(PL)可以有效地进行UDA。但是,这些自我介绍也面临不匹配的域分布中的性能退化,而以前的工作未能解决。这项工作提出了一个系统的UDA框架,可以在预训练和微调范式中充分利用具有自学贴标签的未标记数据。一方面,我们应用持续的预训练和数据重播技术来减轻SSL预训练模型的域不匹配。另一方面,我们提出了一种基于PL技术的域自适应微调方法,并具有三种独特的修改:首先,我们设计了一种双分支PL方法,以降低对错误的伪标签的敏感性;其次,我们设计了一种不确定性感知的置信度过滤策略,以提高伪标签的正确性。第三,我们引入了两步PL方法,以结合目标域语言知识,从而产生更准确的目标域伪标记。各种跨域场景的实验结果表明,所提出的方法可以有效地提高跨域的性能,并显着超过以前的方法。
translated by 谷歌翻译
开发语音技术是对低资源语言的挑战,其中注释和原始语音数据稀疏。马耳他是一种这样的语言。近年来,对马耳他的计算处理有所增加,包括语音技术,但后者的资源仍然稀疏。在本文中,我们考虑提高这些语言的语音识别的数据增强技术,专注于马耳他作为测试用例。我们考虑三种不同类型的数据增强:无监督的培训,多语言培训和合成演讲的使用作为培训数据。目标是确定这些技术或它们的组合,是改善起始点是大约7小时转录语音的语言的语言的最有效。我们的结果表明,在这里研究了三种数据增强技术,导致我们在不使用语言模型的情况下实现15%的绝对增长。
translated by 谷歌翻译
End-to-end Speech Translation (E2E ST) aims to translate source speech into target translation without generating the intermediate transcript. However, existing approaches for E2E ST degrade considerably when only limited ST data are available. We observe that an ST model's performance strongly correlates with its embedding similarity from speech and transcript. In this paper, we propose Word-Aligned COntrastive learning (WACO), a novel method for few-shot speech-to-text translation. Our key idea is bridging word-level representations for both modalities via contrastive learning. We evaluate WACO and other methods on the MuST-C dataset, a widely used ST benchmark. Our experiments demonstrate that WACO outperforms the best baseline methods by 0.7-8.5 BLEU points with only 1-hour parallel data. Code is available at https://anonymous.4open.science/r/WACO .
translated by 谷歌翻译
我们提出了一种两阶段的培训方法,用于开发单个NMT模型,以翻译英语和英语的看不见的语言。对于第一阶段,我们将编码器模型初始化以鉴定XLM-R和Roberta的权重,然后对25种语言的平行数据进行多种语言微调。我们发现该模型可以推广到对看不见的语言的零击翻译。在第二阶段,我们利用这种概括能力从单语数据集生成合成的并行数据,然后用连续的反向翻译训练。最终模型扩展到了英语到许多方向,同时保持了多到英语的性能。我们称我们的方法为ecxtra(以英语为中心的跨语言(x)转移)。我们的方法依次利用辅助并行数据和单语言数据,并且在概念上很简单,仅在两个阶段都使用标准的跨熵目标。最终的ECXTRA模型对8种低资源语言的无监督NMT进行了评估,该语言为英语至哈萨克语(22.3> 10.4 bleu)以及其他15个翻译方向的竞争性能而获得了新的最先进。
translated by 谷歌翻译
We present Mu$^{2}$SLAM, a multilingual sequence-to-sequence model pre-trained jointly on unlabeled speech, unlabeled text and supervised data spanning Automatic Speech Recognition (ASR), Automatic Speech Translation (AST) and Machine Translation (MT), in over 100 languages. By leveraging a quantized representation of speech as a target, Mu$^{2}$SLAM trains the speech-text models with a sequence-to-sequence masked denoising objective similar to T5 on the decoder and a masked language modeling (MLM) objective on the encoder, for both unlabeled speech and text, while utilizing the supervised tasks to improve cross-lingual and cross-modal representation alignment within the model. On CoVoST AST, Mu$^{2}$SLAM establishes a new state-of-the-art for models trained on public datasets, improving on xx-en translation over the previous best by 1.9 BLEU points and on en-xx translation by 1.1 BLEU points. On Voxpopuli ASR, our model matches the performance of an mSLAM model fine-tuned with an RNN-T decoder, despite using a relatively weaker sequence-to-sequence architecture. On text understanding tasks, our model improves by more than 6\% over mSLAM on XNLI, getting closer to the performance of mT5 models of comparable capacity on XNLI and TydiQA, paving the way towards a single model for all speech and text understanding tasks.
translated by 谷歌翻译
端到端(E2E)语音到文本翻译(ST)通常取决于通过语音识别或文本翻译任务使用源成绩单预处理其编码器和/或解码器,否则翻译性能会大大下降。但是,笔录并不总是可用的,在文献中很少研究这种预处理的E2E ST。在本文中,我们重新审视了这个问题,并探讨了仅在语音翻译对培训的E2E ST质量的程度。我们重新审查了几种证明对ST的有益的技术,并提供了一系列最佳实践,这些实践使基于变压器的E2E ST系统偏向于从头开始训练。此外,我们提出了参数化的距离惩罚,以促进语音自我注意模型中的位置建模。在涵盖23种语言的四个基准测试中,我们的实验表明,在不使用任何成绩单或预处理的情况下,提议的系统达到甚至优于先前采用预处理的研究,尽管差距仍然存在(极为)低资源的设置。最后,我们讨论了神经声学特征建模,其中神经模型旨在直接从原始语音信号中提取声学特征,以简化电感偏见并为模型描述语音增添自由度。我们第一次证明了它的可行性,并在ST任务上表现出令人鼓舞的结果。
translated by 谷歌翻译
基于自我监督的变压器模型,例如WAV2VEC 2.0和Hubert,对现有的自动语音识别方法(ASR)产生了重大改进。当用可用标记的数据进行微调时,在许多语言的基于WAV2VEC 2.0预验证的XLSR-53模型的性能中很明显。但是,鉴定这些模型的性能可能取决于预训练数据集中包含的语言或类似语言数据的数量。在本文中,我们使用几种低资源语言的XLSR-53预告片预测模型进行了持续预处理(COPT)。 COPT比半监督训练(SST)更有效,这是使用ASR中未标记数据的标准方法,因为它忽略了对未标记数据的伪标记的需求。我们在单词错误率(WERS)中显示了COPT结果,等于或稍好于使用SST。此外,我们表明,使用COPT模型进行伪标记,并在SST中使用这些标签,从而进一步改善了WER。
translated by 谷歌翻译
口语语言理解(SLU)任务涉及从语音音频信号映射到语义标签。鉴于此类任务的复杂性,可能预期良好的性能需要大量标记的数据集,这很难为每个新任务和域收集。但是,最近的自我监督讲话表现的进步使得考虑使用有限标记的数据学习SLU模型是可行的。在这项工作中,我们专注于低资源讨论(ner)并解决问题:超越自我监督的预培训,我们如何使用未为任务注释的外部语音和/或文本数据?我们借鉴了各种方法,包括自我训练,知识蒸馏和转移学习,并考虑其对端到端模型和管道(语音识别后跟文本型号)的适用性。我们发现,这些方法中的几种方法可以在资源受限的环境中提高绩效,超出了训练有素的表示的福利。与事先工作相比,我们发现改进的F1分数高达16%。虽然最好的基线模型是一种管道方法,但使用外部数据时最终通过端到端模型实现的最佳性能。我们提供了详细的比较和分析,例如,端到端模型能够专注于更加立列人的单词。
translated by 谷歌翻译
在本文中,我们描述了三星研究的提交菲律宾-Konvergen AI团队为WMT'21大规模多语言翻译任务 - 小轨道2.我们向共享任务提交标准SEQ2Seq变压器模型,没有任何培训或架构技巧,主要依靠我们的数据预处理技术来提高性能。我们的最终提交模型在Flores-101 DevTest集中筹集了22.92平均Bleu,并在比赛的隐藏试验集上获得了22.97平均平均Bleu,整体排名第六。尽管只使用标准变压器,我们的型号在印度尼西亚排名第一的javanese,表明数据预处理的重要事项,如果不是更多的,而不是切割边缘模型架构和训练技术。
translated by 谷歌翻译
在本文中,我们介绍了一个高质量的大规模基准数据集,用于英语 - 越南语音翻译,其中有508音频小时,由331k的三胞胎组成(句子长度的音频,英语源笔录句,越南人目标subtitle句子)。我们还使用强基础进行了经验实验,发现传统的“级联”方法仍然优于现代“端到端”方法。据我们所知,这是第一个大规模的英语 - 越南语音翻译研究。我们希望我们的公开数据集和研究都可以作为未来研究和英语语音翻译应用的起点。我们的数据集可从https://github.com/vinairesearch/phost获得
translated by 谷歌翻译