我们介绍了一种无线文字语音转换(S2ST)系统,可以将来自一种语言的语音转换为另一种语言,并且可以在不需要任何文本数据的情况下构建。与文献中的现有工作不同,我们解决了模拟多扬声器目标语音的挑战,并用现实世界的S2ST数据训练系统。我们方法的关键是一种自我监督的单位语音标准化技术,该标准化技术将预先训练的语音编码器具有来自多个扬声器的配对声音,以及单个参考扬声器,以减少由于复印件引起的变化,同时保留词汇内容。只有10分钟的语音标准化的配对数据,我们在培训\ vp〜s2st数据集上的S2ST模型时获得平均3.2 BLEU增益,而不是在未标准化的语音目标上培训的基线。我们还将自动开采的S2ST数据纳入并显示额外的2.0 BLEU增益。据我们所知,我们是第一个建立无线的S2ST技术,可以用真实世界的数据培训,并为多种语言配对工作。
translated by 谷歌翻译
直接语音到语音翻译(S2ST)模型与传统级联系统可用的数据量相比,几乎没有平行的S2ST数据遇到数据稀缺问题,该数据包括自动语音识别(ASR),机器翻译(MT)和文本到语音(TTS)合成。在这项工作中,我们使用未标记的语音数据和数据扩展来探索自我监督的预训练,以解决此问题。我们利用了最近提出的语音到单位翻译(S2UT)框架,该框架将目标语音编码为离散表示形式,并转移前训练前和有效的部分填充技术,可很好地适用于语音到文本翻译(S2T)通过研究语音编码器和离散单位解码器预训练,S2UT域。我们在西班牙语 - 英语翻译上进行的实验表明,与多任务学习相比,自我监督的预训练始终如一地提高模型性能,平均为6.6-12.1 BLEU增长,并且可以与数据增强技术相结合,以应用MT来创建弱监督监督的培训数据。音频样本可在以下网址获得:https://facebookresearch.github.io/speech_translation/enhanced_direct_s2st_units/index.html。
translated by 谷歌翻译
We present SpeechMatrix, a large-scale multilingual corpus of speech-to-speech translations mined from real speech of European Parliament recordings. It contains speech alignments in 136 language pairs with a total of 418 thousand hours of speech. To evaluate the quality of this parallel speech, we train bilingual speech-to-speech translation models on mined data only and establish extensive baseline results on EuroParl-ST, VoxPopuli and FLEURS test sets. Enabled by the multilinguality of SpeechMatrix, we also explore multilingual speech-to-speech translation, a topic which was addressed by few other works. We also demonstrate that model pre-training and sparse scaling using Mixture-of-Experts bring large gains to translation performance. The mined data and models are freely available.
translated by 谷歌翻译
我们介绍了CVSS,这是一种大规模的多语言对语音转换(S2ST)语料库,从21种语言覆盖了21种语言的句子级并行S2ST对。通过将Covost 2从Covost 2的翻译文本综合将翻译文本与最先进的TTS系统合成语音,源自公共语音语音语料库和COVOST 2语音到文本转换(ST)语料库。提供了两个版本的翻译演讲:1)CVSS-C:所有翻译演讲都是一种高质量的规范声音; 2)CVSS-T:翻译语音从相应的源语音传输。此外,CVSS提供标准化的翻译文本,它与翻译语音中的发音匹配。在每个版本的CVSS上,我们建立了基线多语言直接S2ST模型和Cascade S2ST模型,验证了语料库的有效性。为了构建强大的Cascade S2ST基准,我们在Covost 2上培训了St模型,这优于前一种最先进的培训,而无需额外的数据。尽管如此,直接S2ST模型的性能在从头开始训练时接近强级联基线,并且在匹配ST模型中初始化时,仅在ASR转换转换时的0.1或0.7bleu差异。
translated by 谷歌翻译
Direct speech-to-speech translation (S2ST), in which all components can be optimized jointly, is advantageous over cascaded approaches to achieve fast inference with a simplified pipeline. We present a novel two-pass direct S2ST architecture, {\textit UnitY}, which first generates textual representations and predicts discrete acoustic units subsequently. We enhance the model performance by subword prediction in the first-pass decoder, advanced two-pass decoder architecture design and search strategy, and better training regularization. To leverage large amounts of unlabeled text data, we pre-train the first-pass text decoder based on the self-supervised denoising auto-encoding task. Experimental evaluations on benchmark datasets at various data scales demonstrate that UnitY outperforms a single-pass speech-to-unit translation model by 2.5-4.2 ASR-BLEU with 2.83x decoding speed-up. We show that the proposed methods boost the performance even when predicting spectrogram in the second pass. However, predicting discrete units achieves 2.51x decoding speed-up compared to that case.
translated by 谷歌翻译
我们呈现TranslatOrron 2,一个神经直接语音转换转换模型,可以训练结束到底。 TranslatOrron 2由语音编码器,音素解码器,MEL谱图合成器和连接所有前三个组件的注意模块组成。实验结果表明,翻译ron 2在翻译质量和预测的语音自然方面,通过大幅度优于原始翻译,并且通过减轻超越,例如唠叨或长暂停来大幅提高预测演讲的鲁棒性。我们还提出了一种在翻译语音中保留源代言人声音的新方法。训练有素的模型被限制为保留源扬声器的声音,但与原始翻译ron不同,它无法以不同的扬声器的语音产生语音,使模型对生产部署更加强大,通过减轻潜在的滥用来创建欺骗音频伪影。当新方法与基于简单的替代的数据增强一起使用时,训练的翻译器2模型能够保留每个扬声器的声音,以便用扬声器转动输入输入。
translated by 谷歌翻译
本文介绍了基于Wav2VEC 2.0的跨语言语音表示学习的大规模模型。我们在128种语言中培训最多2B个公共讲话音频的近半小时的型号的模型,比公共数据的数量级比最大的已知事先工作。我们的评估涵盖了广泛的任务,域,数据制度和语言,都是高低资源。在Covost-2语音翻译基准测试中,我们将先前的最先进的状态平均为7.4 BLEU超过21个翻译方向进入英语。对于语音识别,XLS-R在Babel,MLS,CommonVoice以及Voxpopuli上的最佳已知工作中提高,降低了相对的误差率14-34%。 XLS-R还在Voxlingua107语言识别上设置了新的技术状态。此外,我们表明,具有足够的模型规模,交叉思维预先预测可以在将英语演讲翻译成其他语言时才能优于英语撇印,这是一个有利于单晶的预借预制的设置。我们希望XLS-R可以帮助改善世界上更多语言的语音处理任务。
translated by 谷歌翻译
Speech translation (ST) is the task of directly translating acoustic speech signals in a source language into text in a foreign language. ST task has been addressed, for a long time, using a pipeline approach with two modules : first an Automatic Speech Recognition (ASR) in the source language followed by a text-to-text Machine translation (MT). In the past few years, we have seen a paradigm shift towards the end-to-end approaches using sequence-to-sequence deep neural network models. This paper presents our efforts towards the development of the first Broadcast News end-to-end Arabic to English speech translation system. Starting from independent ASR and MT LDC releases, we were able to identify about 92 hours of Arabic audio recordings for which the manual transcription was also translated into English at the segment level. These data was used to train and compare pipeline and end-to-end speech translation systems under multiple scenarios including transfer learning and data augmentation techniques.
translated by 谷歌翻译
本文介绍了我们针对IWSLT 2022离线任务的端到端Yitrans语音翻译系统的提交,该任务从英语音频转换为德语,中文和日语。 Yitrans系统建立在大规模训练的编码器模型上。更具体地说,我们首先设计了多阶段的预训练策略,以建立具有大量标记和未标记数据的多模式模型。然后,我们为下游语音翻译任务微调模型的相应组件。此外,我们做出了各种努力,以提高性能,例如数据过滤,数据增强,语音细分,模型集合等。实验结果表明,我们的Yitrans系统比在三个翻译方向上的强基线取得了显着改进,并且比去年在TST2021英语 - 德国人中的最佳端到端系统方面的改进+5.2 BLEU改进。根据自动评估指标,我们的最终意见在英语 - 德国和英语端到端系统上排名第一。我们使代码和模型公开可用。
translated by 谷歌翻译
语音情感转换是修改语音话语的感知情绪的任务,同时保留词汇内容和扬声器身份。在这项研究中,我们将情感转换问题作为口语翻译任务。我们将演讲分解为离散和解散的学习表现,包括内容单位,F0,扬声器和情感。首先,我们通过将内容单元转换为目标情绪来修改语音内容,然后基于这些单元预测韵律特征。最后,通过将预测的表示馈送到神经声码器中来生成语音波形。这样的范式允许我们超越信号的光谱和参数变化,以及模型非口头发声,例如笑声插入,打开拆除等。我们客观地和主观地展示所提出的方法在基础上优于基线感知情绪和音频质量。我们严格评估了这种复杂系统的所有组成部分,并通过广泛的模型分析和消融研究结束,以更好地强调建议方法的建筑选择,优势和弱点。示例和代码将在以下链接下公开使用:https://speechbot.github.io/emotion。
translated by 谷歌翻译
最近,用于语音处理的自我监督模型最近作为语音处理管道中流行的基础块出现。这些模型在未标记的音频数据上进行了预训练,然后用于语音处理下游任务,例如自动语音识别(ASR)或语音翻译(ST)。由于这些模型现在都用于研究和工业系统,因此有必要理解某些特征在培训数据中的性别分布等特征所引起的影响。我们以法语为我们的调查语言,训练和比较性别特定的WAV2VEC 2.0模型与在其预训练数据中包含不同性别平衡的模型。通过将这些模型应用于两个语音到文本下游任务:ASR和ST进行比较。结果显示了下游集成的类型。在微调端到端ASR系统之前,我们使用性别特定的预训练观察到较低的总体性能。但是,当将自我监督模型用作特征提取器时,总体ASR和ST结果遵循更复杂的模式,在这种模式下,平衡的预训练模型不一定会带来最佳结果。最后,我们粗制的“公平”度量标准(男性测试集之间测量的相对性能差异)并未显示出从平衡到特定性别的预训练的Preaded Wav2Vec 2.0模型的强烈变化。
translated by 谷歌翻译
我们提出了Maestro,这是一种自制的培训方法,可以统一从语音和文本方式中学到的表示形式。从语音信号中进行的自我监督学习旨在学习信号中固有的潜在结构,而从文本尝试捕获词汇信息的文本尝试中学习。从不配对的语音和文本序列中学习对齐表示是一项具有挑战性的任务。先前的工作要么隐含地强制执行从这两种方式中学到的表示形式,要通过多任务和参数共享在潜在空间中对齐,或通过语音综合通过模态转换而明确地进行。前者受到两种方式之间的干扰,而后者则引入了额外的复杂性。在本文中,我们提出了一种新颖的算法Maestro,旨在同时从这两种方式中学习统一的表示,可以转移到各种下游任务,例如自动语音识别(ASR)和语音翻译(ST)。 Maestro通过序列比对,持续时间预测和匹配的嵌入在学习空间中通过对齐的蒙版模型损失来学习统一的表示形式。我们在Voxpopuli多语言ASR上建立了一个新的最先进(SOTA),单词错误率相对相对降低8%(WER),多域Speetstew ASR(相对3.7%)和21种英语多语言ST在Covost 2上2.8 BLEU的改善平均21种语言。
translated by 谷歌翻译
End-to-End speech-to-speech translation (S2ST) is generally evaluated with text-based metrics. This means that generated speech has to be automatically transcribed, making the evaluation dependent on the availability and quality of automatic speech recognition (ASR) systems. In this paper, we propose a text-free evaluation metric for end-to-end S2ST, named BLASER, to avoid the dependency on ASR systems. BLASER leverages a multilingual multimodal encoder to directly encode the speech segments for source input, translation output and reference into a shared embedding space and computes a score of the translation quality that can be used as a proxy to human evaluation. To evaluate our approach, we construct training and evaluation sets from more than 40k human annotations covering seven language directions. The best results of BLASER are achieved by training with supervision from human rating scores. We show that when evaluated at the sentence level, BLASER correlates significantly better with human judgment compared to ASR-dependent metrics including ASR-SENTBLEU in all translation directions and ASR-COMET in five of them. Our analysis shows combining speech and text as inputs to BLASER does not increase the correlation with human scores, but best correlations are achieved when using speech, which motivates the goal of our research. Moreover, we show that using ASR for references is detrimental for text-based metrics.
translated by 谷歌翻译
无监督的文本到语音综合(TTS)系统学会通过观察以下语言来生成与任何语言中任何书面句子相对应的语音波形:1)用该语言收集的未转录语音波形的集合; 2)用该语言编写的文本集合,无需访问任何抄录的语音。开发这种系统可以显着提高语言技术对语言的可用性,而无需大量平行的语音和文本数据。本文提出了一个基于对齐模块的无监督的TTS系统,该模块输出了伪文本和另一个使用伪文本进行训练和真实文本进行推理的合成模块。我们的无监督系统可以以七种语言的方式实现与监督系统相当的性能,每种语音约10-20小时。还对文本单元和声码器的效果进行了仔细的研究,以更好地了解哪些因素可能影响无监督的TTS性能。可以在https://cactuswiththoughts.github.io/unsuptts-demo上找到我们的模型生成的样品,可以在https://github.com/lwang114/unsuptts上找到我们的代码。
translated by 谷歌翻译
语音中的自我监督学习涉及在大规模的未注释的语音语料库上训练语音表示网络,然后将学习的表示形式应用于下游任务。由于语音中SSL学习的大多数下游任务主要集中在语音中的内容信息上,因此最理想的语音表示形式应该能够将不需要的变化(例如说话者的变化)从内容中删除。但是,解开扬声器非常具有挑战性,因为删除说话者的信息也很容易导致内容丢失,而后者的损害通常远远超过了前者的好处。在本文中,我们提出了一种新的SSL方法,该方法可以实现扬声器分解而不会严重丢失内容。我们的方法是根据休伯特框架改编的,并结合了解开机制,以使教师标签和博学的代表规范化。我们在一组与内容相关的下游任务上评估了说话者分解的好处,并观察到我们的扬声器示词表示的一致且著名的性能优势。
translated by 谷歌翻译
无监督的语音识别表现出了使每种语言都可以访问的自动语音识别(ASR)系统的巨大潜力。但是,现有方法仍然严重依赖手工制作的预处理。与端到端进行监督语音识别的趋势类似,我们介绍了WAV2VEC-U 2.0,它消除了所有音频端的预处理,并通过更好的体系结构提高了准确性。此外,我们引入了一个辅助自我监督的目标,该目标将模型的预测与输入联系起来。实验表明,WAV2VEC-U 2.0在概念上更简单的同时,可以改善不同语言的无监督识别结果。
translated by 谷歌翻译
End-to-end Speech Translation (E2E ST) aims to translate source speech into target translation without generating the intermediate transcript. However, existing approaches for E2E ST degrade considerably when only limited ST data are available. We observe that an ST model's performance strongly correlates with its embedding similarity from speech and transcript. In this paper, we propose Word-Aligned COntrastive learning (WACO), a novel method for few-shot speech-to-text translation. Our key idea is bridging word-level representations for both modalities via contrastive learning. We evaluate WACO and other methods on the MuST-C dataset, a widely used ST benchmark. Our experiments demonstrate that WACO outperforms the best baseline methods by 0.7-8.5 BLEU points with only 1-hour parallel data. Code is available at https://anonymous.4open.science/r/WACO .
translated by 谷歌翻译
自我监督学习(SSL)在语音识别方面取得了巨大的成功,而有限的探索已尝试完成其他语音处理任务。由于语音信号包含多方面的信息,包括说话者身份,副语言学,口语内容等,学习所有语音任务的通用表示都具有挑战性。为了解决该问题,我们提出了一个新的预培训模型WAVLM,以解决全堆栈的下游语音任务。 Wavlm共同学习了蒙面的语音预测和预训练。通过这种方式,WAVLM不仅可以通过掩盖的语音预测来保持语音内容建模能力,而且还可以通过语音denoing来提高非ASR任务的潜力。此外,WAVLM还采用封闭式的变压器结构的封闭相对位置偏置,以更好地捕获输入语音的序列排序。我们还将培训数据集从60k小时扩展到94K小时。 WAVLM大型在精湛的基准上实现了最先进的性能,并在其代表性基准上为各种语音处理任务带来了重大改进。代码和预培训模型可在https://aka.ms/wavlm上找到。
translated by 谷歌翻译
Data scarcity is one of the main issues with the end-to-end approach for Speech Translation, as compared to the cascaded one. Although most data resources for Speech Translation are originally document-level, they offer a sentence-level view, which can be directly used during training. But this sentence-level view is single and static, potentially limiting the utility of the data. Our proposed data augmentation method SegAugment challenges this idea and aims to increase data availability by providing multiple alternative sentence-level views of a dataset. Our method heavily relies on an Audio Segmentation system to re-segment the speech of each document, after which we obtain the target text with alignment methods. The Audio Segmentation system can be parameterized with different length constraints, thus giving us access to multiple and diverse sentence-level views for each document. Experiments in MuST-C show consistent gains across 8 language pairs, with an average increase of 2.2 BLEU points, and up to 4.7 BLEU for lower-resource scenarios in mTEDx. Additionally, we find that SegAugment is also applicable to purely sentence-level data, as in CoVoST, and that it enables Speech Translation models to completely close the gap between the gold and automatic segmentation at inference time.
translated by 谷歌翻译
We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zero-shot transfer setting without the need for any fine-tuning. When compared to humans, the models approach their accuracy and robustness. We are releasing models and inference code to serve as a foundation for further work on robust speech processing.
translated by 谷歌翻译