端到端的语音到语音翻译(S2ST)而不依赖中间文本表示是一个快速新兴的研究领域。最近的作品表明,这种直接S2ST系统的性能正在接近常规级联S2ST时,在可比较的数据集中进行了培训。但是,实际上,直接S2ST的性能受到配对S2ST培训数据的可用性。在这项工作中,我们探索了多种方法,用于利用更广泛的无监督和弱监督的语音和文本数据,以改善基于Translatotron 2的直接S2ST的性能2.使用我们最有效的方法,我们的最有效的方法是21号直接S2ST的平均翻译质量与没有其他数据的先前最新的训练相比,CVSS-C语料库上的语言对改善了+13.6 BLEU(OR +113%)。低资源语言的改进更加显着(平均+398%)。我们的比较研究表明,S2ST和语音表示学习的未来研究方向。
translated by 谷歌翻译
我们提出了Maestro,这是一种自制的培训方法,可以统一从语音和文本方式中学到的表示形式。从语音信号中进行的自我监督学习旨在学习信号中固有的潜在结构,而从文本尝试捕获词汇信息的文本尝试中学习。从不配对的语音和文本序列中学习对齐表示是一项具有挑战性的任务。先前的工作要么隐含地强制执行从这两种方式中学到的表示形式,要通过多任务和参数共享在潜在空间中对齐,或通过语音综合通过模态转换而明确地进行。前者受到两种方式之间的干扰,而后者则引入了额外的复杂性。在本文中,我们提出了一种新颖的算法Maestro,旨在同时从这两种方式中学习统一的表示,可以转移到各种下游任务,例如自动语音识别(ASR)和语音翻译(ST)。 Maestro通过序列比对,持续时间预测和匹配的嵌入在学习空间中通过对齐的蒙版模型损失来学习统一的表示形式。我们在Voxpopuli多语言ASR上建立了一个新的最先进(SOTA),单词错误率相对相对降低8%(WER),多域Speetstew ASR(相对3.7%)和21种英语多语言ST在Covost 2上2.8 BLEU的改善平均21种语言。
translated by 谷歌翻译
我们介绍了CVSS,这是一种大规模的多语言对语音转换(S2ST)语料库,从21种语言覆盖了21种语言的句子级并行S2ST对。通过将Covost 2从Covost 2的翻译文本综合将翻译文本与最先进的TTS系统合成语音,源自公共语音语音语料库和COVOST 2语音到文本转换(ST)语料库。提供了两个版本的翻译演讲:1)CVSS-C:所有翻译演讲都是一种高质量的规范声音; 2)CVSS-T:翻译语音从相应的源语音传输。此外,CVSS提供标准化的翻译文本,它与翻译语音中的发音匹配。在每个版本的CVSS上,我们建立了基线多语言直接S2ST模型和Cascade S2ST模型,验证了语料库的有效性。为了构建强大的Cascade S2ST基准,我们在Covost 2上培训了St模型,这优于前一种最先进的培训,而无需额外的数据。尽管如此,直接S2ST模型的性能在从头开始训练时接近强级联基线,并且在匹配ST模型中初始化时,仅在ASR转换转换时的0.1或0.7bleu差异。
translated by 谷歌翻译
直接语音到语音翻译(S2ST)模型与传统级联系统可用的数据量相比,几乎没有平行的S2ST数据遇到数据稀缺问题,该数据包括自动语音识别(ASR),机器翻译(MT)和文本到语音(TTS)合成。在这项工作中,我们使用未标记的语音数据和数据扩展来探索自我监督的预训练,以解决此问题。我们利用了最近提出的语音到单位翻译(S2UT)框架,该框架将目标语音编码为离散表示形式,并转移前训练前和有效的部分填充技术,可很好地适用于语音到文本翻译(S2T)通过研究语音编码器和离散单位解码器预训练,S2UT域。我们在西班牙语 - 英语翻译上进行的实验表明,与多任务学习相比,自我监督的预训练始终如一地提高模型性能,平均为6.6-12.1 BLEU增长,并且可以与数据增强技术相结合,以应用MT来创建弱监督监督的培训数据。音频样本可在以下网址获得:https://facebookresearch.github.io/speech_translation/enhanced_direct_s2st_units/index.html。
translated by 谷歌翻译
我们呈现TranslatOrron 2,一个神经直接语音转换转换模型,可以训练结束到底。 TranslatOrron 2由语音编码器,音素解码器,MEL谱图合成器和连接所有前三个组件的注意模块组成。实验结果表明,翻译ron 2在翻译质量和预测的语音自然方面,通过大幅度优于原始翻译,并且通过减轻超越,例如唠叨或长暂停来大幅提高预测演讲的鲁棒性。我们还提出了一种在翻译语音中保留源代言人声音的新方法。训练有素的模型被限制为保留源扬声器的声音,但与原始翻译ron不同,它无法以不同的扬声器的语音产生语音,使模型对生产部署更加强大,通过减轻潜在的滥用来创建欺骗音频伪影。当新方法与基于简单的替代的数据增强一起使用时,训练的翻译器2模型能够保留每个扬声器的声音,以便用扬声器转动输入输入。
translated by 谷歌翻译
Direct speech-to-speech translation (S2ST), in which all components can be optimized jointly, is advantageous over cascaded approaches to achieve fast inference with a simplified pipeline. We present a novel two-pass direct S2ST architecture, {\textit UnitY}, which first generates textual representations and predicts discrete acoustic units subsequently. We enhance the model performance by subword prediction in the first-pass decoder, advanced two-pass decoder architecture design and search strategy, and better training regularization. To leverage large amounts of unlabeled text data, we pre-train the first-pass text decoder based on the self-supervised denoising auto-encoding task. Experimental evaluations on benchmark datasets at various data scales demonstrate that UnitY outperforms a single-pass speech-to-unit translation model by 2.5-4.2 ASR-BLEU with 2.83x decoding speed-up. We show that the proposed methods boost the performance even when predicting spectrogram in the second pass. However, predicting discrete units achieves 2.51x decoding speed-up compared to that case.
translated by 谷歌翻译
本文介绍了我们针对IWSLT 2022离线任务的端到端Yitrans语音翻译系统的提交,该任务从英语音频转换为德语,中文和日语。 Yitrans系统建立在大规模训练的编码器模型上。更具体地说,我们首先设计了多阶段的预训练策略,以建立具有大量标记和未标记数据的多模式模型。然后,我们为下游语音翻译任务微调模型的相应组件。此外,我们做出了各种努力,以提高性能,例如数据过滤,数据增强,语音细分,模型集合等。实验结果表明,我们的Yitrans系统比在三个翻译方向上的强基线取得了显着改进,并且比去年在TST2021英语 - 德国人中的最佳端到端系统方面的改进+5.2 BLEU改进。根据自动评估指标,我们的最终意见在英语 - 德国和英语端到端系统上排名第一。我们使代码和模型公开可用。
translated by 谷歌翻译
We present Mu$^{2}$SLAM, a multilingual sequence-to-sequence model pre-trained jointly on unlabeled speech, unlabeled text and supervised data spanning Automatic Speech Recognition (ASR), Automatic Speech Translation (AST) and Machine Translation (MT), in over 100 languages. By leveraging a quantized representation of speech as a target, Mu$^{2}$SLAM trains the speech-text models with a sequence-to-sequence masked denoising objective similar to T5 on the decoder and a masked language modeling (MLM) objective on the encoder, for both unlabeled speech and text, while utilizing the supervised tasks to improve cross-lingual and cross-modal representation alignment within the model. On CoVoST AST, Mu$^{2}$SLAM establishes a new state-of-the-art for models trained on public datasets, improving on xx-en translation over the previous best by 1.9 BLEU points and on en-xx translation by 1.1 BLEU points. On Voxpopuli ASR, our model matches the performance of an mSLAM model fine-tuned with an RNN-T decoder, despite using a relatively weaker sequence-to-sequence architecture. On text understanding tasks, our model improves by more than 6\% over mSLAM on XNLI, getting closer to the performance of mT5 models of comparable capacity on XNLI and TydiQA, paving the way towards a single model for all speech and text understanding tasks.
translated by 谷歌翻译
We present SpeechMatrix, a large-scale multilingual corpus of speech-to-speech translations mined from real speech of European Parliament recordings. It contains speech alignments in 136 language pairs with a total of 418 thousand hours of speech. To evaluate the quality of this parallel speech, we train bilingual speech-to-speech translation models on mined data only and establish extensive baseline results on EuroParl-ST, VoxPopuli and FLEURS test sets. Enabled by the multilinguality of SpeechMatrix, we also explore multilingual speech-to-speech translation, a topic which was addressed by few other works. We also demonstrate that model pre-training and sparse scaling using Mixture-of-Experts bring large gains to translation performance. The mined data and models are freely available.
translated by 谷歌翻译
End-to-end Speech Translation (E2E ST) aims to translate source speech into target translation without generating the intermediate transcript. However, existing approaches for E2E ST degrade considerably when only limited ST data are available. We observe that an ST model's performance strongly correlates with its embedding similarity from speech and transcript. In this paper, we propose Word-Aligned COntrastive learning (WACO), a novel method for few-shot speech-to-text translation. Our key idea is bridging word-level representations for both modalities via contrastive learning. We evaluate WACO and other methods on the MuST-C dataset, a widely used ST benchmark. Our experiments demonstrate that WACO outperforms the best baseline methods by 0.7-8.5 BLEU points with only 1-hour parallel data. Code is available at https://anonymous.4open.science/r/WACO .
translated by 谷歌翻译
我们介绍了一种无线文字语音转换(S2ST)系统,可以将来自一种语言的语音转换为另一种语言,并且可以在不需要任何文本数据的情况下构建。与文献中的现有工作不同,我们解决了模拟多扬声器目标语音的挑战,并用现实世界的S2ST数据训练系统。我们方法的关键是一种自我监督的单位语音标准化技术,该标准化技术将预先训练的语音编码器具有来自多个扬声器的配对声音,以及单个参考扬声器,以减少由于复印件引起的变化,同时保留词汇内容。只有10分钟的语音标准化的配对数据,我们在培训\ vp〜s2st数据集上的S2ST模型时获得平均3.2 BLEU增益,而不是在未标准化的语音目标上培训的基线。我们还将自动开采的S2ST数据纳入并显示额外的2.0 BLEU增益。据我们所知,我们是第一个建立无线的S2ST技术,可以用真实世界的数据培训,并为多种语言配对工作。
translated by 谷歌翻译
Speech translation (ST) is the task of directly translating acoustic speech signals in a source language into text in a foreign language. ST task has been addressed, for a long time, using a pipeline approach with two modules : first an Automatic Speech Recognition (ASR) in the source language followed by a text-to-text Machine translation (MT). In the past few years, we have seen a paradigm shift towards the end-to-end approaches using sequence-to-sequence deep neural network models. This paper presents our efforts towards the development of the first Broadcast News end-to-end Arabic to English speech translation system. Starting from independent ASR and MT LDC releases, we were able to identify about 92 hours of Arabic audio recordings for which the manual transcription was also translated into English at the segment level. These data was used to train and compare pipeline and end-to-end speech translation systems under multiple scenarios including transfer learning and data augmentation techniques.
translated by 谷歌翻译
我们对真正低资源语言的神经机翻译(NMT)进行了实证研究,并提出了一个训练课程,适用于缺乏并行培训数据和计算资源的情况,反映了世界上大多数世界语言和研究人员的现实致力于这些语言。以前,已经向低资源语言储存了使用后翻译(BT)和自动编码(AE)任务的无监督NMT。我们证明利用可比的数据和代码切换作为弱监管,与BT和AE目标相结合,即使仅使用适度的计算资源,低资源语言也会显着改进。在这项工作中提出的培训课程实现了Bleu分数,可通过+12.2 Bleu为古吉拉特和+3.7 Bleu为哈萨克斯培训的监督NMT培训,展示了弱势监督的巨大监督态度资源语言。在受到监督数据的培训时,我们的培训课程达到了索马里数据集(索马里29.3的BLEU的最先进的结果)。我们还观察到增加更多时间和GPU来培训可以进一步提高性能,强调报告在MT研究中的报告资源使用的重要性。
translated by 谷歌翻译
本文介绍了基于Wav2VEC 2.0的跨语言语音表示学习的大规模模型。我们在128种语言中培训最多2B个公共讲话音频的近半小时的型号的模型,比公共数据的数量级比最大的已知事先工作。我们的评估涵盖了广泛的任务,域,数据制度和语言,都是高低资源。在Covost-2语音翻译基准测试中,我们将先前的最先进的状态平均为7.4 BLEU超过21个翻译方向进入英语。对于语音识别,XLS-R在Babel,MLS,CommonVoice以及Voxpopuli上的最佳已知工作中提高,降低了相对的误差率14-34%。 XLS-R还在Voxlingua107语言识别上设置了新的技术状态。此外,我们表明,具有足够的模型规模,交叉思维预先预测可以在将英语演讲翻译成其他语言时才能优于英语撇印,这是一个有利于单晶的预借预制的设置。我们希望XLS-R可以帮助改善世界上更多语言的语音处理任务。
translated by 谷歌翻译
We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zero-shot transfer setting without the need for any fine-tuning. When compared to humans, the models approach their accuracy and robustness. We are releasing models and inference code to serve as a foundation for further work on robust speech processing.
translated by 谷歌翻译
In this paper, we propose a novel multi-modal multi-task encoder-decoder pre-training framework (MMSpeech) for Mandarin automatic speech recognition (ASR), which employs both unlabeled speech and text data. The main difficulty in speech-text joint pre-training comes from the significant difference between speech and text modalities, especially for Mandarin speech and text. Unlike English and other languages with an alphabetic writing system, Mandarin uses an ideographic writing system where character and sound are not tightly mapped to one another. Therefore, we propose to introduce the phoneme modality into pre-training, which can help capture modality-invariant information between Mandarin speech and text. Specifically, we employ a multi-task learning framework including five self-supervised and supervised tasks with speech and text data. For end-to-end pre-training, we introduce self-supervised speech-to-pseudo-codes (S2C) and phoneme-to-text (P2T) tasks utilizing unlabeled speech and text data, where speech-pseudo-codes pairs and phoneme-text pairs are a supplement to the supervised speech-text pairs. To train the encoder to learn better speech representation, we introduce self-supervised masked speech prediction (MSP) and supervised phoneme prediction (PP) tasks to learn to map speech into phonemes. Besides, we directly add the downstream supervised speech-to-text (S2T) task into the pre-training process, which can further improve the pre-training performance and achieve better recognition results even without fine-tuning. Experiments on AISHELL-1 show that our proposed method achieves state-of-the-art performance, with a more than 40% relative improvement compared with other pre-training methods.
translated by 谷歌翻译
When building state-of-the-art speech translation models, the need for large computational resources is a significant obstacle due to the large training data size and complex models. The availability of pre-trained models is a promising opportunity to build strong speech translation systems efficiently. In a first step, we investigate efficient strategies to build cascaded and end-to-end speech translation systems based on pre-trained models. Using this strategy, we can train and apply the models on a single GPU. While the end-to-end models show superior translation performance to cascaded ones, the application of this technology has a limitation on the need for additional end-to-end training data. In a second step, we proposed an additional similarity loss to encourage the model to generate similar hidden representations for speech and transcript. Using this technique, we can increase the data efficiency and improve the translation quality by 6 BLEU points in scenarios with limited end-to-end training data.
translated by 谷歌翻译
How to solve the data scarcity problem for end-to-end speech-to-text translation (ST)? It's well known that data augmentation is an efficient method to improve performance for many tasks by enlarging the dataset. In this paper, we propose Mix at three levels for Speech Translation (M^3ST) method to increase the diversity of the augmented training corpus. Specifically, we conduct two phases of fine-tuning based on a pre-trained model using external machine translation (MT) data. In the first stage of fine-tuning, we mix the training corpus at three levels, including word level, sentence level and frame level, and fine-tune the entire model with mixed data. At the second stage of fine-tuning, we take both original speech sequences and original text sequences in parallel into the model to fine-tune the network, and use Jensen-Shannon divergence to regularize their outputs. Experiments on MuST-C speech translation benchmark and analysis show that M^3ST outperforms current strong baselines and achieves state-of-the-art results on eight directions with an average BLEU of 29.9.
translated by 谷歌翻译
机器翻译系统(MTS)是通过将文本或语音从一种语言转换为另一种语言的有效工具。在像印度这样的大型多语言环境中,对有效的翻译系统的需求变得显而易见,英语和一套印度语言(ILS)正式使用。与英语相反,由于语料库的不可用,IL仍然被视为低资源语言。为了解决不对称性质,多语言神经机器翻译(MNMT)系统会发展为在这个方向上的理想方法。在本文中,我们提出了一个MNMT系统,以解决与低资源语言翻译有关的问题。我们的模型包括两个MNMT系统,即用于英语印度(一对多),另一个用于指示英语(多一对多),其中包含15个语言对(30个翻译说明)的共享编码器码头。由于大多数IL对具有很少的平行语料库,因此不足以训练任何机器翻译模型。我们探索各种增强策略,以通过建议的模型提高整体翻译质量。最先进的变压器体系结构用于实现所提出的模型。大量数据的试验揭示了其优越性比常规模型的优势。此外,本文解决了语言关系的使用(在方言,脚本等方面),尤其是关于同一家族的高资源语言在提高低资源语言表现方面的作用。此外,实验结果还表明了ILS的倒退和域适应性的优势,以提高源和目标语言的翻译质量。使用所有这些关键方法,我们提出的模型在评估指标方面比基线模型更有效,即一组ILS的BLEU(双语评估研究)得分。
translated by 谷歌翻译
We present a new approach to perform zero-shot cross-modal transfer between speech and text for translation tasks. Multilingual speech and text are encoded in a joint fixed-size representation space. Then, we compare different approaches to decode these multimodal and multilingual fixed-size representations, enabling zero-shot translation between languages and modalities. All our models are trained without the need of cross-modal labeled translation data. Despite a fixed-size representation, we achieve very competitive results on several text and speech translation tasks. In particular, we significantly improve the state-of-the-art for zero-shot speech translation on Must-C. Incorporating a speech decoder in our framework, we introduce the first results for zero-shot direct speech-to-speech and text-to-speech translation.
translated by 谷歌翻译