Multilingual machine translation models can benefit from synergy between different language pairs, but also suffer from interference. While there is a growing number of sophisticated methods that aim to eliminate interference, our understanding of interference as a phenomenon is still limited. This work identifies the main factors that contribute to interference in multilingual machine translation. Through systematic experimentation, we find that interference (or synergy) are primarily determined by model size, data size, and the proportion of each language pair within the total dataset. We observe that substantial interference occurs mainly when the model is very small with respect to the available training data, and that using standard transformer configurations with less than one billion parameters largely alleviates interference and promotes synergy. Moreover, we show that tuning the sampling temperature to control the proportion of each language pair in the data is key to balancing the amount of interference between low and high resource language pairs effectively, and can lead to superior performance overall.
translated by 谷歌翻译
在所有人类语言对之间实现通用翻译是机器翻译的圣杯(MT)研究。虽然最近在大量的多语言MT中的进展是达到这一目标的一步,但它变得明显,即简单地通过在更加平行数据上训练扩展多语言MT系统是不可编译的,因为用于低资源和非英语的标记数据的可用性 - 姓氏对禁止有限。为此,我们展示了一种务实的方法,可以使用监督和自我监督目标的混合来构建涵盖数百种语言的多语种MT模型,具体取决于不同语言对的数据可用性。我们展示这两种训练范例之间的协同作用使模型能够在零资源设置中产生高质量的翻译,甚至超过监控的用于中资和中资和中资质。我们开展广泛的实验,了解多语言监督,域错配和平行和单机数据量的效果,以了解我们自我监督的多语言模型的质量。为了展示方法的可扩展性,我们培训具有200多种语言的模型,并在几个先前研究的语言上展示了对零资源翻译的高性能。我们希望我们的调查结果将成为踏脚石,以便为下一千种语言进行翻译。
translated by 谷歌翻译
通过多种语言对培训的多语言神经机器翻译(MNMT),由于模型参数的较少和较低的培训成本,通过在多种语言之间共享知识,引起了人们的关注。尽管如此,由于不同翻译方向之间的负面干扰,尤其是在高资源语言上,因此,多语言培训在共享参数中受到语言干扰退化的困扰。在本文中,我们提出了具有高资源语言特定培训(HLT-MT)的多语言翻译模型,以减轻负面干扰,该干扰采用了具有特定于语言的选择机制的两阶段培训。具体而言,我们首先仅使用高资源对训练多语言模型,然后选择解码器顶部的语言特定模块,以增强高资源方向的翻译质量。接下来,对所有可用语料库进行进一步培训,将知识从高资源语言(HRLS)转移到低资源语言(LRLS)。实验结果表明,HLT-MT在WMT-10和Opus-100基准测试上的表现优于各种强基础。此外,分析实验验证了我们方法在减轻多语言训练中负面干扰方面的有效性。
translated by 谷歌翻译
Multilingual machine translation suffers from negative interference across languages. A common solution is to relax parameter sharing with language-specific modules like adapters. However, adapters of related languages are unable to transfer information, and their total number of parameters becomes prohibitively expensive as the number of languages grows. In this work, we overcome these drawbacks using hyper-adapters -- hyper-networks that generate adapters from language and layer embeddings. While past work had poor results when scaling hyper-networks, we propose a rescaling fix that significantly improves convergence and enables training larger hyper-networks. We find that hyper-adapters are more parameter efficient than regular adapters, reaching the same performance with up to 12 times less parameters. When using the same number of parameters and FLOPS, our approach consistently outperforms regular adapters. Also, hyper-adapters converge faster than alternative approaches and scale better than regular dense networks. Our analysis shows that hyper-adapters learn to encode language relatedness, enabling positive transfer across languages.
translated by 谷歌翻译
We propose a simple solution to use a single Neural Machine Translation (NMT) model to translate between multiple languages. Our solution requires no changes to the model architecture from a standard NMT system but instead introduces an artificial token at the beginning of the input sentence to specify the required target language. The rest of the model, which includes an encoder, decoder and attention module, remains unchanged and is shared across all languages. Using a shared wordpiece vocabulary, our approach enables Multilingual NMT using a single model without any increase in parameters, which is significantly simpler than previous proposals for Multilingual NMT. On the WMT'14 benchmarks, a single multilingual model achieves comparable performance for English→French and surpasses state-of-the-art results for English→German. Similarly, a single multilingual model surpasses state-of-the-art results for French→English and German→English on WMT'14 and WMT'15 benchmarks, respectively. On production corpora, multilingual models of up to twelve language pairs allow for better translation of many individual pairs. In addition to improving the translation quality of language pairs that the model was trained with, our models can also learn to perform implicit bridging between language pairs never seen explicitly during training, showing that transfer learning and zero-shot translation is possible for neural translation. Finally, we show analyses that hints at a universal interlingua representation in our models and show some interesting examples when mixing languages.
translated by 谷歌翻译
Sparsely gated Mixture of Experts (MoE) models have been shown to be a compute-efficient method to scale model capacity for multilingual machine translation. However, for low-resource tasks, MoE models severely over-fit. We show effective regularization strategies, namely dropout techniques for MoE layers in EOM and FOM, Conditional MoE Routing and Curriculum Learning methods that prevent over-fitting and improve the performance of MoE models on low-resource tasks without adversely affecting high-resource tasks. On a massively multilingual machine translation benchmark, our strategies result in about +1 chrF++ improvement in very low resource language pairs. We perform an extensive analysis of the learned MoE routing to better understand the impact of our regularization methods and how we can improve them.
translated by 谷歌翻译
多语种NMT已成为MT在生产中部署的有吸引力的解决方案。但是要匹配双语质量,它符合较大且较慢的型号。在这项工作中,我们考虑了几种方法在推理时更快地使多语言NMT变得更快而不会降低其质量。我们在两种20语言多平行设置中尝试几个“光解码器”架构:在TED会谈中小规模和帕拉克曲线上的大规模。我们的实验表明,将具有词汇过滤的浅解码器组合在于,在翻译质量下没有损失的速度超过两倍。我们用Bleu和Chrf(380语言对),鲁棒性评估和人类评估验证了我们的研究结果。
translated by 谷歌翻译
多语言神经机器翻译(MNMT)使一个系统能够将句子从多种源语言转换为多种目标语言,与传统的双语系统相比,大大降低了部署成本。但是,MNMT培训益处通常仅限于多一对一的方向。该模型在一对一的表现不佳,并且在零镜头设置中遭受了多种影响。为了解决这个问题,本文讨论了如何实际构建提供任意X-Y翻译指示的MNMT系统,同时使用预处理和填充的两阶段培训策略利用多语言。尝试WMT'21多语言翻译任务,我们证明我们的系统的表现优于大多数方向的直接双语模型和枢轴翻译模型的传统基线,平均提供+6.0和+4.1 BLEU,而无需进行架构更改或额外的数据收集。 。此外,我们还在极大的数据设置中检查了我们提出的方法,以适应实际的部署方案。
translated by 谷歌翻译
多语种模型是参数效率,特别是通过利用Crosslingual Transcer来改善低资源语言。尽管最近有巨大的多语言翻译预先推出了越来越多的模型和数据,但如何有效地培养多语言模型并未得到很好的理解。在本文中,我们表明,多语言训练中的常见情况,语言之间的数据不平衡,高资源和低资源语言之间的优化张力,其中发现的多语言解决方案通常是低资源的次优。我们展示了普通培训方法,upsamples低资源无法鲁布利地优化人口损失,其中风险耗材或过度为低资源的风险。绘制最近关于损失景观几何学的发现及其对泛化的影响,提出了一个原则性的优化算法,曲率意识的任务缩放(CAT),其自适应地从不同任务中重新加强了对低曲率的多语言训练的元的梯度。邻居对所有语言均匀低损失。我们在共同基准(TED,WMT和OPUS-100)上进行了实验,具有不同程度的数据不平衡。猫有效地改善了多语言优化,结果表明,在低资源($ + 0.8 $至+ 2.2 $ BLEU)上展示了一致的收益,而不会伤害高资源。此外,猫对过度分数计量和大量批量训练具有强大的稳健性,这使得这是一种充满希望的大量多语言模型,真正提高低资源语言。
translated by 谷歌翻译
多语种神经机翻译(MNMT)旨在通过单一模型进行翻译多种语言,并且由于具有共享参数的不同语言的有效知识传输,已被证明是成功的。但是,它仍然是一个开放的问题,应该共享哪些参数,并且需要是特定于任务的。目前,常识是启发式设计或搜索特定语言的模块,这很难找到最佳配置。在本文中,我们提出了一种基于新的参数差异化方法,允许模型确定在训练期间应该是哪个参数。灵感来自蜂窝分化,我们方法中的每个共享参数都可以动态区分为更专业化的类型。我们还将差分标准定义为任务间梯度相似性。因此,突出的任务渐变间的参数更可能是特定于语言的。关于多语言数据集的大量实验表明,我们的方法显着优于不同参数共享配置的各种强基线。进一步的分析表明,通过我们的方法获得的参数共享配置与语言近似度很好地相关。
translated by 谷歌翻译
最近非自动增加(NAR)机器翻译最近取得了显着的改进,现在优于一些基准测试的自动增加(AR)模型,为AR推断提供有效的替代方案。然而,虽然AR转换通常使用多语言模型来实现,但是从语言之间的转移和改善的服务效率,多语言NAR模型仍然相对未开发。作为一个示例NAR模型和变压器作为半NAR模型,采用连接员时间分类(CTC),我们展示了多语种NAR的全面实证研究。我们在容量限制下对相关语言与负转移之间的积极转移来测试其能力。随着NAR模型需要蒸馏培训套,我们仔细研究双语与多语种教师的影响。最后,我们适合多语言NAR的缩放法,这使得其相对于AR模型的性能随着模型量表的增加而定量。
translated by 谷歌翻译
本文提出了一种简单而有效的方法,可以改善两种情况下的直接(x-to-y)翻译:零射击和直接数据时。我们将编码器和解码器的输入令牌修改为包括源和目标语言的信号。我们在从头开始训练或使用拟议的设置对验证模型进行填充时显示出绩效增长。在实验中,根据检查点选择标准,我们的方法在内部数据集上显示了近10.0个BLEU点的增益。在WMT评估活动中,从英语性能提高了4.17和2.87 BLEU点,在零射击设置和直接数据可用于培训时。而X-to-y在零射基线上提高了1.29 BLEU,而在多到许多基线上提高了0.44。在低资源设置中,我们在X-TO-Y域数据上进行填充时会看到1.5〜1.7点的改善。
translated by 谷歌翻译
在完全共享所有语言参数的多语言神经机器翻译模型中,通常使用人工语言令牌来指导转换为所需的目标语言。但是,最近的研究表明,预备语言代币有时无法将多语言神经机器翻译模型导航到正确的翻译方向,尤其是在零弹性翻译上。为了减轻此问题,我们提出了两种方法:语言嵌入实施例和语言意识的多头关注,以学习信息丰富的语言表示,以将翻译转换为正确的方向。前者体现了沿着从源到目标的信息流中的不同关键切换点的语言,旨在放大翻译方向引导信号。后者利用矩阵而不是向量来表示连续空间中的语言。矩阵分为多个头,以学习多个子空间中的语言表示。在两个数据集上进行大规模多语言神经机器翻译的实验结果表明,语言意识到的多头注意力受益于监督和零弹性翻译,并大大减轻了脱靶翻译问题。进一步的语言类型学预测实验表明,通过我们的方法学到的基于基质的语言表示能够捕获丰富的语言类型学特征。
translated by 谷歌翻译
多语言代币器是多语言神经机器翻译的基本组成部分。它是通过多语种语料库训练的。由于偏斜的数据分布被认为是有害的,因此通常使用采样策略来平衡语料库中的语言。但是,很少有作品系统地回答了令牌训练中的语言失衡如何影响下游的表现。在这项工作中,我们分析了翻译性能如何随着语言之间的数据比率而变化,而在令牌培训语料库中的变化。我们发现,虽然当语言更加同样地采样时,通常会观察到相对较好的性能,但下游性能对语言不平衡的性能比我们通常预期的要强。在执行任务之前,可以警告两个功能,即UNK速率和接近角色水平,可以警告下游性能不佳。我们还将令牌训练的语言抽样与模型培训的采样分开,并表明该模型对后者更敏感。
translated by 谷歌翻译
This paper demonstrates that multilingual denoising pre-training produces significant performance gains across a wide variety of machine translation (MT) tasks. We present mBART -a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora in many languages using the BART objective . mBART is the first method for pre-training a complete sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only on the encoder, decoder, or reconstructing parts of the text. Pre-training a complete model allows it to be directly fine tuned for supervised (both sentence-level and document-level) and unsupervised machine translation, with no task-specific modifications. We demonstrate that adding mBART initialization produces performance gains in all but the highest-resource settings, including up to 12 BLEU points for low resource MT and over 5 BLEU points for many document-level and unsupervised models. We also show it also enables new types of transfer to language pairs with no bi-text or that were not in the pre-training corpus, and present extensive analysis of which factors contribute the most to effective pre-training.
translated by 谷歌翻译
Pre-training is an effective technique for ensuring robust performance on a variety of machine learning tasks. It typically depends on large-scale crawled corpora that can result in toxic or biased models. Such data can also be problematic with respect to copyright, attribution, and privacy. Pre-training with synthetic tasks and data is a promising way of alleviating such concerns since no real-world information is ingested by the model. Our goal in this paper is to understand what makes for a good pre-trained model when using synthetic resources. We answer this question in the context of neural machine translation by considering two novel approaches to translation model pre-training. Our first approach studies the effect of pre-training on obfuscated data derived from a parallel corpus by mapping words to a vocabulary of 'nonsense' tokens. Our second approach explores the effect of pre-training on procedurally generated synthetic parallel data that does not depend on any real human language corpus. Our empirical evaluation on multiple language pairs shows that, to a surprising degree, the benefits of pre-training can be realized even with obfuscated or purely synthetic parallel data. In our analysis, we consider the extent to which obfuscated and synthetic pre-training techniques can be used to mitigate the issue of hallucinated model toxicity.
translated by 谷歌翻译
最近在单语数据和机器翻译(MT)进行微调的预培训方面取得了成功,但尚不清楚如何最好地利用预先训练的模型来完成给定的MT任务。本文在微调MT上的预训练模型时研究了冻结参数的好处和缺点。我们专注于1)微调仅在英语单语言数据的BART上训练的模型。2)微调一个模型,该模型对25种语言的单语言数据进行了培训,Mbart。对于Bart,我们通过冻结大多数模型参数并添加额外的位置嵌入来获得最佳性能。对于MBART,我们将大多数语言对的天真微调的性能与编码器以及大多数解码器搭配。编码器的注意参数对于微调最重要。当将自己限制为越南人对英语的室外训练套装时,我们看到了基线的最大进步。
translated by 谷歌翻译
We present SpeechMatrix, a large-scale multilingual corpus of speech-to-speech translations mined from real speech of European Parliament recordings. It contains speech alignments in 136 language pairs with a total of 418 thousand hours of speech. To evaluate the quality of this parallel speech, we train bilingual speech-to-speech translation models on mined data only and establish extensive baseline results on EuroParl-ST, VoxPopuli and FLEURS test sets. Enabled by the multilinguality of SpeechMatrix, we also explore multilingual speech-to-speech translation, a topic which was addressed by few other works. We also demonstrate that model pre-training and sparse scaling using Mixture-of-Experts bring large gains to translation performance. The mined data and models are freely available.
translated by 谷歌翻译
本文介绍了基于Wav2VEC 2.0的跨语言语音表示学习的大规模模型。我们在128种语言中培训最多2B个公共讲话音频的近半小时的型号的模型,比公共数据的数量级比最大的已知事先工作。我们的评估涵盖了广泛的任务,域,数据制度和语言,都是高低资源。在Covost-2语音翻译基准测试中,我们将先前的最先进的状态平均为7.4 BLEU超过21个翻译方向进入英语。对于语音识别,XLS-R在Babel,MLS,CommonVoice以及Voxpopuli上的最佳已知工作中提高,降低了相对的误差率14-34%。 XLS-R还在Voxlingua107语言识别上设置了新的技术状态。此外,我们表明,具有足够的模型规模,交叉思维预先预测可以在将英语演讲翻译成其他语言时才能优于英语撇印,这是一个有利于单晶的预借预制的设置。我们希望XLS-R可以帮助改善世界上更多语言的语音处理任务。
translated by 谷歌翻译
以前的工作主要侧重于改善NLU任务的交叉传输,具有多语言预用编码器(MPE),或提高与伯特的监督机器翻译的性能。然而,探索了,MPE是否可以有助于促进NMT模型的交叉传递性。在本文中,我们专注于NMT中的零射频转移任务。在此任务中,NMT模型培训,只有一个语言对的并行数据集和搁置架MPE,然后它直接测试在零拍语言对上。我们为此任务提出了Sixt,一个简单而有效的模型。 SIXT利用了两阶段培训计划利用MPE,并进一步改进了解离编码器和容量增强的解码器。使用此方法,SIMPT显着优于MBart,这是一个用于NMT的预磨削的多语言编码器解码器模型,平均改善了14个源语言的零拍摄的任何英语测试集上的7.1 BLEU。此外,培训计算成本和培训数据较少,我们的模型在15个任何英语测试组上实现了比Criss和M2M-100,两个强大的多语言NMT基线更好的性能。
translated by 谷歌翻译