Multi-source translation (MST), which typically receives multiple source sentences of the same meaning in different languages, has been shown superior to single-source translation. As the quantity of multi-source parallel data is limited, taking full advantage of single-source data and limited multi-source data to make models perform well when receiving as many as possible sources remains a challenge. Unlike previous work mostly devoted to supervised scenarios, we focus on zero-shot MST: expecting models to be able to process unseen combinations of multiple sources, e.g., unseen language combinations, during inference. We propose a simple yet effective parameter efficient method, named Prompt Gating, which appends prompts to the model inputs and attaches gates on the extended hidden states for each encoder layer. It shows strong zero-shot transferability (+9.0 BLEU points maximally) and remarkable compositionality (+15.6 BLEU points maximally) on MST, and also shows its superiorities over baselines on lexically constrained translation.
translated by 谷歌翻译
以前的工作主要侧重于改善NLU任务的交叉传输,具有多语言预用编码器(MPE),或提高与伯特的监督机器翻译的性能。然而,探索了,MPE是否可以有助于促进NMT模型的交叉传递性。在本文中,我们专注于NMT中的零射频转移任务。在此任务中,NMT模型培训,只有一个语言对的并行数据集和搁置架MPE,然后它直接测试在零拍语言对上。我们为此任务提出了Sixt,一个简单而有效的模型。 SIXT利用了两阶段培训计划利用MPE,并进一步改进了解离编码器和容量增强的解码器。使用此方法,SIMPT显着优于MBart,这是一个用于NMT的预磨削的多语言编码器解码器模型,平均改善了14个源语言的零拍摄的任何英语测试集上的7.1 BLEU。此外,培训计算成本和培训数据较少,我们的模型在15个任何英语测试组上实现了比Criss和M2M-100,两个强大的多语言NMT基线更好的性能。
translated by 谷歌翻译
在完全共享所有语言参数的多语言神经机器翻译模型中,通常使用人工语言令牌来指导转换为所需的目标语言。但是,最近的研究表明,预备语言代币有时无法将多语言神经机器翻译模型导航到正确的翻译方向,尤其是在零弹性翻译上。为了减轻此问题,我们提出了两种方法:语言嵌入实施例和语言意识的多头关注,以学习信息丰富的语言表示,以将翻译转换为正确的方向。前者体现了沿着从源到目标的信息流中的不同关键切换点的语言,旨在放大翻译方向引导信号。后者利用矩阵而不是向量来表示连续空间中的语言。矩阵分为多个头,以学习多个子空间中的语言表示。在两个数据集上进行大规模多语言神经机器翻译的实验结果表明,语言意识到的多头注意力受益于监督和零弹性翻译,并大大减轻了脱靶翻译问题。进一步的语言类型学预测实验表明,通过我们的方法学到的基于基质的语言表示能够捕获丰富的语言类型学特征。
translated by 谷歌翻译
微调下游任务的大型预训练语言模型已成为NLP中的事实上学习范式。然而,常规方法微调预先训练模型的所有参数,这变得越来越稳定,因为模型尺寸和增长的任务数量。最近的工作提出了各种参数有效的转移学习方法,只需微调少数(额外)参数以获得强大的性能。虽然有效,但各种方法中的成功和联系的关键成分尚不清楚。在本文中,我们分解了最先进的参数有效的传输学习方法的设计,并提出了一个在它们之间建立连接的统一框架。具体而言,我们将它们重新框架作为预先训练的模型对特定隐藏状态的修改,并定义了一组设计尺寸,不同的方法变化,例如计算修改的功能和应用修改的位置。通过跨机翻译的全面实证研究,文本摘要,语言理解和文本分类基准,我们利用统一的视图来确定以前的方法中的重要设计选择。此外,我们的统一框架使得能够在不同的方法中传输设计元素,因此我们能够实例化新的参数高效的微调方法,该方法比以前的方法更加有效,而是更有效,实现可比的结果在所有四个任务上调整所有参数。
translated by 谷歌翻译
我们提出了一种两阶段的培训方法,用于开发单个NMT模型,以翻译英语和英语的看不见的语言。对于第一阶段,我们将编码器模型初始化以鉴定XLM-R和Roberta的权重,然后对25种语言的平行数据进行多种语言微调。我们发现该模型可以推广到对看不见的语言的零击翻译。在第二阶段,我们利用这种概括能力从单语数据集生成合成的并行数据,然后用连续的反向翻译训练。最终模型扩展到了英语到许多方向,同时保持了多到英语的性能。我们称我们的方法为ecxtra(以英语为中心的跨语言(x)转移)。我们的方法依次利用辅助并行数据和单语言数据,并且在概念上很简单,仅在两个阶段都使用标准的跨熵目标。最终的ECXTRA模型对8种低资源语言的无监督NMT进行了评估,该语言为英语至哈萨克语(22.3> 10.4 bleu)以及其他15个翻译方向的竞争性能而获得了新的最先进。
translated by 谷歌翻译
Multilingual machine translation suffers from negative interference across languages. A common solution is to relax parameter sharing with language-specific modules like adapters. However, adapters of related languages are unable to transfer information, and their total number of parameters becomes prohibitively expensive as the number of languages grows. In this work, we overcome these drawbacks using hyper-adapters -- hyper-networks that generate adapters from language and layer embeddings. While past work had poor results when scaling hyper-networks, we propose a rescaling fix that significantly improves convergence and enables training larger hyper-networks. We find that hyper-adapters are more parameter efficient than regular adapters, reaching the same performance with up to 12 times less parameters. When using the same number of parameters and FLOPS, our approach consistently outperforms regular adapters. Also, hyper-adapters converge faster than alternative approaches and scale better than regular dense networks. Our analysis shows that hyper-adapters learn to encode language relatedness, enabling positive transfer across languages.
translated by 谷歌翻译
Directly training a document-to-document (Doc2Doc) neural machine translation (NMT) via Transformer from scratch, especially on small datasets usually fails to converge. Our dedicated probing tasks show that 1) both the absolute position and relative position information gets gradually weakened or even vanished once it reaches the upper encoder layers, and 2) the vanishing of absolute position information in encoder output causes the training failure of Doc2Doc NMT. To alleviate this problem, we propose a position-aware Transformer (P-Transformer) to enhance both the absolute and relative position information in both self-attention and cross-attention. Specifically, we integrate absolute positional information, i.e., position embeddings, into the query-key pairs both in self-attention and cross-attention through a simple yet effective addition operation. Moreover, we also integrate relative position encoding in self-attention. The proposed P-Transformer utilizes sinusoidal position encoding and does not require any task-specified position embedding, segment embedding, or attention mechanism. Through the above methods, we build a Doc2Doc NMT model with P-Transformer, which ingests the source document and completely generates the target document in a sequence-to-sequence (seq2seq) way. In addition, P-Transformer can be applied to seq2seq-based document-to-sentence (Doc2Sent) and sentence-to-sentence (Sent2Sent) translation. Extensive experimental results of Doc2Doc NMT show that P-Transformer significantly outperforms strong baselines on widely-used 9 document-level datasets in 7 language pairs, covering small-, middle-, and large-scales, and achieves a new state-of-the-art. Experimentation on discourse phenomena shows that our Doc2Doc NMT models improve the translation quality in both BLEU and discourse coherence. We make our code available on Github.
translated by 谷歌翻译
在这项工作中,我们证明了多种语的大规模序列到序列(SEQ2SEQ)模型,该模型是通过Denoising和因果语言建模(CLM)任务的混合物进行训练的,比仅解码器模型更有效地进行了效率的学习者在各种任务上。特别是,我们培训了一个名为Alexa教师模型(Alexatm 20b)的200亿个参数多语言SEQ2SEQ模型,并表明它在1-Shot摘要任务上实现了最先进的(SOTA)性能,超过了更大的540B PALM DOPODER模型。 Alexatm 20b还可以在1-Shot Machine翻译中实现SOTA,尤其是对于低资源语言,几乎所有语言对(阿拉伯语,英语,法语,德语,德语,印地语,意大利语,日语,以及flores-101数据集上的泰卢固语)。我们还显示了零拍设置,AlexATM 20B在SuperGlue和SqueadV2数据集上的表现优于GPT3(175B),并在XNLI,XCOPA,PAWS-X和XWINOGRAD等多语言任务上提供SOTA性能。总体而言,我们的结果为SEQ2SEQ模型提供了一个令人信服的案例,作为大型语言模型(LLM)培训的仅解码器模型的强大替代方法。
translated by 谷歌翻译
We propose a simple solution to use a single Neural Machine Translation (NMT) model to translate between multiple languages. Our solution requires no changes to the model architecture from a standard NMT system but instead introduces an artificial token at the beginning of the input sentence to specify the required target language. The rest of the model, which includes an encoder, decoder and attention module, remains unchanged and is shared across all languages. Using a shared wordpiece vocabulary, our approach enables Multilingual NMT using a single model without any increase in parameters, which is significantly simpler than previous proposals for Multilingual NMT. On the WMT'14 benchmarks, a single multilingual model achieves comparable performance for English→French and surpasses state-of-the-art results for English→German. Similarly, a single multilingual model surpasses state-of-the-art results for French→English and German→English on WMT'14 and WMT'15 benchmarks, respectively. On production corpora, multilingual models of up to twelve language pairs allow for better translation of many individual pairs. In addition to improving the translation quality of language pairs that the model was trained with, our models can also learn to perform implicit bridging between language pairs never seen explicitly during training, showing that transfer learning and zero-shot translation is possible for neural translation. Finally, we show analyses that hints at a universal interlingua representation in our models and show some interesting examples when mixing languages.
translated by 谷歌翻译
Pre-trained multilingual language models show significant performance gains for zero-shot cross-lingual model transfer on a wide range of natural language understanding (NLU) tasks. Previously, for zero-shot cross-lingual evaluation, pre-trained models are only fine-tuned on English data and tested on a variety of target languages. In this paper, we do cross-lingual evaluation on various NLU tasks (sentence classification, sequence labeling, question answering) using prompt-tuning and compare it with fine-tuning. The results show that prompt tuning achieves much better cross-lingual transfer than fine-tuning across datasets, with only 0.1% to 0.3% tuned parameters. Additionally, we demonstrate through the analysis that prompt tuning can have better cross-lingual transferability of representations on downstream tasks with better aligned decision boundaries.
translated by 谷歌翻译
Large language models (LLMs) that have been trained on multilingual but not parallel text exhibit a remarkable ability to translate between languages. We probe this ability in an in-depth study of the pathways language model (PaLM), which has demonstrated the strongest machine translation (MT) performance among similarly-trained LLMs to date. We investigate various strategies for choosing translation examples for few-shot prompting, concluding that example quality is the most important factor. Using optimized prompts, we revisit previous assessments of PaLM's MT capabilities with more recent test sets, modern MT metrics, and human evaluation, and find that its performance, while impressive, still lags that of state-of-the-art supervised systems. We conclude by providing an analysis of PaLM's MT output which reveals some interesting properties and prospects for future work.
translated by 谷歌翻译
GPT-3等大型自回归语言模型是几秒钟的学习者,可以在没有微调的情况下执行各种语言任务。虽然已知这些模型能够共同代表许多不同的语言,但他们的培训数据由英语主导,可能限制了它们的交叉概括。在这项工作中,我们在覆盖多种语言的平衡语料库上培训多语言自回归语言模型,并在广泛的任务中研究他们几乎没有零点的学习能力。我们最大的模型,具有75亿参数,在20多种代表语言中,在几种代表语言中,在几种代表性语言中,在几种代表性语言中,在多语言型号推理中表现出可比大小的GPT-3(在0次设置和0次拍摄设置中的绝对精度改善+ 7.4% 4-拍摄设置中的9.4%)和自然语言推理(每次拍摄和4次设置中的每一个+ 5.4%)。在Flores-101机器翻译基准测试中,我们的模型优于GPT-3在182个翻译方向上有32个培训例子,同时超过45个方向的官方监督基线。我们介绍了模型成功和失败的位置的详细分析,特别是它尤其显示在某些任务中实现交叉语境的内容学习,而仍然存在改善表面的鲁棒性和适应没有a的任务的余地自然冻结形式。最后,我们评估我们在仇恨语音检测中以五种语言的仇恨语音检测的模型,并发现它具有与可比大小的GPT-3模型类似的限制。
translated by 谷歌翻译
Recently, a large number of tuning strategies have been proposed to adapt pre-trained language models to downstream tasks. In this paper, we perform an extensive empirical evaluation of various tuning strategies for multilingual learning, particularly in the context of text summarization. Specifically, we explore the relative advantages of three families of multilingual tuning strategies (a total of five models) and empirically evaluate them for summarization over 45 languages. Experimentally, we not only established a new state-of-the-art on the XL-Sum dataset but also derive a series of observations that hopefully can provide hints for future research on the design of multilingual tuning strategies.
translated by 谷歌翻译
神经机器翻译(NMT)模型在大型双语数据集上已有效。但是,现有的方法和技术表明,该模型的性能高度取决于培训数据中的示例数量。对于许多语言而言,拥有如此数量的语料库是一个牵强的梦想。我们从单语言词典探索新语言的单语扬声器中汲取灵感,我们研究了双语词典对具有极低或双语语料库的语言的适用性。在本文中,我们使用具有NMT模型的双语词典探索方法,以改善资源极低的资源语言的翻译。我们将此工作扩展到多语言系统,表现出零拍的属性。我们详细介绍了字典质量,培训数据集大小,语言家族等对翻译质量的影响。多种低资源测试语言的结果表明,我们的双语词典方法比基线相比。
translated by 谷歌翻译
This paper demonstrates that multilingual denoising pre-training produces significant performance gains across a wide variety of machine translation (MT) tasks. We present mBART -a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora in many languages using the BART objective . mBART is the first method for pre-training a complete sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only on the encoder, decoder, or reconstructing parts of the text. Pre-training a complete model allows it to be directly fine tuned for supervised (both sentence-level and document-level) and unsupervised machine translation, with no task-specific modifications. We demonstrate that adding mBART initialization produces performance gains in all but the highest-resource settings, including up to 12 BLEU points for low resource MT and over 5 BLEU points for many document-level and unsupervised models. We also show it also enables new types of transfer to language pairs with no bi-text or that were not in the pre-training corpus, and present extensive analysis of which factors contribute the most to effective pre-training.
translated by 谷歌翻译
语言模型预训练的最新进展利用大规模数据集创建多语言模型。但是,这些数据集中大多遗漏了低资源语言。这主要是因为网络上没有很好地表示口语,因此被排除在用于创建数据集的大规模爬网中。此外,这些模型的下游用户仅限于最初选择用于预训练的语言的选择。这项工作调查了如何最佳利用现有的预培训模型来为16种非洲语言创建低资源翻译系统。我们关注两个问题:1)如何将预训练的模型用于初始预培训中未包含的语言? 2)生成的翻译模型如何有效地转移到新域?为了回答这些问题,我们创建了一个新的非洲新闻语料库,涵盖16种语言,其中8种语言不属于任何现有评估数据集的一部分。我们证明,将两种语言转移到其他语言和其他领域的最有效策略是,以少量的高质量翻译数据微调大型预训练模型。
translated by 谷歌翻译
大型语言模型(例如GPT-3(Brown等,2020)可以执行任意任务,而无需在仅使用少数标签示例的提示之后进行微调。可以将任意任务重新构成自然语言提示,并且可以要求语言模型生成完成,并以称为基于及时的学习的范式间接执行该任务。迄今为止,主要针对单向语言模型证明了新兴迅速的学习能力。但是,预先培训的双向语言模型(例如蒙版语言建模)为转移学习提供了更强大的学习表示。这激发了促使双向模型的可能性,但是它们的预训练目标使它们与现有的提示范式不相容。我们提出SAP(顺序自动回旋提示),该技术可以使双向模型提示。利用机器翻译任务作为案例研究,我们提示了带有SAP的双向MT5模型(Xue等,2021),并演示其少量拍摄和零照片的翻译优于GPT-3等单向模型的几个单拍翻译和XGLM(Lin等,2021),尽管MT5的参数减少了约50%。我们进一步表明SAP对问题的回答和摘要有效。我们的结果首次表明基于及时的学习是更广泛的语言模型的新兴属性,而不仅仅是单向模型。
translated by 谷歌翻译
We introduce EdgeFormer -- a parameter-efficient Transformer for on-device seq2seq generation under the strict computation and memory constraints. Compared with the previous parameter-efficient Transformers, EdgeFormer applies two novel principles for cost-effective parameterization, allowing it to perform better given the same parameter budget; moreover, EdgeFormer is further enhanced by layer adaptation innovation that is proposed for improving the network with shared layers. Extensive experiments show EdgeFormer can effectively outperform previous parameter-efficient Transformer baselines and achieve competitive results under both the computation and memory constraints. Given the promising results, we release EdgeLM -- the pretrained version of EdgeFormer, which is the first publicly available pretrained on-device seq2seq model that can be easily fine-tuned for seq2seq tasks with strong results, facilitating on-device seq2seq generation in practice.
translated by 谷歌翻译
在本文中,我们介绍了DOCMT5,这是一种预先培训的多语言序列到序列语言模型,具有大规模并行文档。虽然以前的方法专注于利用句子级并行数据,但我们尝试构建一个可以理解和生成长文件的通用预训练模型。我们提出了一个简单有效的预训练目标 - 文件重新排序机翻译(DRMT),其中需要翻译和屏蔽的输入文件。 DRMT在各种文档级生成任务中对强大基线带来一致的改进,包括超过12个BLEU积分,用于观看语言对文件级MT,超过7个BLEU积分,用于看不见的语言对文件级MT和3胭脂-1位为言语对交叉术概要。我们在WMT20 De-en和IWSLT15 Zh-ZH文档翻译任务中实现了最先进的(SOTA)。我们还对文档预培训的各种因素进行了广泛的分析,包括(1)预培训数据质量的影响和(2)组合单语言和交叉训练的影响。我们计划公开使用我们的模型检查站。
translated by 谷歌翻译
最近在单语数据和机器翻译(MT)进行微调的预培训方面取得了成功,但尚不清楚如何最好地利用预先训练的模型来完成给定的MT任务。本文在微调MT上的预训练模型时研究了冻结参数的好处和缺点。我们专注于1)微调仅在英语单语言数据的BART上训练的模型。2)微调一个模型,该模型对25种语言的单语言数据进行了培训,Mbart。对于Bart,我们通过冻结大多数模型参数并添加额外的位置嵌入来获得最佳性能。对于MBART,我们将大多数语言对的天真微调的性能与编码器以及大多数解码器搭配。编码器的注意参数对于微调最重要。当将自己限制为越南人对英语的室外训练套装时,我们看到了基线的最大进步。
translated by 谷歌翻译