在大多数现代机器翻译(MT)系统中,嘈杂或非标准的输入文本可能导致灾难性的误译,并且在创建噪声稳健的MT系统方面的研究兴趣不断增长。然而,到目前为止,还有一些公共可用的并行语料库,具有自然发生的噪声输入和翻译,因此以前的工作已经采用了对合成创建的数据集进行评估。在本文中,我们提出了一个基于噪声文本的机器翻译(MTNT)的基准数据集,包括对Reddit(www.reddit.com)和专业来源翻译的嘈杂评论。我们将英语评论翻译成法语和日语,以及法语和日语评论英语,大约7k-37k语句对语言对。我们定性和定量地检查该数据集中包含的噪声类型,然后证明现有的MT模型在许多与噪声相关的现象上失败,即使在对一小组域内数据训练集进行适应之后也是如此。这表明该数据集可以为适合处理MT中的噪声文本的方法提供有吸引力的测试平台。数据可在www.cs.cmu.edu/~pmichel1/mtnt/上公布。
translated by 谷歌翻译
In this paper, we propose a novel domain adaptation method named "mixed fine tuning" for neural machine translation (NMT). We combine two existing approaches namely fine tuning and multi domain NMT. We first train an NMT model on an out-of-domain parallel corpus, and then fine tune it on a parallel corpus which is a mix of the in-domain and out-of-domain corpora. All corpora are augmented with artificial tags to indicate specific domains. We empirically compare our proposed method against fine tuning and multi domain methods and discuss its benefits and shortcomings.
translated by 谷歌翻译
生成自然语言需要以适当的方式传达内容。我们探讨了生成不同形式文本的两个相关任务:单语形式转移和形式敏感机器翻译。我们建议使用多任务学习联合解决这些任务,并表明我们的模型实现了形式转换的最先进性能,并且能够执行形式敏感的翻译而无需在带有样式注释的翻译示例上进行明确的训练。
translated by 谷歌翻译
在本文中,我们介绍了为巴斯克语到英语低资源MT评估活动建立的ADAPT系统。巴斯克语是一种资源丰富,形态丰富的语言。这对神经机器翻译模型提出了挑战,这些模型通常在使用大数据集训练时可以获得更好的性能。因此,我们使用合成数据来提高仅使用真实数据构建的模型所产生的翻译质量。我们的建议使用反向翻译的数据:(a)创建新的句子,因此可以用更多的数据训练系统; (b)翻译接近测试集的句子,可以将模型微调到要翻译的文档。
translated by 谷歌翻译
用单语数据改进神经机器翻译的有效方法是用目标语言句子的反向翻译来增强并行训练语料库。这项工作拓宽了对反向翻译的理解,并研究了许多生成合成源句的方法。我们发现,在资源较差的所有设置中,通过采样或噪声波束输出获得的翻译都是有效的。我们的分析表明,采样或噪声合成数据比通过波束或贪婪搜索产生的数据提供更强的训练信号。我们还比较了合成数据与真正的文本比较和研究各种域效应的方式。最后,我们在WMT'14English-德语测试集上扩展到数以亿计的单语句并实现了35 BLEU的新技术。
translated by 谷歌翻译
We examine how various types of noise in the parallel training data impact the quality of neural machine translation systems. We create five types of artificial noise and analyze how they degrade performance in neural and statistical machine translation. We find that neural models are generally more harmed by noise than statistical models. For one especially egre-gious type of noise they learn to just copy the input sentence.
translated by 谷歌翻译
Every person speaks or writes their own flavor of their native language, influenced by a number of factors: the content they tend to talk about, their gender, their social status, or their geographical origin. When attempting to perform Machine Translation (MT), these variations have a significant effect on how the system should perform translation, but this is not captured well by standard one-size-fits-all models. In this paper, we propose a simple and parameter-efficient adaptation technique that only requires adapting the bias of the output softmax to each particular user of the MT system, either directly or through a factored approximation. Experiments on TED talks in three languages demonstrate improvements in translation accuracy, and better reflection of speaker traits in the target text.
translated by 谷歌翻译
A prerequisite for training corpus-based machine translation (MT) systems-either Statistical MT (SMT) or Neural MT (NMT)-is the availability of high-quality parallel data. This is arguably more important today than ever before, as NMT has been shown in many studies to outperform SMT, but mostly when large parallel corpora are available; in cases where data is limited, SMT can still outperform NMT. Recently researchers have shown that back-translating monolingual data can be used to create synthetic parallel corpora, which in turn can be used in combination with authentic parallel data to train a high-quality NMT system. Given that large collections of new parallel text become available only quite rarely, backtransla-tion has become the norm when building state-of-the-art NMT systems, especially in resource-poor scenarios. However, we assert that there are many unknown factors regarding the actual effects of back-translated data on the translation capabilities of an NMT model. Accordingly , in this work we investigate how using back-translated data as a training corpus both as a separate standalone dataset as well as combined with human-generated parallel data-affects the performance of an NMT model. We use incrementally larger amounts of back-translated data to train a range of NMT systems for German-c 2018 The authors. This article is licensed under a Creative Commons 3.0 licence, no derivative works, attribution, CC-BY-ND. to-English, and analyse the resulting translation performance.
translated by 谷歌翻译
Despite impressive progress in high-resource settings, Neural Machine Translation (NMT) still struggles in low-resource and out-of-domain scenarios, often failing to match the quality of phrase-based translation. We propose a novel technique that combines back-translation and multilingual NMT to improve performance in these difficult cases. Our technique trains a single model for both directions of a language pair, allowing us to back-translate source or target mono-lingual data without requiring an auxiliary model. We then continue training on the augmented parallel data, enabling a cycle of improvement for a single model that can incorporate any source, target, or parallel data to improve both translation directions. As a byproduct, these models can reduce training and deployment costs significantly compared to uni-directional models. Extensive experiments show that our technique outperforms standard back-translation in low-resource scenarios, improves quality on cross-domain tasks, and effectively reduces costs across the board.
translated by 谷歌翻译
由于深度学习的最新进展和大规模并行机的可用性,机器翻译最近取得了令人瞩目的成绩。已经有许多尝试来扩展这些成功的资源语言对,但需要成千上万的并行句子。在这项工作中,我们将这一研究方向发挥到极致,并研究即使没有任何平行数据也是否有可能学会翻译。我们提出了一种模型,它从两种不同语言的单语语料库中取出句子并将它们映射到相同的潜在空间。 Bylearning从这个共享特征空间重构两种语言,有效地学习翻译而不使用任何标记数据。我们在两个广泛使用的数据集和两个语言对上展示我们的模型,在Multi30k和WMT英语 - 法语数据集上报告BLEU得分为32.8和15.1,在训练时不使用单个平行句子。
translated by 谷歌翻译
测量数据的域相关性以及识别或选择用于机器翻译(MT)的良好适应域数据是一个经过充分研究的主题,但尚未进行去噪。去噪涉及不同类型的数据质量和条件,以减少数据噪声对MT训练的负面影响,特别是神经MT(NMT)训练。本文概述了域MT测量和选择数据的方法,并将其应用于去噪NMT训练。所提出的方法使用可信数据和通过在线数据选择实现的去噪程序。对该方法的内在和外在评估表明其对NMT训练具有严重噪声的数据具有显着的效果。
translated by 谷歌翻译
We explore efficient domain adaptation for the task of statistical machine translation based on extracting sentences from a large general-domain parallel corpus that are most relevant to the target domain. These sentences may be selected with simple cross-entropy based methods, of which we present three. As these sentences are not themselves identical to the in-domain data, we call them pseudo in-domain subcorpora. These subcorpora-1% the size of the original-can then used to train small domain-adapted Statistical Machine Translation (SMT) systems which out-perform systems trained on the entire corpus. Performance is further improved when we use these domain-adapted models in combination with a true in-domain model. The results show that more training data is not always better, and that best results are attained via proper domain-relevant data selection, as well as combining in-and general-domain systems during decoding.
translated by 谷歌翻译
当通过要求用户突出正确的翻译块来收集部分反馈时,我们凭经验研究从神经机器翻译(NMT)中的部分反馈中学习。我们提出了一种在NMT培训中利用这种反馈的简单而有效的方法。我们演示了如何仅基于块级用户反馈来减少训练和部署之间的域不匹配的常见问题。我们进行了一系列模拟实验来测试所提方法的有效性。我们的结果表明,块级反馈优于基于句子的反馈,达到2.61%BLEU绝对值。
translated by 谷歌翻译
神经机器翻译(NMT)已经获得了几种语言对的最先进性能,同时仅使用并行数据进行训练。目标语言单语数据在提高基于语音的统计机器翻译流畅性方面发挥了重要作用,我们研究了使用单语数据进行NMT。与之前将NMT模型与单独训练的语言模型相结合的工作相比,我们注意到编码器 - 解码器NTT架构已经具备了学习与语言模型相同的信息的能力,并且我们探索了在不改变神经网络架构的情况下训练单语数据的策略。通过将单语训练数据与自动反向翻译配对,我们可以将其视为额外的并行训练数据,并且我们在WMT 15task英语< - >德语(+ 2.8-3.7 BLEU)和低资源IWSLT 14上获得了实质性的改进。 taskTurkish-> English(+ 2.1-3.4 BLEU),获得最新的最新成果。 Wealso表明,对域内单语和并行数据的微调可以为IWSLT 15任务英语 - >德语提供实质性的改进。
translated by 谷歌翻译
本文探讨了将神经机器翻译系统适应新的低资源语言(LRLs)的问题,尽可能有效,快速地进行。我们提出了基于大量多语言“种子模型”的方法,这些方法可以提前进行培训,然后继续训练与LRL相关的数据。我们对比了许多策略,导致了“类似语言正规化”的新颖,简单而有效的方法,我们联合培养了一种感兴趣的LRL和类似的高资源语言,以防止过度拟合小LRL数据。实验表明,即使没有任何明确的适应性,大规模多语言模型也令人惊讶地有效,在没有来自LRL的数据的情况下实现高达15.5的BLEU分数,并且所提出的类似语言正则化方法在4个LRL设置下平均改进了1.7个BLEU点的其他适应方法。编码在https://github.com/neubig/rapid-adaptation上重现实验
translated by 谷歌翻译
最近的研究已经证明了生成预训练对于英语自然语言理解的效率。在这项工作中,我们将这种方法扩展到多种语言,并展示了跨语言预训练的有效性。我们提出了两种学习跨语言语言模型的方法:一种是仅依赖于单语数据的监督模式,另一种是监督使用并行数据的方法。一种新的跨语言语言模型目标。我们在跨语言分类,无监督和监督机器翻译方面取得了最先进的成果。在XNLI上,我们的方法以绝对增益4.9%的精度推动了现有技术。在无人监督的机器翻译中,我们在WMT'16德语 - 英语上获得了34.3 BLEU,提高了超过9个BLEU的先前技术水平。在有监督的机器翻译中,我们在WMT'16罗马尼亚语 - 英语上获得了38.5 BLEU的最新技术水平,超过了以前的最佳方法超过4个BLEU。我们的代码和预训练模型将公开发布。
translated by 谷歌翻译
机器翻译系统在某些语言上实现了近乎人性化的表现,但它们的有效性强烈依赖于大量并行句子的可用性,这阻碍了它们适用于大多数语言对。这项工作调查了如何学习翻译时,只能访问每种语言的大型单语语料库。我们提出了两种模型变体,一种神经模型和一种基于短语的模型。两个版本都包括仔细初始化参数,语言模型的去噪效果以及通过迭代反向自动生成并行数据。这些模型明显优于文献中的方法,同时更简单且具有更少的超参数。在广泛使用的WMT'14英语 - 法语和WMT'16德语 - 英语基准测试中,我们的模型分别在不使用单平行句子的情况下获得28.1和25.2个BLEU点,超过11个BLEUpoint的现有技术水平。在英语 - 乌尔都语和英语 - 罗马尼亚语等低资源语言中,我们的方法比半监督和监督方法更好地利用了现有的比特币。我们的NMT和PBSMT代码是公开的。
translated by 谷歌翻译
Intelligent selection of training data has proven a successful technique to simultaneously increase training efficiency and translation performance for phrase-based machine translation (PBMT). With the recent increase in popularity of neural machine translation (NMT), we explore in this paper to what extent and how NMT can also benefit from data selection. While state-of-the-art data selection (Ax-elrod et al., 2011) consistently performs well for PBMT, we show that gains are substantially lower for NMT. Next, we introduce dynamic data selection for NMT, a method in which we vary the selected subset of training data between different training epochs. Our experiments show that the best results are achieved when applying a technique we call gradual fine-tuning, with improvements up to +2.6 BLEU over the original data selection approach and up to +3.1 BLEU over a general baseline.
translated by 谷歌翻译
We propose a novel technique for adapting text-based statistical machine translation to deal with input from automatic speech recognition in spoken language translation tasks. We simulate likely misrecognition errors using only a source language pronunciation dictionary and language model (i.e., without an acoustic model), and use these to augment the phrase table of a standard MT system. The augmented system can thus recover from recognition errors during decoding using synthesized phrases. Using the outputs of five different English ASR systems as input, we find consistent and significant improvements in translation quality. Our proposed technique can also be used in conjunction with lattices as ASR output, leading to further improvements.
translated by 谷歌翻译