本文概述了NVIDIA Nemo的神经电机翻译系统,用于WMT21新闻和生物医学共享翻译任务的受限数据跟踪。我们的新闻任务提交英语 - 德语(EN-DE)和英语 - 俄语(EN-RU)是基于基于基于基线变换器的序列到序列模型之上。具体而言,我们使用1)检查点平均2)模型缩放3)模型缩放3)与从左右分解模型的逆转传播和知识蒸馏的数据增强4)从前一年的测试集上的FINETUNING 5)型号集合6)浅融合解码变压器语言模型和7)嘈杂的频道重新排名。此外,我们的BioMedical任务提交的英语 - 俄语使用生物学偏见的词汇表,并从事新闻任务数据的划痕,从新闻任务数据集中策划的医学相关文本以及共享任务提供的生物医学数据。我们的新闻系统在WMT'20 en-de试验中实现了39.5的Sacrebleu得分优于去年任务38.8的最佳提交。我们的生物医学任务ru-en和en-ru系统分别在WMT'20生物医学任务测试集中达到43.8和40.3的Bleu分数,优于上一年的最佳提交。
translated by 谷歌翻译
This paper introduces WeChat's participation in WMT 2022 shared biomedical translation task on Chinese to English. Our systems are based on the Transformer, and use several different Transformer structures to improve the quality of translation. In our experiments, we employ data filtering, data generation, several variants of Transformer, fine-tuning and model ensemble. Our Chinese$\to$English system, named Summer, achieves the highest BLEU score among all submissions.
translated by 谷歌翻译
我们描述了JD Explore Academy对WMT 2022共享的一般翻译任务的提交。我们参加了所有高资源曲目和一条中型曲目,包括中文英语,德语英语,捷克语英语,俄语 - 英语和日语英语。我们通过扩大两个主要因素,即语言对和模型大小,即\ textbf {vega-mt}系统来推动以前的工作的极限 - 进行翻译的双向培训。至于语言对,我们将“双向”扩展到“多向”设置,涵盖所有参与语言,以利用跨语言的常识,并将其转移到下游双语任务中。至于型号尺寸,我们将变压器限制到拥有近47亿参数的极大模型,以完全增强我们VEGA-MT的模型容量。此外,我们采用数据增强策略,例如单语数据的循环翻译以及双语和单语数据的双向自我训练,以全面利用双语和单语言数据。为了使我们的Vega-MT适应通用域测试集,设计了概括调整。根据受约束系统的官方自动分数,根据图1所示的sacrebleu,我们在{zh-en(33.5),en-zh(49.7)(49.7),de-en(33.7)上获得了第一名-de(37.8),CS-EN(54.9),En-CS(41.4)和En-Ru(32.7)},在{ru-en(45.1)和Ja-en(25.6)}和第三名上的第二名和第三名在{en-ja(41.5)}上; W.R.T彗星,我们在{zh-en(45.1),en-zh(61.7),de-en(58.0),en-de(63.2),cs-en(74.7),ru-en(ru-en(ru-en)上,我们获得了第一名64.9),en-ru(69.6)和en-ja(65.1)},分别在{en-cs(95.3)和ja-en(40.6)}上的第二名。将发布模型,以通过GitHub和Omniforce平台来促进MT社区。
translated by 谷歌翻译
虽然已经提出了许多背景感知神经机器转换模型在翻译中包含语境,但大多数模型在句子级别对齐的并行文档上培训结束到底。因为只有少数域(和语言对)具有此类文档级并行数据,所以我们无法在大多数域中执行准确的上下文感知转换。因此,我们通过将文档级语言模型结合到解码器中,提出了一种简单的方法将句子级转换模型转换为上下文感知模型。我们的上下文感知解码器仅在句子级并行语料库和单语演模板上构建;因此,不需要文档级并行数据。在理论上,这项工作的核心部分是使用上下文和当前句子之间的点亮互信息的语境信息的新颖表示。我们以三种语言对,英语到法语,英语到俄语,以及日语到英语,通过评估,通过评估以及对上下文意识翻译的对比测试。
translated by 谷歌翻译
Neural Machine Translation (NMT) has obtained state-of-the art performance for several language pairs, while only using parallel data for training. Targetside monolingual data plays an important role in boosting fluency for phrasebased statistical machine translation, and we investigate the use of monolingual data for NMT. In contrast to previous work, which combines NMT models with separately trained language models, we note that encoder-decoder NMT architectures already have the capacity to learn the same information as a language model, and we explore strategies to train with monolingual data without changing the neural network architecture. By pairing monolingual training data with an automatic backtranslation, we can treat it as additional parallel training data, and we obtain substantial improvements on the WMT 15 task English↔German (+2.8-3.7 BLEU), and for the low-resourced IWSLT 14 task Turkish→English (+2.1-3.4 BLEU), obtaining new state-of-the-art results. We also show that fine-tuning on in-domain monolingual and parallel data gives substantial improvements for the IWSLT 15 task English→German.
translated by 谷歌翻译
This paper introduces the joint submission of the Beijing Jiaotong University and WeChat AI to the WMT'22 chat translation task for English-German. Based on the Transformer, we apply several effective variants. In our experiments, we utilize the pre-training-then-fine-tuning paradigm. In the first pre-training stage, we employ data filtering and synthetic data generation (i.e., back-translation, forward-translation, and knowledge distillation). In the second fine-tuning stage, we investigate speaker-aware in-domain data generation, speaker adaptation, prompt-based context modeling, target denoising fine-tuning, and boosted self-COMET-based model ensemble. Our systems achieve 0.810 and 0.946 COMET scores. The COMET scores of English-German and German-English are the highest among all submissions.
translated by 谷歌翻译
在任何翻译工作流程中,从源到目标的域知识保存至关重要。在翻译行业中,接收高度专业化的项目是很常见的,那里几乎没有任何平行的内域数据。在这种情况下,没有足够的内域数据来微调机器翻译(MT)模型,生成与相关上下文一致的翻译很具有挑战性。在这项工作中,我们提出了一种新颖的方法,用于域适应性,以利用最新的审计语言模型(LMS)来用于特定于域的MT的域数据增强,并模拟(a)的(a)小型双语数据集的域特征,或(b)要翻译的单语源文本。将这个想法与反翻译相结合,我们可以为两种用例生成大量的合成双语内域数据。为了进行调查,我们使用最先进的变压器体系结构。我们采用混合的微调来训练模型,从而显着改善了内域文本的翻译。更具体地说,在这两种情况下,我们提出的方法分别在阿拉伯语到英语对阿拉伯语言对上分别提高了大约5-6个BLEU和2-3 BLEU。此外,人类评估的结果证实了自动评估结果。
translated by 谷歌翻译
Reranking methods in machine translation aim to close the gap between common evaluation metrics (e.g. BLEU) and maximum likelihood learning and decoding algorithms. Prior works address this challenge by training models to rerank beam search candidates according to their predicted BLEU scores, building upon large models pretrained on massive monolingual corpora -- a privilege that was never made available to the baseline translation model. In this work, we examine a simple approach for training rerankers to predict translation candidates' BLEU scores without introducing additional data or parameters. Our approach can be used as a clean baseline, decoupled from external factors, for future research in this area.
translated by 谷歌翻译
本文介绍了我们提交给WMT21共享新闻翻译任务的受限轨道。我们专注于三个相对低的资源语言对孟加拉,从印地语,英语往返Hausa,以及来自Zulu的Xhosa。为了克服相对低行数据的限制,我们使用采用并行和单晶体数据的多任务目标训练多语言模型。此外,我们使用后退转换增强数据。我们还培养了一种双语模型,包括后退转换和知识蒸馏,然后使用序列到序列映射来组合两种模型。我们看到迄今为止英语和来自Hausa的Bleu Point的相对收益约为70%,以及与双语基线相比,孟加拉和从Zulu的孟加拉和从Zulu的相对改善约25%。
translated by 谷歌翻译
本报告介绍了在大型多语种计算机翻译中为WMT21共享任务的Microsoft的机器翻译系统。我们参加了所有三种评估轨道,包括大轨道和两个小轨道,前者是无约束的,后两者完全受约束。我们的模型提交到共享任务的初始化用deltalm \脚注{\ url {https://aka.ms/deltalm}},一个通用的预训练的多语言编码器 - 解码器模型,并相应地使用巨大的收集并行进行微调数据和允许的数据源根据轨道设置,以及应用逐步学习和迭代背翻译方法进一步提高性能。我们的最终提交在自动评估度量方面排名第一的三条轨道。
translated by 谷歌翻译
多语种NMT已成为MT在生产中部署的有吸引力的解决方案。但是要匹配双语质量,它符合较大且较慢的型号。在这项工作中,我们考虑了几种方法在推理时更快地使多语言NMT变得更快而不会降低其质量。我们在两种20语言多平行设置中尝试几个“光解码器”架构:在TED会谈中小规模和帕拉克曲线上的大规模。我们的实验表明,将具有词汇过滤的浅解码器组合在于,在翻译质量下没有损失的速度超过两倍。我们用Bleu和Chrf(380语言对),鲁棒性评估和人类评估验证了我们的研究结果。
translated by 谷歌翻译
在本文中,我们描述了三星研究的提交菲律宾-Konvergen AI团队为WMT'21大规模多语言翻译任务 - 小轨道2.我们向共享任务提交标准SEQ2Seq变压器模型,没有任何培训或架构技巧,主要依靠我们的数据预处理技术来提高性能。我们的最终提交模型在Flores-101 DevTest集中筹集了22.92平均Bleu,并在比赛的隐藏试验集上获得了22.97平均平均Bleu,整体排名第六。尽管只使用标准变压器,我们的型号在印度尼西亚排名第一的javanese,表明数据预处理的重要事项,如果不是更多的,而不是切割边缘模型架构和训练技术。
translated by 谷歌翻译
Sockeye 3是神经机器翻译(NMT)的Mockeye工具包的最新版本。现在,基于Pytorch,Sockeye 3提供了更快的模型实现和更高级的功能,并具有进一步的简化代码库。这可以通过更快的迭代,对更强大,更快的模型进行有效的培训以及快速从研究转移到生产的新想法的灵活性,从而实现更广泛的实验。当运行可比较的型号时,Sockeye 3的速度比GPU上的其他Pytorch实现快126%,在CPU上的实现速度高达292%。Sockeye 3是根据Apache 2.0许可发布的开源软件。
translated by 谷歌翻译
Pre-training is an effective technique for ensuring robust performance on a variety of machine learning tasks. It typically depends on large-scale crawled corpora that can result in toxic or biased models. Such data can also be problematic with respect to copyright, attribution, and privacy. Pre-training with synthetic tasks and data is a promising way of alleviating such concerns since no real-world information is ingested by the model. Our goal in this paper is to understand what makes for a good pre-trained model when using synthetic resources. We answer this question in the context of neural machine translation by considering two novel approaches to translation model pre-training. Our first approach studies the effect of pre-training on obfuscated data derived from a parallel corpus by mapping words to a vocabulary of 'nonsense' tokens. Our second approach explores the effect of pre-training on procedurally generated synthetic parallel data that does not depend on any real human language corpus. Our empirical evaluation on multiple language pairs shows that, to a surprising degree, the benefits of pre-training can be realized even with obfuscated or purely synthetic parallel data. In our analysis, we consider the extent to which obfuscated and synthetic pre-training techniques can be used to mitigate the issue of hallucinated model toxicity.
translated by 谷歌翻译
我们介绍了双图:一种简单但有效的训练策略,以提高神经机器翻译(NMT)性能。它由两个程序组成:双向预处理和单向填充。这两个过程均使用SIMCUT,这是一种简单的正则化方法,迫使原始句子对的输出分布之间的一致性。在不利用额外的数据集通过反翻译或集成大规模预认证的模型的情况下,BI-Simcut可以在五个翻译基准(数据尺寸从160K到20.20万)中实现强大的翻译性能:EN-的BLEU得分为31.16,EN-> DE和38.37的BLEU得分为38.37 de-> en在IWSLT14数据集上,en-> de的30.78和35.15在WMT14数据集上进行DE-> en,而WMT17数据集中的ZH-> EN为27.17。 Simcut不是一种新方法,而是简化和适用于NMT的cutoff(Shen等,2020)的版本,可以将其视为基于扰动的方法。鉴于Simcut和Bi-Simcut的普遍性和简单性,我们认为它们可以作为未来NMT研究的强大基准。
translated by 谷歌翻译
This paper demonstrates that multilingual denoising pre-training produces significant performance gains across a wide variety of machine translation (MT) tasks. We present mBART -a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora in many languages using the BART objective . mBART is the first method for pre-training a complete sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only on the encoder, decoder, or reconstructing parts of the text. Pre-training a complete model allows it to be directly fine tuned for supervised (both sentence-level and document-level) and unsupervised machine translation, with no task-specific modifications. We demonstrate that adding mBART initialization produces performance gains in all but the highest-resource settings, including up to 12 BLEU points for low resource MT and over 5 BLEU points for many document-level and unsupervised models. We also show it also enables new types of transfer to language pairs with no bi-text or that were not in the pre-training corpus, and present extensive analysis of which factors contribute the most to effective pre-training.
translated by 谷歌翻译
We present the CUNI-Bergamot submission for the WMT22 General translation task. We compete in English$\rightarrow$Czech direction. Our submission further explores block backtranslation techniques. Compared to the previous work, we measure performance in terms of COMET score and named entities translation accuracy. We evaluate performance of MBR decoding compared to traditional mixed backtranslation training and we show a possible synergy when using both of the techniques simultaneously. The results show that both approaches are effective means of improving translation quality and they yield even better results when combined.
translated by 谷歌翻译
基于变压器的语言模型导致所有域的所有域都令人印象深刻的自然语言处理。在语言建模任务上预先预订这些模型以及在文本分类,问题应答和神经机翻译等下游任务上的FineTuning它们一直显示了示例性结果。在这项工作中,我们提出了一种多任务FineTuning方法,它将双语机器翻译任务与辅助因果语言建模任务相结合,以提高印度语言前任务的性能。我们对三种语言对,Marathi-Hindi,Marathi-English和Hindi-English进行了实证研究,在那里我们将多任务FineTuning方法与标准的FineTuning方法进行比较,我们使用MBart50模型。我们的研究表明,多任务FineTuning方法可以是比标准FineTuning更好的技术,并且可以改善语言对的双语机器换算。
translated by 谷歌翻译
We present Charles University submissions to the WMT22 General Translation Shared Task on Czech-Ukrainian and Ukrainian-Czech machine translation. We present two constrained submissions based on block back-translation and tagged back-translation and experiment with rule-based romanization of Ukrainian. Our results show that the romanization only has a minor effect on the translation quality. Further, we describe Charles Translator, a system that was developed in March 2022 as a response to the migration from Ukraine to the Czech Republic. Compared to our constrained systems, it did not use the romanization and used some proprietary data sources.
translated by 谷歌翻译
本文介绍了一种新的数据增强方法,用于神经机器翻译,该方法可以在语言内部和跨语言内部实施更强的语义一致性。我们的方法基于条件掩盖语言模型(CMLM),该模型是双向的,可以在左右上下文以及标签上有条件。我们证明CMLM是生成上下文依赖性单词分布的好技术。特别是,我们表明CMLM能够通过在替换过程中对源和目标进行调节来实现语义一致性。此外,为了增强多样性,我们将软词替换的想法纳入了数据增强,该概念用词汇上的概率分布代替了一个单词。在不同量表的四个翻译数据集上进行的实验表明,总体解决方案会导致更现实的数据增强和更好的翻译质量。与最新作品相比,我们的方法始终取得了最佳性能,并且在基线上的提高了1.90个BLEU点。
translated by 谷歌翻译