嘈杂的频道模型在神经机翻译(NMT)中特别有效。然而,最近的方法如“波束搜索和重新划分”(BSR)在推理期间引起了大量的计算开销,使实际应用不可行。我们的目标是建立一个摊销嘈杂的频道NMT模型,使得从它贪婪解码将生成转换,以最大化与使用BSR生成的翻译相同的奖励。我们尝试三种方法:知识蒸馏,1阶梯偏差仿制学习和Q学习。第一方法获得来自伪语料库的噪声信道信号,后两种方法旨在直接针对嘈杂的通道MT奖励优化。所有三种级别的速度推动速度推断为1-2级。对于所有三种方法,所生成的翻译无法实现与BSR相当的奖励,但BLEU近似的翻译质量类似于BSR产生的翻译的质量。
translated by 谷歌翻译
Reranking methods in machine translation aim to close the gap between common evaluation metrics (e.g. BLEU) and maximum likelihood learning and decoding algorithms. Prior works address this challenge by training models to rerank beam search candidates according to their predicted BLEU scores, building upon large models pretrained on massive monolingual corpora -- a privilege that was never made available to the baseline translation model. In this work, we examine a simple approach for training rerankers to predict translation candidates' BLEU scores without introducing additional data or parameters. Our approach can be used as a clean baseline, decoupled from external factors, for future research in this area.
translated by 谷歌翻译
我们介绍了一种新的分布式策略梯度算法,并表明它在优化机器翻译模型时,在培训稳定性和概括性绩效方面都优于现有的奖励感知培训程序,例如增强,最低风险培训(MRT)和近端政策优化(PPO)。我们称之为MAD的算法(由于在重要性加权计算中使用平均绝对偏差),它分布式数据生成器在Worker节点上每个源句子对多个候选者进行采样,而中心学习者则更新了策略。 MAD取决于两个降低差异策略:(1)一种有条件的奖励归一化方法,可确保每个源句子都具有正面和负面奖励翻译示例,以及(2)一种新的强大重要性加权方案,充当条件性熵正常化器。在各种翻译任务上进行的实验表明,使用MAD算法在使用贪婪的解码和梁搜索时,使用MAD算法学到的策略表现良好,并且学到的政策对训练过程中使用的特定奖励很敏感。
translated by 谷歌翻译
神经自回归序列模型涂抹许多可能​​序列之间的概率,包括退化的序列,例如空或重复序列。在这项工作中,我们解决了一个特定的情况,其中模型为不合理的短序列分配高概率。我们定义了量化速率以量化此问题。在确认神经机翻译中高度过度的过天气后,我们建议明确地减少培训期间的过天平率。我们进行一组实验来研究建议的正规化对模型分布和解码性能的影响。我们使用神经电脑翻译任务作为测试用,并考虑三个不同大小的不同数据集。我们的实验显示了三个主要结果。首先,我们可以通过调整正规化的强度来控制模型的过天平率。其次,通过提高过度损失贡献,令牌的概率和等级在不应该是它的位置下降。第三,所提出的正则化影响光束搜索的结果,特别是当使用大梁时。用大梁的翻译质量(在BLEU中测量)的降解显着减少了较低的过天速速率,但与较小光束尺寸相比的劣化仍有剩余状态。从这些观察中,我们得出结论,高度过度的过度性是神经自回归模型中过于可能的短序列的退化情况背后的主要原因。
translated by 谷歌翻译
Simultaneous machine translation (SiMT) is usually done via sequence-level knowledge distillation (Seq-KD) from a full-sentence neural machine translation (NMT) model. However, there is still a significant performance gap between NMT and SiMT. In this work, we propose to leverage monolingual data to improve SiMT, which trains a SiMT student on the combination of bilingual data and external monolingual data distilled by Seq-KD. Preliminary experiments on En-Zh and En-Ja news domain corpora demonstrate that monolingual data can significantly improve translation quality (e.g., +3.15 BLEU on En-Zh). Inspired by the behavior of human simultaneous interpreters, we propose a novel monolingual sampling strategy for SiMT, considering both chunk length and monotonicity. Experimental results show that our sampling strategy consistently outperforms the random sampling strategy (and other conventional typical NMT monolingual sampling strategies) by avoiding the key problem of SiMT -- hallucination, and has better scalability. We achieve +0.72 BLEU improvements on average against random sampling on En-Zh and En-Ja. Data and codes can be found at https://github.com/hexuandeng/Mono4SiMT.
translated by 谷歌翻译
Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, with the potential to overcome many of the weaknesses of conventional phrase-based translation systems. Unfortunately, NMT systems are known to be computationally expensive both in training and in translation inference -sometimes prohibitively so in the case of very large data sets and large models. Several authors have also charged that NMT systems lack robustness, particularly when input sentences contain rare words. These issues have hindered NMT's use in practical deployments and services, where both accuracy and speed are essential. In this work, we present GNMT, Google's Neural Machine Translation system, which attempts to address many of these issues. Our model consists of a deep LSTM network with 8 encoder and 8 decoder layers using residual connections as well as attention connections from the decoder network to the encoder. To improve parallelism and therefore decrease training time, our attention mechanism connects the bottom layer of the decoder to the top layer of the encoder. To accelerate the final translation speed, we employ low-precision arithmetic during inference computations. To improve handling of rare words, we divide words into a limited set of common sub-word units ("wordpieces") for both input and output. This method provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delimited models, naturally handles translation of rare words, and ultimately improves the overall accuracy of the system. Our beam search technique employs a length-normalization procedure and uses a coverage penalty, which encourages generation of an output sentence that is most likely to cover all the words in the source sentence. To directly optimize the translation BLEU scores, we consider refining the models by using reinforcement learning, but we found that the improvement in the BLEU scores did not reflect in the human evaluation. On the WMT'14 English-to-French and English-to-German benchmarks, GNMT achieves competitive results to state-of-the-art. Using a human side-by-side evaluation on a set of isolated simple sentences, it reduces translation errors by an average of 60% compared to Google's phrase-based production system.
translated by 谷歌翻译
虽然已经提出了许多背景感知神经机器转换模型在翻译中包含语境,但大多数模型在句子级别对齐的并行文档上培训结束到底。因为只有少数域(和语言对)具有此类文档级并行数据,所以我们无法在大多数域中执行准确的上下文感知转换。因此,我们通过将文档级语言模型结合到解码器中,提出了一种简单的方法将句子级转换模型转换为上下文感知模型。我们的上下文感知解码器仅在句子级并行语料库和单语演模板上构建;因此,不需要文档级并行数据。在理论上,这项工作的核心部分是使用上下文和当前句子之间的点亮互信息的语境信息的新颖表示。我们以三种语言对,英语到法语,英语到俄语,以及日语到英语,通过评估,通过评估以及对上下文意识翻译的对比测试。
translated by 谷歌翻译
这项工作适用于最低贝叶斯风险(MBR)解码,以优化翻译质量的各种自动化指标。机器翻译中的自动指标最近取得了巨大的进步。特别是,在人类评级(例如BLEurt,或Comet)上微调,在与人类判断的相关性方面是优于表面度量的微调。我们的实验表明,神经翻译模型与神经基于基于神经参考度量,BLEURT的组合导致自动和人类评估的显着改善。通过与经典光束搜索输出不同的翻译获得该改进:这些翻译的可能性较低,并且较少受到Bleu等表面度量的青睐。
translated by 谷歌翻译
本文概述了NVIDIA Nemo的神经电机翻译系统,用于WMT21新闻和生物医学共享翻译任务的受限数据跟踪。我们的新闻任务提交英语 - 德语(EN-DE)和英语 - 俄语(EN-RU)是基于基于基于基线变换器的序列到序列模型之上。具体而言,我们使用1)检查点平均2)模型缩放3)模型缩放3)与从左右分解模型的逆转传播和知识蒸馏的数据增强4)从前一年的测试集上的FINETUNING 5)型号集合6)浅融合解码变压器语言模型和7)嘈杂的频道重新排名。此外,我们的BioMedical任务提交的英语 - 俄语使用生物学偏见的词汇表,并从事新闻任务数据的划痕,从新闻任务数据集中策划的医学相关文本以及共享任务提供的生物医学数据。我们的新闻系统在WMT'20 en-de试验中实现了39.5的Sacrebleu得分优于去年任务38.8的最佳提交。我们的生物医学任务ru-en和en-ru系统分别在WMT'20生物医学任务测试集中达到43.8和40.3的Bleu分数,优于上一年的最佳提交。
translated by 谷歌翻译
无向神经序列模型实现了与最先进的定向序列模型竞争的性能,这些序列模型在机器翻译任务中从左到右单调。在这项工作中,我们培训一项政策,该政策是通过加强学习来学习预先训练的,无向翻译模型的发电顺序。我们表明,通过我们学习的订单解码的翻译可以实现比从左到右解码的输出量更高的BLEU分数或由来自Mansimov等人的学习顺序解码的输出。 (2019)关于WMT'14德语翻译任务。从De-Zh,WMT'16英语 - 罗马尼亚语和WMT'21英语翻译任务的最大来源和目标长度为30的示例,我们的学习订单优于六个任务中的四个启发式生成订单。我们接下来通过定性和定量分析仔细分析学习的订单模式。我们表明我们的政策通常遵循外部到内部顺序,首先预测最左右的位置,然后向中间移动,同时在开始时跳过不太重要的单词。此外,该政策通常在连续步骤中预测单个语法构成结构的位置。我们相信我们的调查结果可以对无向生成模型的机制提供更多的见解,并鼓励在这方面进一步研究。我们的代码在HTTPS://github.com/jiangyctarheel/undirectect - generation
translated by 谷歌翻译
Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT'14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.8 on the entire test set, where the LSTM's BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which is close to the previous best result on this task. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
translated by 谷歌翻译
Non-autoregressive neural machine translation (NAT) models suffer from the multi-modality problem that there may exist multiple possible translations of a source sentence, so the reference sentence may be inappropriate for the training when the NAT output is closer to other translations. In response to this problem, we introduce a rephraser to provide a better training target for NAT by rephrasing the reference sentence according to the NAT output. As we train NAT based on the rephraser output rather than the reference sentence, the rephraser output should fit well with the NAT output and not deviate too far from the reference, which can be quantified as reward functions and optimized by reinforcement learning. Experiments on major WMT benchmarks and NAT baselines show that our approach consistently improves the translation quality of NAT. Specifically, our best variant achieves comparable performance to the autoregressive Transformer, while being 14.7 times more efficient in inference.
translated by 谷歌翻译
非自动性变压器(NAT)是文本生成模型的家族,旨在通过并行预测整个句子来减少解码延迟。但是,这种延迟减少牺牲了捕获从左到右的依赖性的能力,从而使NAT学习非常具有挑战性。在本文中,我们介绍了理论和经验分析,以揭示NAT学习的挑战,并提出统一的观点来了解现有的成功。首先,我们表明,简单地通过最大化可能性来训练NAT可以导致边际分布的近似值,但在代币之间降低了所有依赖关系,在该数据集的条件总相关性可以测量删除的信息。其次,我们在统一的框架中正式化了许多以前的目标,并表明他们的成功可以得出结论,以最大程度地提高代理分布的可能性,从而减少了信息损失。实证研究表明,我们的观点可以解释NAT学习中的现象,并指导新培训方法的设计。
translated by 谷歌翻译
在本文中,我们提出了一种新的生成模型,逐步逐步的去噪AutoEncoder(Sundae),不依赖于自回归模型。类似地与去噪扩散技术,在从随机输入开始并从随机输入开始并每次直到收敛改善它们时,日出施加Sundae。我们提出了一个简单的新改进运算符,它比扩散方法更少迭代,同时在定性地在自然语言数据集上产生更好的样本。Sundae在WMT'14英语到德语翻译任务上实现最先进的结果(非自回归方法),在巨大清洁的常见爬网数据集和Python代码的数据集上对无条件语言建模的良好定性结果来自GitHub。通过在模板中填充任意空白模式,Sundae的非自动增加性质开辟了超出左右提示的可能性。
translated by 谷歌翻译
神经文本生成的主导范式是自回归语言模型的左右解码。然而,复杂的词汇约束下的受约束或可控发生的产生需要远见计划未来可行的未来路径。从A *搜索算法绘制灵感,我们提出了一种神经系统A * esque,一种解码算法包含未来成本的启发式估计。我们开发了高效的寻找高效,对大规模语言模型有效,使我们的方法成为诸如光束搜索和顶-K采样等共同技术的替代品。为了使受约束的产生,我们构建了神经系统解码(Lu等,2021),将其灵活性结合到与未来约束满足的* esque估计结合起来的逻辑限制。我们的方法在五代任务中优于竞争力的基线,并在表格到文本生成,受限机器翻译和关键字的生成中实现了新的最先进的性能。在需要复杂约束满足或少量拍摄或零拍摄设置的任务上,改进尤其显着。神经系统A * esque说明了用于改进和实现大规模语言模型的新功能的解码的力量。
translated by 谷歌翻译
强化学习(RL)通常涉及估计静止政策或单步模型,利用马尔可夫属性来解决问题。但是,我们也可以将RL视为通用序列建模问题,目标是产生一系列导致一系列高奖励的动作。通过这种方式观看,考虑在其他域中运用良好的高容量序列预测模型,例如自然语言处理,也可以为RL问题提供有效的解决方案。为此,我们探索如何使用变压器架构与序列建模的工具来解决RL,以将分布在轨迹上和将光束搜索作为规划算法进行重新定位。框架RL作为序列建模问题简化了一系列设计决策,允许我们分配在离线RL算法中常见的许多组件。我们展示了这种方法跨越长地平动态预测,仿制学习,目标条件的RL和离线RL的灵活性。此外,我们表明这种方法可以与现有的无模型算法结合起来,以在稀疏奖励,长地平线任务中产生最先进的策划仪。
translated by 谷歌翻译
Minimum Bayesian Risk Decoding (MBR) emerges as a promising decoding algorithm in Neural Machine Translation. However, MBR performs poorly with label smoothing, which is surprising as label smoothing provides decent improvement with beam search and improves generality in various tasks. In this work, we show that the issue arises from the un-consistency of label smoothing on the token-level and sequence-level distributions. We demonstrate that even though label smoothing only causes a slight change in the token-level, the sequence-level distribution is highly skewed. We coin the issue \emph{distributional over-smoothness}. To address this issue, we propose a simple and effective method, Distributional Cooling MBR (DC-MBR), which manipulates the entropy of output distributions by tuning down the Softmax temperature. We theoretically prove the equivalence between pre-tuning label smoothing factor and distributional cooling. Experiments on NMT benchmarks validate that distributional cooling improves MBR's efficiency and effectiveness in various settings.
translated by 谷歌翻译
Neural machine translation is a relatively new approach to statistical machine translation based purely on neural networks. The neural machine translation models often consist of an encoder and a decoder. The encoder extracts a fixed-length representation from a variable-length input sentence, and the decoder generates a correct translation from this representation. In this paper, we focus on analyzing the properties of the neural machine translation using two models; RNN Encoder-Decoder and a newly proposed gated recursive convolutional neural network. We show that the neural machine translation performs relatively well on short sentences without unknown words, but its performance degrades rapidly as the length of the sentence and the number of unknown words increase. Furthermore, we find that the proposed gated recursive convolutional network learns a grammatical structure of a sentence automatically.
translated by 谷歌翻译
多语种NMT已成为MT在生产中部署的有吸引力的解决方案。但是要匹配双语质量,它符合较大且较慢的型号。在这项工作中,我们考虑了几种方法在推理时更快地使多语言NMT变得更快而不会降低其质量。我们在两种20语言多平行设置中尝试几个“光解码器”架构:在TED会谈中小规模和帕拉克曲线上的大规模。我们的实验表明,将具有词汇过滤的浅解码器组合在于,在翻译质量下没有损失的速度超过两倍。我们用Bleu和Chrf(380语言对),鲁棒性评估和人类评估验证了我们的研究结果。
translated by 谷歌翻译
Attention-based autoregressive models have achieved state-of-the-art performance in various sequence-to-sequence tasks, including Text-To-Speech (TTS) and Neural Machine Translation (NMT), but can be difficult to train. The standard training approach, teacher forcing, guides a model with the reference back-history. During inference, the generated back-history must be used. This mismatch limits the evaluation performance. Attention forcing has been introduced to address the mismatch, guiding the model with the generated back-history and reference attention. While successful in tasks with continuous outputs like TTS, attention forcing faces additional challenges in tasks with discrete outputs like NMT. This paper introduces the two extensions of attention forcing to tackle these challenges. (1) Scheduled attention forcing automatically turns attention forcing on and off, which is essential for tasks with discrete outputs. (2) Parallel attention forcing makes training parallel, and is applicable to Transformer-based models. The experiments show that the proposed approaches improve the performance of models based on RNNs and Transformers.
translated by 谷歌翻译