神经自回归序列模型涂抹许多可能​​序列之间的概率,包括退化的序列,例如空或重复序列。在这项工作中,我们解决了一个特定的情况,其中模型为不合理的短序列分配高概率。我们定义了量化速率以量化此问题。在确认神经机翻译中高度过度的过天气后,我们建议明确地减少培训期间的过天平率。我们进行一组实验来研究建议的正规化对模型分布和解码性能的影响。我们使用神经电脑翻译任务作为测试用,并考虑三个不同大小的不同数据集。我们的实验显示了三个主要结果。首先,我们可以通过调整正规化的强度来控制模型的过天平率。其次,通过提高过度损失贡献,令牌的概率和等级在不应该是它的位置下降。第三,所提出的正则化影响光束搜索的结果,特别是当使用大梁时。用大梁的翻译质量(在BLEU中测量)的降解显着减少了较低的过天速速率,但与较小光束尺寸相比的劣化仍有剩余状态。从这些观察中,我们得出结论,高度过度的过度性是神经自回归模型中过于可能的短序列的退化情况背后的主要原因。
translated by 谷歌翻译
Reranking methods in machine translation aim to close the gap between common evaluation metrics (e.g. BLEU) and maximum likelihood learning and decoding algorithms. Prior works address this challenge by training models to rerank beam search candidates according to their predicted BLEU scores, building upon large models pretrained on massive monolingual corpora -- a privilege that was never made available to the baseline translation model. In this work, we examine a simple approach for training rerankers to predict translation candidates' BLEU scores without introducing additional data or parameters. Our approach can be used as a clean baseline, decoupled from external factors, for future research in this area.
translated by 谷歌翻译
Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, with the potential to overcome many of the weaknesses of conventional phrase-based translation systems. Unfortunately, NMT systems are known to be computationally expensive both in training and in translation inference -sometimes prohibitively so in the case of very large data sets and large models. Several authors have also charged that NMT systems lack robustness, particularly when input sentences contain rare words. These issues have hindered NMT's use in practical deployments and services, where both accuracy and speed are essential. In this work, we present GNMT, Google's Neural Machine Translation system, which attempts to address many of these issues. Our model consists of a deep LSTM network with 8 encoder and 8 decoder layers using residual connections as well as attention connections from the decoder network to the encoder. To improve parallelism and therefore decrease training time, our attention mechanism connects the bottom layer of the decoder to the top layer of the encoder. To accelerate the final translation speed, we employ low-precision arithmetic during inference computations. To improve handling of rare words, we divide words into a limited set of common sub-word units ("wordpieces") for both input and output. This method provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delimited models, naturally handles translation of rare words, and ultimately improves the overall accuracy of the system. Our beam search technique employs a length-normalization procedure and uses a coverage penalty, which encourages generation of an output sentence that is most likely to cover all the words in the source sentence. To directly optimize the translation BLEU scores, we consider refining the models by using reinforcement learning, but we found that the improvement in the BLEU scores did not reflect in the human evaluation. On the WMT'14 English-to-French and English-to-German benchmarks, GNMT achieves competitive results to state-of-the-art. Using a human side-by-side evaluation on a set of isolated simple sentences, it reduces translation errors by an average of 60% compared to Google's phrase-based production system.
translated by 谷歌翻译
嘈杂的频道模型在神经机翻译(NMT)中特别有效。然而,最近的方法如“波束搜索和重新划分”(BSR)在推理期间引起了大量的计算开销,使实际应用不可行。我们的目标是建立一个摊销嘈杂的频道NMT模型,使得从它贪婪解码将生成转换,以最大化与使用BSR生成的翻译相同的奖励。我们尝试三种方法:知识蒸馏,1阶梯偏差仿制学习和Q学习。第一方法获得来自伪语料库的噪声信道信号,后两种方法旨在直接针对嘈杂的通道MT奖励优化。所有三种级别的速度推动速度推断为1-2级。对于所有三种方法,所生成的翻译无法实现与BSR相当的奖励,但BLEU近似的翻译质量类似于BSR产生的翻译的质量。
translated by 谷歌翻译
在几乎所有文本生成应用中,Word序列在左右(L2R)或左右(R2L)方式中构造,因为自然语言句子是写入L2R或R2L。但是,我们发现自然语言书面订单对文本生成至关重要。在本文中,我们提出了一种螺旋语言建模(SLM),这是一种普遍的方法,使人们能够构建超出L2R和R2L订单的自然语言句子。 SLM允许其中一个从结果文本内的任意令牌开始,并在所选的任意令牌中展开REST令牌。它使解码顺序除了语言模型困惑之外的新优化目标,这进一步提高了所生成文本的分集和质量。此外,SLM使得可以通过选择正确的开始令牌来操纵文本构建过程。 SLM还将生成排序引入了额外的正则化,以提高低资源方案中的模型稳健性。 8次广泛研究的神经机翻译(NMT)任务的实验表明,与传统的L2R解码方法相比,SLM高达4.7 BLEU增加。
translated by 谷歌翻译
After just a few hundred training updates, a standard probabilistic model for language generation has likely not yet learnt many semantic or syntactic rules of natural language, which inherently makes it difficult to estimate the right probability distribution over next tokens. Yet around this point, these models have identified a simple, loss-minimising behaviour: to output the unigram distribution of the target training corpus. The use of such a crude heuristic raises the question: Rather than wasting precious compute resources and model capacity for learning this strategy at early training stages, can we initialise our models with this behaviour? Here, we show that we can effectively endow our model with a separate module that reflects unigram frequency statistics as prior knowledge. Standard neural language generation architectures offer a natural opportunity for implementing this idea: by initialising the bias term in a model's final linear layer with the log-unigram distribution. Experiments in neural machine translation demonstrate that this simple technique: (i) improves learning efficiency; (ii) achieves better overall performance; and (iii) appears to disentangle strong frequency effects, encouraging the model to specialise in non-frequency-related aspects of language.
translated by 谷歌翻译
虽然已经提出了许多背景感知神经机器转换模型在翻译中包含语境,但大多数模型在句子级别对齐的并行文档上培训结束到底。因为只有少数域(和语言对)具有此类文档级并行数据,所以我们无法在大多数域中执行准确的上下文感知转换。因此,我们通过将文档级语言模型结合到解码器中,提出了一种简单的方法将句子级转换模型转换为上下文感知模型。我们的上下文感知解码器仅在句子级并行语料库和单语演模板上构建;因此,不需要文档级并行数据。在理论上,这项工作的核心部分是使用上下文和当前句子之间的点亮互信息的语境信息的新颖表示。我们以三种语言对,英语到法语,英语到俄语,以及日语到英语,通过评估,通过评估以及对上下文意识翻译的对比测试。
translated by 谷歌翻译
在本文中,我们提出了一种新的生成模型,逐步逐步的去噪AutoEncoder(Sundae),不依赖于自回归模型。类似地与去噪扩散技术,在从随机输入开始并从随机输入开始并每次直到收敛改善它们时,日出施加Sundae。我们提出了一个简单的新改进运算符,它比扩散方法更少迭代,同时在定性地在自然语言数据集上产生更好的样本。Sundae在WMT'14英语到德语翻译任务上实现最先进的结果(非自回归方法),在巨大清洁的常见爬网数据集和Python代码的数据集上对无条件语言建模的良好定性结果来自GitHub。通过在模板中填充任意空白模式,Sundae的非自动增加性质开辟了超出左右提示的可能性。
translated by 谷歌翻译
我们介绍了双图:一种简单但有效的训练策略,以提高神经机器翻译(NMT)性能。它由两个程序组成:双向预处理和单向填充。这两个过程均使用SIMCUT,这是一种简单的正则化方法,迫使原始句子对的输出分布之间的一致性。在不利用额外的数据集通过反翻译或集成大规模预认证的模型的情况下,BI-Simcut可以在五个翻译基准(数据尺寸从160K到20.20万)中实现强大的翻译性能:EN-的BLEU得分为31.16,EN-> DE和38.37的BLEU得分为38.37 de-> en在IWSLT14数据集上,en-> de的30.78和35.15在WMT14数据集上进行DE-> en,而WMT17数据集中的ZH-> EN为27.17。 Simcut不是一种新方法,而是简化和适用于NMT的cutoff(Shen等,2020)的版本,可以将其视为基于扰动的方法。鉴于Simcut和Bi-Simcut的普遍性和简单性,我们认为它们可以作为未来NMT研究的强大基准。
translated by 谷歌翻译
Transfer learning, where a model is first pre-trained on a data-rich task before being finetuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.
translated by 谷歌翻译
我们介绍了一种新的分布式策略梯度算法,并表明它在优化机器翻译模型时,在培训稳定性和概括性绩效方面都优于现有的奖励感知培训程序,例如增强,最低风险培训(MRT)和近端政策优化(PPO)。我们称之为MAD的算法(由于在重要性加权计算中使用平均绝对偏差),它分布式数据生成器在Worker节点上每个源句子对多个候选者进行采样,而中心学习者则更新了策略。 MAD取决于两个降低差异策略:(1)一种有条件的奖励归一化方法,可确保每个源句子都具有正面和负面奖励翻译示例,以及(2)一种新的强大重要性加权方案,充当条件性熵正常化器。在各种翻译任务上进行的实验表明,使用MAD算法在使用贪婪的解码和梁搜索时,使用MAD算法学到的策略表现良好,并且学到的政策对训练过程中使用的特定奖励很敏感。
translated by 谷歌翻译
Large pretrained language models generate fluent text but are notoriously hard to controllably sample from. In this work, we study constrained sampling from such language models: generating text that satisfies user-defined constraints, while maintaining fluency and the model's performance in a downstream task. We propose MuCoLa -- a sampling procedure that combines the log-likelihood of the language model with arbitrary (differentiable) constraints in a single energy function, and then generates samples in a non-autoregressive manner. Specifically, it initializes the entire output sequence with noise and follows a Markov chain defined by Langevin Dynamics using the gradients of the energy function. We evaluate MuCoLa on text generation with soft and hard constraints as well as their combinations obtaining significant improvements over competitive baselines for toxicity avoidance, sentiment control, and keyword-guided generation.
translated by 谷歌翻译
序列模型是现代NLP系统的关键组成部分,但它们的预测难以解释。我们考虑虽然可以解释单个模型预测的基础,但是可以解释各种模型预测的上下文的模型解释。通过解决组合优化来找到顺序律师:最佳理由是输入令牌的最小子集,这些令牌将预测与完整序列相同的输出。枚举所有子集是棘手的,因此我们提出了一种高效的贪婪算法来近似这个目标。称为贪婪合理化的算法适用于任何模型。对于这种方法有效,模型应该在对上下文的不完整子集进行预测时形成兼容的条件分布。这种情况可以用短的微调步骤强制执行。我们研究语言建模与机器翻译的贪婪合理化。与现有的基线相比,贪婪合理化是最优化组合目标的,并提供最忠实的理由。在注释的顺序理由的新数据集中,贪婪的理由与人类理由最相似。
translated by 谷歌翻译
这项工作适用于最低贝叶斯风险(MBR)解码,以优化翻译质量的各种自动化指标。机器翻译中的自动指标最近取得了巨大的进步。特别是,在人类评级(例如BLEurt,或Comet)上微调,在与人类判断的相关性方面是优于表面度量的微调。我们的实验表明,神经翻译模型与神经基于基于神经参考度量,BLEURT的组合导致自动和人类评估的显着改善。通过与经典光束搜索输出不同的翻译获得该改进:这些翻译的可能性较低,并且较少受到Bleu等表面度量的青睐。
translated by 谷歌翻译
通常使用自回归生成模型,尤其是对于涉及顺序数据的那些任务。然而,由于链式有条件建模的内在特征(例如,暴露偏见或缺乏远距离连贯性),由于许多固有的缺陷而困扰着它们,严重限制了它们正确模型分布的能力。在本文中,我们提出了一种独特的方法,该方法称为训练自回旋生成模型,以利用精心设计的基于能量的学习目标。通过利用SoftMax操作的额外自由度,我们被允许使自回归模型本身成为基于能量的模型,用于衡量输入的可能性,而无需引入任何额外的参数。此外,我们表明可以有效地训练电子臂,并能够减轻暴露偏置问题并增加自回归生成模型的时间连贯性。广泛的经验结果涵盖了语言建模,神经机器翻译和图像产生等基准,证明了拟议方法的有效性。
translated by 谷歌翻译
多语种NMT已成为MT在生产中部署的有吸引力的解决方案。但是要匹配双语质量,它符合较大且较慢的型号。在这项工作中,我们考虑了几种方法在推理时更快地使多语言NMT变得更快而不会降低其质量。我们在两种20语言多平行设置中尝试几个“光解码器”架构:在TED会谈中小规模和帕拉克曲线上的大规模。我们的实验表明,将具有词汇过滤的浅解码器组合在于,在翻译质量下没有损失的速度超过两倍。我们用Bleu和Chrf(380语言对),鲁棒性评估和人类评估验证了我们的研究结果。
translated by 谷歌翻译
Recently, contrastive learning attracts increasing interests in neural text generation as a new solution to alleviate the exposure bias problem. It introduces a sequence-level training signal which is crucial to generation tasks that always rely on auto-regressive decoding. However, previous methods using contrastive learning in neural text generation usually lead to inferior performance. In this paper, we analyse the underlying reasons and propose a new Contrastive Neural Text generation framework, CoNT. CoNT addresses bottlenecks that prevent contrastive learning from being widely adopted in generation tasks from three aspects -- the construction of contrastive examples, the choice of the contrastive loss, and the strategy in decoding. We validate CoNT on five generation tasks with ten benchmarks, including machine translation, summarization, code comment generation, data-to-text generation and commonsense generation. Experimental results show that CoNT clearly outperforms the conventional training framework on all the ten benchmarks with a convincing margin. Especially, CoNT surpasses previous the most competitive contrastive learning method for text generation, by 1.50 BLEU on machine translation and 1.77 ROUGE-1 on summarization, respectively. It achieves new state-of-the-art on summarization, code comment generation (without external data) and data-to-text generation.
translated by 谷歌翻译
Subword units are an effective way to alleviate the open vocabulary problems in neural machine translation (NMT). While sentences are usually converted into unique subword sequences, subword segmentation is potentially ambiguous and multiple segmentations are possible even with the same vocabulary. The question addressed in this paper is whether it is possible to harness the segmentation ambiguity as a noise to improve the robustness of NMT. We present a simple regularization method, subword regularization, which trains the model with multiple subword segmentations probabilistically sampled during training. In addition, for better subword sampling, we propose a new subword segmentation algorithm based on a unigram language model. We experiment with multiple corpora and report consistent improvements especially on low resource and out-of-domain settings.
translated by 谷歌翻译
Diffusion models have quickly become the go-to paradigm for generative modelling of perceptual signals (such as images and sound) through iterative refinement. Their success hinges on the fact that the underlying physical phenomena are continuous. For inherently discrete and categorical data such as language, various diffusion-inspired alternatives have been proposed. However, the continuous nature of diffusion models conveys many benefits, and in this work we endeavour to preserve it. We propose CDCD, a framework for modelling categorical data with diffusion models that are continuous both in time and input space. We demonstrate its efficacy on several language modelling tasks.
translated by 谷歌翻译
最近,多模式机器翻译(MMT)的研究激增,其中其他模式(例如图像)用于提高文本系统的翻译质量。这种多模式系统的特殊用途是同时机器翻译的任务,在该任务中,已证明视觉上下文可以补充源句子提供的部分信息,尤其是在翻译的早期阶段。在本文中,我们提出了第一个基于变压器的同时MMT体系结构,该体系结构以前尚未在现场探索过。此外,我们使用辅助监督信号扩展了该模型,该信号使用标记的短语区域比对来指导其视觉注意机制。我们在三个语言方向上进行全面的实验,并使用自动指标和手动检查进行彻底的定量和定性分析。我们的结果表明,(i)监督视觉注意力一致地提高了MMT模型的翻译质量,并且(ii)通过监督损失对MMT进行微调,比从SCRATCH训练MMT的MMT可以提高性能。与最先进的模型相比,我们提出的模型可实现多达2.3 bleu和3.5 Meteor点的改善。
translated by 谷歌翻译