Multi-head self-attention is a key component of the Transformer, a state-of-the-art architecture for neural machine translation. In this work we evaluate the contribution made by individual attention heads in the encoder to the overall performance of the model and analyze the roles played by them. We find that the most important and confident heads play consistent and often linguistically-interpretable roles. When pruning heads using a method based on stochastic gates and a differentiable relaxation of the L 0 penalty, we observe that specialized heads are last to be pruned. Our novel pruning method removes the vast majority of heads without seriously affecting performance. For example, on the English-Russian WMT dataset, pruning 38 out of 48 encoder heads results in a drop of only 0.15 BLEU. 1
translated by 谷歌翻译
Attention is a powerful and ubiquitous mechanism for allowing neural models to focus on particular salient pieces of information by taking their weighted average when making predictions. In particular, multi-headed attention is a driving force behind many recent state-of-the-art natural language processing (NLP) models such as Transformer-based MT models and BERT. These models apply multiple attention mechanisms in parallel, with each attention "head" potentially focusing on different parts of the input, which makes it possible to express sophisticated functions beyond the simple weighted average. In this paper we make the surprising observation that even if models have been trained using multiple heads, in practice, a large percentage of attention heads can be removed at test time without significantly impacting performance. In fact, some layers can even be reduced to a single head. We further examine greedy algorithms for pruning down models, and the potential speed, memory efficiency, and accuracy improvements obtainable therefrom. Finally, we analyze the results with respect to which parts of the model are more reliant on having multiple heads, and provide precursory evidence that training dynamics play a role in the gains provided by multi-head attention 1 .1 Code to replicate our experiments is provided at https://github.com/pmichel31415/ are-16-heads-really-better-than-1
translated by 谷歌翻译
Large pre-trained neural networks such as BERT have had great recent success in NLP, motivating a growing body of research investigating what aspects of language they are able to learn from unlabeled data. Most recent analysis has focused on model outputs (e.g., language model surprisal) or internal vector representations (e.g., probing classifiers). Complementary to these works, we propose methods for analyzing the attention mechanisms of pre-trained models and apply them to BERT. BERT's attention heads exhibit patterns such as attending to delimiter tokens, specific positional offsets, or broadly attending over the whole sentence, with heads in the same layer often exhibiting similar behaviors. We further show that certain attention heads correspond well to linguistic notions of syntax and coreference. For example, we find heads that attend to the direct objects of verbs, determiners of nouns, objects of prepositions, and coreferent mentions with remarkably high accuracy. Lastly, we propose an attention-based probing classifier and use it to further demonstrate that substantial syntactic information is captured in BERT's attention. 1 Code will be released at https://github.com/ clarkkev/attention-analysis.2 We use the English base-sized model.
translated by 谷歌翻译
Compared to conventional bilingual translation systems, massively multilingual machine translation is appealing because a single model can translate into multiple languages and benefit from knowledge transfer for low resource languages. On the other hand, massively multilingual models suffer from the curse of multilinguality, unless scaling their size massively, which increases their training and inference costs. Sparse Mixture-of-Experts models are a way to drastically increase model capacity without the need for a proportional amount of computing. The recently released NLLB-200 is an example of such a model. It covers 202 languages but requires at least four 32GB GPUs just for inference. In this work, we propose a pruning method that allows the removal of up to 80\% of experts with a negligible loss in translation quality, which makes it feasible to run the model on a single 32GB GPU. Further analysis suggests that our pruning metrics allow to identify language-specific experts and prune non-relevant experts for a given language pair.
translated by 谷歌翻译
Transformer-based models have pushed state of the art in many areas of NLP, but our understanding of what is behind their success is still limited. This paper is the first survey of over 150 studies of the popular BERT model. We review the current state of knowledge about how BERT works, what kind of information it learns and how it is represented, common modifications to its training objectives and architecture, the overparameterization issue and approaches to compression. We then outline directions for future research.
translated by 谷歌翻译
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. * Equal contribution. Listing order is random. Jakob proposed replacing RNNs with self-attention and started the effort to evaluate this idea. Ashish, with Illia, designed and implemented the first Transformer models and has been crucially involved in every aspect of this work. Noam proposed scaled dot-product attention, multi-head attention and the parameter-free position representation and became the other person involved in nearly every detail. Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and tensor2tensor. Llion also experimented with novel model variants, was responsible for our initial codebase, and efficient inference and visualizations. Lukasz and Aidan spent countless long days designing various parts of and implementing tensor2tensor, replacing our earlier codebase, greatly improving results and massively accelerating our research.† Work performed while at Google Brain.‡ Work performed while at Google Research.
translated by 谷歌翻译
我用Hunglish2语料库训练神经电脑翻译任务的模型。这项工作的主要贡献在培训NMT模型期间评估不同的数据增强方法。我提出了5种不同的增强方法,这些方法是结构感知的,这意味着而不是随机选择用于消隐或替换的单词,句子的依赖树用作增强的基础。我首先关于神经网络的详细文献综述,顺序建模,神经机翻译,依赖解析和数据增强。经过详细的探索性数据分析和Hunglish2语料库的预处理之后,我使用所提出的数据增强技术进行实验。匈牙利语的最佳型号达到了33.9的BLEU得分,而英国匈牙利最好的模型达到了28.6的BLEU得分。
translated by 谷歌翻译
机器翻译历史上的重要突破之一是变压器模型的发展。不仅对于各种翻译任务,而且对于大多数其他NLP任务都是革命性的。在本文中,我们针对一个基于变压器的系统,该系统能够将德语用源句子转换为其英语的对应目标句子。我们对WMT'13数据集的新闻评论德语 - 英语并行句子进行实验。此外,我们研究了来自IWSLT'16数据集的培训中包含其他通用域数据以改善变压器模型性能的效果。我们发现,在培训中包括IWSLT'16数据集,有助于在WMT'13数据集的测试集中获得2个BLEU得分点。引入定性分析以分析通用域数据的使用如何有助于提高产生的翻译句子的质量。
translated by 谷歌翻译
最近经过彻底调查了变压器多头自我关注机制。一方面,研究人员对理解为什么以及变压器如何工作。另一方面,他们提出了新的注意增强方法,使变压器更准确,高效和可解释。在本文中,我们在循环管道中协同促使这两条研究线,首先找到了重要的任务特定的注意模式。然后应用那些模式,不仅应用于原始模型,还应用于较小的模型,作为人类引导的知识蒸馏过程。在提取摘要任务的情况下,在案例研究中对我们的管道的好处。在受欢迎的Bertsum模型中找到三种有意义的关注模式之后,实验表明,当我们注入这种模式时,原始和较小模型都显示出性能的改进,并且可以说是可争议的解释性。
translated by 谷歌翻译
语言模型偏见已成为NLP社区的重要研究领域。提出了许多偏见技术,但偏见消融仍然是一个未解决的问题。我们展示了一个新颖的框架,用于通过运动修剪来检查预训练的基于变压器的语言模型的偏见。给定模型和一个偏见的目标,我们的框架找到了与原始模型相比,偏差少的模型子集。我们通过对模型进行修剪来实现我们的框架,同时将其按照歧义目标进行微调。优化仅是修剪分数 - 参数以及模型的权重,该参数充当门。我们尝试修剪注意力头,这是变形金刚的重要组成部分:我们修剪正方形块,并建立了一种修剪整个头部的新方法。最后,我们使用性别偏见证明了我们的框架的用法,并且根据我们的发现,我们提出了对现有辩论方法的改进。此外,我们重新发现了偏见 - 绩效权衡:模型执行越好,其包含的偏见就越多。
translated by 谷歌翻译
In Neural Machine Translation (NMT), each token prediction is conditioned on the source sentence and the target prefix (what has been previously translated at a decoding step). However, previous work on interpretability in NMT has mainly focused solely on source sentence tokens' attributions. Therefore, we lack a full understanding of the influences of every input token (source sentence and target prefix) in the model predictions. In this work, we propose an interpretability method that tracks input tokens' attributions for both contexts. Our method, which can be extended to any encoder-decoder Transformer-based model, allows us to better comprehend the inner workings of current NMT models. We apply the proposed method to both bilingual and multilingual Transformers and present insights into their behaviour.
translated by 谷歌翻译
语法纠错(GEC)是检测和纠正句子中语法错误的任务。最近,神经机翻译系统已成为这项任务的流行方法。然而,这些方法缺乏使用句法知识,这在语法错误的校正中起着重要作用。在这项工作中,我们提出了一种语法引导的GEC模型(SG-GEC),它采用图表注意机制来利用依赖树的句法知识。考虑到语法不正确的源句子的依赖性树可以提供不正确的语法知识,我们提出了一个依赖树修正任务来处理它。结合数据增强方法,我们的模型在不使用任何大型预先训练模型的情况下实现了强大的性能。我们评估我们在GEC任务的公共基准上的模型,实现了竞争结果。
translated by 谷歌翻译
已被证明在改善神经电机翻译(NMT)系统方面有效的深度编码器,但是当编码器层数超过18时,它达到了翻译质量的上限。更糟糕的是,更深的网络消耗了很多内存,使其无法实现有效地训练。在本文中,我们呈现了共生网络,其包括完整的网络作为共生主网络(M-Net)和另一个具有相同结构的共享子网,但层数较少为共生子网(S-Net)。我们在变压器深度(M-N)架构上采用共生网络,并在NMT中定义M-Net和S-Net之间的特定正则化损耗$ \ mathcal {l} _ {\ tau} $。我们对共生网络进行联合培训,并旨在提高M净性能。我们拟议的培训策略在CMT'14 en-> De,De-> EN和EN-> FR任务的经典培训下将变压器深(12-6)改善了0.61,0.49和0.69 BLEU。此外,我们的变压器深(12-6)甚至优于经典变压器深度(18-6)。
translated by 谷歌翻译
Relying entirely on an attention mechanism, the Transformer introduced by Vaswani et al. ( 2017) achieves state-of-the-art results for machine translation. In contrast to recurrent and convolutional neural networks, it does not explicitly model relative or absolute position information in its structure. Instead, it requires adding representations of absolute positions to its inputs. In this work we present an alternative approach, extending the self-attention mechanism to efficiently consider representations of the relative positions, or distances between sequence elements. On the WMT 2014 English-to-German and English-to-French translation tasks, this approach yields improvements of 1.3 BLEU and 0.3 BLEU over absolute position representations, respectively. Notably, we observe that combining relative and absolute position representations yields no further improvement in translation quality. We describe an efficient implementation of our method and cast it as an instance of relation-aware self-attention mechanisms that can generalize to arbitrary graphlabeled inputs.
translated by 谷歌翻译
字符不传达意义,但字符序列会。我们提出了一种无监督的分布方法,以便在一系列字符中学习抽象意义的单元。该模型而不是分割序列,而不是分割序列中的“对象”的连续表示,使用最近提出的架构在称为插槽注意中的图像中的对象发现。我们在不同语言中培训我们的模型,并用探测分类评估所获得的表示的质量。我们的实验表明,我们的单位在更高的抽象层面捕捉意义的能力。
translated by 谷歌翻译
在完全共享所有语言参数的多语言神经机器翻译模型中,通常使用人工语言令牌来指导转换为所需的目标语言。但是,最近的研究表明,预备语言代币有时无法将多语言神经机器翻译模型导航到正确的翻译方向,尤其是在零弹性翻译上。为了减轻此问题,我们提出了两种方法:语言嵌入实施例和语言意识的多头关注,以学习信息丰富的语言表示,以将翻译转换为正确的方向。前者体现了沿着从源到目标的信息流中的不同关键切换点的语言,旨在放大翻译方向引导信号。后者利用矩阵而不是向量来表示连续空间中的语言。矩阵分为多个头,以学习多个子空间中的语言表示。在两个数据集上进行大规模多语言神经机器翻译的实验结果表明,语言意识到的多头注意力受益于监督和零弹性翻译,并大大减轻了脱靶翻译问题。进一步的语言类型学预测实验表明,通过我们的方法学到的基于基质的语言表示能够捕获丰富的语言类型学特征。
translated by 谷歌翻译
最近,多模式机器翻译(MMT)的研究激增,其中其他模式(例如图像)用于提高文本系统的翻译质量。这种多模式系统的特殊用途是同时机器翻译的任务,在该任务中,已证明视觉上下文可以补充源句子提供的部分信息,尤其是在翻译的早期阶段。在本文中,我们提出了第一个基于变压器的同时MMT体系结构,该体系结构以前尚未在现场探索过。此外,我们使用辅助监督信号扩展了该模型,该信号使用标记的短语区域比对来指导其视觉注意机制。我们在三个语言方向上进行全面的实验,并使用自动指标和手动检查进行彻底的定量和定性分析。我们的结果表明,(i)监督视觉注意力一致地提高了MMT模型的翻译质量,并且(ii)通过监督损失对MMT进行微调,比从SCRATCH训练MMT的MMT可以提高性能。与最先进的模型相比,我们提出的模型可实现多达2.3 bleu和3.5 Meteor点的改善。
translated by 谷歌翻译
Syntax is a latent hierarchical structure which underpins the robust and compositional nature of human language. An active line of inquiry is whether large pretrained language models (LLMs) are able to acquire syntax by training on text alone; understanding a model's syntactic capabilities is essential to understanding how it processes and makes use of language. In this paper, we propose a new method, SSUD, which allows for the induction of syntactic structures without supervision from gold-standard parses. Instead, we seek to define formalism-agnostic, model-intrinsic syntactic parses by using a property of syntactic relations: syntactic substitutability. We demonstrate both quantitative and qualitative gains on dependency parsing tasks using SSUD, and induce syntactic structures which we hope provide clarity into LLMs and linguistic representations, alike.
translated by 谷歌翻译
在本文中,我们试图通过引入深度学习模型的句法归纳偏见来建立两所学校之间的联系。我们提出了两个归纳偏见的家族,一个家庭用于选区结构,另一个用于依赖性结构。选区归纳偏见鼓励深度学习模型使用不同的单位(或神经元)分别处理长期和短期信息。这种分离为深度学习模型提供了一种方法,可以从顺序输入中构建潜在的层次表示形式,即更高级别的表示由高级表示形式组成,并且可以分解为一系列低级表示。例如,在不了解地面实际结构的情况下,我们提出的模型学会通过根据其句法结构组成变量和运算符的表示来处理逻辑表达。另一方面,依赖归纳偏置鼓励模型在输入序列中找到实体之间的潜在关系。对于自然语言,潜在关系通常被建模为一个定向依赖图,其中一个单词恰好具有一个父节点和零或几个孩子的节点。将此约束应用于类似变压器的模型之后,我们发现该模型能够诱导接近人类专家注释的有向图,并且在不同任务上也优于标准变压器模型。我们认为,这些实验结果为深度学习模型的未来发展展示了一个有趣的选择。
translated by 谷歌翻译
多语种NMT已成为MT在生产中部署的有吸引力的解决方案。但是要匹配双语质量,它符合较大且较慢的型号。在这项工作中,我们考虑了几种方法在推理时更快地使多语言NMT变得更快而不会降低其质量。我们在两种20语言多平行设置中尝试几个“光解码器”架构:在TED会谈中小规模和帕拉克曲线上的大规模。我们的实验表明,将具有词汇过滤的浅解码器组合在于,在翻译质量下没有损失的速度超过两倍。我们用Bleu和Chrf(380语言对),鲁棒性评估和人类评估验证了我们的研究结果。
translated by 谷歌翻译