The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. * Equal contribution. Listing order is random. Jakob proposed replacing RNNs with self-attention and started the effort to evaluate this idea. Ashish, with Illia, designed and implemented the first Transformer models and has been crucially involved in every aspect of this work. Noam proposed scaled dot-product attention, multi-head attention and the parameter-free position representation and became the other person involved in nearly every detail. Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and tensor2tensor. Llion also experimented with novel model variants, was responsible for our initial codebase, and efficient inference and visualizations. Lukasz and Aidan spent countless long days designing various parts of and implementing tensor2tensor, replacing our earlier codebase, greatly improving results and massively accelerating our research.† Work performed while at Google Brain.‡ Work performed while at Google Research.
translated by 谷歌翻译
Relying entirely on an attention mechanism, the Transformer introduced by Vaswani et al. ( 2017) achieves state-of-the-art results for machine translation. In contrast to recurrent and convolutional neural networks, it does not explicitly model relative or absolute position information in its structure. Instead, it requires adding representations of absolute positions to its inputs. In this work we present an alternative approach, extending the self-attention mechanism to efficiently consider representations of the relative positions, or distances between sequence elements. On the WMT 2014 English-to-German and English-to-French translation tasks, this approach yields improvements of 1.3 BLEU and 0.3 BLEU over absolute position representations, respectively. Notably, we observe that combining relative and absolute position representations yields no further improvement in translation quality. We describe an efficient implementation of our method and cast it as an instance of relation-aware self-attention mechanisms that can generalize to arbitrary graphlabeled inputs.
translated by 谷歌翻译
机器翻译历史上的重要突破之一是变压器模型的发展。不仅对于各种翻译任务,而且对于大多数其他NLP任务都是革命性的。在本文中,我们针对一个基于变压器的系统,该系统能够将德语用源句子转换为其英语的对应目标句子。我们对WMT'13数据集的新闻评论德语 - 英语并行句子进行实验。此外,我们研究了来自IWSLT'16数据集的培训中包含其他通用域数据以改善变压器模型性能的效果。我们发现,在培训中包括IWSLT'16数据集,有助于在WMT'13数据集的测试集中获得2个BLEU得分点。引入定性分析以分析通用域数据的使用如何有助于提高产生的翻译句子的质量。
translated by 谷歌翻译
The prevalent approach to sequence to sequence learning maps an input sequence to a variable length output sequence via recurrent neural networks. We introduce an architecture based entirely on convolutional neural networks. 1 Compared to recurrent models, computations over all elements can be fully parallelized during training to better exploit the GPU hardware and optimization is easier since the number of non-linearities is fixed and independent of the input length. Our use of gated linear units eases gradient propagation and we equip each decoder layer with a separate attention module. We outperform the accuracy of the deep LSTM setup of Wu et al. (2016) on both WMT'14 English-German and WMT'14 English-French translation at an order of magnitude faster speed, both on GPU and CPU.
translated by 谷歌翻译
Recurrent neural networks (RNNs) sequentially process data by updating their state with each new data point, and have long been the de facto choice for sequence modeling tasks. However, their inherently sequential computation makes them slow to train. Feed-forward and convolutional architectures have recently been shown to achieve superior results on some sequence modeling tasks such as machine translation, with the added advantage that they concurrently process all inputs in the sequence, leading to easy parallelization and faster training times. Despite these successes, however, popular feed-forward sequence models like the Transformer fail to generalize in many simple tasks that recurrent models handle with ease, e.g. copying strings or even simple logical inference when the string or formula lengths exceed those observed at training time. We propose the Universal Transformer (UT), a parallel-in-time self-attentive recurrent sequence model which can be cast as a generalization of the Transformer model and which addresses these issues. UTs combine the parallelizability and global receptive field of feed-forward sequence models like the Transformer with the recurrent inductive bias of RNNs. We also add a dynamic per-position halting mechanism and find that it improves accuracy on several tasks. In contrast to the standard Transformer, under certain assumptions UTs can be shown to be Turing-complete. Our experiments show that UTs outperform standard Transformers on a wide range of algorithmic and language understanding tasks, including the challenging LAMBADA language modeling task where UTs achieve a new state of the art, and machine translation where UTs achieve a 0.9 BLEU improvement over Transformers on the WMT14 En-De dataset. * Equal contribution, alphabetically by last name. † Work performed while at Google Brain.
translated by 谷歌翻译
我用Hunglish2语料库训练神经电脑翻译任务的模型。这项工作的主要贡献在培训NMT模型期间评估不同的数据增强方法。我提出了5种不同的增强方法,这些方法是结构感知的,这意味着而不是随机选择用于消隐或替换的单词,句子的依赖树用作增强的基础。我首先关于神经网络的详细文献综述,顺序建模,神经机翻译,依赖解析和数据增强。经过详细的探索性数据分析和Hunglish2语料库的预处理之后,我使用所提出的数据增强技术进行实验。匈牙利语的最佳型号达到了33.9的BLEU得分,而英国匈牙利最好的模型达到了28.6的BLEU得分。
translated by 谷歌翻译
Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O(L 2 ) to O(L log L), where L is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of N times, where N is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.
translated by 谷歌翻译
变压器注意机制中的设计选择,包括弱电感偏置和二次计算复杂性,限制了其用于建模长序列的应用。在本文中,我们介绍了一个简单的,理论上的,单头的门控注意机制,配备了(指数)移动平均线,以将局部依赖性的电感偏置纳入位置 - 敏锐的注意机制中。我们进一步提出了一个具有线性时间和空间复杂性的大型变体,但通过将整个序列分为固定长度的多个块,仅产生最小的质量损失。对广泛的序列建模基准测试的广泛实验,包括远距离竞技场,神经机器翻译,自动回归语言建模以及图像和语音分类,表明,巨人比其他序列模型取得了重大改进,包括变种物的变体和最新的变体模型状态空间模型。
translated by 谷歌翻译
在从训练的数据集中学习后,AI Chatbot提供了令人印象深刻的响应。在这十年中,大多数研究工作都表现出深层神经模型优于任何其他模型。 RNN模型定期用于确定序列相关的问题,如问题和IT答案。这种方法熟悉每个人都是SEQ2SEQ学习。在SEQ2SEQ模型机制中,它具有编码器和解码器。编码器嵌入任何输入序列,以及解码器嵌入输出序列。为了加强SEQ2SEQ模型性能,请将注意力添加到编码器和解码器中。之后,变压器模型已经将其自身作为高性能模型引入,具有多种关注机制,用于解决与序列相关的困境。该模型与基于RNN的模型相比减少了训练时间,并且还实现了序列转换的最先进的性能。在这项研究中,我们基于孟加拉普通知识问题答案(QA)数据集,应用了孟加拉一般知识聊天聊天的变压器模型。它在应用的QA数据上得分为85.0 BLEU。要检查变压器模型性能的比较,我们将注意到SEQ2SEQ模型,请注意我们的数据集得分23.5 BLEU。
translated by 谷歌翻译
随着自然语言处理领域的最新发展,在使用不同架构的神经机翻译中的使用情况上升了。变压器架构用于实现最先进的准确性,但它们是训练的非常昂贵的昂贵。每个人都不能拥有由高端GPU和其他资源组成的等待。我们在低计算资源上培训我们的模型,并调查结果。正如预期的那样,变形金刚表现出其他架构,但结果有一些令人惊讶的结果。由更多编码器和解码器组成的变形金刚需要花更多的时间来训练,但有更少的BLEU分数。LSTM在实验中表现良好,比较少花时间训练而不是变压器,适合在具有时间限制的情况下使用。
translated by 谷歌翻译
由于其二次复杂性,是变压器中的关注模块,其是变压器中的重要组件不能高效地扩展到长序列。许多工作侧重于近似于尺寸的圆点 - 指数的软MAX功能,导致分二次甚至线性复杂性变压器架构。但是,我们表明这些方法不能应用于超出点的指数样式的更强大的注意模块,例如,具有相对位置编码(RPE)的变压器。由于在许多最先进的模型中,相对位置编码被用作默认,设计可以包含RPE的高效变压器是吸引人的。在本文中,我们提出了一种新颖的方法来加速对RPE的转化仪的关注计算在核心化的关注之上。基于观察到相对位置编码形成Toeplitz矩阵,我们数在数学上表明,可以使用快速傅里叶变换(FFT)有效地计算具有RPE的核化注意。使用FFT,我们的方法实现$ \ mathcal {o}(n \ log n)$时间复杂性。有趣的是,我们进一步证明使用相对位置编码适当地可以减轻香草群关注的培训不稳定问题。在广泛的任务上,我们经验证明我们的模型可以从头开始培训,没有任何优化问题。学习模型比许多高效的变压器变体更好地执行,并且在长序列制度中比标准变压器更快。
translated by 谷歌翻译
Transformer-based architectures represent the state of the art in sequence modeling tasks like machine translation and language understanding. Their applicability to multi-modal contexts like image captioning, however, is still largely under-explored. With the aim of filling this gap, we present M 2 -a Meshed Transformer with Memory for Image Captioning. The architecture improves both the image encoding and the language generation steps: it learns a multi-level representation of the relationships between image regions integrating learned a priori knowledge, and uses a mesh-like connectivity at decoding stage to exploit low-and high-level features. Experimentally, we investigate the performance of the M 2 Transformer and different fully-attentive models in comparison with recurrent ones. When tested on COCO, our proposal achieves a new state of the art in single-model and ensemble configurations on the "Karpathy" test split and on the online test server. We also assess its performances when describing objects unseen in the training set. Trained models and code for reproducing the experiments are publicly
translated by 谷歌翻译
事实证明,构象异构体在许多语音处理任务中都是有效的。它结合了使用卷积和使用自我注意的全球依赖性提取本地依赖的好处。受此启发,我们提出了一个更灵活,可解释和可自定义的编码器替代方案,分支机构,并在端到端语音处理中对各种远程依赖关系进行建模。在每个编码器层中,一个分支都采用自我注意事项或其变体来捕获远程依赖性,而另一个分支则利用带有卷积门控(CGMLP)的MLP模块来提取局部关系。我们对几种语音识别和口语理解基准进行实验。结果表明,我们的模型优于变压器和CGMLP。它还与构象异构体获得的最先进结果相匹配。此外,由于两分支结构,我们展示了减少计算的各种策略,包括在单个训练有素的模型中具有可变的推理复杂性的能力。合并分支的权重表明如何在不同层中使用本地和全球依赖性,从而使模型设计受益。
translated by 谷歌翻译
变压器注意机制的二次计算和内存复杂性限制了对长序列建模的可扩展性。在本文中,我们提出了Luna,一种线性统一嵌套关注机制,使Softmax注意力具有两个嵌套线性关注功能,仅产生线性(与二次)的时间和空间复杂度相反。具体地,通过第一注意功能,LUNA将输入序列包装成固定长度的序列。然后,使用第二关注功能未包装包装序列。与更传统的关注机制相比,LUNA引入具有固定长度的附加序列作为输入和额外的相应输出,允许LUNA线性地进行关注操作,同时还存储足够的上下文信息。我们对三个序列建模任务的基准进行了广泛的评估:长上下文序列建模,神经机平移和大型预磨损的屏蔽语言建模。竞争甚至更好的实验结果表明了Luna的有效性和效率与各种各样相比
translated by 谷歌翻译
Future surveys such as the Legacy Survey of Space and Time (LSST) of the Vera C. Rubin Observatory will observe an order of magnitude more astrophysical transient events than any previous survey before. With this deluge of photometric data, it will be impossible for all such events to be classified by humans alone. Recent efforts have sought to leverage machine learning methods to tackle the challenge of astronomical transient classification, with ever improving success. Transformers are a recently developed deep learning architecture, first proposed for natural language processing, that have shown a great deal of recent success. In this work we develop a new transformer architecture, which uses multi-head self attention at its core, for general multi-variate time-series data. Furthermore, the proposed time-series transformer architecture supports the inclusion of an arbitrary number of additional features, while also offering interpretability. We apply the time-series transformer to the task of photometric classification, minimising the reliance of expert domain knowledge for feature selection, while achieving results comparable to state-of-the-art photometric classification methods. We achieve a logarithmic-loss of 0.507 on imbalanced data in a representative setting using data from the Photometric LSST Astronomical Time-Series Classification Challenge (PLAsTiCC). Moreover, we achieve a micro-averaged receiver operating characteristic area under curve of 0.98 and micro-averaged precision-recall area under curve of 0.87.
translated by 谷歌翻译
Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, with the potential to overcome many of the weaknesses of conventional phrase-based translation systems. Unfortunately, NMT systems are known to be computationally expensive both in training and in translation inference -sometimes prohibitively so in the case of very large data sets and large models. Several authors have also charged that NMT systems lack robustness, particularly when input sentences contain rare words. These issues have hindered NMT's use in practical deployments and services, where both accuracy and speed are essential. In this work, we present GNMT, Google's Neural Machine Translation system, which attempts to address many of these issues. Our model consists of a deep LSTM network with 8 encoder and 8 decoder layers using residual connections as well as attention connections from the decoder network to the encoder. To improve parallelism and therefore decrease training time, our attention mechanism connects the bottom layer of the decoder to the top layer of the encoder. To accelerate the final translation speed, we employ low-precision arithmetic during inference computations. To improve handling of rare words, we divide words into a limited set of common sub-word units ("wordpieces") for both input and output. This method provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delimited models, naturally handles translation of rare words, and ultimately improves the overall accuracy of the system. Our beam search technique employs a length-normalization procedure and uses a coverage penalty, which encourages generation of an output sentence that is most likely to cover all the words in the source sentence. To directly optimize the translation BLEU scores, we consider refining the models by using reinforcement learning, but we found that the improvement in the BLEU scores did not reflect in the human evaluation. On the WMT'14 English-to-French and English-to-German benchmarks, GNMT achieves competitive results to state-of-the-art. Using a human side-by-side evaluation on a set of isolated simple sentences, it reduces translation errors by an average of 60% compared to Google's phrase-based production system.
translated by 谷歌翻译
最近,基于注意的编码器 - 解码器(AED)模型对多个任务的端到端自动语音识别(ASR)显示了高性能。在此类模型中解决了过度控制,本文介绍了轻松关注的概念,这是一种简单地逐渐注入对训练期间对编码器 - 解码器注意重量的统一分配,其易于用两行代码实现。我们调查轻松关注跨不同AED模型架构和两个突出的ASR任务,华尔街日志(WSJ)和LibRisPeech的影响。我们发现,在用外部语言模型解码时,随着宽松的注意力训练的变压器始终如一地始终如一地遵循标准基线模型。在WSJ中,我们为基于变压器的端到端语音识别设置了一个新的基准,以3.65%的单词错误率,最优于13.1%的相对状态,同时仅引入单个HyperParameter。
translated by 谷歌翻译
基于全注意力的变压器体系结构的强大建模能力通常会导致过度拟合,并且 - 对于自然语言处理任务,导致自动回归变压器解码器中隐式学习的内部语言模型,使外部语言模型的集成变得复杂。在本文中,我们探索了放松的注意力,对注意力的重量进行了简单易于实现的平滑平滑,从编码器。其次,我们表明它自然支持外部语言模型的整合,因为它通过放松解码器中的交叉注意来抑制隐式学习的内部语言模型。我们证明了在几项任务中放松注意力的好处,并与最近的基准方法相结合,并明显改善。具体而言,我们超过了最大的最大公共唇部阅读LRS3基准的26.90%单词错误率的先前最新性能,单词错误率为26.31%,并且我们达到了最佳表现的BLEU分数37.67在IWSLT14(de $ \ rightarrow $ en)的机器翻译任务没有外部语言模型,几乎没有其他模型参数。代码和模型将公开可用。
translated by 谷歌翻译
Multi-head self-attention is a key component of the Transformer, a state-of-the-art architecture for neural machine translation. In this work we evaluate the contribution made by individual attention heads in the encoder to the overall performance of the model and analyze the roles played by them. We find that the most important and confident heads play consistent and often linguistically-interpretable roles. When pruning heads using a method based on stochastic gates and a differentiable relaxation of the L 0 penalty, we observe that specialized heads are last to be pruned. Our novel pruning method removes the vast majority of heads without seriously affecting performance. For example, on the English-Russian WMT dataset, pruning 38 out of 48 encoder heads results in a drop of only 0.15 BLEU. 1
translated by 谷歌翻译
多头注意力是最先进的变压器背后的推动力,它在各种自然语言处理(NLP)和计算机视觉任务中实现了出色的性能。已经观察到,对于许多应用,这些注意力头会学习冗余嵌入,并且大多数可以在不降低模型性能的情况下去除。受到这一观察的启发,我们提出了变压器的混合物(变压器-MGK)的混合物,这是一种新型的变压器架构,用每个头部的钥匙混合了变压器中的冗余头部。这些键的混合物遵循高斯混合模型,并使每个注意力头有效地集中在输入序列的不同部分上。与传统的变压器对应物相比,变压器-MGK会加速训练和推理,具有较少的参数,并且需要更少的拖船来计算,同时实现跨任务的可比性或更高的准确性。 Transformer-MGK也可以轻松扩展到线性注意力。我们从经验上证明了在一系列实用应用中变形金属MGK的优势,包括语言建模和涉及非常长序列的任务。在Wikitext-103和远程竞技场基准中,具有4个头部的变压器MGK具有与基线变压器具有8个头的可比性或更好的性能。
translated by 谷歌翻译