变形金刚在参加长语境时奋斗,因为计算量随着上下文长度而增长,因此它们不能有效地模拟长期存储器。已经提出了几种变体来缓解这个问题,但它们都有有限的内存容量,被迫降低旧信息。在本文中,我们提出了$ \ infty $ -former,它将Vanilla变压器与无限的长期记忆延伸。通过利用连续空间注意机制来参加长期内存,$ \ idty $ -former的注意力复杂性与上下文长度无关。因此,它能够在保持固定计算预算的同时进行任意长的上下文并维持“粘性存储器”。合成排序任务的实验展示了$ \ idty $ -former将信息从长序列中保留信息的能力。我们还通过培训从头开始培训模型以及微调预先培训的语言模型来执行语言建模实验,这表明了无限性的长期记忆的好处。
translated by 谷歌翻译
基于变压器的模型在多个领域和任务上显示了它们的有效性。自我注意力允许将所有序列元素的信息结合到上下文感知表示形式中。但是,全球和本地信息必须主要存储在相同的元素表示中。此外,输入序列的长度受到自我注意的二次计算复杂性的限制。在这项工作中,我们提出并研究了一个记忆启动的片段级循环变压器(复发记忆变压器)。内存允许借助复发的帮助存储和处理本地和全局信息,并可以在长序列的段之间传递信息。我们通过将特殊的内存令牌添加到输入或输出序列中,实现了一个内存机制,无需更改变压器模型。然后,对变压器进行了训练,以控制内存操作和序列表示处理。实验的结果表明,我们的模型与Transformer-XL在语言建模上的较小内存大小上的表现相同,并在需要更长序列处理的任务方面胜过它。我们证明,将内存令牌添加到TR-XL可以提高IT性能。这使得反复的内存变压器成为需要学习长期依赖性和内存处理中的通用性(例如算法任务和推理)的应用程序的有前途的体系结构。
translated by 谷歌翻译
Recent work has improved language models (LMs) remarkably by equipping them with a non-parametric memory component. However, most existing approaches only introduce mem-ories at testing time or represent them using a separately trained encoder, resulting in suboptimal training of the language model. In this work, we present TRIME, a novel yet simple training approach designed for training LMs with memory augmentation. Our approach uses a training objective that directly takes in-batch examples as accessible memory. We also present new methods for memory construction and data batching, which are used for adapting to different sets of memories--local, long-term, and external memory--at testing time. We evaluate TRIME on multiple language modeling and machine translation benchmarks and show that it is able to achieve significant improvements across all the settings. Concretely, TRIME reduces the perplexity from 18.70 to 15.37 on WIKITEXT-103, by effectively leveraging a large memory set from the training corpus. Compared to standard LM training, TRIME adds negligible computational overhead and is compatible with different neural architectures, making it a versatile solution for training memory-augmented LMs.
translated by 谷歌翻译
我们介绍了块状变压器,该变压器以序列的反复方式应用变压器层,并且相对于序列长度具有线性复杂性。我们的复发单元在训练过程中在代币的块而不是单个令牌上运行,并利用块内并行计算,以便有效利用加速器硬件。单元本身非常简单。它仅仅是一个变压器层:它使用自我注意事项和交叉注意力来有效计算大量状态向量和令牌上的复发函数。我们的设计部分受到LSTM单元的启发,它使用LSTM风格的大门,但它可以将典型的LSTM单元缩放为几个数量级。我们的复发实现在计算时间和参数计数中都具有相同的成本作为传统的变压器层,但是在很长的序列中,语言建模任务中的语言建模任务的困惑极大地改善了。我们的模型比远程变压器XL基线的表现宽大,同时运行的速度是两倍。我们证明了它在PG19(书籍),Arxiv论文和GitHub源代码上的有效性。我们的代码已发布为开​​源。
translated by 谷歌翻译
多头注意力是最先进的变压器背后的推动力,它在各种自然语言处理(NLP)和计算机视觉任务中实现了出色的性能。已经观察到,对于许多应用,这些注意力头会学习冗余嵌入,并且大多数可以在不降低模型性能的情况下去除。受到这一观察的启发,我们提出了变压器的混合物(变压器-MGK)的混合物,这是一种新型的变压器架构,用每个头部的钥匙混合了变压器中的冗余头部。这些键的混合物遵循高斯混合模型,并使每个注意力头有效地集中在输入序列的不同部分上。与传统的变压器对应物相比,变压器-MGK会加速训练和推理,具有较少的参数,并且需要更少的拖船来计算,同时实现跨任务的可比性或更高的准确性。 Transformer-MGK也可以轻松扩展到线性注意力。我们从经验上证明了在一系列实用应用中变形金属MGK的优势,包括语言建模和涉及非常长序列的任务。在Wikitext-103和远程竞技场基准中,具有4个头部的变压器MGK具有与基线变压器具有8个头的可比性或更好的性能。
translated by 谷歌翻译
现实世界中的数据是高维的:即使在压缩后,书籍,图像或音乐表演也很容易包含数十万个元素。但是,最常用的自回归模型,变压器非常昂贵,以缩放捕获这种远程结构所需的输入和层数。我们开发了感知者AR,这是一种自回归的模态 - 不合骨架构,它使用交叉注意力将远程输入映射到少数潜在的潜在,同时还可以维护端到端的因果关系掩盖。感知器AR可以直接进行十万个令牌,从而实现了实用的长篇小写密度估计,而无需手工制作的稀疏模式或记忆机制。当对图像或音乐进行培训时,感知器AR会生成具有清晰长期连贯性和结构的输出。我们的架构还获得了长期基准测试的最新可能性,包括64 x 64个Imagenet图像和PG-19书籍。
translated by 谷歌翻译
Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O(L 2 ) to O(L log L), where L is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of N times, where N is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.
translated by 谷歌翻译
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. * Equal contribution. Listing order is random. Jakob proposed replacing RNNs with self-attention and started the effort to evaluate this idea. Ashish, with Illia, designed and implemented the first Transformer models and has been crucially involved in every aspect of this work. Noam proposed scaled dot-product attention, multi-head attention and the parameter-free position representation and became the other person involved in nearly every detail. Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and tensor2tensor. Llion also experimented with novel model variants, was responsible for our initial codebase, and efficient inference and visualizations. Lukasz and Aidan spent countless long days designing various parts of and implementing tensor2tensor, replacing our earlier codebase, greatly improving results and massively accelerating our research.† Work performed while at Google Brain.‡ Work performed while at Google Research.
translated by 谷歌翻译
在这项工作中,由{\它复制的概念更容易记住}的概念,我们介绍了GNN-LM,它通过允许在整个训练语料库中引用类似的上下文来扩展Vanilla神经语言模型(LM)。我们在输入上下文和从训练语料库中选择的语义相关邻居之间构建一个定向的异构图,其中节点是输入上下文中的令牌和检索到的邻居上下文,并且边缘表示节点之间的连接。图形神经网络(GNNS)在图表上构建,以聚合来自类似上下文的信息来解码令牌。此学习范例提供了直接访问参考上下文,并有助于提高模型的泛化能力。我们进行全面的实验以验证GNN-LM的有效性:GNN-LM在Wikitext-103上实现了14.8的新的最先进的困惑(在Vanilla LM模型的对应于的4.5点改进)和显示对强大基线的十亿个单词和enWiki8数据集进行大量改进。进行深度消融研究以了解GNN-LM的机制。可以在\ url {https://github.com/shannonai/gnn-lm}中找到代码}
translated by 谷歌翻译
变压器注意机制的二次计算和内存复杂性限制了对长序列建模的可扩展性。在本文中,我们提出了Luna,一种线性统一嵌套关注机制,使Softmax注意力具有两个嵌套线性关注功能,仅产生线性(与二次)的时间和空间复杂度相反。具体地,通过第一注意功能,LUNA将输入序列包装成固定长度的序列。然后,使用第二关注功能未包装包装序列。与更传统的关注机制相比,LUNA引入具有固定长度的附加序列作为输入和额外的相应输出,允许LUNA线性地进行关注操作,同时还存储足够的上下文信息。我们对三个序列建模任务的基准进行了广泛的评估:长上下文序列建模,神经机平移和大型预磨损的屏蔽语言建模。竞争甚至更好的实验结果表明了Luna的有效性和效率与各种各样相比
translated by 谷歌翻译
近年来,基于变压器的预训练模型已获得了很大的进步,成为自然语言处理中最重要的骨干之一。最近的工作表明,变压器内部的注意力机制可能不需要,卷积神经网络和基于多层感知器的模型也已被研究为变压器替代方案。在本文中,我们考虑了一个用于语言模型预训练的图形循环网络,该网络通过本地令牌级通信为每个序列构建一个图形结构,以及与其他代币解耦的句子级表示。原始模型在受监督培训下的特定领域特定文本分类中表现良好,但是,其通过自我监督的方式学习转移知识的潜力尚未得到充分利用。我们通过优化体系结构并验证其在更通用的语言理解任务(英语和中文)中的有效性来填补这一空白。至于模型效率,我们的模型在基于变压器的模型中而不是二次复杂性,而是具有线性复杂性,并且在推断过程中的性能更有效。此外,我们发现与现有基于注意力的模型相比,我们的模型可以生成更多样化的输出,而背景化的功能冗余性较小。
translated by 谷歌翻译
Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively. Second, an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training. In addition, a new virtual adversarial training method is used for fine-tuning to improve models' generalization. We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understand (NLU) and natural langauge generation (NLG) downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). Notably, we scale up DeBERTa by training a larger version that consists of 48 Transform layers with 1.5 billion parameters. The significant performance boost makes the single DeBERTa model surpass the human performance on the SuperGLUE benchmark (Wang et al., 2019a) for the first time in terms of macro-average score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the SuperGLUE leaderboard as of January 6, 2021, outperforming the human baseline by a decent margin (90.3 versus 89.8). The pre-trained DeBERTa models and the source code were released at: https://github.com/microsoft/DeBERTa 1 .
translated by 谷歌翻译
Causal transformer language models (LMs), such as GPT-3, typically require some form of positional encoding, such as positional embeddings. However, we show that LMs without any explicit positional encoding are still competitive with standard models, and that this phenomenon is robust across different datasets, model sizes, and sequence lengths. Probing experiments reveal that such models acquire an implicit notion of absolute positions throughout the network, effectively compensating for the missing information. We conjecture that causal attention enables the model to infer the number of predecessors that each token can attend to, thereby approximating its absolute position. Our findings indicate that causal LMs might derive positional awareness not only from the explicit positioning mechanism, but also from the effects of the causal mask.
translated by 谷歌翻译
序列模型是现代NLP系统的关键组成部分,但它们的预测难以解释。我们考虑虽然可以解释单个模型预测的基础,但是可以解释各种模型预测的上下文的模型解释。通过解决组合优化来找到顺序律师:最佳理由是输入令牌的最小子集,这些令牌将预测与完整序列相同的输出。枚举所有子集是棘手的,因此我们提出了一种高效的贪婪算法来近似这个目标。称为贪婪合理化的算法适用于任何模型。对于这种方法有效,模型应该在对上下文的不完整子集进行预测时形成兼容的条件分布。这种情况可以用短的微调步骤强制执行。我们研究语言建模与机器翻译的贪婪合理化。与现有的基线相比,贪婪合理化是最优化组合目标的,并提供最忠实的理由。在注释的顺序理由的新数据集中,贪婪的理由与人类理由最相似。
translated by 谷歌翻译
Transformers-based models, such as BERT, have been one of the most successful deep learning models for NLP. Unfortunately, one of their core limitations is the quadratic dependency (mainly in terms of memory) on the sequence length due to their full attention mechanism. To remedy this, we propose, BIGBIRD, a sparse attention mechanism that reduces this quadratic dependency to linear. We show that BIGBIRD is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model. Along the way, our theoretical analysis reveals some of the benefits of having O(1) global tokens (such as CLS), that attend to the entire sequence as part of the sparse attention mechanism. The proposed sparse attention can handle sequences of length up to 8x of what was previously possible using similar hardware. As a consequence of the capability to handle longer context, BIGBIRD drastically improves performance on various NLP tasks such as question answering and summarization. We also propose novel applications to genomics data.
translated by 谷歌翻译
变压器注意机制中的设计选择,包括弱电感偏置和二次计算复杂性,限制了其用于建模长序列的应用。在本文中,我们介绍了一个简单的,理论上的,单头的门控注意机制,配备了(指数)移动平均线,以将局部依赖性的电感偏置纳入位置 - 敏锐的注意机制中。我们进一步提出了一个具有线性时间和空间复杂性的大型变体,但通过将整个序列分为固定长度的多个块,仅产生最小的质量损失。对广泛的序列建模基准测试的广泛实验,包括远距离竞技场,神经机器翻译,自动回归语言建模以及图像和语音分类,表明,巨人比其他序列模型取得了重大改进,包括变种物的变体和最新的变体模型状态空间模型。
translated by 谷歌翻译
在这项工作中,我们介绍了内核化变压器,这是一个通用,可扩展的,数据驱动的框架,用于学习变压器中的内核功能。我们的框架将变压器内核作为光谱特征图之间的点产物近似,并通过学习光谱分布来学习内核。这不仅有助于学习通用的内核端到端,而且还可以减少变压器从二次到线性的时间和空间复杂性。我们表明,在准确性和计算效率方面,内核化的变压器实现了与现有的有效变压器体系结构相当的性能。我们的研究还表明,内核的选择对性能有重大影响,而内核学习变体是固定内核变压器的竞争替代方案,无论是长时间的序列任务。
translated by 谷歌翻译
最近编码的位置已显示在变压器体系结构中有效。它为序列不同位置的元素之间的依赖性建模提供了宝贵的监督。在本文中,我们首先研究了各种方法,以将位置信息整合到基于变压器的语言模型的学习过程中。然后,我们提出了一种名为旋转位置嵌入(绳索)的新颖方法,以有效利用位置信息。具体而言,提议的绳索用旋转矩阵编码绝对位置,同时将显式相对位置依赖性在自我发项公式中。值得注意的是,绳索具有宝贵的特性,包括序列长度的灵活性,衰减的相互依赖性随着相对距离的增加以及将线性自我注意力配备相对位置编码的能力。最后,我们在各种长文本分类基准数据集上使用旋转位置嵌入(也称为Roformer)评估增强的变压器。我们的实验表明,它始终如一地克服了其替代方案。此外,我们提供了理论分析来解释一些实验结果。 Roformer已经集成到HuggingFace:\ url {https://huggingface.co/docs/transformers/model_doc/roformer}。
translated by 谷歌翻译
状态空间模型已显示在建模远距离依赖性方面有效,特别是序列分类任务。在这项工作中,我们着重于对英语书籍,GitHub源代码和Arxiv数学文章的自回旋序列建模。基于围绕封闭激活功能的有效性的最新发展,我们提出了一个名为“封闭状态空间(GSS)”的新层,并表明它的训练速度明显快于TPU的S4(即DSS)的对角线版本,具有相当竞争力 - 基于变压器的基线,并表现出零击向更长的输入,同时直接实施。最后,我们表明,利用自我意见来建模局部依赖性,可以进一步提高GSS的性能。
translated by 谷歌翻译
Transformers do not scale very well to long sequence lengths largely because of quadratic self-attention complexity. In the recent months, a wide spectrum of efficient, fast Transformers have been proposed to tackle this problem, more often than not claiming superior or comparable model quality to vanilla Transformer models. To this date, there is no well-established consensus on how to evaluate this class of models. Moreover, inconsistent benchmarking on a wide spectrum of tasks and datasets makes it difficult to assess relative model quality amongst many models. This paper proposes a systematic and unified benchmark, Long-Range Arena, specifically focused on evaluating model quality under long-context scenarios. Our benchmark is a suite of tasks consisting of sequences ranging from 1K to 16K tokens, encompassing a wide range of data types and modalities such as text, natural, synthetic images, and mathematical expressions requiring similarity, structural, and visual-spatial reasoning. We systematically evaluate ten well-established long-range Transformer models (Reformers, Linformers, Linear Transformers, Sinkhorn Transformers, Performers, Synthesizers, Sparse Transformers, and Longformers) on our newly proposed benchmark suite. Long-Range Arena paves the way towards better understanding this class of efficient Transformer models, facilitates more research in this direction, and presents new challenging tasks to tackle. Our benchmark code will be released at https://github.com/google-research/long-range-arena.
translated by 谷歌翻译