我们在变压器中重新审视设计选择,并提出方法来解决它们在处理长序列中的弱点。首先,我们提出了一个名为“门控注意单元”的简单层,该层允许使用较弱的单头注意,而质量损失最小。然后,我们提出了一种与该新层的线性近似方法互补的,该方法对加速器友好且质量高度竞争。最终的型号(名为Flash)与短(512)和长(8K)上下文长度相匹配,在WIKI-40B上达到高达4.9 $ \ times $的训练速度和PG上的12.1 $ \ times $,在PG上达到了4.9 $ \ times $的困惑。-19用于自动回归语言建模,C4的4.8 $ \ times $用于掩盖语言建模。
translated by 谷歌翻译
变压器注意机制的二次计算和内存复杂性限制了对长序列建模的可扩展性。在本文中,我们提出了Luna,一种线性统一嵌套关注机制,使Softmax注意力具有两个嵌套线性关注功能,仅产生线性(与二次)的时间和空间复杂度相反。具体地,通过第一注意功能,LUNA将输入序列包装成固定长度的序列。然后,使用第二关注功能未包装包装序列。与更传统的关注机制相比,LUNA引入具有固定长度的附加序列作为输入和额外的相应输出,允许LUNA线性地进行关注操作,同时还存储足够的上下文信息。我们对三个序列建模任务的基准进行了广泛的评估:长上下文序列建模,神经机平移和大型预磨损的屏蔽语言建模。竞争甚至更好的实验结果表明了Luna的有效性和效率与各种各样相比
translated by 谷歌翻译
许多NLP任务需要处理超出预磨模模型的长度限制的长语境。为了将这些模型扩展到更长的文本序列,已经提出了许多有效的远程注意力变体。尽管沿着这个方向进行了丰富的研究,但仍然难以在实际用例中衡量这些模型的相对有效性,例如,如果我们在预先rain-yfetune范式之后应用这些模型。在这项工作中,我们的目标是对这些具有大规模和受控实验的这些新兴模型进行彻底的分析。对于每个关注变体,我们使用相同的长DOC语料库,然后使用相同的长DOC语料库,然后为现实世界的长情节任务进行芬特这些模型。我们的调查结果揭示了现有广泛使用的远程基准的陷阱,并显示任何经过测试的高效关注可以在标准预介质范式下击败一个简单的本地窗口关注。对本地注意力变化的进一步分析表明,即使是常用的注意力窗口重叠也没有必要实现良好的下游结果 - 使用不相交的本地关注,我们能够构建符合性能的更简单且更高效的Long-Doc QA模型霍尔福勒〜\ citep {longformer}其预先花费的一半。
translated by 谷歌翻译
状态空间模型已显示在建模远距离依赖性方面有效,特别是序列分类任务。在这项工作中,我们着重于对英语书籍,GitHub源代码和Arxiv数学文章的自回旋序列建模。基于围绕封闭激活功能的有效性的最新发展,我们提出了一个名为“封闭状态空间(GSS)”的新层,并表明它的训练速度明显快于TPU的S4(即DSS)的对角线版本,具有相当竞争力 - 基于变压器的基线,并表现出零击向更长的输入,同时直接实施。最后,我们表明,利用自我意见来建模局部依赖性,可以进一步提高GSS的性能。
translated by 谷歌翻译
Transformer models have achieved superior performance in various natural language processing tasks. However, the quadratic computational cost of the attention mechanism limits its practicality for long sequences. There are existing attention variants that improve the computational efficiency, but they have limited ability to effectively compute global information. In parallel to Transformer models, state space models (SSMs) are tailored for long sequences, but they are not flexible enough to capture complicated local information. We propose SPADE, short for $\underline{\textbf{S}}$tate s$\underline{\textbf{P}}$ace $\underline{\textbf{A}}$ugmente$\underline{\textbf{D}}$ Transform$\underline{\textbf{E}}$r. Specifically, we augment a SSM into the bottom layer of SPADE, and we employ efficient local attention methods for the other layers. The SSM augments global information, which complements the lack of long-range dependency issue in local attention methods. Experimental results on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method. To further demonstrate the scalability of SPADE, we pre-train large encoder-decoder models and present fine-tuning results on natural language understanding and natural language generation tasks.
translated by 谷歌翻译
在深度学习中,模型通常重用所有输入的相同参数。专家的混合(MOE)违反了这一点,而是为每个传入示例选择不同的参数。结果是一个稀疏激活的模型 - 具有残酷数量的参数 - 但恒定的计算成本。然而,尽管MOE取得了一些显着的成功,但复杂性,沟通成本和培训不稳定的阻碍了广泛的采用 - 我们使用Switch Transformer解决了这些领域。我们简化了MOE路由算法和设计直观的改进模型,以降低的通信和计算成本。我们提出的培训技术有助于纠缠不稳定,我们表明稀疏模型可能首次以较低的精度(BFLOAT16)格式进行培训。我们设计了基于T5基数和T5总数的模型,以使用相同的计算资源获得高达7倍的训练速度。这些改进扩展到多语言设置,我们在所有101种语言中衡量对MT5基本版本的收益。最后,我们通过在“巨大的清洁爬行语料库”上预先培训高达数万亿个参数模型,并在T5-XXL模型上实现4倍的速度,从而提高了语言模型的当前规模。
translated by 谷歌翻译
变压器注意机制中的设计选择,包括弱电感偏置和二次计算复杂性,限制了其用于建模长序列的应用。在本文中,我们介绍了一个简单的,理论上的,单头的门控注意机制,配备了(指数)移动平均线,以将局部依赖性的电感偏置纳入位置 - 敏锐的注意机制中。我们进一步提出了一个具有线性时间和空间复杂性的大型变体,但通过将整个序列分为固定长度的多个块,仅产生最小的质量损失。对广泛的序列建模基准测试的广泛实验,包括远距离竞技场,神经机器翻译,自动回归语言建模以及图像和语音分类,表明,巨人比其他序列模型取得了重大改进,包括变种物的变体和最新的变体模型状态空间模型。
translated by 谷歌翻译
在预介质期间,预解压器变压器遭受梯度幅度不匹配:早期层处的梯度远远大于更高层的层。我们所提出的常规程序架构可以减轻这些问题,这为每层增加了三个归一化操作:自我注意后的一层规范,自我注意输出的头部明智的缩放,以及第一完全连接层之后的层标。额外的运营产生忽略不计的计算成本(+ 0.4%的参数增加),但是改善了从12500万到27亿个参数的因果和屏蔽语言模型的预先欣赏困惑和下游任务性能。例如,在我们最强的1.3B参数基线顶部添加NARMFORMER可以在相同的计算预算中更快地达到24%的平等困惑,或者更好地收敛0.27困惑。该模型达到GPT3大(1.3B)零拍摄性能速度快60%。对于屏蔽语言建模,Normformer平均将微调胶水性能提高1.9%。 Fairseq HTTPS://github.com/pytorch/faireq/tree/main/examples/normformer提供培训ormalformer模型的代码。
translated by 谷歌翻译
我们介绍了块状变压器,该变压器以序列的反复方式应用变压器层,并且相对于序列长度具有线性复杂性。我们的复发单元在训练过程中在代币的块而不是单个令牌上运行,并利用块内并行计算,以便有效利用加速器硬件。单元本身非常简单。它仅仅是一个变压器层:它使用自我注意事项和交叉注意力来有效计算大量状态向量和令牌上的复发函数。我们的设计部分受到LSTM单元的启发,它使用LSTM风格的大门,但它可以将典型的LSTM单元缩放为几个数量级。我们的复发实现在计算时间和参数计数中都具有相同的成本作为传统的变压器层,但是在很长的序列中,语言建模任务中的语言建模任务的困惑极大地改善了。我们的模型比远程变压器XL基线的表现宽大,同时运行的速度是两倍。我们证明了它在PG19(书籍),Arxiv论文和GitHub源代码上的有效性。我们的代码已发布为开​​源。
translated by 谷歌翻译
现实世界中的数据是高维的:即使在压缩后,书籍,图像或音乐表演也很容易包含数十万个元素。但是,最常用的自回归模型,变压器非常昂贵,以缩放捕获这种远程结构所需的输入和层数。我们开发了感知者AR,这是一种自回归的模态 - 不合骨架构,它使用交叉注意力将远程输入映射到少数潜在的潜在,同时还可以维护端到端的因果关系掩盖。感知器AR可以直接进行十万个令牌,从而实现了实用的长篇小写密度估计,而无需手工制作的稀疏模式或记忆机制。当对图像或音乐进行培训时,感知器AR会生成具有清晰长期连贯性和结构的输出。我们的架构还获得了长期基准测试的最新可能性,包括64 x 64个Imagenet图像和PG-19书籍。
translated by 谷歌翻译
基于变压器的模型广泛用于自然语言处理(NLP)。变压器模型的核心是自我关注机制,它捕获了输入序列中的令牌对的相互作用,并在序列长度上逐步取决于逐行。在更长的序列上培训此类模型是昂贵的。在本文中,我们表明,基于局部敏感散列(LSH)的伯努利采样注意机制降低了这种模型到线性的二次复杂性。我们通过考虑自我关注作为与Bernoulli随机变量相关的单独令牌的总和来绕过二次成本,原则上可以通过单个哈希进行一次(尽管在实践中,这个数字可能是一个小常数)。这导致了有效的采样方案来估算依赖于LSH的特定修改的自我关注(以便在GPU架构上进行部署)。我们在标准512序列长度上评估了胶水基准的算法,在那里我们看到了相对于标准预磨削变压器的良好性能。在远程竞技场(LRA)基准中,为了评估长序列的性能,我们的方法实现了与Softmax自我关注的结果一致,但具有相当大的加速和内存节省,并且通常优于其他有效的自我关注方法。我们的代码可以在https://github.com/mlpen/yoso获得
translated by 谷歌翻译
Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively. Second, an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training. In addition, a new virtual adversarial training method is used for fine-tuning to improve models' generalization. We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understand (NLU) and natural langauge generation (NLG) downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). Notably, we scale up DeBERTa by training a larger version that consists of 48 Transform layers with 1.5 billion parameters. The significant performance boost makes the single DeBERTa model surpass the human performance on the SuperGLUE benchmark (Wang et al., 2019a) for the first time in terms of macro-average score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the SuperGLUE leaderboard as of January 6, 2021, outperforming the human baseline by a decent margin (90.3 versus 89.8). The pre-trained DeBERTa models and the source code were released at: https://github.com/microsoft/DeBERTa 1 .
translated by 谷歌翻译
Position modeling plays a critical role in Transformers. In this paper, we focus on length extrapolation, i.e., training on short texts while evaluating longer sequences. We define attention resolution as an indicator of extrapolation. Then we propose two designs to improve the above metric of Transformers. Specifically, we introduce a relative position embedding to explicitly maximize attention resolution. Moreover, we use blockwise causal attention during inference for better resolution. We evaluate different Transformer variants with language modeling. Experimental results show that our model achieves strong performance in both interpolation and extrapolation settings. The code will be available at https://aka.ms/LeX-Transformer.
translated by 谷歌翻译
Transformers-based models, such as BERT, have been one of the most successful deep learning models for NLP. Unfortunately, one of their core limitations is the quadratic dependency (mainly in terms of memory) on the sequence length due to their full attention mechanism. To remedy this, we propose, BIGBIRD, a sparse attention mechanism that reduces this quadratic dependency to linear. We show that BIGBIRD is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model. Along the way, our theoretical analysis reveals some of the benefits of having O(1) global tokens (such as CLS), that attend to the entire sequence as part of the sparse attention mechanism. The proposed sparse attention can handle sequences of length up to 8x of what was previously possible using similar hardware. As a consequence of the capability to handle longer context, BIGBIRD drastically improves performance on various NLP tasks such as question answering and summarization. We also propose novel applications to genomics data.
translated by 谷歌翻译
Recent work has improved language models (LMs) remarkably by equipping them with a non-parametric memory component. However, most existing approaches only introduce mem-ories at testing time or represent them using a separately trained encoder, resulting in suboptimal training of the language model. In this work, we present TRIME, a novel yet simple training approach designed for training LMs with memory augmentation. Our approach uses a training objective that directly takes in-batch examples as accessible memory. We also present new methods for memory construction and data batching, which are used for adapting to different sets of memories--local, long-term, and external memory--at testing time. We evaluate TRIME on multiple language modeling and machine translation benchmarks and show that it is able to achieve significant improvements across all the settings. Concretely, TRIME reduces the perplexity from 18.70 to 15.37 on WIKITEXT-103, by effectively leveraging a large memory set from the training corpus. Compared to standard LM training, TRIME adds negligible computational overhead and is compatible with different neural architectures, making it a versatile solution for training memory-augmented LMs.
translated by 谷歌翻译
事实证明,构象异构体在许多语音处理任务中都是有效的。它结合了使用卷积和使用自我注意的全球依赖性提取本地依赖的好处。受此启发,我们提出了一个更灵活,可解释和可自定义的编码器替代方案,分支机构,并在端到端语音处理中对各种远程依赖关系进行建模。在每个编码器层中,一个分支都采用自我注意事项或其变体来捕获远程依赖性,而另一个分支则利用带有卷积门控(CGMLP)的MLP模块来提取局部关系。我们对几种语音识别和口语理解基准进行实验。结果表明,我们的模型优于变压器和CGMLP。它还与构象异构体获得的最先进结果相匹配。此外,由于两分支结构,我们展示了减少计算的各种策略,包括在单个训练有素的模型中具有可变的推理复杂性的能力。合并分支的权重表明如何在不同层中使用本地和全球依赖性,从而使模型设计受益。
translated by 谷歌翻译
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attentionkernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can also be used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.
translated by 谷歌翻译
近年来,基于变压器的预训练模型已获得了很大的进步,成为自然语言处理中最重要的骨干之一。最近的工作表明,变压器内部的注意力机制可能不需要,卷积神经网络和基于多层感知器的模型也已被研究为变压器替代方案。在本文中,我们考虑了一个用于语言模型预训练的图形循环网络,该网络通过本地令牌级通信为每个序列构建一个图形结构,以及与其他代币解耦的句子级表示。原始模型在受监督培训下的特定领域特定文本分类中表现良好,但是,其通过自我监督的方式学习转移知识的潜力尚未得到充分利用。我们通过优化体系结构并验证其在更通用的语言理解任务(英语和中文)中的有效性来填补这一空白。至于模型效率,我们的模型在基于变压器的模型中而不是二次复杂性,而是具有线性复杂性,并且在推断过程中的性能更有效。此外,我们发现与现有基于注意力的模型相比,我们的模型可以生成更多样化的输出,而背景化的功能冗余性较小。
translated by 谷歌翻译
The pre-training of masked language models (MLMs) consumes massive computation to achieve good results on downstream NLP tasks, resulting in a large carbon footprint. In the vanilla MLM, the virtual tokens, [MASK]s, act as placeholders and gather the contextualized information from unmasked tokens to restore the corrupted information. It raises the question of whether we can append [MASK]s at a later layer, to reduce the sequence length for earlier layers and make the pre-training more efficient. We show: (1) [MASK]s can indeed be appended at a later layer, being disentangled from the word embedding; (2) The gathering of contextualized information from unmasked tokens can be conducted with a few layers. By further increasing the masking rate from 15% to 50%, we can pre-train RoBERTa-base and RoBERTa-large from scratch with only 78% and 68% of the original computational budget without any degradation on the GLUE benchmark. When pre-training with the original budget, our method outperforms RoBERTa for 6 out of 8 GLUE tasks, on average by 0.4%.
translated by 谷歌翻译
基于变压器的模型在多个领域和任务上显示了它们的有效性。自我注意力允许将所有序列元素的信息结合到上下文感知表示形式中。但是,全球和本地信息必须主要存储在相同的元素表示中。此外,输入序列的长度受到自我注意的二次计算复杂性的限制。在这项工作中,我们提出并研究了一个记忆启动的片段级循环变压器(复发记忆变压器)。内存允许借助复发的帮助存储和处理本地和全局信息,并可以在长序列的段之间传递信息。我们通过将特殊的内存令牌添加到输入或输出序列中,实现了一个内存机制,无需更改变压器模型。然后,对变压器进行了训练,以控制内存操作和序列表示处理。实验的结果表明,我们的模型与Transformer-XL在语言建模上的较小内存大小上的表现相同,并在需要更长序列处理的任务方面胜过它。我们证明,将内存令牌添加到TR-XL可以提高IT性能。这使得反复的内存变压器成为需要学习长期依赖性和内存处理中的通用性(例如算法任务和推理)的应用程序的有前途的体系结构。
translated by 谷歌翻译