Transformers achieve remarkable performance in several tasks but due to their quadratic complexity, with respect to the input's length, they are prohibitively slow for very long sequences. To address this limitation, we express the self-attention as a linear dot-product of kernel feature maps and make use of the associativity property of matrix products to reduce the complexity from O N 2 to O (N ), where N is the sequence length. We show that this formulation permits an iterative implementation that dramatically accelerates autoregressive transformers and reveals their relationship to recurrent neural networks. Our linear transformers achieve similar performance to vanilla transformers and they are up to 4000x faster on autoregressive prediction of very long sequences.
变形金刚在序列建模及以后取得了显着的成功,但相对于输入序列的长度,二次计算和记忆复杂性遭受了损失。利用技术包括稀疏和线性的注意力和哈希技巧;已经提出了有效的变压器来降低变压器的二次复杂性,但会显着降低准确性。作为响应,我们首先将计算注意图的线性注意力和残差连接解释为梯度下降步骤。然后,我们将动量引入这些组件,并提出\ emph {动量变压器},该动量利用动量来提高线性变压器的精度,同时保持线性内存和计算复杂性。此外,我们制定了一种自适应策略,以根据二次优化的最佳动量计算模型的动量值。这种自适应动量消除了寻找最佳动量值的需求,并进一步增强了动量变压器的性能。包括图像生成和机器翻译在内的自回归和非自动回归任务的一系列实验表明,动量变压器在训练效率和准确性方面优于流行的线性变压器。
现实世界中的数据是高维的:即使在压缩后,书籍,图像或音乐表演也很容易包含数十万个元素。但是,最常用的自回归模型,变压器非常昂贵,以缩放捕获这种远程结构所需的输入和层数。我们开发了感知者AR,这是一种自回归的模态 - 不合骨架构,它使用交叉注意力将远程输入映射到少数潜在的潜在,同时还可以维护端到端的因果关系掩盖。感知器AR可以直接进行十万个令牌,从而实现了实用的长篇小写密度估计,而无需手工制作的稀疏模式或记忆机制。当对图像或音乐进行培训时,感知器AR会生成具有清晰长期连贯性和结构的输出。我们的架构还获得了长期基准测试的最新可能性,包括64 x 64个Imagenet图像和PG-19书籍。
Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O(L 2 ) to O(L log L), where L is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of N times, where N is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.
由于其二次复杂性,是变压器中的关注模块,其是变压器中的重要组件不能高效地扩展到长序列。许多工作侧重于近似于尺寸的圆点 - 指数的软MAX功能,导致分二次甚至线性复杂性变压器架构。但是,我们表明这些方法不能应用于超出点的指数样式的更强大的注意模块,例如,具有相对位置编码(RPE)的变压器。由于在许多最先进的模型中,相对位置编码被用作默认,设计可以包含RPE的高效变压器是吸引人的。在本文中,我们提出了一种新颖的方法来加速对RPE的转化仪的关注计算在核心化的关注之上。基于观察到相对位置编码形成Toeplitz矩阵,我们数在数学上表明,可以使用快速傅里叶变换(FFT)有效地计算具有RPE的核化注意。使用FFT,我们的方法实现$ \ mathcal {o}(n \ log n)$时间复杂性。有趣的是,我们进一步证明使用相对位置编码适当地可以减轻香草群关注的培训不稳定问题。在广泛的任务上,我们经验证明我们的模型可以从头开始培训,没有任何优化问题。学习模型比许多高效的变压器变体更好地执行,并且在长序列制度中比标准变压器更快。
状态空间模型已显示在建模远距离依赖性方面有效,特别是序列分类任务。在这项工作中,我们着重于对英语书籍,GitHub源代码和Arxiv数学文章的自回旋序列建模。基于围绕封闭激活功能的有效性的最新发展,我们提出了一个名为“封闭状态空间(GSS)”的新层,并表明它的训练速度明显快于TPU的S4(即DSS)的对角线版本,具有相当竞争力 - 基于变压器的基线,并表现出零击向更长的输入,同时直接实施。最后,我们表明,利用自我意见来建模局部依赖性,可以进一步提高GSS的性能。
多头注意力是最先进的变压器背后的推动力,它在各种自然语言处理(NLP)和计算机视觉任务中实现了出色的性能。已经观察到,对于许多应用,这些注意力头会学习冗余嵌入,并且大多数可以在不降低模型性能的情况下去除。受到这一观察的启发,我们提出了变压器的混合物(变压器-MGK)的混合物,这是一种新型的变压器架构,用每个头部的钥匙混合了变压器中的冗余头部。这些键的混合物遵循高斯混合模型,并使每个注意力头有效地集中在输入序列的不同部分上。与传统的变压器对应物相比,变压器-MGK会加速训练和推理,具有较少的参数,并且需要更少的拖船来计算,同时实现跨任务的可比性或更高的准确性。 Transformer-MGK也可以轻松扩展到线性注意力。我们从经验上证明了在一系列实用应用中变形金属MGK的优势,包括语言建模和涉及非常长序列的任务。在Wikitext-103和远程竞技场基准中,具有4个头部的变压器MGK具有与基线变压器具有8个头的可比性或更好的性能。
序列建模的一个中心目标是设计一个单个原则模型,该模型可以解决各种方式和任务,尤其是在远程依赖方面的序列数据。尽管包括RNN,CNN和Transformers在内的传统模型具有用于捕获长期依赖性的专业变体,但它们仍然很难扩展到长时间的10000美元或更多步骤。通过模拟基本状态空间模型(SSM)\(x'(t)= ax(t)= ax(t) + bu(t),y(t)= cx(t) + du(t) + du(t)\ ), and showed that for appropriate choices of the state matrix \( A \), this system could handle long-range dependencies mathematically and empirically.但是,该方法具有过度的计算和内存需求,使其无法作为一般序列建模解决方案。我们根据SSM的新参数化提出了结构化状态空间序列模型(S4),并表明它可以比以前的方法更有效地计算出其理论强度。我们的技术涉及对\(a \)进行低级校正的调节,从而使其对角度稳定,并将SSM降低到库奇内核的精心研究的计算中。 S4在各种既定的基准测试范围内取得了强劲的经验结果,包括(i)在顺序CIFAR-10上的91 \%精度,没有数据增强或辅助损失,与较大的2-D Resnet相当,(ii)实质上关闭。在图像和语言建模任务上与变形金刚的差距,同时在远程竞技场基准的每个任务上执行每一代$ 60 \ times $ $(iii)sota,包括求解所有先前工作的挑战性path-x任务,而所有先前工作的长度为16K,同时与所有竞争对手一样高效。
Self-attention has recently been adopted for a wide range of sequence modeling problems. Despite its effectiveness, self-attention suffers from quadratic compute and memory requirements with respect to sequence length. Successful approaches to reduce this complexity focused on attending to local sliding windows or a small set of locations independent of content. Our work proposes to learn dynamic sparse attention patterns that avoid allocating computation and memory to attend to content unrelated to the query of interest. This work builds upon two lines of research: it combines the modeling flexibility of prior work on content-based sparse attention with the efficiency gains from approaches based on local, temporal sparse attention. Our model, the Routing Transformer, endows selfattention with a sparse routing module based on online k-means while reducing the overall complexity of attention to O(n 1.5 d) from O(n 2 d) for sequence length n and hidden dimension d. We show that our model outperforms comparable sparse attention models on language modeling on Wikitext-103 (15.8 vs 18.3 perplexity), as well as on image generation on ImageNet-64 (3.43 vs 3.44 bits/dim) while using fewer self-attention layers. Additionally, we set a new state-of-the-art on the newly released PG-19 data-set, obtaining a test perplexity of 33.2 with a 22 layer Routing Transformer model trained on sequences of length 8192. We open-source the code for Routing Transformer in Tensorflow. *
变压器在长序列上是缓慢的,渴望记忆力,因为自我注意的时间和记忆复杂性在序列上是二次的。近似关注方法试图通过交易模型质量以降低计算复杂性来解决此问题,但通常无法实现墙壁锁定的加速。我们认为,缺失的原则是提出注意力算法,以考虑读取和在GPU记忆层次之间写入。我们提出了FlashAttention,这是一种IO意识的精确注意算法,该算法使用平铺来减少GPU高带宽内存(HBM)和GPU芯片SRAM之间的内存读数/写入/写入。我们分析了闪存的IO复杂性,表明它所需的HBM访问少于标准注意力,并且对于一系列SRAM尺寸而言是最佳的。我们还扩展了闪光词,以引起障碍物的注意,从而产生了比任何现有的近似关注方法更快的近似关注算法。闪存火车的变压器​​比现有基准快:与MLPERF 1.1训练速度记录相比,Bert-Large(第512秒)的端到端壁式锁定加速度为15%,GPT-2上的3 $ \ times $ speedup(seq) 。闪存表现和块状闪光词可在变压器中实现更长的上下文,从而产生更高质量的模型(GPT-2上的0.7更好的困惑和长期分类的6.4点升力)和全新的功能:第一个实现优于更好的Chance的变压器PATH-X挑战(Seq。Length16K,61.4%精度)和PATH-256(Seq。Length64K,63.1%精度)上的性能。
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attentionkernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can also be used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. * Equal contribution. Listing order is random. Jakob proposed replacing RNNs with self-attention and started the effort to evaluate this idea. Ashish, with Illia, designed and implemented the first Transformer models and has been crucially involved in every aspect of this work. Noam proposed scaled dot-product attention, multi-head attention and the parameter-free position representation and became the other person involved in nearly every detail. Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and tensor2tensor. Llion also experimented with novel model variants, was responsible for our initial codebase, and efficient inference and visualizations. Lukasz and Aidan spent countless long days designing various parts of and implementing tensor2tensor, replacing our earlier codebase, greatly improving results and massively accelerating our research.† Work performed while at Google Brain.‡ Work performed while at Google Research.
We present a new approach to modeling sequential data: the deep equilibrium model (DEQ). Motivated by an observation that the hidden layers of many existing deep sequence models converge towards some fixed point, we propose the DEQ approach that directly finds these equilibrium points via root-finding. Such a method is equivalent to running an infinite depth (weight-tied) feedforward network, but has the notable advantage that we can analytically backpropagate through the equilibrium point using implicit differentiation. Using this approach, training and prediction in these networks require only constant memory, regardless of the effective "depth" of the network. We demonstrate how DEQs can be applied to two state-of-the-art deep sequence models: self-attention transformers and trellis networks. On large-scale language modeling tasks, such as the WikiText-103 benchmark, we show that DEQs 1) often improve performance over these stateof-the-art models (for similar parameter counts); 2) have similar computational requirements to existing models; and 3) vastly reduce memory consumption (often the bottleneck for training large sequence models), demonstrating an up-to 88% memory reduction in our experiments. The code is available at https://github. com/locuslab/deq.
