具有输入序列长度的标准推理和基于变压器的体系结构的训练四倍。对于各种应用程序,尤其是在网页翻译,查询播放等方面,这非常大,因此,最近已经开发了几种方法来通过强制执行不同的注意力结构(例如稀疏性,低秩,使用内核)来加速注意计算。 。在这项工作中,我们将注意力计算视为最近的邻居检索的计算,并使用基于决策树的层次导航来降低每个查询令牌的检索成本,从线性序列长度从线性长度到几乎对数。基于这样的层次导航,我们设计了树形的树形,它可以使用两个有效的注意层之一 - TF - 注意和TC - 注意。 TF注意力以细粒的样式计算出注意力,而TC意见是一个粗糙的注意力层,它也确保梯度是“密集”的。为了优化此类具有挑战性的离散层,我们提出了一种两级自举训练方法。使用对标准NLP基准测试的广泛实验,尤其是对于长期序列,我们证明了我们的树形架构几乎可以像基线变压器一样准确,而注意力层则使用了30倍的失败。与Linform相比,在注意力层中使用类似的拖鞋时,准确性可能会高达12%。
translated by 谷歌翻译
Transformers-based models, such as BERT, have been one of the most successful deep learning models for NLP. Unfortunately, one of their core limitations is the quadratic dependency (mainly in terms of memory) on the sequence length due to their full attention mechanism. To remedy this, we propose, BIGBIRD, a sparse attention mechanism that reduces this quadratic dependency to linear. We show that BIGBIRD is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model. Along the way, our theoretical analysis reveals some of the benefits of having O(1) global tokens (such as CLS), that attend to the entire sequence as part of the sparse attention mechanism. The proposed sparse attention can handle sequences of length up to 8x of what was previously possible using similar hardware. As a consequence of the capability to handle longer context, BIGBIRD drastically improves performance on various NLP tasks such as question answering and summarization. We also propose novel applications to genomics data.
translated by 谷歌翻译
translated by 谷歌翻译
多头注意力是最先进的变压器背后的推动力,它在各种自然语言处理(NLP)和计算机视觉任务中实现了出色的性能。已经观察到,对于许多应用,这些注意力头会学习冗余嵌入,并且大多数可以在不降低模型性能的情况下去除。受到这一观察的启发,我们提出了变压器的混合物(变压器-MGK)的混合物,这是一种新型的变压器架构,用每个头部的钥匙混合了变压器中的冗余头部。这些键的混合物遵循高斯混合模型,并使每个注意力头有效地集中在输入序列的不同部分上。与传统的变压器对应物相比,变压器-MGK会加速训练和推理,具有较少的参数,并且需要更少的拖船来计算,同时实现跨任务的可比性或更高的准确性。 Transformer-MGK也可以轻松扩展到线性注意力。我们从经验上证明了在一系列实用应用中变形金属MGK的优势,包括语言建模和涉及非常长序列的任务。在Wikitext-103和远程竞技场基准中,具有4个头部的变压器MGK具有与基线变压器具有8个头的可比性或更好的性能。
translated by 谷歌翻译
变形金刚在杂项任务中取得了进展,但遭受了二次计算和记忆复杂性的困扰。最近的作品提出了稀疏的变压器,并注意稀疏图,以降低复杂性并保持强劲的性能。虽然有效,但并未充分探索图形如何进行良好表现的关键部分。在本文中,我们提出了标准化信息有效载荷(NIP),这是图表评分函数,该函数测量图上的信息传输,该函数为性能和复杂性之间的权衡提供了分析工具。在这一理论分析的指导下,我们提出了HyperCube Transformer,这是一种稀疏的变压器,它模拟了HyperCube中的标记相互作用,并与Vanilla Transformer显示出可比甚至更好的结果,同时产生$ O(N \ log n)$复杂性,具有序列长度$ n $。对我们的图形函数的各种序列长度进行验证的任务实验。
translated by 谷歌翻译
translated by 谷歌翻译
我们在变压器中重新审视设计选择,并提出方法来解决它们在处理长序列中的弱点。首先,我们提出了一个名为“门控注意单元”的简单层,该层允许使用较弱的单头注意,而质量损失最小。然后,我们提出了一种与该新层的线性近似方法互补的,该方法对加速器友好且质量高度竞争。最终的型号(名为Flash)与短(512)和长(8K)上下文长度相匹配,在WIKI-40B上达到高达4.9 $ \ times $的训练速度和PG上的12.1 $ \ times $,在PG上达到了4.9 $ \ times $的困惑。-19用于自动回归语言建模,C4的4.8 $ \ times $用于掩盖语言建模。
translated by 谷歌翻译
Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively. Second, an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training. In addition, a new virtual adversarial training method is used for fine-tuning to improve models' generalization. We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understand (NLU) and natural langauge generation (NLG) downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). Notably, we scale up DeBERTa by training a larger version that consists of 48 Transform layers with 1.5 billion parameters. The significant performance boost makes the single DeBERTa model surpass the human performance on the SuperGLUE benchmark (Wang et al., 2019a) for the first time in terms of macro-average score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the SuperGLUE leaderboard as of January 6, 2021, outperforming the human baseline by a decent margin (90.3 versus 89.8). The pre-trained DeBERTa models and the source code were released at: https://github.com/microsoft/DeBERTa 1 .
translated by 谷歌翻译
translated by 谷歌翻译
变形金刚在语言和视觉域中取得了成功。然而,将它们缩放到长期序列(例如长)或高分辨率图像,因为自我关注机构相对于输入序列长度具有二次时间和存储器复杂性。在本文中,我们提出了长短变压器(变压器-LS),是一种有效的自我关注机制,用于对语言和视觉任务进行线性复杂性建模的长序列。它用动态投影聚集了一种新的远程关注,以模拟远处相关性和短期注意,以捕获细粒度的局部相关性。我们提出了双重正径策略,以解释两个注意机制之间的规模不匹配。变压器-LS可以应用于自回归和双向模型,而无需额外复杂。我们的方法在语言和视觉域中的多个任务中优于最先进的模型,包括远程竞技场基准,自回归语言建模和想象成分类。例如,变换器-LS使用比以前的方法的一半在eNWIK8上实现0.97测试BPC,同时与其在同一硬件上的全部关注版本相比,可以更快地处理3倍。在Imagenet上,它可以获得最先进的结果(例如,适度大小的55.8M模型,仅在224x224 Imagenet-1K上培训,可以获得顶级1精度84.1%),同时在高分辨率上更加可扩展图片。源代码和模型在https://github.com/nvidia/transformer-ls上发布。
translated by 谷歌翻译
translated by 谷歌翻译
Attention-based neural networks, such as Transformers, have become ubiquitous in numerous applications, including computer vision, natural language processing, and time-series analysis. In all kinds of attention networks, the attention maps are crucial as they encode semantic dependencies between input tokens. However, most existing attention networks perform modeling or reasoning based on representations, wherein the attention maps of different layers are learned separately without explicit interactions. In this paper, we propose a novel and generic evolving attention mechanism, which directly models the evolution of inter-token relationships through a chain of residual convolutional modules. The major motivations are twofold. On the one hand, the attention maps in different layers share transferable knowledge, thus adding a residual connection can facilitate the information flow of inter-token relationships across layers. On the other hand, there is naturally an evolutionary trend among attention maps at different abstraction levels, so it is beneficial to exploit a dedicated convolution-based module to capture this process. Equipped with the proposed mechanism, the convolution-enhanced evolving attention networks achieve superior performance in various applications, including time-series representation, natural language understanding, machine translation, and image classification. Especially on time-series representation tasks, Evolving Attention-enhanced Dilated Convolutional (EA-DC-) Transformer outperforms state-of-the-art models significantly, achieving an average of 17% improvement compared to the best SOTA. To the best of our knowledge, this is the first work that explicitly models the layer-wise evolution of attention maps. Our implementation is available at https://github.com/pkuyym/EvolvingAttention
translated by 谷歌翻译
translated by 谷歌翻译
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train. 1
translated by 谷歌翻译
在深度学习中,模型通常重用所有输入的相同参数。专家的混合(MOE)违反了这一点,而是为每个传入示例选择不同的参数。结果是一个稀疏激活的模型 - 具有残酷数量的参数 - 但恒定的计算成本。然而,尽管MOE取得了一些显着的成功,但复杂性,沟通成本和培训不稳定的阻碍了广泛的采用 - 我们使用Switch Transformer解决了这些领域。我们简化了MOE路由算法和设计直观的改进模型,以降低的通信和计算成本。我们提出的培训技术有助于纠缠不稳定,我们表明稀疏模型可能首次以较低的精度(BFLOAT16)格式进行培训。我们设计了基于T5基数和T5总数的模型,以使用相同的计算资源获得高达7倍的训练速度。这些改进扩展到多语言设置,我们在所有101种语言中衡量对MT5基本版本的收益。最后,我们通过在“巨大的清洁爬行语料库”上预先培训高达数万亿个参数模型,并在T5-XXL模型上实现4倍的速度,从而提高了语言模型的当前规模。
translated by 谷歌翻译
translated by 谷歌翻译
translated by 谷歌翻译
translated by 谷歌翻译
Transformer models have achieved superior performance in various natural language processing tasks. However, the quadratic computational cost of the attention mechanism limits its practicality for long sequences. There are existing attention variants that improve the computational efficiency, but they have limited ability to effectively compute global information. In parallel to Transformer models, state space models (SSMs) are tailored for long sequences, but they are not flexible enough to capture complicated local information. We propose SPADE, short for $\underline{\textbf{S}}$tate s$\underline{\textbf{P}}$ace $\underline{\textbf{A}}$ugmente$\underline{\textbf{D}}$ Transform$\underline{\textbf{E}}$r. Specifically, we augment a SSM into the bottom layer of SPADE, and we employ efficient local attention methods for the other layers. The SSM augments global information, which complements the lack of long-range dependency issue in local attention methods. Experimental results on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method. To further demonstrate the scalability of SPADE, we pre-train large encoder-decoder models and present fine-tuning results on natural language understanding and natural language generation tasks.
translated by 谷歌翻译
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attentionkernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can also be used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.
translated by 谷歌翻译