结构分布,即组合空间的分布,通常用于学习观察到数据的潜在概率表示。然而,缩放这些模型是由高计算和内存复杂度相对于潜在表示的大小的瓶颈。诸如隐藏的马尔可夫模型(HMMS)和概率的无内容语法(PCFG)的常见模型在隐藏状态的数量中需要时间和空间二次和立方。这项工作展示了一种简单的方法来降低大类结构化模型的计算和内存复杂性。我们展示通过将中央推理步骤视为矩阵 - 矢量产品,并使用低秩约束,我们可以通过等级进行模型表达性和速度。用神经参数化结构化模型进行语言建模,复音音乐建模,无监督语法诱导和视频建模的实验表明,我们的方法在提供实用加速度的同时匹配大状态空间的标准模型的准确性。
translated by 谷歌翻译
序列建模的一个中心目标是设计一个单个原则模型,该模型可以解决各种方式和任务,尤其是在远程依赖方面的序列数据。尽管包括RNN,CNN和Transformers在内的传统模型具有用于捕获长期依赖性的专业变体,但它们仍然很难扩展到长时间的10000美元或更多步骤。通过模拟基本状态空间模型(SSM)\(x'(t)= ax(t)= ax(t) + bu(t),y(t)= cx(t) + du(t) + du(t)\ ), and showed that for appropriate choices of the state matrix \( A \), this system could handle long-range dependencies mathematically and empirically.但是,该方法具有过度的计算和内存需求,使其无法作为一般序列建模解决方案。我们根据SSM的新参数化提出了结构化状态空间序列模型(S4),并表明它可以比以前的方法更有效地计算出其理论强度。我们的技术涉及对\(a \)进行低级校正的调节,从而使其对角度稳定,并将SSM降低到库奇内核的精心研究的计算中。 S4在各种既定的基准测试范围内取得了强劲的经验结果,包括(i)在顺序CIFAR-10上的91 \%精度,没有数据增强或辅助损失,与较大的2-D Resnet相当,(ii)实质上关闭。在图像和语言建模任务上与变形金刚的差距,同时在远程竞技场基准的每个任务上执行每一代$ 60 \ times $ $(iii)sota,包括求解所有先前工作的挑战性path-x任务,而所有先前工作的长度为16K,同时与所有竞争对手一样高效。
translated by 谷歌翻译
变压器在长序列上是缓慢的,渴望记忆力,因为自我注意的时间和记忆复杂性在序列上是二次的。近似关注方法试图通过交易模型质量以降低计算复杂性来解决此问题,但通常无法实现墙壁锁定的加速。我们认为,缺失的原则是提出注意力算法,以考虑读取和在GPU记忆层次之间写入。我们提出了FlashAttention,这是一种IO意识的精确注意算法,该算法使用平铺来减少GPU高带宽内存(HBM)和GPU芯片SRAM之间的内存读数/写入/写入。我们分析了闪存的IO复杂性,表明它所需的HBM访问少于标准注意力,并且对于一系列SRAM尺寸而言是最佳的。我们还扩展了闪光词,以引起障碍物的注意,从而产生了比任何现有的近似关注方法更快的近似关注算法。闪存火车的变压器​​比现有基准快:与MLPERF 1.1训练速度记录相比,Bert-Large(第512秒)的端到端壁式锁定加速度为15%,GPT-2上的3 $ \ times $ speedup(seq) 。闪存表现和块状闪光词可在变压器中实现更长的上下文,从而产生更高质量的模型(GPT-2上的0.7更好的困惑和长期分类的6.4点升力)和全新的功能:第一个实现优于更好的Chance的变压器PATH-X挑战(Seq。Length16K,61.4%精度)和PATH-256(Seq。Length64K,63.1%精度)上的性能。
translated by 谷歌翻译
We study grammar induction with mildly context-sensitive grammars for unsupervised discontinuous parsing. Using the probabilistic linear context-free rewriting system (LCFRS) formalism, our approach fixes the rule structure in advance and focuses on parameter learning with maximum likelihood. To reduce the computational complexity of both parsing and parameter estimation, we restrict the grammar formalism to LCFRS-2 (i.e., binary LCFRS with fan-out two) and further discard rules that require O(n^6) time to parse, reducing inference to O(n^5). We find that using a large number of nonterminals is beneficial and thus make use of tensor decomposition-based rank-space dynamic programming with an embedding-based parameterization of rule probabilities to scale up the number of nonterminals. Experiments on German and Dutch show that our approach is able to induce linguistically meaningful trees with continuous and discontinuous structures
translated by 谷歌翻译
有效地对远程依赖性建模是序列建模的重要目标。最近,使用结构化状态空间序列(S4)层的模型在许多远程任务上实现了最先进的性能。 S4层将线性状态空间模型(SSM)与深度学习技术结合在一起,并利用HIPPO框架进行在线功能近似以实现高性能。但是,该框架导致了架构约束和计算困难,使S4方法变得复杂,可以理解和实施。我们重新审视这样的想法,即遵循河马框架对于高性能是必要的。具体而言,我们替换了许多独立的单输入单输出(SISO)SSM的库S4层与一个多输入的多输出(MIMO)SSM一起使用,并具有降低的潜在尺寸。 MIMO系统的缩小潜在维度允许使用有效的并行扫描,从而简化了将S5层应用于序列到序列转换所需的计算。此外,我们将S5 SSM的状态矩阵初始化,其近似与S4 SSMS使用的河马级矩阵近似,并表明这是MIMO设置的有效初始化。 S5与S4在远程任务上的表现相匹配,包括在远程竞技场基准的套件中平均达到82.46%,而S4的80.48%和最佳的变压器变体的61.41%。
translated by 谷歌翻译
在本文中,我们试图通过引入深度学习模型的句法归纳偏见来建立两所学校之间的联系。我们提出了两个归纳偏见的家族,一个家庭用于选区结构,另一个用于依赖性结构。选区归纳偏见鼓励深度学习模型使用不同的单位(或神经元)分别处理长期和短期信息。这种分离为深度学习模型提供了一种方法,可以从顺序输入中构建潜在的层次表示形式,即更高级别的表示由高级表示形式组成,并且可以分解为一系列低级表示。例如,在不了解地面实际结构的情况下,我们提出的模型学会通过根据其句法结构组成变量和运算符的表示来处理逻辑表达。另一方面,依赖归纳偏置鼓励模型在输入序列中找到实体之间的潜在关系。对于自然语言,潜在关系通常被建模为一个定向依赖图,其中一个单词恰好具有一个父节点和零或几个孩子的节点。将此约束应用于类似变压器的模型之后,我们发现该模型能够诱导接近人类专家注释的有向图,并且在不同任务上也优于标准变压器模型。我们认为,这些实验结果为深度学习模型的未来发展展示了一个有趣的选择。
translated by 谷歌翻译
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attentionkernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can also be used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.
translated by 谷歌翻译
State space models (SSMs) have demonstrated state-of-the-art sequence modeling performance in some modalities, but underperform attention in language modeling. Moreover, despite scaling nearly linearly in sequence length instead of quadratically, SSMs are still slower than Transformers due to poor hardware utilization. In this paper, we make progress on understanding the expressivity gap between SSMs and attention in language modeling, and on reducing the hardware barrier between SSMs and attention. First, we use synthetic language modeling tasks to understand the gap between SSMs and attention. We find that existing SSMs struggle with two capabilities: recalling earlier tokens in the sequence and comparing tokens across the sequence. To understand the impact on language modeling, we propose a new SSM layer, H3, that is explicitly designed for these abilities. H3 matches attention on the synthetic languages and comes within 0.4 PPL of Transformers on OpenWebText. Furthermore, a hybrid 125M-parameter H3-attention model that retains two attention layers surprisingly outperforms Transformers on OpenWebText by 1.0 PPL. Next, to improve the efficiency of training SSMs on modern hardware, we propose FlashConv. FlashConv uses a fused block FFT algorithm to improve efficiency on sequences up to 8K, and introduces a novel state passing algorithm that exploits the recurrent properties of SSMs to scale to longer sequences. FlashConv yields 2$\times$ speedup on the long-range arena benchmark and allows hybrid language models to generate text 1.6$\times$ faster than Transformers. Using FlashConv, we scale hybrid H3-attention language models up to 1.3B parameters on the Pile and find promising initial results, achieving lower perplexity than Transformers and outperforming Transformers in zero- and few-shot learning on a majority of tasks in the SuperGLUE benchmark.
translated by 谷歌翻译
状态空间模型已显示在建模远距离依赖性方面有效,特别是序列分类任务。在这项工作中,我们着重于对英语书籍,GitHub源代码和Arxiv数学文章的自回旋序列建模。基于围绕封闭激活功能的有效性的最新发展,我们提出了一个名为“封闭状态空间(GSS)”的新层,并表明它的训练速度明显快于TPU的S4(即DSS)的对角线版本,具有相当竞争力 - 基于变压器的基线,并表现出零击向更长的输入,同时直接实施。最后,我们表明,利用自我意见来建模局部依赖性,可以进一步提高GSS的性能。
translated by 谷歌翻译
Learning hierarchical structures in sequential data -- from simple algorithmic patterns to natural language -- in a reliable, generalizable way remains a challenging problem for neural language models. Past work has shown that recurrent neural networks (RNNs) struggle to generalize on held-out algorithmic or syntactic patterns without supervision or some inductive bias. To remedy this, many papers have explored augmenting RNNs with various differentiable stacks, by analogy with finite automata and pushdown automata (PDAs). In this paper, we improve the performance of our recently proposed Nondeterministic Stack RNN (NS-RNN), which uses a differentiable data structure that simulates a nondeterministic PDA, with two important changes. First, the model now assigns unnormalized positive weights instead of probabilities to stack actions, and we provide an analysis of why this improves training. Second, the model can directly observe the state of the underlying PDA. Our model achieves lower cross-entropy than all previous stack RNNs on five context-free language modeling tasks (within 0.05 nats of the information-theoretic lower bound), including a task on which the NS-RNN previously failed to outperform a deterministic stack RNN baseline. Finally, we propose a restricted version of the NS-RNN that incrementally processes infinitely long sequences, and we present language modeling results on the Penn Treebank.
translated by 谷歌翻译
多头注意力是最先进的变压器背后的推动力,它在各种自然语言处理(NLP)和计算机视觉任务中实现了出色的性能。已经观察到,对于许多应用,这些注意力头会学习冗余嵌入,并且大多数可以在不降低模型性能的情况下去除。受到这一观察的启发,我们提出了变压器的混合物(变压器-MGK)的混合物,这是一种新型的变压器架构,用每个头部的钥匙混合了变压器中的冗余头部。这些键的混合物遵循高斯混合模型,并使每个注意力头有效地集中在输入序列的不同部分上。与传统的变压器对应物相比,变压器-MGK会加速训练和推理,具有较少的参数,并且需要更少的拖船来计算,同时实现跨任务的可比性或更高的准确性。 Transformer-MGK也可以轻松扩展到线性注意力。我们从经验上证明了在一系列实用应用中变形金属MGK的优势,包括语言建模和涉及非常长序列的任务。在Wikitext-103和远程竞技场基准中,具有4个头部的变压器MGK具有与基线变压器具有8个头的可比性或更好的性能。
translated by 谷歌翻译
最近已证明状态空间模型(SSM)是深度学习层非常有效的,它是序列模型(例如RNN,CNN或变压器)的有前途替代方案。第一个显示这种潜力的版本是S4模型,它通过使用称为HIPPO矩阵的规定状态矩阵对涉及长期依赖性的任务特别有效。尽管这具有可解释的数学机制来建模长期依赖性,但它引入了一种自定义表示和算法,可能难以实施。另一方面,最新的S4变体称为DSS,表明将状态矩阵完全对角线限制在使用基于近似S4矩阵的特定初始化时,仍然可以保留原始模型的性能。这项工作旨在系统地了解如何参数化和初始化此类对角线状态空间模型。虽然从经典的结果来看,几乎所有SSM都具有等效的对角线形式,但我们表明初始化对于性能至关重要。我们通过证明S4矩阵的对角线限制出人意料地在无限状态尺寸的极限中恢复了相同的内核来解释为什么DSS在数学上起作用。我们还系统地描述了参数化和计算对角线SSM的各种设计选择,并执行对这些选择的影响的受控经验研究。我们的最终型号S4D是S4的简单对角线版本,其内核计算仅需要2行代码,并且几乎在所有设置中都与S4相当地执行,并具有最新的图像,音频和医疗时间序列域的结果,在远程竞技场基准中平均为85%。
translated by 谷歌翻译
过度分辨的神经网络概括井,但训练昂贵。理想情况下,人们希望减少其计算成本,同时保留其概括的益处。稀疏的模型培训是实现这一目标的简单和有希望的方法,但随着现有方法与准确性损失,慢速训练运行时的困难或困难,仍然存在挑战,仍然存在困难的挑战。核心问题是,在离散的一组稀疏矩阵上搜索稀疏性掩模是困难和昂贵的。为了解决此问题,我们的主要见解是通过具有称为蝴蝶矩阵产品的固定结构的固定结构来优化优化稀疏矩阵的连续超集。随着蝴蝶矩阵不是硬件效率,我们提出了简单的蝴蝶(块和平坦)的变体来利用现代硬件。我们的方法(像素化蝴蝶)使用基于扁平块蝴蝶和低秩矩阵的简单固定稀疏模式,以缩小大多数网络层(例如,注意,MLP)。我们经验验证了像素化蝴蝶比蝴蝶快3倍,加快培训,以实现有利的准确性效率权衡。在ImageNet分类和Wikitext-103语言建模任务中,我们的稀疏模型训练比致密的MLP - 混频器,视觉变压器和GPT-2媒体更快地训练高达2.5倍,没有精确下降。
translated by 谷歌翻译
Normalizing flows provide a general mechanism for defining expressive probability distributions, only requiring the specification of a (usually simple) base distribution and a series of bijective transformations. There has been much recent work on normalizing flows, ranging from improving their expressive power to expanding their application. We believe the field has now matured and is in need of a unified perspective. In this review, we attempt to provide such a perspective by describing flows through the lens of probabilistic modeling and inference. We place special emphasis on the fundamental principles of flow design, and discuss foundational topics such as expressive power and computational trade-offs. We also broaden the conceptual framing of flows by relating them to more general probability transformations. Lastly, we summarize the use of flows for tasks such as generative modeling, approximate inference, and supervised learning.
translated by 谷歌翻译
近年来,基于变压器的预训练模型已获得了很大的进步,成为自然语言处理中最重要的骨干之一。最近的工作表明,变压器内部的注意力机制可能不需要,卷积神经网络和基于多层感知器的模型也已被研究为变压器替代方案。在本文中,我们考虑了一个用于语言模型预训练的图形循环网络,该网络通过本地令牌级通信为每个序列构建一个图形结构,以及与其他代币解耦的句子级表示。原始模型在受监督培训下的特定领域特定文本分类中表现良好,但是,其通过自我监督的方式学习转移知识的潜力尚未得到充分利用。我们通过优化体系结构并验证其在更通用的语言理解任务(英语和中文)中的有效性来填补这一空白。至于模型效率,我们的模型在基于变压器的模型中而不是二次复杂性,而是具有线性复杂性,并且在推断过程中的性能更有效。此外,我们发现与现有基于注意力的模型相比,我们的模型可以生成更多样化的输出,而背景化的功能冗余性较小。
translated by 谷歌翻译
收购用于监督学习的标签可能很昂贵。为了提高神经网络回归的样本效率,我们研究了活跃的学习方法,这些方法可以适应地选择未标记的数据进行标记。我们提出了一个框架,用于从(与网络相关的)基础内核,内核转换和选择方法中构造此类方法。我们的框架涵盖了许多基于神经网络的高斯过程近似以及非乘式方法的现有贝叶斯方法。此外,我们建议用草图的有限宽度神经切线核代替常用的最后层特征,并将它们与一种新型的聚类方法结合在一起。为了评估不同的方法,我们引入了一个由15个大型表格回归数据集组成的开源基准。我们所提出的方法的表现优于我们的基准测试上的最新方法,缩放到大数据集,并在不调整网络体系结构或培训代码的情况下开箱即用。我们提供开源代码,包括所有内核,内核转换和选择方法的有效实现,并可用于复制我们的结果。
translated by 谷歌翻译
神经网络的经典发展主要集中在有限维欧基德空间或有限组之间的学习映射。我们提出了神经网络的概括,以学习映射无限尺寸函数空间之间的运算符。我们通过一类线性积分运算符和非线性激活函数的组成制定运营商的近似,使得组合的操作员可以近似复杂的非线性运算符。我们证明了我们建筑的普遍近似定理。此外,我们介绍了四类运算符参数化:基于图形的运算符,低秩运算符,基于多极图形的运算符和傅里叶运算符,并描述了每个用于用每个计算的高效算法。所提出的神经运营商是决议不变的:它们在底层函数空间的不同离散化之间共享相同的网络参数,并且可以用于零击超分辨率。在数值上,与现有的基于机器学习的方法,达西流程和Navier-Stokes方程相比,所提出的模型显示出卓越的性能,而与传统的PDE求解器相比,与现有的基于机器学习的方法有关的基于机器学习的方法。
translated by 谷歌翻译
Transformers achieve remarkable performance in several tasks but due to their quadratic complexity, with respect to the input's length, they are prohibitively slow for very long sequences. To address this limitation, we express the self-attention as a linear dot-product of kernel feature maps and make use of the associativity property of matrix products to reduce the complexity from O N 2 to O (N ), where N is the sequence length. We show that this formulation permits an iterative implementation that dramatically accelerates autoregressive transformers and reveals their relationship to recurrent neural networks. Our linear transformers achieve similar performance to vanilla transformers and they are up to 4000x faster on autoregressive prediction of very long sequences.
translated by 谷歌翻译
基于预测方法的深度学习已成为时间序列预测或预测的许多应用中的首选方法,通常通常优于其他方法。因此,在过去的几年中,这些方法现在在大规模的工业预测应用中无处不在,并且一直在预测竞赛(例如M4和M5)中排名最佳。这种实践上的成功进一步提高了学术兴趣,以理解和改善深厚的预测方法。在本文中,我们提供了该领域的介绍和概述:我们为深入预测的重要构建块提出了一定深度的深入预测;随后,我们使用这些构建块,调查了最近的深度预测文献的广度。
translated by 谷歌翻译
The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representation-learning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, auto-encoders, manifold learning, and deep networks. This motivates longer-term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation and manifold learning.
translated by 谷歌翻译