变压器模型的缩放属性引起了很多兴趣。但是,在研究不同电感偏差和模型体系结构的缩放特性的效果的前提下,没有做太多事情。模型体系结构的规模不同吗?如果是这样,归纳偏置如何影响缩放行为?这如何影响上游(预训练)和下游(转移)?本文对十种不同模型体系结构的缩放行为进行了系统研究,例如变压器,交换机变压器,通用变压器,动态卷积,表演者以及最近提出的MLP混合物。通过广泛的实验,我们表明(1)架构在执行缩放时确实是一个重要的考虑因素,并且(2)最佳性能模型可以在不同的尺度上波动。我们认为,这项工作中概述的发现对当前在社区中评估模型架构的方式具有重要意义。
translated by 谷歌翻译
在深度学习中,模型通常重用所有输入的相同参数。专家的混合(MOE)违反了这一点,而是为每个传入示例选择不同的参数。结果是一个稀疏激活的模型 - 具有残酷数量的参数 - 但恒定的计算成本。然而,尽管MOE取得了一些显着的成功,但复杂性,沟通成本和培训不稳定的阻碍了广泛的采用 - 我们使用Switch Transformer解决了这些领域。我们简化了MOE路由算法和设计直观的改进模型,以降低的通信和计算成本。我们提出的培训技术有助于纠缠不稳定,我们表明稀疏模型可能首次以较低的精度(BFLOAT16)格式进行培训。我们设计了基于T5基数和T5总数的模型,以使用相同的计算资源获得高达7倍的训练速度。这些改进扩展到多语言设置,我们在所有101种语言中衡量对MT5基本版本的收益。最后,我们通过在“巨大的清洁爬行语料库”上预先培训高达数万亿个参数模型,并在T5-XXL模型上实现4倍的速度,从而提高了语言模型的当前规模。
translated by 谷歌翻译
扩展语言模型已被证明可以预测提高各种下游任务的性能和样本效率。相反,本文讨论了一种不可预测的现象,我们将其称为大语言模型的新兴能力。如果在较小的模型中不存在,而是在较大的模型中存在,那么我们认为它可以突然出现。因此,不仅可以通过推断较小模型的性能来预测紧急能力。这种出现的存在意味着额外的扩展可以进一步扩大语言模型的能力范围。
translated by 谷歌翻译
Recent trends in language modeling have focused on increasing performance through scaling, and have resulted in an environment where training language models is out of reach for most researchers and practitioners. While most in the community are asking how to push the limits of extreme computation, we ask the opposite question: How far can we get with a single GPU in just one day? We investigate the downstream performance achievable with a transformer-based language model trained completely from scratch with masked language modeling for a single day on a single consumer GPU. Aside from re-analyzing nearly all components of the pretraining pipeline for this scenario and providing a modified pipeline with performance close to BERT, we investigate why scaling down is hard, and which modifications actually improve performance in this scenario. We provide evidence that even in this constrained setting, performance closely follows scaling laws observed in large-compute settings. Through the lens of scaling laws, we categorize a range of recent improvements to training and architecture and discuss their merit and practical applicability (or lack thereof) for the limited compute setting.
translated by 谷歌翻译
The crystallization of modeling methods around the Transformer architecture has been a boon for practitioners. Simple, well-motivated architectural variations can transfer across tasks and scale, increasing the impact of modeling research. However, with the emergence of state-of-the-art 100B+ parameters models, large language models are increasingly expensive to accurately design and train. Notably, it can be difficult to evaluate how modeling decisions may impact emergent capabilities, given that these capabilities arise mainly from sheer scale alone. In the process of building BLOOM--the Big Science Large Open-science Open-access Multilingual language model--our goal is to identify an architecture and training setup that makes the best use of our 1,000,000 A100-GPU-hours budget. Specifically, we perform an ablation study at the billion-parameter scale comparing different modeling practices and their impact on zero-shot generalization. In addition, we study the impact of various popular pre-training corpora on zero-shot generalization. We also study the performance of a multilingual model and how it compares to the English-only one. Finally, we consider the scaling behaviour of Transformers to choose the target model size, shape, and training setup. All our models and code are open-sourced at https://huggingface.co/bigscience .
translated by 谷歌翻译
稀疏的专家模型是一个三十年来的概念,作为深度学习中流行的建筑。这类体系结构包括专家的混合物,交换变压器,路由网络,基础层等,所有这些都以一个统一的想法,即每个示例都由参数的一个子集进行。通过这样做,稀疏度将参数计数与每个示例的计算分解,从而允许使用极大但有效的模型。最终的模型显示了各种领域的显着改善,例如自然语言处理,计算机视觉和语音识别。我们回顾了稀疏专家模型的概念,提供了对常见算法的基本描述,将深度学习时代的进步进行上下文化,并通过突出未来工作的领域来结束。
translated by 谷歌翻译
尽管最近的多任务学习和自然语言处理的转移学习成功(NLP),但很少有效地研究了在训练中缩放任务数量的效果。迈出了这一目标,介绍了Exmix(极端混合物):跨越各个领域和任务家庭的大规模收集107个监督的NLP任务。使用EXMIX,我们研究了最大规模的多任务预培训的影响,并分析了普通任务家庭之间的共同培训转移。通过此分析,我们表明手动策划用于多任务预训练的理想任务,并不简单,而且多任务缩放可以自行改进模型。最后,我们提出了Ext5:使用自我监督跨度去噪和监督EXMIX的多任务目标预先训练的模型。通过广泛的实验,我们表明Ext5优于超级格,宝石,彩虹,封闭书QA任务的强大T5基线,以及Exmix之外的几个任务。 Ext5在预训练时也显着提高了样品效率。
translated by 谷歌翻译
Training large, deep neural networks to convergence can be prohibitively expensive. As a result, often only a small selection of popular, dense models are reused across different contexts and tasks. Increasingly, sparsely activated models, which seek to decouple model size from computation costs, are becoming an attractive alternative to dense models. Although more efficient in terms of quality and computation cost, sparse models remain data-hungry and costly to train from scratch in the large scale regime. In this work, we propose sparse upcycling -- a simple way to reuse sunk training costs by initializing a sparsely activated Mixture-of-Experts model from a dense checkpoint. We show that sparsely upcycled T5 Base, Large, and XL language models and Vision Transformer Base and Large models, respectively, significantly outperform their dense counterparts on SuperGLUE and ImageNet, using only ~50% of the initial dense pretraining sunk cost. The upcycled models also outperform sparse models trained from scratch on 100% of the initial dense pretraining computation budget.
translated by 谷歌翻译
规模一直是改善机器学习绩效的主要驱动力,了解规模定律对于可持续模型质量绩效增长,长期资源计划和开​​发有效的系统基础架构以支持大规模模型的战略规划至关重要。在本文中,我们研究了DLRM样式推荐模型的经验缩放定律,特别是点击率(CTR)。我们观察到具有功率定律的模型质量尺度以及模型大小,数据大小和用于培训的计算量的常数。我们通过比较沿这些轴的不同缩放方案来表征沿三个不同资源维度的缩放效率,即数据,参数和计算。我们表明,对于正在研究的模型体系结构,参数缩放量不超出蒸汽,直到出现较高表现的模型体系结构之前,数据缩放是前进的路径。本研究解决的关键研究问题包括:建议模型规模是否可以可持续地按照规模定律预测?还是我们远离规模定律的预测?缩放的限制是什么?扩展法对长期硬件/系统开发的含义是什么?
translated by 谷歌翻译
Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based networks, such as the Vision Transformer, have also become popular. In this paper we show that while convolutions and attention are both sufficient for good performance, neither of them are necessary. We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). MLP-Mixer contains two types of layers: one with MLPs applied independently to image patches (i.e. "mixing" the per-location features), and one with MLPs applied across patches (i.e. "mixing" spatial information). When trained on large datasets, or with modern regularization schemes, MLP-Mixer attains competitive scores on image classification benchmarks, with pre-training and inference cost comparable to state-of-the-art models. We hope that these results spark further research beyond the realms of well established CNNs and Transformers. 1
translated by 谷歌翻译
Scaling up language models has led to unprecedented performance gains, but little is understood about how the training dynamics change as models get larger. How do language models of different sizes learn during pre-training? Why do larger language models demonstrate more desirable behaviors? In this paper, we analyze the intermediate training checkpoints of differently sized OPT models (Zhang et al.,2022)--from 125M to 175B parameters--on next-token prediction, sequence-level generation, and downstream tasks. We find that 1) at a given perplexity and independent of model sizes, a similar subset of training tokens see the most significant reduction in loss, with the rest stagnating or showing double-descent behavior; 2) early in training, all models learn to reduce the perplexity of grammatical sequences that contain hallucinations, with small models halting at this suboptimal distribution and larger ones eventually learning to assign these sequences lower probabilities; 3) perplexity is a strong predictor of in-context learning performance on 74 multiple-choice tasks from BIG-Bench, and this holds independent of the model size. Together, these results show that perplexity is more predictive of model behaviors than model size or training computation.
translated by 谷歌翻译
专家层(MOES)的混合物通过条件计算实现语言模型的高效缩放。本文提出了一个详细的实证研究,自回归鞋语言模型与广泛的设置中的密集模型相比:在域外语言建模,零和少量射击和全部微调。除了微调外,我们发现Moes基本上更加计算效率。在更适度的培训预算下,MOES可以使用$ \ SIM值4倍的计算,符合密集模型的性能。该差距在比例下变窄,但我们最大的MOE模型(1.1T参数)始终如一地优于计算等效的密集模型(6.7b参数)。总体而言,这种表现差距在任务和域中有很大差异,表明MOE和密集模型以不值得研究的方式概括不同的方式。我们使我们的代码和模型公开可用于研究使用。
translated by 谷歌翻译
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train. 1
translated by 谷歌翻译
具有更多数据,计算和参数的缩放语言模型在自然语言处理方面取得了重大进展。例如,由于缩放,GPT-3能够在内心学习任务上实现强烈结果。但是,培训这些大密度模型需要大量的计算资源。在本文中,我们提出并开发了名为Glam(通用语言模型)的语言模型系列,它使用稀疏激活的专家架构来规模模型容量,同时与致密变体相比,也产生显着更少的训练成本。最大的Glam具有1.2万亿参数,比GPT-3大约为7倍。它仅消耗了用于训练GPT-3的1/3的能量,并且需要一半的计算拖鞋进行推理,同时仍然在29个NLP任务中实现更好的整体零射击和一次性性能。
translated by 谷歌翻译
The pre-training of masked language models (MLMs) consumes massive computation to achieve good results on downstream NLP tasks, resulting in a large carbon footprint. In the vanilla MLM, the virtual tokens, [MASK]s, act as placeholders and gather the contextualized information from unmasked tokens to restore the corrupted information. It raises the question of whether we can append [MASK]s at a later layer, to reduce the sequence length for earlier layers and make the pre-training more efficient. We show: (1) [MASK]s can indeed be appended at a later layer, being disentangled from the word embedding; (2) The gathering of contextualized information from unmasked tokens can be conducted with a few layers. By further increasing the masking rate from 15% to 50%, we can pre-train RoBERTa-base and RoBERTa-large from scratch with only 78% and 68% of the original computational budget without any degradation on the GLUE benchmark. When pre-training with the original budget, our method outperforms RoBERTa for 6 out of 8 GLUE tasks, on average by 0.4%.
translated by 谷歌翻译
Transformers do not scale very well to long sequence lengths largely because of quadratic self-attention complexity. In the recent months, a wide spectrum of efficient, fast Transformers have been proposed to tackle this problem, more often than not claiming superior or comparable model quality to vanilla Transformer models. To this date, there is no well-established consensus on how to evaluate this class of models. Moreover, inconsistent benchmarking on a wide spectrum of tasks and datasets makes it difficult to assess relative model quality amongst many models. This paper proposes a systematic and unified benchmark, Long-Range Arena, specifically focused on evaluating model quality under long-context scenarios. Our benchmark is a suite of tasks consisting of sequences ranging from 1K to 16K tokens, encompassing a wide range of data types and modalities such as text, natural, synthetic images, and mathematical expressions requiring similarity, structural, and visual-spatial reasoning. We systematically evaluate ten well-established long-range Transformer models (Reformers, Linformers, Linear Transformers, Sinkhorn Transformers, Performers, Synthesizers, Sparse Transformers, and Longformers) on our newly proposed benchmark suite. Long-Range Arena paves the way towards better understanding this class of efficient Transformer models, facilitates more research in this direction, and presents new challenging tasks to tackle. Our benchmark code will be released at https://github.com/google-research/long-range-arena.
translated by 谷歌翻译
我们表明,在将直接转换应用到数据集之后,自回归语言模型可以学会填充文本,这简单地将文本的跨度从文档的中间移动到了其末尾。虽然近年来这种数据增强引起了人们的极大兴趣,但我们提供了广泛的证据,表明以这种方式转换的数据很大一部分并不会损害原始的左右生成能力,这是通过困惑和抽样评估来衡量的广泛的尺度。鉴于培训模型对中间的有用性,简单性和效率(FIM),我们建议默认情况下使用FIM培训未来的自回归语言模型。为此,我们在关键的超参数上运行一系列消融,例如数据转换频率,转换的结构以及选择填充跨度的方法。我们使用这些消融来规定强大的默认设置和最佳实践来训练FIM模型。我们发布了最佳的填充模型,该模型在API中培训了最佳实践,并发布了我们的填充基准,以帮助未来的研究。
translated by 谷歌翻译
尽管经过验证的大型变压器模型已被证明具有很高的能力解决自然语言任务,但处理长序列输入仍然是一个重大挑战。这样的任务之一就是长输入摘要,其中输入比大多数预验证的模型的最大输入上下文更长。通过一系列广泛的实验,我们研究了哪些模型架构变化和预处理范式可以最有效地适应经过预定的变压器以进行长输入摘要。我们发现,带有全局编码器代币的交错,块状变压器可以达到良好的性能和效率平衡,并且在长序列上有意义地改善了下游摘要性能。根据我们的发现,我们介绍了Pegasus-X,这是Pegasus模型的扩展,并具有额外的长输入预处理,以处理最多16K令牌的输入。 Pegasus-X在长输入摘要任务上实现了强劲的性能,与更大的模型相当,同时添加了很少的其他参数,并且不需要模型并行训练。
translated by 谷歌翻译
在预介质期间,预解压器变压器遭受梯度幅度不匹配:早期层处的梯度远远大于更高层的层。我们所提出的常规程序架构可以减轻这些问题,这为每层增加了三个归一化操作:自我注意后的一层规范,自我注意输出的头部明智的缩放,以及第一完全连接层之后的层标。额外的运营产生忽略不计的计算成本(+ 0.4%的参数增加),但是改善了从12500万到27亿个参数的因果和屏蔽语言模型的预先欣赏困惑和下游任务性能。例如,在我们最强的1.3B参数基线顶部添加NARMFORMER可以在相同的计算预算中更快地达到24%的平等困惑,或者更好地收敛0.27困惑。该模型达到GPT3大(1.3B)零拍摄性能速度快60%。对于屏蔽语言建模,Normformer平均将微调胶水性能提高1.9%。 Fairseq HTTPS://github.com/pytorch/faireq/tree/main/examples/normformer提供培训ormalformer模型的代码。
translated by 谷歌翻译
从有限的资源中获得最大收益可以进步自然语言处理(NLP)研究和实践,同时保守资源。这些资源可能是数据,时间,存储或能源。NLP的最新工作从缩放率产生了有趣的结果。但是,仅使用比例来改善结果意味着资源消耗也会扩展。这种关系激发了对有效方法的研究,这些方法需要更少的资源才能获得相似的结果。这项调查涉及NLP效率的方法和发现,旨在指导该领域的新研究人员并激发新方法的发展。
translated by 谷歌翻译