Quantization methods reduce the number of bits required to represent each parameter in a model, trading accuracy for smaller memory footprints and inference latencies. However, the final model size depends on both the number of parameters of the original model and the rate of compression. For example, a 30B 8-bit model and a 60B 4-bit model have the same number of bits but may have very different zero-shot accuracies. In this work, we study this trade-off by developing inference scaling laws of zero-shot performance in Large Language Models (LLMs) to determine the bit-precision and model size that maximizes zero-shot performance. We run more than 35,000 zero-shot experiments with 16-bit inputs and k-bit parameters to examine which quantization methods improve scaling for 3 to 8-bit precision at scales of 19M to 66B parameters across the LLM families BLOOM, OPT, NeoX/Pythia, and GPT-2. We find that it is challenging to improve the bit-level scaling trade-off, with the only improvements being the use of a small block size -- splitting the parameters into small independently quantized blocks -- and the quantization data type being used (e.g., Int vs Float). Overall, our findings show that 4-bit precision is almost universally optimal for total model bits and zero-shot accuracy.
translated by 谷歌翻译
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.
translated by 谷歌翻译
许多NLP任务受益于使用通常具有超过1000亿参数的大语言模型(LLM)。随着Bloom-176b和Opt-175B的发布,每个人都可以下载该规模的预估计型号。尽管如此,使用这些模型仍需要许多研究人员无法获得高端硬件。在某些情况下,LLM可以通过RAM卸载或托管API更实惠。但是,这些技术具有先天的局限性:对于交互推理而言,卸载太慢,而API的灵活性不足以进行研究。在这项工作中,我们通过加入信任处理客户数据的多个政党的资源来提出花瓣$ - $ $用于推理和微调大型模型的系统。我们证明,这种策略的表现极大地超过了非常大型型号的卸载,以每秒约1美元的价格$ \ $ \ $ \ $ \ $ \ $ \ $ \ $ \ $ 1。与大多数推理API不同,花瓣还本地揭示了服务模型的隐藏状态,从而使其用户可以根据有效的微调方法训练和共享自定义模型扩展。
translated by 谷歌翻译
大型语言模型已被广泛采用,但需要大量的GPU记忆进行推理。我们为变形金刚中的进料前进和注意力投影层开发了一个INT8矩阵乘法的过程,该过程将推断所需的记忆减少了一半,同时保留了完整的精度性能。使用我们的方法,可以加载175b参数16/32位检查点,转换为INT8,并立即使用而不会降解。通过理解和围绕变压器语言模型中高度系统的新兴特征的属性来理解和工作,这些属性主导着注意力和变压器预测性能。为了应对这些功能,我们开发了两部分量化程序,llm.int8()。我们首先将矢量量化与矩阵乘法中每个内部产品的单独归一化常数一起使用,以量化大多数特征。但是,对于新兴的离群值,我们还包括一种新的混合精液分解方案,该方案将离群特征尺寸分离为16位矩阵乘法,而在8位中仍超过99.9%的值乘以99.9%。使用llm.int8(),我们从经验上显示,可以在LLM中执行最多175B参数的推断,而无需任何性能降解。这个结果使此类模型更容易访问,例如,可以在带有消费者GPU的单个服务器上使用Opt-175b/Bloom。
translated by 谷歌翻译
我们提出了分支机构 - 培训 - 合并(BTM),这是一种用于对大型语言模型(LLMS)平行训练的沟通效率算法。我们表明,有可能在不同的数据子集上独立训练新的LLMS的子部分,从而消除了训练LLMS当前所需的大量多节点同步。 BTM学习了一组独立的专家LMS(ELMS),每个LMS(ELMS)专门针对不同的文本领域,例如科学或法律文本。可以添加和删除这些榆树以更新数据覆盖范围,并结合概括为新域,或者平均折叠回到单个LM以进行有效推理。通过从当前集合中的(混合物)分支,进一步训练新域的数据参数,然后将结果模型归还到该集合以备将来使用,从而学习新的榆树。实验表明,在控制训练成本时,与GPT型变压器LMS相比,BTM改善了与GPT风格的变压器LMS相比,可以改善内部和外部困惑。通过广泛的分析,我们表明这些结果对不同的ELM初始化方案是可靠的,但需要专家领域的专业化。具有随机数据拆分的LM合奏表现不佳。我们还提出了将BTM缩放到64个领域的新语料库(总计192B居民分开的代币)的研究;所得的LM(22.4B总参数)以及经过2.5倍计算训练的变压器LM。这些收益随域的数量增长,表明可以使用更具侵略性的并行性来有效地在未来的工作中训练更大的模型。
translated by 谷歌翻译
培训最先进模型所需的基础设施变得过于昂贵,这使得培训此类模型仅适用于大型公司和机构。最近的工作提出了几种协作培训此类模型的方法,即通过将许多独立方的硬件汇总在一起,并通过Internet培训共享模型。在此演示中,我们合作培训了类似于Openai Dall-E的文本到图像变压器。我们邀请观众加入正在进行的训练运行,向他们展示有关如何使用可用硬件贡献的说明。我们解释了如何应对与此类训练运行相关的工程挑战(缓慢的沟通,有限的内存,设备之间的性能不均和安全问题),并讨论了观众如何设置协作培训。最后,我们表明所得模型在许多提示上生成了合理质量的图像。
translated by 谷歌翻译
随着时间的推移,状态优化者维持梯度统计数据,例如,过去梯度值的指数平滑总和(具有动量)或平方和平方和。与普通的随机梯度下降相比,该状态可用于加速优化,但使用否则可能会分配给模型参数的内存,从而限制了在实践中训练的模型的最大尺寸。在本文中,我们开发了使用8位统计数据的第一批优化器,同时保持使用32位优化器状态的性能水平。为了克服最终的计算,量化和稳定性挑战,我们开发了稳固的动态量化。块量化将输入张量分为独立量化的较小块。每个块跨核并行处理,得出更快的优化和高精度量化。为了维持稳定性和性能,我们将块量化与其他两个更改相结合:(1)动态量化,一种非线性优化的形式,对于大小的小幅度值都是精确的,(2)稳定的嵌入层到减少来自语言模型中输入令牌的高度不均匀分布所带来的梯度差异。结果,我们的8位优化器在一系列任务上保持了32位的性能,其中包括1.5B参数语言建模,胶水芬特,Imagenet分类,WMT'14机器翻译,Moco V2对比相比, ImageNet预训练+芬太尼和罗伯塔训练,而没有更改原始优化器超参数。我们开放我们的8位优化器作为一个仅需要两行代码更改的置换式替换。
translated by 谷歌翻译
Link prediction for knowledge graphs is the task of predicting missing relationships between entities. Previous work on link prediction has focused on shallow, fast models which can scale to large knowledge graphs. However, these models learn less expressive features than deep, multi-layer modelswhich potentially limits performance. In this work we introduce ConvE, a multi-layer convolutional network model for link prediction, and report state-of-the-art results for several established datasets. We also show that the model is highly parameter efficient, yielding the same performance as DistMult and R-GCN with 8x and 17x fewer parameters. Analysis of our model suggests that it is particularly effective at modelling nodes with high indegree -which are common in highlyconnected, complex knowledge graphs such as Freebase and YAGO3. In addition, it has been noted that the WN18 and FB15k datasets suffer from test set leakage, due to inverse relations from the training set being present in the test sethowever, the extent of this issue has so far not been quantified. We find this problem to be severe: a simple rule-based model can achieve state-of-the-art results on both WN18 and FB15k. To ensure that models are evaluated on datasets where simply exploiting inverse relations cannot yield competitive results, we investigate and validate several commonly used datasets -deriving robust variants where necessary. We then perform experiments on these robust datasets for our own and several previously proposed models, and find that ConvE achieves state-of-the-art Mean Reciprocal Rank across most datasets.
translated by 谷歌翻译
Modeling lies at the core of both the financial and the insurance industry for a wide variety of tasks. The rise and development of machine learning and deep learning models have created many opportunities to improve our modeling toolbox. Breakthroughs in these fields often come with the requirement of large amounts of data. Such large datasets are often not publicly available in finance and insurance, mainly due to privacy and ethics concerns. This lack of data is currently one of the main hurdles in developing better models. One possible option to alleviating this issue is generative modeling. Generative models are capable of simulating fake but realistic-looking data, also referred to as synthetic data, that can be shared more freely. Generative Adversarial Networks (GANs) is such a model that increases our capacity to fit very high-dimensional distributions of data. While research on GANs is an active topic in fields like computer vision, they have found limited adoption within the human sciences, like economics and insurance. Reason for this is that in these fields, most questions are inherently about identification of causal effects, while to this day neural networks, which are at the center of the GAN framework, focus mostly on high-dimensional correlations. In this paper we study the causal preservation capabilities of GANs and whether the produced synthetic data can reliably be used to answer causal questions. This is done by performing causal analyses on the synthetic data, produced by a GAN, with increasingly more lenient assumptions. We consider the cross-sectional case, the time series case and the case with a complete structural model. It is shown that in the simple cross-sectional scenario where correlation equals causation the GAN preserves causality, but that challenges arise for more advanced analyses.
translated by 谷歌翻译
KL-regularized reinforcement learning from expert demonstrations has proved successful in improving the sample efficiency of deep reinforcement learning algorithms, allowing them to be applied to challenging physical real-world tasks. However, we show that KL-regularized reinforcement learning with behavioral reference policies derived from expert demonstrations can suffer from pathological training dynamics that can lead to slow, unstable, and suboptimal online learning. We show empirically that the pathology occurs for commonly chosen behavioral policy classes and demonstrate its impact on sample efficiency and online policy performance. Finally, we show that the pathology can be remedied by non-parametric behavioral reference policies and that this allows KL-regularized reinforcement learning to significantly outperform state-of-the-art approaches on a variety of challenging locomotion and dexterous hand manipulation tasks.
translated by 谷歌翻译