转移学习提供了一种在学习另一个任务时从一个任务中利用知识的方式。执行转移学习通常涉及通过训练数据集上的梯度下降来迭代地更新模型的参数。在本文中,我们介绍了一种基本上不同的方法,用于将知识转移到跨模型,这些方法将多个模型“合并”成一个。我们的方法有效地涉及计算模型参数的加权平均值。我们表明,该平均值相当于从模型权重的后部的大致抽样。在某些情况下使用各向同性高斯近似时,我们还通过Fisher信息近似于精确矩阵来证明优势。总之,我们的方法使得与基于标准梯度的培训相比,可以以极低的计算成本将多种模型中的“知识”组合。我们展示了模型合并在中间任务培训和域适应问题上实现了基于梯度下降的转移学习的可比性。我们还表明,我们的合并程序使得可以以先前未开发的方式结合模型。为了测量我们方法的稳健性,我们对我们算法的设计进行了广泛的消融。
translated by 谷歌翻译
在基于典型的深度神经网络训练期间,所有模型的参数都在每次迭代时更新。最近的工作表明,在训练期间只能更新模型参数的小型子集,这可以减轻存储和通信要求。在本文中,我们表明,可以在模型的参数上诱导一个固定的稀疏掩码,该屏蔽选择要在许多迭代中更新的子集。我们的方法用最大的Fisher信息构造出k $参数的掩码,作为一个简单的近似,与手头的任务最重要的近似值。在参数高效转移学习和分布式培训的实验中,我们表明我们的方法与其他方法的性能相匹配或超出稀疏更新的其他方法的性能,同时在内存使用和通信成本方面更有效。我们公开发布我们的代码,以促进我们的方法的进一步应用。
translated by 谷歌翻译
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models. Oftentimes fine-tuned models are readily available but their training data is not, due to data privacy or intellectual property concerns. This creates a barrier to fusing knowledge across individual models to yield a better single model. In this paper, we study the problem of merging individual models built on different training data sets to obtain a single model that performs well both across all data set domains and can generalize on out-of-domain data. We propose a dataless knowledge fusion method that merges models in their parameter space, guided by weights that minimize prediction differences between the merged model and the individual models. Over a battery of evaluation settings, we show that the proposed method significantly outperforms baselines such as Fisher-weighted averaging or model ensembling. Further, we find that our method is a promising alternative to multi-task learning that can preserve or sometimes improve over the individual models without access to the training data. Finally, model merging is more efficient than training a multi-task model, thus making it applicable to a wider set of scenarios.
translated by 谷歌翻译
最大化模型准确性的常规配方是(1)具有各种超参数的多个模型,以及(2)选择在固定验证集中表现最佳的单个模型,从而丢弃其余部分。在本文中,我们在微调大型预训练的模型的背景下重新审视了该过程的第二步,其中微调模型通常位于单个低误差盆地中。我们表明,平均多种模型的权重以不同的超参数配置进行了微调通常提高准确性和鲁棒性。与传统的合奏不同,我们可能会平均许多模型,而不会产生任何其他推理或记忆成本 - 我们将结果称为“模型汤”。当微调大型预训练的模型,例如夹子,Align和VIT-G在JFT上预先训练的VIT-G时,我们的汤食谱可为ImageNet上的超参数扫描中的最佳模型提供显着改进。所得的VIT-G模型在Imagenet上达到90.94%的TOP-1准确性,实现了新的最新状态。此外,我们表明,模型汤方法扩展到多个图像分类和自然语言处理任务,改善分发性能,并改善新下游任务的零局部性。最后,我们通过分析将权重平衡和与logit浓度的性能相似与预测的损失和信心的平坦度联系起来,并经过经验验证这种关系。代码可从https://github.com/mlfoundations/model-soups获得。
translated by 谷歌翻译
Changing how pre-trained models behave -- e.g., improving their performance on a downstream task or mitigating biases learned during pre-training -- is a common practice when developing machine learning systems. In this work, we propose a new paradigm for steering the behavior of neural networks, centered around \textit{task vectors}. A task vector specifies a direction in the weight space of a pre-trained model, such that movement in that direction improves performance on the task. We build task vectors by subtracting the weights of a pre-trained model from the weights of the same model after fine-tuning on a task. We show that these task vectors can be modified and combined together through arithmetic operations such as negation and addition, and the behavior of the resulting model is steered accordingly. Negating a task vector decreases performance on the target task, with little change in model behavior on control tasks. Moreover, adding task vectors together can improve performance on multiple tasks at once. Finally, when tasks are linked by an analogy relationship of the form ``A is to B as C is to D", combining task vectors from three of the tasks can improve performance on the fourth, even when no data from the fourth task is used for training. Overall, our experiments with several models, modalities and tasks show that task arithmetic is a simple, efficient and effective way of editing models.
translated by 谷歌翻译
在许多图像分类任务中,诸如夹子之类的开放式摄影模型具有高精度。但是,在某些设置中,他们的零拍摄性能远非最佳。我们研究模型修补程序,目的是提高对特定任务的准确性,而不会在表现已经足够的任务上降低准确性。为了实现这一目标,我们引入了油漆,这是一种修补方法,该方法在微调之前使用模型的权重与要修补的任务进行微调后的权重。在零机夹的性能差的九个任务上,油漆可将精度提高15至60个百分点,同时将ImageNet上的精度保留在零拍模型的一个百分点之内。油漆还允许在多个任务上修补单个模型,并通过模型刻度进行改进。此外,我们确定了广泛转移的案例,即使任务不相交,对一个任务进行修补也会提高其他任务的准确性。最后,我们研究了超出常见基准的应用程序,例如计数或减少印刷攻击对剪辑的影响。我们的发现表明,可以扩展一组任务集,开放式摄影模型可实现高精度,而无需从头开始重新训练它们。
translated by 谷歌翻译
Transfer learning, where a model is first pre-trained on a data-rich task before being finetuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.
translated by 谷歌翻译
已知生物制剂在他们的生活过程中学习许多不同的任务,并且能够重新审视以前的任务和行为,而没有表现不损失。相比之下,人工代理容易出于“灾难性遗忘”,在以前任务上的性能随着所获取的新的任务而恶化。最近使用该方法通过鼓励参数保持接近以前任务的方法来解决此缺点。这可以通过(i)使用特定的参数正常数来完成,该参数正常数是在参数空间中映射合适的目的地,或(ii)通过将渐变投影到不会干扰先前任务的子空间来指导优化旅程。然而,这些方法通常在前馈和经常性神经网络中表现出子分子表现,并且经常性网络对支持生物持续学习的神经动力学研究感兴趣。在这项工作中,我们提出了自然的持续学习(NCL),一种统一重量正则化和预测梯度下降的新方法。 NCL使用贝叶斯重量正常化来鼓励在收敛的所有任务上进行良好的性能,并将其与梯度投影结合使用先前的精度,这可以防止在优化期间陷入灾难性遗忘。当应用于前馈和经常性网络中的连续学习问题时,我们的方法占据了标准重量正则化技术和投影的方法。最后,训练有素的网络演变了特定于任务特定的动态,这些动态被认为是学习的新任务,类似于生物电路中的实验结果。
translated by 谷歌翻译
具有许多预训练模型(PTM)的模型中心已经是深度学习的基石。尽管以高成本建造,但它们仍然保持\ emph {探索}:从业人员通常会通过普及从提供的模型中心中选择一个PTM,然后对PTM进行微调以解决目标任务。这种na \“我的但共同的实践构成了两个障碍,以充分利用预训练的模型中心:(1)通过受欢迎程度选择的PTM选择没有最佳保证;(2)仅使用一个PTM,而其余的PTM则被忽略。理想情况下。理想情况下。 ,为了最大程度地利用预训练的模型枢纽,需要尝试所有PTM的所有组合和广泛的微调每个PTM组合,这会产生指数组合和不可偿还的计算预算。在本文中,我们提出了一种新的范围排名和调整预训练的模型:(1)我们的会议论文〜\ citep {you_logme:_2021}提出的logMe,以估算预先训练模型提取的标签证据的最大值,该标签证据可以在模型中排名所有PTMS用于各种类型的PTM和任务的枢纽\ Emph {微调之前}。(2)如果我们不偏爱模型的体系结构,则可以对排名最佳的PTM进行微调和部署,或者可以通过TOPE调整目标PTM -k通过t排名PTM他提出了b-tuning算法。排名部分基于会议论文,我们在本文中完成了其理论分析,包括启发式证据最大化程序的收敛证明和特征维度的影响。调整零件引入了一种用于调整多个PTM的新型贝叶斯调整(B-Tuning)方法,该方法超过了专门的方法,该方法旨在调整均匀的PTMS,并为调整异质PTMS设置了一种新的技术。利用PTM枢纽的新范式对于整个机器学习社区的大量受众来说可能会很有趣。
translated by 谷歌翻译
几乎没有射击的内在学习(ICL)使预训练的语言模型能够通过为输入的一部分提供少量的培训示例来执行以前的任务,而无需任何基于梯度的培训。 ICL会产生大量的计算,内存和存储成本,因为它每次进行预测时都涉及处理所有培训示例。参数有效的微调(PEFT)(例如,适配器模块,提示调谐,稀疏更新方法等)提供了替代范式,其中训练了一组少量参数以启用模型来执行新任务。在本文中,我们严格地比较了几个ICL和PEFT,并证明后者提供了更好的准确性,并大大降低了计算成本。在此过程中,我们引入了一种称为(IA)$^3 $的新PEFT方法,该方法通过学习的向量来扩展激活,从而获得更强的性能,同时仅引入相对少量的新参数。我们还提出了一个基于称为T-FEW的T0模型的简单食谱,可以将其应用于新任务,而无需特定于任务的调整或修改。我们通过将T-FEW应用于木筏基准,首次实现超人性能,并以6%的绝对性能优于最先进的方法来验证T-FEW对完全看不见的任务的有效性。我们实验中使用的所有代码均可公开使用。
translated by 谷歌翻译
执行零摄像推理时(即,在特定数据集上不进行微调)时,大型预训练的模型(例如剪辑或ALIGN)在一系列数据分布中提供一致的精度。尽管现有的微调方法显着提高了给定目标分布的准确性,但它们通常会降低分配变化的稳健性。我们通过引入一种简单有效的方法来提高鲁棒性,同时进行微调:结合零拍和微调模型(Wise-ft)的重量。与标准的微调相比,Wise-FT在分配变化下提供了巨大的准确性提高,同时保留了目标分布的高精度。在Imagenet和五个派生的分布变化上,Wise-FT在先前的工作中提高了分布转移的准确性4至6个百分点(PP),同时将Imagenet精度提高1.6pp。Wise-ft的稳健性相似(2至23 pp),明智之前与七个常用的转移学习数据集的标准微调相比,在一组进一步的分配转移的各种集合中,准确性增长率为0.8至3.3 pp。这些改进在微调或推理期间没有任何额外的计算成本。
translated by 谷歌翻译
机器学习中的终身学习范式是一个有吸引力的替代方案,不仅是由于其与生物学学习的相似之处,而且它通过避免过度模型重新训练来减少能量浪费的可能性。对此范式的关键挑战是灾难性遗忘的现象。随着在机器学习中训练有素的模型的越来越受欢迎和成功,我们提出了问题:终身学习中的训练前比赛,特别是关于灾难性的遗忘?我们在大型预先训练模型的上下文中调查现有方法,并在各种文本和图像分类任务中评估其性能,包括使用15个不同的NLP任务的新型数据集进行大规模研究。在所有设置中,我们观察到,通用预训练隐含地减轻了在与随机初始化模型相比依次学习多个任务时灾难性忘记的影响。然后,我们进一步调查为什么预先训练缓解在这个环境中忘记。我们通过分析损失景观来研究这种现象,发现预先训练的重量似乎可以通过导致更宽的最小值来缓解遗忘。基于这一洞察力,我们提出了对当前任务损失和损失盆地锐利的共同优化,以便在连续微调期间明确鼓励更广泛的盆地。我们表明,这种优化方法导致与跨多个设置的任务顺序持续学习的性能相当,而无需保留具有任务数量的大小的内存。
translated by 谷歌翻译
Foundation models are redefining how AI systems are built. Practitioners now follow a standard procedure to build their machine learning solutions: download a copy of a foundation model, and fine-tune it using some in-house data about the target task of interest. Consequently, the Internet is swarmed by a handful of foundation models fine-tuned on many diverse tasks. Yet, these individual fine-tunings often lack strong generalization and exist in isolation without benefiting from each other. In our opinion, this is a missed opportunity, as these specialized models contain diverse features. Based on this insight, we propose model recycling, a simple strategy that leverages multiple fine-tunings of the same foundation model on diverse auxiliary tasks, and repurposes them as rich and diverse initializations for the target task. Specifically, model recycling fine-tunes in parallel each specialized model on the target task, and then averages the weights of all target fine-tunings into a final model. Empirically, we show that model recycling maximizes model diversity by benefiting from diverse auxiliary tasks, and achieves a new state of the art on the reference DomainBed benchmark for out-of-distribution generalization. Looking forward, model recycling is a contribution to the emerging paradigm of updatable machine learning where, akin to open-source software development, the community collaborates to incrementally and reliably update machine learning models.
translated by 谷歌翻译
We introduce a conceptually simple and scalable framework for continual learning domains where tasks are learned sequentially. Our method is constant in the number of parameters and is designed to preserve performance on previously encountered tasks while accelerating learning progress on subsequent problems. This is achieved by training a network with two components: A knowledge base, capable of solving previously encountered problems, which is connected to an active column that is employed to efficiently learn the current task. After learning a new task, the active column is distilled into the knowledge base, taking care to protect any previously acquired skills. This cycle of active learning (progression) followed by consolidation (compression) requires no architecture growth, no access to or storing of previous data or tasks, and no task-specific parameters. We demonstrate the progress & compress approach on sequential classification of handwritten alphabets as well as two reinforcement learning domains: Atari games and 3D maze navigation.
translated by 谷歌翻译
我们为大规模训练的大规模训练语言模型提供了更简单,更稀疏,更快的算法,这些算法在许多标准的NLP任务上实现了最新的隐私与实用性权衡。我们为此问题提出了一个元框架,这是受高度参数效率方法进行微调成功的启发。我们的实验表明,这些方法的差异化适应能力在三个重要方面优于以前的私人算法:实用程序,隐私以及私人培训的计算和记忆成本。在许多经常研究的数据集中,私人模型的实用性接近了非私人模型的方法。例如,在MNLI数据集上,我们使用Roberta-large的准确度为87.8 \%$,使用Roberta-Base $ 83.5 \%$,其隐私预算为$ \ Epsilon = 6.7 $。相比之下,缺乏隐私限制,罗伯塔·莱格(Roberta-Large)的准确度为$ 90.2 \%$。我们的发现对于自然语言生成任务类似。与DART,GPT-2-SMALL,GPT-2中,GPT-2-MEDIUM,GPT-2-LARGE和GPT-2-XL的私人微调达到38.5、42.0、43.1和43.8($ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 43.8) epsilon = 6.8,\ delta = $ 1E-5),而非私人基线为$ 48.1 $。我们所有的实验都表明,较大的模型更适合私人微调:虽然众所周知,它们旨在非优先实现卓越的准确性,但我们发现当引入隐私时,它们也更好地保持其准确性。
translated by 谷歌翻译
We propose SWA-Gaussian (SWAG), a simple, scalable, and general purpose approach for uncertainty representation and calibration in deep learning. Stochastic Weight Averaging (SWA), which computes the first moment of stochastic gradient descent (SGD) iterates with a modified learning rate schedule, has recently been shown to improve generalization in deep learning. With SWAG, we fit a Gaussian using the SWA solution as the first moment and a low rank plus diagonal covariance also derived from the SGD iterates, forming an approximate posterior distribution over neural network weights; we then sample from this Gaussian distribution to perform Bayesian model averaging. We empirically find that SWAG approximates the shape of the true posterior, in accordance with results describing the stationary distribution of SGD iterates. Moreover, we demonstrate that SWAG performs well on a wide variety of tasks, including out of sample detection, calibration, and transfer learning, in comparison to many popular alternatives including MC dropout, KFAC Laplace, SGLD, and temperature scaling.
translated by 谷歌翻译
Singular value decomposition (SVD) is one of the most popular compression methods that approximate a target matrix with smaller matrices. However, standard SVD treats the parameters within the matrix with equal importance, which is a simple but unrealistic assumption. The parameters of a trained neural network model may affect task performance unevenly, which suggests non-equal importance among the parameters. Compared to SVD, the decomposition method aware of parameter importance is the more practical choice in real cases. Unlike standard SVD, weighted value decomposition is a non-convex optimization problem that lacks a closed-form solution. We systematically investigated multiple optimization strategies to tackle the problem and examined our method by compressing Transformer-based language models. Further, we designed a metric to predict when the SVD may introduce a significant performance drop, for which our method can be a rescue strategy. The extensive evaluations demonstrate that our method can perform better than current SOTA methods in compressing Transformer-based language models.
translated by 谷歌翻译
Incremental learning (IL) has received a lot of attention recently, however, the literature lacks a precise problem definition, proper evaluation settings, and metrics tailored specifically for the IL problem. One of the main objectives of this work is to fill these gaps so as to provide a common ground for better understanding of IL. The main challenge for an IL algorithm is to update the classifier whilst preserving existing knowledge. We observe that, in addition to forgetting, a known issue while preserving knowledge, IL also suffers from a problem we call intransigence, inability of a model to update its knowledge. We introduce two metrics to quantify forgetting and intransigence that allow us to understand, analyse, and gain better insights into the behaviour of IL algorithms. We present RWalk, a generalization of EWC++ (our efficient version of EWC [7]) and Path Integral [26] with a theoretically grounded KL-divergence based perspective. We provide a thorough analysis of various IL algorithms on MNIST and CIFAR-100 datasets. In these experiments, RWalk obtains superior results in terms of accuracy, and also provides a better trade-off between forgetting and intransigence.
translated by 谷歌翻译
While deep learning has led to remarkable advances across diverse applications, it struggles in domains where the data distribution changes over the course of learning. In stark contrast, biological neural networks continually adapt to changing domains, possibly by leveraging complex molecular machinery to solve many tasks simultaneously. In this study, we introduce intelligent synapses that bring some of this biological complexity into artificial neural networks. Each synapse accumulates task relevant information over time, and exploits this information to rapidly store new memories without forgetting old ones. We evaluate our approach on continual learning of classification tasks, and show that it dramatically reduces forgetting while maintaining computational efficiency.
translated by 谷歌翻译
基于变压器的NLP模型是使用数亿甚至数十亿个参数训练的,从而限制了其在计算受限环境中的适用性。尽管参数的数量通常与性能相关,但尚不清楚下游任务是否需要整个网络。在最新的修剪和提炼预培训模型的工作中,我们探索了在预训练模型中放下层的策略,并观察修剪对下游胶水任务的影响。我们能够修剪Bert,Roberta和XLNet型号高达40%,同时保持其原始性能的98%。此外,我们证明,在大小和性能方面,您的修剪模型与使用知识蒸馏的型号相提并论。我们的实验产生有趣的观察结果,例如(i)下层对于维持下游任务性能最重要,(ii)某些任务(例如释义检测和句子相似性)对于降低层的降低和(iii)经过训练的模型更强大。使用不同的目标函数表现出不同的学习模式,并且层掉落。
translated by 谷歌翻译