Adapter Tuning, which freezes the pretrained language models (PLMs) and only fine-tunes a few extra modules, becomes an appealing efficient alternative to the full model fine-tuning. Although computationally efficient, the recent Adapters often increase parameters (e.g. bottleneck dimension) for matching the performance of full model fine-tuning, which we argue goes against their original intention. In this work, we re-examine the parameter-efficiency of Adapters through the lens of network pruning (we name such plug-in concept as \texttt{SparseAdapter}) and find that SparseAdapter can achieve comparable or better performance than standard Adapters when the sparse ratio reaches up to 80\%. Based on our findings, we introduce an easy but effective setting ``\textit{Large-Sparse}'' to improve the model capacity of Adapters under the same parameter budget. Experiments on five competitive Adapters upon three advanced PLMs show that with proper sparse method (e.g. SNIP) and ratio (e.g. 40\%) SparseAdapter can consistently outperform their corresponding counterpart. Encouragingly, with the \textit{Large-Sparse} setting, we can obtain further appealing gains, even outperforming the full fine-tuning by a large margin. Our code will be released at: https://github.com/Shwai-He/SparseAdapter.
translated by 谷歌翻译
具有数百万参数的基于变压器的预训练模型需要大量存储。最近的方法通过培训适配器解决了这一缺点,但是这些方法仍然需要相对较大的参数。在这项研究中,提出了一种令人惊讶的简单但有效的适配器体系结构的Adapterbias。AdapterBias向变压器层的隐藏输出添加了代币依赖性转移,以适应仅使用向量和线性层的下游任务。进行了广泛的实验,以证明适配性的有效性。实验表明,与先前的作品相比,我们提出的方法可以大大减少可训练的参数,而任务性能与微调的预训练模型相比最小。我们进一步发现,适应性比亚斯自动学习以将更重要的表示形式分配给与任务相关的代币转移。
translated by 谷歌翻译
微调下游任务的大型预训练语言模型已成为NLP中的事实上学习范式。然而,常规方法微调预先训练模型的所有参数,这变得越来越稳定,因为模型尺寸和增长的任务数量。最近的工作提出了各种参数有效的转移学习方法,只需微调少数(额外)参数以获得强大的性能。虽然有效,但各种方法中的成功和联系的关键成分尚不清楚。在本文中,我们分解了最先进的参数有效的传输学习方法的设计,并提出了一个在它们之间建立连接的统一框架。具体而言,我们将它们重新框架作为预先训练的模型对特定隐藏状态的修改,并定义了一组设计尺寸,不同的方法变化,例如计算修改的功能和应用修改的位置。通过跨机翻译的全面实证研究,文本摘要,语言理解和文本分类基准,我们利用统一的视图来确定以前的方法中的重要设计选择。此外,我们的统一框架使得能够在不同的方法中传输设计元素,因此我们能够实例化新的参数高效的微调方法,该方法比以前的方法更加有效,而是更有效,实现可比的结果在所有四个任务上调整所有参数。
translated by 谷歌翻译
巨大的预训练模型已成为自然语言处理(NLP)的核心,它是针对一系列下游任务进行微调的起点。然而,此范式的两个疼痛点持续:(a)随着预训练的模型的增长越大(例如,GPT-3的175b参数),即使是微调过程也可能是耗时的,并且计算昂贵; (b)默认情况下,微调模型的大小与起点相同,由于其更专业的功能,这既不明智,也不是实际的,因为许多微调模型将部署在资源受限的环境中。为了解决这些疼痛点,我们通过在重量更新和最终模型权重中利用稀疏性来提出一个用于资源和参数有效的微调的框架。我们提出的框架被称为双重稀疏性的有效调整(DSEE),旨在实现两个关键目标:(i)参数有效的微调 - 通过在预训练的权重的顶部强制实施稀疏性的低级更新; (ii)资源有效的推论 - 通过鼓励对最终微调模型的稀疏重量结构。我们通过统一的方法在预训练的语言模型中利用非结构化和结构化的稀疏模式来利用这两个方向的稀疏性。广泛的实验和深入研究,对数十个数据集进行了不同的网络骨干(即Bert,Roberta和GPT-2),始终显示出令人印象深刻的参数 - /推理效率,同时保持竞争性下游性能。例如,DSEE在达到可比性能的同时节省了约25%的推理拖失lo,在BERT上具有0.5%的可训练参数。代码可在https://github.com/vita-group/dsee中找到。
translated by 谷歌翻译
Conventional fine-tuning encounters increasing difficulties given the size of current Pre-trained Language Models, which makes parameter-efficient tuning become the focal point of frontier research. Previous methods in this field add tunable adapters into MHA or/and FFN of Transformer blocks to enable PLMs achieve transferability. However, as an important part of Transformer architecture, the power of layer normalization for parameter-efficent tuning is ignored. In this paper, we first propose LN-tuning, by tuning the gain and bias term of Layer Normalization module with only 0.03\% parameters, which is of high time-efficency and significantly superior to baselines which are less than 0.1\% tunable parameters. Further, we study the unified framework of combining LN-tuning with previous ones and we find that: (1) the unified framework of combining prefix-tuning, the adapter-based method working on MHA, and LN-tuning achieves SOTA performance. (2) unified framework which tunes MHA and LayerNorm simultaneously can get performance improvement but those which tune FFN and LayerNorm simultaneous will cause performance decrease. Ablation study validates LN-tuning is of no abundant parameters and gives a further understanding of it.
translated by 谷歌翻译
Dynamic networks have been extensively explored as they can considerably improve the model's representation power with acceptable computational cost. The common practice in implementing dynamic networks is to convert given static layers into fully dynamic ones where all parameters are dynamic and vary with the input. Recent studies empirically show the trend that the more dynamic layers contribute to ever-increasing performance. However, such a fully dynamic setting 1) may cause redundant parameters and high deployment costs, limiting the applicability of dynamic networks to a broader range of tasks and models, and more importantly, 2) contradicts the previous discovery in the human brain that \textit{when human brains process an attention-demanding task, only partial neurons in the task-specific areas are activated by the input, while the rest neurons leave in a baseline state.} Critically, there is no effort to understand and resolve the above contradictory finding, leaving the primal question -- to make the computational parameters fully dynamic or not? -- unanswered. The main contributions of our work are challenging the basic commonsense in dynamic networks, and, proposing and validating the \textsc{cherry hypothesis} -- \textit{A fully dynamic network contains a subset of dynamic parameters that when transforming other dynamic parameters into static ones, can maintain or even exceed the performance of the original network.} Technically, we propose a brain-inspired partially dynamic network, namely PAD-Net, to transform the redundant dynamic parameters into static ones. Also, we further design Iterative Mode Partition to partition the dynamic- and static-subnet, which alleviates the redundancy in traditional fully dynamic networks. Our hypothesis and method are comprehensively supported by large-scale experiments with typical advanced dynamic methods.
translated by 谷歌翻译
最近的参数效率语言模型调整(PELT)方法可以使微调的性能与较少的可训练参数相匹配,并且在训练数据受到限制时尤其表现良好。但是,不同的PELT方法在相同的任务上的性能可能会有所不同,因此为特定任务选择最合适的方法是不平凡的,尤其是考虑到快速增长的新PELT方法和任务。鉴于模型多样性和模型选择的难度,我们提出了一个统一的框架Unipelt,该框架将不同的毛皮方法纳入了子模型,并学会了激活最适合当前数据或通过门控机制设置的方法。在胶水基准上,与最佳的单个毛皮方法相比,UniPelt始终达到1〜4%的增长,而其融合甚至超过了不同设置下的微调。此外,UniPelt通常超过上限,该上限在每个任务上单独使用的所有子模型的最佳性能,表明多种PELT方法的混合物可能本质上比单个方法更有效。
translated by 谷歌翻译
通过微调将大规模的预训练语言模型适应下游任务是实现NLP基准测试最先进性能的标准方法。然而,微调具有数百万或数十亿个参数的所有重量模型是对低资源设置中不稳定的采样低效,并且浪费,因为它需要为每个任务存储模型的单独副本。最近的工作已经开发了参数高效的微调方法,但这些方法仍然需要相对大量的参数或表现不足标准微调。在这项工作中,我们提出了一种特殊调整大型语言模型的方法,其在任务性能和比率参数之间具有更好的权衡的方法,而不是比上事先工作。 Compacter通过构建适配器,低级优化和参数化超复分乘法层的思想之上来实现这一目标。具体地,Compacter将特定于特定的权重矩阵插入到预估计模型的权重中,这些权重被有效地计算为共享的“慢速”权重和“快速”等级 - 每个Compacter层定义的矩阵之间的矩阵产品的总和。仅通过培训0.047%的预磨料模型的参数,Compacter会在胶水上标准微调和胜过标准微调的标准微调和低资源设置。我们的代码在〜\ url {https://github.com/rabeehk/compacter}上公开使用。
translated by 谷歌翻译
最近在各种领域中采用了关于下游任务的大型预训练模型。但是,更新大型预训练模型的整个参数集是昂贵的。尽管最近提出的参数效率转移学习(PETL)技术允许在预先训练的骨干网络内更新一小部分参数(例如,仅使用2%的参数)用于新任务,但它们只能通过最多减少训练记忆要求30%。这是因为可训练参数的梯度计算仍然需要通过大型预训练的骨干模型反向传播。为了解决这个问题,我们提出了梯子侧调(LST),这是一种新的PETL技术,可将训练记忆要求减少更多。与现有的参数效率方法不同,将其他参数插入骨干网络中,我们训练梯子侧网络,梯子侧网络是一个小而独立的网络,将中间激活作为通过快速连接(梯子)从骨干网络中获得的输入作为输入,并进行预测。 LST的内存要求明显低于以前的方法,因为它不需要通过骨干网络反向传播,而是仅通过侧网和梯子连接。我们使用NLP(胶)和视觉语言(VQA,GQA,NLVR2,MSCOCO)任务上的各种模型(T5,CLIP-T5)进行评估。 LST节省了69%的内存成本来微调整个网络,而其他方法仅将其中的26%保存在相似的参数使用中(因此,更多的内存节省了2.7倍)。此外,LST在低内存状态下的适配器和洛拉的精度高。为了进一步显示这种更好的记忆效率的优势,我们还将LST应用于较大的T5型号(T5-Large,T5-3B),比完整的微调和其他PETL方法获得更好的胶水性能。我们对VL任务的实验也完全相同。
translated by 谷歌翻译
Recently, a large number of tuning strategies have been proposed to adapt pre-trained language models to downstream tasks. In this paper, we perform an extensive empirical evaluation of various tuning strategies for multilingual learning, particularly in the context of text summarization. Specifically, we explore the relative advantages of three families of multilingual tuning strategies (a total of five models) and empirically evaluate them for summarization over 45 languages. Experimentally, we not only established a new state-of-the-art on the XL-Sum dataset but also derive a series of observations that hopefully can provide hints for future research on the design of multilingual tuning strategies.
translated by 谷歌翻译
我们为大规模训练的大规模训练语言模型提供了更简单,更稀疏,更快的算法,这些算法在许多标准的NLP任务上实现了最新的隐私与实用性权衡。我们为此问题提出了一个元框架,这是受高度参数效率方法进行微调成功的启发。我们的实验表明,这些方法的差异化适应能力在三个重要方面优于以前的私人算法:实用程序,隐私以及私人培训的计算和记忆成本。在许多经常研究的数据集中,私人模型的实用性接近了非私人模型的方法。例如,在MNLI数据集上,我们使用Roberta-large的准确度为87.8 \%$,使用Roberta-Base $ 83.5 \%$,其隐私预算为$ \ Epsilon = 6.7 $。相比之下,缺乏隐私限制,罗伯塔·莱格(Roberta-Large)的准确度为$ 90.2 \%$。我们的发现对于自然语言生成任务类似。与DART,GPT-2-SMALL,GPT-2中,GPT-2-MEDIUM,GPT-2-LARGE和GPT-2-XL的私人微调达到38.5、42.0、43.1和43.8($ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 43.8) epsilon = 6.8,\ delta = $ 1E-5),而非私人基线为$ 48.1 $。我们所有的实验都表明,较大的模型更适合私人微调:虽然众所周知,它们旨在非优先实现卓越的准确性,但我们发现当引入隐私时,它们也更好地保持其准确性。
translated by 谷歌翻译
我们介绍了BitFit,这是一种稀疏的重点方法,其中仅修改了模型的偏差(或其中一个子集)。我们表明,通过在预训练的BERT模型上应用BITFIT的小型至中等训练数据具有竞争力(有时比)对整个模型进行微调。对于较大的数据,该方法与其他稀疏微调方法具有竞争力。除了它们的实际实用性外,这些发现与理解常用的填补过程的问题有关:它们支持以下假设:填充主要是关于揭示通过语言模型培训引起的知识,而不是学习新的任务特定的语言知识。
translated by 谷歌翻译
大型神经模型的培训和推断很昂贵。但是,对于许多应用程序域,虽然新任务和模型经常出现,但建模的基础文档主要保持不变。我们研究如何通过嵌入回收利用(ER)来降低此类设置的计算成本:在执行训练或推理时从以前的模型中重新使用激活。与以前的工作相反,重点是冻结小型分类头进行填充,这通常会导致绩效显着下降,我们提出了从预告片的模型中缓存中间层的输出,并为新任务的剩余层进行填充。我们表明,我们的方法在训练过程中提供了100%的速度和55-86%的推理,并且对科学领域中文本分类和实体识别任务的准确性产生了可观的影响。对于通用域的问答任务,ER提供了类似的加速和少量准确性。最后,我们确定了ER的几个开放挑战和未来的方向。
translated by 谷歌翻译
In computer vision, it has achieved great transfer learning performance via adapting large-scale pretrained vision models (e.g., vision transformers) to downstream tasks. Common approaches for model adaptation either update all model parameters or leverage linear probes. In this paper, we aim to study parameter-efficient model adaptation strategies for vision transformers on the image classification task. We formulate efficient model adaptation as a subspace training problem and perform a comprehensive benchmarking over different efficient adaptation methods. We conduct an empirical study on each efficient model adaptation method focusing on its performance alongside parameter cost. Furthermore, we propose a parameter-efficient model adaptation framework, which first selects submodules by measuring local intrinsic dimensions and then projects them into subspace for further decomposition via a novel Kronecker Adaptation (KAdaptation) method. We analyze and compare our method with a diverse set of baseline model adaptation methods (including state-of-the-art methods for pretrained language models). Our method performs the best in terms of the tradeoff between accuracy and parameter efficiency across 20 image classification datasets under the few-shot setting and 7 image classification datasets under the full-shot setting.
translated by 谷歌翻译
预先接受的语言模型(PLMS)在预训练和微调范式下,在各种自然语言处理(NLP)任务中取得了巨大成功。具有大量参数,PLMS是计算密集型和资源饥饿的。因此,已经引入了模型修剪来压缩大规模的PLM。然而,大多数先前的方法只考虑对下游任务的任务特定知识,但忽略了修剪期间的基本任务无关知识,这可能导致灾难性的遗忘问题并导致普遍性较差。为了在我们的修剪模型中维护任务不可行的特定知识,我们提出了在预训练和微调范式下的对比修剪(盖子)。它设计为一​​般框架,与结构化和非结构化修剪兼容。统一的对比学习,CAP使修剪模型能够从预训练的模型中学到任务无关的知识,以及特定于任务知识的微调模型。此外,为了更好地保留修剪模型的性能,快照(即,每个修剪迭代的中间模型)也是修剪的有效监督。我们广泛的实验表明,采用盖子一致地产生显着的改善,特别是在极高的稀疏性方案中。只有3%的型号参数保留(即97%的稀疏性),CAP成功达到了QQP和MNLI任务的原始BERT性能的99.2%和96.3%。此外,我们的探测实验表明,CAP修剪的模型趋于达到更好的泛化能力。
translated by 谷歌翻译
With increasing privacy concerns on data, recent studies have made significant progress using federated learning (FL) on privacy-sensitive natural language processing (NLP) tasks. Much literature suggests fully fine-tuning pre-trained language models (PLMs) in the FL paradigm can mitigate the data heterogeneity problem and close the performance gap with centralized training. However, large PLMs bring the curse of prohibitive communication overhead and local model adaptation costs for the FL system. To this end, we introduce various parameter-efficient tuning (PETuning) methods into federated learning. Specifically, we provide a holistic empirical study of representative PLMs tuning methods in FL. The experimental results cover the analysis of data heterogeneity levels, data scales, and different FL scenarios. Overall communication overhead can be significantly reduced by locally tuning and globally aggregating lightweight model parameters while maintaining acceptable performance in various FL settings. To facilitate the research of PETuning in FL, we also develop a federated tuning framework FedPETuning, which allows practitioners to exploit different PETuning methods under the FL training paradigm conveniently. The source code is available at \url{https://github.com/iezhuozhuo/FedETuning/tree/deltaTuning}.
translated by 谷歌翻译
Fine-tuning large pre-trained models is an effective transfer mechanism in NLP. However, in the presence of many downstream tasks, fine-tuning is parameter inefficient: an entire new model is required for every task. As an alternative, we propose transfer with adapter modules. Adapter modules yield a compact and extensible model; they add only a few trainable parameters per task, and new tasks can be added without revisiting previous ones. The parameters of the original network remain fixed, yielding a high degree of parameter sharing. To demonstrate adapter's effectiveness, we transfer the recently proposed BERT Transformer model to 26 diverse text classification tasks, including the GLUE benchmark. Adapters attain near state-of-the-art performance, whilst adding only a few parameters per task. On GLUE, we attain within 0.4% of the performance of full fine-tuning, adding only 3.6% parameters per task. By contrast, fine-tuning trains 100% of the parameters per task.
translated by 谷歌翻译
Recent works on Lottery Ticket Hypothesis have shown that pre-trained language models (PLMs) contain smaller matching subnetworks(winning tickets) which are capable of reaching accuracy comparable to the original models. However, these tickets are proved to be notrobust to adversarial examples, and even worse than their PLM counterparts. To address this problem, we propose a novel method based on learning binary weight masks to identify robust tickets hidden in the original PLMs. Since the loss is not differentiable for the binary mask, we assign the hard concrete distribution to the masks and encourage their sparsity using a smoothing approximation of L0 regularization.Furthermore, we design an adversarial loss objective to guide the search for robust tickets and ensure that the tickets perform well bothin accuracy and robustness. Experimental results show the significant improvement of the proposed method over previous work on adversarial robustness evaluation.
translated by 谷歌翻译
基于变压器的语言模型应用于自然语言处理的广泛应用程序。但是,它们效率低,难以部署。近年来,已经提出了许多压缩算法来提高目标硬件上大型变压器的模型的实现效率。在这项工作中,我们通过整合体重修剪和模型蒸馏来提出一种训练稀疏预训练的变压器语言模型的新方法。这些稀疏的预训练型号可用于在维护稀疏模式的同时传输广泛的任务。我们展示了我们有三个已知的架构的方法,以创建稀疏的预训练伯特基,BERT-MAT​​RY和DISTOLBERT。我们展示了压缩稀疏的预训练模型如何培训他们的知识,以最小的精度损失将他们的知识转移到五种不同的下游自然语言任务。此外,我们展示了如何使用量化感知培训进一步将稀疏模型的重量压缩为8位精度。例如,在SQUAdv1.1上使用我们稀疏预训练的BERT频率,并量化为8位,我们为编码器达到40美元的压缩比,而不是1 \%$精度损失。据我们所知,我们的结果表明Bert-Base,Bert-Light和Distilbert的最佳压缩至准确率。
translated by 谷歌翻译
Language models with the Transformers structure have shown great performance in natural language processing. However, there still poses problems when fine-tuning pre-trained language models on downstream tasks, such as over-fitting or representation collapse. In this work, we propose HyPe, a simple yet effective fine-tuning technique to alleviate such problems by perturbing hidden representations of Transformers layers. Unlike previous works that only add noise to inputs or parameters, we argue that the hidden representations of Transformers layers convey more diverse and meaningful language information. Therefore, making the Transformers layers more robust to hidden representation perturbations can further benefit the fine-tuning of PLMs en bloc. We conduct extensive experiments and analyses on GLUE and other natural language inference datasets. Results demonstrate that HyPe outperforms vanilla fine-tuning and enhances generalization of hidden representations from different layers. In addition, HyPe acquires negligible computational overheads, and is better than and compatible with previous state-of-the-art fine-tuning techniques.
translated by 谷歌翻译