具有数百万参数的基于变压器的预训练模型需要大量存储。最近的方法通过培训适配器解决了这一缺点,但是这些方法仍然需要相对较大的参数。在这项研究中,提出了一种令人惊讶的简单但有效的适配器体系结构的Adapterbias。AdapterBias向变压器层的隐藏输出添加了代币依赖性转移,以适应仅使用向量和线性层的下游任务。进行了广泛的实验,以证明适配性的有效性。实验表明,与先前的作品相比,我们提出的方法可以大大减少可训练的参数,而任务性能与微调的预训练模型相比最小。我们进一步发现,适应性比亚斯自动学习以将更重要的表示形式分配给与任务相关的代币转移。
translated by 谷歌翻译
我们介绍了BitFit,这是一种稀疏的重点方法,其中仅修改了模型的偏差(或其中一个子集)。我们表明,通过在预训练的BERT模型上应用BITFIT的小型至中等训练数据具有竞争力(有时比)对整个模型进行微调。对于较大的数据,该方法与其他稀疏微调方法具有竞争力。除了它们的实际实用性外,这些发现与理解常用的填补过程的问题有关:它们支持以下假设:填充主要是关于揭示通过语言模型培训引起的知识,而不是学习新的任务特定的语言知识。
translated by 谷歌翻译
Adapter Tuning, which freezes the pretrained language models (PLMs) and only fine-tunes a few extra modules, becomes an appealing efficient alternative to the full model fine-tuning. Although computationally efficient, the recent Adapters often increase parameters (e.g. bottleneck dimension) for matching the performance of full model fine-tuning, which we argue goes against their original intention. In this work, we re-examine the parameter-efficiency of Adapters through the lens of network pruning (we name such plug-in concept as \texttt{SparseAdapter}) and find that SparseAdapter can achieve comparable or better performance than standard Adapters when the sparse ratio reaches up to 80\%. Based on our findings, we introduce an easy but effective setting ``\textit{Large-Sparse}'' to improve the model capacity of Adapters under the same parameter budget. Experiments on five competitive Adapters upon three advanced PLMs show that with proper sparse method (e.g. SNIP) and ratio (e.g. 40\%) SparseAdapter can consistently outperform their corresponding counterpart. Encouragingly, with the \textit{Large-Sparse} setting, we can obtain further appealing gains, even outperforming the full fine-tuning by a large margin. Our code will be released at: https://github.com/Shwai-He/SparseAdapter.
translated by 谷歌翻译
Fine-tuning large pre-trained models is an effective transfer mechanism in NLP. However, in the presence of many downstream tasks, fine-tuning is parameter inefficient: an entire new model is required for every task. As an alternative, we propose transfer with adapter modules. Adapter modules yield a compact and extensible model; they add only a few trainable parameters per task, and new tasks can be added without revisiting previous ones. The parameters of the original network remain fixed, yielding a high degree of parameter sharing. To demonstrate adapter's effectiveness, we transfer the recently proposed BERT Transformer model to 26 diverse text classification tasks, including the GLUE benchmark. Adapters attain near state-of-the-art performance, whilst adding only a few parameters per task. On GLUE, we attain within 0.4% of the performance of full fine-tuning, adding only 3.6% parameters per task. By contrast, fine-tuning trains 100% of the parameters per task.
translated by 谷歌翻译
Conventional fine-tuning encounters increasing difficulties given the size of current Pre-trained Language Models, which makes parameter-efficient tuning become the focal point of frontier research. Previous methods in this field add tunable adapters into MHA or/and FFN of Transformer blocks to enable PLMs achieve transferability. However, as an important part of Transformer architecture, the power of layer normalization for parameter-efficent tuning is ignored. In this paper, we first propose LN-tuning, by tuning the gain and bias term of Layer Normalization module with only 0.03\% parameters, which is of high time-efficency and significantly superior to baselines which are less than 0.1\% tunable parameters. Further, we study the unified framework of combining LN-tuning with previous ones and we find that: (1) the unified framework of combining prefix-tuning, the adapter-based method working on MHA, and LN-tuning achieves SOTA performance. (2) unified framework which tunes MHA and LayerNorm simultaneously can get performance improvement but those which tune FFN and LayerNorm simultaneous will cause performance decrease. Ablation study validates LN-tuning is of no abundant parameters and gives a further understanding of it.
translated by 谷歌翻译
通过微调将大规模的预训练语言模型适应下游任务是实现NLP基准测试最先进性能的标准方法。然而,微调具有数百万或数十亿个参数的所有重量模型是对低资源设置中不稳定的采样低效,并且浪费,因为它需要为每个任务存储模型的单独副本。最近的工作已经开发了参数高效的微调方法,但这些方法仍然需要相对大量的参数或表现不足标准微调。在这项工作中,我们提出了一种特殊调整大型语言模型的方法,其在任务性能和比率参数之间具有更好的权衡的方法,而不是比上事先工作。 Compacter通过构建适配器,低级优化和参数化超复分乘法层的思想之上来实现这一目标。具体地,Compacter将特定于特定的权重矩阵插入到预估计模型的权重中,这些权重被有效地计算为共享的“慢速”权重和“快速”等级 - 每个Compacter层定义的矩阵之间的矩阵产品的总和。仅通过培训0.047%的预磨料模型的参数,Compacter会在胶水上标准微调和胜过标准微调的标准微调和低资源设置。我们的代码在〜\ url {https://github.com/rabeehk/compacter}上公开使用。
translated by 谷歌翻译
Parameter-efficient methods (like Prompt or Adapters) for adapting pre-trained language models to downstream tasks have been popular recently. However, hindrances still prevent these methods from reaching their full potential. For example, two significant challenges are few-shot adaptation and cross-task generalization ability. To tackle these issues, we propose a general framework to enhance the few-shot adaptation and cross-domain generalization ability of parameter-efficient methods. In our framework, we prime the self-supervised model for parameter-efficient methods to rapidly adapt to various downstream few-shot tasks. To evaluate the authentic generalization ability of these parameter-efficient methods, we conduct experiments on a few-shot cross-domain benchmark containing 160 diverse NLP tasks. The experiment result reveals that priming by tuning PLM only with extra training tasks leads to the best performance. Also, we perform a comprehensive analysis of various parameter-efficient methods under few-shot cross-domain scenarios.
translated by 谷歌翻译
大型审慎的语言模型(PLM)通常是通过微调或提示来适应域或任务的。填充需要修改所有参数,并具有足够的数据以避免过度拟合,同时提示不需要培训,也不需要示例,而是限制性能。取而代之的是,我们通过学习学习一般和适应性PLM之间的差异来为数据和参数有效适应。通过我们提出的动态低级别重新聚体和学识渊博的体系结构控制器,通过模型权重和子层结构来表示这种差异。实验对话完成,低资源抽象摘要以及多域语言建模的实验显示了通过域自适应预处理进行适应时间和性能的改善。消融表明我们的任务自适应重新聚体化(TARP)和模型搜索(TAMS)组件分别改进了其他参数效率转移(如适配器和结构学习方法),例如学习的稀疏。
translated by 谷歌翻译
最近的参数效率语言模型调整(PELT)方法可以使微调的性能与较少的可训练参数相匹配,并且在训练数据受到限制时尤其表现良好。但是,不同的PELT方法在相同的任务上的性能可能会有所不同,因此为特定任务选择最合适的方法是不平凡的,尤其是考虑到快速增长的新PELT方法和任务。鉴于模型多样性和模型选择的难度,我们提出了一个统一的框架Unipelt,该框架将不同的毛皮方法纳入了子模型,并学会了激活最适合当前数据或通过门控机制设置的方法。在胶水基准上,与最佳的单个毛皮方法相比,UniPelt始终达到1〜4%的增长,而其融合甚至超过了不同设置下的微调。此外,UniPelt通常超过上限,该上限在每个任务上单独使用的所有子模型的最佳性能,表明多种PELT方法的混合物可能本质上比单个方法更有效。
translated by 谷歌翻译
巨大的预训练模型已成为自然语言处理(NLP)的核心,它是针对一系列下游任务进行微调的起点。然而,此范式的两个疼痛点持续:(a)随着预训练的模型的增长越大(例如,GPT-3的175b参数),即使是微调过程也可能是耗时的,并且计算昂贵; (b)默认情况下,微调模型的大小与起点相同,由于其更专业的功能,这既不明智,也不是实际的,因为许多微调模型将部署在资源受限的环境中。为了解决这些疼痛点,我们通过在重量更新和最终模型权重中利用稀疏性来提出一个用于资源和参数有效的微调的框架。我们提出的框架被称为双重稀疏性的有效调整(DSEE),旨在实现两个关键目标:(i)参数有效的微调 - 通过在预训练的权重的顶部强制实施稀疏性的低级更新; (ii)资源有效的推论 - 通过鼓励对最终微调模型的稀疏重量结构。我们通过统一的方法在预训练的语言模型中利用非结构化和结构化的稀疏模式来利用这两个方向的稀疏性。广泛的实验和深入研究,对数十个数据集进行了不同的网络骨干(即Bert,Roberta和GPT-2),始终显示出令人印象深刻的参数 - /推理效率,同时保持竞争性下游性能。例如,DSEE在达到可比性能的同时节省了约25%的推理拖失lo,在BERT上具有0.5%的可训练参数。代码可在https://github.com/vita-group/dsee中找到。
translated by 谷歌翻译
大型神经模型的培训和推断很昂贵。但是,对于许多应用程序域,虽然新任务和模型经常出现,但建模的基础文档主要保持不变。我们研究如何通过嵌入回收利用(ER)来降低此类设置的计算成本:在执行训练或推理时从以前的模型中重新使用激活。与以前的工作相反,重点是冻结小型分类头进行填充,这通常会导致绩效显着下降,我们提出了从预告片的模型中缓存中间层的输出,并为新任务的剩余层进行填充。我们表明,我们的方法在训练过程中提供了100%的速度和55-86%的推理,并且对科学领域中文本分类和实体识别任务的准确性产生了可观的影响。对于通用域的问答任务,ER提供了类似的加速和少量准确性。最后,我们确定了ER的几个开放挑战和未来的方向。
translated by 谷歌翻译
Language models with the Transformers structure have shown great performance in natural language processing. However, there still poses problems when fine-tuning pre-trained language models on downstream tasks, such as over-fitting or representation collapse. In this work, we propose HyPe, a simple yet effective fine-tuning technique to alleviate such problems by perturbing hidden representations of Transformers layers. Unlike previous works that only add noise to inputs or parameters, we argue that the hidden representations of Transformers layers convey more diverse and meaningful language information. Therefore, making the Transformers layers more robust to hidden representation perturbations can further benefit the fine-tuning of PLMs en bloc. We conduct extensive experiments and analyses on GLUE and other natural language inference datasets. Results demonstrate that HyPe outperforms vanilla fine-tuning and enhances generalization of hidden representations from different layers. In addition, HyPe acquires negligible computational overheads, and is better than and compatible with previous state-of-the-art fine-tuning techniques.
translated by 谷歌翻译
This work introduces a new multi-task, parameter-efficient language model (LM) tuning method that learns to transfer knowledge across different tasks via a mixture of soft prompts-small prefix embedding vectors pre-trained for different tasks. Our method, called ATTEMPT (ATTEntional Mixtures of Prompt Tuning), obtains source prompts as encodings of large-scale source tasks into a small number of parameters and trains an attention module to interpolate the source prompts and a newly initialized target prompt for every instance in the target task. During training, only the target task prompt and the attention weights, which are shared between tasks in multi-task training, are updated, while the original LM and source prompts are intact. ATTEMPT is highly parameter-efficient (e.g., updates 2,300 times fewer parameters than full fine-tuning) while achieving high task performance using knowledge from high-resource tasks. Moreover, it is modular using pre-trained soft prompts, and can flexibly add or remove source prompts for effective knowledge transfer. Our experimental results across 21 diverse NLP datasets show that ATTEMPT significantly outperforms prompt tuning and outperforms or matches fully fine-tuned or other parameter-efficient tuning approaches that use over ten times more parameters. Finally, ATTEMPT outperforms previous work in few-shot learning settings.
translated by 谷歌翻译
我们为大规模训练的大规模训练语言模型提供了更简单,更稀疏,更快的算法,这些算法在许多标准的NLP任务上实现了最新的隐私与实用性权衡。我们为此问题提出了一个元框架,这是受高度参数效率方法进行微调成功的启发。我们的实验表明,这些方法的差异化适应能力在三个重要方面优于以前的私人算法:实用程序,隐私以及私人培训的计算和记忆成本。在许多经常研究的数据集中,私人模型的实用性接近了非私人模型的方法。例如,在MNLI数据集上,我们使用Roberta-large的准确度为87.8 \%$,使用Roberta-Base $ 83.5 \%$,其隐私预算为$ \ Epsilon = 6.7 $。相比之下,缺乏隐私限制,罗伯塔·莱格(Roberta-Large)的准确度为$ 90.2 \%$。我们的发现对于自然语言生成任务类似。与DART,GPT-2-SMALL,GPT-2中,GPT-2-MEDIUM,GPT-2-LARGE和GPT-2-XL的私人微调达到38.5、42.0、43.1和43.8($ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 43.8) epsilon = 6.8,\ delta = $ 1E-5),而非私人基线为$ 48.1 $。我们所有的实验都表明,较大的模型更适合私人微调:虽然众所周知,它们旨在非优先实现卓越的准确性,但我们发现当引入隐私时,它们也更好地保持其准确性。
translated by 谷歌翻译
微调下游任务的大型预训练语言模型已成为NLP中的事实上学习范式。然而,常规方法微调预先训练模型的所有参数,这变得越来越稳定,因为模型尺寸和增长的任务数量。最近的工作提出了各种参数有效的转移学习方法,只需微调少数(额外)参数以获得强大的性能。虽然有效,但各种方法中的成功和联系的关键成分尚不清楚。在本文中,我们分解了最先进的参数有效的传输学习方法的设计,并提出了一个在它们之间建立连接的统一框架。具体而言,我们将它们重新框架作为预先训练的模型对特定隐藏状态的修改,并定义了一组设计尺寸,不同的方法变化,例如计算修改的功能和应用修改的位置。通过跨机翻译的全面实证研究,文本摘要,语言理解和文本分类基准,我们利用统一的视图来确定以前的方法中的重要设计选择。此外,我们的统一框架使得能够在不同的方法中传输设计元素,因此我们能够实例化新的参数高效的微调方法,该方法比以前的方法更加有效,而是更有效,实现可比的结果在所有四个任务上调整所有参数。
translated by 谷歌翻译
及时调整是以参数有效的方式对预训练的预训练语言模型的新范式。在这里,我们探讨了超级核武器的使用来产生超预价:我们提出了HyperPrompt,这是一种用于迅速基于变形金刚自我注意的任务调节的新型体系结构。超预要是通过超网络通过一代人来学习的端到端。 HyperPrompt允许网络学习特定于任务的功能地图,其中超预告是要参与的查询的任务全局记忆,同时启用了任务之间的灵活信息共享。我们表明,HyperPrompt与强大的多任务学习基线具有竞争力,其额外的任务条件参数的$ 0.14 \%$ $ \%,实现了出色的参数和计算效率。通过广泛的经验实验,我们证明,超级启示可以比强大的T5多任务学习基准和参数效率高效的适配器变体获得卓越的性能,包括及时调整和SuplyFormer ++在许多模型尺寸的自然语言理解胶水和SuperGrue的基准上。
translated by 谷歌翻译
Self-supervised learning (SSL) is a powerful technique for learning representations from unlabeled data. Transformer based models such as HuBERT, which consist a feature extractor and transformer layers, are leading the field in the speech domain. SSL models are fine-tuned on a wide range of downstream tasks, which involves re-training the majority of the model for each task. Previous studies have introduced applying adapters, which are small lightweight modules commonly used in Natural Language Processing (NLP) to adapt pre-trained models to new tasks. However, such efficient tuning techniques only provide adaptation at the transformer layer, but failed to perform adaptation at the feature extractor. In this paper, we propose CHAPTER, an efficient tuning method specifically designed for SSL speech model, by applying CNN adapters at the feature extractor. Using this method, we can only fine-tune fewer than 5% of parameters per task compared to fully fine-tuning and achieve better and more stable performance. We empirically found that adding CNN adapters to the feature extractor can help the adaptation on emotion and speaker tasks. For instance, the accuracy of SID is improved from 87.71 to 91.56, and the accuracy of ER is improved by 5%.
translated by 谷歌翻译
最近在各种领域中采用了关于下游任务的大型预训练模型。但是,更新大型预训练模型的整个参数集是昂贵的。尽管最近提出的参数效率转移学习(PETL)技术允许在预先训练的骨干网络内更新一小部分参数(例如,仅使用2%的参数)用于新任务,但它们只能通过最多减少训练记忆要求30%。这是因为可训练参数的梯度计算仍然需要通过大型预训练的骨干模型反向传播。为了解决这个问题,我们提出了梯子侧调(LST),这是一种新的PETL技术,可将训练记忆要求减少更多。与现有的参数效率方法不同,将其他参数插入骨干网络中,我们训练梯子侧网络,梯子侧网络是一个小而独立的网络,将中间激活作为通过快速连接(梯子)从骨干网络中获得的输入作为输入,并进行预测。 LST的内存要求明显低于以前的方法,因为它不需要通过骨干网络反向传播,而是仅通过侧网和梯子连接。我们使用NLP(胶)和视觉语言(VQA,GQA,NLVR2,MSCOCO)任务上的各种模型(T5,CLIP-T5)进行评估。 LST节省了69%的内存成本来微调整个网络,而其他方法仅将其中的26%保存在相似的参数使用中(因此,更多的内存节省了2.7倍)。此外,LST在低内存状态下的适配器和洛拉的精度高。为了进一步显示这种更好的记忆效率的优势,我们还将LST应用于较大的T5型号(T5-Large,T5-3B),比完整的微调和其他PETL方法获得更好的胶水性能。我们对VL任务的实验也完全相同。
translated by 谷歌翻译
Transformer-based models have pushed state of the art in many areas of NLP, but our understanding of what is behind their success is still limited. This paper is the first survey of over 150 studies of the popular BERT model. We review the current state of knowledge about how BERT works, what kind of information it learns and how it is represented, common modifications to its training objectives and architecture, the overparameterization issue and approaches to compression. We then outline directions for future research.
translated by 谷歌翻译
Recently, domain-specific PLMs have been proposed to boost the task performance of specific domains (e.g., biomedical and computer science) by continuing to pre-train general PLMs with domain-specific corpora. However, this Domain-Adaptive Pre-Training (DAPT; Gururangan et al. (2020)) tends to forget the previous general knowledge acquired by general PLMs, which leads to a catastrophic forgetting phenomenon and sub-optimal performance. To alleviate this problem, we propose a new framework of General Memory Augmented Pre-trained Language Model (G-MAP), which augments the domain-specific PLM by a memory representation built from the frozen general PLM without losing any general knowledge. Specifically, we propose a new memory-augmented layer, and based on it, different augmented strategies are explored to build the memory representation and then adaptively fuse it into the domain-specific PLM. We demonstrate the effectiveness of G-MAP on various domains (biomedical and computer science publications, news, and reviews) and different kinds (text classification, QA, NER) of tasks, and the extensive results show that the proposed G-MAP can achieve SOTA results on all tasks.
translated by 谷歌翻译