迅速调整,它冻结了预审计的语言模型(PLM),只有微调的几个额外软提示的参数,在PLM具有数十亿个参数时,对全参数微调(即模型调整)显示出具有竞争性的性能,但仍然显示出竞争力。在较小的PLM的情况下,性能差。因此,迅速转移(POT),通过训练有素的类似源任务的提示来初始化目标提示,最近提议改善及时调整。但是,这样的香草锅方法通常会实现次优的性能,因为(i)锅对源目标对的相似性和(ii)直接对目标提示进行初始提示的提示敏感,而目标任务可能会导致灾难性忘记来源知识。为了解决这些问题,我们提出了一个新的指标,以准确预测及时的转移性(关于(i)),以及一种利用知识蒸馏技术将“知识”从源提示转移到的新颖的锅方法(即熊猫)目标以微妙的方式提示,并有效缓解灾难性遗忘(关于(ii))。此外,为了实现每个源目标对的自适应及时转移,我们使用指标来控制熊猫方法中的知识转移。对PLM的5个量表的21个源和9个目标数据集的189组组合进行了广泛而系统的实验,表明:1)我们提出的指标很好地预测了及时的可传递性; 2)在所有任务和型号中,我们的熊猫始终优于香草锅的平均得分2.3%(最高24.1%); 3)通过我们的熊猫方法,及时调整可以比在各种PLM量表场景中的模型调整来实现竞争性甚至更好的性能。接受代码和模型将在接受后发布。
translated by 谷歌翻译
This technical report briefly describes our JDExplore d-team's Vega v2 submission on the SuperGLUE leaderboard. SuperGLUE is more challenging than the widely used general language understanding evaluation (GLUE) benchmark, containing eight difficult language understanding tasks, including question answering, natural language inference, word sense disambiguation, coreference resolution, and reasoning. [Method] Instead of arbitrarily increasing the size of a pretrained language model (PLM), our aim is to 1) fully extract knowledge from the input pretraining data given a certain parameter budget, e.g., 6B, and 2) effectively transfer this knowledge to downstream tasks. To achieve goal 1), we propose self-evolution learning for PLMs to wisely predict the informative tokens that should be masked, and supervise the masked language modeling (MLM) process with rectified smooth labels. For goal 2), we leverage the prompt transfer technique to improve the low-resource tasks by transferring the knowledge from the foundation model and related downstream tasks to the target task. [Results] According to our submission record (Oct. 2022), with our optimized pretraining and fine-tuning strategies, our 6B Vega method achieved new state-of-the-art performance on 4/8 tasks, sitting atop the SuperGLUE leaderboard on Oct. 8, 2022, with an average score of 91.3.
translated by 谷歌翻译
提示调整(PT)是一个有前途的参数高效的方法,可以利用极大的预先培训的语言模型(PLM),它可以通过仅调整几个软提示来实现与全参数微调的可比性。但是,与微调相比,PT经验需要更多的培训步骤。为了探索我们通过重用培训的软提示和分享知识来提高PT的效率,我们经验探讨了在不同任务和模型中的软提示的可转换性。在交叉任务传输中,我们发现训练有素的软提示可以转移到类似的任务并初始化PT,以加速培训并提高性能。此外,为了探讨影响的因素,提示跨任务的可转移性,我们调查如何测量提示相似性,并发现激活神经元的重叠率与可转移性高度相关。在跨模型传输中,我们探索如何将PLM的提示投影到另一个PLM并成功培训了一种可以在类似任务上实现非琐碎的传输性能的投影仪。但是,使用预计提示初始化PT不起作用,这可能是由优化偏好和PLMS高冗余引起的。我们的研究结果表明,具有知识转移的改善PT是可能的并且有希望的,而提示的交叉任务转移性通常比跨模型转移性更好。
translated by 谷歌翻译
立场检测旨在确定文本的作者是否赞成,反对或中立。这项任务的主要挑战是两个方面的:由于不同目标以及缺乏目标的上下文信息而产生的几乎没有学习。现有作品主要通过设计基于注意力的模型或引入嘈杂的外部知识来解决第二期,而第一个问题仍未探索。在本文中,受到预训练的语言模型(PLM)的潜在能力(PLM)的启发,我们建议介绍基于立场检测的及时基于迅速的微调。 PLM可以为目标提供基本的上下文信息,并通过提示启用几次学习。考虑到目标在立场检测任务中的关键作用,我们设计了目标感知的提示并提出了一种新颖的语言。我们的语言器不会将每个标签映射到具体单词,而是将每个标签映射到矢量,并选择最能捕获姿势与目标之间相关性的标签。此外,为了减轻通过单人工提示来处理不同目标的可能缺陷,我们建议将信息从多个提示中学到的信息提炼。实验结果表明,我们提出的模型在全数据和少数场景中的表现出色。
translated by 谷歌翻译
Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive, so it is difficult to efficiently execute them on resourcerestricted devices. To accelerate inference and reduce model size while maintaining accuracy, we first propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large "teacher" BERT can be effectively transferred to a small "student" Tiny-BERT. Then, we introduce a new two-stage learning framework for TinyBERT, which performs Transformer distillation at both the pretraining and task-specific learning stages. This framework ensures that TinyBERT can capture the general-domain as well as the task-specific knowledge in BERT. TinyBERT 41 with 4 layers is empirically effective and achieves more than 96.8% the performance of its teacher BERT BASE on GLUE benchmark, while being 7.5x smaller and 9.4x faster on inference. TinyBERT 4 is also significantly better than 4-layer state-of-the-art baselines on BERT distillation, with only ∼28% parameters and ∼31% inference time of them. Moreover, TinyBERT 6 with 6 layers performs on-par with its teacher BERT BASE .
translated by 谷歌翻译
This work introduces a new multi-task, parameter-efficient language model (LM) tuning method that learns to transfer knowledge across different tasks via a mixture of soft prompts-small prefix embedding vectors pre-trained for different tasks. Our method, called ATTEMPT (ATTEntional Mixtures of Prompt Tuning), obtains source prompts as encodings of large-scale source tasks into a small number of parameters and trains an attention module to interpolate the source prompts and a newly initialized target prompt for every instance in the target task. During training, only the target task prompt and the attention weights, which are shared between tasks in multi-task training, are updated, while the original LM and source prompts are intact. ATTEMPT is highly parameter-efficient (e.g., updates 2,300 times fewer parameters than full fine-tuning) while achieving high task performance using knowledge from high-resource tasks. Moreover, it is modular using pre-trained soft prompts, and can flexibly add or remove source prompts for effective knowledge transfer. Our experimental results across 21 diverse NLP datasets show that ATTEMPT significantly outperforms prompt tuning and outperforms or matches fully fine-tuned or other parameter-efficient tuning approaches that use over ten times more parameters. Finally, ATTEMPT outperforms previous work in few-shot learning settings.
translated by 谷歌翻译
在过去的几年中,基于变压器的预训练的语言模型在行业和学术界都取得了惊人的成功。但是,较大的模型尺寸和高运行时间延迟是在实践中应用它们的严重障碍,尤其是在手机和物联网(IoT)设备上。为了压缩该模型,最近有大量文献围绕知识蒸馏(KD)的主题长大。然而,KD在基于变压器的模型中的工作方式仍不清楚。我们取消了KD的组件,并提出了一个统一的KD框架。通过框架,花费了23,000多个GPU小时的系统和广泛的实验,从知识类型的角度,匹配策略,宽度深度折衷,初始化,型号大小等。在培训前语言模型中,对先前最新的(SOTA)的相对显着改善。最后,我们为基于变压器模型的KD提供了最佳实践指南。
translated by 谷歌翻译
通过微调将大规模的预训练语言模型适应下游任务是实现NLP基准测试最先进性能的标准方法。然而,微调具有数百万或数十亿个参数的所有重量模型是对低资源设置中不稳定的采样低效,并且浪费,因为它需要为每个任务存储模型的单独副本。最近的工作已经开发了参数高效的微调方法,但这些方法仍然需要相对大量的参数或表现不足标准微调。在这项工作中,我们提出了一种特殊调整大型语言模型的方法,其在任务性能和比率参数之间具有更好的权衡的方法,而不是比上事先工作。 Compacter通过构建适配器,低级优化和参数化超复分乘法层的思想之上来实现这一目标。具体地,Compacter将特定于特定的权重矩阵插入到预估计模型的权重中,这些权重被有效地计算为共享的“慢速”权重和“快速”等级 - 每个Compacter层定义的矩阵之间的矩阵产品的总和。仅通过培训0.047%的预磨料模型的参数,Compacter会在胶水上标准微调和胜过标准微调的标准微调和低资源设置。我们的代码在〜\ url {https://github.com/rabeehk/compacter}上公开使用。
translated by 谷歌翻译
Sequence-to-sequence (seq2seq) learning is a popular fashion for large-scale pretraining language models. However, the prior seq2seq pretraining models generally focus on reconstructive objectives on the decoder side and neglect the effect of encoder-side supervision, which we argue may lead to sub-optimal performance. To verify our hypothesis, we first empirically study the functionalities of the encoder and decoder in seq2seq pretrained language models, and find that the encoder takes an important but under-exploitation role than the decoder regarding the downstream performance and neuron activation. Therefore, we propose an encoding-enhanced seq2seq pretraining strategy, namely E2S2, which improves the seq2seq models via integrating more efficient self-supervised information into the encoders. Specifically, E2S2 adopts two self-supervised objectives on the encoder side from two aspects: 1) locally denoising the corrupted sentence (denoising objective); and 2) globally learning better sentence representations (contrastive objective). With the help of both objectives, the encoder can effectively distinguish the noise tokens and capture high-level (i.e. syntactic and semantic) knowledge, thus strengthening the ability of seq2seq model to accurately achieve the conditional generation. On a large diversity of downstream natural language understanding and generation tasks, E2S2 dominantly improves the performance of its powerful backbone models, e.g. BART and T5. For example, upon BART backbone, we achieve +1.1% averaged gain on the general language understanding evaluation (GLUE) benchmark and +1.75% F_0.5 score improvement on CoNLL2014 dataset. We also provide in-depth analyses to show the improvement stems from better linguistic representation. We hope that our work will foster future self-supervision research on seq2seq language model pretraining.
translated by 谷歌翻译
Language models with the Transformers structure have shown great performance in natural language processing. However, there still poses problems when fine-tuning pre-trained language models on downstream tasks, such as over-fitting or representation collapse. In this work, we propose HyPe, a simple yet effective fine-tuning technique to alleviate such problems by perturbing hidden representations of Transformers layers. Unlike previous works that only add noise to inputs or parameters, we argue that the hidden representations of Transformers layers convey more diverse and meaningful language information. Therefore, making the Transformers layers more robust to hidden representation perturbations can further benefit the fine-tuning of PLMs en bloc. We conduct extensive experiments and analyses on GLUE and other natural language inference datasets. Results demonstrate that HyPe outperforms vanilla fine-tuning and enhances generalization of hidden representations from different layers. In addition, HyPe acquires negligible computational overheads, and is better than and compatible with previous state-of-the-art fine-tuning techniques.
translated by 谷歌翻译
已经证明了对比学习适合学习句子嵌入,可以显着提高语义文本相似性(STS)任务。最近,大型对比学习模型,例如句子T5倾向于学到更强大的句子嵌入。虽然有效,但由于计算资源或时间成本限制,这种大型型号很难在线服务。为了解决这个问题,通常采用知识蒸馏(KD),这可以将大型“教师”模型压缩成一个小的“学生”模型,但通常会遭受一些性能损失。在这里,我们提出了一个增强的KD框架,称为蒸馏 - 对比度(迪斯科)。所提出的迪斯科框架首先利用KD将大句子嵌入模型的能力转移到大型未标记数据的小学生模型,然后在标记的训练数据上具有对比学习的学生模型。对于迪斯科舞厅的KD进程,我们进一步提出了对比的知识蒸馏(CKD),以增强教师模型培训,KD和学生模型的一致性,这可能会提高迅速学习的表现。 7 STS基准测试的广泛实验表明,使用所提出的迪斯科和CKD培训的学生模型很少或甚至没有性能损失,并且始终如一地优于相同参数大小的相应对应物。令人惊讶的是,我们的110米学生模型甚至可以优于最新的最新(SOTA)模型,即句子T5(11B),只有1%的参数。
translated by 谷歌翻译
预先接受的语言模型实现了最先进的导致各种自然语言处理(NLP)任务。 GPT-3表明,缩放预先训练的语言模型可以进一步利用它们的巨大潜力。最近提出了一个名为Ernie 3.0的统一框架,以预先培训大型知识增强型号,并培训了具有10亿参数的模型。 Ernie 3.0在各种NLP任务上表现出最先进的模型。为了探讨缩放的表现,我们培养了百卢比的3.0泰坦参数型号,在PaddlePaddle平台上有高达260亿参数的泰坦。此外,我们设计了一种自我监督的对抗性损失和可控语言建模损失,以使ERNIE 3.0 TITAN产生可信和可控的文本。为了减少计算开销和碳排放,我们向Ernie 3.0泰坦提出了一个在线蒸馏框架,教师模型将同时教授学生和培训。埃塞尼3.0泰坦是迄今为止最大的中国密集预训练模型。经验结果表明,Ernie 3.0泰坦在68个NLP数据集中优于最先进的模型。
translated by 谷歌翻译
当前的预训练语言模型(PLM)通常是通过静态数据训练的,忽略了在现实情况下,各种来源的流数据可能会不断增长。这要求PLM终生整合来自所有来源的信息。尽管可以通过对所有现有数据进行详尽的预培训来实现此目标,但已知该过程在计算上是昂贵的。为此,我们提出了Elle,目的是为新兴数据有效终身预训练。具体而言,ELLE由(1)函数保留的模型扩展组成,它们灵活地扩展了现有的PLM的宽度和深度以提高知识获取的效率; (2)预先训练的领域提示,它消除了在预训练期间学习的多功能知识,并刺激了下游任务的适当知识。我们通过来自BERT和GPT上5个域的流数据进行实验。结果表明,在预训练效率和下游性能中,ELLE的优越性超过了各种终身学习基线。这些代码可在https://github.com/thunlp/elle上公开获得。
translated by 谷歌翻译
Given the success with in-context learning of large pre-trained language models, we introduce in-context learning distillation to transfer in-context few-shot learning ability from large models to smaller models. We propose to combine in-context learning objectives with language modeling objectives to distill both the ability to read in-context examples and task knowledge to the smaller models. We perform in-context learning distillation under two different few-shot learning paradigms: Meta In-context Tuning (Meta-ICT) and Multitask In-context Tuning (Multitask-ICT). Multitask-ICT performs better on multitask few-shot learning but also requires more computation than Meta-ICT. Our method shows consistent improvements for both Meta-ICT and Multitask-ICT on two benchmarks: LAMA and CrossFit. Our extensive experiments and analysis reveal that in-context learning objectives and language modeling objectives are complementary under the Multitask-ICT paradigm. In-context learning objectives achieve the best performance when combined with language modeling objectives.
translated by 谷歌翻译
知识蒸馏是一种通过减少差异来将有关陈述信息从教师转移到学生的方法。这种方法的一个挑战是减少学生表现的灵活性,从而导致对教师知识的学习不准确。为了解决BERT转移,我们研究了指定为三种类型的表示结构的蒸馏:功能内,局部局部互感,全局功能间结构。要转移它们,我们基于中心内核对齐方式介绍了\ textit {特征结构蒸馏}方法,该方法为相似的特征结构分配了一致的价值,并揭示了更有信息的关系。特别是,针对全局结构实现了一种带有聚类的内存调节方法。在对胶合数据集的语言理解的九项任务的实验中,与最新的蒸馏方法相比,提出的方法有效地传递了三种类型的结构并提高性能。实际上,这些方法的代码可在https://github.com/maroo-sky/fsd中获得
translated by 谷歌翻译
知识图(KG)嵌入寻求学习实体和关系的向量表示。传统的模型理由是图形结构,但它们遭受了图形不完整和长尾实体的问题。最近的研究使用了预训练的语言模型根据实体和关系的文本信息来学习嵌入,但它们无法利用图形结构。在论文中,我们从经验上表明,这两种特征是KG嵌入的互补性。为此,我们提出了Cole,Cole是一种用于嵌入KG的共同介绍方法,可利用图形结构和文本信息的互补性。其图形嵌入模型使用变压器从其邻域子图中重建实体的表示。其文本嵌入模型使用预训练的语言模型来从其名称,描述和关系邻居的软提示中生成实体表示。为了让两个模型相互推广,我们提出了共同依据学习,使他们可以从彼此的预测逻辑中提取选择性知识。在我们的共同阶段学习中,每个模型既是老师又是学生。基准数据集上的实验表明,这两个模型的表现优于其相关基线,而与共同介绍学习的集合方法Cole可以推进KG嵌入的最先进。
translated by 谷歌翻译
Conventional fine-tuning encounters increasing difficulties given the size of current Pre-trained Language Models, which makes parameter-efficient tuning become the focal point of frontier research. Previous methods in this field add tunable adapters into MHA or/and FFN of Transformer blocks to enable PLMs achieve transferability. However, as an important part of Transformer architecture, the power of layer normalization for parameter-efficent tuning is ignored. In this paper, we first propose LN-tuning, by tuning the gain and bias term of Layer Normalization module with only 0.03\% parameters, which is of high time-efficency and significantly superior to baselines which are less than 0.1\% tunable parameters. Further, we study the unified framework of combining LN-tuning with previous ones and we find that: (1) the unified framework of combining prefix-tuning, the adapter-based method working on MHA, and LN-tuning achieves SOTA performance. (2) unified framework which tunes MHA and LayerNorm simultaneously can get performance improvement but those which tune FFN and LayerNorm simultaneous will cause performance decrease. Ablation study validates LN-tuning is of no abundant parameters and gives a further understanding of it.
translated by 谷歌翻译
预训练模型已在许多代码智能任务中有效。这些模型在大规模未标记的语料库中进行了预训练,然后在下游任务中进行了微调。但是,由于预训练和下游任务的输入是不同的形式,因此很难充分探索预训练模型的知识。此外,微调的性能强烈依赖于下游数据的量,而实际上,具有稀缺数据的场景很常见。自然语言处理(NLP)领域的最新研究表明,迅速调整,一种调整的新范式,减轻上述问题并在各种NLP任务中实现了有希望的结果。在迅速调整中,在调整过程中插入的提示提供了特定于任务的知识,这对于具有相对较少数据的任务特别有益。在本文中,我们凭经验评估了代码智能任务中迅速调整的用法和效果。我们对流行的预训练模型Codebert和codet5进行及时调整,并尝试三个代码智能任务,包括缺陷预测,代码摘要和代码翻译。我们的实验结果表明,在所有三个任务中,迅速调整始终优于微调。此外,及时调整在低资源场景中显示出很大的潜力,例如,对于代码摘要,平均将微调的BLEU分数提高了26%以上。我们的结果表明,我们可以调整代码智能任务的迅速调整,以实现更好的性能,尤其是在缺乏特定于任务的数据时,我们可以调整及时调整。
translated by 谷歌翻译
由于开放的社交平台允许大量未经验证的信息流动,因此谣言可以出乎意料地出现并迅速传播。但是,现有的谣言检测(RD)模型通常会采用相同的培训和测试分布,并且无法应对不断变化的社交网络环境。本文提出了一个持续的及时调整RD(CPT-RD)框架,该框架避免了在顺序任务学习过程中上游任务的灾难性遗忘(CF),并使域任务之间的双向知识转移。具体而言,我们提出以下策略:(a)我们的设计明确地将共享和特定于领域的知识分解,从而减少了优化过程中不同领域的干扰; (b)几种技术旨在转移上游任务的知识以应对紧急情况; (c)任务条件的及时性超网(TPHNET)用于合并过去的域。此外,CPT-RD避免了CF,而无需进行排练缓冲区。
translated by 谷歌翻译
Although continually extending an existing NMT model to new domains or languages has attracted intensive interest in recent years, the equally valuable problem of continually improving a given NMT model in its domain by leveraging knowledge from an unlimited number of existing NMT models is not explored yet. To facilitate the study, we propose a formal definition for the problem named knowledge accumulation for NMT (KA-NMT) with corresponding datasets and evaluation metrics and develop a novel method for KA-NMT. We investigate a novel knowledge detection algorithm to identify beneficial knowledge from existing models at token level, and propose to learn from beneficial knowledge and learn against other knowledge simultaneously to improve learning efficiency. To alleviate catastrophic forgetting, we further propose to transfer knowledge from previous to current version of the given model. Extensive experiments show that our proposed method significantly and consistently outperforms representative baselines under homogeneous, heterogeneous, and malicious model settings for different language pairs.
translated by 谷歌翻译