Language models with the Transformers structure have shown great performance in natural language processing. However, there still poses problems when fine-tuning pre-trained language models on downstream tasks, such as over-fitting or representation collapse. In this work, we propose HyPe, a simple yet effective fine-tuning technique to alleviate such problems by perturbing hidden representations of Transformers layers. Unlike previous works that only add noise to inputs or parameters, we argue that the hidden representations of Transformers layers convey more diverse and meaningful language information. Therefore, making the Transformers layers more robust to hidden representation perturbations can further benefit the fine-tuning of PLMs en bloc. We conduct extensive experiments and analyses on GLUE and other natural language inference datasets. Results demonstrate that HyPe outperforms vanilla fine-tuning and enhances generalization of hidden representations from different layers. In addition, HyPe acquires negligible computational overheads, and is better than and compatible with previous state-of-the-art fine-tuning techniques.
translated by 谷歌翻译
Recently, domain-specific PLMs have been proposed to boost the task performance of specific domains (e.g., biomedical and computer science) by continuing to pre-train general PLMs with domain-specific corpora. However, this Domain-Adaptive Pre-Training (DAPT; Gururangan et al. (2020)) tends to forget the previous general knowledge acquired by general PLMs, which leads to a catastrophic forgetting phenomenon and sub-optimal performance. To alleviate this problem, we propose a new framework of General Memory Augmented Pre-trained Language Model (G-MAP), which augments the domain-specific PLM by a memory representation built from the frozen general PLM without losing any general knowledge. Specifically, we propose a new memory-augmented layer, and based on it, different augmented strategies are explored to build the memory representation and then adaptively fuse it into the domain-specific PLM. We demonstrate the effectiveness of G-MAP on various domains (biomedical and computer science publications, news, and reviews) and different kinds (text classification, QA, NER) of tasks, and the extensive results show that the proposed G-MAP can achieve SOTA results on all tasks.
translated by 谷歌翻译
Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively. Second, an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training. In addition, a new virtual adversarial training method is used for fine-tuning to improve models' generalization. We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understand (NLU) and natural langauge generation (NLG) downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). Notably, we scale up DeBERTa by training a larger version that consists of 48 Transform layers with 1.5 billion parameters. The significant performance boost makes the single DeBERTa model surpass the human performance on the SuperGLUE benchmark (Wang et al., 2019a) for the first time in terms of macro-average score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the SuperGLUE leaderboard as of January 6, 2021, outperforming the human baseline by a decent margin (90.3 versus 89.8). The pre-trained DeBERTa models and the source code were released at: https://github.com/microsoft/DeBERTa 1 .
translated by 谷歌翻译
迅速调整,它冻结了预审计的语言模型(PLM),只有微调的几个额外软提示的参数,在PLM具有数十亿个参数时,对全参数微调(即模型调整)显示出具有竞争性的性能,但仍然显示出竞争力。在较小的PLM的情况下,性能差。因此,迅速转移(POT),通过训练有素的类似源任务的提示来初始化目标提示,最近提议改善及时调整。但是,这样的香草锅方法通常会实现次优的性能,因为(i)锅对源目标对的相似性和(ii)直接对目标提示进行初始提示的提示敏感,而目标任务可能会导致灾难性忘记来源知识。为了解决这些问题,我们提出了一个新的指标,以准确预测及时的转移性(关于(i)),以及一种利用知识蒸馏技术将“知识”从源提示转移到的新颖的锅方法(即熊猫)目标以微妙的方式提示,并有效缓解灾难性遗忘(关于(ii))。此外,为了实现每个源目标对的自适应及时转移,我们使用指标来控制熊猫方法中的知识转移。对PLM的5个量表的21个源和9个目标数据集的189组组合进行了广泛而系统的实验,表明:1)我们提出的指标很好地预测了及时的可传递性; 2)在所有任务和型号中,我们的熊猫始终优于香草锅的平均得分2.3%(最高24.1%); 3)通过我们的熊猫方法,及时调整可以比在各种PLM量表场景中的模型调整来实现竞争性甚至更好的性能。接受代码和模型将在接受后发布。
translated by 谷歌翻译
激活功能可以对降低输入数据的拓扑复杂性产生重大影响,从而提高模型的性能。选择合适的激活函数是神经模型设计中的重要步骤。但是,在基于变压器的语言模型中很少讨论或探索激活功能的选择。事先选择它们的激活功能,然后从预训练中固定到微调。结果,在这个漫长的生命周期中,无法调整它们对模型的电感偏见。此外,随后开发的模型(例如Roberta,Bart和GPT-3)经常跟进先前的工作(例如BERT),以使用相同的激活函数而无需合理。在本文中,我们研究了变压器体系结构中使用理性激活函数(RAF)(RAF)的有效性。与常规,预定义的激活功能相反,RAF可以根据输入数据自适应地学习最佳激活功能。我们的实验表明,基于RAF的变压器(RAFT)比具有GELU函数的香草BERT的验证性更低。我们进一步评估了低和全数据设置中下游任务的筏。我们的结果表明,筏在大多数任务和设置上都优于对应模型。例如,在低数据表情况下(有100个训练示例),木筏在胶水基准上的表现平均高出5.71点,在全数据设置的小队中,平均得分为2.05分。对学到的RAF的形状的分析进一步揭示了它们在预训练模型的不同层之间有很大的变化,并且看起来与常规激活函数大多不同。 RAFT为根据学习的激活功能打开了一个新的研究方向,用于分析和解释预训练的模型。
translated by 谷歌翻译
由于表现强劲,预用的语言模型已成为许多NLP任务的标准方法,但他们培训价格昂贵。我们提出了一个简单高效的学习框架TLM,不依赖于大规模预制。给定一些标记的任务数据和大型常规语料库,TLM使用任务数据作为查询来检索一般语料库的微小子集,并联合优化任务目标和从头开始的语言建模目标。在四个域中的八个分类数据集上,TLM实现了比预用语言模型(例如Roberta-Light)更好地或类似的结果,同时减少了两个数量级的训练拖鞋。高精度和效率,我们希望TLM将有助于民主化NLP并加快发展。
translated by 谷歌翻译
具有数百万参数的基于变压器的预训练模型需要大量存储。最近的方法通过培训适配器解决了这一缺点,但是这些方法仍然需要相对较大的参数。在这项研究中,提出了一种令人惊讶的简单但有效的适配器体系结构的Adapterbias。AdapterBias向变压器层的隐藏输出添加了代币依赖性转移,以适应仅使用向量和线性层的下游任务。进行了广泛的实验,以证明适配性的有效性。实验表明,与先前的作品相比,我们提出的方法可以大大减少可训练的参数,而任务性能与微调的预训练模型相比最小。我们进一步发现,适应性比亚斯自动学习以将更重要的表示形式分配给与任务相关的代币转移。
translated by 谷歌翻译
大规模预训练的语言模型的出现为自然语言处理的最新进展做出了巨大贡献。许多最先进的语言模型首先在大型文本语料库上进行培训,然后在下游任务上进行微调。尽管它最近获得了成功和广泛的采用,但对预训练的语言模型的微调通常会遭受过度拟合,这会导致由于模型的复杂性极高的复杂性和下游任务的有限培训样本而导致的普遍性差。为了解决这个问题,我们提出了一个新颖有效的微调框架,称为Layerwise噪声稳定性正则化(LNSR)。具体而言,我们建议注入标准的高斯噪声或势内噪声,并将微调模型的隐藏表示形式定向。我们首先提供理论分析以支持我们方法的功效。然后,我们证明了所提出的方法的优势,而不是其他最先进的算法,包括L2-SP,MixOut和Smart。尽管这些先前的作品仅验证其方法对相对简单的文本分类任务的有效性,但我们还验证了方法对问题答案任务的有效性,而目标问题更加困难,并且可以使用更多的培训示例。此外,广泛的实验结果表明,所提出的算法不仅可以提高语言模型的内域性能,而且还可以改善域外数据的域概括性能。
translated by 谷歌翻译
This technical report briefly describes our JDExplore d-team's Vega v2 submission on the SuperGLUE leaderboard. SuperGLUE is more challenging than the widely used general language understanding evaluation (GLUE) benchmark, containing eight difficult language understanding tasks, including question answering, natural language inference, word sense disambiguation, coreference resolution, and reasoning. [Method] Instead of arbitrarily increasing the size of a pretrained language model (PLM), our aim is to 1) fully extract knowledge from the input pretraining data given a certain parameter budget, e.g., 6B, and 2) effectively transfer this knowledge to downstream tasks. To achieve goal 1), we propose self-evolution learning for PLMs to wisely predict the informative tokens that should be masked, and supervise the masked language modeling (MLM) process with rectified smooth labels. For goal 2), we leverage the prompt transfer technique to improve the low-resource tasks by transferring the knowledge from the foundation model and related downstream tasks to the target task. [Results] According to our submission record (Oct. 2022), with our optimized pretraining and fine-tuning strategies, our 6B Vega method achieved new state-of-the-art performance on 4/8 tasks, sitting atop the SuperGLUE leaderboard on Oct. 8, 2022, with an average score of 91.3.
translated by 谷歌翻译
Neural models that do not rely on pre-training have excelled in the keyphrase generation task with large annotated datasets. Meanwhile, new approaches have incorporated pre-trained language models (PLMs) for their data efficiency. However, there lacks a systematic study of how the two types of approaches compare and how different design choices can affect the performance of PLM-based models. To fill in this knowledge gap and facilitate a more informed use of PLMs for keyphrase extraction and keyphrase generation, we present an in-depth empirical study. Formulating keyphrase extraction as sequence labeling and keyphrase generation as sequence-to-sequence generation, we perform extensive experiments in three domains. After showing that PLMs have competitive high-resource performance and state-of-the-art low-resource performance, we investigate important design choices including in-domain PLMs, PLMs with different pre-training objectives, using PLMs with a parameter budget, and different formulations for present keyphrases. Further results show that (1) in-domain BERT-like PLMs can be used to build strong and data-efficient keyphrase generation models; (2) with a fixed parameter budget, prioritizing model depth over width and allocating more layers in the encoder leads to better encoder-decoder models; and (3) introducing four in-domain PLMs, we achieve a competitive performance in the news domain and the state-of-the-art performance in the scientific domain.
translated by 谷歌翻译
本文提出了一种新的预先接受训练的语言模型Debertav3,它通过用更换的令牌检测(RTD)更换掩模语言建模(MLM)来改善原始的Deberta模型,更高的预训练任务。我们的分析表明,Vanilla嵌入了电力中的共享损害培训效率和模型性能。这是因为鉴别器的培训损失和发电机的销售损失在不同的方向上拉动令牌嵌入,从而创造“拔河”动态。因此,我们提出了一种新的梯度 - 解开嵌入共享方法,避免了战争动态,提高了训练效率和预训练模型的质量。我们使用与Deberta相同的设置预先接受了培训的Debertav3,以展示其在广泛的下游自然语言理解(NLU)任务上的特殊表现。以八个任务为例,Debertav3大型模型以八个任务为例,平均得分为91.37%,杜伯塔省的1.37%和电力1.91%,在模型中设置新的最先进(SOTA)具有类似的结构。此外,我们预先培训了多语思伯类Mdeberta,与英语模型相比,对强基线的更大改善。例如,Mdeberta基地达到XNLI的79.8%零射频精度和超过XLM-R基础的3.6%的改进,在此基准上创建了一个新的Sota。我们在HTTPS://github.com/microsoft/deberta公开提供我们预先接受的模型和推理码。
translated by 谷歌翻译
用于预培训语言模型的自我监督学习的核心包括预训练任务设计以及适当的数据增强。语言模型中的大多数数据增强都是独立于上下文的。最近在电子中提出了一个开创性的增强,并通过引入辅助生成网络(发电机)来实现最先进的性能,以产生用于培训主要辨别网络(鉴别者)的上下文化数据增强。然而,这种设计引入了发电机的额外计算成本,并且需要调整发电机和鉴别器之间的相对能力。在本文中,我们提出了一种自增强策略(SAS),其中单个网络用于审视以后的时期的培训常规预训练和上下文化数据增强。基本上,该策略消除了单独的发电机,并使用单个网络共同执行具有MLM(屏蔽语言建模)和RTD(替换令牌检测)头的两个预训练任务。它避免了寻找适当大小的发电机的挑战,这对于在电子中证明的性能至关重要,以及其随后的变体模型至关重要。此外,SAS是一项常规策略,可以与最近或将来的许多新技术无缝地结合,例如杜伯塔省的解除关注机制。我们的实验表明,SAS能够在具有相似或更少的计算成本中优于胶水任务中的电磁和其他最先进的模型。
translated by 谷歌翻译
动机:生物医学研究人员和临床从业者的常年挑战是随着出版物和医疗票据的快速增长而待的。自然语言处理(NLP)已成为驯服信息超载的有希望的方向。特别是,大型神经语言模型通过预先绘制的文本预测,通过各种NLP应用中的BERT模型的成功示例,便于通过预先绘制的预先来进行学习。然而,用于结束任务的微调此类模型仍然具有挑战性,特别是具有小标记数据集,这些数据集是生物医学NLP的常见。结果:我们对生物医学NLP的微调稳定性进行了系统研究。我们表明FineTuning性能可能对预先预订的设置敏感,尤其是在低资源域中。大型型号有可能获得更好的性能,但越来越多的模型大小也加剧了FineTuning不稳定性。因此,我们对解决微调不稳定的技术进行了全面的探索。我们表明,这些技术可以大大提高低源生物医学NLP应用的微调性能。具体地,冻结下层有助于标准伯特基型号,而完整的衰减对于BERT-LARD和Electra型号更有效。对于低资源文本相似性任务,如生物,重新初始化顶层是最佳策略。总体而言,占星型词汇和预制促进更强大的微调模型。基于这些调查结果,我们在广泛的生物医学NLP应用方面建立了新的技术。可用性和实施​​:为了促进生物医学NLP的进展,我们释放了我们最先进的预订和微调模型:https://aka.ms/blurb。
translated by 谷歌翻译
基于方面的情感分析(ABSA)是一项精细的情感分析任务,它的重点是检测句子中的情感极性。但是,它始终对多方面的挑战敏感,在句子中,多个方面的特征将相互影响。为了减轻此问题,我们设计了一个新颖的培训框架,称为对比度跨通道数据增强(C3 DA),该框架利用了一个内域的发电机来构建更多的多种相应样本,然后通过对比度模型通过对比度学习的稳健性,从而通过对比度学习的稳健性这些生成的数据。实际上,鉴于生成预审预测的语言模型和一些有限的ABSA标记数据,我们首先采用一些参数效率的方法来执行内域微调。然后,所获得的内域发生器用于从两个通道(即方面增强通道和极性增强通道)生成合成句子,该句子分别在给定的方面和极性上生成句子条件。具体而言,我们的C3 DA以跨渠道的方式执行句子生成以获取更多句子,并提出了熵最小化过滤器以滤除低质量生成的样品。广泛的实验表明,我们的C3 DA可以在准确性和宏观上胜过约1%的基准,而不会增加1%。代码和数据在https://github.com/wangbing1416/c3da中发布。
translated by 谷歌翻译
变压器注意机制的二次计算和内存复杂性限制了对长序列建模的可扩展性。在本文中,我们提出了Luna,一种线性统一嵌套关注机制,使Softmax注意力具有两个嵌套线性关注功能,仅产生线性(与二次)的时间和空间复杂度相反。具体地,通过第一注意功能,LUNA将输入序列包装成固定长度的序列。然后,使用第二关注功能未包装包装序列。与更传统的关注机制相比,LUNA引入具有固定长度的附加序列作为输入和额外的相应输出,允许LUNA线性地进行关注操作,同时还存储足够的上下文信息。我们对三个序列建模任务的基准进行了广泛的评估:长上下文序列建模,神经机平移和大型预磨损的屏蔽语言建模。竞争甚至更好的实验结果表明了Luna的有效性和效率与各种各样相比
translated by 谷歌翻译
The attention mechanism is considered the backbone of the widely-used Transformer architecture. It contextualizes the input by computing input-specific attention matrices. We find that this mechanism, while powerful and elegant, is not as important as typically thought for pretrained language models. We introduce PAPA, a new probing method that replaces the input-dependent attention matrices with constant ones -- the average attention weights over multiple inputs. We use PAPA to analyze several established pretrained Transformers on six downstream tasks. We find that without any input-dependent attention, all models achieve competitive performance -- an average relative drop of only 8% from the probing baseline. Further, little or no performance drop is observed when replacing half of the input-dependent attention matrices with constant (input-independent) ones. Interestingly, we show that better-performing models lose more from applying our method than weaker models, suggesting that the utilization of the input-dependent attention mechanism might be a factor in their success. Our results motivate research on simpler alternatives to input-dependent attention, as well as on methods for better utilization of this mechanism in the Transformer architecture.
translated by 谷歌翻译
Recent works on Lottery Ticket Hypothesis have shown that pre-trained language models (PLMs) contain smaller matching subnetworks(winning tickets) which are capable of reaching accuracy comparable to the original models. However, these tickets are proved to be notrobust to adversarial examples, and even worse than their PLM counterparts. To address this problem, we propose a novel method based on learning binary weight masks to identify robust tickets hidden in the original PLMs. Since the loss is not differentiable for the binary mask, we assign the hard concrete distribution to the masks and encourage their sparsity using a smoothing approximation of L0 regularization.Furthermore, we design an adversarial loss objective to guide the search for robust tickets and ensure that the tickets perform well bothin accuracy and robustness. Experimental results show the significant improvement of the proposed method over previous work on adversarial robustness evaluation.
translated by 谷歌翻译
在基于变压器的模型中通常观察到令牌均匀性,在经过变压器中经过堆叠的多个自我发场层后,不同的令牌共享大量相似信息。在本文中,我们建议使用每个变压器层的输出的奇异值的分布来表征令牌均匀性的现象,并从经验上说明,偏斜的奇异值分布可以减轻“令牌均匀性”问题。基于我们的观察结果,我们定义了奇异值分布的几种理想特性,并提出了一种新的转换函数,以更新奇异值。我们表明,除了减轻令牌均匀性外,转换功能还应保留原始嵌入空间中的当地邻域结构。我们提出的奇异价值变换函数应用于伯特,阿尔伯特,罗伯塔和德文尔特等一系列基于变压器的语言模型,并且在语义文本相似性评估和一系列胶水任务中观察到了改善的性能。我们的源代码可在https://github.com/hanqi-qi/tokenuni.git上找到。
translated by 谷歌翻译
Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive, so it is difficult to efficiently execute them on resourcerestricted devices. To accelerate inference and reduce model size while maintaining accuracy, we first propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large "teacher" BERT can be effectively transferred to a small "student" Tiny-BERT. Then, we introduce a new two-stage learning framework for TinyBERT, which performs Transformer distillation at both the pretraining and task-specific learning stages. This framework ensures that TinyBERT can capture the general-domain as well as the task-specific knowledge in BERT. TinyBERT 41 with 4 layers is empirically effective and achieves more than 96.8% the performance of its teacher BERT BASE on GLUE benchmark, while being 7.5x smaller and 9.4x faster on inference. TinyBERT 4 is also significantly better than 4-layer state-of-the-art baselines on BERT distillation, with only ∼28% parameters and ∼31% inference time of them. Moreover, TinyBERT 6 with 6 layers performs on-par with its teacher BERT BASE .
translated by 谷歌翻译
预训练的语言模型(PLM)在各种自然语言理解任务上取得了巨大的成功。另一方面,对PLM的简单微调对于特定于领域的任务可能是次优的,因为它们不可能涵盖所有域中的知识。尽管PLM的自适应预培训可以帮助他们获得特定于领域的知识,但需要大量的培训成本。此外,自适应预训练可能会通过造成灾难性忘记其常识来损害PLM在下游任务上的表现。为了克服PLM适应性适应性预训练的这种局限性,我们提出了一个新颖的域名适应框架,用于将PLMS创造为知识增强语言模型适应性(KALA),该框架调节了PLM的中间隐藏表示与域中的中间隐藏表示,由实体和实体和实体和实体和实体构成他们的关系事实。我们验证了Kala在问题答案中的性能,并在各个域的多个数据集上命名实体识别任务。结果表明,尽管在计算上有效,但我们的Kala在很大程度上优于适应性预训练。代码可在以下网址获得:https://github.com/nardien/kala/。
translated by 谷歌翻译