预审前的语言模型在自然语言处理的各个领域都取得了成功,包括阅读理解任务。但是,当将机器学习方法应用于新域时,标记的数据可能并不总是可用。为了解决这个问题,我们使用对源域数据进行预处理的监督,以降低特定于域的下游任务的样本复杂性。我们通过将任务转移与域适应性相结合以微调验证的模型,而没有目标任务中的数据来评估特定于领域的阅读理解任务的零射击性能。我们的方法在4个域中的3个域中的下游域特异性阅读理解任务上超过了域自适应预测。
translated by 谷歌翻译
Language models pretrained on text from a wide variety of sources form the foundation of today's NLP. In light of the success of these broad-coverage models, we investigate whether it is still helpful to tailor a pretrained model to the domain of a target task. We present a study across four domains (biomedical and computer science publications, news, and reviews) and eight classification tasks, showing that a second phase of pretraining indomain (domain-adaptive pretraining) leads to performance gains, under both high-and low-resource settings. Moreover, adapting to the task's unlabeled data (task-adaptive pretraining) improves performance even after domain-adaptive pretraining. Finally, we show that adapting to a task corpus augmented using simple data selection strategies is an effective alternative, especially when resources for domain-adaptive pretraining might be unavailable. Overall, we consistently find that multiphase adaptive pretraining offers large gains in task performance.
translated by 谷歌翻译
Recently, domain-specific PLMs have been proposed to boost the task performance of specific domains (e.g., biomedical and computer science) by continuing to pre-train general PLMs with domain-specific corpora. However, this Domain-Adaptive Pre-Training (DAPT; Gururangan et al. (2020)) tends to forget the previous general knowledge acquired by general PLMs, which leads to a catastrophic forgetting phenomenon and sub-optimal performance. To alleviate this problem, we propose a new framework of General Memory Augmented Pre-trained Language Model (G-MAP), which augments the domain-specific PLM by a memory representation built from the frozen general PLM without losing any general knowledge. Specifically, we propose a new memory-augmented layer, and based on it, different augmented strategies are explored to build the memory representation and then adaptively fuse it into the domain-specific PLM. We demonstrate the effectiveness of G-MAP on various domains (biomedical and computer science publications, news, and reviews) and different kinds (text classification, QA, NER) of tasks, and the extensive results show that the proposed G-MAP can achieve SOTA results on all tasks.
translated by 谷歌翻译
语言模型是使用大量通用数据(如Book Copus,Common Crawl和Wikipedia)进行预训练的,这对于模型了解语言的语言特征至关重要。新的研究建议将域自适应预训练(DAPT)和任务自适应预训练(TAPT)作为最终填充任务之前的中间步骤。此步骤有助于涵盖目标域词汇,并改善下游任务的模型性能。在这项工作中,我们仅研究训练在TAPT和特定于任务的填充过程中嵌入层对模型性能的影响。基于我们的研究,我们提出了一种简单的方法,以通过对BERT层进行选择性预训练,使基于BERT的模型的中间步骤更有效。我们表明,在TAPT期间仅训练BERT嵌入层足以适应目标域的词汇并实现可比的性能。我们的方法在计算上是有效的,在TAPT期间训练了78%的参数。所提出的嵌入层列式方法也可以是一种有效的域适应技术。
translated by 谷歌翻译
近年来,预制语言模型彻底改变了NLP世界,同时在各种下游任务中实现了最先进的性能。但是,在许多情况下,当标记数据稀缺时,这些模型不会表现良好,并且预计模型将在零或几秒钟内执行。最近,有几项工作表明,与下游任务更好地对准的预先预测或执行第二阶段,可以导致改进的结果,尤其是在稀缺数据设置中。在此,我们建议利用携带的情绪话语标记来产生大规模的弱标记数据,这又可以用于适应语言模型进行情感分析。广泛的实验结果显示了我们在各种基准数据集中的方法的价值,包括金融域。在https://github.com/ibm/tslm-discourse-markers上提供代码,模型和数据。
translated by 谷歌翻译
The field of cybersecurity is evolving fast. Experts need to be informed about past, current and - in the best case - upcoming threats, because attacks are becoming more advanced, targets bigger and systems more complex. As this cannot be addressed manually, cybersecurity experts need to rely on machine learning techniques. In the texutual domain, pre-trained language models like BERT have shown to be helpful, by providing a good baseline for further fine-tuning. However, due to the domain-knowledge and many technical terms in cybersecurity general language models might miss the gist of textual information, hence doing more harm than good. For this reason, we create a high-quality dataset and present a language model specifically tailored to the cybersecurity domain, which can serve as a basic building block for cybersecurity systems that deal with natural language. The model is compared with other models based on 15 different domain-dependent extrinsic and intrinsic tasks as well as general tasks from the SuperGLUE benchmark. On the one hand, the results of the intrinsic tasks show that our model improves the internal representation space of words compared to the other models. On the other hand, the extrinsic, domain-dependent tasks, consisting of sequence tagging and classification, show that the model is best in specific application scenarios, in contrast to the others. Furthermore, we show that our approach against catastrophic forgetting works, as the model is able to retrieve the previously trained domain-independent knowledge. The used dataset and trained model are made publicly available
translated by 谷歌翻译
Recent advances in NLP are brought by a range of large-scale pretrained language models (PLMs). These PLMs have brought significant performance gains for a range of NLP tasks, circumventing the need to customize complex designs for specific tasks. However, most current work focus on finetuning PLMs on a domain-specific datasets, ignoring the fact that the domain gap can lead to overfitting and even performance drop. Therefore, it is practically important to find an appropriate method to effectively adapt PLMs to a target domain of interest. Recently, a range of methods have been proposed to achieve this purpose. Early surveys on domain adaptation are not suitable for PLMs due to the sophisticated behavior exhibited by PLMs from traditional models trained from scratch and that domain adaptation of PLMs need to be redesigned to take effect. This paper aims to provide a survey on these newly proposed methods and shed light in how to apply traditional machine learning methods to newly evolved and future technologies. By examining the issues of deploying PLMs for downstream tasks, we propose a taxonomy of domain adaptation approaches from a machine learning system view, covering methods for input augmentation, model optimization and personalization. We discuss and compare those methods and suggest promising future research directions.
translated by 谷歌翻译
预训练的语言模型(PLM)在各种自然语言理解任务上取得了巨大的成功。另一方面,对PLM的简单微调对于特定于领域的任务可能是次优的,因为它们不可能涵盖所有域中的知识。尽管PLM的自适应预培训可以帮助他们获得特定于领域的知识,但需要大量的培训成本。此外,自适应预训练可能会通过造成灾难性忘记其常识来损害PLM在下游任务上的表现。为了克服PLM适应性适应性预训练的这种局限性,我们提出了一个新颖的域名适应框架,用于将PLMS创造为知识增强语言模型适应性(KALA),该框架调节了PLM的中间隐藏表示与域中的中间隐藏表示,由实体和实体和实体和实体和实体构成他们的关系事实。我们验证了Kala在问题答案中的性能,并在各个域的多个数据集上命名实体识别任务。结果表明,尽管在计算上有效,但我们的Kala在很大程度上优于适应性预训练。代码可在以下网址获得:https://github.com/nardien/kala/。
translated by 谷歌翻译
本文探讨了提高语言模型的零次学习能力的简单方法。我们表明,指令调整 - 通过对说明书中所述的任务集合微调语言模型 - 大幅提升零射门上看不见任务中的表现。我们采取预训练的语言模型和指令调整它通过自然语言指令模板语言表达了60NLP任务137B参数。我们评估这种指令调整模型,我们称之为FLAN,在看不见的任务类型。FLAN显着改善其未修饰的对应的性能和超过25的20个任务,我们评估零射门175BGPT-3。FLAN甚至GPT-3通过在安利,RTE,BoolQ,AI2-ARC,OpenbookQA和StoryCloze大比分胜过几拍。消融研究显示任务和模型的规模,这个数字是指令调整取得成功的关键组成部分。
translated by 谷歌翻译
In the era of digital healthcare, the huge volumes of textual information generated every day in hospitals constitute an essential but underused asset that could be exploited with task-specific, fine-tuned biomedical language representation models, improving patient care and management. For such specialized domains, previous research has shown that fine-tuning models stemming from broad-coverage checkpoints can largely benefit additional training rounds over large-scale in-domain resources. However, these resources are often unreachable for less-resourced languages like Italian, preventing local medical institutions to employ in-domain adaptation. In order to reduce this gap, our work investigates two accessible approaches to derive biomedical language models in languages other than English, taking Italian as a concrete use-case: one based on neural machine translation of English resources, favoring quantity over quality; the other based on a high-grade, narrow-scoped corpus natively written in Italian, thus preferring quality over quantity. Our study shows that data quantity is a harder constraint than data quality for biomedical adaptation, but the concatenation of high-quality data can improve model performance even when dealing with relatively size-limited corpora. The models published from our investigations have the potential to unlock important research opportunities for Italian hospitals and academia. Finally, the set of lessons learned from the study constitutes valuable insights towards a solution to build biomedical language models that are generalizable to other less-resourced languages and different domain settings.
translated by 谷歌翻译
动机:生物医学研究人员和临床从业者的常年挑战是随着出版物和医疗票据的快速增长而待的。自然语言处理(NLP)已成为驯服信息超载的有希望的方向。特别是,大型神经语言模型通过预先绘制的文本预测,通过各种NLP应用中的BERT模型的成功示例,便于通过预先绘制的预先来进行学习。然而,用于结束任务的微调此类模型仍然具有挑战性,特别是具有小标记数据集,这些数据集是生物医学NLP的常见。结果:我们对生物医学NLP的微调稳定性进行了系统研究。我们表明FineTuning性能可能对预先预订的设置敏感,尤其是在低资源域中。大型型号有可能获得更好的性能,但越来越多的模型大小也加剧了FineTuning不稳定性。因此,我们对解决微调不稳定的技术进行了全面的探索。我们表明,这些技术可以大大提高低源生物医学NLP应用的微调性能。具体地,冻结下层有助于标准伯特基型号,而完整的衰减对于BERT-LARD和Electra型号更有效。对于低资源文本相似性任务,如生物,重新初始化顶层是最佳策略。总体而言,占星型词汇和预制促进更强大的微调模型。基于这些调查结果,我们在广泛的生物医学NLP应用方面建立了新的技术。可用性和实施​​:为了促进生物医学NLP的进展,我们释放了我们最先进的预订和微调模型:https://aka.ms/blurb。
translated by 谷歌翻译
由于表现强劲,预用的语言模型已成为许多NLP任务的标准方法,但他们培训价格昂贵。我们提出了一个简单高效的学习框架TLM,不依赖于大规模预制。给定一些标记的任务数据和大型常规语料库,TLM使用任务数据作为查询来检索一般语料库的微小子集,并联合优化任务目标和从头开始的语言建模目标。在四个域中的八个分类数据集上,TLM实现了比预用语言模型(例如Roberta-Light)更好地或类似的结果,同时减少了两个数量级的训练拖鞋。高精度和效率,我们希望TLM将有助于民主化NLP并加快发展。
translated by 谷歌翻译
最近的工作表明,在适应新域时,域名语言模型可以提高性能。但是,与培训前提出的成本提出了一个重要问题:给出了固定预算,NLP从业者应该采取哪些步骤来最大限度地提高绩效?在本文中,我们在预算限制下研究域适应,并将其作为数据注释和预培训之间的客户选择问题。具体而言,我们测量三个程序文本数据集的注释成本以及三种域语言模型的预培训成本。然后,我们评估不同预算限制下的预训练和数据注释的不同组合的效用,以评估哪种组合策略最佳效果。我们发现,对于小预算,支出所有资金都会导致最佳表现;一旦预算变得足够大,数据注释和域内预训练的组合更优先。因此,我们建议任务特定的数据注释应该是在将NLP模型调整到新域时的经济策略的一部分。
translated by 谷歌翻译
密集的检索方法可以克服词汇差距并导致显着改善的搜索结果。但是,它们需要大量的培训数据,这些数据不适用于大多数域。如前面的工作所示(Thakur等,2021b),密集检索的性能在域移位下严重降低。这限制了密集检索方法的使用,只有几个具有大型训练数据集的域。在本文中,我们提出了一种新颖的无监督域适配方法生成伪标签(GPL),其将查询发生器与来自跨编码器的伪标记相结合。在六种代表性域专用数据集中,我们发现所提出的GPL可以优于箱子外的最先进的密集检索方法,最高可达8.9点NDCG @ 10。 GPL需要来自目标域的少(未标记)数据,并且在其培训中比以前的方法更强大。我们进一步调查了六种最近训练方法在检索任务的域改编方案中的作用,其中只有三种可能会产生改善的结果。最好的方法,Tsdae(Wang等,2021)可以与GPL结合,在六个任务中产生了1.0点NDCG @ 10的另一个平均改善。
translated by 谷歌翻译
Many prior language modeling efforts have shown that pre-training on an in-domain corpus can significantly improve performance on downstream domain-specific NLP tasks. However, the difficulties associated with collecting enough in-domain data might discourage researchers from approaching this pre-training task. In this paper, we conducted a series of experiments by pre-training Bidirectional Encoder Representations from Transformers (BERT) with different sizes of biomedical corpora. The results demonstrate that pre-training on a relatively small amount of in-domain data (4GB) with limited training steps, can lead to better performance on downstream domain-specific NLP tasks compared with fine-tuning models pre-trained on general corpora.
translated by 谷歌翻译
Recent advances in open-domain question answering (ODQA) have demonstrated impressive accuracy on standard Wikipedia style benchmarks. However, it is less clear how robust these models are and how well they perform when applied to real-world applications in drastically different domains. While there has been some work investigating how well ODQA models perform when tested for out-of-domain (OOD) generalization, these studies have been conducted only under conservative shifts in data distribution and typically focus on a single component (ie. retrieval) rather than an end-to-end system. In response, we propose a more realistic and challenging domain shift evaluation setting and, through extensive experiments, study end-to-end model performance. We find that not only do models fail to generalize, but high retrieval scores often still yield poor answer prediction accuracy. We then categorize different types of shifts and propose techniques that, when presented with a new dataset, predict if intervention methods are likely to be successful. Finally, using insights from this analysis, we propose and evaluate several intervention methods which improve end-to-end answer F1 score by up to 24 points.
translated by 谷歌翻译
预审前的语言模型通过提供高质量的上下文化单词嵌入来显着改善了下游语言理解任务(包括提取性问题)的性能。但是,培训问答模型仍然需要大量特定域的注释数据。在这项工作中,我们提出了一个合作的自我训练框架RGX,用于自动生成更非平凡的问题 - 解答对以提高模型性能。 RGX建立在带有答案实体识别器,问题生成器和答案提取器的交互式学习环境的蒙版答案提取任务上。给定带有蒙版实体的段落,生成器会在实体周围生成一个问题,并培训了提取器,以提取蒙面实体,并使用生成的问题和原始文本。该框架允许对任何文本语料库的问题产生和回答模型进行培训,而无需注释。实验结果表明,RGX优于最先进的语言模型(SOTA)的语言模型,并在标准提问基准的基准上采用转移学习方法,并在给定的模型大小和传输学习设置下产生新的SOTA性能。
translated by 谷歌翻译
Language model pre-training has proven to be useful in learning universal language representations. As a state-of-the-art language model pre-training model, BERT (Bidirectional Encoder Representations from Transformers) has achieved amazing results in many language understanding tasks. In this paper, we conduct exhaustive experiments to investigate different fine-tuning methods of BERT on text classification task and provide a general solution for BERT fine-tuning. Finally, the proposed solution obtains new state-of-the-art results on eight widely-studied text classification datasets. 1
translated by 谷歌翻译
查询聚焦的文本摘要(QFTS)任务旨在构建基于给定查询的文本文档摘要的构建系统。解决此任务的关键挑战是缺乏培训摘要模型的大量标记数据。在本文中,我们通过探索一系列域适应技术来解决这一挑战。鉴于最近在广泛的自然语言处理任务中进行预先接受的变压器模型的成功,我们利用此类模型为单文档和多文件方案的QFTS任务产生抽象摘要。对于域适应,我们使用预先训练的变压器的摘要模型应用了各种技术,包括转移学习,弱监督学习和远程监督。六个数据集的广泛实验表明,我们所提出的方法非常有效地为QFTS任务产生抽象摘要,同时在一组自动和人类评估指标上设置新的最先进的结果。
translated by 谷歌翻译
NLP是与计算机或机器理解和解释人类语言的能力有关的人工智能和机器学习的一种形式。语言模型在文本分析和NLP中至关重要,因为它们允许计算机解释定性输入并将其转换为可以在其他任务中使用的定量数据。从本质上讲,在转移学习的背景下,语言模型通常在大型通用语料库上进行培训,称为预训练阶段,然后对特定的基本任务进行微调。结果,预训练的语言模型主要用作基线模型,该模型包含了对上下文的广泛掌握,并且可以进一步定制以在新的NLP任务中使用。大多数预训练的模型都经过来自Twitter,Newswire,Wikipedia和Web等通用领域的Corpora培训。在一般文本中训练的现成的NLP模型可能在专业领域效率低下且不准确。在本文中,我们提出了一个名为Securebert的网络安全语言模型,该模型能够捕获网络安全域中的文本含义,因此可以进一步用于自动化,用于许多重要的网络安全任务,否则这些任务将依靠人类的专业知识和繁琐的手动努力。 Securebert受到了我们从网络安全和一般计算域的各种来源收集和预处理的大量网络安全文本培训。使用我们提出的令牌化和模型权重调整的方法,Securebert不仅能够保留对一般英语的理解,因为大多数预训练的语言模型都可以做到,而且在应用于具有网络安全含义的文本时也有效。
translated by 谷歌翻译