语言模型是所有现代自然语言处理(NLP)任务的基础。变压器架构的引入在许多NLP任务中非常有效地制作语言建模,导致该领域的显着进步。然而,变压器具有大的计算成本,这相对于输入长度逐渐增长。这提出了一个挑战,以了解长文本需要很多上下文。在本文中,我们提出了一个名为Corelm的微调框架,它扩展了当前预级语言模型的体系结构,以便它们包含显式实体信息。通过引入实体表示,我们在模型的上下文空间之外进行提供的信息,这导致更好的语言模型,用于计算成本的一小部分。我们使用GPT2实现我们的方法,并将微调模型与原件进行比较。与GPT2和GPT2的微调版本相比,我们所提出的模型在Gumby和Lambdada数据集中实现了较低的困惑,而GPT2没有任何变化。我们还在Lambada和儿童书籍测试中的准确性方面进行了比较模型的性能,而无需使用模型创建的Coreference注释。
translated by 谷歌翻译
大型基于变压器的预训练的语言模型在各种知识密集的任务上取得了令人印象深刻的表现,并可以在其参数中捕获事实知识。我们认为,考虑到不断增长的知识和资源需求,在模型参数中存储大量知识是亚最佳选择。我们认为,更有效的替代方法是向模型提供对上下文相关的结构化知识的明确访问,并训练它以使用该知识。我们提出了LM核 - 实现这一目标的一般框架 - 允许从外部知识源对语言模型培训的\ textit {解耦},并允许后者更新而不会影响已经训练的模型。实验结果表明,LM核心获得外部知识,在知识探索任务上的最先进的知识增强语言模型中实现了重要而强大的优于性能。可以有效处理知识更新;并在两个下游任务上表现良好。我们还提出了一个彻底的错误分析,突出了LM核的成功和失败。
translated by 谷歌翻译
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a;Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).
translated by 谷歌翻译
与伯特(Bert)等语言模型相比,已证明知识增强语言表示的预培训模型在知识基础构建任务(即〜关系提取)中更有效。这些知识增强的语言模型将知识纳入预训练中,以生成实体或关系的表示。但是,现有方法通常用单独的嵌入表示每个实体。结果,这些方法难以代表播出的实体和大量参数,在其基础代币模型之上(即〜变压器),必须使用,并且可以处理的实体数量为由于内存限制,实践限制。此外,现有模型仍然难以同时代表实体和关系。为了解决这些问题,我们提出了一个新的预培训模型,该模型分别从图书中学习实体和关系的表示形式,并分别在文本中跨越跨度。通过使用SPAN模块有效地编码跨度,我们的模型可以代表实体及其关系,但所需的参数比现有模型更少。我们通过从Wikipedia中提取的知识图对我们的模型进行了预训练,并在广泛的监督和无监督的信息提取任务上进行了测试。结果表明,我们的模型比基线学习对实体和关系的表现更好,而在监督的设置中,微调我们的模型始终优于罗伯塔,并在信息提取任务上取得了竞争成果。
translated by 谷歌翻译
目前,用于训练语言模型的最广泛的神经网络架构是所谓的BERT,导致各种自然语言处理(NLP)任务的改进。通常,BERT模型中的参数的数量越大,这些NLP任务中获得的结果越好。不幸的是,内存消耗和训练持续时间随着这些模型的大小而大大增加。在本文中,我们调查了较小的BERT模型的各种训练技术:我们将不同的方法与Albert,Roberta和相对位置编码等其他BERT变体相结合。此外,我们提出了两个新的微调修改,导致更好的性能:类开始终端标记和修改形式的线性链条条件随机字段。此外,我们介绍了整个词的注意力,从而降低了伯特存储器的使用,并导致性能的小幅增加,与古典的多重关注相比。我们评估了这些技术的五个公共德国命名实体识别(NER)任务,其中两条由这篇文章引入了两项任务。
translated by 谷歌翻译
Transformer-based models have pushed state of the art in many areas of NLP, but our understanding of what is behind their success is still limited. This paper is the first survey of over 150 studies of the popular BERT model. We review the current state of knowledge about how BERT works, what kind of information it learns and how it is represented, common modifications to its training objectives and architecture, the overparameterization issue and approaches to compression. We then outline directions for future research.
translated by 谷歌翻译
预训练的语言模型(PLM)在各种自然语言理解任务上取得了巨大的成功。另一方面,对PLM的简单微调对于特定于领域的任务可能是次优的,因为它们不可能涵盖所有域中的知识。尽管PLM的自适应预培训可以帮助他们获得特定于领域的知识,但需要大量的培训成本。此外,自适应预训练可能会通过造成灾难性忘记其常识来损害PLM在下游任务上的表现。为了克服PLM适应性适应性预训练的这种局限性,我们提出了一个新颖的域名适应框架,用于将PLMS创造为知识增强语言模型适应性(KALA),该框架调节了PLM的中间隐藏表示与域中的中间隐藏表示,由实体和实体和实体和实体和实体构成他们的关系事实。我们验证了Kala在问题答案中的性能,并在各个域的多个数据集上命名实体识别任务。结果表明,尽管在计算上有效,但我们的Kala在很大程度上优于适应性预训练。代码可在以下网址获得:https://github.com/nardien/kala/。
translated by 谷歌翻译
Neural language representation models such as BERT pre-trained on large-scale corpora can well capture rich semantic patterns from plain text, and be fine-tuned to consistently improve the performance of various NLP tasks. However, the existing pre-trained language models rarely consider incorporating knowledge graphs (KGs), which can provide rich structured knowledge facts for better language understanding. We argue that informative entities in KGs can enhance language representation with external knowledge. In this paper, we utilize both large-scale textual corpora and KGs to train an enhanced language representation model (ERNIE), which can take full advantage of lexical, syntactic, and knowledge information simultaneously. The experimental results have demonstrated that ERNIE achieves significant improvements on various knowledge-driven tasks, and meanwhile is comparable with the state-of-the-art model BERT on other common NLP tasks. The source code and experiment details of this paper can be obtained from https:// github.com/thunlp/ERNIE.
translated by 谷歌翻译
We introduce a new type of deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). Our word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pretrained on a large text corpus. We show that these representations can be easily added to existing models and significantly improve the state of the art across six challenging NLP problems, including question answering, textual entailment and sentiment analysis. We also present an analysis showing that exposing the deep internals of the pre-trained network is crucial, allowing downstream models to mix different types of semi-supervision signals.
translated by 谷歌翻译
我们提出了Pangu-Coder,这是一种仅预读的解码器语言模型,该模型采用pangu-alpha架构进行文本到代码生成,即给定自然语言问题描述的编程语言解决方案的合成。我们使用两阶段策略训练Pangu-Coder:第一阶段采用因果语言建模(CLM)来预先培训原始编程语言数据,而第二阶段则使用因果语言建模和掩盖语言建模(MLM)的组合培训目标,专注于文本到代码生成的下游任务,并培训松散的自然语言程序定义和代码功能。最后,我们讨论了pangu-coder-ft,该pander the是通过竞争性编程问题和代码与持续集成测试的结合进行了微调的。我们评估了pangu-coder,重点是它是否生成功能上正确的程序,并证明它在参加较小的上下文窗口和较少的数据培训的同时,它比诸如Codex之类的类似大小的模型(例如Codex)实现等效性或更好的性能。
translated by 谷歌翻译
The prediction of protein structures from sequences is an important task for function prediction, drug design, and related biological processes understanding. Recent advances have proved the power of language models (LMs) in processing the protein sequence databases, which inherit the advantages of attention networks and capture useful information in learning representations for proteins. The past two years have witnessed remarkable success in tertiary protein structure prediction (PSP), including evolution-based and single-sequence-based PSP. It seems that instead of using energy-based models and sampling procedures, protein language model (pLM)-based pipelines have emerged as mainstream paradigms in PSP. Despite the fruitful progress, the PSP community needs a systematic and up-to-date survey to help bridge the gap between LMs in the natural language processing (NLP) and PSP domains and introduce their methodologies, advancements and practical applications. To this end, in this paper, we first introduce the similarities between protein and human languages that allow LMs extended to pLMs, and applied to protein databases. Then, we systematically review recent advances in LMs and pLMs from the perspectives of network architectures, pre-training strategies, applications, and commonly-used protein databases. Next, different types of methods for PSP are discussed, particularly how the pLM-based architectures function in the process of protein folding. Finally, we identify challenges faced by the PSP community and foresee promising research directions along with the advances of pLMs. This survey aims to be a hands-on guide for researchers to understand PSP methods, develop pLMs and tackle challenging problems in this field for practical purposes.
translated by 谷歌翻译
Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively. Second, an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training. In addition, a new virtual adversarial training method is used for fine-tuning to improve models' generalization. We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understand (NLU) and natural langauge generation (NLG) downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). Notably, we scale up DeBERTa by training a larger version that consists of 48 Transform layers with 1.5 billion parameters. The significant performance boost makes the single DeBERTa model surpass the human performance on the SuperGLUE benchmark (Wang et al., 2019a) for the first time in terms of macro-average score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the SuperGLUE leaderboard as of January 6, 2021, outperforming the human baseline by a decent margin (90.3 versus 89.8). The pre-trained DeBERTa models and the source code were released at: https://github.com/microsoft/DeBERTa 1 .
translated by 谷歌翻译
大型语言模型在各种任务上显示出令人印象深刻的几次结果。但是,当知识是此类结果的关键时,就像问题回答和事实检查之类的任务一样,似乎需要存储知识的大量参数计数。众所周知,检索增强模型可以在不需要多个参数的情况下在知识密集的任务上表现出色,但是目前尚不清楚它们是否在几个弹药设置中工作。在这项工作中,我们介绍了地图集,这是一个经过精心设计和预先训练的增强语言模型,能够通过很少的培训示例学习知识密集型任务。我们对包括MMLU,苏格兰短裙和归类等各种任务进行评估,并研究文档索引内容的影响,表明它可以很容易地进行更新。值得注意的是,在自然问题上仅使用64个示例在自然问题上达到超过42 \%的准确性,尽管参数少了50倍,但比540B参数模型的表现优于540b参数模型。
translated by 谷歌翻译
Transfer learning, where a model is first pre-trained on a data-rich task before being finetuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.
translated by 谷歌翻译
关系提取(RE)是自然语言处理的基本任务。RE试图通过识别文本中的实体对之间的关系信息来将原始的,非结构化的文本转变为结构化知识。RE有许多用途,例如知识图完成,文本摘要,提问和搜索查询。RE方法的历史可以分为四个阶段:基于模式的RE,基于统计的RE,基于神经的RE和大型语言模型的RE。这项调查始于对RE的早期阶段的一些示例性作品的概述,突出了局限性和缺点,以使进度相关。接下来,我们回顾流行的基准测试,并严格检查用于评估RE性能的指标。然后,我们讨论遥远的监督,这是塑造现代RE方法发展的范式。最后,我们回顾了重点是降级和培训方法的最新工作。
translated by 谷歌翻译
Privacy preserving deep learning is an emerging field in machine learning that aims to mitigate the privacy risks in the use of deep neural networks. One such risk is training data extraction from language models that have been trained on datasets , which contain personal and privacy sensitive information. In our study, we investigate the extent of named entity memorization in fine-tuned BERT models. We use single-label text classification as representative downstream task and employ three different fine-tuning setups in our experiments, including one with Differentially Privacy (DP). We create a large number of text samples from the fine-tuned BERT models utilizing a custom sequential sampling strategy with two prompting strategies. We search in these samples for named entities and check if they are also present in the fine-tuning datasets. We experiment with two benchmark datasets in the domains of emails and blogs. We show that the application of DP has a huge effect on the text generation capabilities of BERT. Furthermore, we show that a fine-tuned BERT does not generate more named entities entities specific to the fine-tuning dataset than a BERT model that is pre-trained only. This suggests that BERT is unlikely to emit personal or privacy sensitive named entities. Overall, our results are important to understand to what extent BERT-based services are prone to training data extraction attacks.
translated by 谷歌翻译
基于变压器的大型语言模型在自然语言处理中表现出色。通过考虑这些模型在一个领域中获得的知识的可传递性,以及自然语言与高级编程语言(例如C/C ++)的亲密关系,这项工作研究了如何利用(大)基于变压器语言模型检测软件漏洞以及这些模型在漏洞检测任务方面的良好程度。在这方面,首先提出了一个系统的(凝聚)框架,详细介绍了源代码翻译,模型准备和推理。然后,使用具有多个漏洞的C/C ++源代码的软件漏洞数据集进行经验分析,该数据集对应于库功能调用,指针使用,数组使用情况和算术表达式。我们的经验结果证明了语言模型在脆弱性检测中的良好性能。此外,这些语言模型具有比当代模型更好的性能指标,例如F1得分,即双向长期记忆和双向封闭式复发单元。由于计算资源,平台,库和依赖项的要求,对语言模型进行实验始终是具有挑战性的。因此,本文还分析了流行的平台,以有效地微调这些模型并在选择平台时提出建议。
translated by 谷歌翻译
尽管在产生流利的文本方面取得了进步,但现有的预训练模型倾向于在产生诸如故事和新闻之类的叙述时将不连贯的事件序列附加到相关实体上。我们猜想,这些问题是由将实体表示为浅表词的静态嵌入而导致的,同时忽略了对其不断变化的状态建模,即随着文本的展开,即它们所携带的信息。因此,我们将变压器模型扩展到动态执行实体状态更新和叙事生成的句子实现。我们提出了一个对比框架,以在离散空间中学习状态表示,并将其他注意层插入解码器中以更好地利用这些状态。两个叙述数据集的实验表明,与有意义的实体状态的指导相比,我们的模型可以产生更多的连贯和多样化的叙事。
translated by 谷歌翻译
在NLP社区中有一个正在进行的辩论,无论现代语言模型是否包含语言知识,通过所谓的探针恢复。在本文中,我们研究了语言知识是否是现代语言模型良好表现的必要条件,我们称之为\ Texit {重新发现假设}。首先,我们展示了语言模型,这是显着压缩的,但在预先磨普目标上表现良好,以便在语言结构探讨时保持良好的分数。这一结果支持重新发现的假设,并导致我们的论文的第二款贡献:一个信息 - 理论框架,与语言建模目标相关。该框架还提供了测量语言信息对字词预测任务的影响的度量标准。我们通过英语综合和真正的NLP任务加固我们的分析结果。
translated by 谷歌翻译
多语言语言模型(\ mllms),如mbert,xlm,xlm-r,\ textit {etc。}已成为一种可行的选择,使预先估计到大量语言的力量。鉴于他们的成功在零射击转移学习中,在(i)建立更大的\ mllms〜覆盖了大量语言(ii)创建覆盖更广泛的任务和语言来评估的详尽工作基准mllms〜(iii)分析单音零点,零拍摄交叉和双语任务(iv)对Monolingual的性能,了解\ mllms〜(v)增强(通常)学习的通用语言模式(如果有的话)有限的容量\ mllms〜以提高他们在已见甚至看不见语言的表现。在这项调查中,我们审查了现有的文学,涵盖了上述与\ MLLMS有关的广泛研究领域。根据我们的调查,我们建议您有一些未来的研究方向。
translated by 谷歌翻译