Pre-trained Language Model (PLM) has become a representative foundation model in the natural language processing field. Most PLMs are trained with linguistic-agnostic pre-training tasks on the surface form of the text, such as the masked language model (MLM). To further empower the PLMs with richer linguistic features, in this paper, we aim to propose a simple but effective way to learn linguistic features for pre-trained language models. We propose LERT, a pre-trained language model that is trained on three types of linguistic features along with the original MLM pre-training task, using a linguistically-informed pre-training (LIP) strategy. We carried out extensive experiments on ten Chinese NLU tasks, and the experimental results show that LERT could bring significant improvements over various comparable baselines. Furthermore, we also conduct analytical experiments in various linguistic aspects, and the results prove that the design of LERT is valid and effective. Resources are available at https://github.com/ymcui/LERT
translated by 谷歌翻译
来自变压器(BERT)的双向编码器表示显示了各种NLP任务的奇妙改进,并且已经提出了其连续的变体来进一步提高预先训练的语言模型的性能。在本文中,我们的目标是首先介绍中国伯特的全文掩蔽(WWM)策略,以及一系列中国预培训的语言模型。然后我们还提出了一种简单但有效的型号,称为Macbert,这在几种方面提高了罗伯塔。特别是,我们提出了一种称为MLM作为校正(MAC)的新掩蔽策略。为了展示这些模型的有效性,我们创建了一系列中国预先培训的语言模型,作为我们的基线,包括BERT,Roberta,Electra,RBT等。我们对十个中国NLP任务进行了广泛的实验,以评估创建的中国人托管语言模型以及提议的麦克白。实验结果表明,Macbert可以在许多NLP任务上实现最先进的表演,我们还通过几种可能有助于未来的研究的调查结果来消融细节。我们开源我们的预先培训的语言模型,以进一步促进我们的研究界。资源可用:https://github.com/ymcui/chinese-bert-wwm
translated by 谷歌翻译
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a;Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).
translated by 谷歌翻译
预训练的语言模型(PLM)在自然语言理解中的许多下游任务中取得了显着的性能增长。已提出了各种中文PLM,以学习更好的中文表示。但是,大多数当前模型都使用中文字符作为输入,并且无法编码中文单词中包含的语义信息。虽然最近的预训练模型同时融合了单词和字符,但它们通常会遭受不足的语义互动,并且无法捕获单词和字符之间的语义关系。为了解决上述问题,我们提出了一个简单而有效的PLM小扣手,该小扣子采用了对单词和性格表示的对比度学习。特别是,Clower通过对多透明信息的对比学习将粗粒的信息(即单词)隐式编码为细粒度表示(即字符)。在现实的情况下,小电动器具有很大的价值,因为它可以轻松地将其纳入任何现有的基于细粒的PLM中而无需修改生产管道。在一系列下游任务上进行的扩展实验表明,小动物的卓越性能超过了几个最先进的实验 - 艺术基线。
translated by 谷歌翻译
Transformer-based models have pushed state of the art in many areas of NLP, but our understanding of what is behind their success is still limited. This paper is the first survey of over 150 studies of the popular BERT model. We review the current state of knowledge about how BERT works, what kind of information it learns and how it is represented, common modifications to its training objectives and architecture, the overparameterization issue and approaches to compression. We then outline directions for future research.
translated by 谷歌翻译
事实证明,将先验知识纳入预训练的语言模型中对知识驱动的NLP任务有效,例如实体键入和关系提取。当前的培训程序通常通过使用知识掩盖,知识融合和知识更换将外部知识注入模型。但是,输入句子中包含的事实信息尚未完全开采,并且尚未严格检查注射的外部知识。结果,无法完全利用上下文信息,并将引入额外的噪音,或者注入的知识量受到限制。为了解决这些问题,我们提出了MLRIP,该MLRIP修改了Ernie-Baidu提出的知识掩盖策略,并引入了两阶段的实体替代策略。进行全面分析的广泛实验说明了MLRIP在军事知识驱动的NLP任务中基于BERT的模型的优势。
translated by 谷歌翻译
Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively. Second, an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training. In addition, a new virtual adversarial training method is used for fine-tuning to improve models' generalization. We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understand (NLU) and natural langauge generation (NLG) downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). Notably, we scale up DeBERTa by training a larger version that consists of 48 Transform layers with 1.5 billion parameters. The significant performance boost makes the single DeBERTa model surpass the human performance on the SuperGLUE benchmark (Wang et al., 2019a) for the first time in terms of macro-average score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the SuperGLUE leaderboard as of January 6, 2021, outperforming the human baseline by a decent margin (90.3 versus 89.8). The pre-trained DeBERTa models and the source code were released at: https://github.com/microsoft/DeBERTa 1 .
translated by 谷歌翻译
We present SpanBERT, a pre-training method that is designed to better represent and predict spans of text. Our approach extends BERT by (1) masking contiguous random spans, rather than random tokens, and (2) training the span boundary representations to predict the entire content of the masked span, without relying on the individual token representations within it. Span-BERT consistently outperforms BERT and our better-tuned baselines, with substantial gains on span selection tasks such as question answering and coreference resolution. In particular, with the same training data and model size as BERT large , our single model obtains 94.6% and 88.7% F1 on SQuAD 1.1 and 2.0 respectively. We also achieve a new state of the art on the OntoNotes coreference resolution task (79.6% F1), strong performance on the TACRED relation extraction benchmark, and even gains on GLUE. 1 * Equal contribution. 1 Our code and pre-trained models are available at https://github.com/facebookresearch/ SpanBERT.
translated by 谷歌翻译
Neural language representation models such as BERT pre-trained on large-scale corpora can well capture rich semantic patterns from plain text, and be fine-tuned to consistently improve the performance of various NLP tasks. However, the existing pre-trained language models rarely consider incorporating knowledge graphs (KGs), which can provide rich structured knowledge facts for better language understanding. We argue that informative entities in KGs can enhance language representation with external knowledge. In this paper, we utilize both large-scale textual corpora and KGs to train an enhanced language representation model (ERNIE), which can take full advantage of lexical, syntactic, and knowledge information simultaneously. The experimental results have demonstrated that ERNIE achieves significant improvements on various knowledge-driven tasks, and meanwhile is comparable with the state-of-the-art model BERT on other common NLP tasks. The source code and experiment details of this paper can be obtained from https:// github.com/thunlp/ERNIE.
translated by 谷歌翻译
蒙版语言建模(MLM)已被广泛用作培训前语言模型(PRLMS)中的剥夺目标。现有的PRLMS通常采用随机掩盖策略,在该策略中应用固定的掩蔽率,并且在整个培训中都有均等的概率掩盖了不同的内容。但是,该模型可能会受到训练前状态的复杂影响,随着训练时间的发展,这种影响会发生相应的变化。在本文中,我们表明这种时间不变的MLM设置对掩盖比和掩盖内容不太可能提供最佳结果,这激发了我们探索时间变化的MLM设置的影响。我们提出了两种计划的掩蔽方法,可在不同的训练阶段适应掩盖比和内容,从而提高了训练前效率和在下游任务上验证的效率。我们的工作是一项关于比率和内容的时间变化掩盖策略的先驱研究,并更好地了解掩盖比率和掩盖内容如何影响MLM的MLM预训练。
translated by 谷歌翻译
本文旨在通过介绍第一个中国数学预训练的语言模型〜(PLM)来提高机器的数学智能,以有效理解和表示数学问题。与其他标准NLP任务不同,数学文本很难理解,因为它们在问题陈述中涉及数学术语,符号和公式。通常,它需要复杂的数学逻辑和背景知识来解决数学问题。考虑到数学文本的复杂性质,我们设计了一种新的课程预培训方法,用于改善由基本和高级课程组成的数学PLM的学习。特别是,我们首先根据位置偏见的掩盖策略执行令牌级预训练,然后设计基于逻辑的预训练任务,旨在分别恢复改组的句子和公式。最后,我们介绍了一项更加困难的预训练任务,该任务强制执行PLM以检测和纠正其生成的解决方案中的错误。我们对离线评估(包括九个与数学相关的任务)和在线$ A/B $测试进行了广泛的实验。实验结果证明了与许多竞争基线相比,我们的方法的有效性。我们的代码可在:\ textColor {blue} {\ url {https://github.com/rucaibox/jiuzhang}}}中获得。
translated by 谷歌翻译
在NLP社区中有一个正在进行的辩论,无论现代语言模型是否包含语言知识,通过所谓的探针恢复。在本文中,我们研究了语言知识是否是现代语言模型良好表现的必要条件,我们称之为\ Texit {重新发现假设}。首先,我们展示了语言模型,这是显着压缩的,但在预先磨普目标上表现良好,以便在语言结构探讨时保持良好的分数。这一结果支持重新发现的假设,并导致我们的论文的第二款贡献:一个信息 - 理论框架,与语言建模目标相关。该框架还提供了测量语言信息对字词预测任务的影响的度量标准。我们通过英语综合和真正的NLP任务加固我们的分析结果。
translated by 谷歌翻译
与伯特(Bert)等语言模型相比,已证明知识增强语言表示的预培训模型在知识基础构建任务(即〜关系提取)中更有效。这些知识增强的语言模型将知识纳入预训练中,以生成实体或关系的表示。但是,现有方法通常用单独的嵌入表示每个实体。结果,这些方法难以代表播出的实体和大量参数,在其基础代币模型之上(即〜变压器),必须使用,并且可以处理的实体数量为由于内存限制,实践限制。此外,现有模型仍然难以同时代表实体和关系。为了解决这些问题,我们提出了一个新的预培训模型,该模型分别从图书中学习实体和关系的表示形式,并分别在文本中跨越跨度。通过使用SPAN模块有效地编码跨度,我们的模型可以代表实体及其关系,但所需的参数比现有模型更少。我们通过从Wikipedia中提取的知识图对我们的模型进行了预训练,并在广泛的监督和无监督的信息提取任务上进行了测试。结果表明,我们的模型比基线学习对实体和关系的表现更好,而在监督的设置中,微调我们的模型始终优于罗伯塔,并在信息提取任务上取得了竞争成果。
translated by 谷歌翻译
Language model pre-training has proven to be useful in learning universal language representations. As a state-of-the-art language model pre-training model, BERT (Bidirectional Encoder Representations from Transformers) has achieved amazing results in many language understanding tasks. In this paper, we conduct exhaustive experiments to investigate different fine-tuning methods of BERT on text classification task and provide a general solution for BERT fine-tuning. Finally, the proposed solution obtains new state-of-the-art results on eight widely-studied text classification datasets. 1
translated by 谷歌翻译
在这项工作中,我们探索如何学习专用的语言模型,旨在学习从文本文件中学习关键词的丰富表示。我们在判别和生成设置中进行预训练变压器语言模型(LMS)的不同掩蔽策略。在歧视性设定中,我们引入了一种新的预训练目标 - 关键边界,用替换(kbir)infifiling,在使用Kbir预先训练的LM进行微调时显示出在Sota上的性能(F1中高达9.26点)的大量增益关键酶提取的任务。在生成设置中,我们为BART - 键盘介绍了一个新的预训练设置,可再现与CATSeq格式中的输入文本相关的关键字,而不是Denoised原始输入。这也导致在关键词中的性能(F1 @ M)中的性能(高达4.33点),用于关键正版生成。此外,我们还微调了在命名实体识别(ner),问题应答(qa),关系提取(重新),抽象摘要和达到与SOTA的可比性表现的预训练的语言模型,表明学习丰富的代表关键词确实有利于许多其他基本的NLP任务。
translated by 谷歌翻译
Logical reasoning of text is an important ability that requires understanding the information present in the text, their interconnections, and then reasoning through them to infer new conclusions. Prior works on improving the logical reasoning ability of language models require complex processing of training data (e.g., aligning symbolic knowledge to text), yielding task-specific data augmentation solutions that restrict the learning of general logical reasoning skills. In this work, we propose APOLLO, an adaptively pretrained language model that has improved logical reasoning abilities. We select a subset of Wikipedia, based on a set of logical inference keywords, for continued pretraining of a language model. We use two self-supervised loss functions: a modified masked language modeling loss where only specific parts-of-speech words, that would likely require more reasoning than basic language understanding, are masked, and a sentence-level classification loss that teaches the model to distinguish between entailment and contradiction types of sentences. The proposed training paradigm is both simple and independent of task formats. We demonstrate the effectiveness of APOLLO by comparing it with prior baselines on two logical reasoning datasets. APOLLO performs comparably on ReClor and outperforms baselines on LogiQA.
translated by 谷歌翻译
Transformer language models (TLMs) are critical for most NLP tasks, but they are difficult to create for low-resource languages because of how much pretraining data they require. In this work, we investigate two techniques for training monolingual TLMs in a low-resource setting: greatly reducing TLM size, and complementing the masked language modeling objective with two linguistically rich supervised tasks (part-of-speech tagging and dependency parsing). Results from 7 diverse languages indicate that our model, MicroBERT, is able to produce marked improvements in downstream task evaluations relative to a typical monolingual TLM pretraining approach. Specifically, we find that monolingual MicroBERT models achieve gains of up to 18% for parser LAS and 11% for NER F1 compared to a multilingual baseline, mBERT, while having less than 1% of its parameter count. We conclude reducing TLM parameter count and using labeled data for pretraining low-resource TLMs can yield large quality benefits and in some cases produce models that outperform multilingual approaches.
translated by 谷歌翻译
We introduce a linguistically enhanced combination of pre-training methods for transformers. The pre-training objectives include POS-tagging, synset prediction based on semantic knowledge graphs, and parent prediction based on dependency parse trees. Our approach achieves competitive results on the Natural Language Inference task, compared to the state of the art. Specifically for smaller models, the method results in a significant performance boost, emphasizing the fact that intelligent pre-training can make up for fewer parameters and help building more efficient models. Combining POS-tagging and synset prediction yields the overall best results.
translated by 谷歌翻译
Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out. As a result, the contextual representations learned by our approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained using 30x more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale, where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when using the same amount of compute.
translated by 谷歌翻译
用于预培训语言模型的自我监督学习的核心包括预训练任务设计以及适当的数据增强。语言模型中的大多数数据增强都是独立于上下文的。最近在电子中提出了一个开创性的增强,并通过引入辅助生成网络(发电机)来实现最先进的性能,以产生用于培训主要辨别网络(鉴别者)的上下文化数据增强。然而,这种设计引入了发电机的额外计算成本,并且需要调整发电机和鉴别器之间的相对能力。在本文中,我们提出了一种自增强策略(SAS),其中单个网络用于审视以后的时期的培训常规预训练和上下文化数据增强。基本上,该策略消除了单独的发电机,并使用单个网络共同执行具有MLM(屏蔽语言建模)和RTD(替换令牌检测)头的两个预训练任务。它避免了寻找适当大小的发电机的挑战,这对于在电子中证明的性能至关重要,以及其随后的变体模型至关重要。此外,SAS是一项常规策略,可以与最近或将来的许多新技术无缝地结合,例如杜伯塔省的解除关注机制。我们的实验表明,SAS能够在具有相似或更少的计算成本中优于胶水任务中的电磁和其他最先进的模型。
translated by 谷歌翻译