Discriminative pre-trained language models (PLMs) learn to predict original texts from intentionally corrupted ones. Taking the former text as positive and the latter as negative samples, the PLM can be trained effectively for contextualized representation. However, the training of such a type of PLMs highly relies on the quality of the automatically constructed samples. Existing PLMs simply treat all corrupted texts as equal negative without any examination, which actually lets the resulting model inevitably suffer from the false negative issue where training is carried out on pseudo-negative data and leads to less efficiency and less robustness in the resulting PLMs. In this work, on the basis of defining the false negative issue in discriminative PLMs that has been ignored for a long time, we design enhanced pre-training methods to counteract false negative predictions and encourage pre-training language models on true negatives by correcting the harmful gradient updates subject to false negative predictions. Experimental results on GLUE and SQuAD benchmarks show that our counter-false-negative pre-training methods indeed bring about better performance together with stronger robustness.
translated by 谷歌翻译
Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out. As a result, the contextual representations learned by our approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained using 30x more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale, where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when using the same amount of compute.
translated by 谷歌翻译
蒙版语言建模(MLM)已被广泛用作培训前语言模型(PRLMS)中的剥夺目标。现有的PRLMS通常采用随机掩盖策略,在该策略中应用固定的掩蔽率,并且在整个培训中都有均等的概率掩盖了不同的内容。但是,该模型可能会受到训练前状态的复杂影响,随着训练时间的发展,这种影响会发生相应的变化。在本文中,我们表明这种时间不变的MLM设置对掩盖比和掩盖内容不太可能提供最佳结果,这激发了我们探索时间变化的MLM设置的影响。我们提出了两种计划的掩蔽方法,可在不同的训练阶段适应掩盖比和内容,从而提高了训练前效率和在下游任务上验证的效率。我们的工作是一项关于比率和内容的时间变化掩盖策略的先驱研究,并更好地了解掩盖比率和掩盖内容如何影响MLM的MLM预训练。
translated by 谷歌翻译
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a;Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).
translated by 谷歌翻译
Dense retrieval aims to map queries and passages into low-dimensional vector space for efficient similarity measuring, showing promising effectiveness in various large-scale retrieval tasks. Since most existing methods commonly adopt pre-trained Transformers (e.g. BERT) for parameter initialization, some work focuses on proposing new pre-training tasks for compressing the useful semantic information from passages into dense vectors, achieving remarkable performances. However, it is still challenging to effectively capture the rich semantic information and relations about passages into the dense vectors via one single particular pre-training task. In this work, we propose a multi-task pre-trained model, MASTER, that unifies and integrates multiple pre-training tasks with different learning objectives under the bottlenecked masked autoencoder architecture. Concretely, MASTER utilizes a multi-decoder architecture to integrate three types of pre-training tasks: corrupted passages recovering, related passage recovering and PLMs outputs recovering. By incorporating a shared deep encoder, we construct a representation bottleneck in our architecture, compressing the abundant semantic information across tasks into dense vectors. The first two types of tasks concentrate on capturing the semantic information of passages and relationships among them within the pre-training corpus. The third one can capture the knowledge beyond the corpus from external PLMs (e.g. GPT-2). Extensive experiments on several large-scale passage retrieval datasets have shown that our approach outperforms the previous state-of-the-art dense retrieval methods. Our code and data are publicly released in https://github.com/microsoft/SimXNS
translated by 谷歌翻译
来自变压器(BERT)的双向编码器表示显示了各种NLP任务的奇妙改进,并且已经提出了其连续的变体来进一步提高预先训练的语言模型的性能。在本文中,我们的目标是首先介绍中国伯特的全文掩蔽(WWM)策略,以及一系列中国预培训的语言模型。然后我们还提出了一种简单但有效的型号,称为Macbert,这在几种方面提高了罗伯塔。特别是,我们提出了一种称为MLM作为校正(MAC)的新掩蔽策略。为了展示这些模型的有效性,我们创建了一系列中国预先培训的语言模型,作为我们的基线,包括BERT,Roberta,Electra,RBT等。我们对十个中国NLP任务进行了广泛的实验,以评估创建的中国人托管语言模型以及提议的麦克白。实验结果表明,Macbert可以在许多NLP任务上实现最先进的表演,我们还通过几种可能有助于未来的研究的调查结果来消融细节。我们开源我们的预先培训的语言模型,以进一步促进我们的研究界。资源可用:https://github.com/ymcui/chinese-bert-wwm
translated by 谷歌翻译
用于预培训语言模型的自我监督学习的核心包括预训练任务设计以及适当的数据增强。语言模型中的大多数数据增强都是独立于上下文的。最近在电子中提出了一个开创性的增强,并通过引入辅助生成网络(发电机)来实现最先进的性能,以产生用于培训主要辨别网络(鉴别者)的上下文化数据增强。然而,这种设计引入了发电机的额外计算成本,并且需要调整发电机和鉴别器之间的相对能力。在本文中,我们提出了一种自增强策略(SAS),其中单个网络用于审视以后的时期的培训常规预训练和上下文化数据增强。基本上,该策略消除了单独的发电机,并使用单个网络共同执行具有MLM(屏蔽语言建模)和RTD(替换令牌检测)头的两个预训练任务。它避免了寻找适当大小的发电机的挑战,这对于在电子中证明的性能至关重要,以及其随后的变体模型至关重要。此外,SAS是一项常规策略,可以与最近或将来的许多新技术无缝地结合,例如杜伯塔省的解除关注机制。我们的实验表明,SAS能够在具有相似或更少的计算成本中优于胶水任务中的电磁和其他最先进的模型。
translated by 谷歌翻译
预训练的语言模型(PLM)在自然语言理解中的许多下游任务中取得了显着的性能增长。已提出了各种中文PLM,以学习更好的中文表示。但是,大多数当前模型都使用中文字符作为输入,并且无法编码中文单词中包含的语义信息。虽然最近的预训练模型同时融合了单词和字符,但它们通常会遭受不足的语义互动,并且无法捕获单词和字符之间的语义关系。为了解决上述问题,我们提出了一个简单而有效的PLM小扣手,该小扣子采用了对单词和性格表示的对比度学习。特别是,Clower通过对多透明信息的对比学习将粗粒的信息(即单词)隐式编码为细粒度表示(即字符)。在现实的情况下,小电动器具有很大的价值,因为它可以轻松地将其纳入任何现有的基于细粒的PLM中而无需修改生产管道。在一系列下游任务上进行的扩展实验表明,小动物的卓越性能超过了几个最先进的实验 - 艺术基线。
translated by 谷歌翻译
本文旨在通过介绍第一个中国数学预训练的语言模型〜(PLM)来提高机器的数学智能,以有效理解和表示数学问题。与其他标准NLP任务不同,数学文本很难理解,因为它们在问题陈述中涉及数学术语,符号和公式。通常,它需要复杂的数学逻辑和背景知识来解决数学问题。考虑到数学文本的复杂性质,我们设计了一种新的课程预培训方法,用于改善由基本和高级课程组成的数学PLM的学习。特别是,我们首先根据位置偏见的掩盖策略执行令牌级预训练,然后设计基于逻辑的预训练任务,旨在分别恢复改组的句子和公式。最后,我们介绍了一项更加困难的预训练任务,该任务强制执行PLM以检测和纠正其生成的解决方案中的错误。我们对离线评估(包括九个与数学相关的任务)和在线$ A/B $测试进行了广泛的实验。实验结果证明了与许多竞争基线相比,我们的方法的有效性。我们的代码可在:\ textColor {blue} {\ url {https://github.com/rucaibox/jiuzhang}}}中获得。
translated by 谷歌翻译
近年来,在应用预训练的语言模型(例如Bert)上,取得了巨大进展,以获取信息检索(IR)任务。在网页中通常使用的超链接已被利用用于设计预训练目标。例如,超链接的锚文本已用于模拟查询,从而构建了巨大的查询文档对以进行预训练。但是,作为跨越两个网页的桥梁,尚未完全探索超链接的潜力。在这项工作中,我们专注于建模通过超链接连接的两个文档之间的关系,并为临时检索设计一个新的预训练目标。具体而言,我们将文档之间的关系分为四组:无链接,单向链接,对称链接和最相关的对称链接。通过比较从相邻组采样的两个文档,该模型可以逐渐提高其捕获匹配信号的能力。我们提出了一个渐进的超链接预测({php})框架,以探索预训练中超链接的利用。对两个大规模临时检索数据集和六个提问数据集的实验结果证明了其优于现有的预训练方法。
translated by 谷歌翻译
Transfer learning, where a model is first pre-trained on a data-rich task before being finetuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.
translated by 谷歌翻译
We present SpanBERT, a pre-training method that is designed to better represent and predict spans of text. Our approach extends BERT by (1) masking contiguous random spans, rather than random tokens, and (2) training the span boundary representations to predict the entire content of the masked span, without relying on the individual token representations within it. Span-BERT consistently outperforms BERT and our better-tuned baselines, with substantial gains on span selection tasks such as question answering and coreference resolution. In particular, with the same training data and model size as BERT large , our single model obtains 94.6% and 88.7% F1 on SQuAD 1.1 and 2.0 respectively. We also achieve a new state of the art on the OntoNotes coreference resolution task (79.6% F1), strong performance on the TACRED relation extraction benchmark, and even gains on GLUE. 1 * Equal contribution. 1 Our code and pre-trained models are available at https://github.com/facebookresearch/ SpanBERT.
translated by 谷歌翻译
Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively. Second, an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training. In addition, a new virtual adversarial training method is used for fine-tuning to improve models' generalization. We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understand (NLU) and natural langauge generation (NLG) downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). Notably, we scale up DeBERTa by training a larger version that consists of 48 Transform layers with 1.5 billion parameters. The significant performance boost makes the single DeBERTa model surpass the human performance on the SuperGLUE benchmark (Wang et al., 2019a) for the first time in terms of macro-average score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the SuperGLUE leaderboard as of January 6, 2021, outperforming the human baseline by a decent margin (90.3 versus 89.8). The pre-trained DeBERTa models and the source code were released at: https://github.com/microsoft/DeBERTa 1 .
translated by 谷歌翻译
在这项工作中,我们探索如何学习专用的语言模型,旨在学习从文本文件中学习关键词的丰富表示。我们在判别和生成设置中进行预训练变压器语言模型(LMS)的不同掩蔽策略。在歧视性设定中,我们引入了一种新的预训练目标 - 关键边界,用替换(kbir)infifiling,在使用Kbir预先训练的LM进行微调时显示出在Sota上的性能(F1中高达9.26点)的大量增益关键酶提取的任务。在生成设置中,我们为BART - 键盘介绍了一个新的预训练设置,可再现与CATSeq格式中的输入文本相关的关键字,而不是Denoised原始输入。这也导致在关键词中的性能(F1 @ M)中的性能(高达4.33点),用于关键正版生成。此外,我们还微调了在命名实体识别(ner),问题应答(qa),关系提取(重新),抽象摘要和达到与SOTA的可比性表现的预训练的语言模型,表明学习丰富的代表关键词确实有利于许多其他基本的NLP任务。
translated by 谷歌翻译
This paper presents a pre-training technique called query-as-context that uses query prediction to improve dense retrieval. Previous research has applied query prediction to document expansion in order to alleviate the problem of lexical mismatch in sparse retrieval. However, query prediction has not yet been studied in the context of dense retrieval. Query-as-context pre-training assumes that the predicted query is a special context for the document and uses contrastive learning or contextual masked auto-encoding learning to compress the document and query into dense vectors. The technique is evaluated on large-scale passage retrieval benchmarks and shows considerable improvements compared to existing strong baselines such as coCondenser and CoT-MAE, demonstrating its effectiveness. Our code will be available at https://github.com/caskcsg/ir/tree/main/cotmae-qc .
translated by 谷歌翻译
最近,在自动开放域对话框评估中应用预先接受训练的语言模型(PR-LM),有兴趣的兴趣。PR-LMS提供了满足多域评估挑战的有希望的方向。然而,不同PR-LMS对自动度量的性能的影响是不太理解的。本文审查了8种不同的PRM,并研究了三种不同对话评估基准的三种典型自动对话对话指标的影响。具体而言,我们分析PR-LMS的选择如何影响自动度量的性能。执行对每个度量的广泛相关分析以评估不同PR-LMS沿各种轴的影响,包括预训练目标,对话对话标准,模型规模和跨数据集鲁棒性。本研究有助于第一次全面评估不同PR-LMS对自动对话评估的影响。
translated by 谷歌翻译
由于表现强劲,预用的语言模型已成为许多NLP任务的标准方法,但他们培训价格昂贵。我们提出了一个简单高效的学习框架TLM,不依赖于大规模预制。给定一些标记的任务数据和大型常规语料库,TLM使用任务数据作为查询来检索一般语料库的微小子集,并联合优化任务目标和从头开始的语言建模目标。在四个域中的八个分类数据集上,TLM实现了比预用语言模型(例如Roberta-Light)更好地或类似的结果,同时减少了两个数量级的训练拖鞋。高精度和效率,我们希望TLM将有助于民主化NLP并加快发展。
translated by 谷歌翻译
事实证明,将先验知识纳入预训练的语言模型中对知识驱动的NLP任务有效,例如实体键入和关系提取。当前的培训程序通常通过使用知识掩盖,知识融合和知识更换将外部知识注入模型。但是,输入句子中包含的事实信息尚未完全开采,并且尚未严格检查注射的外部知识。结果,无法完全利用上下文信息,并将引入额外的噪音,或者注入的知识量受到限制。为了解决这些问题,我们提出了MLRIP,该MLRIP修改了Ernie-Baidu提出的知识掩盖策略,并引入了两阶段的实体替代策略。进行全面分析的广泛实验说明了MLRIP在军事知识驱动的NLP任务中基于BERT的模型的优势。
translated by 谷歌翻译
Transformer-based models have pushed state of the art in many areas of NLP, but our understanding of what is behind their success is still limited. This paper is the first survey of over 150 studies of the popular BERT model. We review the current state of knowledge about how BERT works, what kind of information it learns and how it is represented, common modifications to its training objectives and architecture, the overparameterization issue and approaches to compression. We then outline directions for future research.
translated by 谷歌翻译
在培训数据中拟合复杂的模式,例如推理和争议,是语言预训练的关键挑战。根据最近的研究和我们的经验观察,一种可能的原因是训练数据中的一些易于适应的模式,例如经常共同发生的单词组合,主导和伤害预训练,使模型很难适合更复杂的信息。我们争辩说,错误预测可以帮助找到危害语言理解的这种主导模式。当发生错误预测时,应该经常与导致MIS预测的模型拟合的MIS预测字相同的模式。如果我们可以添加正规化以培训模型,当MIS预测发生并更多地对待更微妙的模式时,可以在更多信息上缩小到这种主导模式时,可以在预训练中有效地安装更多信息。在此动机之后,我们提出了一种新的语言预培训方法,错误预测作为伤害警报(MPA)。在MPA中,当在预训练期间发生错误预测时,我们使用其共同发生信息来指导自我关注模块的多个头部。变压器模块中的一些自我关注头经过优化,以将更低的注意重量分配给频繁地在误报中的输入句子中的单词,同时将更高权重分配给另一个单词。通过这样做,变压器模型训练,以依赖于主导的频繁共同发生模式,而在误报中,当发生错误预测时,在剩余更复杂的信息上更加关注更多。我们的实验表明,MPA加快了伯特和电器的预训练,并提高了他们对下游任务的表现。
translated by 谷歌翻译