本文提出了一种新的预先接受训练的语言模型Debertav3,它通过用更换的令牌检测(RTD)更换掩模语言建模(MLM)来改善原始的Deberta模型,更高的预训练任务。我们的分析表明,Vanilla嵌入了电力中的共享损害培训效率和模型性能。这是因为鉴别器的培训损失和发电机的销售损失在不同的方向上拉动令牌嵌入,从而创造“拔河”动态。因此,我们提出了一种新的梯度 - 解开嵌入共享方法,避免了战争动态,提高了训练效率和预训练模型的质量。我们使用与Deberta相同的设置预先接受了培训的Debertav3,以展示其在广泛的下游自然语言理解(NLU)任务上的特殊表现。以八个任务为例,Debertav3大型模型以八个任务为例,平均得分为91.37%,杜伯塔省的1.37%和电力1.91%,在模型中设置新的最先进(SOTA)具有类似的结构。此外,我们预先培训了多语思伯类Mdeberta,与英语模型相比,对强基线的更大改善。例如,Mdeberta基地达到XNLI的79.8%零射频精度和超过XLM-R基础的3.6%的改进,在此基准上创建了一个新的Sota。我们在HTTPS://github.com/microsoft/deberta公开提供我们预先接受的模型和推理码。
translated by 谷歌翻译
Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively. Second, an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training. In addition, a new virtual adversarial training method is used for fine-tuning to improve models' generalization. We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understand (NLU) and natural langauge generation (NLG) downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). Notably, we scale up DeBERTa by training a larger version that consists of 48 Transform layers with 1.5 billion parameters. The significant performance boost makes the single DeBERTa model surpass the human performance on the SuperGLUE benchmark (Wang et al., 2019a) for the first time in terms of macro-average score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the SuperGLUE leaderboard as of January 6, 2021, outperforming the human baseline by a decent margin (90.3 versus 89.8). The pre-trained DeBERTa models and the source code were released at: https://github.com/microsoft/DeBERTa 1 .
translated by 谷歌翻译
本文介绍了Z-Code ++,这是一种针对抽象文本摘要优化的新的预训练的语言模型。该模型使用三种技术扩展了艺术编码器模型的状态。首先,我们使用两阶段的预训练过程来改善模型在低资源摘要任务上的性能。该模型首先是使用文本语料库进行语言理解的预先培训的,然后在汇总语料库中不断预先培训,以进行基础文本生成。其次,我们用分离的注意力层代替编码器中的自我发项层,其中每个单词都使用两个向量分别代表其内容和位置。第三,我们使用融合编码器,这是一种以层次方式编码长序列的简单而有效的方法。 Z-Code ++在13个文本摘要任务中的9个跨5种语言中创建了新的艺术状态。我们的模型的参数有效,因为它的表现优于XSUM上600倍较大的Palm-540b,并且在Samsum上的易经的200倍GPT3-175B较大。在零射击和少量设置中,我们的模型大大优于竞争模型。
translated by 谷歌翻译
Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out. As a result, the contextual representations learned by our approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained using 30x more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale, where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when using the same amount of compute.
translated by 谷歌翻译
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a;Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).
translated by 谷歌翻译
在培训数据中拟合复杂的模式,例如推理和争议,是语言预训练的关键挑战。根据最近的研究和我们的经验观察,一种可能的原因是训练数据中的一些易于适应的模式,例如经常共同发生的单词组合,主导和伤害预训练,使模型很难适合更复杂的信息。我们争辩说,错误预测可以帮助找到危害语言理解的这种主导模式。当发生错误预测时,应该经常与导致MIS预测的模型拟合的MIS预测字相同的模式。如果我们可以添加正规化以培训模型,当MIS预测发生并更多地对待更微妙的模式时,可以在更多信息上缩小到这种主导模式时,可以在预训练中有效地安装更多信息。在此动机之后,我们提出了一种新的语言预培训方法,错误预测作为伤害警报(MPA)。在MPA中,当在预训练期间发生错误预测时,我们使用其共同发生信息来指导自我关注模块的多个头部。变压器模块中的一些自我关注头经过优化,以将更低的注意重量分配给频繁地在误报中的输入句子中的单词,同时将更高权重分配给另一个单词。通过这样做,变压器模型训练,以依赖于主导的频繁共同发生模式,而在误报中,当发生错误预测时,在剩余更复杂的信息上更加关注更多。我们的实验表明,MPA加快了伯特和电器的预训练,并提高了他们对下游任务的表现。
translated by 谷歌翻译
Pre-trained models have achieved remarkable success in natural language processing (NLP). However, existing pre-training methods underutilize the benefits of language understanding for generation. Inspired by the idea of Generative Adversarial Networks (GANs), we propose a GAN-style model for encoder-decoder pre-training by introducing an auxiliary discriminator, unifying the ability of language understanding and generation in a single model. Our model, named as GanLM, is trained with two pre-training objectives: replaced token detection and replaced token denoising. Specifically, given masked source sentences, the generator outputs the target distribution and the discriminator predicts whether the target sampled tokens from distribution are incorrect. The target sentence is replaced with misclassified tokens to construct noisy previous context, which is used to generate the gold sentence. In general, both tasks improve the ability of language understanding and generation by selectively using the denoising data. Extensive experiments in language generation benchmarks show that GanLM with the powerful language understanding capability outperforms various strong pre-trained language models (PLMs) and achieves state-of-the-art performance.
translated by 谷歌翻译
来自变压器(BERT)的双向编码器表示显示了各种NLP任务的奇妙改进,并且已经提出了其连续的变体来进一步提高预先训练的语言模型的性能。在本文中,我们的目标是首先介绍中国伯特的全文掩蔽(WWM)策略,以及一系列中国预培训的语言模型。然后我们还提出了一种简单但有效的型号,称为Macbert,这在几种方面提高了罗伯塔。特别是,我们提出了一种称为MLM作为校正(MAC)的新掩蔽策略。为了展示这些模型的有效性,我们创建了一系列中国预先培训的语言模型,作为我们的基线,包括BERT,Roberta,Electra,RBT等。我们对十个中国NLP任务进行了广泛的实验,以评估创建的中国人托管语言模型以及提议的麦克白。实验结果表明,Macbert可以在许多NLP任务上实现最先进的表演,我们还通过几种可能有助于未来的研究的调查结果来消融细节。我们开源我们的预先培训的语言模型,以进一步促进我们的研究界。资源可用:https://github.com/ymcui/chinese-bert-wwm
translated by 谷歌翻译
用于预培训语言模型的自我监督学习的核心包括预训练任务设计以及适当的数据增强。语言模型中的大多数数据增强都是独立于上下文的。最近在电子中提出了一个开创性的增强,并通过引入辅助生成网络(发电机)来实现最先进的性能,以产生用于培训主要辨别网络(鉴别者)的上下文化数据增强。然而,这种设计引入了发电机的额外计算成本,并且需要调整发电机和鉴别器之间的相对能力。在本文中,我们提出了一种自增强策略(SAS),其中单个网络用于审视以后的时期的培训常规预训练和上下文化数据增强。基本上,该策略消除了单独的发电机,并使用单个网络共同执行具有MLM(屏蔽语言建模)和RTD(替换令牌检测)头的两个预训练任务。它避免了寻找适当大小的发电机的挑战,这对于在电子中证明的性能至关重要,以及其随后的变体模型至关重要。此外,SAS是一项常规策略,可以与最近或将来的许多新技术无缝地结合,例如杜伯塔省的解除关注机制。我们的实验表明,SAS能够在具有相似或更少的计算成本中优于胶水任务中的电磁和其他最先进的模型。
translated by 谷歌翻译
多语言语言模型(\ mllms),如mbert,xlm,xlm-r,\ textit {etc。}已成为一种可行的选择,使预先估计到大量语言的力量。鉴于他们的成功在零射击转移学习中,在(i)建立更大的\ mllms〜覆盖了大量语言(ii)创建覆盖更广泛的任务和语言来评估的详尽工作基准mllms〜(iii)分析单音零点,零拍摄交叉和双语任务(iv)对Monolingual的性能,了解\ mllms〜(v)增强(通常)学习的通用语言模式(如果有的话)有限的容量\ mllms〜以提高他们在已见甚至看不见语言的表现。在这项调查中,我们审查了现有的文学,涵盖了上述与\ MLLMS有关的广泛研究领域。根据我们的调查,我们建议您有一些未来的研究方向。
translated by 谷歌翻译
问题答案(QA)是自然语言处理中最具挑战性的最具挑战性的问题之一(NLP)。问答(QA)系统试图为给定问题产生答案。这些答案可以从非结构化或结构化文本生成。因此,QA被认为是可以用于评估文本了解系统的重要研究区域。大量的QA研究致力于英语语言,调查最先进的技术和实现最先进的结果。然而,由于阿拉伯QA中的研究努力和缺乏大型基准数据集,在阿拉伯语问答进展中的研究努力得到了很大速度的速度。最近许多预先接受的语言模型在许多阿拉伯语NLP问题中提供了高性能。在这项工作中,我们使用四个阅读理解数据集来评估阿拉伯QA的最先进的接种变压器模型,它是阿拉伯语 - 队,ArcD,AQAD和TYDIQA-GoldP数据集。我们微调并比较了Arabertv2基础模型,ArabertV0.2大型型号和ARAElectra模型的性能。在最后,我们提供了一个分析,了解和解释某些型号获得的低绩效结果。
translated by 谷歌翻译
This paper presents a new UNIfied pre-trained Language Model (UNILM) that can be fine-tuned for both natural language understanding and generation tasks. The model is pre-trained using three types of language modeling tasks: unidirectional, bidirectional, and sequence-to-sequence prediction. The unified modeling is achieved by employing a shared Transformer network and utilizing specific self-attention masks to control what context the prediction conditions on. UNILM compares favorably with BERT on the GLUE benchmark, and the SQuAD 2.0 and CoQA question answering tasks. Moreover, UNILM achieves new state-ofthe-art results on five natural language generation datasets, including improving the CNN/DailyMail abstractive summarization ROUGE-L to 40.51 (2.04 absolute improvement), the Gigaword abstractive summarization ROUGE-L to 35.75 (0.86 absolute improvement), the CoQA generative question answering F1 score to 82.5 (37.1 absolute improvement), the SQuAD question generation BLEU-4 to 22.12 (3.75 absolute improvement), and the DSTC7 document-grounded dialog response generation NIST-4 to 2.67 (human performance is 2.65). The code and pre-trained models are available at https://github.com/microsoft/unilm. * Equal contribution. † Contact person.
translated by 谷歌翻译
多语言预训练的语言模型在跨语言任务上表现出了令人印象深刻的表现。它极大地促进了自然语言处理在低资源语言上的应用。但是,当前的多语言模型仍然有些语言表现不佳。在本文中,我们提出了Cino(中国少数族裔训练的语言模型),这是一种用于中国少数语言的多语言预训练的语言模型。它涵盖了标准的中文,Yue中文和其他六种少数民族语言。为了评估多语言模型在少数族裔语言上的跨语性能力,我们从Wikipedia和新闻网站收集文档,并构建两个文本分类数据集,WCM(Wiki-Chinese-Minority)和CMNEWS(中国最少的新闻)。我们表明,Cino在各种分类任务上的表现明显优于基准。Cino模型和数据集可在http://cino.hfl-rc.com上公开获得。
translated by 谷歌翻译
大型语言模型在各种任务上显示出令人印象深刻的几次结果。但是,当知识是此类结果的关键时,就像问题回答和事实检查之类的任务一样,似乎需要存储知识的大量参数计数。众所周知,检索增强模型可以在不需要多个参数的情况下在知识密集的任务上表现出色,但是目前尚不清楚它们是否在几个弹药设置中工作。在这项工作中,我们介绍了地图集,这是一个经过精心设计和预先训练的增强语言模型,能够通过很少的培训示例学习知识密集型任务。我们对包括MMLU,苏格兰短裙和归类等各种任务进行评估,并研究文档索引内容的影响,表明它可以很容易地进行更新。值得注意的是,在自然问题上仅使用64个示例在自然问题上达到超过42 \%的准确性,尽管参数少了50倍,但比540B参数模型的表现优于540b参数模型。
translated by 谷歌翻译
The pre-training of masked language models (MLMs) consumes massive computation to achieve good results on downstream NLP tasks, resulting in a large carbon footprint. In the vanilla MLM, the virtual tokens, [MASK]s, act as placeholders and gather the contextualized information from unmasked tokens to restore the corrupted information. It raises the question of whether we can append [MASK]s at a later layer, to reduce the sequence length for earlier layers and make the pre-training more efficient. We show: (1) [MASK]s can indeed be appended at a later layer, being disentangled from the word embedding; (2) The gathering of contextualized information from unmasked tokens can be conducted with a few layers. By further increasing the masking rate from 15% to 50%, we can pre-train RoBERTa-base and RoBERTa-large from scratch with only 78% and 68% of the original computational budget without any degradation on the GLUE benchmark. When pre-training with the original budget, our method outperforms RoBERTa for 6 out of 8 GLUE tasks, on average by 0.4%.
translated by 谷歌翻译
Transfer learning, where a model is first pre-trained on a data-rich task before being finetuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.
translated by 谷歌翻译
We present SpanBERT, a pre-training method that is designed to better represent and predict spans of text. Our approach extends BERT by (1) masking contiguous random spans, rather than random tokens, and (2) training the span boundary representations to predict the entire content of the masked span, without relying on the individual token representations within it. Span-BERT consistently outperforms BERT and our better-tuned baselines, with substantial gains on span selection tasks such as question answering and coreference resolution. In particular, with the same training data and model size as BERT large , our single model obtains 94.6% and 88.7% F1 on SQuAD 1.1 and 2.0 respectively. We also achieve a new state of the art on the OntoNotes coreference resolution task (79.6% F1), strong performance on the TACRED relation extraction benchmark, and even gains on GLUE. 1 * Equal contribution. 1 Our code and pre-trained models are available at https://github.com/facebookresearch/ SpanBERT.
translated by 谷歌翻译
随着预培训的语言模型变得更加要求资源,因此资源丰富的语言(例如英语和资源筛选)语言之间的不平等正在恶化。这可以归因于以下事实:每种语言中的可用培训数据量都遵循幂律分布,并且大多数语言都属于分布的长尾巴。一些研究领域试图缓解这个问题。例如,在跨语言转移学习和多语言培训中,目标是通过从资源丰富的语言中获得的知识使长尾语言受益。尽管成功,但现有工作主要集中于尝试尽可能多的语言。结果,有针对性的深入分析主要不存在。在这项研究中,我们专注于单一的低资源语言,并使用跨语性培训(XPT)进行广泛的评估和探测实验。为了使转移方案具有挑战性,我们选择韩语作为目标语言,因为它是一种孤立的语言,因此与英语几乎没有类型的分类。结果表明,XPT不仅优于表现或与单语模型相当,该模型训练有大小的数据,而且在传输过程中也很高。
translated by 谷歌翻译
Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations and longer training times. To address these problems, we present two parameterreduction techniques to lower memory consumption and increase the training speed of BERT (Devlin et al., 2019). Comprehensive empirical evidence shows that our proposed methods lead to models that scale much better compared to the original BERT. We also use a self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and SQuAD benchmarks while having fewer parameters compared to BERT-large. The code and the pretrained models are available at https://github.com/google-research/ALBERT. * Work done as an intern at Google Research, driving data processing and downstream task evaluations.
translated by 谷歌翻译
变压器模型是置换等分之一的。要提供输入令牌的顺序和类型信息,通常将位置和段嵌入式添加到输入中。最近的作品提出了具有相对位置编码的位置编码的变化,实现了更好的性能。我们的分析表明,增益实际上来自从输入中将位置信息移动到注意层。由此激励,我们介绍了变压器(饮食)的解耦的位置注意,一个简单但有效的机制,将位置和分段信息编码为变压器模型。该方法具有更快的培训和推理时间,同时在胶水,Xtreme和WMT基准上实现竞争性能。我们进一步概括了我们的方法到远程变压器并显示性能增益。
translated by 谷歌翻译