在本文中,我们利用了以前的预训练模型(PTM)的优势,并提出了一种新型的中国预训练的不平衡变压器(CPT)。与以前的中国PTM不同,CPT旨在利用自然语言理解(NLU)和自然语言生成(NLG)之间的共同知识来促进表现。 CPT包括三个部分:共享编码器,一个理解解码器和一代解码器。具有共享编码器的两个特定解码器分别通过蒙版语言建模(MLM)进行了预训练,并分别将自动编码(DAE)任务进行了验证。借助部分共享的体系结构和多任务预培训,CPT可以(1)使用两个解码器学习NLU或NLG任务的特定知识,并且(2)对模型的潜力充分利用了微调。此外,不平衡的变压器节省了计算和存储成本,这使CPT竞争激烈,并极大地加速了文本生成的推断。对各种中国NLU和NLG任务的实验结果显示了CPT的有效性。
translated by 谷歌翻译
Pre-trained models have achieved remarkable success in natural language processing (NLP). However, existing pre-training methods underutilize the benefits of language understanding for generation. Inspired by the idea of Generative Adversarial Networks (GANs), we propose a GAN-style model for encoder-decoder pre-training by introducing an auxiliary discriminator, unifying the ability of language understanding and generation in a single model. Our model, named as GanLM, is trained with two pre-training objectives: replaced token detection and replaced token denoising. Specifically, given masked source sentences, the generator outputs the target distribution and the discriminator predicts whether the target sampled tokens from distribution are incorrect. The target sentence is replaced with misclassified tokens to construct noisy previous context, which is used to generate the gold sentence. In general, both tasks improve the ability of language understanding and generation by selectively using the denoising data. Extensive experiments in language generation benchmarks show that GanLM with the powerful language understanding capability outperforms various strong pre-trained language models (PLMs) and achieves state-of-the-art performance.
translated by 谷歌翻译
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a;Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).
translated by 谷歌翻译
本文旨在通过介绍第一个中国数学预训练的语言模型〜(PLM)来提高机器的数学智能,以有效理解和表示数学问题。与其他标准NLP任务不同,数学文本很难理解,因为它们在问题陈述中涉及数学术语,符号和公式。通常,它需要复杂的数学逻辑和背景知识来解决数学问题。考虑到数学文本的复杂性质,我们设计了一种新的课程预培训方法,用于改善由基本和高级课程组成的数学PLM的学习。特别是,我们首先根据位置偏见的掩盖策略执行令牌级预训练,然后设计基于逻辑的预训练任务,旨在分别恢复改组的句子和公式。最后,我们介绍了一项更加困难的预训练任务,该任务强制执行PLM以检测和纠正其生成的解决方案中的错误。我们对离线评估(包括九个与数学相关的任务)和在线$ A/B $测试进行了广泛的实验。实验结果证明了与许多竞争基线相比,我们的方法的有效性。我们的代码可在:\ textColor {blue} {\ url {https://github.com/rucaibox/jiuzhang}}}中获得。
translated by 谷歌翻译
This paper presents a new UNIfied pre-trained Language Model (UNILM) that can be fine-tuned for both natural language understanding and generation tasks. The model is pre-trained using three types of language modeling tasks: unidirectional, bidirectional, and sequence-to-sequence prediction. The unified modeling is achieved by employing a shared Transformer network and utilizing specific self-attention masks to control what context the prediction conditions on. UNILM compares favorably with BERT on the GLUE benchmark, and the SQuAD 2.0 and CoQA question answering tasks. Moreover, UNILM achieves new state-ofthe-art results on five natural language generation datasets, including improving the CNN/DailyMail abstractive summarization ROUGE-L to 40.51 (2.04 absolute improvement), the Gigaword abstractive summarization ROUGE-L to 35.75 (0.86 absolute improvement), the CoQA generative question answering F1 score to 82.5 (37.1 absolute improvement), the SQuAD question generation BLEU-4 to 22.12 (3.75 absolute improvement), and the DSTC7 document-grounded dialog response generation NIST-4 to 2.67 (human performance is 2.65). The code and pre-trained models are available at https://github.com/microsoft/unilm. * Equal contribution. † Contact person.
translated by 谷歌翻译
We present BART, a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by ( 1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and many other more recent pretraining schemes. We evaluate a number of noising approaches, finding the best performance by both randomly shuffling the order of the original sentences and using a novel in-filling scheme, where spans of text are replaced with a single mask token. BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new stateof-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE. BART also provides a 1.1 BLEU increase over a back-translation system for machine translation, with only target language pretraining. We also report ablation experiments that replicate other pretraining schemes within the BART framework, to better measure which factors most influence end-task performance.
translated by 谷歌翻译
来自变压器(BERT)的双向编码器表示显示了各种NLP任务的奇妙改进,并且已经提出了其连续的变体来进一步提高预先训练的语言模型的性能。在本文中,我们的目标是首先介绍中国伯特的全文掩蔽(WWM)策略,以及一系列中国预培训的语言模型。然后我们还提出了一种简单但有效的型号,称为Macbert,这在几种方面提高了罗伯塔。特别是,我们提出了一种称为MLM作为校正(MAC)的新掩蔽策略。为了展示这些模型的有效性,我们创建了一系列中国预先培训的语言模型,作为我们的基线,包括BERT,Roberta,Electra,RBT等。我们对十个中国NLP任务进行了广泛的实验,以评估创建的中国人托管语言模型以及提议的麦克白。实验结果表明,Macbert可以在许多NLP任务上实现最先进的表演,我们还通过几种可能有助于未来的研究的调查结果来消融细节。我们开源我们的预先培训的语言模型,以进一步促进我们的研究界。资源可用:https://github.com/ymcui/chinese-bert-wwm
translated by 谷歌翻译
近年来,基于变压器的预训练模型已获得了很大的进步,成为自然语言处理中最重要的骨干之一。最近的工作表明,变压器内部的注意力机制可能不需要,卷积神经网络和基于多层感知器的模型也已被研究为变压器替代方案。在本文中,我们考虑了一个用于语言模型预训练的图形循环网络,该网络通过本地令牌级通信为每个序列构建一个图形结构,以及与其他代币解耦的句子级表示。原始模型在受监督培训下的特定领域特定文本分类中表现良好,但是,其通过自我监督的方式学习转移知识的潜力尚未得到充分利用。我们通过优化体系结构并验证其在更通用的语言理解任务(英语和中文)中的有效性来填补这一空白。至于模型效率,我们的模型在基于变压器的模型中而不是二次复杂性,而是具有线性复杂性,并且在推断过程中的性能更有效。此外,我们发现与现有基于注意力的模型相比,我们的模型可以生成更多样化的输出,而背景化的功能冗余性较小。
translated by 谷歌翻译
预先接受的语言模型实现了最先进的导致各种自然语言处理(NLP)任务。 GPT-3表明,缩放预先训练的语言模型可以进一步利用它们的巨大潜力。最近提出了一个名为Ernie 3.0的统一框架,以预先培训大型知识增强型号,并培训了具有10亿参数的模型。 Ernie 3.0在各种NLP任务上表现出最先进的模型。为了探讨缩放的表现,我们培养了百卢比的3.0泰坦参数型号,在PaddlePaddle平台上有高达260亿参数的泰坦。此外,我们设计了一种自我监督的对抗性损失和可控语言建模损失,以使ERNIE 3.0 TITAN产生可信和可控的文本。为了减少计算开销和碳排放,我们向Ernie 3.0泰坦提出了一个在线蒸馏框架,教师模型将同时教授学生和培训。埃塞尼3.0泰坦是迄今为止最大的中国密集预训练模型。经验结果表明,Ernie 3.0泰坦在68个NLP数据集中优于最先进的模型。
translated by 谷歌翻译
Sequence-to-sequence (seq2seq) learning is a popular fashion for large-scale pretraining language models. However, the prior seq2seq pretraining models generally focus on reconstructive objectives on the decoder side and neglect the effect of encoder-side supervision, which we argue may lead to sub-optimal performance. To verify our hypothesis, we first empirically study the functionalities of the encoder and decoder in seq2seq pretrained language models, and find that the encoder takes an important but under-exploitation role than the decoder regarding the downstream performance and neuron activation. Therefore, we propose an encoding-enhanced seq2seq pretraining strategy, namely E2S2, which improves the seq2seq models via integrating more efficient self-supervised information into the encoders. Specifically, E2S2 adopts two self-supervised objectives on the encoder side from two aspects: 1) locally denoising the corrupted sentence (denoising objective); and 2) globally learning better sentence representations (contrastive objective). With the help of both objectives, the encoder can effectively distinguish the noise tokens and capture high-level (i.e. syntactic and semantic) knowledge, thus strengthening the ability of seq2seq model to accurately achieve the conditional generation. On a large diversity of downstream natural language understanding and generation tasks, E2S2 dominantly improves the performance of its powerful backbone models, e.g. BART and T5. For example, upon BART backbone, we achieve +1.1% averaged gain on the general language understanding evaluation (GLUE) benchmark and +1.75% F_0.5 score improvement on CoNLL2014 dataset. We also provide in-depth analyses to show the improvement stems from better linguistic representation. We hope that our work will foster future self-supervision research on seq2seq language model pretraining.
translated by 谷歌翻译
Dense retrieval aims to map queries and passages into low-dimensional vector space for efficient similarity measuring, showing promising effectiveness in various large-scale retrieval tasks. Since most existing methods commonly adopt pre-trained Transformers (e.g. BERT) for parameter initialization, some work focuses on proposing new pre-training tasks for compressing the useful semantic information from passages into dense vectors, achieving remarkable performances. However, it is still challenging to effectively capture the rich semantic information and relations about passages into the dense vectors via one single particular pre-training task. In this work, we propose a multi-task pre-trained model, MASTER, that unifies and integrates multiple pre-training tasks with different learning objectives under the bottlenecked masked autoencoder architecture. Concretely, MASTER utilizes a multi-decoder architecture to integrate three types of pre-training tasks: corrupted passages recovering, related passage recovering and PLMs outputs recovering. By incorporating a shared deep encoder, we construct a representation bottleneck in our architecture, compressing the abundant semantic information across tasks into dense vectors. The first two types of tasks concentrate on capturing the semantic information of passages and relationships among them within the pre-training corpus. The third one can capture the knowledge beyond the corpus from external PLMs (e.g. GPT-2). Extensive experiments on several large-scale passage retrieval datasets have shown that our approach outperforms the previous state-of-the-art dense retrieval methods. Our code and data are publicly released in https://github.com/microsoft/SimXNS
translated by 谷歌翻译
We present SpanBERT, a pre-training method that is designed to better represent and predict spans of text. Our approach extends BERT by (1) masking contiguous random spans, rather than random tokens, and (2) training the span boundary representations to predict the entire content of the masked span, without relying on the individual token representations within it. Span-BERT consistently outperforms BERT and our better-tuned baselines, with substantial gains on span selection tasks such as question answering and coreference resolution. In particular, with the same training data and model size as BERT large , our single model obtains 94.6% and 88.7% F1 on SQuAD 1.1 and 2.0 respectively. We also achieve a new state of the art on the OntoNotes coreference resolution task (79.6% F1), strong performance on the TACRED relation extraction benchmark, and even gains on GLUE. 1 * Equal contribution. 1 Our code and pre-trained models are available at https://github.com/facebookresearch/ SpanBERT.
translated by 谷歌翻译
预训练的语言模型(PLM)在自然语言理解中的许多下游任务中取得了显着的性能增长。已提出了各种中文PLM,以学习更好的中文表示。但是,大多数当前模型都使用中文字符作为输入,并且无法编码中文单词中包含的语义信息。虽然最近的预训练模型同时融合了单词和字符,但它们通常会遭受不足的语义互动,并且无法捕获单词和字符之间的语义关系。为了解决上述问题,我们提出了一个简单而有效的PLM小扣手,该小扣子采用了对单词和性格表示的对比度学习。特别是,Clower通过对多透明信息的对比学习将粗粒的信息(即单词)隐式编码为细粒度表示(即字符)。在现实的情况下,小电动器具有很大的价值,因为它可以轻松地将其纳入任何现有的基于细粒的PLM中而无需修改生产管道。在一系列下游任务上进行的扩展实验表明,小动物的卓越性能超过了几个最先进的实验 - 艺术基线。
translated by 谷歌翻译
Transfer learning, where a model is first pre-trained on a data-rich task before being finetuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.
translated by 谷歌翻译
Neural models that do not rely on pre-training have excelled in the keyphrase generation task with large annotated datasets. Meanwhile, new approaches have incorporated pre-trained language models (PLMs) for their data efficiency. However, there lacks a systematic study of how the two types of approaches compare and how different design choices can affect the performance of PLM-based models. To fill in this knowledge gap and facilitate a more informed use of PLMs for keyphrase extraction and keyphrase generation, we present an in-depth empirical study. Formulating keyphrase extraction as sequence labeling and keyphrase generation as sequence-to-sequence generation, we perform extensive experiments in three domains. After showing that PLMs have competitive high-resource performance and state-of-the-art low-resource performance, we investigate important design choices including in-domain PLMs, PLMs with different pre-training objectives, using PLMs with a parameter budget, and different formulations for present keyphrases. Further results show that (1) in-domain BERT-like PLMs can be used to build strong and data-efficient keyphrase generation models; (2) with a fixed parameter budget, prioritizing model depth over width and allocating more layers in the encoder leads to better encoder-decoder models; and (3) introducing four in-domain PLMs, we achieve a competitive performance in the news domain and the state-of-the-art performance in the scientific domain.
translated by 谷歌翻译
Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively. Second, an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training. In addition, a new virtual adversarial training method is used for fine-tuning to improve models' generalization. We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understand (NLU) and natural langauge generation (NLG) downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). Notably, we scale up DeBERTa by training a larger version that consists of 48 Transform layers with 1.5 billion parameters. The significant performance boost makes the single DeBERTa model surpass the human performance on the SuperGLUE benchmark (Wang et al., 2019a) for the first time in terms of macro-average score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the SuperGLUE leaderboard as of January 6, 2021, outperforming the human baseline by a decent margin (90.3 versus 89.8). The pre-trained DeBERTa models and the source code were released at: https://github.com/microsoft/DeBERTa 1 .
translated by 谷歌翻译
在这项工作中,我们证明了多种语的大规模序列到序列(SEQ2SEQ)模型,该模型是通过Denoising和因果语言建模(CLM)任务的混合物进行训练的,比仅解码器模型更有效地进行了效率的学习者在各种任务上。特别是,我们培训了一个名为Alexa教师模型(Alexatm 20b)的200亿个参数多语言SEQ2SEQ模型,并表明它在1-Shot摘要任务上实现了最先进的(SOTA)性能,超过了更大的540B PALM DOPODER模型。 Alexatm 20b还可以在1-Shot Machine翻译中实现SOTA,尤其是对于低资源语言,几乎所有语言对(阿拉伯语,英语,法语,德语,德语,印地语,意大利语,日语,以及flores-101数据集上的泰卢固语)。我们还显示了零拍设置,AlexATM 20B在SuperGlue和SqueadV2数据集上的表现优于GPT3(175B),并在XNLI,XCOPA,PAWS-X和XWINOGRAD等多语言任务上提供SOTA性能。总体而言,我们的结果为SEQ2SEQ模型提供了一个令人信服的案例,作为大型语言模型(LLM)培训的仅解码器模型的强大替代方法。
translated by 谷歌翻译
如今,基础模型已成为人工智能中的基本基础设施之一,铺平了通往通用情报的方式。但是,现实提出了两个紧急挑战:现有的基础模型由英语社区主导;用户通常会获得有限的资源,因此不能总是使用基础模型。为了支持中文社区的发展,我们介绍了一个名为Fengshenbang的开源项目,该项目由认知计算与自然语言研究中心(CCNL)领导。我们的项目具有全面的功能,包括大型预培训模型,用户友好的API,基准,数据集等。我们将所有这些都包装在三个子项目中:风水次模型,风水框架和狂热基准。 Fengshenbang的开源路线图旨在重新评估中国预培训的大型大型模型的开源社区,促使整个中国大型模型社区的发展。我们还希望构建一个以用户为中心的开源生态系统,以允许个人访问所需的模型以匹配其计算资源。此外,我们邀请公司,大学和研究机构与我们合作建立大型开源模型的生态系统。我们希望这个项目将成为中国认知情报的基础。
translated by 谷歌翻译
在这项工作中,我们探索如何学习专用的语言模型,旨在学习从文本文件中学习关键词的丰富表示。我们在判别和生成设置中进行预训练变压器语言模型(LMS)的不同掩蔽策略。在歧视性设定中,我们引入了一种新的预训练目标 - 关键边界,用替换(kbir)infifiling,在使用Kbir预先训练的LM进行微调时显示出在Sota上的性能(F1中高达9.26点)的大量增益关键酶提取的任务。在生成设置中,我们为BART - 键盘介绍了一个新的预训练设置,可再现与CATSeq格式中的输入文本相关的关键字,而不是Denoised原始输入。这也导致在关键词中的性能(F1 @ M)中的性能(高达4.33点),用于关键正版生成。此外,我们还微调了在命名实体识别(ner),问题应答(qa),关系提取(重新),抽象摘要和达到与SOTA的可比性表现的预训练的语言模型,表明学习丰富的代表关键词确实有利于许多其他基本的NLP任务。
translated by 谷歌翻译
Controllable Text Generation (CTG) is emerging area in the field of natural language generation (NLG). It is regarded as crucial for the development of advanced text generation technologies that are more natural and better meet the specific constraints in practical applications. In recent years, methods using large-scale pre-trained language models (PLMs), in particular the widely used transformer-based PLMs, have become a new paradigm of NLG, allowing generation of more diverse and fluent text. However, due to the lower level of interpretability of deep neural networks, the controllability of these methods need to be guaranteed. To this end, controllable text generation using transformer-based PLMs has become a rapidly growing yet challenging new research hotspot. A diverse range of approaches have emerged in the recent 3-4 years, targeting different CTG tasks which may require different types of controlled constraints. In this paper, we present a systematic critical review on the common tasks, main approaches and evaluation methods in this area. Finally, we discuss the challenges that the field is facing, and put forward various promising future directions. To the best of our knowledge, this is the first survey paper to summarize CTG techniques from the perspective of PLMs. We hope it can help researchers in related fields to quickly track the academic frontier, providing them with a landscape of the area and a roadmap for future research.
translated by 谷歌翻译