智能论文笔记

JiuZhang: A Chinese Pre-trained Language Model for Mathematical Problem Understanding

Wayne Xin Zhao , Kun Zhou , Zheng Gong , Beichen Zhang , Yuanhang Zhou , Jing Sha , Zhigang Chen , Shijin Wang , Cong Liu , Ji-Rong Wen

分类：自然语言处理 | 人工智能

2022-06-13

本文旨在通过介绍第一个中国数学预训练的语言模型〜（PLM）来提高机器的数学智能，以有效理解和表示数学问题。与其他标准NLP任务不同，数学文本很难理解，因为它们在问题陈述中涉及数学术语，符号和公式。通常，它需要复杂的数学逻辑和背景知识来解决数学问题。考虑到数学文本的复杂性质，我们设计了一种新的课程预培训方法，用于改善由基本和高级课程组成的数学PLM的学习。特别是，我们首先根据位置偏见的掩盖策略执行令牌级预训练，然后设计基于逻辑的预训练任务，旨在分别恢复改组的句子和公式。最后，我们介绍了一项更加困难的预训练任务，该任务强制执行PLM以检测和纠正其生成的解决方案中的错误。我们对离线评估（包括九个与数学相关的任务）和在线$ A/B $测试进行了广泛的实验。实验结果证明了与许多竞争基线相比，我们的方法的有效性。我们的代码可在：\ textColor {blue} {\ url {https://github.com/rucaibox/jiuzhang}}}中获得。

translated by 谷歌翻译

CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation

Yunfan Shao , Zhichao Geng , Yitao Liu , Junqi Dai , Hang Yan , Fei Yang , Li Zhe , Hujun Bao , Xipeng Qiu

分类：自然语言处理

2021-09-13

在本文中，我们利用了以前的预训练模型（PTM）的优势，并提出了一种新型的中国预训练的不平衡变压器（CPT）。与以前的中国PTM不同，CPT旨在利用自然语言理解（NLU）和自然语言生成（NLG）之间的共同知识来促进表现。 CPT包括三个部分：共享编码器，一个理解解码器和一代解码器。具有共享编码器的两个特定解码器分别通过蒙版语言建模（MLM）进行了预训练，并分别将自动编码（DAE）任务进行了验证。借助部分共享的体系结构和多任务预培训，CPT可以（1）使用两个解码器学习NLU或NLG任务的特定知识，并且（2）对模型的潜力充分利用了微调。此外，不平衡的变压器节省了计算和存储成本，这使CPT竞争激烈，并极大地加速了文本生成的推断。对各种中国NLU和NLG任务的实验结果显示了CPT的有效性。

translated by 谷歌翻译

MASTER: Multi-task Pre-trained Bottlenecked Masked Autoencoders are Better Dense Retrievers

Kun Zhou , Xiao Liu , Yeyun Gong , Wayne Xin Zhao , Daxin Jiang , Nan Duan , Ji-Rong Wen

分类：自然语言处理

2022-12-15

Dense retrieval aims to map queries and passages into low-dimensional vector space for efficient similarity measuring, showing promising effectiveness in various large-scale retrieval tasks. Since most existing methods commonly adopt pre-trained Transformers (e.g. BERT) for parameter initialization, some work focuses on proposing new pre-training tasks for compressing the useful semantic information from passages into dense vectors, achieving remarkable performances. However, it is still challenging to effectively capture the rich semantic information and relations about passages into the dense vectors via one single particular pre-training task. In this work, we propose a multi-task pre-trained model, MASTER, that unifies and integrates multiple pre-training tasks with different learning objectives under the bottlenecked masked autoencoder architecture. Concretely, MASTER utilizes a multi-decoder architecture to integrate three types of pre-training tasks: corrupted passages recovering, related passage recovering and PLMs outputs recovering. By incorporating a shared deep encoder, we construct a representation bottleneck in our architecture, compressing the abundant semantic information across tasks into dense vectors. The first two types of tasks concentrate on capturing the semantic information of passages and relationships among them within the pre-training corpus. The third one can capture the knowledge beyond the corpus from external PLMs (e.g. GPT-2). Extensive experiments on several large-scale passage retrieval datasets have shown that our approach outperforms the previous state-of-the-art dense retrieval methods. Our code and data are publicly released in https://github.com/microsoft/SimXNS

translated by 谷歌翻译

ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation

Shuohuan Wang , Yu Sun , Yang Xiang , Zhihua Wu , Siyu Ding , Weibao Gong , Shikun Feng , Junyuan Shang , Yanbin Zhao , Chao Pang

分类：自然语言处理

2021-12-23

预先接受的语言模型实现了最先进的导致各种自然语言处理（NLP）任务。 GPT-3表明，缩放预先训练的语言模型可以进一步利用它们的巨大潜力。最近提出了一个名为Ernie 3.0的统一框架，以预先培训大型知识增强型号，并培训了具有10亿参数的模型。 Ernie 3.0在各种NLP任务上表现出最先进的模型。为了探讨缩放的表现，我们培养了百卢比的3.0泰坦参数型号，在PaddlePaddle平台上有高达260亿参数的泰坦。此外，我们设计了一种自我监督的对抗性损失和可控语言建模损失，以使ERNIE 3.0 TITAN产生可信和可控的文本。为了减少计算开销和碳排放，我们向Ernie 3.0泰坦提出了一个在线蒸馏框架，教师模型将同时教授学生和培训。埃塞尼3.0泰坦是迄今为止最大的中国密集预训练模型。经验结果表明，Ernie 3.0泰坦在68个NLP数据集中优于最先进的模型。

translated by 谷歌翻译

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin , Ming-Wei Chang , Kenton Lee , Kristina Toutanova

分类：

2018-10-11

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a;Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

translated by 谷歌翻译

Pre-Training with Whole Word Masking for Chinese BERT

Yiming Cui , Wanxiang Che , Ting Liu , Bing Qin , Ziqing Yang

分类：自然语言处理 | 机器学习

2019-06-19

来自变压器（BERT）的双向编码器表示显示了各种NLP任务的奇妙改进，并且已经提出了其连续的变体来进一步提高预先训练的语言模型的性能。在本文中，我们的目标是首先介绍中国伯特的全文掩蔽（WWM）策略，以及一系列中国预培训的语言模型。然后我们还提出了一种简单但有效的型号，称为Macbert，这在几种方面提高了罗伯塔。特别是，我们提出了一种称为MLM作为校正（MAC）的新掩蔽策略。为了展示这些模型的有效性，我们创建了一系列中国预先培训的语言模型，作为我们的基线，包括BERT，Roberta，Electra，RBT等。我们对十个中国NLP任务进行了广泛的实验，以评估创建的中国人托管语言模型以及提议的麦克白。实验结果表明，Macbert可以在许多NLP任务上实现最先进的表演，我们还通过几种可能有助于未来的研究的调查结果来消融细节。我们开源我们的预先培训的语言模型，以进一步促进我们的研究界。资源可用：https://github.com/ymcui/chinese-bert-wwm

translated by 谷歌翻译

MLRIP: Pre-training a military language representation model with informative factual knowledge and professional knowledge base

Hui Li , Xuekang Yang , Xin Zhao , Lin Yu , Jiping Zheng , Wei Sun

分类：自然语言处理

2022-07-28

事实证明，将先验知识纳入预训练的语言模型中对知识驱动的NLP任务有效，例如实体键入和关系提取。当前的培训程序通常通过使用知识掩盖，知识融合和知识更换将外部知识注入模型。但是，输入句子中包含的事实信息尚未完全开采，并且尚未严格检查注射的外部知识。结果，无法完全利用上下文信息，并将引入额外的噪音，或者注入的知识量受到限制。为了解决这些问题，我们提出了MLRIP，该MLRIP修改了Ernie-Baidu提出的知识掩盖策略，并引入了两阶段的实体替代策略。进行全面分析的广泛实验说明了MLRIP在军事知识驱动的NLP任务中基于BERT的模型的优势。

translated by 谷歌翻译

CLOWER: A Pre-trained Language Model with Contrastive Learning over Word and Character Representations

Borun Chen , Hongyin Tang , Jingang Wang , Qifan Wang , Hai-Tao Zheng , Wei Wu , Liqian Yu

分类：自然语言处理 | 人工智能

2022-08-23

预训练的语言模型（PLM）在自然语言理解中的许多下游任务中取得了显着的性能增长。已提出了各种中文PLM，以学习更好的中文表示。但是，大多数当前模型都使用中文字符作为输入，并且无法编码中文单词中包含的语义信息。虽然最近的预训练模型同时融合了单词和字符，但它们通常会遭受不足的语义互动，并且无法捕获单词和字符之间的语义关系。为了解决上述问题，我们提出了一个简单而有效的PLM小扣手，该小扣子采用了对单词和性格表示的对比度学习。特别是，Clower通过对多透明信息的对比学习将粗粒的信息（即单词）隐式编码为细粒度表示（即字符）。在现实的情况下，小电动器具有很大的价值，因为它可以轻松地将其纳入任何现有的基于细粒的PLM中而无需修改生产管道。在一系列下游任务上进行的扩展实验表明，小动物的卓越性能超过了几个最先进的实验 - 艺术基线。

translated by 谷歌翻译

FormLM: Recommending Creation Ideas for Online Forms by Modelling Semantic and Structural Information

Yijia Shao , Mengyu Zhou , Yifan Zhong , Tao Wu , Hongwei Han , Shi Han , Gideon Huang , Dongmei Zhang

分类：自然语言处理

2022-11-10

Online forms are widely used to collect data from human and have a multi-billion market. Many software products provide online services for creating semi-structured forms where questions and descriptions are organized by pre-defined structures. However, the design and creation process of forms is still tedious and requires expert knowledge. To assist form designers, in this work we present FormLM to model online forms (by enhancing pre-trained language model with form structural information) and recommend form creation ideas (including question / options recommendations and block type suggestion). For model training and evaluation, we collect the first public online form dataset with 62K online forms. Experiment results show that FormLM significantly outperforms general-purpose language models on all tasks, with an improvement by 4.71 on Question Recommendation and 10.6 on Block Type Suggestion in terms of ROUGE-1 and Macro-F1, respectively.

translated by 谷歌翻译

Pre-training for Information Retrieval: Are Hyperlinks Fully Explored?

Jiawen Wu , Xinyu Zhang , Yutao Zhu , Zheng Liu , Zikai Guo , Zhaoye Fei , Ruofei Lai , Yongkang Wu , Zhao Cao , Zhicheng Dou

分类：人工智能 | 自然语言处理

2022-09-14

近年来，在应用预训练的语言模型（例如Bert）上，取得了巨大进展，以获取信息检索（IR）任务。在网页中通常使用的超链接已被利用用于设计预训练目标。例如，超链接的锚文本已用于模拟查询，从而构建了巨大的查询文档对以进行预训练。但是，作为跨越两个网页的桥梁，尚未完全探索超链接的潜力。在这项工作中，我们专注于建模通过超链接连接的两个文档之间的关系，并为临时检索设计一个新的预训练目标。具体而言，我们将文档之间的关系分为四组：无链接，单向链接，对称链接和最相关的对称链接。通过比较从相邻组采样的两个文档，该模型可以逐渐提高其捕获匹配信号的能力。我们提出了一个渐进的超链接预测（{php}）框架，以探索预训练中超链接的利用。对两个大规模临时检索数据集和六个提问数据集的实验结果证明了其优于现有的预训练方法。

translated by 谷歌翻译

A Survey of Controllable Text Generation using Transformer-based Pre-trained Language Models

Hanqing Zhang , Haolin Song , Shaoyu Li , Ming Zhou , Dawei Song

分类：自然语言处理

2022-01-14

Controllable Text Generation (CTG) is emerging area in the field of natural language generation (NLG). It is regarded as crucial for the development of advanced text generation technologies that are more natural and better meet the specific constraints in practical applications. In recent years, methods using large-scale pre-trained language models (PLMs), in particular the widely used transformer-based PLMs, have become a new paradigm of NLG, allowing generation of more diverse and fluent text. However, due to the lower level of interpretability of deep neural networks, the controllability of these methods need to be guaranteed. To this end, controllable text generation using transformer-based PLMs has become a rapidly growing yet challenging new research hotspot. A diverse range of approaches have emerged in the recent 3-4 years, targeting different CTG tasks which may require different types of controlled constraints. In this paper, we present a systematic critical review on the common tasks, main approaches and evaluation methods in this area. Finally, we discuss the challenges that the field is facing, and put forward various promising future directions. To the best of our knowledge, this is the first survey paper to summarize CTG techniques from the perspective of PLMs. We hope it can help researchers in related fields to quickly track the academic frontier, providing them with a landscape of the area and a roadmap for future research.

translated by 谷歌翻译

MathBERT: A Pre-trained Language Model for General NLP Tasks in Mathematics Education

Jia Tracy Shen , Michiharu Yamashita , Ethan Prihar , Neil Heffernan , Xintao Wu , Ben Graff , Dongwon Lee

分类：自然语言处理 | 人工智能

2021-06-02

自从引进原始伯特（即，基础BERT）以来，研究人员通过利用转让学习的好处，开发了各种定制的伯特模型，并通过利用转移学习的好处来提高特定领域和任务的性能。由于数学文本的性质，这通常使用域特定的词汇以及方程和数学符号，我们对数学的新BERT模型的开发对于许多数学下游任务有用。在这个资源论文中，我们介绍了我们的多体制努力（即，美国的两个学习平台和三个学术机构）对此需求：Mathbert，通过在大型数学语料库上预先培训基础伯爵模型来创建的模型预先幼儿园（Pre-K），高中，大学毕业生水平数学内容。此外，我们选择了三个通常用于数学教育的一般NLP任务：知识组件预测，自动分级开放式Q＆A，以及知识追踪，以展示Mathbert对底座的优越性。我们的实验表明，Mathbert以此任务的2-8％达到了1.2-22％，碱基贝尔以前最佳方法。此外，我们建立了一个数学特定的词汇“Mathvocab”，用Mathbert训练。我们发现Mathbert预先接受过的“Mathvocab”优于Mathbert培训的底座伯特词汇（即'Origvocab'）。 Mathbert目前正在参加倾斜平台采用：Stride，Inc，商业教育资源提供商和Accortments.org，是一个免费在线教育平台。我们发布Mathbert以获取公共用途：https://github.com/tbs17/mathbert。

translated by 谷歌翻译

SpanBERT: Improving Pre-training by Representing and Predicting Spans

Mandar Joshi , Danqi Chen , Yinhan Liu , Daniel S. Weld , Luke Zettlemoyer , Omer Levy

分类：

2019-07-24

We present SpanBERT, a pre-training method that is designed to better represent and predict spans of text. Our approach extends BERT by (1) masking contiguous random spans, rather than random tokens, and (2) training the span boundary representations to predict the entire content of the masked span, without relying on the individual token representations within it. Span-BERT consistently outperforms BERT and our better-tuned baselines, with substantial gains on span selection tasks such as question answering and coreference resolution. In particular, with the same training data and model size as BERT large , our single model obtains 94.6% and 88.7% F1 on SQuAD 1.1 and 2.0 respectively. We also achieve a new state of the art on the OntoNotes coreference resolution task (79.6% F1), strong performance on the TACRED relation extraction benchmark, and even gains on GLUE. 1 * Equal contribution. 1 Our code and pre-trained models are available at https://github.com/facebookresearch/ SpanBERT.

translated by 谷歌翻译

Unified language model pre-training for natural language understanding and generation

分类：

This paper presents a new UNIfied pre-trained Language Model (UNILM) that can be fine-tuned for both natural language understanding and generation tasks. The model is pre-trained using three types of language modeling tasks: unidirectional, bidirectional, and sequence-to-sequence prediction. The unified modeling is achieved by employing a shared Transformer network and utilizing specific self-attention masks to control what context the prediction conditions on. UNILM compares favorably with BERT on the GLUE benchmark, and the SQuAD 2.0 and CoQA question answering tasks. Moreover, UNILM achieves new state-ofthe-art results on five natural language generation datasets, including improving the CNN/DailyMail abstractive summarization ROUGE-L to 40.51 (2.04 absolute improvement), the Gigaword abstractive summarization ROUGE-L to 35.75 (0.86 absolute improvement), the CoQA generative question answering F1 score to 82.5 (37.1 absolute improvement), the SQuAD question generation BLEU-4 to 22.12 (3.75 absolute improvement), and the DSTC7 document-grounded dialog response generation NIST-4 to 2.67 (human performance is 2.65). The code and pre-trained models are available at https://github.com/microsoft/unilm. * Equal contribution. † Contact person.

translated by 谷歌翻译

LERT: A Linguistically-motivated Pre-trained Language Model

Yiming Cui , Wanxiang Che , Shijin Wang , Ting Liu

分类：自然语言处理 | 机器学习

2022-11-10

Pre-trained Language Model (PLM) has become a representative foundation model in the natural language processing field. Most PLMs are trained with linguistic-agnostic pre-training tasks on the surface form of the text, such as the masked language model (MLM). To further empower the PLMs with richer linguistic features, in this paper, we aim to propose a simple but effective way to learn linguistic features for pre-trained language models. We propose LERT, a pre-trained language model that is trained on three types of linguistic features along with the original MLM pre-training task, using a linguistically-informed pre-training (LIP) strategy. We carried out extensive experiments on ten Chinese NLU tasks, and the experimental results show that LERT could bring significant improvements over various comparable baselines. Furthermore, we also conduct analytical experiments in various linguistic aspects, and the results prove that the design of LERT is valid and effective. Resources are available at https://github.com/ymcui/LERT

translated by 谷歌翻译

A Survey of Natural Language Generation

Chenhe Dong , Yinghui Li , Haifan Gong , Miaoxin Chen , Junxin Li , Ying Shen , Min Yang

分类：自然语言处理 | 人工智能 | 机器学习

2021-12-22

本文对过去二十年来对自然语言生成（NLG）的研究提供了全面的审查，特别是与数据到文本生成和文本到文本生成深度学习方法有关，以及NLG的新应用技术。该调查旨在（a）给出关于NLG核心任务的最新综合，以及该领域采用的建筑;（b）详细介绍各种NLG任务和数据集，并提请注意NLG评估中的挑战，专注于不同的评估方法及其关系;（c）强调一些未来的强调和相对近期的研究问题，因为NLG和其他人工智能领域的协同作用而增加，例如计算机视觉，文本和计算创造力。

translated by 谷歌翻译

A Survey on Knowledge-Enhanced Pre-trained Language Models

Chaoqi Zhen , Yanlei Shang , Xiangyu Liu , Yifei Li , Yong Chen , Dell Zhang

分类：自然语言处理

2022-12-27

Natural Language Processing (NLP) has been revolutionized by the use of Pre-trained Language Models (PLMs) such as BERT. Despite setting new records in nearly every NLP task, PLMs still face a number of challenges including poor interpretability, weak reasoning capability, and the need for a lot of expensive annotated data when applied to downstream tasks. By integrating external knowledge into PLMs, \textit{\underline{K}nowledge-\underline{E}nhanced \underline{P}re-trained \underline{L}anguage \underline{M}odels} (KEPLMs) have the potential to overcome the above-mentioned limitations. In this paper, we examine KEPLMs systematically through a series of studies. Specifically, we outline the common types and different formats of knowledge to be integrated into KEPLMs, detail the existing methods for building and evaluating KEPLMS, present the applications of KEPLMs in downstream tasks, and discuss the future research directions. Researchers will benefit from this survey by gaining a quick and comprehensive overview of the latest developments in this field.

translated by 谷歌翻译

Learning Rich Representation of Keyphrases from Text

Mayank Kulkarni , Debanjan Mahata , Ravneet Arora , Rajarshi Bhowmik

分类：自然语言处理 | 机器学习

2021-12-16

在这项工作中，我们探索如何学习专用的语言模型，旨在学习从文本文件中学习关键词的丰富表示。我们在判别和生成设置中进行预训练变压器语言模型（LMS）的不同掩蔽策略。在歧视性设定中，我们引入了一种新的预训练目标 - 关键边界，用替换（kbir）infifiling，在使用Kbir预先训练的LM进行微调时显示出在Sota上的性能（F1中高达9.26点）的大量增益关键酶提取的任务。在生成设置中，我们为BART - 键盘介绍了一个新的预训练设置，可再现与CATSeq格式中的输入文本相关的关键字，而不是Denoised原始输入。这也导致在关键词中的性能（F1 @ M）中的性能（高达4.33点），用于关键正版生成。此外，我们还微调了在命名实体识别（ner），问题应答（qa），关系提取（重新），抽象摘要和达到与SOTA的可比性表现的预训练的语言模型，表明学习丰富的代表关键词确实有利于许多其他基本的NLP任务。

translated by 谷歌翻译

ERNIE: Enhanced Language Representation with Informative Entities

Zhengyan Zhang , Xu Han , Zhiyuan Liu , Xin Jiang , Maosong Sun , Qun Liu

分类：

2019-05-17

Neural language representation models such as BERT pre-trained on large-scale corpora can well capture rich semantic patterns from plain text, and be fine-tuned to consistently improve the performance of various NLP tasks. However, the existing pre-trained language models rarely consider incorporating knowledge graphs (KGs), which can provide rich structured knowledge facts for better language understanding. We argue that informative entities in KGs can enhance language representation with external knowledge. In this paper, we utilize both large-scale textual corpora and KGs to train an enhanced language representation model (ERNIE), which can take full advantage of lexical, syntactic, and knowledge information simultaneously. The experimental results have demonstrated that ERNIE achieves significant improvements on various knowledge-driven tasks, and meanwhile is comparable with the state-of-the-art model BERT on other common NLP tasks. The source code and experiment details of this paper can be obtained from https:// github.com/thunlp/ERNIE.

translated by 谷歌翻译

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Mike Lewis , Yinhan Liu , Naman Goyal , Marjan Ghazvininejad , Abdelrahman Mohamed , Omer Levy , Ves Stoyanov , Luke Zettlemoyer

分类：

2019-10-29

We present BART, a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by ( 1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and many other more recent pretraining schemes. We evaluate a number of noising approaches, finding the best performance by both randomly shuffling the order of the original sentences and using a novel in-filling scheme, where spans of text are replaced with a single mask token. BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new stateof-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE. BART also provides a 1.1 BLEU increase over a back-translation system for machine translation, with only target language pretraining. We also report ablation experiments that replicate other pretraining schemes within the BART framework, to better measure which factors most influence end-task performance.

translated by 谷歌翻译