With the boom of digital educational materials and scalable e-learning systems, the potential for realising AI-assisted personalised learning has skyrocketed. In this landscape, the automatic generation of educational questions will play a key role, enabling scalable self-assessment when a global population is manoeuvring their personalised learning journeys. We develop EduQG, a novel educational question generation model built by adapting a large language model. Our initial experiments demonstrate that EduQG can produce superior educational questions by pre-training on scientific text.
translated by 谷歌翻译
我们介绍了Godel(接地开放对话语言模型),这是对话框的大型预训练的语言模型。与诸如Dialogpt之类的早期模型相比,Godel利用了一个新的扎根预训练阶段,旨在更好地支持将Godel适应广泛的下游对话框任务,这些任务需要当前对话外部的信息(例如,数据库或文档)到产生良好的回应。针对一系列基准测试的实验,这些基准涵盖了面向任务的对话框,对话质量质量检查和接地的开放式对话框,表明Godel在几次以上的微调设置中优于最先进的预训练的对话模型,就人类和自动评估。我们评估方法的一个新颖特征是引入了一个效用概念,该概念除了其交流特征(内在评估)外,还评估了响应的有用性(外部评估)。我们表明,外部评估提供了改进的通道间一致性和与自动指标的相关性。代码和数据处理脚本公开可用。
translated by 谷歌翻译
The recent success of large language models for text generation poses a severe threat to academic integrity, as plagiarists can generate realistic paraphrases indistinguishable from original work. However, the role of large autoregressive transformers in generating machine-paraphrased plagiarism and their detection is still developing in the literature. This work explores T5 and GPT-3 for machine-paraphrase generation on scientific articles from arXiv, student theses, and Wikipedia. We evaluate the detection performance of six automated solutions and one commercial plagiarism detection software and perform a human study with 105 participants regarding their detection performance and the quality of generated examples. Our results suggest that large models can rewrite text humans have difficulty identifying as machine-paraphrased (53% mean acc.). Human experts rate the quality of paraphrases generated by GPT-3 as high as original texts (clarity 4.0/5, fluency 4.2/5, coherence 3.8/5). The best-performing detection model (GPT-3) achieves a 66% F1-score in detecting paraphrases.
translated by 谷歌翻译
Question Answering (QA) is a growing area of research, often used to facilitate the extraction of information from within documents. State-of-the-art QA models are usually pre-trained on domain-general corpora like Wikipedia and thus tend to struggle on out-of-domain documents without fine-tuning. We demonstrate that synthetic domain-specific datasets can be generated easily using domain-general models, while still providing significant improvements to QA performance. We present two new tools for this task: A flexible pipeline for validating the synthetic QA data and training downstream models on it, and an online interface to facilitate human annotation of this generated data. Using this interface, crowdworkers labelled 1117 synthetic QA pairs, which we then used to fine-tune downstream models and improve domain-specific QA performance by 8.75 F1.
translated by 谷歌翻译
Existing metrics for evaluating the quality of automatically generated questions such as BLEU, ROUGE, BERTScore, and BLEURT compare the reference and predicted questions, providing a high score when there is a considerable lexical overlap or semantic similarity between the candidate and the reference questions. This approach has two major shortcomings. First, we need expensive human-provided reference questions. Second, it penalises valid questions that may not have high lexical or semantic similarity to the reference questions. In this paper, we propose a new metric, RQUGE, based on the answerability of the candidate question given the context. The metric consists of a question-answering and a span scorer module, in which we use pre-trained models from the existing literature, and therefore, our metric can be used without further training. We show that RQUGE has a higher correlation with human judgment without relying on the reference question. RQUGE is shown to be significantly more robust to several adversarial corruptions. Additionally, we illustrate that we can significantly improve the performance of QA models on out-of-domain datasets by fine-tuning on the synthetic data generated by a question generation model and re-ranked by RQUGE.
translated by 谷歌翻译
Powerful generative models have led to recent progress in question generation (QG). However, it is difficult to measure advances in QG research since there are no standardized resources that allow a uniform comparison among approaches. In this paper, we introduce QG-Bench, a multilingual and multidomain benchmark for QG that unifies existing question answering datasets by converting them to a standard QG setting. It includes general-purpose datasets such as SQuAD for English, datasets from ten domains and two styles, as well as datasets in eight different languages. Using QG-Bench as a reference, we perform an extensive analysis of the capabilities of language models for the task. First, we propose robust QG baselines based on fine-tuning generative language models. Then, we complement automatic evaluation based on standard metrics with an extensive manual evaluation, which in turn sheds light on the difficulty of evaluating QG models. Finally, we analyse both the domain adaptability of these models as well as the effectiveness of multilingual models in languages other than English. QG-Bench is released along with the fine-tuned models presented in the paper https://github.com/asahi417/lm-question-generation, which are also available as a demo https://autoqg.net/.
translated by 谷歌翻译
Precisely assessing the progress in natural language generation (NLG) tasks is challenging, and human evaluation to establish a preference in a model's output over another is often necessary. However, human evaluation is usually costly, difficult to reproduce, and non-reusable. In this paper, we propose a new and simple automatic evaluation method for NLG called Near-Negative Distinction (NND) that repurposes prior human annotations into NND tests. In an NND test, an NLG model must place a higher likelihood on a high-quality output candidate than on a near-negative candidate with a known error. Model performance is established by the number of NND tests a model passes, as well as the distribution over task-specific errors the model fails on. Through experiments on three NLG tasks (question generation, question answering, and summarization), we show that NND achieves a higher correlation with human judgments than standard NLG evaluation metrics. We then illustrate NND evaluation in four practical scenarios, for example performing fine-grain model analysis, or studying model training dynamics. Our findings suggest that NND can give a second life to human annotations and provide low-cost NLG evaluation.
translated by 谷歌翻译
社会科学的学术文献是记录人类文明并研究人类社会问题的文献。随着这种文献的大规模增长,快速找到有关相关问题的现有研究的方法已成为对研究人员的紧迫需求。先前的研究,例如SCIBERT,已经表明,使用特定领域的文本进行预训练可以改善这些领域中自然语言处理任务的性能。但是,没有针对社会科学的预训练的语言模型,因此本文提出了关于社会科学引文指数(SSCI)期刊上许多摘要的预培训模型。这些模型可在GitHub(https://github.com/s-t-full-text-knowledge-mining/ssci-bert)上获得,在学科分类和带有社会科学文学的抽象结构 - 功能识别任务方面表现出色。
translated by 谷歌翻译
Open-Domain Generative Question Answering has achieved impressive performance in English by combining document-level retrieval with answer generation. These approaches, which we refer to as GenQA, can generate complete sentences, effectively answering both factoid and non-factoid questions. In this paper, we extend GenQA to the multilingual and cross-lingual settings. For this purpose, we first introduce GenTyDiQA, an extension of the TyDiQA dataset with well-formed and complete answers for Arabic, Bengali, English, Japanese, and Russian. Based on GenTyDiQA, we design a cross-lingual generative model that produces full-sentence answers by exploiting passages written in multiple languages, including languages different from the question. Our cross-lingual generative system outperforms answer sentence selection baselines for all 5 languages and monolingual generative pipelines for three out of five languages studied.
translated by 谷歌翻译
我们解决产品生成任务。对于给定的产品描述,我们的目标是生成反映潜在用户信息需求的问题,这些需求要么缺少或不涵盖描述中的问题。此外,我们希望涵盖可能涵盖多种产品类型的各种用户信息需求。为此,我们首先展示了如何对任务进行微调的T5预训练的变压器编码器模型。然而,尽管与最先进的任务方法相比,T5产生的问题具有合理的质量(KPCNET),但许多此类问题仍然太笼统,导致了次优最佳的全球问题多样性。作为替代方案,我们提出了一种新颖的学习对多样化(LTD)微调方法,该方法可以丰富基础变压器模型所学的语言。我们的经验评估表明,使用我们的方法可显着提高基础变压器模型的全球多样性,同时尽可能多地保持其一代相关性。
translated by 谷歌翻译
人类在对话中提出的问题通常包含上下文依赖性,即对先前对话转弯的明确或隐式引用。这些依赖性采用核心发挥的形式(例如,通过代词使用)或椭圆形,并且可以使自动化系统的理解难以理解。促进对问题的理解和后续治疗方法的一种方法是将其重写为不受欢迎的形式,即可以理解的形式而没有对话性上下文。我们提出了Coqar,Coqar是一种语料库,其中包含$ 4.5 $ k的对话中的对话询问数据集COQA,总计$ 53 $ K的后续提问 - 答案对。每个原始问题都在至少2个脱离台面重写中手动注释。 COQAR可用于监督三个任务的监督:问题释义,问题重写和会话问题回答。为了评估Coqar重写的质量,我们进行了几项实验,包括培训和评估这三个任务的模型。我们的结果支持以下想法:问题重写可以用作问题回答模型的预处理步骤,从而提高其性能。
translated by 谷歌翻译
Incorporating external knowledge into the response generation process is essential to building more helpful and reliable dialog agents. However, collecting knowledge-grounded conversations is often costly, calling for a better pre-trained model for grounded dialog generation that generalizes well w.r.t. different types of knowledge. In this work, we propose KPT (Keyword-guided Pre-Training), a novel self-supervised pre-training method for grounded dialog generation without relying on extra knowledge annotation. Specifically, we use a pre-trained language model to extract the most uncertain tokens in the dialog as keywords. With these keywords, we construct two kinds of knowledge and pre-train a knowledge-grounded response generation model, aiming at handling two different scenarios: (1) the knowledge should be faithfully grounded; (2) it can be selectively used. For the former, the grounding knowledge consists of keywords extracted from the response. For the latter, the grounding knowledge is additionally augmented with keywords extracted from other utterances in the same dialog. Since the knowledge is extracted from the dialog itself, KPT can be easily performed on a large volume and variety of dialogue data. We considered three data sources (open-domain, task-oriented, conversational QA) with a total of 2.5M dialogues. We conduct extensive experiments on various few-shot knowledge-grounded generation tasks, including grounding on dialog acts, knowledge graphs, persona descriptions, and Wikipedia passages. Our comprehensive experiments and analyses demonstrate that KPT consistently outperforms state-of-the-art methods on these tasks with diverse grounding knowledge.
translated by 谷歌翻译
最近的工作表明,(1)增加输入长度或(2)增加模型大小可以提高基于变压器的神经模型的性能。在本文中,我们提出了一个名为Longt5的新模型,我们探讨了同时缩放输入长度和模型大小的效果。具体而言,我们综合了从长输入变压器(ETC)的关注思路,并采用了从摘要预训练(PEGASU)的预训练策略进入可扩展的T5架构。结果是我们称之为{\ EM瞬态全球}(TGLOBAL)的新关注机制,这些机制是模仿等本地/全球注意力机制,但不需要额外的侧面输入。我们能够实现最先进的结果,以若干摘要任务,优于问题应答任务的原始T5模型。
translated by 谷歌翻译
大型语言模型在各种任务上显示出令人印象深刻的几次结果。但是,当知识是此类结果的关键时,就像问题回答和事实检查之类的任务一样,似乎需要存储知识的大量参数计数。众所周知,检索增强模型可以在不需要多个参数的情况下在知识密集的任务上表现出色,但是目前尚不清楚它们是否在几个弹药设置中工作。在这项工作中,我们介绍了地图集,这是一个经过精心设计和预先训练的增强语言模型,能够通过很少的培训示例学习知识密集型任务。我们对包括MMLU,苏格兰短裙和归类等各种任务进行评估,并研究文档索引内容的影响,表明它可以很容易地进行更新。值得注意的是,在自然问题上仅使用64个示例在自然问题上达到超过42 \%的准确性,尽管参数少了50倍,但比540B参数模型的表现优于540b参数模型。
translated by 谷歌翻译
Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively. Second, an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training. In addition, a new virtual adversarial training method is used for fine-tuning to improve models' generalization. We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understand (NLU) and natural langauge generation (NLG) downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). Notably, we scale up DeBERTa by training a larger version that consists of 48 Transform layers with 1.5 billion parameters. The significant performance boost makes the single DeBERTa model surpass the human performance on the SuperGLUE benchmark (Wang et al., 2019a) for the first time in terms of macro-average score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the SuperGLUE leaderboard as of January 6, 2021, outperforming the human baseline by a decent margin (90.3 versus 89.8). The pre-trained DeBERTa models and the source code were released at: https://github.com/microsoft/DeBERTa 1 .
translated by 谷歌翻译
This paper addresses the quality issues in existing Twitter-based paraphrase datasets, and discusses the necessity of using two separate definitions of paraphrase for identification and generation tasks. We present a new Multi-Topic Paraphrase in Twitter (MultiPIT) corpus that consists of a total of 130k sentence pairs with crowdsoursing (MultiPIT_crowd) and expert (MultiPIT_expert) annotations using two different paraphrase definitions for paraphrase identification, in addition to a multi-reference test set (MultiPIT_NMR) and a large automatically constructed training set (MultiPIT_Auto) for paraphrase generation. With improved data annotation quality and task-specific paraphrase definition, the best pre-trained language model fine-tuned on our dataset achieves the state-of-the-art performance of 84.2 F1 for automatic paraphrase identification. Furthermore, our empirical results also demonstrate that the paraphrase generation models trained on MultiPIT_Auto generate more diverse and high-quality paraphrases compared to their counterparts fine-tuned on other corpora such as Quora, MSCOCO, and ParaNMT.
translated by 谷歌翻译
语言是人类交流的主要工具,其中幽默是最有吸引力的部分之一。使用计算机,又称自然语言生成(NLG)的人类产生自然语言,已广泛用于对话系统,聊天机器人,机器翻译以及计算机AID创建,例如Idea Generations,剧本。但是,自然语言的幽默方面相对不足,尤其是在预训练的语言模型时代。在这项工作中,我们旨在初步测试NLG是否可以像人类一样产生幽默。我们构建了一个新的数据集,该数据集由众多数字化的中国可笑的串扰脚本(称为c $^3 $简称),该脚本适用于1800年代以来名为“ Xiangsheng”的流行中国表演艺术。 (为了方便非中国扬声器,我们在本文中称为“ Xiangsheng”的“ Crosstalk”。)我们基准了各种一代方法,包括训练seq2seq,微调中级PLMS和大型PLMS(大型PLMS)(有无微调)。此外,我们还进行了人类评估,表明1)大规模预处理在很大程度上提高了串扰的产生质量; 2)即使是从最佳PLM产生的脚本也远非我们的期望,只有65%的人类创建的串扰质量。我们得出结论,使用大型PLM可以在很大程度上改善幽默的产生,但仍处于起步阶段。 \ url {https://github.com/anonno2/crosstalk-generation}公开可用数据和基准代码。
translated by 谷歌翻译
尽管最近的多任务学习和自然语言处理的转移学习成功(NLP),但很少有效地研究了在训练中缩放任务数量的效果。迈出了这一目标,介绍了Exmix(极端混合物):跨越各个领域和任务家庭的大规模收集107个监督的NLP任务。使用EXMIX,我们研究了最大规模的多任务预培训的影响,并分析了普通任务家庭之间的共同培训转移。通过此分析,我们表明手动策划用于多任务预训练的理想任务,并不简单,而且多任务缩放可以自行改进模型。最后,我们提出了Ext5:使用自我监督跨度去噪和监督EXMIX的多任务目标预先训练的模型。通过广泛的实验,我们表明Ext5优于超级格,宝石,彩虹,封闭书QA任务的强大T5基线,以及Exmix之外的几个任务。 Ext5在预训练时也显着提高了样品效率。
translated by 谷歌翻译
知识库问题应答(KBQA)旨在在外部知识库的帮助下回答自然语言问题。核心思想是找到内部知识与知识库的已知三元组之间的内部知识之间的联系。 KBQA任务管道包含几个步骤,包括实体识别,关系提取和实体链接。这种管道方法意味着任何过程中的错误将不可避免地传播到最终预测。为了解决上述问题,本文提出了一种具有预培训语言模型(PLM)和知识图(KG)的语料库生成 - 检索方法(CGRM)。首先,基于MT5模型,我们设计了两个新的预训练任务:基于段落的知识屏蔽语言建模和问题,以获取知识增强型T5(KT5)模型。其次,在用一系列启发式规则预处理知识图的预处理之后,KT5模型基于处理的三元组生成自然语言QA对。最后,我们通过检索合成数据集直接解决QA。我们在NLPCC-ICCPOL 2016 KBQA数据集上测试我们的方法,结果表明,我们的框架提高了KBQA的性能,直接向前的方法与最先进的方法竞争。
translated by 谷歌翻译
动机:生物医学研究人员和临床从业者的常年挑战是随着出版物和医疗票据的快速增长而待的。自然语言处理(NLP)已成为驯服信息超载的有希望的方向。特别是,大型神经语言模型通过预先绘制的文本预测,通过各种NLP应用中的BERT模型的成功示例,便于通过预先绘制的预先来进行学习。然而,用于结束任务的微调此类模型仍然具有挑战性,特别是具有小标记数据集,这些数据集是生物医学NLP的常见。结果:我们对生物医学NLP的微调稳定性进行了系统研究。我们表明FineTuning性能可能对预先预订的设置敏感,尤其是在低资源域中。大型型号有可能获得更好的性能,但越来越多的模型大小也加剧了FineTuning不稳定性。因此,我们对解决微调不稳定的技术进行了全面的探索。我们表明,这些技术可以大大提高低源生物医学NLP应用的微调性能。具体地,冻结下层有助于标准伯特基型号,而完整的衰减对于BERT-LARD和Electra型号更有效。对于低资源文本相似性任务,如生物,重新初始化顶层是最佳策略。总体而言,占星型词汇和预制促进更强大的微调模型。基于这些调查结果,我们在广泛的生物医学NLP应用方面建立了新的技术。可用性和实施​​:为了促进生物医学NLP的进展,我们释放了我们最先进的预订和微调模型:https://aka.ms/blurb。
translated by 谷歌翻译