本文着重于几次NLP任务的文本数据增强。现有的数据增强算法要么使用一个小型培训集来生成新的合成数据,要么利用与任务无关的启发式规则(例如,同义词替代)或微调通用预训练的语言模型(例如GPT2)。因此,这些方法具有特定于任务的知识,并且仅限于在简单任务中为弱基线产生低质量的合成数据。为了解决这个问题,我们提出了知识混合数据增强模型(KNOWDA):使用知识混合培训(KOMT)在不同的NLP任务的混合物上预测的编码器LM。 KOMT是一种培训程序,将各种异质NLP任务的输入示例重新定义为统一的文本到文本格式,并采用不同粒度的目标,以学习生成部分或完整的样本。在KOMT的帮助下,Knowda可以隐含地将所需的特定于任务的知识从任务的混合中隐含地结合在一起,并通过一些给定的实例迅速掌握目标任务的固有综合定律。据我们所知,我们是首次尝试将任务数量扩展到多任务共同培训以进行数据扩展。广泛的实验表明,i)Knowda成功地通过少量基准的基准成功地提高了Albert和Deberta的表现,表现优于先前的最新数据增强基线; ii)KNOWDA还可以改善少数弹药任务的模型性能,这是KOMT中未包含的固定任务类型。
translated by 谷歌翻译
Recent work has shown that fine-tuning large pre-trained language models on a collection of tasks described via instructions, a.k.a. instruction-tuning, improves their zero and few-shot generalization to unseen tasks. However, there is a limited understanding of the performance trade-offs of different decisions made during the instruction-tuning process. These decisions include the scale and diversity of the instruction-tuning benchmark, different task sampling strategies, fine-tuning with and without demonstrations, training using specialized datasets for reasoning and dialogue, and finally, the fine-tuning objectives themselves. In this paper, we characterize the effect of instruction-tuning decisions on downstream task performance when scaling both model and benchmark sizes. To this end, we create OPT-IML Bench: a large benchmark for Instruction Meta-Learning (IML) of 2000 NLP tasks consolidated into task categories from 8 existing benchmarks, and prepare an evaluation framework to measure three types of model generalizations: to tasks from fully held-out categories, to held-out tasks from seen categories, and to held-out instances from seen tasks. Through the lens of this framework, we first present insights about instruction-tuning decisions as applied to OPT-30B and further exploit these insights to train OPT-IML 30B and 175B, which are instruction-tuned versions of OPT. OPT-IML demonstrates all three generalization abilities at both scales on four different evaluation benchmarks with diverse tasks and input formats -- PromptSource, FLAN, Super-NaturalInstructions, and UnifiedSKG. Not only does it significantly outperform OPT on all benchmarks but is also highly competitive with existing models fine-tuned on each specific benchmark. We release OPT-IML at both scales, together with the OPT-IML Bench evaluation framework.
translated by 谷歌翻译
最近已被证明大型语言模型在各种任务集中获得合理的零射普通化(Brown等,2020)。它已经假设这是语言模型的隐式多任务学习的结果,在语言模型中的预押(Radford等,2019)。可以通过明确的多任务学习直接引起零拍常规化?为了以缩放测试这个问题,我们开发一个系统,以便轻松地将任何自然语言任务映射到人类可读的提示表单中。我们转换一组大量的监督数据集,每个数据集都有多个提示,具有不同的措辞。这些提示的数据集允许基准测试模型执行完全看不见的任务的能力。我们介绍了一个普拉克尔编码器 - 解码器模型(Raffel等,2020; Lester等,2021),覆盖各种任务。该模型在多个标准数据集中达到强大的零点性能,通常优于其尺寸的型号超过16倍。此外,我们的方法对来自Big-替补基准测试的任务子集具有强烈性能,优于其尺寸的6倍。所有提示和培训的型号都可以在https://github.com/ bigscience-workshop / protectsource / httpsource / https://huggingface.co/bigscience/t0pp。
translated by 谷歌翻译
最近的自然语言理解进展(NLU)已经被驱动,部分是由胶水,超级格,小队等的基准。事实上,许多NLU模型现在在许多任务中匹配或超过“人类水平”性能这些基准。然而,大多数这些基准测试都提供模型访问相对大量的标记数据进行培训。因此,该模型提供了比人类所需的更多数据,以实现强大的性能。这有动机侧重于侧重于改善NLU模型的少量学习性能。然而,缺乏少量射门的标准化评估基准,导致不同纸张中的不同实验设置。为了帮助加速这一工作的工作,我们介绍了线索(受限制的语言理解评估标准),这是评估NLU模型的几次拍摄学习功能的基准。我们证明,虽然最近的模型在获得大量标记数据时达到人类性能,但对于大多数任务,少量拍摄设置中的性能存在巨大差距。我们还展示了几个拍摄设置中替代模型家族和适应技术之间的差异。最后,我们讨论了在设计实验设置时讨论了评估真实少量学习绩效的实验设置,并提出了统一的标准化方法,以获得少量学习评估。我们的目标是鼓励对NLU模型的研究,可以概括为具有少数示例的新任务。线索的代码和数据可以在https://github.com/microsoft/clues提供。
translated by 谷歌翻译
大型语言模型在各种任务上显示出令人印象深刻的几次结果。但是,当知识是此类结果的关键时,就像问题回答和事实检查之类的任务一样,似乎需要存储知识的大量参数计数。众所周知,检索增强模型可以在不需要多个参数的情况下在知识密集的任务上表现出色,但是目前尚不清楚它们是否在几个弹药设置中工作。在这项工作中,我们介绍了地图集,这是一个经过精心设计和预先训练的增强语言模型,能够通过很少的培训示例学习知识密集型任务。我们对包括MMLU,苏格兰短裙和归类等各种任务进行评估,并研究文档索引内容的影响,表明它可以很容易地进行更新。值得注意的是,在自然问题上仅使用64个示例在自然问题上达到超过42 \%的准确性,尽管参数少了50倍,但比540B参数模型的表现优于540b参数模型。
translated by 谷歌翻译
预先接受的语言模型实现了最先进的导致各种自然语言处理(NLP)任务。 GPT-3表明,缩放预先训练的语言模型可以进一步利用它们的巨大潜力。最近提出了一个名为Ernie 3.0的统一框架,以预先培训大型知识增强型号,并培训了具有10亿参数的模型。 Ernie 3.0在各种NLP任务上表现出最先进的模型。为了探讨缩放的表现,我们培养了百卢比的3.0泰坦参数型号,在PaddlePaddle平台上有高达260亿参数的泰坦。此外,我们设计了一种自我监督的对抗性损失和可控语言建模损失,以使ERNIE 3.0 TITAN产生可信和可控的文本。为了减少计算开销和碳排放,我们向Ernie 3.0泰坦提出了一个在线蒸馏框架,教师模型将同时教授学生和培训。埃塞尼3.0泰坦是迄今为止最大的中国密集预训练模型。经验结果表明,Ernie 3.0泰坦在68个NLP数据集中优于最先进的模型。
translated by 谷歌翻译
The recent GPT-3 model (Brown et al., 2020) achieves remarkable few-shot performance solely by leveraging a natural-language prompt and a few task demonstrations as input context. Inspired by their findings, we study few-shot learning in a more practical scenario, where we use smaller language models for which fine-tuning is computationally efficient. We present LM-BFF-better few-shot fine-tuning of language models 1 -a suite of simple and complementary techniques for finetuning language models on a small number of annotated examples. Our approach includes (1) prompt-based fine-tuning together with a novel pipeline for automating prompt generation; and (2) a refined strategy for dynamically and selectively incorporating demonstrations into each context. Finally, we present a systematic evaluation for analyzing few-shot performance on a range of NLP tasks, including classification and regression. Our experiments demonstrate that our methods combine to dramatically outperform standard fine-tuning procedures in this low resource setting, achieving up to 30% absolute improvement, and 11% on average across all tasks. Our approach makes minimal assumptions on task resources and domain expertise, and hence constitutes a strong task-agnostic method for few-shot learning. 2 * The first two authors contributed equally. 1 Alternatively, language models' best friends forever. 2 Our implementation is publicly available at https:// github.com/princeton-nlp/LM-BFF.
translated by 谷歌翻译
预测任务标签和为其预测生成自由文本阐述的自律化模型可以实现与NLP系统更直观的交互。然而,这些模型目前正在接受大量人为的自由文本解释,每个任务都会阻碍更广泛的使用。我们建议使用少数培训例子研究更现实的自律化建立。我们出示2月 - 一个标准化的四个现有英语数据集和相关指标。我们通过2月份广泛探索自然语言提示来确定正确的提示方法。然后,通过使用此提示并缩放模型大小,我们证明了几次拍摄自合合理化的进展。我们展示了这项任务的完善房间仍然有充足的改进空间:人类注册人评估的生成解释的平均合理性最多为51%,而人类解释的合理性是76%。我们希望2月份与我们的拟议方法一起促使社区承担几次拍摄的自我合理化挑战。
translated by 谷歌翻译
本文探讨了提高语言模型的零次学习能力的简单方法。我们表明,指令调整 - 通过对说明书中所述的任务集合微调语言模型 - 大幅提升零射门上看不见任务中的表现。我们采取预训练的语言模型和指令调整它通过自然语言指令模板语言表达了60NLP任务137B参数。我们评估这种指令调整模型,我们称之为FLAN,在看不见的任务类型。FLAN显着改善其未修饰的对应的性能和超过25的20个任务,我们评估零射门175BGPT-3。FLAN甚至GPT-3通过在安利,RTE,BoolQ,AI2-ARC,OpenbookQA和StoryCloze大比分胜过几拍。消融研究显示任务和模型的规模,这个数字是指令调整取得成功的关键组成部分。
translated by 谷歌翻译
关于信息检索的许多最新研究集中在如何从一项任务(通常具有丰富的监督数据)转移到有限的其他各种任务,并隐含地假设可以从一个任务概括到所有其余的任务。但是,这忽略了这样一个事实,即有许多多样化和独特的检索任务,每个任务都针对不同的搜索意图,查询和搜索域。在本文中,我们建议使用几乎没有散热的检索,每个任务都有一个简短的描述和一些示例。为了扩大一些示例的功能,我们提出了针对检索器(即将到来)的及时基本查询生成,该查询将大型语言模型(LLM)作为几个弹片查询生成器,并根据生成的数据创建特定于任务的检索器。通过LLM的概括能力提供动力,即要来源使得可以仅基于一些示例{没有自然问题或MS MARCO来训练%问题生成器或双重编码器,就可以仅基于一些示例{没有}来创建特定于任务的端到端检索。出乎意料的是,LLM提示不超过8个示例,允许双重编码器在MARCO(例如Colbert V2)上训练的大量工程模型平均在11个检索套件中超过1.2 NDCG。使用相同生成数据的进一步培训标准尺寸的重新级别可获得5.0点NDCG的改进。我们的研究确定,查询产生比以前观察到的更有效,尤其是在给出少量特定于任务知识的情况下。
translated by 谷歌翻译
任务概括是自然语言处理(NLP)的漫长挑战。最近的研究试图通过将NLP任务映射到人类可读的提示形式中来提高预训练语言模型的任务概括能力。但是,这些方法需要费力且不灵活的提示,并且在同一下游任务上的不同提示可能会获得不稳定的性能。我们提出了统一的架构提示,这是一种灵活且可扩展的提示方法,该方法会根据任务输入架构自动自动自定义每个任务的可学习提示。它在任务之间建模共享知识,同时保持不同任务架构的特征,从而增强任务概括能力。架构提示采用每个任务的明确数据结构,以制定提示,因此涉及几乎没有人类的努力。为了测试模式提示的任务概括能力,我们对各种一般NLP任务进行基于模式提示的多任务预训练。该框架在从8种任务类型(例如QA,NLI等)的16个看不见的下游任务上实现了强劲的零射击和很少的概括性能。此外,全面的分析证明了每个组件在架构提示中的有效性,其在任务组成性方面的灵活性以及在全DATA微调设置下提高性能的能力。
translated by 谷歌翻译
Question Answering (QA) is a longstanding challenge in natural language processing. Existing QA works mostly focus on specific question types, knowledge domains, or reasoning skills. The specialty in QA research hinders systems from modeling commonalities between tasks and generalization for wider applications. To address this issue, we present ProQA, a unified QA paradigm that solves various tasks through a single model. ProQA takes a unified structural prompt as the bridge and improves the QA-centric ability by structural prompt-based pre-training. Through a structurally designed prompt-based input schema, ProQA concurrently models the knowledge generalization for all QA tasks while keeping the knowledge customization for every specific QA task. Furthermore, ProQA is pre-trained with structural prompt-formatted large-scale synthesized corpus, which empowers the model with the commonly-required QA ability. Experimental results on 11 QA benchmarks demonstrate that ProQA consistently boosts performance on both full data fine-tuning, few-shot learning, and zero-shot testing scenarios. Furthermore, ProQA exhibits strong ability in both continual learning and transfer learning by taking the advantages of the structural prompt.
translated by 谷歌翻译
通过自我监督的学习预先训练的大型语言模型在各种各样的任务上表现出令人印象深刻的零击功能。在这项工作中,我们介绍了Welm:一种针对中文的精心读取的预训练的语言模型,能够无缝执行不同类型的任务,以零或几次演示。 Welm通过“阅读”涵盖广泛主题的精选高质量语料库来接受10b参数的培训。我们表明,韦尔姆拥有有关各种领域和语言的广泛知识。在18个单语(中文)任务中,WELM可以大大优于现有的预训练模型,尺寸相似,并匹配高达25倍大的模型的性能。韦尔姆还表现出强大的多种语言和代码转换理解的能力,优于预先对30种语言进行预培训的现有多语言模型。此外,我们收集了人工编写的提示,并通过多次培训进行了大量的中文和微调韦尔姆的监督数据集。最终的模型可以实现对看不见的任务类型的强烈概括,并在零射门学习中优于无监督的韦尔姆。最后,我们证明韦尔姆具有解释和校准自己的决策的基本技能,这可能是未来研究的有希望的方向。我们的模型可以从https://welm.weixin.qq.com/docs/api/应用。
translated by 谷歌翻译
Transfer learning, where a model is first pre-trained on a data-rich task before being finetuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.
translated by 谷歌翻译
We demonstrate that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even becoming competitive with prior state-ofthe-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous nonsparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks. We also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora.
translated by 谷歌翻译
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a;Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).
translated by 谷歌翻译
大型语言模型(例如GPT-3(Brown等,2020)可以执行任意任务,而无需在仅使用少数标签示例的提示之后进行微调。可以将任意任务重新构成自然语言提示,并且可以要求语言模型生成完成,并以称为基于及时的学习的范式间接执行该任务。迄今为止,主要针对单向语言模型证明了新兴迅速的学习能力。但是,预先培训的双向语言模型(例如蒙版语言建模)为转移学习提供了更强大的学习表示。这激发了促使双向模型的可能性,但是它们的预训练目标使它们与现有的提示范式不相容。我们提出SAP(顺序自动回旋提示),该技术可以使双向模型提示。利用机器翻译任务作为案例研究,我们提示了带有SAP的双向MT5模型(Xue等,2021),并演示其少量拍摄和零照片的翻译优于GPT-3等单向模型的几个单拍翻译和XGLM(Lin等,2021),尽管MT5的参数减少了约50%。我们进一步表明SAP对问题的回答和摘要有效。我们的结果首次表明基于及时的学习是更广泛的语言模型的新兴属性,而不仅仅是单向模型。
translated by 谷歌翻译
In this work, we explore "prompt tuning," a simple yet effective mechanism for learning "soft prompts" to condition frozen language models to perform specific downstream tasks. Unlike the discrete text prompts used by GPT-3, soft prompts are learned through backpropagation and can be tuned to incorporate signals from any number of labeled examples. Our end-to-end learned approach outperforms GPT-3's few-shot learning by a large margin. More remarkably, through ablations on model size using T5, we show that prompt tuning becomes more competitive with scale: as models exceed billions of parameters, our method "closes the gap" and matches the strong performance of model tuning (where all model weights are tuned). This finding is especially relevant because large models are costly to share and serve and the ability to reuse one frozen model for multiple downstream tasks can ease this burden. Our method can be seen as a simplification of the recently proposed "prefix tuning" of Li and Liang (2021) and we provide a comparison to this and other similar approaches. Finally, we show that conditioning a frozen model with soft prompts confers benefits in robustness to domain transfer and enables efficient "prompt ensembling." * Work done as a Google AI Resident.
translated by 谷歌翻译
培训和评估语言模型越来越多地要求构建元数据 - 多样化的策划数据收集,并具有清晰的出处。自然语言提示最近通过将现有的,有监督的数据集转换为多种新颖的预处理任务,突出了元数据策划的好处,从而改善了零击的概括。尽管将这些以数据为中心的方法转化为生物医学语言建模的通用域文本成功,但由于标记的生物医学数据集在流行的数据中心中的代表性大大不足,因此仍然具有挑战性。为了应对这一挑战,我们介绍了BigBio一个由126个以上的生物医学NLP数据集的社区库,目前涵盖12个任务类别和10多种语言。 BigBio通过对数据集及其元数据进行程序化访问来促进可再现的元数据策划,并与当前的平台兼容,以及时工程和端到端的几个/零射击语言模型评估。我们讨论了我们的任务架构协调,数据审核,贡献指南的过程,并概述了两个说明性用例:生物医学提示和大规模,多任务学习的零射门评估。 BigBio是一项持续的社区努力,可在https://github.com/bigscience-workshop/biomedical上获得。
translated by 谷歌翻译
Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PALM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints, which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models.
translated by 谷歌翻译