任务概括是自然语言处理(NLP)的漫长挑战。最近的研究试图通过将NLP任务映射到人类可读的提示形式中来提高预训练语言模型的任务概括能力。但是,这些方法需要费力且不灵活的提示,并且在同一下游任务上的不同提示可能会获得不稳定的性能。我们提出了统一的架构提示,这是一种灵活且可扩展的提示方法,该方法会根据任务输入架构自动自动自定义每个任务的可学习提示。它在任务之间建模共享知识,同时保持不同任务架构的特征,从而增强任务概括能力。架构提示采用每个任务的明确数据结构,以制定提示,因此涉及几乎没有人类的努力。为了测试模式提示的任务概括能力,我们对各种一般NLP任务进行基于模式提示的多任务预训练。该框架在从8种任务类型(例如QA,NLI等)的16个看不见的下游任务上实现了强劲的零射击和很少的概括性能。此外,全面的分析证明了每个组件在架构提示中的有效性,其在任务组成性方面的灵活性以及在全DATA微调设置下提高性能的能力。
translated by 谷歌翻译
Question Answering (QA) is a longstanding challenge in natural language processing. Existing QA works mostly focus on specific question types, knowledge domains, or reasoning skills. The specialty in QA research hinders systems from modeling commonalities between tasks and generalization for wider applications. To address this issue, we present ProQA, a unified QA paradigm that solves various tasks through a single model. ProQA takes a unified structural prompt as the bridge and improves the QA-centric ability by structural prompt-based pre-training. Through a structurally designed prompt-based input schema, ProQA concurrently models the knowledge generalization for all QA tasks while keeping the knowledge customization for every specific QA task. Furthermore, ProQA is pre-trained with structural prompt-formatted large-scale synthesized corpus, which empowers the model with the commonly-required QA ability. Experimental results on 11 QA benchmarks demonstrate that ProQA consistently boosts performance on both full data fine-tuning, few-shot learning, and zero-shot testing scenarios. Furthermore, ProQA exhibits strong ability in both continual learning and transfer learning by taking the advantages of the structural prompt.
translated by 谷歌翻译
本文探讨了提高语言模型的零次学习能力的简单方法。我们表明,指令调整 - 通过对说明书中所述的任务集合微调语言模型 - 大幅提升零射门上看不见任务中的表现。我们采取预训练的语言模型和指令调整它通过自然语言指令模板语言表达了60NLP任务137B参数。我们评估这种指令调整模型,我们称之为FLAN,在看不见的任务类型。FLAN显着改善其未修饰的对应的性能和超过25的20个任务,我们评估零射门175BGPT-3。FLAN甚至GPT-3通过在安利,RTE,BoolQ,AI2-ARC,OpenbookQA和StoryCloze大比分胜过几拍。消融研究显示任务和模型的规模,这个数字是指令调整取得成功的关键组成部分。
translated by 谷歌翻译
最近已被证明大型语言模型在各种任务集中获得合理的零射普通化(Brown等,2020)。它已经假设这是语言模型的隐式多任务学习的结果,在语言模型中的预押(Radford等,2019)。可以通过明确的多任务学习直接引起零拍常规化?为了以缩放测试这个问题,我们开发一个系统,以便轻松地将任何自然语言任务映射到人类可读的提示表单中。我们转换一组大量的监督数据集,每个数据集都有多个提示,具有不同的措辞。这些提示的数据集允许基准测试模型执行完全看不见的任务的能力。我们介绍了一个普拉克尔编码器 - 解码器模型(Raffel等,2020; Lester等,2021),覆盖各种任务。该模型在多个标准数据集中达到强大的零点性能,通常优于其尺寸的型号超过16倍。此外,我们的方法对来自Big-替补基准测试的任务子集具有强烈性能,优于其尺寸的6倍。所有提示和培训的型号都可以在https://github.com/ bigscience-workshop / protectsource / httpsource / https://huggingface.co/bigscience/t0pp。
translated by 谷歌翻译
This paper presents ReasonFormer, a unified reasoning framework for mirroring the modular and compositional reasoning process of humans in complex decision-making. Inspired by dual-process theory in cognitive science, the representation module (automatic thinking) and reasoning modules (controlled thinking) are decoupled to capture different levels of cognition. Upon the top of the representation module, the pre-trained reasoning modules are modular and professional in specific and fundamental reasoning skills (e.g., logic, simple QA, etc). To mimic the controlled compositional thinking process, different reasoning modules are dynamically activated and composed in both parallel and cascaded manners to control what reasoning skills are activated and how deep the reasoning process will be reached to solve the current problems. The unified reasoning framework solves multiple tasks with a single model, and is trained and inferred in an end-to-end manner. Evaluated on 11 datasets requiring different reasoning skills and complexity, ReasonFormer demonstrates substantial performance boosts, revealing the compositional reasoning ability. Few-shot experiments exhibit better generalization ability by learning to compose pre-trained skills for new tasks with limited data, and decoupling the representation module and the reasoning modules. Further analysis shows the modularity of reasoning modules as different tasks activate distinct reasoning skills at different reasoning depths.
translated by 谷歌翻译
预训练的语言模型(PLM)在自然语言生成(NLG)任务中取得了显着的成功。到目前为止,大多数PLM都使用大型一般语料库以无监督的方式进行了预培训。同时,与无监督的模型相比,预先训练的模型越来越多地显示出较低的数据表现出色。受监督预训练的成功的激励,我们提出了自然语言生成的多任务监督预训练(MVP)。为了预先培训文本生成模型MVP,我们从七个生成任务中收集了45个数据集的标记预训练语料库。对于每个任务,我们进一步预先训练特定的软提示,以刺激执行特定任务的模型能力。广泛的实验证明了我们在许多NLG任务中有监督的预训练的有效性,并且我们的一般方法在17个数据集中的12个中实现了最先进的性能。
translated by 谷歌翻译
文本匹配是信息检索和自然语言处理的基本技术。文本匹配任务共享确定两个给定文本之间关系的相同范例。这些关系因任务而异,例如〜在文档检索中相关性,释义识别中的语义一致性和所回答的可回答判断。但是,文本匹配的基本信号保留在有限范围中,即〜精确匹配,语义匹配和推理匹配。理想情况下,良好的文本匹配模型可以学会捕获和汇总这些信号,以实现不同的匹配任务以实现竞争性能,而最近的最新文本匹配模型,例如〜预训练的语言模型(PLM)很难概括。这是因为在特定于任务的数据集上的端到端监督学习使模型过分强调了数据示例偏置和特定于任务的信号,而不是基本的匹配信号。为了克服这个问题,我们采用了专业化的将军培训策略,并将其称为比赛推出。在专业阶段,对不同匹配任务的描述映射到一些提示令牌。在概括阶段,匹配模型通过接受各种匹配任务的培训来探索基本匹配信号。高不同的匹配任务避免了模型拟合特定任务的数据偏差,因此该模型可以专注于学习基本匹配信号。同时,在第一步中获得的提示令牌有助于模型区分不同的特定任务匹配信号。公共数据集上的实验结果表明,匹配点可以提高PLM在文本匹配中的多任务概括能力,并产生更好的内域多任务,外域多任务和新任务适应性性能由以前的微调范式训练的特定于任务模型。
translated by 谷歌翻译
There has been great progress in unifying various table-to-text tasks using a single encoder-decoder model trained via multi-task learning (Xie et al., 2022). However, existing methods typically encode task information with a simple dataset name as a prefix to the encoder. This not only limits the effectiveness of multi-task learning, but also hinders the model's ability to generalize to new domains or tasks that were not seen during training, which is crucial for real-world applications. In this paper, we propose compositional task configurations, a set of prompts prepended to the encoder to improve cross-task generalization of unified models. We design the task configurations to explicitly specify the task type, as well as its input and output types. We show that this not only allows the model to better learn shared knowledge across different tasks at training, but also allows us to control the model by composing new configurations that apply novel input-output combinations in a zero-shot manner. We demonstrate via experiments over ten table-to-text tasks that our method outperforms the UnifiedSKG baseline by noticeable margins in both in-domain and zero-shot settings, with average improvements of +0.5 and +12.6 from using a T5-large backbone, respectively.
translated by 谷歌翻译
Recent work has shown that fine-tuning large pre-trained language models on a collection of tasks described via instructions, a.k.a. instruction-tuning, improves their zero and few-shot generalization to unseen tasks. However, there is a limited understanding of the performance trade-offs of different decisions made during the instruction-tuning process. These decisions include the scale and diversity of the instruction-tuning benchmark, different task sampling strategies, fine-tuning with and without demonstrations, training using specialized datasets for reasoning and dialogue, and finally, the fine-tuning objectives themselves. In this paper, we characterize the effect of instruction-tuning decisions on downstream task performance when scaling both model and benchmark sizes. To this end, we create OPT-IML Bench: a large benchmark for Instruction Meta-Learning (IML) of 2000 NLP tasks consolidated into task categories from 8 existing benchmarks, and prepare an evaluation framework to measure three types of model generalizations: to tasks from fully held-out categories, to held-out tasks from seen categories, and to held-out instances from seen tasks. Through the lens of this framework, we first present insights about instruction-tuning decisions as applied to OPT-30B and further exploit these insights to train OPT-IML 30B and 175B, which are instruction-tuned versions of OPT. OPT-IML demonstrates all three generalization abilities at both scales on four different evaluation benchmarks with diverse tasks and input formats -- PromptSource, FLAN, Super-NaturalInstructions, and UnifiedSKG. Not only does it significantly outperform OPT on all benchmarks but is also highly competitive with existing models fine-tuned on each specific benchmark. We release OPT-IML at both scales, together with the OPT-IML Bench evaluation framework.
translated by 谷歌翻译
本文着重于几次NLP任务的文本数据增强。现有的数据增强算法要么使用一个小型培训集来生成新的合成数据,要么利用与任务无关的启发式规则(例如,同义词替代)或微调通用预训练的语言模型(例如GPT2)。因此,这些方法具有特定于任务的知识,并且仅限于在简单任务中为弱基线产生低质量的合成数据。为了解决这个问题,我们提出了知识混合数据增强模型(KNOWDA):使用知识混合培训(KOMT)在不同的NLP任务的混合物上预测的编码器LM。 KOMT是一种培训程序,将各种异质NLP任务的输入示例重新定义为统一的文本到文本格式,并采用不同粒度的目标,以学习生成部分或完整的样本。在KOMT的帮助下,Knowda可以隐含地将所需的特定于任务的知识从任务的混合中隐含地结合在一起,并通过一些给定的实例迅速掌握目标任务的固有综合定律。据我们所知,我们是首次尝试将任务数量扩展到多任务共同培训以进行数据扩展。广泛的实验表明,i)Knowda成功地通过少量基准的基准成功地提高了Albert和Deberta的表现,表现优于先前的最新数据增强基线; ii)KNOWDA还可以改善少数弹药任务的模型性能,这是KOMT中未包含的固定任务类型。
translated by 谷歌翻译
In this work, we explore "prompt tuning," a simple yet effective mechanism for learning "soft prompts" to condition frozen language models to perform specific downstream tasks. Unlike the discrete text prompts used by GPT-3, soft prompts are learned through backpropagation and can be tuned to incorporate signals from any number of labeled examples. Our end-to-end learned approach outperforms GPT-3's few-shot learning by a large margin. More remarkably, through ablations on model size using T5, we show that prompt tuning becomes more competitive with scale: as models exceed billions of parameters, our method "closes the gap" and matches the strong performance of model tuning (where all model weights are tuned). This finding is especially relevant because large models are costly to share and serve and the ability to reuse one frozen model for multiple downstream tasks can ease this burden. Our method can be seen as a simplification of the recently proposed "prefix tuning" of Li and Liang (2021) and we provide a comparison to this and other similar approaches. Finally, we show that conditioning a frozen model with soft prompts confers benefits in robustness to domain transfer and enables efficient "prompt ensembling." * Work done as a Google AI Resident.
translated by 谷歌翻译
几乎没有射击的内在学习(ICL)使预训练的语言模型能够通过为输入的一部分提供少量的培训示例来执行以前的任务,而无需任何基于梯度的培训。 ICL会产生大量的计算,内存和存储成本,因为它每次进行预测时都涉及处理所有培训示例。参数有效的微调(PEFT)(例如,适配器模块,提示调谐,稀疏更新方法等)提供了替代范式,其中训练了一组少量参数以启用模型来执行新任务。在本文中,我们严格地比较了几个ICL和PEFT,并证明后者提供了更好的准确性,并大大降低了计算成本。在此过程中,我们引入了一种称为(IA)$^3 $的新PEFT方法,该方法通过学习的向量来扩展激活,从而获得更强的性能,同时仅引入相对少量的新参数。我们还提出了一个基于称为T-FEW的T0模型的简单食谱,可以将其应用于新任务,而无需特定于任务的调整或修改。我们通过将T-FEW应用于木筏基准,首次实现超人性能,并以6%的绝对性能优于最先进的方法来验证T-FEW对完全看不见的任务的有效性。我们实验中使用的所有代码均可公开使用。
translated by 谷歌翻译
预训练模型已在许多代码智能任务中有效。这些模型在大规模未标记的语料库中进行了预训练,然后在下游任务中进行了微调。但是,由于预训练和下游任务的输入是不同的形式,因此很难充分探索预训练模型的知识。此外,微调的性能强烈依赖于下游数据的量,而实际上,具有稀缺数据的场景很常见。自然语言处理(NLP)领域的最新研究表明,迅速调整,一种调整的新范式,减轻上述问题并在各种NLP任务中实现了有希望的结果。在迅速调整中,在调整过程中插入的提示提供了特定于任务的知识,这对于具有相对较少数据的任务特别有益。在本文中,我们凭经验评估了代码智能任务中迅速调整的用法和效果。我们对流行的预训练模型Codebert和codet5进行及时调整,并尝试三个代码智能任务,包括缺陷预测,代码摘要和代码翻译。我们的实验结果表明,在所有三个任务中,迅速调整始终优于微调。此外,及时调整在低资源场景中显示出很大的潜力,例如,对于代码摘要,平均将微调的BLEU分数提高了26%以上。我们的结果表明,我们可以调整代码智能任务的迅速调整,以实现更好的性能,尤其是在缺乏特定于任务的数据时,我们可以调整及时调整。
translated by 谷歌翻译
This work introduces a new multi-task, parameter-efficient language model (LM) tuning method that learns to transfer knowledge across different tasks via a mixture of soft prompts-small prefix embedding vectors pre-trained for different tasks. Our method, called ATTEMPT (ATTEntional Mixtures of Prompt Tuning), obtains source prompts as encodings of large-scale source tasks into a small number of parameters and trains an attention module to interpolate the source prompts and a newly initialized target prompt for every instance in the target task. During training, only the target task prompt and the attention weights, which are shared between tasks in multi-task training, are updated, while the original LM and source prompts are intact. ATTEMPT is highly parameter-efficient (e.g., updates 2,300 times fewer parameters than full fine-tuning) while achieving high task performance using knowledge from high-resource tasks. Moreover, it is modular using pre-trained soft prompts, and can flexibly add or remove source prompts for effective knowledge transfer. Our experimental results across 21 diverse NLP datasets show that ATTEMPT significantly outperforms prompt tuning and outperforms or matches fully fine-tuned or other parameter-efficient tuning approaches that use over ten times more parameters. Finally, ATTEMPT outperforms previous work in few-shot learning settings.
translated by 谷歌翻译
预先接受的语言模型实现了最先进的导致各种自然语言处理(NLP)任务。 GPT-3表明,缩放预先训练的语言模型可以进一步利用它们的巨大潜力。最近提出了一个名为Ernie 3.0的统一框架,以预先培训大型知识增强型号,并培训了具有10亿参数的模型。 Ernie 3.0在各种NLP任务上表现出最先进的模型。为了探讨缩放的表现,我们培养了百卢比的3.0泰坦参数型号,在PaddlePaddle平台上有高达260亿参数的泰坦。此外,我们设计了一种自我监督的对抗性损失和可控语言建模损失,以使ERNIE 3.0 TITAN产生可信和可控的文本。为了减少计算开销和碳排放,我们向Ernie 3.0泰坦提出了一个在线蒸馏框架,教师模型将同时教授学生和培训。埃塞尼3.0泰坦是迄今为止最大的中国密集预训练模型。经验结果表明,Ernie 3.0泰坦在68个NLP数据集中优于最先进的模型。
translated by 谷歌翻译
With increasing scale, large language models demonstrate both quantitative improvement and new qualitative capabilities, especially as zero-shot learners, like GPT-3. However, these results rely heavily on delicate prompt design and large computation. In this work, we explore whether the strong zero-shot ability could be achieved at a smaller model scale without any external supervised data. To achieve this goal, we revisit masked language modeling and present a geometry-guided self-supervised learning method (Go-tuningfor short) by taking a small number of task-aware self-supervised data to update language models further. Experiments show that Go-tuning can enable T5-small (80M) competitive zero-shot results compared with large language models, such as T5-XL (3B). We also apply Go-tuning on multi-task settings and develop a multi-task model, mgo-T5 (250M). It can reach the average performance of OPT (175B) on 9 datasets.
translated by 谷歌翻译
Pre-trained large language models can efficiently interpolate human-written prompts in a natural way. Multitask prompted learning can help generalization through a diverse set of tasks at once, thus enhancing the potential for more effective downstream fine-tuning. To perform efficient multitask-inference in the same batch, parameter-efficient fine-tuning methods such as prompt tuning have been proposed. However, the existing prompt tuning methods may lack generalization. We propose SPT, a semi-parametric prompt tuning method for multitask prompted learning. The novel component of SPT is a memory bank from where memory prompts are retrieved based on discrete prompts. Extensive experiments, such as (i) fine-tuning a full language model with SPT on 31 different tasks from 8 different domains and evaluating zero-shot generalization on 9 heldout datasets under 5 NLP task categories and (ii) pretraining SPT on the GLUE datasets and evaluating fine-tuning on the SuperGLUE datasets, demonstrate effectiveness of SPT.
translated by 谷歌翻译
最近的自然语言理解进展(NLU)已经被驱动,部分是由胶水,超级格,小队等的基准。事实上,许多NLU模型现在在许多任务中匹配或超过“人类水平”性能这些基准。然而,大多数这些基准测试都提供模型访问相对大量的标记数据进行培训。因此,该模型提供了比人类所需的更多数据,以实现强大的性能。这有动机侧重于侧重于改善NLU模型的少量学习性能。然而,缺乏少量射门的标准化评估基准,导致不同纸张中的不同实验设置。为了帮助加速这一工作的工作,我们介绍了线索(受限制的语言理解评估标准),这是评估NLU模型的几次拍摄学习功能的基准。我们证明,虽然最近的模型在获得大量标记数据时达到人类性能,但对于大多数任务,少量拍摄设置中的性能存在巨大差距。我们还展示了几个拍摄设置中替代模型家族和适应技术之间的差异。最后,我们讨论了在设计实验设置时讨论了评估真实少量学习绩效的实验设置,并提出了统一的标准化方法,以获得少量学习评估。我们的目标是鼓励对NLU模型的研究,可以概括为具有少数示例的新任务。线索的代码和数据可以在https://github.com/microsoft/clues提供。
translated by 谷歌翻译
培训和评估语言模型越来越多地要求构建元数据 - 多样化的策划数据收集,并具有清晰的出处。自然语言提示最近通过将现有的,有监督的数据集转换为多种新颖的预处理任务,突出了元数据策划的好处,从而改善了零击的概括。尽管将这些以数据为中心的方法转化为生物医学语言建模的通用域文本成功,但由于标记的生物医学数据集在流行的数据中心中的代表性大大不足,因此仍然具有挑战性。为了应对这一挑战,我们介绍了BigBio一个由126个以上的生物医学NLP数据集的社区库,目前涵盖12个任务类别和10多种语言。 BigBio通过对数据集及其元数据进行程序化访问来促进可再现的元数据策划,并与当前的平台兼容,以及时工程和端到端的几个/零射击语言模型评估。我们讨论了我们的任务架构协调,数据审核,贡献指南的过程,并概述了两个说明性用例:生物医学提示和大规模,多任务学习的零射门评估。 BigBio是一项持续的社区努力,可在https://github.com/bigscience-workshop/biomedical上获得。
translated by 谷歌翻译
预测任务标签和为其预测生成自由文本阐述的自律化模型可以实现与NLP系统更直观的交互。然而,这些模型目前正在接受大量人为的自由文本解释,每个任务都会阻碍更广泛的使用。我们建议使用少数培训例子研究更现实的自律化建立。我们出示2月 - 一个标准化的四个现有英语数据集和相关指标。我们通过2月份广泛探索自然语言提示来确定正确的提示方法。然后,通过使用此提示并缩放模型大小,我们证明了几次拍摄自合合理化的进展。我们展示了这项任务的完善房间仍然有充足的改进空间:人类注册人评估的生成解释的平均合理性最多为51%,而人类解释的合理性是76%。我们希望2月份与我们的拟议方法一起促使社区承担几次拍摄的自我合理化挑战。
translated by 谷歌翻译