There has been great progress in unifying various table-to-text tasks using a single encoder-decoder model trained via multi-task learning (Xie et al., 2022). However, existing methods typically encode task information with a simple dataset name as a prefix to the encoder. This not only limits the effectiveness of multi-task learning, but also hinders the model's ability to generalize to new domains or tasks that were not seen during training, which is crucial for real-world applications. In this paper, we propose compositional task configurations, a set of prompts prepended to the encoder to improve cross-task generalization of unified models. We design the task configurations to explicitly specify the task type, as well as its input and output types. We show that this not only allows the model to better learn shared knowledge across different tasks at training, but also allows us to control the model by composing new configurations that apply novel input-output combinations in a zero-shot manner. We demonstrate via experiments over ten table-to-text tasks that our method outperforms the UnifiedSKG baseline by noticeable margins in both in-domain and zero-shot settings, with average improvements of +0.5 and +12.6 from using a T5-large backbone, respectively.
translated by 谷歌翻译
大型语言模型在各种任务上显示出令人印象深刻的几次结果。但是,当知识是此类结果的关键时,就像问题回答和事实检查之类的任务一样,似乎需要存储知识的大量参数计数。众所周知,检索增强模型可以在不需要多个参数的情况下在知识密集的任务上表现出色,但是目前尚不清楚它们是否在几个弹药设置中工作。在这项工作中,我们介绍了地图集,这是一个经过精心设计和预先训练的增强语言模型,能够通过很少的培训示例学习知识密集型任务。我们对包括MMLU,苏格兰短裙和归类等各种任务进行评估,并研究文档索引内容的影响,表明它可以很容易地进行更新。值得注意的是,在自然问题上仅使用64个示例在自然问题上达到超过42 \%的准确性,尽管参数少了50倍,但比540B参数模型的表现优于540b参数模型。
translated by 谷歌翻译
任务概括是自然语言处理(NLP)的漫长挑战。最近的研究试图通过将NLP任务映射到人类可读的提示形式中来提高预训练语言模型的任务概括能力。但是,这些方法需要费力且不灵活的提示,并且在同一下游任务上的不同提示可能会获得不稳定的性能。我们提出了统一的架构提示,这是一种灵活且可扩展的提示方法,该方法会根据任务输入架构自动自动自定义每个任务的可学习提示。它在任务之间建模共享知识,同时保持不同任务架构的特征,从而增强任务概括能力。架构提示采用每个任务的明确数据结构,以制定提示,因此涉及几乎没有人类的努力。为了测试模式提示的任务概括能力,我们对各种一般NLP任务进行基于模式提示的多任务预训练。该框架在从8种任务类型(例如QA,NLI等)的16个看不见的下游任务上实现了强劲的零射击和很少的概括性能。此外,全面的分析证明了每个组件在架构提示中的有效性,其在任务组成性方面的灵活性以及在全DATA微调设置下提高性能的能力。
translated by 谷歌翻译
本文探讨了提高语言模型的零次学习能力的简单方法。我们表明,指令调整 - 通过对说明书中所述的任务集合微调语言模型 - 大幅提升零射门上看不见任务中的表现。我们采取预训练的语言模型和指令调整它通过自然语言指令模板语言表达了60NLP任务137B参数。我们评估这种指令调整模型,我们称之为FLAN,在看不见的任务类型。FLAN显着改善其未修饰的对应的性能和超过25的20个任务,我们评估零射门175BGPT-3。FLAN甚至GPT-3通过在安利,RTE,BoolQ,AI2-ARC,OpenbookQA和StoryCloze大比分胜过几拍。消融研究显示任务和模型的规模,这个数字是指令调整取得成功的关键组成部分。
translated by 谷歌翻译
预训练的语言模型(PLM)在自然语言生成(NLG)任务中取得了显着的成功。到目前为止,大多数PLM都使用大型一般语料库以无监督的方式进行了预培训。同时,与无监督的模型相比,预先训练的模型越来越多地显示出较低的数据表现出色。受监督预训练的成功的激励,我们提出了自然语言生成的多任务监督预训练(MVP)。为了预先培训文本生成模型MVP,我们从七个生成任务中收集了45个数据集的标记预训练语料库。对于每个任务,我们进一步预先训练特定的软提示,以刺激执行特定任务的模型能力。广泛的实验证明了我们在许多NLG任务中有监督的预训练的有效性,并且我们的一般方法在17个数据集中的12个中实现了最先进的性能。
translated by 谷歌翻译
目前用于开放域问题的最先进的生成模型(ODQA)专注于从非结构化文本信息生成直接答案。但是,大量的世界知识存储在结构化数据库中,并且需要使用SQL等查询语言访问。此外,查询语言可以回答需要复杂推理的问题,以及提供完全的解释性。在本文中,我们提出了一个混合框架,将文本和表格证据占据了输入,并根据哪种形式更好地回答这个问题而生成直接答案或SQL查询。然后可以在关联的数据库上执行生成的SQL查询以获得最终答案。据我们所知,这是第一种将Text2SQL与ODQA任务应用于ODQA任务的论文。凭经验,我们证明,在几个ODQA数据集上,混合方法始终如一地优于仅采用大边缘的均匀输入的基线模型。具体地,我们使用T5基础模型实现OpenSquad数据集的最先进的性能。在一个详细的分析中,我们证明能够生成结构的SQL查询可以始终带来增益,特别是对于那些需要复杂推理的问题。
translated by 谷歌翻译
检索演示的生成模型比独立语言模型提供了许多好处:除了对给定查询的文字答案外,它们还提供了从可更新知识库中检索到的出处项目。但是,它们也是更复杂的系统,需要处理长输入。在这项工作中,我们介绍了FID Light,以强烈提高最先进的检索功能模型的效率,同时保持相同的有效性。我们的FID光模型将信息流从编码器(分别编码段落)限制为解码器(使用串联编码表示)。此外,我们通过文本源指针通过重新排列的功能调整FID光,以提高排名最高的出处精度。我们对七个知识密集任务(KILT)的各种实验表明,FID光线始终改善了查询潜伏期和有效性之间的帕累托前沿。带有源指向的FID光设置为六个苏格兰短裙任务的新最新结果,用于合并文本生成和出处检索评估,同时保持合理的效率。
translated by 谷歌翻译
关于信息检索的许多最新研究集中在如何从一项任务(通常具有丰富的监督数据)转移到有限的其他各种任务,并隐含地假设可以从一个任务概括到所有其余的任务。但是,这忽略了这样一个事实,即有许多多样化和独特的检索任务,每个任务都针对不同的搜索意图,查询和搜索域。在本文中,我们建议使用几乎没有散热的检索,每个任务都有一个简短的描述和一些示例。为了扩大一些示例的功能,我们提出了针对检索器(即将到来)的及时基本查询生成,该查询将大型语言模型(LLM)作为几个弹片查询生成器,并根据生成的数据创建特定于任务的检索器。通过LLM的概括能力提供动力,即要来源使得可以仅基于一些示例{没有自然问题或MS MARCO来训练%问题生成器或双重编码器,就可以仅基于一些示例{没有}来创建特定于任务的端到端检索。出乎意料的是,LLM提示不超过8个示例,允许双重编码器在MARCO(例如Colbert V2)上训练的大量工程模型平均在11个检索套件中超过1.2 NDCG。使用相同生成数据的进一步培训标准尺寸的重新级别可获得5.0点NDCG的改进。我们的研究确定,查询产生比以前观察到的更有效,尤其是在给出少量特定于任务知识的情况下。
translated by 谷歌翻译
Recent work has shown that fine-tuning large pre-trained language models on a collection of tasks described via instructions, a.k.a. instruction-tuning, improves their zero and few-shot generalization to unseen tasks. However, there is a limited understanding of the performance trade-offs of different decisions made during the instruction-tuning process. These decisions include the scale and diversity of the instruction-tuning benchmark, different task sampling strategies, fine-tuning with and without demonstrations, training using specialized datasets for reasoning and dialogue, and finally, the fine-tuning objectives themselves. In this paper, we characterize the effect of instruction-tuning decisions on downstream task performance when scaling both model and benchmark sizes. To this end, we create OPT-IML Bench: a large benchmark for Instruction Meta-Learning (IML) of 2000 NLP tasks consolidated into task categories from 8 existing benchmarks, and prepare an evaluation framework to measure three types of model generalizations: to tasks from fully held-out categories, to held-out tasks from seen categories, and to held-out instances from seen tasks. Through the lens of this framework, we first present insights about instruction-tuning decisions as applied to OPT-30B and further exploit these insights to train OPT-IML 30B and 175B, which are instruction-tuned versions of OPT. OPT-IML demonstrates all three generalization abilities at both scales on four different evaluation benchmarks with diverse tasks and input formats -- PromptSource, FLAN, Super-NaturalInstructions, and UnifiedSKG. Not only does it significantly outperform OPT on all benchmarks but is also highly competitive with existing models fine-tuned on each specific benchmark. We release OPT-IML at both scales, together with the OPT-IML Bench evaluation framework.
translated by 谷歌翻译
Fine-tuned language models use greedy decoding to answer reading comprehension questions with relative success. However, this approach does not ensure that the answer is a span in the given passage, nor does it guarantee that it is the most probable one. Does greedy decoding actually perform worse than an algorithm that does adhere to these properties? To study the performance and optimality of greedy decoding, we present exact-extract, a decoding algorithm that efficiently finds the most probable answer span in the context. We compare the performance of T5 with both decoding algorithms on zero-shot and few-shot extractive question answering. When no training examples are available, exact-extract significantly outperforms greedy decoding. However, greedy decoding quickly converges towards the performance of exact-extract with the introduction of a few training examples, becoming more extractive and increasingly likelier to generate the most probable span as the training set grows. We also show that self-supervised training can bias the model towards extractive behavior, increasing performance in the zero-shot setting without resorting to annotated examples. Overall, our results suggest that pretrained language models are so good at adapting to extractive question answering, that it is often enough to fine-tune on a small training set for the greedy algorithm to emulate the optimal decoding strategy.
translated by 谷歌翻译
We study the problem of retrieval with instructions, where users of a retrieval system explicitly describe their intent along with their queries. We aim to develop a general-purpose task-aware retrieval system using multi-task instruction tuning, which can follow human-written instructions to find the best documents for a given query. We introduce the first large-scale collection of approximately 40 retrieval datasets with instructions, BERRI, and present TART, a multi-task retrieval system trained on BERRI with instructions. TART shows strong capabilities to adapt to a new retrieval task via instructions and advances the state of the art on two zero-shot retrieval benchmarks, BEIR and LOTTE, outperforming models up to three times larger. We further introduce a new evaluation setup, X^2-Retrieval to better reflect real-world scenarios, where diverse domains and tasks are pooled and a system needs to find documents aligning users' intents. In this setup, TART significantly outperforms competitive baselines, further demonstrating the effectiveness of guiding retrieval with instructions.
translated by 谷歌翻译
本文着重于几次NLP任务的文本数据增强。现有的数据增强算法要么使用一个小型培训集来生成新的合成数据,要么利用与任务无关的启发式规则(例如,同义词替代)或微调通用预训练的语言模型(例如GPT2)。因此,这些方法具有特定于任务的知识,并且仅限于在简单任务中为弱基线产生低质量的合成数据。为了解决这个问题,我们提出了知识混合数据增强模型(KNOWDA):使用知识混合培训(KOMT)在不同的NLP任务的混合物上预测的编码器LM。 KOMT是一种培训程序,将各种异质NLP任务的输入示例重新定义为统一的文本到文本格式,并采用不同粒度的目标,以学习生成部分或完整的样本。在KOMT的帮助下,Knowda可以隐含地将所需的特定于任务的知识从任务的混合中隐含地结合在一起,并通过一些给定的实例迅速掌握目标任务的固有综合定律。据我们所知,我们是首次尝试将任务数量扩展到多任务共同培训以进行数据扩展。广泛的实验表明,i)Knowda成功地通过少量基准的基准成功地提高了Albert和Deberta的表现,表现优于先前的最新数据增强基线; ii)KNOWDA还可以改善少数弹药任务的模型性能,这是KOMT中未包含的固定任务类型。
translated by 谷歌翻译
Powerful generative models have led to recent progress in question generation (QG). However, it is difficult to measure advances in QG research since there are no standardized resources that allow a uniform comparison among approaches. In this paper, we introduce QG-Bench, a multilingual and multidomain benchmark for QG that unifies existing question answering datasets by converting them to a standard QG setting. It includes general-purpose datasets such as SQuAD for English, datasets from ten domains and two styles, as well as datasets in eight different languages. Using QG-Bench as a reference, we perform an extensive analysis of the capabilities of language models for the task. First, we propose robust QG baselines based on fine-tuning generative language models. Then, we complement automatic evaluation based on standard metrics with an extensive manual evaluation, which in turn sheds light on the difficulty of evaluating QG models. Finally, we analyse both the domain adaptability of these models as well as the effectiveness of multilingual models in languages other than English. QG-Bench is released along with the fine-tuned models presented in the paper https://github.com/asahi417/lm-question-generation, which are also available as a demo https://autoqg.net/.
translated by 谷歌翻译
基础模型由于在广泛的下游应用中的有效性而受到了很多关注。尽管在体系结构方面存在很大的融合,但大多数审慎的模型通常仍用于特定任务或模式。在这项工作中,我们建议将语言模型用作各种基础模型的通用接口。一系列预处理的编码者感知到了多种方式(例如视觉和语言),并与扮演通用任务层角色的语言模型对接。我们提出了一个半伴侣的语言建模目标,以共同确定界面和模块化编码器。我们从因果关系和非因果建模中涵盖了优势和能力,从而结合了两个世界的最佳状态。具体而言,所提出的方法不仅从因果语言建模中继承了内在学习和开放式生成的能力,而且由于双向编码器而有利于填补。更重要的是,我们的方法无缝地解锁了上述功能的组合,例如,通过填充编码器启用了文本学习或指导。各种仅语言和视觉语言基准的实验结果表明,我们的模型表现优于或与鉴定,零弹性概括和几乎没有的学习的专业模型竞争。
translated by 谷歌翻译
查询聚焦的文本摘要(QFTS)任务旨在构建基于给定查询的文本文档摘要的构建系统。解决此任务的关键挑战是缺乏培训摘要模型的大量标记数据。在本文中,我们通过探索一系列域适应技术来解决这一挑战。鉴于最近在广泛的自然语言处理任务中进行预先接受的变压器模型的成功,我们利用此类模型为单文档和多文件方案的QFTS任务产生抽象摘要。对于域适应,我们使用预先训练的变压器的摘要模型应用了各种技术,包括转移学习,弱监督学习和远程监督。六个数据集的广泛实验表明,我们所提出的方法非常有效地为QFTS任务产生抽象摘要,同时在一组自动和人类评估指标上设置新的最先进的结果。
translated by 谷歌翻译
检索增强的代表在许多知识密集型的NLP任务中表现出最先进的表现,例如打开问题应答和事实验证。考虑到检索到的段落,这些模型训练以产生最终输出,这可能与原始查询无关,导致学习虚假线索或回答记忆。这项工作介绍了一种融入通道的证据性的方法 - 是否段落包含正确的证据来支持输出 - 培训发电机。我们介绍了一个多任务学习框架,共同生成最终输出并预测每个段落的证据性,利用新的任务不可行方法来获得{\ IT Silver}分证分性标签进行监督。我们在三个知识密集型任务中的五个数据集的实验表明,我们的新的证据引导发电机具有相同尺寸模型的直接对应的直接对应,并使Faviq-Ambig的最先进。我们将这些改进归因于辅助多任务学习和银证处分性挖掘技术。
translated by 谷歌翻译
Open-Domain Question Answering (ODQA) requires models to answer factoid questions with no context given. The common way for this task is to train models on a large-scale annotated dataset to retrieve related documents and generate answers based on these documents. In this paper, we show that the ODQA architecture can be dramatically simplified by treating Large Language Models (LLMs) as a knowledge corpus and propose a Self-Prompting framework for LLMs to perform ODQA so as to eliminate the need for training data and external knowledge corpus. Concretely, we firstly generate multiple pseudo QA pairs with background passages and one-sentence explanations for these QAs by prompting LLMs step by step and then leverage the generated QA pairs for in-context learning. Experimental results show our method surpasses previous state-of-the-art methods by +8.8 EM averagely on three widely-used ODQA datasets, and even achieves comparable performance with several retrieval-augmented fine-tuned models.
translated by 谷歌翻译
预测任务标签和为其预测生成自由文本阐述的自律化模型可以实现与NLP系统更直观的交互。然而,这些模型目前正在接受大量人为的自由文本解释,每个任务都会阻碍更广泛的使用。我们建议使用少数培训例子研究更现实的自律化建立。我们出示2月 - 一个标准化的四个现有英语数据集和相关指标。我们通过2月份广泛探索自然语言提示来确定正确的提示方法。然后,通过使用此提示并缩放模型大小,我们证明了几次拍摄自合合理化的进展。我们展示了这项任务的完善房间仍然有充足的改进空间:人类注册人评估的生成解释的平均合理性最多为51%,而人类解释的合理性是76%。我们希望2月份与我们的拟议方法一起促使社区承担几次拍摄的自我合理化挑战。
translated by 谷歌翻译
对于开放式域问题的密集检索已被证明通过在问题通道对的大型数据集上培训来实现令人印象深刻的性能。我们调查是否可以以自我监督的方式学习密集的检索,并有效地应用没有任何注释。我们观察到这种情况下的检索斗争的现有借用模型,并提出了一种设计用于检索的新预制方案:重复跨度检索。我们在文档中使用经常性跨度来创建用于对比学习的伪示例。由此产生的模型 - 蜘蛛 - 在广泛的ODQA数据集上没有任何示例,并且与BM25具有竞争力,具有强烈的稀疏基线。此外,蜘蛛通常优于DPR在其他数据集的问题上培训的DPR培训的强大基线。我们将蜘蛛与BM25结合的混合猎犬改进了所有数据集的组件,并且通常与域中DPR模型具有竞争力,这些模型培训数万例培训。
translated by 谷歌翻译
知识密集型任务,例如开放域问题答案(QA),需要访问大量的世界知识或领域知识。知识密集型任务的一种常见方法是采用检索到阅读的管道,该管道首先从诸如Wikipedia之类的外部语料库中检索少数相关的上下文文档,然后预测在检索文档的条件下得到答案。在本文中,我们提出了一种新的观点,可以通过用大型语言模型生成器代替文档检索器来解决知识密集型任务。我们称我们的方法生成-Read Read(GenRead),该方法首先提示大型语言模型根据给定问题生成上下文文档,然后读取生成的文档以产生最终答案。此外,我们提出了一种基于聚类的提示方法,该方法选择了不同的提示,从而产生了涵盖不同观点的生成文档,从而更好地回忆了可接受的答案。我们对三个不同的知识密集任务进行了广泛的实验,包括开放域质量检查,事实检查和对话系统。值得注意的是,GenRead在Triviaqa和WebQ上实现了71.6和54.4的精确匹配分数,显着超过了最先进的检索到+4.0和+3.9的最先进的dpr-fid,而无需从任何外部知识源中检索任何文档。最后,我们证明可以通过结合检索和生成来进一步提高模型性能。
translated by 谷歌翻译