先前的工作表明,语言模型(LMS)的大小(LMS)与它们在不同下游NLP任务上的零拍摄性能之间存在缩放定律。在这项工作中,我们表明,在用否定提示的任务评估大型LM时,这种现象并不存在,而是显示了逆缩放定律。我们对(1)验证的LMS(OPT&GPT -3)的否定提示评估了9个不同的任务,该任务的不同尺寸(125m -175b),(2)LMS进一步预处理以推广到新颖的提示(指令),(3)提供的LMS,(3)LMS。示例很少,(4)LMS专门针对否定的提示进行了微调;所有LM类型在否定的提示上的表现较差,并在比较原始提示和否定提示的平均得分时显示人类绩效之间的巨大性能差距。通过强调现有LMS和方法的关键局限,我们敦促社区开发开发实际遵循给定指示的LMS的新方法。我们提供代码和数据集,以探索https://github.com/joeljang/negated-prompts-for-llms的否定提示。
translated by 谷歌翻译
Meta-training, which fine-tunes the language model (LM) on various downstream tasks by maximizing the likelihood of the target label given the task instruction and input instance, has improved the zero-shot task generalization performance. However, meta-trained LMs still struggle to generalize to challenging tasks containing novel labels unseen during meta-training. In this paper, we propose Flipped Learning, an alternative method of meta-training which trains the LM to generate the task instruction given the input instance and label. During inference, the LM trained with Flipped Learning, referred to as Flipped, selects the label option that is most likely to generate the task instruction. On 14 tasks of the BIG-bench benchmark, the 11B-sized Flipped outperforms zero-shot T0-11B and even a 16 times larger 3-shot GPT-3 (175B) on average by 8.4% and 9.7% points, respectively. Flipped gives particularly large improvements on tasks with unseen labels, outperforming T0-11B by up to +20% average F1 score. This indicates that the strong task generalization of Flipped comes from improved generalization to novel labels. We release our code at https://github.com/seonghyeonye/Flipped-Learning.
translated by 谷歌翻译
本文探讨了提高语言模型的零次学习能力的简单方法。我们表明,指令调整 - 通过对说明书中所述的任务集合微调语言模型 - 大幅提升零射门上看不见任务中的表现。我们采取预训练的语言模型和指令调整它通过自然语言指令模板语言表达了60NLP任务137B参数。我们评估这种指令调整模型,我们称之为FLAN,在看不见的任务类型。FLAN显着改善其未修饰的对应的性能和超过25的20个任务,我们评估零射门175BGPT-3。FLAN甚至GPT-3通过在安利,RTE,BoolQ,AI2-ARC,OpenbookQA和StoryCloze大比分胜过几拍。消融研究显示任务和模型的规模,这个数字是指令调整取得成功的关键组成部分。
translated by 谷歌翻译
最近已被证明大型语言模型在各种任务集中获得合理的零射普通化(Brown等,2020)。它已经假设这是语言模型的隐式多任务学习的结果,在语言模型中的预押(Radford等,2019)。可以通过明确的多任务学习直接引起零拍常规化?为了以缩放测试这个问题,我们开发一个系统,以便轻松地将任何自然语言任务映射到人类可读的提示表单中。我们转换一组大量的监督数据集,每个数据集都有多个提示,具有不同的措辞。这些提示的数据集允许基准测试模型执行完全看不见的任务的能力。我们介绍了一个普拉克尔编码器 - 解码器模型(Raffel等,2020; Lester等,2021),覆盖各种任务。该模型在多个标准数据集中达到强大的零点性能,通常优于其尺寸的型号超过16倍。此外,我们的方法对来自Big-替补基准测试的任务子集具有强烈性能,优于其尺寸的6倍。所有提示和培训的型号都可以在https://github.com/ bigscience-workshop / protectsource / httpsource / https://huggingface.co/bigscience/t0pp。
translated by 谷歌翻译
Current large language models can perform reasonably well on complex tasks that require step-by-step reasoning with few-shot learning. Are these models applying reasoning skills they have learnt during pre-training and reason outside of their training context, or are they simply memorizing their training corpus at finer granularity and have learnt to better understand their context? To tease apart these possibilities, we introduce ALERT, a benchmark and suite of analyses for assessing language models' reasoning ability comparing pre-trained and finetuned models on complex tasks that require reasoning skills to solve. ALERT provides a test bed to asses any language model on fine-grained reasoning skills, which spans over 20 datasets and covers 10 different reasoning skills. We leverage ALERT to further investigate the role of finetuning. With extensive empirical analysis we find that language models learn more reasoning skills such as textual entailment, abductive reasoning, and analogical reasoning during finetuning stage compared to pretraining state. We also find that when language models are finetuned they tend to overfit to the prompt template, which hurts the robustness of models causing generalization problems.
translated by 谷歌翻译
Generating a chain of thought (CoT) can increase large language model (LLM) performance on a wide range of tasks. Zero-shot CoT evaluations, however, have been conducted primarily on logical tasks (e.g. arithmetic, commonsense QA). In this paper, we perform a controlled evaluation of zero-shot CoT across two sensitive domains: harmful questions and stereotype benchmarks. We find that using zero-shot CoT reasoning in a prompt can significantly increase a model's likelihood to produce undesirable output. Without future advances in alignment or explicit mitigation instructions, zero-shot CoT should be avoided on tasks where models can make inferences about marginalized groups or harmful topics.
translated by 谷歌翻译
Pre-trained large language models can efficiently interpolate human-written prompts in a natural way. Multitask prompted learning can help generalization through a diverse set of tasks at once, thus enhancing the potential for more effective downstream fine-tuning. To perform efficient multitask-inference in the same batch, parameter-efficient fine-tuning methods such as prompt tuning have been proposed. However, the existing prompt tuning methods may lack generalization. We propose SPT, a semi-parametric prompt tuning method for multitask prompted learning. The novel component of SPT is a memory bank from where memory prompts are retrieved based on discrete prompts. Extensive experiments, such as (i) fine-tuning a full language model with SPT on 31 different tasks from 8 different domains and evaluating zero-shot generalization on 9 heldout datasets under 5 NLP task categories and (ii) pretraining SPT on the GLUE datasets and evaluating fine-tuning on the SuperGLUE datasets, demonstrate effectiveness of SPT.
translated by 谷歌翻译
Despite their impressive performance on diverse tasks, large language models (LMs) still struggle with tasks requiring rich world knowledge, implying the limitations of relying solely on their parameters to encode a wealth of world knowledge. This paper aims to understand LMs' strengths and limitations in memorizing factual knowledge, by conducting large-scale knowledge probing experiments of 10 models and 4 augmentation methods on PopQA, our new open-domain QA dataset with 14k questions. We find that LMs struggle with less popular factual knowledge, and that scaling fails to appreciably improve memorization of factual knowledge in the tail. We then show that retrieval-augmented LMs largely outperform orders of magnitude larger LMs, while unassisted LMs remain competitive in questions about high-popularity entities. Based on those findings, we devise a simple, yet effective, method for powerful and efficient retrieval-augmented LMs, which retrieves non-parametric memories only when necessary. Experimental results show that this significantly improves models' performance while reducing the inference costs.
translated by 谷歌翻译
Recent work has shown that fine-tuning large pre-trained language models on a collection of tasks described via instructions, a.k.a. instruction-tuning, improves their zero and few-shot generalization to unseen tasks. However, there is a limited understanding of the performance trade-offs of different decisions made during the instruction-tuning process. These decisions include the scale and diversity of the instruction-tuning benchmark, different task sampling strategies, fine-tuning with and without demonstrations, training using specialized datasets for reasoning and dialogue, and finally, the fine-tuning objectives themselves. In this paper, we characterize the effect of instruction-tuning decisions on downstream task performance when scaling both model and benchmark sizes. To this end, we create OPT-IML Bench: a large benchmark for Instruction Meta-Learning (IML) of 2000 NLP tasks consolidated into task categories from 8 existing benchmarks, and prepare an evaluation framework to measure three types of model generalizations: to tasks from fully held-out categories, to held-out tasks from seen categories, and to held-out instances from seen tasks. Through the lens of this framework, we first present insights about instruction-tuning decisions as applied to OPT-30B and further exploit these insights to train OPT-IML 30B and 175B, which are instruction-tuned versions of OPT. OPT-IML demonstrates all three generalization abilities at both scales on four different evaluation benchmarks with diverse tasks and input formats -- PromptSource, FLAN, Super-NaturalInstructions, and UnifiedSKG. Not only does it significantly outperform OPT on all benchmarks but is also highly competitive with existing models fine-tuned on each specific benchmark. We release OPT-IML at both scales, together with the OPT-IML Bench evaluation framework.
translated by 谷歌翻译
Reasoning, as an essential ability for complex problem-solving, can provide back-end support for various real-world applications, such as medical diagnosis, negotiation, etc. This paper provides a comprehensive survey of cutting-edge research on reasoning with language model prompting. We introduce research works with comparisons and summaries and provide systematic resources to help beginners. We also discuss the potential reasons for emerging such reasoning abilities and highlight future research directions.
translated by 谷歌翻译
语言模型(LMS)已被证明在各种下游应用程序中很有用,例如摘要,翻译,问答和文本分类。由于它们可以存储的大量信息,LMS正在成为人工智能中越来越重要的工具。在这项工作中,我们提出了道具(提示为探测),该道具利用GPT-3(最初由OpenAI在2020年提出的大型语言模型)来执行知识基础构建任务(KBC)。 Prop实施了一种多步骤方法,该方法结合了各种提示技术来实现这一目标。我们的结果表明,手动提示策划是必不可少的,必须鼓励LM给出可变长度的答案集,特别是包括空的答案集,True/False问题是提高LM生成的建议精度的有用设备。 LM的大小是至关重要的因素,并且实体字典别名提高了LM评分。我们的评估研究表明,这些提出的技术可以大大提高最终预测的质量:Prop赢得了LM-KBC竞争的轨道2,表现优于基线36.4个百分点。我们的实施可在https://github.com/hemile/iswc-challenge上获得。
translated by 谷歌翻译
Instruction tuning enables pretrained language models to perform new tasks from inference-time natural language descriptions. These approaches rely on vast amounts of human supervision in the form of crowdsourced datasets or user interactions. In this work, we introduce Unnatural Instructions: a large dataset of creative and diverse instructions, collected with virtually no human labor. We collect 64,000 examples by prompting a language model with three seed examples of instructions and eliciting a fourth. This set is then expanded by prompting the model to rephrase each instruction, creating a total of approximately 240,000 examples of instructions, inputs, and outputs. Experiments show that despite containing a fair amount of noise, training on Unnatural Instructions rivals the effectiveness of training on open-source manually-curated datasets, surpassing the performance of models such as T0++ and Tk-Instruct across various benchmarks. These results demonstrate the potential of model-generated data as a cost-effective alternative to crowdsourcing for dataset expansion and diversification.
translated by 谷歌翻译
预处理的大语言模型(LLM)广泛用于自然语言处理(NLP)的许多子场,通常被称为具有特定任务示例的优秀少数学习者。值得注意的是,思想链(COT)提示,这是一种通过分步答案示例引发复杂的多步推理的技术,在算术和符号推理中实现了最新的表演,难以置信的System-2任务不遵循LLMS的标准缩放定律。尽管这些成功通常归因于LLM的几次学习能力,但我们表明,LLM是通过在每个答案之前简单地添加“让我们逐步思考”而成为不错的零射击推理者。实验结果表明,使用相同的单个提示模板,我们的零射击功能明显优于零摄像机LLM在不同的基准推理任务上的零摄像机表现,包括算术(Multiarith,GSM8K,Aqua-Rat,SVAMP,SVAMP),符号推理(最后一个字母,字母,字母,字母,,,,,字母,字母)(最后一个字母),硬币翻转)和其他逻辑推理任务(日期理解,跟踪洗牌对象),而没有任何手工制作的几个示例,例如通过175B参数指令gpt模型将Multiarith的准确性从17.7%提高到78.7%,GSM8K从10.4%提高到40.7%,以及另一种现成的大型模型,540B参数Palm Palm的相似改进。在非常多样化的推理任务中,这个单一提示的多功能性暗示了LLM的尚未开发和研究的基本零拍功能,这表明可以通过简单提示来提取高级,多任务的广泛认知能力。我们希望我们的工作不仅可以作为具有挑战性的推理基准的最小零击基线,而且还强调了仔细探索和分析LLM中隐藏在LLM中的巨大的零拍知识的重要性,然后在制作Finetunning数据集或少数拍摄的典范之前。
translated by 谷歌翻译
Task agnostic generative pretraining (GPT) has recently proved promising for zero- and few-shot learning, gradually diverting attention from the expensive supervised learning paradigm. Although the community is accumulating knowledge as to capabilities of English-language autoregressive models such as GPT-3 adopting this generative approach, scholarship about these models remains acutely Anglocentric. Consequently, the community currently has serious gaps in its understanding of this class of models, their potential, and their societal impacts in diverse settings, linguistic traditions, and cultures. To alleviate this issue for Arabic, a collection of diverse languages and language varieties with more than $400$ million population, we introduce JASMINE, a suite of powerful Arabic autoregressive Transformer language models ranging in size between 300 million-13 billion parameters. We pretrain our new models with large amounts of diverse data (400GB of text) from different Arabic varieties and domains. We evaluate JASMINE extensively in both intrinsic and extrinsic settings, using a comprehensive benchmark for zero- and few-shot learning across a wide range of NLP tasks. We also carefully develop and release a novel benchmark for both automated and human evaluation of Arabic autoregressive models focused at investigating potential social biases, harms, and toxicity in these models. We aim to responsibly release our models with interested researchers, along with code for experimenting with them
translated by 谷歌翻译
Large Language Models are affected by the phenomena of memorizing and forgetting their training data. But how do these vary by model size? We work towards this question by investigating how the model size affects the model's ability to discriminate a word's meaning in a given context. We introduce a dataset called DeltaWords, which evaluates a model's ability to follow instructions to select a sentence which replaces the target word with its antonym. We show a weak inverse scaling trend, where task accuracy degrades as model size increase, under extremely few-shot prompting regimes. We show that increasing the number of examples tend to disproportionately benefit larger models than smaller models.
translated by 谷歌翻译
Pragmatics is an essential part of communication, but it remains unclear what mechanisms underlie human pragmatic communication and whether NLP systems capture pragmatic language understanding. To investigate both these questions, we perform a fine-grained comparison of language models and humans on seven pragmatic phenomena, using zero-shot prompting on an expert-curated set of English materials. We ask whether models (1) select pragmatic interpretations of speaker utterances, (2) make similar error patterns as humans, and (3) use similar linguistic cues as humans to solve the tasks. We find that the largest models achieve high accuracy and match human error patterns: within incorrect responses, models favor the literal interpretation of an utterance over heuristic-based distractors. We also find evidence that models and humans are sensitive to similar linguistic cues. Our results suggest that even paradigmatic pragmatic phenomena may be solved without explicit representations of other agents' mental states, and that artificial models can be used to gain mechanistic insights into human pragmatic processing.
translated by 谷歌翻译
Many real-world applications of language models (LMs), such as code autocomplete and writing assistance, involve human-LM interaction, but the main LM benchmarks are non-interactive, where a system produces output without human intervention. To evaluate human-LM interaction, we develop a framework, Human-AI Language-based Interaction Evaluation (H-LINE), that expands non-interactive evaluation along three dimensions, capturing (i) the interactive process, not only the final output; (ii) the first-person subjective experience, not just a third-party assessment; and (iii) notions of preference beyond quality. We then design five tasks ranging from goal-oriented to open-ended to capture different forms of interaction. On four state-of-the-art LMs (three variants of OpenAI's GPT-3 and AI21's J1-Jumbo), we find that non-interactive performance does not always result in better human-LM interaction and that first-person and third-party metrics can diverge, suggesting the importance of examining the nuances of human-LM interaction.
translated by 谷歌翻译
Pre-trained language models (LMs) have shown remarkable reasoning performance using explanations (or ``chain-of-thought'' (CoT)) for in-context learning. On the other hand, these reasoning tasks are usually presumed to be more approachable for symbolic programming. To make progress towards understanding in-context learning, we curate synthetic datasets containing equivalent (natural, symbolic) data pairs, where symbolic examples contain first-order logic rules and predicates from knowledge bases (KBs). Then we revisit neuro-symbolic approaches and use Language Models as Logic Programmer (LMLP) that learns from demonstrations containing logic rules and corresponding examples to iteratively reason over KBs, recovering Prolog's backward chaining algorithm. Comprehensive experiments are included to systematically compare LMLP with CoT in deductive reasoning settings, showing that LMLP enjoys more than 25% higher accuracy than CoT on length generalization benchmarks even with fewer parameters.
translated by 谷歌翻译
Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PALM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints, which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models.
translated by 谷歌翻译
表问题回答(TQA)是一项重要但不足的任务。大多数现有的QA数据集都采用非结构化文本格式,只有很少的数据集使用表作为上下文。据我们所知,在生物医学领域中,没有任何TQA数据集存在经常用于提供信息的生物医学领域。在本文中,我们首先使用22个模板和关于鉴别诊断的生物医学教科书中的上下文来回答数据集Biotabqa的桌子问题。 Biotabqa不仅可以用来教授模型如何从表中回答问题,还可以评估模型如何推广到看不见的问题,这是生物医学应用的重要情况。为了实现概括评估,我们将模板分为17个培训和5个跨任务评估。然后,我们使用BioTABQA上的单个和多任务学习开发两个基准。此外,我们探索教学学习,这是一种显示出令人印象深刻的概括性能的技术。实验结果表明,我们的指导调整模型在各种评估设置中平均比单一和多任务基准平均比单一和多任务基准,更重要的是,更重要的是,指令调整的模型在交叉任务上的基准比5% 。
translated by 谷歌翻译