Language models can be prompted to perform a wide variety of zero- and few-shot learning problems. However, performance varies significantly with the choice of prompt, and we do not yet understand why this happens or how to pick the best prompts. In this work, we analyze the factors that contribute to this variance and establish a new empirical hypothesis: the performance of a prompt is coupled with the extent to which the model is familiar with the language it contains. Over a wide range of tasks, we show that the lower the perplexity of the prompt is, the better the prompt is able to perform the task. As a result, we devise a method for creating prompts: (1) automatically extend a small seed set of manually written prompts by paraphrasing using GPT3 and backtranslation and (2) choose the lowest perplexity prompts to get significant gains in performance.
translated by 谷歌翻译
Large language models can perform new tasks in a zero-shot fashion, given natural language prompts that specify the desired behavior. Such prompts are typically hand engineered, but can also be learned with gradient-based methods from labeled data. However, it is underexplored what factors make the prompts effective, especially when the prompts are natural language. In this paper, we investigate common attributes shared by effective prompts. We first propose a human readable prompt tuning method (F LUENT P ROMPT) based on Langevin dynamics that incorporates a fluency constraint to find a diverse distribution of effective and fluent prompts. Our analysis reveals that effective prompts are topically related to the task domain and calibrate the prior probability of label words. Based on these findings, we also propose a method for generating prompts using only unlabeled data, outperforming strong baselines by an average of 7.0% accuracy across three tasks.
translated by 谷歌翻译
Task agnostic generative pretraining (GPT) has recently proved promising for zero- and few-shot learning, gradually diverting attention from the expensive supervised learning paradigm. Although the community is accumulating knowledge as to capabilities of English-language autoregressive models such as GPT-3 adopting this generative approach, scholarship about these models remains acutely Anglocentric. Consequently, the community currently has serious gaps in its understanding of this class of models, their potential, and their societal impacts in diverse settings, linguistic traditions, and cultures. To alleviate this issue for Arabic, a collection of diverse languages and language varieties with more than $400$ million population, we introduce JASMINE, a suite of powerful Arabic autoregressive Transformer language models ranging in size between 300 million-13 billion parameters. We pretrain our new models with large amounts of diverse data (400GB of text) from different Arabic varieties and domains. We evaluate JASMINE extensively in both intrinsic and extrinsic settings, using a comprehensive benchmark for zero- and few-shot learning across a wide range of NLP tasks. We also carefully develop and release a novel benchmark for both automated and human evaluation of Arabic autoregressive models focused at investigating potential social biases, harms, and toxicity in these models. We aim to responsibly release our models with interested researchers, along with code for experimenting with them
translated by 谷歌翻译
Recent work has presented intriguing results examining the knowledge contained in language models (LM) by having the LM fill in the blanks of prompts such as "Obama is a by profession". These prompts are usually manually created, and quite possibly suboptimal; another prompt such as "Obama worked as a " may result in more accurately predicting the correct profession. Because of this, given an inappropriate prompt, we might fail to retrieve facts that the LM does know, and thus any given prompt only provides a lower bound estimate of the knowledge contained in an LM. In this paper, we attempt to more accurately estimate the knowledge contained in LMs by automatically discovering better prompts to use in this querying process. Specifically, we propose mining-based and paraphrasing-based methods to automatically generate high-quality and diverse prompts, as well as ensemble methods to combine answers from different prompts. Extensive experiments on the LAMA benchmark for extracting relational knowledge from LMs demonstrate that our methods can improve accuracy from 31.1% to 39.6%, providing a tighter lower bound on what LMs know. We have released the code and the resulting LM Prompt And Query Archive (LPAQA) at https://github. com/jzbjyb/LPAQA.1 Some models we use in this paper, e.g. BERT (Devlin et al., 2019), are bi-directional, and do not directly define probability distribution over text, which is the underlying definition of an LM. Nonetheless, we call them LMs for simplicity.
translated by 谷歌翻译
GPT-3等大型自回归语言模型是几秒钟的学习者,可以在没有微调的情况下执行各种语言任务。虽然已知这些模型能够共同代表许多不同的语言,但他们的培训数据由英语主导,可能限制了它们的交叉概括。在这项工作中,我们在覆盖多种语言的平衡语料库上培训多语言自回归语言模型,并在广泛的任务中研究他们几乎没有零点的学习能力。我们最大的模型,具有75亿参数,在20多种代表语言中,在几种代表语言中,在几种代表性语言中,在几种代表性语言中,在多语言型号推理中表现出可比大小的GPT-3(在0次设置和0次拍摄设置中的绝对精度改善+ 7.4% 4-拍摄设置中的9.4%)和自然语言推理(每次拍摄和4次设置中的每一个+ 5.4%)。在Flores-101机器翻译基准测试中,我们的模型优于GPT-3在182个翻译方向上有32个培训例子,同时超过45个方向的官方监督基线。我们介绍了模型成功和失败的位置的详细分析,特别是它尤其显示在某些任务中实现交叉语境的内容学习,而仍然存在改善表面的鲁棒性和适应没有a的任务的余地自然冻结形式。最后,我们评估我们在仇恨语音检测中以五种语言的仇恨语音检测的模型,并发现它具有与可比大小的GPT-3模型类似的限制。
translated by 谷歌翻译
Recent work has shown that fine-tuning large pre-trained language models on a collection of tasks described via instructions, a.k.a. instruction-tuning, improves their zero and few-shot generalization to unseen tasks. However, there is a limited understanding of the performance trade-offs of different decisions made during the instruction-tuning process. These decisions include the scale and diversity of the instruction-tuning benchmark, different task sampling strategies, fine-tuning with and without demonstrations, training using specialized datasets for reasoning and dialogue, and finally, the fine-tuning objectives themselves. In this paper, we characterize the effect of instruction-tuning decisions on downstream task performance when scaling both model and benchmark sizes. To this end, we create OPT-IML Bench: a large benchmark for Instruction Meta-Learning (IML) of 2000 NLP tasks consolidated into task categories from 8 existing benchmarks, and prepare an evaluation framework to measure three types of model generalizations: to tasks from fully held-out categories, to held-out tasks from seen categories, and to held-out instances from seen tasks. Through the lens of this framework, we first present insights about instruction-tuning decisions as applied to OPT-30B and further exploit these insights to train OPT-IML 30B and 175B, which are instruction-tuned versions of OPT. OPT-IML demonstrates all three generalization abilities at both scales on four different evaluation benchmarks with diverse tasks and input formats -- PromptSource, FLAN, Super-NaturalInstructions, and UnifiedSKG. Not only does it significantly outperform OPT on all benchmarks but is also highly competitive with existing models fine-tuned on each specific benchmark. We release OPT-IML at both scales, together with the OPT-IML Bench evaluation framework.
translated by 谷歌翻译
参数效率的方法能够使用单个冷冻的预训练的大语言模型(LLM)来通过学习特定于任务的软提示来执行许多任务,从而在串联到输入文本时调节模型行为。但是,这些学习的提示与给定的冷冻模型紧密耦合 - 如果模型已更新,则需要获得相应的新提示。在这项工作中,我们提出并调查了几种“提示回收”的方法,其中将在源模型上进行了及时培训以与新目标模型一起使用。我们的方法不依赖于目标模型的有监督的提示,特定于任务的数据或培训更新,这与从头开始的目标模型重新调整提示一样昂贵。我们表明,模型之间的回收是可能的(我们的最佳设置能够成功回收$ 88.9 \%的提示,从而产生一个提示,即表现出色的基线),但是剩下的大量性能净空,需要改进的回收技术。
translated by 谷歌翻译
In-context learning (ICL) enables large language models (LLMs) to perform new tasks by prompting them with a sequence of training examples. However, ICL is very sensitive to the choice of training examples: randomly sampling examples from a training set leads to high variance in performance. In this paper, we show that curating a carefully chosen subset of training data greatly stabilizes ICL performance. We propose two methods to choose training subsets, both of which score training examples individually and then select the highest-scoring ones. CondAcc scores a training example by its average ICL accuracy when combined with random training examples, while Datamodels learns a linear proxy model that estimates how the presence of each training example influences LLM accuracy. On average, CondAcc and Datamodels outperform sampling from the entire training set by 7.7% and 6.3%, respectively, across 5 tasks and two LLMs. Our analysis shows that stable subset examples are no more diverse than average, and are not outliers in terms of sequence length and perplexity.
translated by 谷歌翻译
大型语言模型经常经过数十万个计算天的训练,已经显示出零和少数学习的显着功能。鉴于它们的计算成本,如果没有大量资本,这些模型很难复制。对于通过API可用的少数产品,没有访问完整的模型权重,因此很难学习。我们提供开放训练的预训练变压器(OPT),这是一套仅解码器预训练的变压器,范围从12500万到175b参数,我们旨在与感兴趣的研究人员完全和负责任地分享。我们表明,OPT-175B与GPT-3相当,而仅需要1/7碳足迹才能开发。我们还释放了日志,详细介绍了我们面临的基础架构挑战,以及用于尝试所有发布模型的代码。
translated by 谷歌翻译
通过自我监督的学习预先训练的大型语言模型在各种各样的任务上表现出令人印象深刻的零击功能。在这项工作中,我们介绍了Welm:一种针对中文的精心读取的预训练的语言模型,能够无缝执行不同类型的任务,以零或几次演示。 Welm通过“阅读”涵盖广泛主题的精选高质量语料库来接受10b参数的培训。我们表明,韦尔姆拥有有关各种领域和语言的广泛知识。在18个单语(中文)任务中,WELM可以大大优于现有的预训练模型,尺寸相似,并匹配高达25倍大的模型的性能。韦尔姆还表现出强大的多种语言和代码转换理解的能力,优于预先对30种语言进行预培训的现有多语言模型。此外,我们收集了人工编写的提示,并通过多次培训进行了大量的中文和微调韦尔姆的监督数据集。最终的模型可以实现对看不见的任务类型的强烈概括,并在零射门学习中优于无监督的韦尔姆。最后,我们证明韦尔姆具有解释和校准自己的决策的基本技能,这可能是未来研究的有希望的方向。我们的模型可以从https://welm.weixin.qq.com/docs/api/应用。
translated by 谷歌翻译
本文探讨了提高语言模型的零次学习能力的简单方法。我们表明,指令调整 - 通过对说明书中所述的任务集合微调语言模型 - 大幅提升零射门上看不见任务中的表现。我们采取预训练的语言模型和指令调整它通过自然语言指令模板语言表达了60NLP任务137B参数。我们评估这种指令调整模型,我们称之为FLAN,在看不见的任务类型。FLAN显着改善其未修饰的对应的性能和超过25的20个任务,我们评估零射门175BGPT-3。FLAN甚至GPT-3通过在安利,RTE,BoolQ,AI2-ARC,OpenbookQA和StoryCloze大比分胜过几拍。消融研究显示任务和模型的规模,这个数字是指令调整取得成功的关键组成部分。
translated by 谷歌翻译
最近已被证明大型语言模型在各种任务集中获得合理的零射普通化(Brown等,2020)。它已经假设这是语言模型的隐式多任务学习的结果,在语言模型中的预押(Radford等,2019)。可以通过明确的多任务学习直接引起零拍常规化?为了以缩放测试这个问题,我们开发一个系统,以便轻松地将任何自然语言任务映射到人类可读的提示表单中。我们转换一组大量的监督数据集,每个数据集都有多个提示,具有不同的措辞。这些提示的数据集允许基准测试模型执行完全看不见的任务的能力。我们介绍了一个普拉克尔编码器 - 解码器模型(Raffel等,2020; Lester等,2021),覆盖各种任务。该模型在多个标准数据集中达到强大的零点性能,通常优于其尺寸的型号超过16倍。此外,我们的方法对来自Big-替补基准测试的任务子集具有强烈性能,优于其尺寸的6倍。所有提示和培训的型号都可以在https://github.com/ bigscience-workshop / protectsource / httpsource / https://huggingface.co/bigscience/t0pp。
translated by 谷歌翻译
少量学习时,基于及时的方法很强劲。然而,Perez等人。 (2021年)最近对他们的表现产生了疑问,因为它们难以在“真实”的几次拍摄设置中获得良好的结果,其中提示和超级参数无法在DEV集上调整。鉴于此,我们对PET进行了广泛的研究,该方法将文本指令与基于示例的FENETUNING结合起来。我们表明,如果正确配置,宠物在真正的几次拍摄设置中强烈执行,即,没有开发装置。这对这种强大的表现至关重要是宠物智能处理多个提示的能力。然后,我们通过在RAFT上运行PET来将我们的调查结果置于真实世界的测试中,直接从现实的NLP应用程序采取的任务的基准,没有标记的开发或测试集。宠物在筏上实现了新的艺术状态,并且在11个任务中靠近非专家人类进行了近距离进行。这些结果表明,基于及时的学习者像宠物Excel这样的真正的几次拍摄学习和支持我们的信念,即从指示中学习的信念将在人类少量学习能力的路径上发挥重要作用。
translated by 谷歌翻译
在原始文本中训练的语言模型(LMS)无法直接访问物理世界。 Gordon和Van Durme(2013)指出,LMS因此可能会遭受报告偏见的困扰:文本很少报告常见事实,而是关注情况的异常方面。如果LMS仅接受文本语料库的培训,并天真地记住当地的同时出现统计数据,那么他们自然会学会对物理世界的偏见。虽然先前的研究反复验证了较小尺度的LM(例如Roberta,GPT-2)放大了报告偏差,但在模型扩展时,这种趋势是否继续。我们从较大语言模型(LLM)(例如Palm和GPT-3)中从颜色的角度研究报告偏见。具体而言,我们查询llms对物体的典型颜色,这是一种简单的感知扎根的物理常识。令人惊讶的是,我们发现LLM在确定对象的典型颜色和更紧密地跟踪人类判断方面的表现明显优于较小的LMS,而不是过于适应文本中存储的表面图案。这表明,仅凭语言的大型语言模型就能克服以局部共发生为特征的某些类型的报告偏差。
translated by 谷歌翻译
The recent GPT-3 model (Brown et al., 2020) achieves remarkable few-shot performance solely by leveraging a natural-language prompt and a few task demonstrations as input context. Inspired by their findings, we study few-shot learning in a more practical scenario, where we use smaller language models for which fine-tuning is computationally efficient. We present LM-BFF-better few-shot fine-tuning of language models 1 -a suite of simple and complementary techniques for finetuning language models on a small number of annotated examples. Our approach includes (1) prompt-based fine-tuning together with a novel pipeline for automating prompt generation; and (2) a refined strategy for dynamically and selectively incorporating demonstrations into each context. Finally, we present a systematic evaluation for analyzing few-shot performance on a range of NLP tasks, including classification and regression. Our experiments demonstrate that our methods combine to dramatically outperform standard fine-tuning procedures in this low resource setting, achieving up to 30% absolute improvement, and 11% on average across all tasks. Our approach makes minimal assumptions on task resources and domain expertise, and hence constitutes a strong task-agnostic method for few-shot learning. 2 * The first two authors contributed equally. 1 Alternatively, language models' best friends forever. 2 Our implementation is publicly available at https:// github.com/princeton-nlp/LM-BFF.
translated by 谷歌翻译
Scaling up language models has led to unprecedented performance gains, but little is understood about how the training dynamics change as models get larger. How do language models of different sizes learn during pre-training? Why do larger language models demonstrate more desirable behaviors? In this paper, we analyze the intermediate training checkpoints of differently sized OPT models (Zhang et al.,2022)--from 125M to 175B parameters--on next-token prediction, sequence-level generation, and downstream tasks. We find that 1) at a given perplexity and independent of model sizes, a similar subset of training tokens see the most significant reduction in loss, with the rest stagnating or showing double-descent behavior; 2) early in training, all models learn to reduce the perplexity of grammatical sequences that contain hallucinations, with small models halting at this suboptimal distribution and larger ones eventually learning to assign these sequences lower probabilities; 3) perplexity is a strong predictor of in-context learning performance on 74 multiple-choice tasks from BIG-Bench, and this holds independent of the model size. Together, these results show that perplexity is more predictive of model behaviors than model size or training computation.
translated by 谷歌翻译
Large-scale generative models show an impressive ability to perform a wide range of Natural Language Processing (NLP) tasks using in-context learning, where a few examples are used to describe a task to the model. For Machine Translation (MT), these examples are typically randomly sampled from the development dataset with a similar distribution as the evaluation set. However, it is unclear how the choice of these in-context examples and their ordering impacts the output translation quality. In this work, we aim to understand the properties of good in-context examples for MT in both in-domain and out-of-domain settings. We show that the translation quality and the domain of the in-context examples matter and that 1-shot noisy unrelated example can have a catastrophic impact on output quality. While concatenating multiple random examples reduces the effect of noise, a single good prompt optimized to maximize translation quality on the development dataset can elicit learned information from the pre-trained language model. Adding similar examples based on an n-gram overlap with the test source significantly and consistently improves the translation quality of the outputs, outperforming a strong kNN-MT baseline in 2 out of 4 out-of-domain datasets.
translated by 谷歌翻译
The remarkable success of pretrained language models has motivated the study of what kinds of knowledge these models learn during pretraining. Reformulating tasks as fillin-the-blanks problems (e.g., cloze tests) is a natural approach for gauging such knowledge, however, its usage is limited by the manual effort and guesswork required to write suitable prompts. To address this, we develop AUTOPROMPT, an automated method to create prompts for a diverse set of tasks, based on a gradient-guided search. Using AUTO-PROMPT, we show that masked language models (MLMs) have an inherent capability to perform sentiment analysis and natural language inference without additional parameters or finetuning, sometimes achieving performance on par with recent state-of-the-art supervised models. We also show that our prompts elicit more accurate factual knowledge from MLMs than the manually created prompts on the LAMA benchmark, and that MLMs can be used as relation extractors more effectively than supervised relation extraction models. These results demonstrate that automatically generated prompts are a viable parameter-free alternative to existing probing methods, and as pretrained LMs become more sophisticated and capable, potentially a replacement for finetuning.
translated by 谷歌翻译
姿态检测的目标是确定以目标朝向目标的文本中表达的视点。这些观点或上下文通常以许多不同的语言表达,这取决于用户和平台,这可以是本地新闻插座,社交媒体平台,新闻论坛等。然而,姿态检测的大多数研究已经限于使用单一语言和几个有限的目标,在交叉舌姿态检测很少有效。此外,标记数据的非英语来源通常稀缺,并具有额外的挑战。最近,大型多语言语言模型在许多非英语任务上大大提高了性能,尤其是具有有限数量的示例。这突出了模型预培训的重要性及其从少数例子中学习的能力。在本文中,我们展示了对日期交叉姿态检测的最全面的研究:我们在6名语言系列中使用12种语言的12种不同的数据集进行实验,每个都有6个低资源评估设置。对于我们的实验,我们构建了模式开发培训,提出了添加一种新颖的标签编码器来简化言语程序。我们进一步提出了基于情绪的姿态数据进行预培训,这在与几个强的基线相比,在低拍摄环境中显示了大量的6%F1绝对的增长。
translated by 谷歌翻译
petroni等。 (2019)证明,可以通过将它们表达为冻结式提示并将模型的预测准确性解释为下限,作为其编码的事实信息量的较低限制,从预先接收的语言模型中检索世界事实。随后的工作已经尝试通过搜索更好的提示来缩回估计,使用不相交的事实作为培训数据。在这项工作中,我们制作两个互补贡献,以更好地了解这些事实探测技术。首先,我们提出了OptiPrompt,一种新颖的和有效的方法,直接在连续嵌入空间中优化。我们发现这种简单的方法能够预测喇嘛基准中的额外6.4%的事实。其次,我们提出了一个更重要的问题:我们真的可以将这些探测结果解释为下限吗?这些提示搜索方法是否有可能从培训数据中学习?我们发现,有些令人惊讶的是,这些方法使用的培训数据包含了潜在的事实分布的某些规则,以及所有现有的提示方法,包括我们的方法,可以利用它们以获得更好的事实预测。我们开展一系列控制实验来解除“学习”从“学习召回”,提供了更详细的图片,不同的提示可以揭示关于预先接受的语言模型。
translated by 谷歌翻译