With a handful of demonstration examples, large-scale language models show strong capability to perform various tasks by in-context learning from these examples, without any fine-tuning. We demonstrate that in-context learning performance can be highly unstable across samples of examples, indicating the idiosyncrasies of how language models acquire information. We formulate example selection for in-context learning as a sequential decision problem, and propose a reinforcement learning algorithm for identifying generalizable policies to select demonstration examples. For GPT-2, our learned policies demonstrate strong abilities of generalizing to unseen tasks in training, with a $5.8\%$ improvement on average. Examples selected from our learned policies can even achieve a small improvement on GPT-3 Ada. However, the improvement diminishes on larger GPT-3 models, suggesting emerging capabilities of large language models.
translated by 谷歌翻译
With the increasing ability of large language models (LLMs), in-context learning (ICL) has become a new paradigm for natural language processing (NLP), where LLMs make predictions only based on contexts augmented with a few training examples. It has been a new trend exploring ICL to evaluate and extrapolate the ability of LLMs. In this paper, we aim to survey and summarize the progress, challenges, and future work in ICL. We first present a formal definition of ICL and clarify its correlation to related studies. Then, we organize and discuss advanced techniques of ICL, including training strategies, prompting strategies, and so on. Finally, we present the challenges of ICL and provide potential directions for further research. We hope our work can encourage more research on uncovering how ICL works and improving ICL in future work.
translated by 谷歌翻译
许多最新的自然语言任务方法都建立在大型语言模型的非凡能力上。大型语言模型可以执行内在的学习,他们可以从几个任务演示中学习新任务,而无需任何参数更新。这项工作研究了对新自然语言任务的数据集创建数据集的含义。与最近的文化学习方法背道而驰,我们制定了一个注释效率的两步框架:选择性注释,选择一个示例池,以提前从未标记的数据中从未标记的数据中进行注释,然后及时检索从注释的池中检索任务示例测试时间。基于此框架,我们提出了一种无监督的,基于图的选择性注释方法VOKE-K,以选择各种代表性的示例进行注释。在10个数据集上进行了广泛的实验(涵盖分类,常识性推理,对话和文本/代码生成)表明,我们的选择性注释方法通过很大的利润提高了任务性能。与随机选择示例进行注释相比,Pote-K平均在注释预算下获得了12.9%/11.4%的相对增益。与最先进的监督登录方法相比,它的性能相似,而在10个任务中的注释成本降低了10-100倍。我们在各种情况下进一步分析了框架的有效性:具有不同大小的语言模型,替代选择性注释方法以及有测试数据域移动的情况。我们希望我们的研究将作为数据注释的基础,因为大型语言模型越来越多地应用于新任务。我们的代码可在https://github.com/hkunlp/icl-selactive-annotation上找到。
translated by 谷歌翻译
The recent GPT-3 model (Brown et al., 2020) achieves remarkable few-shot performance solely by leveraging a natural-language prompt and a few task demonstrations as input context. Inspired by their findings, we study few-shot learning in a more practical scenario, where we use smaller language models for which fine-tuning is computationally efficient. We present LM-BFF-better few-shot fine-tuning of language models 1 -a suite of simple and complementary techniques for finetuning language models on a small number of annotated examples. Our approach includes (1) prompt-based fine-tuning together with a novel pipeline for automating prompt generation; and (2) a refined strategy for dynamically and selectively incorporating demonstrations into each context. Finally, we present a systematic evaluation for analyzing few-shot performance on a range of NLP tasks, including classification and regression. Our experiments demonstrate that our methods combine to dramatically outperform standard fine-tuning procedures in this low resource setting, achieving up to 30% absolute improvement, and 11% on average across all tasks. Our approach makes minimal assumptions on task resources and domain expertise, and hence constitutes a strong task-agnostic method for few-shot learning. 2 * The first two authors contributed equally. 1 Alternatively, language models' best friends forever. 2 Our implementation is publicly available at https:// github.com/princeton-nlp/LM-BFF.
translated by 谷歌翻译
Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning, they may reproduce errors that humans make; if we train them to generate text that humans rate highly, they may output errors that human evaluators can't detect. We propose circumventing this issue by directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way. Specifically, we introduce a method for accurately answering yes-no questions given only unlabeled model activations. It works by finding a direction in activation space that satisfies logical consistency properties, such as that a statement and its negation have opposite truth values. We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models: across 6 models and 10 question-answering datasets, it outperforms zero-shot accuracy by 4\% on average. We also find that it cuts prompt sensitivity in half and continues to maintain high accuracy even when models are prompted to generate incorrect answers. Our results provide an initial step toward discovering what language models know, distinct from what they say, even when we don't have access to explicit ground truth labels.
translated by 谷歌翻译
Despite the surprising few-shot performance of in-context learning (ICL), it is still a common practice to randomly sample examples to serve as context. This paper advocates a new principle for ICL: self-adaptive in-context learning. The self-adaption mechanism is introduced to help each sample find an in-context example permutation (i.e., selection and ordering) that can derive the correct prediction, thus maximizing performance. To validate the effectiveness of self-adaptive ICL, we propose a general select-then-rank framework and instantiate it with new selection and ranking algorithms. Upon extensive evaluation on eight different NLP datasets, our self-adaptive ICL method achieves a 40% relative improvement over the common practice setting. Further analysis reveals the enormous potential of self-adaptive ICL that it might be able to close the gap between ICL and finetuning given more advanced algorithms. Our code is released to facilitate future research in this area: https://github.com/Shark-NLP/self-adaptive-ICL
translated by 谷歌翻译
Large language models can perform new tasks in a zero-shot fashion, given natural language prompts that specify the desired behavior. Such prompts are typically hand engineered, but can also be learned with gradient-based methods from labeled data. However, it is underexplored what factors make the prompts effective, especially when the prompts are natural language. In this paper, we investigate common attributes shared by effective prompts. We first propose a human readable prompt tuning method (F LUENT P ROMPT) based on Langevin dynamics that incorporates a fluency constraint to find a diverse distribution of effective and fluent prompts. Our analysis reveals that effective prompts are topically related to the task domain and calibrate the prior probability of label words. Based on these findings, we also propose a method for generating prompts using only unlabeled data, outperforming strong baselines by an average of 7.0% accuracy across three tasks.
translated by 谷歌翻译
Transformer, originally devised for natural language processing, has also attested significant success in computer vision. Thanks to its super expressive power, researchers are investigating ways to deploy transformers to reinforcement learning (RL) and the transformer-based models have manifested their potential in representative RL benchmarks. In this paper, we collect and dissect recent advances on transforming RL by transformer (transformer-based RL or TRL), in order to explore its development trajectory and future trend. We group existing developments in two categories: architecture enhancement and trajectory optimization, and examine the main applications of TRL in robotic manipulation, text-based games, navigation and autonomous driving. For architecture enhancement, these methods consider how to apply the powerful transformer structure to RL problems under the traditional RL framework, which model agents and environments much more precisely than deep RL methods, but they are still limited by the inherent defects of traditional RL algorithms, such as bootstrapping and "deadly triad". For trajectory optimization, these methods treat RL problems as sequence modeling and train a joint state-action model over entire trajectories under the behavior cloning framework, which are able to extract policies from static datasets and fully use the long-sequence modeling capability of the transformer. Given these advancements, extensions and challenges in TRL are reviewed and proposals about future direction are discussed. We hope that this survey can provide a detailed introduction to TRL and motivate future research in this rapidly developing field.
translated by 谷歌翻译
Offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data. This problem setting offers the promise of utilizing such datasets to acquire policies without any costly or dangerous active exploration. However, it is also challenging, due to the distributional shift between the offline training data and those states visited by the learned policy. Despite significant recent progress, the most successful prior methods are model-free and constrain the policy to the support of data, precluding generalization to unseen states. In this paper, we first observe that an existing model-based RL algorithm already produces significant gains in the offline setting compared to model-free approaches. However, standard model-based RL methods, designed for the online setting, do not provide an explicit mechanism to avoid the offline setting's distributional shift issue. Instead, we propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics. We theoretically show that the algorithm maximizes a lower bound of the policy's return under the true MDP. We also characterize the trade-off between the gain and risk of leaving the support of the batch data. Our algorithm, Model-based Offline Policy Optimization (MOPO), outperforms standard model-based RL algorithms and prior state-of-the-art model-free offline RL algorithms on existing offline RL benchmarks and two challenging continuous control tasks that require generalizing from data collected for a different task. * equal contribution. † equal advising. Orders randomized.34th Conference on Neural Information Processing Systems (NeurIPS 2020),
translated by 谷歌翻译
GPT-3等大型语言模型是优秀的几次学习者,允许他们通过自然文本提示来控制。最近的研究报告称,基于及时的直接分类消除了对微调的需求,但缺乏数据和推理可扩展性。本文提出了一种新的数据增强技术,利用大规模语言模型来生成来自真实样本的混合的现实文本样本。我们还建议利用语言模型预测的软标签,从大规模语言模型中有效地蒸馏知识并同时创建文本扰动。我们对各种分类任务进行数据增强实验,并显示我们的方法非常优于现有的文本增强方法。消融研究和定性分析为我们的方法提供了更多的见解。
translated by 谷歌翻译
离线增强学习(RL)可以从静态数据集中学习控制策略,但是像标准RL方法一样,它需要每个过渡的奖励注释。在许多情况下,将大型数据集标记为奖励可能会很高,尤其是如果人类标签必须提供这些奖励,同时收集多样的未标记数据可能相对便宜。我们如何在离线RL中最好地利用这种未标记的数据?一种自然解决方案是从标记的数据中学习奖励函数,并使用它标记未标记的数据。在本文中,我们发现,也许令人惊讶的是,一种简单得多的方法,它简单地将零奖励应用于未标记的数据可以导致理论和实践中的有效数据共享,而无需学习任何奖励模型。虽然这种方法起初可能看起来很奇怪(并且不正确),但我们提供了广泛的理论和经验分析,说明了它如何摆脱奖励偏见,样本复杂性和分配变化,通常会导致良好的结果。我们表征了这种简单策略有效的条件,并进一步表明,使用简单的重新加权方法扩展它可以进一步缓解通过使用不正确的奖励标签引入的偏见。我们的经验评估证实了模拟机器人运动,导航和操纵设置中的这些发现。
translated by 谷歌翻译
In-context learning (ICL) enables large language models (LLMs) to perform new tasks by prompting them with a sequence of training examples. However, ICL is very sensitive to the choice of training examples: randomly sampling examples from a training set leads to high variance in performance. In this paper, we show that curating a carefully chosen subset of training data greatly stabilizes ICL performance. We propose two methods to choose training subsets, both of which score training examples individually and then select the highest-scoring ones. CondAcc scores a training example by its average ICL accuracy when combined with random training examples, while Datamodels learns a linear proxy model that estimates how the presence of each training example influences LLM accuracy. On average, CondAcc and Datamodels outperform sampling from the entire training set by 7.7% and 6.3%, respectively, across 5 tasks and two LLMs. Our analysis shows that stable subset examples are no more diverse than average, and are not outliers in terms of sequence length and perplexity.
translated by 谷歌翻译
人类可以利用先前的经验,并从少数示威活动中学习新颖的任务。与旨在通过更好的算法设计来快速适应的离线元强化学习相反,我们研究了建筑归纳偏见对少量学习能力的影响。我们提出了一个基于及时的决策变压器(提示-DT),该变压器利用了变压器体系结构和及时框架的顺序建模能力,以在离线RL中实现少量适应。我们设计了轨迹提示,其中包含少量演示的片段,并编码特定于任务的信息以指导策略生成。我们在五个Mujoco控制基准中进行的实验表明,提示-DT是一个强大的少数学习者,而没有对看不见的目标任务进行任何额外的填充。提示-DT的表现优于其变体和强大的元线RL基线,只有一个轨迹提示符只包含少量时间段。提示-DT也很健壮,可以提示长度更改并可以推广到分布(OOD)环境。
translated by 谷歌翻译
当从人类行为中推断出奖励功能(无论是演示,比较,物理校正或电子停靠点)时,它已证明对人类进行建模作为做出嘈杂的理性选择,并具有“合理性系数”,以捕获多少噪声或熵我们希望看到人类的行为。无论人类反馈的类型或质量如何,许多现有作品都选择修复此系数。但是,在某些情况下,进行演示可能要比回答比较查询要困难得多。在这种情况下,我们应该期望在示范中看到比比较中更多的噪音或次级临时性,并且应该相应地解释反馈。在这项工作中,我们提倡,将每种反馈类型的实际数据中的理性系数扎根,而不是假设默认值,对奖励学习具有重大的积极影响。我们在模拟反馈以及用户研究的实验中测试了这一点。我们发现,从单一反馈类型中学习时,高估人类理性可能会对奖励准确性和遗憾产生可怕的影响。此外,我们发现合理性层面会影响每种反馈类型的信息性:令人惊讶的是,示威并不总是最有用的信息 - 当人类的行为非常卑鄙时,即使在合理性水平相同的情况下,比较实际上就变得更加有用。 。此外,当机器人确定要要求的反馈类型时,它可以通过准确建模每种类型的理性水平来获得很大的优势。最终,我们的结果强调了关注假定理性级别的重要性,不仅是在从单个反馈类型中学习时,尤其是当代理商从多种反馈类型中学习时,尤其是在学习时。
translated by 谷歌翻译
上下文学习是最近的自然语言理解的范例,其中大型预先接受的语言模型(LM)观察测试实例和一些训练示例作为其输入,并直接对输出进行解码,而不会对其参数进行任何更新。但是,表现已被证明强烈依赖于所选培训示例(称为提示)。在这项工作中,我们提出了一种有效的方法,用于使用注释的数据和LM检索内心学习的提示。给定输入输出对,我们估计给出输入和候选训练示例的输出的概率作为提示,以及基于这种概率的正面或负标记训练示例。然后,我们从该数据中培训一个有效的密集鼠尾,用于检索训练示例作为测试时间的提示。我们在三个序列到序列任务中评估我们的方法,其中语言话语映射到意义表示,并发现它基本上优于前面的工作和电路板的多个基线。
translated by 谷歌翻译
离线政策优化可能会对许多现实世界的决策问题产生重大影响,因为在线学习在许多应用中可能是不可行的。重要性采样及其变体是离线策略评估中一种常用的估计器类型,此类估计器通常不需要关于价值函数或决策过程模型功能类的属性和代表性能力的假设。在本文中,我们确定了一种重要的过度拟合现象,以优化重要性加权收益,在这种情况下,学到的政策可以基本上避免在最初的状态空间的一部分中做出一致的决策。我们提出了一种算法,以避免通过新的每个国家 - 邻居标准化约束过度拟合,并提供对拟议算法的理论理由。我们还显示了以前尝试这种方法的局限性。我们在以医疗风格的模拟器为中测试算法,该模拟器是从真实医院收集的记录数据集和连续的控制任务。这些实验表明,与最先进的批处理学习算法相比,所提出的方法的过度拟合和更好的测试性能。
translated by 谷歌翻译
Meta-training, which fine-tunes the language model (LM) on various downstream tasks by maximizing the likelihood of the target label given the task instruction and input instance, has improved the zero-shot task generalization performance. However, meta-trained LMs still struggle to generalize to challenging tasks containing novel labels unseen during meta-training. In this paper, we propose Flipped Learning, an alternative method of meta-training which trains the LM to generate the task instruction given the input instance and label. During inference, the LM trained with Flipped Learning, referred to as Flipped, selects the label option that is most likely to generate the task instruction. On 14 tasks of the BIG-bench benchmark, the 11B-sized Flipped outperforms zero-shot T0-11B and even a 16 times larger 3-shot GPT-3 (175B) on average by 8.4% and 9.7% points, respectively. Flipped gives particularly large improvements on tasks with unseen labels, outperforming T0-11B by up to +20% average F1 score. This indicates that the strong task generalization of Flipped comes from improved generalization to novel labels. We release our code at https://github.com/seonghyeonye/Flipped-Learning.
translated by 谷歌翻译
Recent work has shown that fine-tuning large pre-trained language models on a collection of tasks described via instructions, a.k.a. instruction-tuning, improves their zero and few-shot generalization to unseen tasks. However, there is a limited understanding of the performance trade-offs of different decisions made during the instruction-tuning process. These decisions include the scale and diversity of the instruction-tuning benchmark, different task sampling strategies, fine-tuning with and without demonstrations, training using specialized datasets for reasoning and dialogue, and finally, the fine-tuning objectives themselves. In this paper, we characterize the effect of instruction-tuning decisions on downstream task performance when scaling both model and benchmark sizes. To this end, we create OPT-IML Bench: a large benchmark for Instruction Meta-Learning (IML) of 2000 NLP tasks consolidated into task categories from 8 existing benchmarks, and prepare an evaluation framework to measure three types of model generalizations: to tasks from fully held-out categories, to held-out tasks from seen categories, and to held-out instances from seen tasks. Through the lens of this framework, we first present insights about instruction-tuning decisions as applied to OPT-30B and further exploit these insights to train OPT-IML 30B and 175B, which are instruction-tuned versions of OPT. OPT-IML demonstrates all three generalization abilities at both scales on four different evaluation benchmarks with diverse tasks and input formats -- PromptSource, FLAN, Super-NaturalInstructions, and UnifiedSKG. Not only does it significantly outperform OPT on all benchmarks but is also highly competitive with existing models fine-tuned on each specific benchmark. We release OPT-IML at both scales, together with the OPT-IML Bench evaluation framework.
translated by 谷歌翻译
最近的工作表明,离线增强学习(RL)可以作为序列建模问题(Chen等,2021; Janner等,2021)配制,并通过类似于大规模语言建模的方法解决。但是,RL的任何实际实例化也涉及一个在线组件,在线组件中,通过与环境的任务规定相互作用对被动离线数据集进行了预测的策略。我们建议在线决策变压器(ODT),这是一种基于序列建模的RL算法,该算法将离线预处理与统一框架中的在线填充融为一体。我们的框架将序列级熵正规仪与自回归建模目标结合使用,用于样品效率探索和填充。从经验上讲,我们表明ODT在D4RL基准上的绝对性能中与最先进的表现具有竞争力,但在填充过程中显示出更大的收益。
translated by 谷歌翻译
Although large language models can be prompted for both zero- and few-shot learning, performance drops significantly when no demonstrations are available. In this paper, we introduce Z-ICL, a new zero-shot method that closes the gap by constructing pseudo-demonstrations for a given test input using a raw text corpus. Concretely, pseudo-demonstrations are constructed by (1) finding the nearest neighbors to the test input from the corpus and pairing them with random task labels, and (2) applying a set of techniques to reduce the amount of direct copying the model does from the resulting demonstrations. Evaluation on nine classification datasets shows that Z-ICL outperforms previous zero-shot methods by a significant margin, and is on par with in-context learning with labeled training data in the few-shot setting. Overall, Z-ICL provides a significantly higher estimate of the zero-shot performance levels of a model, and supports future efforts to develop better pseudo-demonstrations that further improve zero-shot results.
translated by 谷歌翻译