Despite recent success in large language model (LLM) reasoning, LLMs still struggle with hierarchical multi-step reasoning like generating complex programs. In these cases, humans often start with a high-level algorithmic design and implement each part gradually. We introduce Parsel, a framework enabling automatic implementation and validation of complex algorithms with code LLMs, based on hierarchical function descriptions in natural language. Parsel can be used across domains requiring hierarchical reasoning, e.g. code synthesis, theorem proving, and robotic planning. We demonstrate Parsel's capabilities by using it to generate complex programs that cannot currently be automatically implemented from one description and backtranslating Python programs in the APPS dataset. Beyond modeling capabilities, Parsel allows problem-solving with high-level algorithmic designs, benefiting both students and professional programmers.
translated by 谷歌翻译
Language models have recently achieved strong performance across a wide range of NLP benchmarks. However, unlike benchmarks, real world tasks are often poorly specified, and agents must deduce the user's intended behavior from a combination of context, instructions, and examples. We investigate how both humans and models behave in the face of such task ambiguity by proposing AmbiBench, a new benchmark of six ambiguously-specified classification tasks. We evaluate humans and models on AmbiBench by seeing how well they identify the intended task using 1) instructions with varying degrees of ambiguity, and 2) different numbers of labeled examples. We find that the combination of model scaling (to 175B parameters) and training with human feedback data enables models to approach or exceed the accuracy of human participants across tasks, but that either one alone is not sufficient. In addition, we show how to dramatically improve the accuracy of language models trained without large-scale human feedback training by finetuning on a small number of ambiguous in-context examples, providing a promising direction for teaching models to generalize well in the face of ambiguity.
translated by 谷歌翻译
What role do augmentations play in contrastive learning? Recent work suggests that good augmentations are label-preserving with respect to a specific downstream task. We complicate this picture by showing that label-destroying augmentations can be useful in the foundation model setting, where the goal is to learn diverse, general-purpose representations for multiple downstream tasks. We perform contrastive learning experiments on a range of image and audio datasets with multiple downstream tasks (e.g. for digits superimposed on photographs, predicting the class of one vs. the other). We find that Viewmaker Networks, a recently proposed model for learning augmentations for contrastive learning, produce label-destroying augmentations that stochastically destroy features needed for different downstream tasks. These augmentations are interpretable (e.g. altering shapes, digits, or letters added to images) and surprisingly often result in better performance compared to expert-designed augmentations, despite not preserving label information. To support our empirical results, we theoretically analyze a simple contrastive learning setting with a linear model. In this setting, label-destroying augmentations are crucial for preventing one set of features from suppressing the learning of features useful for another downstream task. Our results highlight the need for analyzing the interaction between multiple downstream tasks when trying to explain the success of foundation models.
translated by 谷歌翻译
Euclidean geometry is among the earliest forms of mathematical thinking. While the geometric primitives underlying its constructions, such as perfect lines and circles, do not often occur in the natural world, humans rarely struggle to perceive and reason with them. Will computer vision models trained on natural images show the same sensitivity to Euclidean geometry? Here we explore these questions by studying few-shot generalization in the universe of Euclidean geometry constructions. We introduce Geoclidean, a domain-specific language for Euclidean geometry, and use it to generate two datasets of geometric concept learning tasks for benchmarking generalization judgements of humans and machines. We find that humans are indeed sensitive to Euclidean geometry and generalize strongly from a few visual examples of a geometric concept. In contrast, low-level and high-level visual features from standard computer vision models pretrained on natural images do not support correct generalization. Thus Geoclidean represents a novel few-shot generalization benchmark for geometric concept learning, where the performance of humans and of AI models diverge. The Geoclidean framework and dataset are publicly available for download.
translated by 谷歌翻译
General mathematical reasoning is computationally undecidable, but humans routinely solve new problems. Moreover, discoveries developed over centuries are taught to subsequent generations quickly. What structure enables this, and how might that inform automated mathematical reasoning? We posit that central to both puzzles is the structure of procedural abstractions underlying mathematics. We explore this idea in a case study on 5 sections of beginning algebra on the Khan Academy platform. To define a computational foundation, we introduce Peano, a theorem-proving environment where the set of valid actions at any point is finite. We use Peano to formalize introductory algebra problems and axioms, obtaining well-defined search problems. We observe existing reinforcement learning methods for symbolic reasoning to be insufficient to solve harder problems. Adding the ability to induce reusable abstractions ("tactics") from its own solutions allows an agent to make steady progress, solving all problems. Furthermore, these abstractions induce an order to the problems, seen at random during training. The recovered order has significant agreement with the expert-designed Khan Academy curriculum, and second-generation agents trained on the recovered curriculum learn significantly faster. These results illustrate the synergistic role of abstractions and curricula in the cultural transmission of mathematics.
translated by 谷歌翻译
语言理解的概率模型是可解释和结构化的,例如隐喻理解的模型描述了有关潜在主题和特征的推论。但是,这些模型是为特定任务手动设计的。大型语言模型(LLMS)可以通过内在的学习来执行许多任务,但它们缺乏概率模型的清晰结构。在本文中,我们使用经过思考的提示将概率模型的结构引入LLMS。这些提示导致该模型推断潜在变量和有关其关系的理由,以选择隐喻的适当释义。所选择的潜在变量和关系是由认知心理学理解理论得出的。我们将这些提示应用于GPT-3的两个最大版本,并表明它们可以改善释义选择。
translated by 谷歌翻译
概率程序为生成模型提供了表达性表示语言。给定概率程序,我们对后验推断的任务感兴趣:在给定一组观察到的变量的情况下,估计潜在变量。现有的概率计划中推断技术通常需要选择许多超参数,在计算上是昂贵的,并且/或仅适用于限制类别的程序。在这里,我们将推断作为掩盖语言建模:给定程序,我们生成了一个监督的变量和作业数据集,并随机掩盖了作业的子集。然后,我们训练神经网络以揭示随机值,从而定义了近似后验分布。通过在各种程序中优化单个神经网络,我们可以摊销培训的成本,从而产生“基础”后部能够对新程序进行零弹性推断。基础后验也可以通过优化变异推理目标来微调特定程序和数据集。我们在Stan程序的基准上显示了该方法的功效,零射和微调。
translated by 谷歌翻译
Language use differs dramatically from context to context. To some degree, modern language models like GPT-3 are able to account for such variance by conditioning on a string of previous input text, or prompt. Yet prompting is ineffective when contexts are sparse, out-of-sample, or extra-textual; for instance, accounting for when and where the text was produced or who produced it. In this paper, we introduce the mixed-effects transformer (MET), a novel approach for learning hierarchically-structured prefixes -- lightweight modules prepended to the input -- to account for structured variation. Specifically, we show how the popular class of mixed-effects models may be extended to transformer-based architectures using a regularized prefix-tuning procedure with dropout. We evaluate this approach on several domain-adaptation benchmarks, finding that it efficiently adapts to novel contexts with minimal data while still effectively generalizing to unseen contexts.
translated by 谷歌翻译
对比学习在计算机视觉中取得了相当大的进展,优于一系列下游数据集的监督预测。然而,对比学习各种情况的更好选择?我们展示了两种情况。首先,在足够小的预测预算下,监督ImageNet的预测始终如一地优于八个不同的图像分类数据集相当的对比模型。这表明,比较数百或数千时代的预押方法的常见做法可能不会为这些计算预算有限的人产生可操作的见解。其次,即使有更大的预测预算,我们也可以确定监督学习的任务,也许是因为监督预测的对象偏见使模型更加适应普通腐败和虚假的前景背景相关性。这些结果强调了需要在更广泛的背景和培训制度范围内表征不同预威胁目标的权衡。
translated by 谷歌翻译
蒸馏工作导致语言模型更紧凑,没有严重的性能下降。蒸馏的标准方法培训了针对两个目标的学生模型:特定于任务的目标(例如,语言建模)和模仿目标,并鼓励学生模型的隐藏状态与较大的教师模型类似。在本文中,我们表明,增强蒸馏有利于第三个目标,鼓励学生通过交换干预培训(IIT)来模仿教师的因果计算过程。 IIT推动学生模型成为教师模型的因果抽象 - 一种具有相同因果结构的更简单的模型。 IIT是完全可差异的,容易实施,并与其他目标灵活结合。与伯特标准蒸馏相比,通过IIT蒸馏导致维基百科(屏蔽语言建模)逐步困惑,并对胶水基准(自然语言理解),队(问题接听)和Conll-2003(命名实体识别)进行了改进。
translated by 谷歌翻译