Large language models have recently shown promising progress in mathematical reasoning when fine-tuned with human-generated sequences walking through a sequence of solution steps. However, the solution sequences are not formally structured and the resulting model-generated sequences may not reflect the kind of systematic reasoning we might expect an expert human to produce. In this paper, we study how to build stronger reasoning capability in language models using the idea of relational abstractions. We introduce new types of sequences that more explicitly provide an abstract characterization of the transitions through intermediate solution steps to the goal state. We find that models that are supplied with such sequences as prompts can solve tasks with a significantly higher accuracy, and models that are trained to produce such sequences solve problems better than those that are trained with previously used human-generated sequences and other baselines. Our work thus takes several steps toward elucidating and improving how language models perform on tasks requiring multi-step mathematical reasoning.
translated by 谷歌翻译
语言模型在需要自然语言理解的各种任务上取得了非凡的表现。然而,最先进的模型通常在需要定量推理的任务上挣扎,例如在大学一级解决数学,科学和工程问题。为了帮助缩小这一差距,我们介绍了Minerva,Minerva是一种在一般自然语言数据上鉴定的大型语言模型,并进一步培训了技术内容。该模型在不使用外部工具的情况下实现了技术基准测试的最新性能。我们还评估了我们在需要定量推理的物理学,生物学,化学,经济学和其他科学方面的200多个本科生问题上评估我们的模型,并发现该模型可以正确回答其中几乎三分之一。
translated by 谷歌翻译
我们探索如何产生一系列思想 - 一系列中间推理步骤 - 显着提高了大语言模型执行复杂推理的能力。特别是,我们通过一种称为“思想链”提示的简单方法在足够大的语言模型中自然出现这种推理能力,在此过程中,一些思想示范被作为提示的示例提供了。三种大语模型的实验表明,促使思想链提高了一系列算术,常识和象征性推理任务的性能。经验收益可能会引人注目。例如,仅使用八个思想范围的540B参数语言模型才能在数学单词问题的GSM8K基准上实现最新的精度,甚至超过了带有验证器的Fineted GPT-3。
translated by 谷歌翻译
推理是人类认知和智力的关键支柱。在过去的十年中,我们目睹了自然语言处理的巨大收益和大型语言模型的前所未有的缩放。最近的工作表征了很少射击技术的能力,例如思想链,可以在大语言模型中模仿人类的推理。这个标志性的功能很少,连同不断扩展的语言模型相结合,打开了解决各种任务的可能性的远景,例如数学单词问题,代码完成和常识性推理。促使思想链(COT)通过提供中间步骤并敦促模型遵循相同的过程,从而进一步推动了模型的性能。尽管具有令人信服的性能,但在这些模型中推理能力的起源却很少探索。这项工作启动了对大语言模型中推理机制的更深入了解的初步步骤。我们的工作围绕查询模型,同时在提示中控制除一个组件以外的所有组件外:符号,模式和文本。然后,我们分析查询之间的性能差异。我们的结果表明,在提示中存在事实模式对于COT的成功并不是必需的。尽管如此,我们从经验上表明,仅依靠模式也不足以获得高质量的结果。我们认为文本具有常识性知识和意义。我们详尽的经验分析提供了定性的例子,说明了文本和模式之间的共生关系。这种对COT的系统理解使我们能够设计简洁的思想链,被称为CCOT,在其中修剪文本和模式只能保留其关键角色,同时以PAR或更高的求解任务率交付。
translated by 谷歌翻译
最先进的语言模型可以在许多任务中匹配人类性能,但它们仍然努力努力执行多步数学推理。要诊断当前模型和支持研究的故障,我们介绍了GSM8K,是8.5k高质量的语言学级别学校数学词问题的数据集。我们发现即使是最大的变压器模型也无法实现高测试性能,尽管该问题分布的概念简单性。为了提高性能,我们提出培训验证者来判断模型完成的正确性。在测试时间,我们生成许多候选解决方案,并选择验证者排名最高的解决方案。我们证明,验证显着提高了GSM8K的性能,我们提供了强大的经验证据,即验证尺度更有效地具有比FineTuning基线的数据增加。
translated by 谷歌翻译
许多智力努力需要解决数学问题,但这种技能仍然超出了计算机的能力。为了测量机器学习模型中的这种能力,我们介绍了数学,这是一个12,500个挑战性竞争数学问题的新数据集。数学中的每个问题都有一个完整的逐步解决方案,可用于教授模型来生成答案派生和解释。为了促进未来的研究和提高数学准确性,我们还提供了一个大型辅助预制数据集,有助于教导模型数学的基本原则。尽管我们能够提高数学准确性,但我们的结果表明,即使有巨大的变压器模型,即使有巨大的变压器模型也是相对较低的。此外,我们发现,如果缩放趋势持续,则无法增加预算和模型参数计数对于实现强大的数学推理,这将是不切实际的。虽然缩放变压器正在自动解决大多数基于文本的任务,但缩放目前没有解决数学。为了在数学问题上进行更多牵引,我们可能需要更广泛的研究界的新算法进步。
translated by 谷歌翻译
Step-by-step reasoning approaches like chain-of-thought (CoT) have proved to be a very effective technique to induce reasoning capabilities in large language models. However, the success of the CoT approach depends primarily on model size, and often billion parameter-scale models are needed to get CoT to work. In this paper, we propose a knowledge distillation approach, that leverages the step-by-step CoT reasoning capabilities of larger models and distils these reasoning abilities into smaller models. Our approach Decompositional Distillation learns a semantic decomposition of the original problem into a sequence of subproblems and uses it to train two models: a) a problem decomposer that learns to decompose the complex reasoning problem into a sequence of simpler sub-problems and b) a problem solver that uses the intermediate subproblems to solve the overall problem. On a multi-step math word problem dataset (GSM8K), we boost the performance of GPT-2 variants up to 35% when distilled with our approach compared to CoT. We show that using our approach, it is possible to train a GPT-2-large model (775M) that can outperform a 10X larger GPT-3 (6B) model trained using CoT reasoning. Finally, we also demonstrate that our approach of problem decomposition can also be used as an alternative to CoT prompting, which boosts the GPT-3 performance by 40% compared to CoT prompts.
translated by 谷歌翻译
Mathematical reasoning is a fundamental aspect of human intelligence and is applicable in various fields, including science, engineering, finance, and everyday life. The development of artificial intelligence (AI) systems capable of solving math problems and proving theorems has garnered significant interest in the fields of machine learning and natural language processing. For example, mathematics serves as a testbed for aspects of reasoning that are challenging for powerful deep learning models, driving new algorithmic and modeling advances. On the other hand, recent advances in large-scale neural language models have opened up new benchmarks and opportunities to use deep learning for mathematical reasoning. In this survey paper, we review the key tasks, datasets, and methods at the intersection of mathematical reasoning and deep learning over the past decade. We also evaluate existing benchmarks and methods, and discuss future research directions in this domain.
translated by 谷歌翻译
Language models (LMs) have demonstrated remarkable performance on downstream tasks, using in-context exemplars or human instructions. Recent works have shown that chain-of-thought (CoT) prompting can elicit models to solve complex reasoning tasks, step-by-step. However, the efficacy of prompt-based CoT methods is restricted to very large LMs such as GPT-3 (175B), thus limiting deployability. In this paper, we revisit the fine-tuning approach to enable complex reasoning in smaller LMs, optimized to efficiently perform a specific task. We propose Fine-tune-CoT, a method that leverages the capabilities of very large LMs to generate reasoning samples and teach smaller models via fine-tuning. We evaluate our method on publicly available LMs across a wide range of complex tasks and model sizes. We find that Fine-tune-CoT enables substantial reasoning capability in small models, whereas previous prompt-based baselines exhibit near-random performance. Student models can even outperform the teacher in some tasks while reducing model size requirements by several orders of magnitude. We conduct extensive ablations and sample studies to understand the reasoning capabilities of student models. We also identify several important nuances that have been overlooked in concurrent fine-tuning works on CoT and address them in our analysis.
translated by 谷歌翻译
许多现实世界问题需要综合应用采用合适的抽象,致辞认识和创造性的解决问题策略的多种推理能力。为了帮助推进AI系统实现这种能力,我们提出了一个新的推理挑战,即费银问题(FPS),这是答案只能估计的问题,因为它们的精确计算是不切实际或不可能的。例如,“如果世界上所有的冰融化,那么海平面会增加多少海平面?” FPS通常用于测验和访谈,以发出和评估人类的创造性推理能力。为AI系统做同样的事情,我们展示了两个数据集:1)来自测验和奥林匹克的1K现实世界FPS的集合; 2)一个10K的中间复杂合成FPS的银行,作为较难的真实挑战的沙箱。除问题答案对之外,数据集还包含可执行计划形式的详细解决方案,并提供支持事实,帮助监督和评估中间步骤。我们展示了甚至广泛的微调大规模语言模型在这些数据集上表现不佳,平均估计是由两个数量级的估计值。因此,我们的贡献是几个未解决的AI问题的结晶,以至于我们希望将促进可以推理的建筑系统进一步前进。
translated by 谷歌翻译
本文演示了通过对自动调整自选语言模型(GPT-NEO)进行适当的逐步演示,可以将其执行以前证明变换器的数学任务 - 龙手模数操作 - 具有相对较少的例子。具体而言,我们微调GPT-Neo从DeepMind数学数据集解决数字_Div_remainder任务;萨克斯顿等人。 (ARXIV:1904.01557)报告,这项任务的准确性低于40%,培训例子有200万。我们表明,在200次适当地结构化的练习型展示远期问题并报告剩余时间后,最小可用的GPT-Neo模型可实现80%以上。这是通过构建用于微调的适当数据集来实现,没有更改学习算法。这些结果表明,小型精心设计演示的微调自回归语言模型可能是一种有用的范例,可以在没有机器学习中培训的情况下使个人能够培训,以便在这些模型中执行某种复杂的多步任务。
translated by 谷歌翻译
Reasoning is a fundamental aspect of human intelligence that plays a crucial role in activities such as problem solving, decision making, and critical thinking. In recent years, large language models (LLMs) have made significant progress in natural language processing, and there is observation that these models may exhibit reasoning abilities when they are sufficiently large. However, it is not yet clear to what extent LLMs are capable of reasoning. This paper provides a comprehensive overview of the current state of knowledge on reasoning in LLMs, including techniques for improving and eliciting reasoning in these models, methods and benchmarks for evaluating reasoning abilities, findings and implications of previous research in this field, and suggestions on future directions. Our aim is to provide a detailed and up-to-date review of this topic and stimulate meaningful discussion and future work.
translated by 谷歌翻译
解决数学单词问题需要对文本中的数量进行演绎推理。各种最近的研究工作主要依赖于序列到序列或序列模型,以生成数学表达式,而无需在给定情况下明确执行数量之间的关系推理。尽管经验上有效,但这种方法通常并未为生成的表达提供解释。在这项工作中,我们将任务视为一个复杂的关系提取问题,提出了一种新的方法,该方法提出了可解释的演绎推理步骤,以迭代构建目标表达式,其中每个步骤涉及两个定义其关系的数量的原始操作。通过在四个基准数据集上进行的大量实验,我们表明该提出的模型显着优于现有的强基础。我们进一步证明,演绎过程不仅提出了更可解释的步骤,而且还使我们能够对需要更复杂推理的问题进行更准确的预测。
translated by 谷歌翻译
When a large language model (LLM) performs complex reasoning by chain of thought (CoT), it can be highly sensitive to individual mistakes. We have had to train verifiers to address this issue. As we all know, after human inferring a conclusion, they often check it by re-verifying it, which can avoid some mistakes. We propose a new method called self-verification that uses the conclusion of the CoT as a condition to build a new sample and asks the LLM to re-predict the original conditions which be masked. We calculate an explainable verification score based on the accuracy. This method can improve the accuracy of multiple arithmetics and logical reasoning datasets when using few-shot learning. we have demonstrated that LLMs can conduct explainable self-verification of their own conclusions and achieve competitive reasoning performance. Extensive experimentals have demonstrated that our method can help multiple large language models with self-verification can avoid interference from incorrect CoT. Code is available at \url{https://github.com/WENGSYX/Self-Verification}
translated by 谷歌翻译
Large language models show improved downstream task performance when prompted to generate step-by-step reasoning to justify their final answers. These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness (independent of the final answer) is difficult without reliable methods for automatic evaluation. We simply do not know how often the stated reasoning steps actually support the final end task predictions. In this work, we present ROSCOE, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics. To evaluate ROSCOE against baseline metrics, we design a typology of reasoning errors and collect synthetic and human evaluation scores on commonly used reasoning datasets. In contrast with existing metrics, ROSCOE can measure semantic consistency, logicality, informativeness, fluency, and factuality - among other traits - by leveraging properties of step-by-step rationales. We empirically verify the strength of our metrics on five human annotated and six programmatically perturbed diagnostics datasets - covering a diverse set of tasks that require reasoning skills and show that ROSCOE can consistently outperform baseline metrics.
translated by 谷歌翻译
Computational notebooks, such as Jupyter notebooks, are interactive computing environments that are ubiquitous among data scientists to perform data wrangling and analytic tasks. To measure the performance of AI pair programmers that automatically synthesize programs for those tasks given natural language (NL) intents from users, we build ARCADE, a benchmark of 1082 code generation problems using the pandas data analysis framework in data science notebooks. ARCADE features multiple rounds of NL-to-code problems from the same notebook. It requires a model to understand rich multi-modal contexts, such as existing notebook cells and their execution states as well as previous turns of interaction. To establish a strong baseline on this challenging task, we develop PaChiNCo, a 62B code language model (LM) for Python computational notebooks, which significantly outperforms public code LMs. Finally, we explore few-shot prompting strategies to elicit better code with step-by-step decomposition and NL explanation, showing the potential to improve the diversity and explainability of model predictions.
translated by 谷歌翻译
抽象推理是智能系统的关键能力。大型语言模型在抽象推理任务上实现了高度的性能,但表现出许多缺陷。但是,人类的抽象推理也是不完美的,并且取决于我们对推理问题内容的知识和信念。例如,人类对在日常情况下基于逻辑规则的逻辑规则比关于抽象属性的任意规则更可靠地理解。语言模型的培训经验类似地赋予了他们先前的期望,这些期望反映了人类的知识和信念。因此,我们假设语言模型会显示出类似人类的内容对抽象推理问题的影响。我们在三个逻辑推理任务中探讨了这一假设:自然语言推论,判断三段论的逻辑有效性和ison选择任务(Wason,1968)。我们发现,最新的大语言模型(具有7或700亿个参数; Hoffman等,2022)反映了这些任务中人类在人类中观察到的许多相同模式 - 像人类一样,模型对可信情况的理由更有效地理由不现实或抽象的。我们的发现对理解这些认知效应以及有助于语言模型表现的因素具有影响。
translated by 谷歌翻译
鉴于大型语言模型的广泛能力,应该有可能朝着一般的文本的助手工作,这些助手与人类价值一致,这意味着它是有帮助,诚实的和无害的。在此方向上的初始遗传,我们研究简单的基线技术和评估,例如提示。我们发现,从模型规模增加适度的干预措施的好处,概括为各种对准评估,并不会损害大型模型的性能。接下来,我们调查与对齐,比较仿制,二进制歧视和排名偏好建模相关的几个培训目标的缩放趋势。我们发现排名优先级模型比模仿学习更好地表现得多,并且通常以模型大小更有利地缩放。相比之下,二进制歧视通常与模仿学习非常类似地执行和缩放。最后,我们研究了一种“偏好模型预训练阶段的培训阶段,其目的是在对人偏好的芬明时提高样本效率。
translated by 谷歌翻译
Pre-trained language models (LMs) have shown remarkable reasoning performance using explanations (or ``chain-of-thought'' (CoT)) for in-context learning. On the other hand, these reasoning tasks are usually presumed to be more approachable for symbolic programming. To make progress towards understanding in-context learning, we curate synthetic datasets containing equivalent (natural, symbolic) data pairs, where symbolic examples contain first-order logic rules and predicates from knowledge bases (KBs). Then we revisit neuro-symbolic approaches and use Language Models as Logic Programmer (LMLP) that learns from demonstrations containing logic rules and corresponding examples to iteratively reason over KBs, recovering Prolog's backward chaining algorithm. Comprehensive experiments are included to systematically compare LMLP with CoT in deductive reasoning settings, showing that LMLP enjoys more than 25% higher accuracy than CoT on length generalization benchmarks even with fewer parameters.
translated by 谷歌翻译
The recent advent of large language models - large neural networks trained on a simple predictive objective over a massive corpus of natural language - has reinvigorated debate over whether human cognitive capacities might emerge in such generic models given sufficient training data. Of particular interest is the ability of these models to reason about novel problems zero-shot, without any direct training on those problems. In human cognition, this capacity is closely tied to an ability to reason by analogy. Here, we performed a direct comparison between human reasoners and a large language model (GPT-3) on a range of analogical tasks, including a novel text-based matrix reasoning task closely modeled on Raven's Progressive Matrices. We found that GPT-3 displayed a surprisingly strong capacity for abstract pattern induction, matching or even surpassing human capabilities in most settings. Our results indicate that large language models such as GPT-3 have acquired an emergent ability to find zero-shot solutions to a broad range of analogy problems.
translated by 谷歌翻译