尽管当代的大语言模型(LMS)表现出令人印象深刻的提问功能,但它们的答案通常是单个呼吁模型的产物。这需要不受欢迎的不透明度和损害性能,尤其是在本质上是多步骤的问题上。为了解决这些局限性,我们可以通过一个过程通过因果结构反映了问题的基本逻辑结构的过程来展示如何制作LMS来执行忠实的多步推理。我们的方法是通过将推理步骤链接在一起的,每个步骤都来自调用两个微调的LMS,一个用于选择,一种用于推理,以产生有效的推理跟踪。我们的方法在推理轨迹的空间中进行了光束搜索,以提高推理质量。我们证明了模型对多步逻辑推论和科学提问的有效性,表明它在最终答案的准确性上优于基准,并生成可解释的人类解释的推理痕迹,其有效性可以由用户检查。
translated by 谷歌翻译
我们介绍了Sparrow,这是一个寻求信息的对话代理,与提示的语言模型基线相比,训练有素,更有帮助,正确和无害。我们使用从人类反馈中的强化学习来培训我们的模型,以帮助人类评估者判断代理人的行为。首先,为了使我们的代理人更有帮助和无害,我们将良好对话的要求分解为代理人应遵循的自然语言规则,并分别向评估者询问每个规则。我们证明,这种崩溃使我们能够收集对代理行为的更多针对性的人类判断,并允许更有效的规则条件奖励模型。其次,我们的代理商在收集对模型声明的偏好判决时提供了支持事实主张的来源的证据。对于事实问题,麻雀提供的证据支持了78%的时间。比基线比基线更享受麻雀,同时对人类的对抗性探测更具弹性,在探测时只有8%的时间违反了我们的规则。最后,我们进行了广泛的分析,表明尽管我们的模型学会遵守我们的规则,但它可以表现出分布偏见。
translated by 谷歌翻译
最近已被证明大型语言模型在各种任务集中获得合理的零射普通化(Brown等,2020)。它已经假设这是语言模型的隐式多任务学习的结果,在语言模型中的预押(Radford等,2019)。可以通过明确的多任务学习直接引起零拍常规化?为了以缩放测试这个问题,我们开发一个系统,以便轻松地将任何自然语言任务映射到人类可读的提示表单中。我们转换一组大量的监督数据集,每个数据集都有多个提示,具有不同的措辞。这些提示的数据集允许基准测试模型执行完全看不见的任务的能力。我们介绍了一个普拉克尔编码器 - 解码器模型(Raffel等,2020; Lester等,2021),覆盖各种任务。该模型在多个标准数据集中达到强大的零点性能,通常优于其尺寸的型号超过16倍。此外,我们的方法对来自Big-替补基准测试的任务子集具有强烈性能,优于其尺寸的6倍。所有提示和培训的型号都可以在https://github.com/ bigscience-workshop / protectsource / httpsource / https://huggingface.co/bigscience/t0pp。
translated by 谷歌翻译
我们提出了一种系统推理的方法,该方法生产了基于事实基础的人类可解释的证明树。我们的解决方案类似于经典的基于序言的推理引擎的风格,在该引擎中,我们通过神经语言建模,指导生成和半磁头密集检索的结合来代替手工制作的规则。这款新颖的推理引擎Nellie动态实例化了可解释的推理规则,这些规则捕获和分数构成(DE)在自然语言陈述上。内莉(Nellie)在科学质量检查数据集上提供竞争性能,需要对多个事实进行结构化解释。
translated by 谷歌翻译
The emergence of large pretrained models has enabled language models to achieve superior performance in common NLP tasks, including language modeling and question answering, compared to previous static word representation methods. Augmenting these models with a retriever to retrieve the related text and documents as supporting information has shown promise in effectively solving NLP problems in a more interpretable way given that the additional knowledge is injected explicitly rather than being captured in the models' parameters. In spite of the recent progress, our analysis on retriever-augmented language models shows that this class of language models still lack reasoning over the retrieved documents. In this paper, we study the strengths and weaknesses of different retriever-augmented language models such as REALM, kNN-LM, FiD, ATLAS, and Flan-T5 in reasoning over the selected documents in different tasks. In particular, we analyze the reasoning failures of each of these models and study how the models' failures in reasoning are rooted in the retriever module as well as the language model.
translated by 谷歌翻译
语言模型在需要自然语言理解的各种任务上取得了非凡的表现。然而,最先进的模型通常在需要定量推理的任务上挣扎,例如在大学一级解决数学,科学和工程问题。为了帮助缩小这一差距,我们介绍了Minerva,Minerva是一种在一般自然语言数据上鉴定的大型语言模型,并进一步培训了技术内容。该模型在不使用外部工具的情况下实现了技术基准测试的最新性能。我们还评估了我们在需要定量推理的物理学,生物学,化学,经济学和其他科学方面的200多个本科生问题上评估我们的模型,并发现该模型可以正确回答其中几乎三分之一。
translated by 谷歌翻译
Recent work pre-training Transformers with self-supervised objectives on large text corpora has shown great success when fine-tuned on downstream NLP tasks including text summarization. However, pre-training objectives tailored for abstractive text summarization have not been explored. Furthermore there is a lack of systematic evaluation across diverse domains. In this work, we propose pre-training large Transformer-based encoder-decoder models on massive text corpora with a new selfsupervised objective. In PEGASUS, important sentences are removed/masked from an input document and are generated together as one output sequence from the remaining sentences, similar to an extractive summary. We evaluated our best PEGASUS model on 12 downstream summarization tasks spanning news, science, stories, instructions, emails, patents, and legislative bills. Experiments demonstrate it achieves state-of-the-art performance on all 12 downstream datasets measured by ROUGE scores. Our model also shows surprising performance on low-resource summarization, surpassing previous state-of-the-art results on 6 datasets with only 1000 examples. Finally we validated our results using human evaluation and show that our model summaries achieve human performance on multiple datasets.
translated by 谷歌翻译
Neural language models (LMs) have achieved impressive results on various language-based reasoning tasks by utilizing latent knowledge encoded in their own pretrained parameters. To make this reasoning process more explicit, recent works retrieve a rationalizing LM's internal knowledge by training or prompting it to generate free-text rationales, which can be used to guide task predictions made by either the same LM or a separate reasoning LM. However, rationalizing LMs require expensive rationale annotation and/or computation, without any assurance that their generated rationales improve LM task performance or faithfully reflect LM decision-making. In this paper, we propose PINTO, an LM pipeline that rationalizes via prompt-based learning, and learns to faithfully reason over rationales via counterfactual regularization. First, PINTO maps out a suitable reasoning process for the task input by prompting a frozen rationalizing LM to generate a free-text rationale. Second, PINTO's reasoning LM is fine-tuned to solve the task using the generated rationale as context, while regularized to output less confident predictions when the rationale is perturbed. Across four datasets, we show that PINTO significantly improves the generalization ability of the reasoning LM, yielding higher performance on both in-distribution and out-of-distribution test sets. Also, we find that PINTO's rationales are more faithful to its task predictions than those generated by competitive baselines.
translated by 谷歌翻译
Large language models have demonstrated outstanding performance on a wide range of tasks such as question answering and code generation. On a high level, given an input, a language model can be used to automatically complete the sequence in a statistically-likely way. Based on this, users prompt these models with language instructions or examples, to implement a variety of downstream tasks. Advanced prompting methods can even imply interaction between the language model, a user, and external tools such as calculators. However, to obtain state-of-the-art performance or adapt language models for specific tasks, complex task- and model-specific programs have to be implemented, which may still require ad-hoc interaction. Based on this, we present the novel idea of Language Model Programming (LMP). LMP generalizes language model prompting from pure text prompts to an intuitive combination of text prompting and scripting. Additionally, LMP allows constraints to be specified over the language model output. This enables easy adaption to many tasks, while abstracting language model internals and providing high-level semantics. To enable LMP, we implement LMQL (short for Language Model Query Language), which leverages the constraints and control flow from an LMP prompt to generate an efficient inference procedure that minimizes the number of expensive calls to the underlying language model. We show that LMQL can capture a wide range of state-of-the-art prompting methods in an intuitive way, especially facilitating interactive flows that are challenging to implement with existing high-level APIs. Our evaluation shows that we retain or increase the accuracy on several downstream tasks, while also significantly reducing the required amount of computation or cost in the case of pay-to-use APIs (13-85% cost savings).
translated by 谷歌翻译
There has been a recent resurgence in the area of explainable artificial intelligence as researchers and practitioners seek to make their algorithms more understandable. Much of this research is focused on explicitly explaining decisions or actions to a human observer, and it should not be controversial to say that looking at how humans explain to each other can serve as a useful starting point for explanation in artificial intelligence. However, it is fair to say that most work in explainable artificial intelligence uses only the researchers' intuition of what constitutes a 'good' explanation. There exists vast and valuable bodies of research in philosophy, psychology, and cognitive science of how people define, generate, select, evaluate, and present explanations, which argues that people employ certain cognitive biases and social expectations towards the explanation process. This paper argues that the field of explainable artificial intelligence should build on this existing research, and reviews relevant papers from philosophy, cognitive psychology/science, and social psychology, which study these topics. It draws out some important findings, and discusses ways that these can be infused with work on explainable artificial intelligence.
translated by 谷歌翻译
抽象推理是智能系统的关键能力。大型语言模型在抽象推理任务上实现了高度的性能,但表现出许多缺陷。但是,人类的抽象推理也是不完美的,并且取决于我们对推理问题内容的知识和信念。例如,人类对在日常情况下基于逻辑规则的逻辑规则比关于抽象属性的任意规则更可靠地理解。语言模型的培训经验类似地赋予了他们先前的期望,这些期望反映了人类的知识和信念。因此,我们假设语言模型会显示出类似人类的内容对抽象推理问题的影响。我们在三个逻辑推理任务中探讨了这一假设:自然语言推论,判断三段论的逻辑有效性和ison选择任务(Wason,1968)。我们发现,最新的大语言模型(具有7或700亿个参数; Hoffman等,2022)反映了这些任务中人类在人类中观察到的许多相同模式 - 像人类一样,模型对可信情况的理由更有效地理由不现实或抽象的。我们的发现对理解这些认知效应以及有助于语言模型表现的因素具有影响。
translated by 谷歌翻译
关于概念及其属性的常识知识(CSK)对AI应用程序(例如强大的聊天机器人)有用。诸如ConceptNet,Tuplekb和其他人之类的先前作品汇编了大型CSK集合,但在其表现力上限制了主题性主体对象(SPO)三倍(SPO)三元组,其中s和p和Onolithic的简单概念是P和O。这些项目都优先考虑精确精度。或召回,但几乎不能调和这些互补目标。本文介绍了一种称为Ascent的方法,以自动建立一个大规模的CSK断言的知识库(KB),具有高级表现力,并且比先前的作品更好,并且具有更好的精度和回忆。通过捕获子组和方面的复合概念,以及通过语义方面的主张来捕获复合概念,超越了三倍。后者对于表达断言和进一步预选赛的时间和空间有效性很重要。 Ascent使用语言模型将开放信息提取与明智的清洁结合在一起。内在评估显示了上升KB的较高规模和质量,QA支持任务的外部评估强调了上升的好处。可以在https://ascent.mpi-inf.mpg.de/上找到Web界面,数据和代码。
translated by 谷歌翻译
As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user's preferred answer ("sycophancy") and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse. For example, RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. Overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors.
translated by 谷歌翻译
Information seeking users often pose questions with false presuppositions, especially when asking about unfamiliar topics. Most existing question answering (QA) datasets, in contrast, assume all questions have well defined answers. We introduce CREPE, a QA dataset containing a natural distribution of presupposition failures from online information-seeking forums. We find that 25% of questions contain false presuppositions, and provide annotations for these presuppositions and their corrections. Through extensive baseline experiments, we show that adaptations of existing open-domain QA models can find presuppositions moderately well, but struggle when predicting whether a presupposition is factually correct. This is in large part due to difficulty in retrieving relevant evidence passages from a large text corpus. CREPE provides a benchmark to study question answering in the wild, and our analyses provide avenues for future work in better modeling and further studying the task.
translated by 谷歌翻译
Large language models show improved downstream task performance when prompted to generate step-by-step reasoning to justify their final answers. These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness (independent of the final answer) is difficult without reliable methods for automatic evaluation. We simply do not know how often the stated reasoning steps actually support the final end task predictions. In this work, we present ROSCOE, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics. To evaluate ROSCOE against baseline metrics, we design a typology of reasoning errors and collect synthetic and human evaluation scores on commonly used reasoning datasets. In contrast with existing metrics, ROSCOE can measure semantic consistency, logicality, informativeness, fluency, and factuality - among other traits - by leveraging properties of step-by-step rationales. We empirically verify the strength of our metrics on five human annotated and six programmatically perturbed diagnostics datasets - covering a diverse set of tasks that require reasoning skills and show that ROSCOE can consistently outperform baseline metrics.
translated by 谷歌翻译
Many real-world applications of language models (LMs), such as code autocomplete and writing assistance, involve human-LM interaction, but the main LM benchmarks are non-interactive, where a system produces output without human intervention. To evaluate human-LM interaction, we develop a framework, Human-AI Language-based Interaction Evaluation (H-LINE), that expands non-interactive evaluation along three dimensions, capturing (i) the interactive process, not only the final output; (ii) the first-person subjective experience, not just a third-party assessment; and (iii) notions of preference beyond quality. We then design five tasks ranging from goal-oriented to open-ended to capture different forms of interaction. On four state-of-the-art LMs (three variants of OpenAI's GPT-3 and AI21's J1-Jumbo), we find that non-interactive performance does not always result in better human-LM interaction and that first-person and third-party metrics can diverge, suggesting the importance of examining the nuances of human-LM interaction.
translated by 谷歌翻译
我们介绍了概率世界,这是一个新的全象征性的贝叶斯型号的语义解析和推理模型,作为对更具领域和任务通用NLU和AI的研究计划的第一步。人类创造了他们观察的内部心理模型,这极大地帮助理解和理解大量问题。在PWM中,句子的含义,获得世界的事实,以及推理的中间步骤都以人类可读的形式表达,具有可解释性的设计目标。 PWM是贝叶斯,专为能够概括新域和新任务而设计。我们派生并实现了一种推导算法,通过解析和释放捕获这些句子的语义的潜在世界模型来读取句子,并在两个域名问题答案数据集中评估它:(1)校对器和(2 )我们呼叫虚构的新数据集,旨在更具实际语言的代表,但仍然足够简单,以重新评估推理能力,同时对启发式鲁棒。我们的方法均优于两者的基线,从而将其值证明其作为概念验证。
translated by 谷歌翻译
促使模型表现出令人印象深刻的几次学习能力。在测试时间与单个模型或多个模型的组成一起重复相互作用,进一步扩展了功能。这些组成是概率模型,可以用具有随机变量的图形模型的语言表示,其值是复杂的数据类型,例如字符串。具有控制流和动态结构的情况需要概率编程的技术,这些技术允许以统一语言实施不同的模型结构和推理策略。我们从这个角度正式化了几种现有技术,包括刮擦板 /思想链,验证者,星星,选择 - 推动和工具使用。我们将结果程序称为语言模型级联。
translated by 谷歌翻译
我们调整了大型语言模型,以使用行为克隆来编写自然语言批评(自然语言批判性评论)。关于基于主题的摘要任务,我们的模型所写的批评帮助人类在摘要中发现了本来会错过的漏洞。我们的模型有助于在模型和人类书面摘要中发现自然存在的缺陷,以及人类撰写的摘要中有意误导的摘要中的缺陷。我们研究批评的缩放特性,包括基于主题的汇总和合成任务。较大的模型写出更多有用的批评,在大多数任务上,尽管产生了更困难的输出,但在大多数任务上都更好地进行了自我关注。较大的模型还可以将自己的自我批评纳入反馈,将自己的摘要完善为更好的摘要。最后,我们激励并引入了一个框架,以比较批评能力的产生和歧视能力。我们的测量表明,即使是大型模型也可能仍然具有他们无法或不表达为批评的相关知识。这些结果是使用AI辅助的人类反馈来扩展机器学习系统的监督到人类直接评估的任务的概念证明。我们释放培训数据集以及批评援助实验的样本。
translated by 谷歌翻译
当前的语言模型可以产生高质量的文本。他们只是复制他们之前看到的文本,或者他们学习了普遍的语言抽象吗?要取笑这些可能性,我们介绍了乌鸦,这是一套评估生成文本的新颖性,专注于顺序结构(n-gram)和句法结构。我们将这些分析应用于四种神经语言模型(LSTM,变压器,变换器-XL和GPT-2)。对于本地结构 - 例如,单个依赖性 - 模型生成的文本比来自每个模型的测试集的人类生成文本的基线显着不那么新颖。对于大规模结构 - 例如,总句结构 - 模型生成的文本与人生成的基线一样新颖甚至更新颖,但模型仍然有时复制,在某些情况下,在训练集中重复超过1000字超过1,000字的通道。我们还表现了广泛的手动分析,表明GPT-2的新文本通常在形态学和语法中形成良好,但具有合理的语义问题(例如,是自相矛盾)。
translated by 谷歌翻译