While large language models (LLMs) have demonstrated impressive capabilities across tasks in language understanding and interactive decision making, their abilities for reasoning (e.g. chain-of-thought prompting) and acting (e.g. action plan generation) have primarily been studied as separate topics. In this paper, we explore the use of LLMs to generate both reasoning traces and task-specific actions in an interleaved manner, allowing for greater synergy between the two: reasoning traces help the model induce, track, and update action plans as well as handle exceptions, while actions allow it to interface with external sources, such as knowledge bases or environments, to gather additional information. We apply our approach, named ReAct, to a diverse set of language and decision making tasks and demonstrate its effectiveness over state-of-the-art baselines, as well as improved human interpretability and trustworthiness over methods without reasoning or acting components. Concretely, on question answering (HotpotQA) and fact verification (Fever), ReAct overcomes issues of hallucination and error propagation prevalent in chain-of-thought reasoning by interacting with a simple Wikipedia API, and generates human-like task-solving trajectories that are more interpretable than baselines without reasoning traces. On two interactive decision making benchmarks (ALFWorld and WebShop), ReAct outperforms imitation and reinforcement learning methods by an absolute success rate of 34% and 10% respectively, while being prompted with only one or two in-context examples. Project site with code: https://react-lm.github.io
translated by 谷歌翻译
我们介绍了Sparrow,这是一个寻求信息的对话代理,与提示的语言模型基线相比,训练有素,更有帮助,正确和无害。我们使用从人类反馈中的强化学习来培训我们的模型,以帮助人类评估者判断代理人的行为。首先,为了使我们的代理人更有帮助和无害,我们将良好对话的要求分解为代理人应遵循的自然语言规则,并分别向评估者询问每个规则。我们证明,这种崩溃使我们能够收集对代理行为的更多针对性的人类判断,并允许更有效的规则条件奖励模型。其次,我们的代理商在收集对模型声明的偏好判决时提供了支持事实主张的来源的证据。对于事实问题,麻雀提供的证据支持了78%的时间。比基线比基线更享受麻雀,同时对人类的对抗性探测更具弹性,在探测时只有8%的时间违反了我们的规则。最后,我们进行了广泛的分析,表明尽管我们的模型学会遵守我们的规则,但它可以表现出分布偏见。
translated by 谷歌翻译
在交互式环境中,现有的基础语言基准要么缺乏现实世界的语言元素,要么由于人类参与数据收集或反馈信号而难以扩展。为了弥合这一差距,我们开发了网络商店 - 一个模拟的电子商务网站环境,拥有11.18亿美元的现实世界中的产品和12,087美元的人群文本说明。给定指定产品需求的文本指令,代理需要导航多种类型的网页并发布各种操作以查找,自定义和购买项目。 WebShop为语言基础提供了一些挑战,包括了解构图说明,查询(重新)表述,理解和对网页中的嘈杂文本进行操作以及执行战略探索。我们为这项任务收集了超过1,600美元的人类示范,并使用强化学习,模仿学习以及预训练的图像和语言模型来训练和评估各种代理商。我们的最佳模型达到了任务成功率$ 29 \%$,它优于基于规则的启发式方法($ 9.6 \%$),但远低于人类专家绩效($ 59 \%$)。我们还分析了代理和人类轨迹,并消融各种模型组件,以提供有关具有更强语言理解和决策能力的未来代理人的见解。最后,我们表明,在Amazon.com上进行评估时,在网络商店进行培训的代理商展示了非平凡的SIM转移转移,这表明网络商店在开发可以在野外运行的实用基于网络的代理商中的潜在价值。
translated by 谷歌翻译
最近的作品表明,如何将大语言模型(LLM)的推理能力应用于自然语言处理以外的领域,例如机器人的计划和互动。这些具体的问题要求代理商了解世界上许多语义方面:可用技能的曲目,这些技能如何影响世界以及对世界的变化如何映射回该语言。在体现环境中规划的LLMS不仅需要考虑要做什么技能,还需要考虑如何以及何时进行操作 - 答案随着时间的推移而变化,以响应代理商自己的选择。在这项工作中,我们调查了在这种体现的环境中使用的LLM在多大程度上可以推论通过自然语言提供的反馈来源,而无需任何其他培训。我们建议,通过利用环境反馈,LLM能够形成内部独白,使他们能够在机器人控制方案中进行更丰富的处理和计划。我们研究了各种反馈来源,例如成功检测,场景描述和人类互动。我们发现,闭环语言反馈显着改善了三个领域的高级指导完成,包括模拟和真实的桌面顶部重新排列任务以及现实世界中厨房环境中的长途移动操作任务。
translated by 谷歌翻译
Recent work has shown that large language models are capable of generating natural language reasoning steps or Chains-of-Thoughts (CoT) to answer a multi-step question when prompted to do so. This is insufficient, however, when the necessary knowledge is not available or up-to-date within a model's parameters. A straightforward approach to address this is to retrieve text from an external knowledge source using the question as a query and prepend it as context to the model's input. This, however, is also insufficient for multi-step QA where \textit{what to retrieve} depends on \textit{what has already been derived}. To address this issue we propose IRCoT, a new approach that interleaves retrieval with CoT for multi-step QA, guiding the retrieval with CoT and in turn using retrieved results to improve CoT. Our experiments with GPT3 show substantial improvements in retrieval (up to 22 points) and downstream QA (up to 16 points) over the baselines on four datasets: HotpotQA, 2WikiMultihopQA, MuSiQue, and IIRC. Notably, our method also works well for much smaller models such as T5-Flan-large (0.7B) without any additional training.
translated by 谷歌翻译
Many real-world applications of language models (LMs), such as code autocomplete and writing assistance, involve human-LM interaction, but the main LM benchmarks are non-interactive, where a system produces output without human intervention. To evaluate human-LM interaction, we develop a framework, Human-AI Language-based Interaction Evaluation (H-LINE), that expands non-interactive evaluation along three dimensions, capturing (i) the interactive process, not only the final output; (ii) the first-person subjective experience, not just a third-party assessment; and (iii) notions of preference beyond quality. We then design five tasks ranging from goal-oriented to open-ended to capture different forms of interaction. On four state-of-the-art LMs (three variants of OpenAI's GPT-3 and AI21's J1-Jumbo), we find that non-interactive performance does not always result in better human-LM interaction and that first-person and third-party metrics can diverge, suggesting the importance of examining the nuances of human-LM interaction.
translated by 谷歌翻译
预处理的大语言模型(LLM)广泛用于自然语言处理(NLP)的许多子场,通常被称为具有特定任务示例的优秀少数学习者。值得注意的是,思想链(COT)提示,这是一种通过分步答案示例引发复杂的多步推理的技术,在算术和符号推理中实现了最新的表演,难以置信的System-2任务不遵循LLMS的标准缩放定律。尽管这些成功通常归因于LLM的几次学习能力,但我们表明,LLM是通过在每个答案之前简单地添加“让我们逐步思考”而成为不错的零射击推理者。实验结果表明,使用相同的单个提示模板,我们的零射击功能明显优于零摄像机LLM在不同的基准推理任务上的零摄像机表现,包括算术(Multiarith,GSM8K,Aqua-Rat,SVAMP,SVAMP),符号推理(最后一个字母,字母,字母,字母,,,,,字母,字母)(最后一个字母),硬币翻转)和其他逻辑推理任务(日期理解,跟踪洗牌对象),而没有任何手工制作的几个示例,例如通过175B参数指令gpt模型将Multiarith的准确性从17.7%提高到78.7%,GSM8K从10.4%提高到40.7%,以及另一种现成的大型模型,540B参数Palm Palm的相似改进。在非常多样化的推理任务中,这个单一提示的多功能性暗示了LLM的尚未开发和研究的基本零拍功能,这表明可以通过简单提示来提取高级,多任务的广泛认知能力。我们希望我们的工作不仅可以作为具有挑战性的推理基准的最小零击基线,而且还强调了仔细探索和分析LLM中隐藏在LLM中的巨大的零拍知识的重要性,然后在制作Finetunning数据集或少数拍摄的典范之前。
translated by 谷歌翻译
我们探索如何产生一系列思想 - 一系列中间推理步骤 - 显着提高了大语言模型执行复杂推理的能力。特别是,我们通过一种称为“思想链”提示的简单方法在足够大的语言模型中自然出现这种推理能力,在此过程中,一些思想示范被作为提示的示例提供了。三种大语模型的实验表明,促使思想链提高了一系列算术,常识和象征性推理任务的性能。经验收益可能会引人注目。例如,仅使用八个思想范围的540B参数语言模型才能在数学单词问题的GSM8K基准上实现最新的精度,甚至超过了带有验证器的Fineted GPT-3。
translated by 谷歌翻译
We present a retrospective on the state of Embodied AI research. Our analysis focuses on 13 challenges presented at the Embodied AI Workshop at CVPR. These challenges are grouped into three themes: (1) visual navigation, (2) rearrangement, and (3) embodied vision-and-language. We discuss the dominant datasets within each theme, evaluation metrics for the challenges, and the performance of state-of-the-art models. We highlight commonalities between top approaches to the challenges and identify potential future directions for Embodied AI research.
translated by 谷歌翻译
我们微调GPT-3使用基于文本的Web浏览环境来回答长形问题,允许模型搜索和导航Web。通过建立任务,以便通过人类执行,我们能够使用模仿学习培训在任务上的模型,然后通过人体反馈优化答案质量。为了使人为评估事实精度更容易,模型必须在浏览支持答案时收集引用。我们在ELI5上培训并评估我们的模型,Reddit用户提出的问题数据集。我们的最佳模型是通过使用行为克隆进行微调GPT-3获得的,然后对训练训练的奖励模型进行拒绝采样来获得以预测人类偏好。这种模式的答案是人类56%的答案,我们的人类示威者的时间和69%的时间到Reddit的最高投票答复。
translated by 谷歌翻译
Transformer, originally devised for natural language processing, has also attested significant success in computer vision. Thanks to its super expressive power, researchers are investigating ways to deploy transformers to reinforcement learning (RL) and the transformer-based models have manifested their potential in representative RL benchmarks. In this paper, we collect and dissect recent advances on transforming RL by transformer (transformer-based RL or TRL), in order to explore its development trajectory and future trend. We group existing developments in two categories: architecture enhancement and trajectory optimization, and examine the main applications of TRL in robotic manipulation, text-based games, navigation and autonomous driving. For architecture enhancement, these methods consider how to apply the powerful transformer structure to RL problems under the traditional RL framework, which model agents and environments much more precisely than deep RL methods, but they are still limited by the inherent defects of traditional RL algorithms, such as bootstrapping and "deadly triad". For trajectory optimization, these methods treat RL problems as sequence modeling and train a joint state-action model over entire trajectories under the behavior cloning framework, which are able to extract policies from static datasets and fully use the long-sequence modeling capability of the transformer. Given these advancements, extensions and challenges in TRL are reviewed and proposals about future direction are discussed. We hope that this survey can provide a detailed introduction to TRL and motivate future research in this rapidly developing field.
translated by 谷歌翻译
Reasoning, as an essential ability for complex problem-solving, can provide back-end support for various real-world applications, such as medical diagnosis, negotiation, etc. This paper provides a comprehensive survey of cutting-edge research on reasoning with language model prompting. We introduce research works with comparisons and summaries and provide systematic resources to help beginners. We also discuss the potential reasons for emerging such reasoning abilities and highlight future research directions.
translated by 谷歌翻译
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but the quality bar for medical and clinical applications is high. Today, attempts to assess models' clinical knowledge typically rely on automated evaluations on limited benchmarks. There is no standard to evaluate model predictions and reasoning across a breadth of tasks. To address this, we present MultiMedQA, a benchmark combining six existing open question answering datasets spanning professional medical exams, research, and consumer queries; and HealthSearchQA, a new free-response dataset of medical questions searched online. We propose a framework for human evaluation of model answers along multiple axes including factuality, precision, possible harm, and bias. In addition, we evaluate PaLM (a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM, on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA, MedMCQA, PubMedQA, MMLU clinical topics), including 67.6% accuracy on MedQA (US Medical License Exam questions), surpassing prior state-of-the-art by over 17%. However, human evaluation reveals key gaps in Flan-PaLM responses. To resolve this we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, recall of knowledge, and medical reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal important limitations of today's models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLM models for clinical applications.
translated by 谷歌翻译
Developing safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at hand. Empirical work on this problem is not straightforward, since we do not yet have systems that broadly exceed our abilities. This paper discusses one of the major ways we think about this problem, with a focus on how to turn it into one that can be productively studied empirically. We first present an experimental design centered on choosing tasks for which human specialists succeed but unaided humans and current general AI systems fail. We then present a proof-of-concept experiment following meant to demonstrate a key feature of this experimental design and show its viability with two question-answering tasks: MMLU and time-limited QuALITY. On these tasks, we find that human participants who interact with an unreliable large-language-model dialog assistant through chat -- a trivial baseline strategy for scalable oversight -- substantially outperform both the model alone and their own unaided performance. These results are an encouraging sign that scalable oversight will be tractable to study with present models and bolster recent findings that large language models can productively assist humans with difficult tasks.
translated by 谷歌翻译
创建可以自然与人类互动的代理是人工智能(AI)研究中的共同目标。但是,评估这些互动是具有挑战性的:收集在线人类代理相互作用缓慢而昂贵,但更快的代理指标通常与交互式评估相关。在本文中,我们评估了这些现有评估指标的优点,并提出了一种新颖的评估方法,称为标准化测试套件(STS)。 STS使用从真实人类交互数据中挖掘出的行为方案。代理商请参阅重播方案上下文,接收指令,然后将控制权控制以脱机完成交互。记录这些代理的延续并将其发送给人类注释者以将其标记为成功或失败,并且根据其成功的连续性比例对代理进行排名。最终的ST是自然主义相互作用的快速,控制,可解释的和代表的。总的来说,STS巩固了我们许多标准评估指标中所需的许多值,从而使我们能够加速研究进展,以生产可以自然与人类互动的代理。可以在https://youtu.be/yr1tnggorgq上找到视频。
translated by 谷歌翻译
本文介绍了寻求信息(是)任务,概念和算法的信息重新分类。拟议的分类系统提供了新的维度,以研究寻求任务和方法的信息。新尺寸包括搜索迭代,搜索目标类型和程序的数量,以实现这些目标。寻求任务的信息沿着这些尺寸呼叫合适的计算解决方案的差异。然后,该文章评论了符合每个新类别的机器学习解决方案。该论文结束了对系统的评估活动进行了审查。
translated by 谷歌翻译
随着人工智能系统变得越来越强大和普遍,人们对机器的道德或缺乏道德的关注变得越来越关注。然而,向机器讲授道德是一项艰巨的任务,因为道德仍然是人类中最激烈的争论问题之一,更不用说AI了。但是,部署到数百万用户的现有AI系统已经在做出充满道德影响的决策,这构成了一个看似不可能的挑战:教学机器的道德意义,而人类继续努力努力。为了探索这一挑战,我们介绍了Delphi,这是一个基于深层神经网络的实验框架,直接训练了描述性道德判断,例如,“帮助朋友”通常是不错的,而“帮助朋友传播假新闻”不是。经验结果提供了对机器伦理的承诺和局限性的新见解。面对新的道德情况,德尔菲(Delphi)表现出强大的概括能力,而现成的神经网络模型表现出明显差的判断,包括不公正的偏见,证实了对明确教学机器的道德意义的必要性。然而,德尔菲并不完美,表现出对普遍性偏见和不一致的敏感性。尽管如此,我们还是展示了不完美的Delphi的积极用例,包括在其他不完美的AI系统中将其用作组件模型。重要的是,我们根据著名的道德理论来解释Delphi的运营化,这使我们提出了重要的未来研究问题。
translated by 谷歌翻译
鉴于大型语言模型的广泛能力,应该有可能朝着一般的文本的助手工作,这些助手与人类价值一致,这意味着它是有帮助,诚实的和无害的。在此方向上的初始遗传,我们研究简单的基线技术和评估,例如提示。我们发现,从模型规模增加适度的干预措施的好处,概括为各种对准评估,并不会损害大型模型的性能。接下来,我们调查与对齐,比较仿制,二进制歧视和排名偏好建模相关的几个培训目标的缩放趋势。我们发现排名优先级模型比模仿学习更好地表现得多,并且通常以模型大小更有利地缩放。相比之下,二进制歧视通常与模仿学习非常类似地执行和缩放。最后,我们研究了一种“偏好模型预训练阶段的培训阶段,其目的是在对人偏好的芬明时提高样本效率。
translated by 谷歌翻译
We propose the Detailed Outline Control (DOC) framework for improving long-range plot coherence when automatically generating several-thousand-word-long stories. DOC consists of two complementary components: a detailed outliner and a detailed controller. The detailed outliner creates a more detailed, hierarchically structured outline, shifting creative burden from the main drafting procedure to the planning stage. The detailed controller ensures the more detailed outline is still respected during generation by controlling story passages to align with outline details. In human evaluations of automatically generated stories, DOC substantially outperforms a strong Re3 baseline (Yang et al., 2022) on plot coherence (22.5% absolute gain), outline relevance (28.2%), and interestingness (20.7%). Humans also judged DOC to be much more controllable in an interactive generation setting.
translated by 谷歌翻译
We are currently unable to specify human goals and societal values in a way that reliably directs AI behavior. Law-making and legal interpretation form a computational engine that converts opaque human values into legible directives. "Law Informs Code" is the research agenda capturing complex computational legal processes, and embedding them in AI. Similar to how parties to a legal contract cannot foresee every potential contingency of their future relationship, and legislators cannot predict all the circumstances under which their proposed bills will be applied, we cannot ex ante specify rules that provably direct good AI behavior. Legal theory and practice have developed arrays of tools to address these specification problems. For instance, legal standards allow humans to develop shared understandings and adapt them to novel situations. In contrast to more prosaic uses of the law (e.g., as a deterrent of bad behavior through the threat of sanction), leveraged as an expression of how humans communicate their goals, and what society values, Law Informs Code. We describe how data generated by legal processes (methods of law-making, statutory interpretation, contract drafting, applications of legal standards, legal reasoning, etc.) can facilitate the robust specification of inherently vague human goals. This increases human-AI alignment and the local usefulness of AI. Toward society-AI alignment, we present a framework for understanding law as the applied philosophy of multi-agent alignment. Although law is partly a reflection of historically contingent political power - and thus not a perfect aggregation of citizen preferences - if properly parsed, its distillation offers the most legitimate computational comprehension of societal values available. If law eventually informs powerful AI, engaging in the deliberative political process to improve law takes on even more meaning.
translated by 谷歌翻译