Many real-world applications of language models (LMs), such as code autocomplete and writing assistance, involve human-LM interaction, but the main LM benchmarks are non-interactive, where a system produces output without human intervention. To evaluate human-LM interaction, we develop a framework, Human-AI Language-based Interaction Evaluation (H-LINE), that expands non-interactive evaluation along three dimensions, capturing (i) the interactive process, not only the final output; (ii) the first-person subjective experience, not just a third-party assessment; and (iii) notions of preference beyond quality. We then design five tasks ranging from goal-oriented to open-ended to capture different forms of interaction. On four state-of-the-art LMs (three variants of OpenAI's GPT-3 and AI21's J1-Jumbo), we find that non-interactive performance does not always result in better human-LM interaction and that first-person and third-party metrics can diverge, suggesting the importance of examining the nuances of human-LM interaction.
translated by 谷歌翻译
我们介绍了Sparrow,这是一个寻求信息的对话代理,与提示的语言模型基线相比,训练有素,更有帮助,正确和无害。我们使用从人类反馈中的强化学习来培训我们的模型,以帮助人类评估者判断代理人的行为。首先,为了使我们的代理人更有帮助和无害,我们将良好对话的要求分解为代理人应遵循的自然语言规则,并分别向评估者询问每个规则。我们证明,这种崩溃使我们能够收集对代理行为的更多针对性的人类判断,并允许更有效的规则条件奖励模型。其次,我们的代理商在收集对模型声明的偏好判决时提供了支持事实主张的来源的证据。对于事实问题,麻雀提供的证据支持了78%的时间。比基线比基线更享受麻雀,同时对人类的对抗性探测更具弹性,在探测时只有8%的时间违反了我们的规则。最后,我们进行了广泛的分析,表明尽管我们的模型学会遵守我们的规则,但它可以表现出分布偏见。
translated by 谷歌翻译
As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user's preferred answer ("sycophancy") and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse. For example, RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. Overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors.
translated by 谷歌翻译
鉴于大型语言模型的广泛能力,应该有可能朝着一般的文本的助手工作,这些助手与人类价值一致,这意味着它是有帮助,诚实的和无害的。在此方向上的初始遗传,我们研究简单的基线技术和评估,例如提示。我们发现,从模型规模增加适度的干预措施的好处,概括为各种对准评估,并不会损害大型模型的性能。接下来,我们调查与对齐,比较仿制,二进制歧视和排名偏好建模相关的几个培训目标的缩放趋势。我们发现排名优先级模型比模仿学习更好地表现得多,并且通常以模型大小更有利地缩放。相比之下,二进制歧视通常与模仿学习非常类似地执行和缩放。最后,我们研究了一种“偏好模型预训练阶段的培训阶段,其目的是在对人偏好的芬明时提高样本效率。
translated by 谷歌翻译
我们提出了Blenderbot 3,这是一个175B参数对话模型,能够通过访问Internet和长期内存进行开放域对话,并接受了大量用户定义的任务的培训。我们同时发布了模型权重和代码,还将模型部署在公共网页上,以与有机用户进行交互。该技术报告描述了该模型的构建方式(建筑,模型和培训计划)以及其部署的细节,包括安全机制。人类评估表明,它优于现有的开放域对话代理,包括其前身(Roller等,2021; Komeili等,2022)。最后,我们使用部署收集的数据详细介绍了持续学习的计划,该数据也将公开发布。因此,该研究计划的目标是使社区能够研究通过互动学习的不断改进的负责任的代理商。
translated by 谷歌翻译
问答系统被认为是流行且经常有效的信息在网络上寻求信息的手段。在这样的系统中,寻求信息者可以通过自然语言提出问题来获得对他们的查询的简短回应。交互式问题回答是一种最近提出且日益流行的解决方案,它位于问答和对话系统的交集。一方面,用户可以以普通语言提出问题,并找到对她的询问的实际回答;另一方面,如果在初始请求中有多个可能的答复,很少或歧义,则系统可以将问题交通会话延长到对话中。通过允许用户提出更多问题,交互式问题回答使用户能够与系统动态互动并获得更精确的结果。这项调查提供了有关当前文献中普遍存在的交互式提问方法的详细概述。它首先要解释提问系统的基本原理,从而定义新的符号和分类法,以将所有已确定的作品结合在统一框架内。然后,根据提出的方法,评估方法和数据集/应用程序域来介绍和检查有关交互式问题解答系统的审查已发表的工作。我们还描述了围绕社区提出的特定任务和问题的趋势,从而阐明了学者的未来利益。 GitHub页面的综合综合了本文献研究中涵盖的所有主要主题,我们的工作得到了进一步的支持。 https://sisinflab.github.io/interactive-question-answering-systems-survey/
translated by 谷歌翻译
随着近期自然语言生成(NLG)模型的各种应用程序的改进,它变得必须具有识别和评估NLG输出是否仅共享关于外部世界的可验证信息的手段。在这项工作中,我们提出了一个归属于识别的来源(AIS)的新评估框架,用于评估自然语言生成模型的输出,当这种输出涉及外部世界时。我们首先定义AIS,并引入两级注释管道,用于允许注释器根据AIS指南适当地评估模型输出。通过人为评估研究,我们在三个代数据集(会话QA域中的两个中和总结一下,概括地验证了这种方法,表明AIS可以作为测量模型生成的语句是否支持基础来源的常见框架。我们释放人类评估研究指南。
translated by 谷歌翻译
大型语言模型,例如OpenAI的法典和DeepMind的字母,可以生成代码来解决以自然语言表达的各种问题。这项技术已经在至少一项广泛使用的编程编辑器扩展程序中进行了商业化:Github Copilot。在本文中,我们探讨了具有大型语言模型(LLM辅助编程)的编程与程序员协助的先前概念化相似,并且与众不同。我们借鉴了公开可用的经验报告,有关LLM辅助编程以及先前的可用性和设计研究。我们发现,尽管LLM辅助编程通过搜索和重用分享了一些编译,配对编程和编程的属性,但技术可能性和实践经验都存在根本差异。因此,应该将LLM辅助编程视为具有自己独特的属性和挑战的新方法。最后,我们借鉴了用户研究的观察结果,在该观察中,非专家最终用户程序员使用LLM辅助工具来求解电子表格中的数据任务。我们讨论可能出现的问题,并在将大型语言模型应用于最终用户编程时,尤其是对于几乎没有编程专业知识的用户。
translated by 谷歌翻译
GPT-3等模型的零和少量提示的最新成功导致了NLP研究的范式转移。在本文中,我们研究了其对文本摘要的影响,重点是新闻摘要的经典基准领域。首先,我们研究了零击GPT-3与在大型摘要数据集中训练的微调模型的比较。我们表明,不仅人类压倒性地更喜欢GPT-3摘要,而且这些摘要也不遭受普通数据集特异性问题(例如事实差的问题)。接下来,我们研究这对评估意味着什么,尤其是黄金标准测试集的作用。我们的实验表明,基于参考和无参考的自动指标,例如最近提出的基于质量检查或基于质量的事实方法无法可靠地评估零击摘要。最后,我们讨论了未来的研究挑战,除了通用摘要之外,特别是基于关键字和方面的摘要,表明了优势微调方法与零拍的提示相比如何。为了支持进一步的研究,我们发布:(a)在4个标准摘要基准中,从微调和零摄像模型中产生的10K生成的摘要,(b)1K人类偏好判断和比较不同系统的普通系统,以进行通用和关键字的不同系统。基于摘要。
translated by 谷歌翻译
Natural Language Generation (NLG) has improved exponentially in recent years thanks to the development of sequence-to-sequence deep learning technologies such as Transformer-based language models. This advancement has led to more fluent and coherent NLG, leading to improved development in downstream tasks such as abstractive summarization, dialogue generation and data-to-text generation. However, it is also apparent that deep learning based generation is prone to hallucinate unintended text, which degrades the system performance and fails to meet user expectations in many real-world scenarios. To address this issue, many studies have been presented in measuring and mitigating hallucinated texts, but these have never been reviewed in a comprehensive manner before. In this survey, we thus provide a broad overview of the research progress and challenges in the hallucination problem in NLG. The survey is organized into two parts: (1) a general overview of metrics, mitigation methods, and future directions; and (2) an overview of task-specific research progress on hallucinations in the following downstream tasks, namely abstractive summarization, dialogue generation, generative question answering, data-to-text generation, machine translation, and visual-language generation. This survey serves to facilitate collaborative efforts among researchers in tackling the challenge of hallucinated texts in NLG.
translated by 谷歌翻译
随着人工智能系统变得越来越强大和普遍,人们对机器的道德或缺乏道德的关注变得越来越关注。然而,向机器讲授道德是一项艰巨的任务,因为道德仍然是人类中最激烈的争论问题之一,更不用说AI了。但是,部署到数百万用户的现有AI系统已经在做出充满道德影响的决策,这构成了一个看似不可能的挑战:教学机器的道德意义,而人类继续努力努力。为了探索这一挑战,我们介绍了Delphi,这是一个基于深层神经网络的实验框架,直接训练了描述性道德判断,例如,“帮助朋友”通常是不错的,而“帮助朋友传播假新闻”不是。经验结果提供了对机器伦理的承诺和局限性的新见解。面对新的道德情况,德尔菲(Delphi)表现出强大的概括能力,而现成的神经网络模型表现出明显差的判断,包括不公正的偏见,证实了对明确教学机器的道德意义的必要性。然而,德尔菲并不完美,表现出对普遍性偏见和不一致的敏感性。尽管如此,我们还是展示了不完美的Delphi的积极用例,包括在其他不完美的AI系统中将其用作组件模型。重要的是,我们根据著名的道德理论来解释Delphi的运营化,这使我们提出了重要的未来研究问题。
translated by 谷歌翻译
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but the quality bar for medical and clinical applications is high. Today, attempts to assess models' clinical knowledge typically rely on automated evaluations on limited benchmarks. There is no standard to evaluate model predictions and reasoning across a breadth of tasks. To address this, we present MultiMedQA, a benchmark combining six existing open question answering datasets spanning professional medical exams, research, and consumer queries; and HealthSearchQA, a new free-response dataset of medical questions searched online. We propose a framework for human evaluation of model answers along multiple axes including factuality, precision, possible harm, and bias. In addition, we evaluate PaLM (a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM, on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA, MedMCQA, PubMedQA, MMLU clinical topics), including 67.6% accuracy on MedQA (US Medical License Exam questions), surpassing prior state-of-the-art by over 17%. However, human evaluation reveals key gaps in Flan-PaLM responses. To resolve this we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, recall of knowledge, and medical reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal important limitations of today's models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLM models for clinical applications.
translated by 谷歌翻译
大型语言模型会产生类似人类的文本,这些文本推动了越来越多的应用。但是,最近的文献以及越来越多的现实世界观察表明,这些模型可以产生有毒,有偏见,不真实或其他有害的语言。尽管正在进行评估语言模型危害的工作,但要远见卓识转换出可能出现的危害可能会引起严格的基准。为了促进这种翻译,我们概述了六种表征有害文本的方式,这些方法在设计新基准时值得明确考虑。然后,我们将这些特征用作镜头来识别现有基准中的趋势和差距。最后,我们将它们应用于视角API的案例研究,这是一种毒性分类器,被广泛用于HARS基准。我们的特征提供了一块桥梁,可以在远见和有效评估之间转化。
translated by 谷歌翻译
我们提出了一个文本编辑器,以帮助用户计划,结构并反思其写作过程。它使用自动文本摘要提供了不断更新的段落摘要作为边缘注释。摘要级别范围从全文到选定的(中央)句子,一直到关键字的集合。为了了解用户在写作过程中如何与该系统进行交互,我们进行了两项用户研究(n = 4和n = 8),人们在其中写了有关给定主题和文章的分析文章。作为关键发现,这些摘要使用户对他们的写作有了外部视角,并帮助他们修改了草稿段落的内容和范围。人们进一步使用该工具快速获得文本概述,并制定了整合自动摘要中见解的策略。从更广泛的角度来看,这项工作探索并突出了为作家设计AI工具的价值,其自然语言处理(NLP)功能超出了直接文本生成和更正。
translated by 谷歌翻译
Explainable AI (XAI) is widely viewed as a sine qua non for ever-expanding AI research. A better understanding of the needs of XAI users, as well as human-centered evaluations of explainable models are both a necessity and a challenge. In this paper, we explore how HCI and AI researchers conduct user studies in XAI applications based on a systematic literature review. After identifying and thoroughly analyzing 85 core papers with human-based XAI evaluations over the past five years, we categorize them along the measured characteristics of explanatory methods, namely trust, understanding, fairness, usability, and human-AI team performance. Our research shows that XAI is spreading more rapidly in certain application domains, such as recommender systems than in others, but that user evaluations are still rather sparse and incorporate hardly any insights from cognitive or social sciences. Based on a comprehensive discussion of best practices, i.e., common models, design choices, and measures in user studies, we propose practical guidelines on designing and conducting user studies for XAI researchers and practitioners. Lastly, this survey also highlights several open research directions, particularly linking psychological science and human-centered XAI.
translated by 谷歌翻译
我们微调GPT-3使用基于文本的Web浏览环境来回答长形问题,允许模型搜索和导航Web。通过建立任务,以便通过人类执行,我们能够使用模仿学习培训在任务上的模型,然后通过人体反馈优化答案质量。为了使人为评估事实精度更容易,模型必须在浏览支持答案时收集引用。我们在ELI5上培训并评估我们的模型,Reddit用户提出的问题数据集。我们的最佳模型是通过使用行为克隆进行微调GPT-3获得的,然后对训练训练的奖励模型进行拒绝采样来获得以预测人类偏好。这种模式的答案是人类56%的答案,我们的人类示威者的时间和69%的时间到Reddit的最高投票答复。
translated by 谷歌翻译
As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.
translated by 谷歌翻译
Incivility remains a major challenge for online discussion platforms, to such an extent that even conversations between well-intentioned users can often derail into uncivil behavior. Traditionally, platforms have relied on moderators to -- with or without algorithmic assistance -- take corrective actions such as removing comments or banning users. In this work we propose a complementary paradigm that directly empowers users by proactively enhancing their awareness about existing tension in the conversation they are engaging in and actively guides them as they are drafting their replies to avoid further escalation. As a proof of concept for this paradigm, we design an algorithmic tool that provides such proactive information directly to users, and conduct a user study in a popular discussion platform. Through a mixed methods approach combining surveys with a randomized controlled experiment, we uncover qualitative and quantitative insights regarding how the participants utilize and react to this information. Most participants report finding this proactive paradigm valuable, noting that it helps them to identify tension that they may have otherwise missed and prompts them to further reflect on their own replies and to revise them. These effects are corroborated by a comparison of how the participants draft their reply when our tool warns them that their conversation is at risk of derailing into uncivil behavior versus in a control condition where the tool is disabled. These preliminary findings highlight the potential of this user-centered paradigm and point to concrete directions for future implementations.
translated by 谷歌翻译
Language models (LMs) now excel at many tasks such as few-shot learning, question answering, reasoning, and dialog. However, they sometimes generate unsupported or misleading content. A user cannot easily determine whether their outputs are trustworthy or not, because most LMs do not have any built-in mechanism for attribution to external evidence. To enable attribution while still preserving all the powerful advantages of recent generation models, we propose RARR (Retrofit Attribution using Research and Revision), a system that 1) automatically finds attribution for the output of any text generation model and 2) post-edits the output to fix unsupported content while preserving the original output as much as possible. When applied to the output of several state-of-the-art LMs on a diverse set of generation tasks, we find that RARR significantly improves attribution while otherwise preserving the original input to a much greater degree than previously explored edit models. Furthermore, the implementation of RARR requires only a handful of training examples, a large language model, and standard web search.
translated by 谷歌翻译
深度神经语言模型的最新进展与大规模数据集的能力相结合,加速了自然语言生成系统的发展,这些系统在多种任务和应用程序上下文中产生流利和连贯的文本(在各种成功程度上)。但是,为所需的用户控制这些模型的输出仍然是一个开放的挑战。这不仅对于自定义生成语言的内容和样式至关重要,而且对于他们在现实世界中的安全可靠部署至关重要。我们提出了一项关于受约束神经语言生成的新兴主题的广泛调查,在该主题中,我们通过区分条件和约束(后者是在输出文本上而不是输入的可检验条件),正式定义和分类自然语言生成问题,目前是可检验的)约束文本生成任务,并查看受限文本生成的现有方法和评估指标。我们的目的是强调这个新兴领域的最新进展和趋势,以告知最有希望的方向和局限性,以推动受约束神经语言生成研究的最新作品。
translated by 谷歌翻译