人为决策的合作努力实现超出人类或人工智能表现的团队绩效。但是,许多因素都会影响人类团队的成功,包括用户的领域专业知识,AI系统的心理模型,对建议的信任等等。这项工作检查了用户与三种模拟算法模型的互动,所有这些模型都具有相似的精度,但对其真正的正面和真实负率进行了不同的调整。我们的研究检查了在非平凡的血管标签任务中的用户性能,参与者表明给定的血管是流动还是停滞。我们的结果表明,虽然AI-Assistant的建议可以帮助用户决策,但用户相对于AI的基线性能和AI错误类型的补充调整等因素会显着影响整体团队的整体绩效。新手用户有所改善,但不能达到AI的准确性。高度熟练的用户通常能够识别何时应遵循AI建议,并通常保持或提高其性能。与AI相似的准确性水平的表演者在AI建议方面是最大的变化。此外,我们发现用户对AI的性能亲戚的看法也对给出AI建议时的准确性是否有所提高产生重大影响。这项工作提供了有关与人类协作有关的因素的复杂性的见解,并提供了有关如何开发以人为中心的AI算法来补充用户在决策任务中的建议。
translated by 谷歌翻译
Explainable AI (XAI) is widely viewed as a sine qua non for ever-expanding AI research. A better understanding of the needs of XAI users, as well as human-centered evaluations of explainable models are both a necessity and a challenge. In this paper, we explore how HCI and AI researchers conduct user studies in XAI applications based on a systematic literature review. After identifying and thoroughly analyzing 85 core papers with human-based XAI evaluations over the past five years, we categorize them along the measured characteristics of explanatory methods, namely trust, understanding, fairness, usability, and human-AI team performance. Our research shows that XAI is spreading more rapidly in certain application domains, such as recommender systems than in others, but that user evaluations are still rather sparse and incorporate hardly any insights from cognitive or social sciences. Based on a comprehensive discussion of best practices, i.e., common models, design choices, and measures in user studies, we propose practical guidelines on designing and conducting user studies for XAI researchers and practitioners. Lastly, this survey also highlights several open research directions, particularly linking psychological science and human-centered XAI.
translated by 谷歌翻译
随着AI系统表现出越来越强烈的预测性能,它们的采用已经在许多域中种植。然而,在刑事司法和医疗保健等高赌场域中,由于安全,道德和法律问题,往往是完全自动化的,但是完全手工方法可能是不准确和耗时的。因此,对研究界的兴趣日益增长,以增加人力决策。除了为此目的开发AI技术之外,人民AI决策的新兴领域必须采用实证方法,以形成对人类如何互动和与AI合作做出决定的基础知识。为了邀请和帮助结构研究努力了解理解和改善人为 - AI决策的研究,我们近期对本课题的实证人体研究的文献。我们总结了在三个重要方面的100多篇论文中的研究设计选择:(1)决定任务,(2)AI模型和AI援助要素,以及(3)评估指标。对于每个方面,我们总结了当前的趋势,讨论了现场当前做法中的差距,并列出了未来研究的建议。我们的调查强调了开发共同框架的需要考虑人类 - AI决策的设计和研究空间,因此研究人员可以在研究设计中进行严格的选择,研究界可以互相构建并产生更广泛的科学知识。我们还希望这项调查将成为HCI和AI社区的桥梁,共同努力,相互塑造人类决策的经验科学和计算技术。
translated by 谷歌翻译
Prior work has identified a resilient phenomenon that threatens the performance of human-AI decision-making teams: overreliance, when people agree with an AI, even when it is incorrect. Surprisingly, overreliance does not reduce when the AI produces explanations for its predictions, compared to only providing predictions. Some have argued that overreliance results from cognitive biases or uncalibrated trust, attributing overreliance to an inevitability of human cognition. By contrast, our paper argues that people strategically choose whether or not to engage with an AI explanation, demonstrating empirically that there are scenarios where AI explanations reduce overreliance. To achieve this, we formalize this strategic choice in a cost-benefit framework, where the costs and benefits of engaging with the task are weighed against the costs and benefits of relying on the AI. We manipulate the costs and benefits in a maze task, where participants collaborate with a simulated AI to find the exit of a maze. Through 5 studies (N = 731), we find that costs such as task difficulty (Study 1), explanation difficulty (Study 2, 3), and benefits such as monetary compensation (Study 4) affect overreliance. Finally, Study 5 adapts the Cognitive Effort Discounting paradigm to quantify the utility of different explanations, providing further support for our framework. Our results suggest that some of the null effects found in literature could be due in part to the explanation not sufficiently reducing the costs of verifying the AI's prediction.
translated by 谷歌翻译
Teaser: How seemingly trivial experiment design choices to simplify the evaluation of human-ML systems can yield misleading results.
translated by 谷歌翻译
最近的工作表明,当AI的预测不可靠时,可以学会推迟人类的选择性预测系统的潜在好处,特别是提高医疗保健等高赌注应用中AI系统的可靠性。然而,大多数事先工作假定当他们解决预测任务时,人类行为仍然保持不变,作为人类艾队团队的一部分而不是自己。我们表明,通过执行实验来规定在选择性预测的背景下量化人AI相互作用的实验并非如此。特别是,我们研究将不同类型信息传送给人类的影响,了解AI系统的决定推迟。使用现实世界的保护数据和选择性预测系统,可以在单独工作的人体或AI系统上提高预期准确性,我们表明,这种消息传递对人类判断的准确性产生了重大影响。我们的结果研究了消息传递策略的两个组成部分:1)人类是否被告知AI系统的预测和2)是否被告知选择性预测系统的决定推迟。通过操纵这些消息传递组件,我们表明,通过通知人类推迟的决定,可以显着提高人类的性能,但不透露对AI的预测。因此,我们表明,考虑在设计选择性预测系统时如何传送到人类的决定是至关重要的,并且必须使用循环框架仔细评估人类-AI团队的复合精度。
translated by 谷歌翻译
尽管Ai在各个领域的超人表现,但人类往往不愿意采用AI系统。许多现代AI技术中缺乏可解释性的缺乏可令人伤害他们的采用,因为用户可能不相信他们不理解的决策过程的系统。我们通过一种新的实验调查这一主张,其中我们使用互动预测任务来分析可解释性和结果反馈对AI信任的影响和AI辅助预测任务的人类绩效。我们发现解释性导致了不强大的信任改进,而结果反馈具有明显更大且更可靠的效果。然而,这两个因素对参与者的任务表现产生了适度的影响。我们的研究结果表明(1)接受重大关注的因素,如可解释性,在越来越多的信任方面可能比其他结果反馈的因素效果,而(2)通过AI系统增强人类绩效可能不是在AI中增加信任的简单问题。 ,随着增加的信任并不总是与性能同样大的改进相关联。这些调查结果邀请了研究界不仅关注产生解释的方法,而且还专注于确保在实践中产生影响和表现的技巧。
translated by 谷歌翻译
解释已被框起来是更好,更公平的人类决策的基本特征。在公平的背景下,这一点尚未得到适当的研究,因为先前的工作主要根据他们对人们的看法的影响进行了评估。但是,我们认为,要促进更公正的决定,它们必须使人类能够辨别正确和错误的AI建议。为了验证我们的概念论点,我们进行了一项实证研究,以研究解释,公平感和依赖行为之间的关系。我们的发现表明,解释会影响人们的公平感,这反过来又影响了依赖。但是,我们观察到,低公平的看法会导致AI建议的更多替代,无论它们是正确还是错。这(i)引起了人们对现有解释对增强分配公平性的有用性的怀疑,并且(ii)为为什么不必将感知作为适当依赖的代理而被混淆的重要案例。
translated by 谷歌翻译
在线众包平台使对算法输出进行评估变得容易,并提出诸如“哪个图像更好,A或B?”之类的问题的调查,在视觉和图形研究论文中的这些“用户研究”的扩散导致了增加匆忙进行的研究充其量是草率且无知的,并且可能有害和误导。我们认为,在计算机视觉和图形论文中的用户研究的设计和报告需要更多关注。为了提高从业者的知识并提高用户研究的可信度和可复制性,我们提供了用户体验研究(UXR),人类计算机互动(HCI)和相关领域的方法论的概述。我们讨论了目前在计算机视觉和图形研究中未利用的基础用户研究方法(例如,需要调查),但可以为研究项目提供宝贵的指导。我们为有兴趣探索其他UXR方法的读者提供了进一步的指导。最后,我们描述了研究界的更广泛的开放问题和建议。我们鼓励作者和审稿人都认识到,并非每项研究贡献都需要用户研究,而且根本没有研究比不小心进行的研究更好。
translated by 谷歌翻译
机器学习的最新进展导致人们对可解释的AI(XAI)的兴趣越来越大,使人类能够深入了解机器学习模型的决策。尽管最近有这种兴趣,但XAI技术的实用性尚未在人机组合中得到特征。重要的是,XAI提供了增强团队情境意识(SA)和共享心理模型发展的希望,这是有效的人机团队的关键特征。快速开发这种心理模型在临时人机团队中尤其重要,因为代理商对他人的决策策略没有先验知识。在本文中,我们提出了两个新颖的人类受试者实验,以量化在人机组合场景中部署XAI技术的好处。首先,我们证明XAI技术可以支持SA($ P <0.05)$。其次,我们研究了通过协作AI政策抽象诱导的不同SA级别如何影响临时人机组合绩效。重要的是,我们发现XAI的好处不是普遍的,因为对人机团队的组成有很大的依赖。新手受益于XAI提供增加的SA($ P <0.05 $),但容易受到认知开销的影响($ P <0.05 $)。另一方面,专家性能随着基于XAI的支持($ p <0.05 $)而降低,这表明关注XAI的成本超过了从提供的其他信息中获得的收益以增强SA所获得的收益。我们的结果表明,研究人员必须通过仔细考虑人机团队组成以及XAI方法如何增强SA来故意在正确的情况下设计和部署正确的XAI技术。
translated by 谷歌翻译
在许多现实世界的背景下,成功的人类合作要求人类有效地将补充信息来源整合到AI信息的决策中。但是,实际上,人类决策者常常缺乏对AI模型与自己有关的信息的了解。关于如何有效沟通不可观察的指南,几乎没有可用的准则:可能影响结果但模型无法使用的功能。在这项工作中,我们进行了一项在线实验,以了解以及如何显式交流潜在相关的不可观念,从而影响人们在做出预测时如何整合模型输出和无法观察到的。我们的发现表明,提示有关不可观察的提示可以改变人类整合模型输出和不可观察的方式,但不一定会改善性能。此外,这些提示的影响可能会根据决策者的先前领域专业知识而有所不同。我们通过讨论对基于AI的决策支持工具的未来研究和设计的影响来结束。
translated by 谷歌翻译
Many real-world applications of language models (LMs), such as code autocomplete and writing assistance, involve human-LM interaction, but the main LM benchmarks are non-interactive, where a system produces output without human intervention. To evaluate human-LM interaction, we develop a framework, Human-AI Language-based Interaction Evaluation (H-LINE), that expands non-interactive evaluation along three dimensions, capturing (i) the interactive process, not only the final output; (ii) the first-person subjective experience, not just a third-party assessment; and (iii) notions of preference beyond quality. We then design five tasks ranging from goal-oriented to open-ended to capture different forms of interaction. On four state-of-the-art LMs (three variants of OpenAI's GPT-3 and AI21's J1-Jumbo), we find that non-interactive performance does not always result in better human-LM interaction and that first-person and third-party metrics can diverge, suggesting the importance of examining the nuances of human-LM interaction.
translated by 谷歌翻译
人工智能算法越来越多地被公共机构作为决策助手,并承诺克服人类决策者的偏见。同时,他们可能会在人类算法中引入新的偏见。在心理学和公共行政文献上,我们调查了两个关键偏见:即使面对来自其他来源的警告信号(自动化偏见)的警告信号,对算法建议过度依赖,并选择性地采用算法建议时,这与刻板印象相对应(Selective Adherence)。我们通过在荷兰瓦中进行的三项实验研究评估这些研究,讨论了我们发现对公共部门决策在自动化时代的影响。总体而言,我们的研究表明,对已经脆弱和处境不利的公民自动化自动化的潜在负面影响。
translated by 谷歌翻译
机器学习透明度(ML),试图揭示复杂模型的工作机制。透明ML承诺推进人为因素在目标用户中以人为本的人体目标的工程目标。从以人为本的设计视角,透明度不是ML模型的属性,而是一种能力,即算法与用户之间的关系;因此,与用户的迭代原型和评估对于获得提供透明度的充足解决方案至关重要。然而,由于有限的可用性和最终用户,遵循了医疗保健和医学图像分析的人以人为本的设计原则是具有挑战性的。为了调查医学图像分析中透明ML的状态,我们对文献进行了系统审查。我们的评论在医学图像分析应用程序的透明ML的设计和验证方面揭示了多种严重的缺点。我们发现,大多数研究到达迄今为止透明度作为模型本身的属性,类似于任务性能,而不考虑既未开发也不考虑最终用户也不考虑评估。此外,缺乏用户研究以及透明度声明的偶发验证将当代研究透明ML的医学图像分析有可能对用户难以理解的风险,因此临床无关紧要。为了缓解即将到来的研究中的这些缺点,同时承认人以人为中心设计在医疗保健中的挑战,我们介绍了用于医学图像分析中的透明ML系统的系统设计指令。 Intrult指南建议形成的用户研究作为透明模型设计的第一步,以了解用户需求和域要求。在此过程之后,会产生支持设计选择的证据,最终增加了算法提供透明度的可能性。
translated by 谷歌翻译
Deepfakes are computationally-created entities that falsely represent reality. They can take image, video, and audio modalities, and pose a threat to many areas of systems and societies, comprising a topic of interest to various aspects of cybersecurity and cybersafety. In 2020 a workshop consulting AI experts from academia, policing, government, the private sector, and state security agencies ranked deepfakes as the most serious AI threat. These experts noted that since fake material can propagate through many uncontrolled routes, changes in citizen behaviour may be the only effective defence. This study aims to assess human ability to identify image deepfakes of human faces (StyleGAN2:FFHQ) from nondeepfake images (FFHQ), and to assess the effectiveness of simple interventions intended to improve detection accuracy. Using an online survey, 280 participants were randomly allocated to one of four groups: a control group, and 3 assistance interventions. Each participant was shown a sequence of 20 images randomly selected from a pool of 50 deepfake and 50 real images of human faces. Participants were asked if each image was AI-generated or not, to report their confidence, and to describe the reasoning behind each response. Overall detection accuracy was only just above chance and none of the interventions significantly improved this. Participants' confidence in their answers was high and unrelated to accuracy. Assessing the results on a per-image basis reveals participants consistently found certain images harder to label correctly, but reported similarly high confidence regardless of the image. Thus, although participant accuracy was 62% overall, this accuracy across images ranged quite evenly between 85% and 30%, with an accuracy of below 50% for one in every five images. We interpret the findings as suggesting that there is a need for an urgent call to action to address this threat.
translated by 谷歌翻译
Incivility remains a major challenge for online discussion platforms, to such an extent that even conversations between well-intentioned users can often derail into uncivil behavior. Traditionally, platforms have relied on moderators to -- with or without algorithmic assistance -- take corrective actions such as removing comments or banning users. In this work we propose a complementary paradigm that directly empowers users by proactively enhancing their awareness about existing tension in the conversation they are engaging in and actively guides them as they are drafting their replies to avoid further escalation. As a proof of concept for this paradigm, we design an algorithmic tool that provides such proactive information directly to users, and conduct a user study in a popular discussion platform. Through a mixed methods approach combining surveys with a randomized controlled experiment, we uncover qualitative and quantitative insights regarding how the participants utilize and react to this information. Most participants report finding this proactive paradigm valuable, noting that it helps them to identify tension that they may have otherwise missed and prompts them to further reflect on their own replies and to revise them. These effects are corroborated by a comparison of how the participants draft their reply when our tool warns them that their conversation is at risk of derailing into uncivil behavior versus in a control condition where the tool is disabled. These preliminary findings highlight the potential of this user-centered paradigm and point to concrete directions for future implementations.
translated by 谷歌翻译
自我跟踪可以提高人们对他们不健康的行为的认识,为行为改变提供见解。事先工作探索了自动跟踪器如何反映其记录数据,但它仍然不清楚他们从跟踪反馈中学到多少,以及哪些信息更有用。实际上,反馈仍然可以压倒,并简明扼要可以通过增加焦点和减少解释负担来改善学习。为了简化反馈,我们提出了一个自动跟踪反馈显着框架,以定义提供反馈的特定信息,为什么这些细节以及如何呈现它们(手动引出或自动反馈)。我们从移动食品跟踪的实地研究中收集了调查和膳食图像数据,并实施了Salientrack,一种机器学习模型,以预测用户从跟踪事件中学习。使用可解释的AI(XAI)技术,SalientRack识别该事件的哪些特征是最突出的,为什么它们导致正面学习结果,并优先考虑如何根据归属分数呈现反馈。我们展示了用例,并进行了形成性研究,以展示Salientrack的可用性和有用性。我们讨论自动跟踪中可读性的影响,以及如何添加模型解释性扩大了提高反馈体验的机会。
translated by 谷歌翻译
Very few eXplainable AI (XAI) studies consider how users understanding of explanations might change depending on whether they know more or less about the to be explained domain (i.e., whether they differ in their expertise). Yet, expertise is a critical facet of most high stakes, human decision making (e.g., understanding how a trainee doctor differs from an experienced consultant). Accordingly, this paper reports a novel, user study (N=96) on how peoples expertise in a domain affects their understanding of post-hoc explanations by example for a deep-learning, black box classifier. The results show that peoples understanding of explanations for correct and incorrect classifications changes dramatically, on several dimensions (e.g., response times, perceptions of correctness and helpfulness), when the image-based domain considered is familiar (i.e., MNIST) as opposed to unfamiliar (i.e., Kannada MNIST). The wider implications of these new findings for XAI strategies are discussed.
translated by 谷歌翻译
近年来,人们对可解释的AI(XAI)领域的兴趣激增,文献中提出了很多算法。但是,关于如何评估XAI的共识缺乏共识阻碍了该领域的发展。我们强调说,XAI并不是一组整体技术 - 研究人员和从业人员已经开始利用XAI算法来构建服务于不同使用环境的XAI系统,例如模型调试和决策支持。然而,对XAI的算法研究通常不会考虑到这些多样化的下游使用环境,从而对实际用户产生有限的有效性甚至意想不到的后果,以及从业者做出技术选择的困难。我们认为,缩小差距的一种方法是开发评估方法,这些方法在这些用法上下文中说明了不同的用户需求。为了实现这一目标,我们通过考虑XAI评估标准对XAI的原型用法上下文的相对重要性,介绍了情境化XAI评估的观点。为了探索XAI评估标准的上下文依赖性,我们进行了两项调查研究,一项与XAI主题专家,另一项与人群工人进行。我们的结果敦促通过使用使用的评估实践进行负责任的AI研究,并在不同使用环境中对XAI的用户需求有细微的了解。
translated by 谷歌翻译
人工智能(AI)系统越来越多地用于提供建议以促进人类决策。尽管大量工作探讨了如何优化AI系统以产生准确且公平的建议以及如何向人类决策者提供算法建议,但在这项工作中,我们提出了一个不同的基本问题:何时应该提供建议?由于当前不断提供算法建议的局限性的限制,我们提出了以双向方式与人类用户互动的AI系统的设计。我们的AI系统学习使用过去的人类决策为政策提供建议。然后,对于新案例,学识渊博的政策利用人类的意见来确定算法建议将是有用的案例,以及人类最好单独决定的情况。我们通过使用美国刑事司法系统的数据对审前释放决策进行大规模实验来评估我们的方法。在我们的实验中,要求参与者评估被告违反其释放条款的风险,如果释放,并受到不同建议方法的建议。结果表明,与固定的非交互式建议方法相比,我们的交互式辅助方法可以在需要时提供建议,并显着改善人类决策。我们的方法在促进人类学习,保留人类决策者的互补优势以及对建议的更积极反应方面具有额外的优势。
translated by 谷歌翻译