为了与机器人合作,我们必须能够理解他们的决策。人类自然会通过类似于逆增强学习(IRL)的方式来推理其可观察到的行为,从而推断出其他代理商的信念和欲望。因此,机器人可以通过提供对人类学习者的IRL提供信息的示威来传达他们的信念和欲望。一项内容丰富的演示是,鉴于他们当前对机器人决策的理解,与学习者对机器人将要做的事情的期望有很大差异。但是,标准IRL并未对学习者的现有期望进行建模,因此不能执行这种反事实推理。我们建议将学习者对机器人决策的当前理解纳入我们的人类IRL模型中,以便机器人可以选择最大化人类理解的演示。我们还提出了一种新颖的措施,以估计人类在看不见环境中预测机器人行为的实例的难度。一项用户研究发现,我们的测试难度与人类绩效和信心息息相关。有趣的是,选择人类的信念和反事实时,选择示范会在易于测试中降低人类绩效,但在困难测试中提高了性能,从而提供了有关如何最好地利用此类模型的见解。
translated by 谷歌翻译
在本文中,我们研究了不确定性下的顺序决策任务中可读性的概念。以前的作品将易读性扩展到了机器人运动以外的方案,要么集中在确定性设置上,要么在计算上太昂贵。我们提出的称为POL-MDP的方法能够处理不确定性,同时保持计算障碍。在几种不同复杂性的模拟场景中,我们建立了反对最新方法的方法的优势。我们还展示了将我们的清晰政策用作反向加强学习代理的示范,并根据最佳政策建立了他们的优越性。最后,我们通过用户研究评估计算政策的可读性,在该研究中,要求人们通过观察其行动来推断移动机器人的目标。
translated by 谷歌翻译
Humans have internal models of robots (like their physical capabilities), the world (like what will happen next), and their tasks (like a preferred goal). However, human internal models are not always perfect: for example, it is easy to underestimate a robot's inertia. Nevertheless, these models change and improve over time as humans gather more experience. Interestingly, robot actions influence what this experience is, and therefore influence how people's internal models change. In this work we take a step towards enabling robots to understand the influence they have, leverage it to better assist people, and help human models more quickly align with reality. Our key idea is to model the human's learning as a nonlinear dynamical system which evolves the human's internal model given new observations. We formulate a novel optimization problem to infer the human's learning dynamics from demonstrations that naturally exhibit human learning. We then formalize how robots can influence human learning by embedding the human's learning dynamics model into the robot planning problem. Although our formulations provide concrete problem statements, they are intractable to solve in full generality. We contribute an approximation that sacrifices the complexity of the human internal models we can represent, but enables robots to learn the nonlinear dynamics of these internal models. We evaluate our inference and planning methods in a suite of simulated environments and an in-person user study, where a 7DOF robotic arm teaches participants to be better teleoperators. While influencing human learning remains an open problem, our results demonstrate that this influence is possible and can be helpful in real human-robot interaction.
translated by 谷歌翻译
人类可以利用身体互动来教机器人武器。这种物理互动取决于任务,用户以及机器人到目前为止所学的内容。最先进的方法专注于从单一模态学习,或者假设机器人具有有关人类预期任务的先前信息,从而结合了多个互动类型。相比之下,在本文中,我们介绍了一种算法形式主义,该算法从演示,更正和偏好中学习。我们的方法对人类想要教机器人的任务没有任何假设。取而代之的是,我们通过将人类的输入与附近的替代方案进行比较,从头开始学习奖励模型。我们首先得出损失函数,该功能训练奖励模型的合奏,以匹配人类的示范,更正和偏好。反馈的类型和顺序取决于人类老师:我们使机器人能够被动地或积极地收集此反馈。然后,我们应用受约束的优化将我们学习的奖励转换为所需的机器人轨迹。通过模拟和用户研究,我们证明,与现有基线相比,我们提出的方法更准确地从人体互动中学习了操纵任务,尤其是当机器人面临新的或意外的目标时。我们的用户研究视频可在以下网址获得:https://youtu.be/fsujstyveku
translated by 谷歌翻译
当从人类行为中推断出奖励功能(无论是演示,比较,物理校正或电子停靠点)时,它已证明对人类进行建模作为做出嘈杂的理性选择,并具有“合理性系数”,以捕获多少噪声或熵我们希望看到人类的行为。无论人类反馈的类型或质量如何,许多现有作品都选择修复此系数。但是,在某些情况下,进行演示可能要比回答比较查询要困难得多。在这种情况下,我们应该期望在示范中看到比比较中更多的噪音或次级临时性,并且应该相应地解释反馈。在这项工作中,我们提倡,将每种反馈类型的实际数据中的理性系数扎根,而不是假设默认值,对奖励学习具有重大的积极影响。我们在模拟反馈以及用户研究的实验中测试了这一点。我们发现,从单一反馈类型中学习时,高估人类理性可能会对奖励准确性和遗憾产生可怕的影响。此外,我们发现合理性层面会影响每种反馈类型的信息性:令人惊讶的是,示威并不总是最有用的信息 - 当人类的行为非常卑鄙时,即使在合理性水平相同的情况下,比较实际上就变得更加有用。 。此外,当机器人确定要要求的反馈类型时,它可以通过准确建模每种类型的理性水平来获得很大的优势。最终,我们的结果强调了关注假定理性级别的重要性,不仅是在从单个反馈类型中学习时,尤其是当代理商从多种反馈类型中学习时,尤其是在学习时。
translated by 谷歌翻译
为了根据他们在装配任务中的个人偏好,机器人在给定任务中需要用户演示。但是,在实际装配任务中提供示范可能是乏味且耗时的。我们的论文是,我们可以从代表性规范任务中的演示中学习装配任务中的用户偏好。受到以前的人类运动经济的启发,我们建议将用户偏好作为抽象任务 - 不可行特征的线性函数,例如用户所需的运动和身体和心理工作。对于每个用户,我们从规范任务中的演示中学习他们的偏好,并使用学习的偏好来预测他们在实际装配任务中的行为,而实际任务中的任何用户演示。我们在模型 - 飞机组装研究中评估我们提出的方法,并表明偏好可以有效地从规范转移到实际装配任务,使机器人能够预测用户动作。
translated by 谷歌翻译
In this paper we examine the problem of determining demonstration sufficiency for AI agents that learn from demonstrations: how can an AI agent self-assess whether it has received enough demonstrations from an expert to ensure a desired level of performance? To address this problem we propose a novel self-assessment approach based on Bayesian inverse reinforcement learning and value-at-risk to enable agents that learn from demonstrations to compute high-confidence bounds on their performance and use these bounds to determine when they have a sufficient number of demonstrations. We propose and evaluate two definitions of sufficiency: (1) normalized expected value difference, which measures regret with respect to the expert's unobserved reward function, and (2) improvement over a baseline policy. We demonstrate how to formulate high-confidence bounds on both of these metrics. We evaluate our approach in simulation and demonstrate the feasibility of developing an AI system that can accurately evaluate whether it has received sufficient training data to guarantee, with high confidence, that it can match an expert's performance or surpass the performance of a baseline policy within some desired safety threshold.
translated by 谷歌翻译
已知人类凝视是在操纵任务期间的潜在人类意图和目标的强大指标。这项工作研究人类教师的凝视模式证明了机器人的任务,并提出了这种模式可用于增强机器人学习的方式。使用Kinesthetic教学和视频演示,我们在教学中识别新颖的意图揭示凝视行为。这些在各种问题中被证明是从参考帧推理到多步任务的分割的各种问题。基于我们的研究结果,我们提出了两个概念验证算法,该算法表明,凝视数据可以增强多台任务的子任务分类,高达6%,奖励推理和策略学习,可为单步任务高达67%。我们的调查结果为机器人学习中的自然人凝视模型提供了基础,从演示设置上学习,并在利用人凝游来提高机器人学习的开放问题。
translated by 谷歌翻译
Human and robot partners increasingly need to work together to perform tasks as a team. Robots designed for such collaboration must reason about how their task-completion strategies interplay with the behavior and skills of their human team members as they coordinate on achieving joint goals. Our goal in this work is to develop a computational framework for robot adaptation to human partners in human-robot team collaborations. We first present an algorithm for autonomously recognizing available task-completion strategies by observing human-human teams performing a collaborative task. By transforming team actions into low dimensional representations using hidden Markov models, we can identify strategies without prior knowledge. Robot policies are learned on each of the identified strategies to construct a Mixture-of-Experts model that adapts to the task strategies of unseen human partners. We evaluate our model on a collaborative cooking task using an Overcooked simulator. Results of an online user study with 125 participants demonstrate that our framework improves the task performance and collaborative fluency of human-agent teams, as compared to state of the art reinforcement learning methods.
translated by 谷歌翻译
Explainable AI (XAI) is widely viewed as a sine qua non for ever-expanding AI research. A better understanding of the needs of XAI users, as well as human-centered evaluations of explainable models are both a necessity and a challenge. In this paper, we explore how HCI and AI researchers conduct user studies in XAI applications based on a systematic literature review. After identifying and thoroughly analyzing 85 core papers with human-based XAI evaluations over the past five years, we categorize them along the measured characteristics of explanatory methods, namely trust, understanding, fairness, usability, and human-AI team performance. Our research shows that XAI is spreading more rapidly in certain application domains, such as recommender systems than in others, but that user evaluations are still rather sparse and incorporate hardly any insights from cognitive or social sciences. Based on a comprehensive discussion of best practices, i.e., common models, design choices, and measures in user studies, we propose practical guidelines on designing and conducting user studies for XAI researchers and practitioners. Lastly, this survey also highlights several open research directions, particularly linking psychological science and human-centered XAI.
translated by 谷歌翻译
Imitation learning techniques aim to mimic human behavior in a given task. An agent (a learning machine) is trained to perform a task from demonstrations by learning a mapping between observations and actions. The idea of teaching by imitation has been around for many years, however, the field is gaining attention recently due to advances in computing and sensing as well as rising demand for intelligent applications. The paradigm of learning by imitation is gaining popularity because it facilitates teaching complex tasks with minimal expert knowledge of the tasks. Generic imitation learning methods could potentially reduce the problem of teaching a task to that of providing demonstrations; without the need for explicit programming or designing reward functions specific to the task. Modern sensors are able to collect and transmit high volumes of data rapidly, and processors with high computational power allow fast processing that maps the sensory data to actions in a timely manner. This opens the door for many potential AI applications that require real-time perception and reaction such as humanoid robots, self-driving vehicles, human computer interaction and computer games to name a few. However, specialized algorithms are needed to effectively and robustly learn models as learning by imitation poses its own set of challenges. In this paper, we survey imitation learning methods and present design options in different steps of the learning process. We introduce a background and motivation for the field as well as highlight challenges specific to the imitation problem. Methods for designing and evaluating imitation learning tasks are categorized and reviewed. Special attention is given to learning methods in robotics and games as these domains are the most popular in the literature and provide a wide array of problems and methodologies. We extensively discuss combining imitation learning approaches using different sources and methods, as well as incorporating other motion learning methods to enhance imitation. We also discuss the potential impact on industry, present major applications and highlight current and future research directions.
translated by 谷歌翻译
从演示中学习的方法(LFD)通过模仿用户表现出在获取行为策略方面的成功。但是,即使对于一项任务,LFD也可能需要大量的演示。对于必须通过演示学习许多任务的多功能代理,如果孤立地学习每个任务,此过程将大大负担用户的负担。为了应对这一挑战,我们介绍了从演示中学习的新颖问题,该问题使代理商能够不断地基于从先前演示的任务中学到的知识,以加速学习新任务,从而减少所需的示范量。作为解决这个问题的一种解决方案,我们提出了第一种终身学习方法来进行逆强化学习,该方法通过演示学习连续的任务,不断地在任务之间转移知识以提高绩效。
translated by 谷歌翻译
Safe Reinforcement Learning can be defined as the process of learning policies that maximize the expectation of the return in problems in which it is important to ensure reasonable system performance and/or respect safety constraints during the learning and/or deployment processes. We categorize and analyze two approaches of Safe Reinforcement Learning. The first is based on the modification of the optimality criterion, the classic discounted finite/infinite horizon, with a safety factor. The second is based on the modification of the exploration process through the incorporation of external knowledge or the guidance of a risk metric. We use the proposed classification to survey the existing literature, as well as suggesting future directions for Safe Reinforcement Learning.
translated by 谷歌翻译
人类可以利用身体互动来教机器人武器。当人类的动力学通过示范引导机器人时,机器人学习了所需的任务。尽管先前的工作重点是机器人学习方式,但对于人类老师来说,了解其机器人正在学习的内容同样重要。视觉显示可以传达此信息;但是,我们假设仅视觉反馈就错过了人与机器人之间的物理联系。在本文中,我们介绍了一类新颖的软触觉显示器,这些显示器包裹在机器人臂上,添加信号而不会影响相互作用。我们首先设计一个气动驱动阵列,该阵列在安装方面保持灵活。然后,我们开发了这种包裹的触觉显示的单一和多维版本,并在心理物理测试和机器人学习过程中探索了人类对渲染信号的看法。我们最终发现,人们以11.4%的韦伯(Weber)分数准确区分单维反馈,并以94.5%的精度确定多维反馈。当物理教授机器人臂时,人类利用单维反馈来提供比视觉反馈更好的演示:我们包装的触觉显示会降低教学时间,同时提高演示质量。这种改进取决于包裹的触觉显示的位置和分布。您可以在此处查看我们的设备和实验的视频:https://youtu.be/ypcmgeqsjdm
translated by 谷歌翻译
Prior work has identified a resilient phenomenon that threatens the performance of human-AI decision-making teams: overreliance, when people agree with an AI, even when it is incorrect. Surprisingly, overreliance does not reduce when the AI produces explanations for its predictions, compared to only providing predictions. Some have argued that overreliance results from cognitive biases or uncalibrated trust, attributing overreliance to an inevitability of human cognition. By contrast, our paper argues that people strategically choose whether or not to engage with an AI explanation, demonstrating empirically that there are scenarios where AI explanations reduce overreliance. To achieve this, we formalize this strategic choice in a cost-benefit framework, where the costs and benefits of engaging with the task are weighed against the costs and benefits of relying on the AI. We manipulate the costs and benefits in a maze task, where participants collaborate with a simulated AI to find the exit of a maze. Through 5 studies (N = 731), we find that costs such as task difficulty (Study 1), explanation difficulty (Study 2, 3), and benefits such as monetary compensation (Study 4) affect overreliance. Finally, Study 5 adapts the Cognitive Effort Discounting paradigm to quantify the utility of different explanations, providing further support for our framework. Our results suggest that some of the null effects found in literature could be due in part to the explanation not sufficiently reducing the costs of verifying the AI's prediction.
translated by 谷歌翻译
In inverse reinforcement learning (IRL), a learning agent infers a reward function encoding the underlying task using demonstrations from experts. However, many existing IRL techniques make the often unrealistic assumption that the agent has access to full information about the environment. We remove this assumption by developing an algorithm for IRL in partially observable Markov decision processes (POMDPs). We address two limitations of existing IRL techniques. First, they require an excessive amount of data due to the information asymmetry between the expert and the learner. Second, most of these IRL techniques require solving the computationally intractable forward problem -- computing an optimal policy given a reward function -- in POMDPs. The developed algorithm reduces the information asymmetry while increasing the data efficiency by incorporating task specifications expressed in temporal logic into IRL. Such specifications may be interpreted as side information available to the learner a priori in addition to the demonstrations. Further, the algorithm avoids a common source of algorithmic complexity by building on causal entropy as the measure of the likelihood of the demonstrations as opposed to entropy. Nevertheless, the resulting problem is nonconvex due to the so-called forward problem. We solve the intrinsic nonconvexity of the forward problem in a scalable manner through a sequential linear programming scheme that guarantees to converge to a locally optimal policy. In a series of examples, including experiments in a high-fidelity Unity simulator, we demonstrate that even with a limited amount of data and POMDPs with tens of thousands of states, our algorithm learns reward functions and policies that satisfy the task while inducing similar behavior to the expert by leveraging the provided side information.
translated by 谷歌翻译
假设人类(大约)理性使机器人能够通过观察人类行为来推断奖励功能。但人们展出了各种各样的非理性,我们与这项工作的目标是更好地了解他们可以对奖励推论的影响。研究这种效果的挑战是存在许多类型的非理性,具有不同程度的数学形式化。因此,通过改变Bellman Optimaly公式,使用本框架来研究这些框架会如何影响推理的框架,从而通过改变MDP的语言。我们发现错误地建模一个系统地造型的人类,因为嘈杂的理性比正确捕获这些偏差更糟糕 - 这么多,因此可以更好地跳过推动并坚持先前!更重要的是,我们表明,在正确建模时,一个非理性人类可以传达有关奖励的更多信息,而不是完全合理的人体。也就是说,如果机器人具有正确的人类非理性模型,如果人类是理性的,它可以使推论比它能够更强大。非理性基本上有助于而不是阻碍奖励推断,但需要正确占用。
translated by 谷歌翻译
人类决策受到许多系统错误的困扰。可以通过提供决策辅助工具来指导决策者参与重要信息并根据理性决策策略将其集成,从而避免使用这些错误。设计这样的决策辅助工具曾经是一个乏味的手动过程。认知科学的进步可能会使将来自动化这一过程。我们最近引入了机器学习方法,以自动发现人类决策的最佳策略,并自动向人们解释这些策略。通过这种方法构建的决策辅助工具能够改善人类决策。但是,遵循该方法产生的描述非常乏味。我们假设可以通过将自动发现的决策策略作为一系列自然语言指示来克服这个问题。实验1表明,人们确实确实比以前的方法更容易理解此类程序说明。在这一发现的鼓励下,我们开发了一种将我们先前方法的输出转化为程序指示的算法。我们应用了改进的方法来自动为自然主义计划任务(即计划旅行)和自然主义决策任务(即选择抵押)生成决策辅助工具。实验2表明,这些自动产生的决策AID可显着改善人们在计划公路旅行和选择抵押贷款方面的表现。这些发现表明,AI驱动的增强可能有可能改善现实世界中的人类决策。
translated by 谷歌翻译
交互式增强学习建议使用外部信息,以加快学习过程。当与学习者互动时,人类可以提供评估或有益的建议。先前的研究通过在交互式增强学习过程中包括实时反馈,专门旨在提高代理商的学习速度,同时最大程度地减少对人类的时间的需求,从而重点关注人类建议的效果。这项工作重点是回答两种评估或信息性的方法中的哪种是人类的首选教学方法。此外,这项工作为人类试验提供了实验设置,旨在比较人们用来提供人类参与建议的方法。获得的结果表明,向学习者提供信息的用户提供了更准确的建议,愿意在更长的时间内为学习者提供帮助,并每集提供更多建议。此外,使用信息丰富的方法的参与者的自我评估表明,与提供评估建议的人相比,代理商遵循建议的能力更高,因此,他们认为自己的建议的准确性更高。
translated by 谷歌翻译
可接受的是指对象允许的可能动作的感知。尽管其与人计算机相互作用有关,但没有现有理论解释了支撑无力形成的机制;也就是说,通过交互发现和适应的充分性。基于认知科学的加固学习理论,提出了一种综合性的无力形成理论。关键假设是用户学习在存在增强信号(成功/故障)时将有前途的电机动作与经验相关联。他们还学会分类行动(例如,“旋转”拨号),使他们能够命名和理由的能力。在遇到新颖的小部件时,他们概括这些行动的能力决定了他们感受到的能力。我们在虚拟机器人模型中实现了这个理论,它展示了在交互式小部件任务中的人性化适应性。虽然其预测与人类数据的趋势对齐,但人类能够更快地适应能力,表明存在额外机制。
translated by 谷歌翻译