人类可以利用身体互动来教机器人武器。这种物理互动取决于任务,用户以及机器人到目前为止所学的内容。最先进的方法专注于从单一模态学习,或者假设机器人具有有关人类预期任务的先前信息,从而结合了多个互动类型。相比之下,在本文中,我们介绍了一种算法形式主义,该算法从演示,更正和偏好中学习。我们的方法对人类想要教机器人的任务没有任何假设。取而代之的是,我们通过将人类的输入与附近的替代方案进行比较,从头开始学习奖励模型。我们首先得出损失函数,该功能训练奖励模型的合奏,以匹配人类的示范,更正和偏好。反馈的类型和顺序取决于人类老师:我们使机器人能够被动地或积极地收集此反馈。然后,我们应用受约束的优化将我们学习的奖励转换为所需的机器人轨迹。通过模拟和用户研究,我们证明,与现有基线相比,我们提出的方法更准确地从人体互动中学习了操纵任务,尤其是当机器人面临新的或意外的目标时。我们的用户研究视频可在以下网址获得:https://youtu.be/fsujstyveku
translated by 谷歌翻译
When robots interact with humans in homes, roads, or factories the human's behavior often changes in response to the robot. Non-stationary humans are challenging for robot learners: actions the robot has learned to coordinate with the original human may fail after the human adapts to the robot. In this paper we introduce an algorithmic formalism that enables robots (i.e., ego agents) to co-adapt alongside dynamic humans (i.e., other agents) using only the robot's low-level states, actions, and rewards. A core challenge is that humans not only react to the robot's behavior, but the way in which humans react inevitably changes both over time and between users. To deal with this challenge, our insight is that -- instead of building an exact model of the human -- robots can learn and reason over high-level representations of the human's policy and policy dynamics. Applying this insight we develop RILI: Robustly Influencing Latent Intent. RILI first embeds low-level robot observations into predictions of the human's latent strategy and strategy dynamics. Next, RILI harnesses these predictions to select actions that influence the adaptive human towards advantageous, high reward behaviors over repeated interactions. We demonstrate that -- given RILI's measured performance with users sampled from an underlying distribution -- we can probabilistically bound RILI's expected performance across new humans sampled from the same distribution. Our simulated experiments compare RILI to state-of-the-art representation and reinforcement learning baselines, and show that RILI better learns to coordinate with imperfect, noisy, and time-varying agents. Finally, we conduct two user studies where RILI co-adapts alongside actual humans in a game of tag and a tower-building task. See videos of our user studies here: https://youtu.be/WYGO5amDXbQ
translated by 谷歌翻译
人类可以利用身体互动来教机器人武器。当人类的动力学通过示范引导机器人时,机器人学习了所需的任务。尽管先前的工作重点是机器人学习方式,但对于人类老师来说,了解其机器人正在学习的内容同样重要。视觉显示可以传达此信息;但是,我们假设仅视觉反馈就错过了人与机器人之间的物理联系。在本文中,我们介绍了一类新颖的软触觉显示器,这些显示器包裹在机器人臂上,添加信号而不会影响相互作用。我们首先设计一个气动驱动阵列,该阵列在安装方面保持灵活。然后,我们开发了这种包裹的触觉显示的单一和多维版本,并在心理物理测试和机器人学习过程中探索了人类对渲染信号的看法。我们最终发现,人们以11.4%的韦伯(Weber)分数准确区分单维反馈,并以94.5%的精度确定多维反馈。当物理教授机器人臂时,人类利用单维反馈来提供比视觉反馈更好的演示:我们包装的触觉显示会降低教学时间,同时提高演示质量。这种改进取决于包裹的触觉显示的位置和分布。您可以在此处查看我们的设备和实验的视频:https://youtu.be/ypcmgeqsjdm
translated by 谷歌翻译
Humans have internal models of robots (like their physical capabilities), the world (like what will happen next), and their tasks (like a preferred goal). However, human internal models are not always perfect: for example, it is easy to underestimate a robot's inertia. Nevertheless, these models change and improve over time as humans gather more experience. Interestingly, robot actions influence what this experience is, and therefore influence how people's internal models change. In this work we take a step towards enabling robots to understand the influence they have, leverage it to better assist people, and help human models more quickly align with reality. Our key idea is to model the human's learning as a nonlinear dynamical system which evolves the human's internal model given new observations. We formulate a novel optimization problem to infer the human's learning dynamics from demonstrations that naturally exhibit human learning. We then formalize how robots can influence human learning by embedding the human's learning dynamics model into the robot planning problem. Although our formulations provide concrete problem statements, they are intractable to solve in full generality. We contribute an approximation that sacrifices the complexity of the human internal models we can represent, but enables robots to learn the nonlinear dynamics of these internal models. We evaluate our inference and planning methods in a suite of simulated environments and an in-person user study, where a 7DOF robotic arm teaches participants to be better teleoperators. While influencing human learning remains an open problem, our results demonstrate that this influence is possible and can be helpful in real human-robot interaction.
translated by 谷歌翻译
人类和机器人之间的物理互动可以帮助机器人学习执行复杂的任务。机器人臂通过观察人类在整个任务中指导它的方式来获得信息。虽然先前的作品专注于机器人如何学习,但它同样重要的是,这种学习对人类教师透明。显示机器人不确定性的视觉显示可能会传达此信息;然而,我们假设视觉反馈机制错过了人类和机器人之间的物理连接。在这项工作中,我们提出了一种柔软的触觉显示,它缠绕在机器人臂的表面并符合机器人臂的表面,在现有的触点点添加触觉信号,而不会显着影响相互作用。我们展示了软致动力如何产生突出的触觉信号,同时仍然允许在设备安装中的灵活性。使用心理物理学实验,我们表明用户可以准确地区分包裹展示的通胀水平,平均韦伯分数为11.4%。当我们在机器人操纵器的ARM周围放置包裹的显示器时,用户能够在样本机器人学习任务中解释和利用触觉信号,从而改善机器人需要更多培训的区域的识别,并使用户能够提供更好的演示。查看我们的设备和用户学习的视频:https://youtu.be/tx-2tqeb9nw
translated by 谷歌翻译
当机器人与人类伴侣互动时,这些合作伙伴通常会因机器人而改变其行为。一方面,这是具有挑战性的,因为机器人必须学会与动态合作伙伴进行协调。但是,另一方面 - 如果机器人理解这些动态 - 它可以利用自己的行为,影响人类,并指导团队进行有效的协作。先前的研究使机器人能够学会影响其他机器人或模拟药物。在本文中,我们将这些学习方法扩展到现在影响人类。使人类特别难影响的原因是 - 人类不仅对机器人做出反应 - 而且单个用户对机器人的反应可能会随着时间而改变,而且不同的人类会以不同的方式对相同的机器人行为做出反应。因此,我们提出了一种强大的方法,该方法学会影响不断变化的伴侣动态。我们的方法首先在重复互动中与一组合作伙伴进行训练,并学会根据以前的状态,行动和奖励来预测当前伙伴的行为。接下来,我们通过对机器人与原始合作伙伴学习的轨迹进行采样轨迹迅速适应了新合作伙伴,然后利用这些现有行为来影响新的合作伙伴动态。我们将最终的算法与跨模拟环境和用户研究进行比较,并在其中进行了机器人和参与者协作建造塔楼的用户研究。我们发现,即使合作伙伴遵循新的或意外的动态,我们的方法也优于替代方案。用户研究的视频可在此处获得:https://youtu.be/lyswm8an18g
translated by 谷歌翻译
When robots learn reward functions using high capacity models that take raw state directly as input, they need to both learn a representation for what matters in the task -- the task ``features" -- as well as how to combine these features into a single objective. If they try to do both at once from input designed to teach the full reward function, it is easy to end up with a representation that contains spurious correlations in the data, which fails to generalize to new settings. Instead, our ultimate goal is to enable robots to identify and isolate the causal features that people actually care about and use when they represent states and behavior. Our idea is that we can tune into this representation by asking users what behaviors they consider similar: behaviors will be similar if the features that matter are similar, even if low-level behavior is different; conversely, behaviors will be different if even one of the features that matter differs. This, in turn, is what enables the robot to disambiguate between what needs to go into the representation versus what is spurious, as well as what aspects of behavior can be compressed together versus not. The notion of learning representations based on similarity has a nice parallel in contrastive learning, a self-supervised representation learning technique that maps visually similar data points to similar embeddings, where similarity is defined by a designer through data augmentation heuristics. By contrast, in order to learn the representations that people use, so we can learn their preferences and objectives, we use their definition of similarity. In simulation as well as in a user study, we show that learning through such similarity queries leads to representations that, while far from perfect, are indeed more generalizable than self-supervised and task-input alternatives.
translated by 谷歌翻译
从示范中学习(LFD)方法使最终用户能够通过演示所需的行为来教机器人新任务,从而使对机器人技术的访问民主化。但是,当前的LFD框架无法快速适应异质的人类示范,也无法在无处不在的机器人技术应用中进行大规模部署。在本文中,我们提出了一个新型的LFD框架,快速的终身自适应逆增强学习(FLAIR)。我们的方法(1)利用策略来构建政策混合物,以快速适应新的示范,从而快速最终用户个性化; (2)提炼跨示范的常识,实现准确的任务推断; (3)仅在终身部署中需要扩展其模型,并保持一套简洁的原型策略,这些策略可以通过政策混合物近似所有行为。我们从经验上验证了能力可以实现适应能力(即机器人适应异质性,特定用户特定的任务偏好),效率(即机器人实现样本适应性)和可伸缩性(即,模型都会与示范范围增长,同时保持高性能)。 Flair超过了三个连续控制任务的基准测试,其政策收益的平均提高了57%,使用策略混合物进行示范建模所需的次数少78%。最后,我们在现实机器人乒乓球任务中展示了Flair的成功。
translated by 谷歌翻译
当从人类行为中推断出奖励功能(无论是演示,比较,物理校正或电子停靠点)时,它已证明对人类进行建模作为做出嘈杂的理性选择,并具有“合理性系数”,以捕获多少噪声或熵我们希望看到人类的行为。无论人类反馈的类型或质量如何,许多现有作品都选择修复此系数。但是,在某些情况下,进行演示可能要比回答比较查询要困难得多。在这种情况下,我们应该期望在示范中看到比比较中更多的噪音或次级临时性,并且应该相应地解释反馈。在这项工作中,我们提倡,将每种反馈类型的实际数据中的理性系数扎根,而不是假设默认值,对奖励学习具有重大的积极影响。我们在模拟反馈以及用户研究的实验中测试了这一点。我们发现,从单一反馈类型中学习时,高估人类理性可能会对奖励准确性和遗憾产生可怕的影响。此外,我们发现合理性层面会影响每种反馈类型的信息性:令人惊讶的是,示威并不总是最有用的信息 - 当人类的行为非常卑鄙时,即使在合理性水平相同的情况下,比较实际上就变得更加有用。 。此外,当机器人确定要要求的反馈类型时,它可以通过准确建模每种类型的理性水平来获得很大的优势。最终,我们的结果强调了关注假定理性级别的重要性,不仅是在从单个反馈类型中学习时,尤其是当代理商从多种反馈类型中学习时,尤其是在学习时。
translated by 谷歌翻译
While reinforcement learning (RL) has become a more popular approach for robotics, designing sufficiently informative reward functions for complex tasks has proven to be extremely difficult due their inability to capture human intent and policy exploitation. Preference based RL algorithms seek to overcome these challenges by directly learning reward functions from human feedback. Unfortunately, prior work either requires an unreasonable number of queries implausible for any human to answer or overly restricts the class of reward functions to guarantee the elicitation of the most informative queries, resulting in models that are insufficiently expressive for realistic robotics tasks. Contrary to most works that focus on query selection to \emph{minimize} the amount of data required for learning reward functions, we take an opposite approach: \emph{expanding} the pool of available data by viewing human-in-the-loop RL through the more flexible lens of multi-task learning. Motivated by the success of meta-learning, we pre-train preference models on prior task data and quickly adapt them for new tasks using only a handful of queries. Empirically, we reduce the amount of online feedback needed to train manipulation policies in Meta-World by 20$\times$, and demonstrate the effectiveness of our method on a real Franka Panda Robot. Moreover, this reduction in query-complexity allows us to train robot policies from actual human users. Videos of our results and code can be found at https://sites.google.com/view/few-shot-preference-rl/home.
translated by 谷歌翻译
Imitation learning techniques aim to mimic human behavior in a given task. An agent (a learning machine) is trained to perform a task from demonstrations by learning a mapping between observations and actions. The idea of teaching by imitation has been around for many years, however, the field is gaining attention recently due to advances in computing and sensing as well as rising demand for intelligent applications. The paradigm of learning by imitation is gaining popularity because it facilitates teaching complex tasks with minimal expert knowledge of the tasks. Generic imitation learning methods could potentially reduce the problem of teaching a task to that of providing demonstrations; without the need for explicit programming or designing reward functions specific to the task. Modern sensors are able to collect and transmit high volumes of data rapidly, and processors with high computational power allow fast processing that maps the sensory data to actions in a timely manner. This opens the door for many potential AI applications that require real-time perception and reaction such as humanoid robots, self-driving vehicles, human computer interaction and computer games to name a few. However, specialized algorithms are needed to effectively and robustly learn models as learning by imitation poses its own set of challenges. In this paper, we survey imitation learning methods and present design options in different steps of the learning process. We introduce a background and motivation for the field as well as highlight challenges specific to the imitation problem. Methods for designing and evaluating imitation learning tasks are categorized and reviewed. Special attention is given to learning methods in robotics and games as these domains are the most popular in the literature and provide a wide array of problems and methodologies. We extensively discuss combining imitation learning approaches using different sources and methods, as well as incorporating other motion learning methods to enhance imitation. We also discuss the potential impact on industry, present major applications and highlight current and future research directions.
translated by 谷歌翻译
当人类与机器人互动时,不可避免地会影响。考虑一辆在人类附近行驶的自动驾驶汽车:自动驾驶汽车的速度和转向将影响人类驾驶方式。先前的作品开发了框架,使机器人能够影响人类对所需行为的影响。但是,尽管这些方法在短期(即前几个人类机器人相互作用)中有效,但我们在这里探索了长期影响(即同一人与机器人之间的重复相互作用)。我们的主要见解是,人类是动态的:人们适应机器人,一旦人类学会预见机器人的行为,现在影响力的行为可能会失败。有了这种见解,我们在实验上证明了一种普遍的游戏理论形式主义,用于产生有影响力的机器人行为,而不是重复互动的有效性降低。接下来,我们为Stackelberg游戏提出了三个修改,这些游戏使机器人的政策具有影响力和不可预测性。我们最终在模拟和用户研究中测试了这些修改:我们的结果表明,故意使他们的行为更难预期的机器人能够更好地维持对长期互动的影响。在此处查看视频:https://youtu.be/ydo83cgjz2q
translated by 谷歌翻译
奖励学习是人机互动中的一个基本问题,使机器人与他们的人类用户想要的对齐方式。已经提出了许多基于偏好的学习算法和主动查询技术作为解决此问题的解决方案。在本文中,我们展示了一种用于基于活跃的偏好的奖励学习算法的库,使研究人员和从业者能够尝试现有技术,并轻松开发自己的各种模块的自己的算法。APREL可在HTTPS://github.com/stanford-iliad/aprel提供。
translated by 谷歌翻译
本文提出了一种方法,该方法使机器人能够从人类的定向校正中逐渐学习控制目标函数。现有方法从人类的幅度校正中学习,并且需要人类仔细选择校正幅度,否则可以很容易地导致过度校正和学习效率低下。所提出的方法仅需要人类的定向校正 - 校正,该校正仅指示控制变化的方向,而不会指示其幅度 - 在机器人运动期间的某些时间实例应用。我们仅假设人类的校正,无论其幅度如何,在一个方向上指向机器人当前运动相对于隐含控制目标函数。因此,人类的有效修正总是占校正空间的一半。所提出的方法使用校正的方向来基于切割平面技术更新目标函数的估计。我们建立了理论结果,以证明该过程保证了学习目标函数的收敛到隐含的目标。通过数值例子,对两个人机游戏的用户研究以及真实世界的四轮车实验进行了拟议的方法。结果证实了该方法的收敛性,并表明该方法更有效(成功率较高),有效/轻松(需要较少人力校正),可访问(更少的早期浪费的试验)而不是最先进的机器人交互式学习计划。
translated by 谷歌翻译
为了与机器人合作,我们必须能够理解他们的决策。人类自然会通过类似于逆增强学习(IRL)的方式来推理其可观察到的行为,从而推断出其他代理商的信念和欲望。因此,机器人可以通过提供对人类学习者的IRL提供信息的示威来传达他们的信念和欲望。一项内容丰富的演示是,鉴于他们当前对机器人决策的理解,与学习者对机器人将要做的事情的期望有很大差异。但是,标准IRL并未对学习者的现有期望进行建模,因此不能执行这种反事实推理。我们建议将学习者对机器人决策的当前理解纳入我们的人类IRL模型中,以便机器人可以选择最大化人类理解的演示。我们还提出了一种新颖的措施,以估计人类在看不见环境中预测机器人行为的实例的难度。一项用户研究发现,我们的测试难度与人类绩效和信心息息相关。有趣的是,选择人类的信念和反事实时,选择示范会在易于测试中降低人类绩效,但在困难测试中提高了性能,从而提供了有关如何最好地利用此类模型的见解。
translated by 谷歌翻译
Dexterous manipulation with anthropomorphic robot hands remains a challenging problem in robotics because of the high-dimensional state and action spaces and complex contacts. Nevertheless, skillful closed-loop manipulation is required to enable humanoid robots to operate in unstructured real-world environments. Reinforcement learning (RL) has traditionally imposed enormous interaction data requirements for optimizing such complex control problems. We introduce a new framework that leverages recent advances in GPU-based simulation along with the strength of imitation learning in guiding policy search towards promising behaviors to make RL training feasible in these domains. To this end, we present an immersive virtual reality teleoperation interface designed for interactive human-like manipulation on contact rich tasks and a suite of manipulation environments inspired by tasks of daily living. Finally, we demonstrate the complementary strengths of massively parallel RL and imitation learning, yielding robust and natural behaviors. Videos of trained policies, our source code, and the collected demonstration datasets are available at https://maltemosbach.github.io/interactive_ human_like_manipulation/.
translated by 谷歌翻译
对于许多强化学习(RL)应用程序,指定奖励是困难的。本文考虑了一个RL设置,其中代理仅通过查询可以询问可以的专家来获取有关奖励的信息,例如,评估单个状态或通过轨迹提供二进制偏好。从如此昂贵的反馈中,我们的目标是学习奖励的模型,允许标准RL算法实现高预期的回报,尽可能少的专家查询。为此,我们提出了信息定向奖励学习(IDRL),它使用奖励的贝叶斯模型,然后选择要最大化信息增益的查询,这些查询是有关合理的最佳策略之间的返回差异的差异。与针对特定类型查询设计的先前主动奖励学习方法相比,IDRL自然地适应不同的查询类型。此外,它通过将焦点转移降低奖励近似误差来实现类似或更好的性能,从而降低奖励近似误差,以改善奖励模型引起的策略。我们支持我们的调查结果,在多个环境中进行广泛的评估,并具有不同的查询类型。
translated by 谷歌翻译
社会意识的机器人导航,其中需要机器人来优化其轨迹,除了到达没有碰撞的目标的目标外,还可以保持与人类的舒适和柔顺的空间互动,是人类背景下导航机器人的基本尚容的任务-robot互动。随着基于学习的方法已经实现了比以前的基于模型的方法更好的性能,它们仍然存在一些缺点:加强学习方法,在手工制作的奖励中回复优化,不太可能全面地模拟社会合会,可以导致奖励剥削问题;通过人类示范学习政策的反增强学习方法遭受昂贵的和部分样本,并且需要广泛的特征工程来合理。在本文中,我们提出了Fapl,一种反馈高效的互动强化学习方法,蒸煮了人的偏好和舒适性,成为奖励模型,作为指导代理人探索社会合准性的潜在方面的教师。介绍了混合体验和违规学习,以提高样品和人体反馈的效率。广泛的模拟实验证明了FAPPL的优势。用户学习,在现实世界中,在现实世界的情况下与人类导航的情况,进一步评估了定性地评估了学习机器人行为的好处。
translated by 谷歌翻译
We introduce an imitation learning-based physical human-robot interaction algorithm capable of predicting appropriate robot responses in complex interactions involving a superposition of multiple interactions. Our proposed algorithm, Blending Bayesian Interaction Primitives (B-BIP) allows us to achieve responsive interactions in complex hugging scenarios, capable of reciprocating and adapting to a hugs motion and timing. We show that this algorithm is a generalization of prior work, for which the original formulation reduces to the particular case of a single interaction, and evaluate our method through both an extensive user study and empirical experiments. Our algorithm yields significantly better quantitative prediction error and more-favorable participant responses with respect to accuracy, responsiveness, and timing, when compared to existing state-of-the-art methods.
translated by 谷歌翻译
在本文中,我们研究了不确定性下的顺序决策任务中可读性的概念。以前的作品将易读性扩展到了机器人运动以外的方案,要么集中在确定性设置上,要么在计算上太昂贵。我们提出的称为POL-MDP的方法能够处理不确定性,同时保持计算障碍。在几种不同复杂性的模拟场景中,我们建立了反对最新方法的方法的优势。我们还展示了将我们的清晰政策用作反向加强学习代理的示范,并根据最佳政策建立了他们的优越性。最后,我们通过用户研究评估计算政策的可读性,在该研究中,要求人们通过观察其行动来推断移动机器人的目标。
translated by 谷歌翻译