连续控制设置中的复杂顺序任务通常需要代理在其状态空间中成功遍历一组“窄段”。通过以样本有效的方式解决具有稀疏奖励的这些任务对现代钢筋(RL)构成了挑战,由于问题的相关的长地平性,并且在学习期间缺乏充足的正信号。已应用各种工具来解决这一挑战。当可用时,大型演示可以指导代理探索。后威尔同时释放不需要额外的信息来源。然而,现有的战略基于任务不可行的目标分布探索,这可以使长地平线的解决方案不切实际。在这项工作中,我们扩展了后视可释放的机制,以指导沿着一小组成功示范所暗示的特定任务特定分布的探索。我们评估了四个复杂,单身和双臂,机器人操纵任务的方法,对抗强合适的基线。该方法需要较少的演示来解决所有任务,并且达到明显更高的整体性能作为任务复杂性增加。最后,我们研究了提出的解决方案对输入表示质量和示范人数的鲁棒性。
translated by 谷歌翻译
近年来,深度加固学习(DRL)已经成功地进入了复杂的决策应用,例如机器人,自动驾驶或视频游戏。违规算法往往比其策略对应物更具样本效率,并且可以从存储在重放缓冲区中存储的任何违规数据中受益。专家演示是此类数据的流行来源:代理人接触到成功的国家和行动,可以加速学习过程并提高性能。在过去,已经提出了多种想法来充分利用缓冲区中的演示,例如仅在演示或最小化额外的成本函数的预先估算。我们继续进行研究,以孤立地评估这些想法中的几个想法,以了解哪一个具有最大的影响。我们还根据给予示范和成功集中的奖励奖金,为稀疏奖励任务提供了一种新的方法。首先,我们向来自示威活动的过渡提供奖励奖金,以鼓励代理商符合所证明的行为。然后,在收集成功的剧集时,我们将其在将其添加到重播缓冲区之前与相同的奖金转换,鼓励代理也与其先前的成功相匹配。我们的实验的基本算法是流行的软演员 - 评论家(SAC),用于连续动作空间的最先进的脱核算法。我们的实验专注于操纵机器人,特别是在模拟中的机器人手臂的3D到达任务。我们表明,我们的方法Sacr2根据奖励重新标记提高了此任务的性能,即使在没有示范的情况下也是如此。
translated by 谷歌翻译
Exploration in environments with sparse rewards has been a persistent problem in reinforcement learning (RL). Many tasks are natural to specify with a sparse reward, and manually shaping a reward function can result in suboptimal performance. However, finding a non-zero reward is exponentially more difficult with increasing task horizon or action dimensionality. This puts many real-world tasks out of practical reach of RL methods. In this work, we use demonstrations to overcome the exploration problem and successfully learn to perform long-horizon, multi-step robotics tasks with continuous control such as stacking blocks with a robot arm. Our method, which builds on top of Deep Deterministic Policy Gradients and Hindsight Experience Replay, provides an order of magnitude of speedup over RL on simulated robotics tasks. It is simple to implement and makes only the additional assumption that we can collect a small set of demonstrations. Furthermore, our method is able to solve tasks not solvable by either RL or behavior cloning alone, and often ends up outperforming the demonstrator policy.
translated by 谷歌翻译
通过加强学习(RL)掌握机器人操纵技巧通常需要设计奖励功能。该地区的最新进展表明,使用稀疏奖励,即仅在成功完成任务时奖励代理,可能会导致更好的政策。但是,在这种情况下,国家行动空间探索更困难。最近的RL与稀疏奖励学习的方法已经为任务提供了高质量的人类演示,但这些可能是昂贵的,耗时甚至不可能获得的。在本文中,我们提出了一种不需要人类示范的新颖有效方法。我们观察到,每个机器人操纵任务都可以被视为涉及从被操纵对象的角度来看运动的任务,即,对象可以了解如何自己达到目标状态。为了利用这个想法,我们介绍了一个框架,最初使用现实物理模拟器获得对象运动策略。然后,此策略用于生成辅助奖励,称为模拟的机器人演示奖励(SLDRS),使我们能够学习机器人操纵策略。拟议的方法已在增加复杂性的13个任务中进行了评估,与替代算法相比,可以实现更高的成功率和更快的学习率。 SLDRS对多对象堆叠和非刚性物体操作等任务特别有益。
translated by 谷歌翻译
近年来,深度加固学习(DRL)已经成功地进入了复杂的决策应用,例如机器人,自动驾驶或视频游戏。在寻找更多采样高效的算法中,有希望的方向是利用尽可能多的外部偏离策略数据。这种数据驱动方法的一个主题是从专家演示中学习。在过去,已经提出了多种想法来利用添加到重放缓冲区的示范,例如仅在演示中预先预订或最小化额外的成本函数。我们提出了一种新的方法,能够利用任何稀疏奖励环境中在线收集的演示和剧集,以任何违规算法在线。我们的方法基于奖励奖金,给出了示范和成功的剧集,鼓励专家模仿和自模仿。首先,我们向来自示威活动的过渡提供奖励奖金,以鼓励代理商符合所证明的行为。然后,在收集成功的剧集时,我们将其在将其添加到重播缓冲区之前与相同的奖金转换,鼓励代理也与其先前的成功相匹配。我们的实验专注于操纵机器人,特别是在模拟中有6个自由的机器人手臂的三个任务。我们表明,即使在没有示范的情况下,我们基于奖励重新标记的方法可以提高基础算法(SAC和DDPG)对这些任务的性能。此外,集成到我们的方法中的两种改进来自以前的作品,允许我们的方法优于所有基线。
translated by 谷歌翻译
For an autonomous agent to fulfill a wide range of user-specified goals at test time, it must be able to learn broadly applicable and general-purpose skill repertoires. Furthermore, to provide the requisite level of generality, these skills must handle raw sensory input such as images. In this paper, we propose an algorithm that acquires such general-purpose skills by combining unsupervised representation learning and reinforcement learning of goal-conditioned policies. Since the particular goals that might be required at test-time are not known in advance, the agent performs a self-supervised "practice" phase where it imagines goals and attempts to achieve them. We learn a visual representation with three distinct purposes: sampling goals for self-supervised practice, providing a structured transformation of raw sensory inputs, and computing a reward signal for goal reaching. We also propose a retroactive goal relabeling scheme to further improve the sample-efficiency of our method. Our off-policy algorithm is efficient enough to learn policies that operate on raw image observations and goals for a real-world robotic system, and substantially outperforms prior techniques. * Equal contribution. Order was determined by coin flip.
translated by 谷歌翻译
通过稀疏奖励的环境中的深度加强学习学习机器人操纵是一项具有挑战性的任务。在本文中,我们通过引入虚构对象目标的概念来解决这个问题。对于给定的操纵任务,首先通过物理逼真的模拟训练感兴趣的对象以达到自己的目标位置,而不会被操纵。然后利用对象策略来构建可编征物体轨迹的预测模型,该轨迹提供具有逐步更加困难的对象目标的机器人来达到训练期间的课程。所提出的算法,遵循对象(FO),已经在需要增加探索程度的7个Mujoco环境中进行评估,并且与替代算法相比,取得了更高的成功率。在特别具有挑战性的学习场景中,例如当物体的初始和目标位置相隔甚远,我们的方法仍然可以学习政策,而竞争方法目前失败。
translated by 谷歌翻译
有效的探索是深度强化学习的关键挑战。几种方法,例如行为先验,能够利用离线数据,以便在复杂任务上有效加速加强学习。但是,如果手动的任务与所证明的任务过度偏离,则此类方法的有效性是有限的。在我们的工作中,我们建议从离线数据中学习功能,这些功能由更加多样化的任务共享,例如动作与定向之间的相关性。因此,我们介绍了无国有先验,该先验直接在显示的轨迹中直接建模时间一致性,并且即使在对简单任务收集的数据进行培训时,也能够在复杂的任务中推动探索。此外,我们通过从政策和行动之前的概率混合物中动态采样动作,引入了一种新颖的集成方案,用于非政策强化学习中的动作研究。我们将我们的方法与强大的基线相提并论,并提供了经验证据,表明它可以在稀疏奖励环境下的长途持续控制任务中加速加强学习。
translated by 谷歌翻译
我们调查视觉跨实施的模仿设置,其中代理商学习来自其他代理的视频(例如人类)的策略,示范相同的任务,但在其实施例中具有缺点差异 - 形状,动作,终效应器动态等。在这项工作中,我们证明可以从对这些差异强大的跨实施例证视频自动发现和学习基于视觉的奖励功能。具体而言,我们介绍了一种用于跨实施的跨实施的自我监督方法(XIRL),它利用时间周期 - 一致性约束来学习深度视觉嵌入,从而从多个专家代理的示范的脱机视频中捕获任务进度,每个都执行相同的任务不同的原因是实施例差异。在我们的工作之前,从自我监督嵌入产生奖励通常需要与参考轨迹对齐,这可能难以根据STARK实施例的差异来获取。我们凭经验显示,如果嵌入式了解任务进度,则只需在学习的嵌入空间中占据当前状态和目标状态之间的负距离是有用的,作为培训与加强学习的培训政策的奖励。我们发现我们的学习奖励功能不仅适用于在训练期间看到的实施例,而且还概括为完全新的实施例。此外,在将现实世界的人类示范转移到模拟机器人时,我们发现XIRL比当前最佳方法更具样本。 https://x-irl.github.io提供定性结果,代码和数据集
translated by 谷歌翻译
由于在存在障碍物和高维视觉观测的情况下,由于在存在障碍和高维视觉观测的情况下,学习复杂的操纵任务是一个具有挑战性的问题。事先工作通过整合运动规划和强化学习来解决勘探问题。但是,运动计划程序增强策略需要访问状态信息,该信息通常在现实世界中不可用。为此,我们建议通过(1)视觉行为克隆以通过(1)视觉行为克隆来将基于国家的运动计划者增强策略,以删除运动计划员依赖以及其抖动运动,以及(2)基于视觉的增强学习来自行为克隆代理的平滑轨迹的指导。我们在阻塞环境中的三个操作任务中评估我们的方法,并将其与各种加固学习和模仿学习基线进行比较。结果表明,我们的框架是高度采样的和优于最先进的算法。此外,与域随机化相结合,我们的政策能够用零击转移到未经分散的人的未经环境环境。 https://clvrai.com/mopa-pd提供的代码和视频
translated by 谷歌翻译
长摩根和包括一系列隐性子任务的日常任务仍然在离线机器人控制中构成了重大挑战。尽管许多先前的方法旨在通过模仿和离线增强学习的变体来解决这种设置,但学习的行为通常是狭窄的,并且经常努力实现可配置的长匹配目标。由于这两个范式都具有互补的优势和劣势,因此我们提出了一种新型的层次结构方法,结合了两种方法的优势,以从高维相机观察中学习任务无关的长胜压策略。具体而言,我们结合了一项低级政策,该政策通过模仿学习和从离线强化学习中学到的高级政策学习潜在的技能,以促进潜在的行为先验。各种模拟和真实机器人控制任务的实验表明,我们的配方使以前看不见的技能组合能够通过“缝制”潜在技能通过目标链条,并在绩效上提高绩效的顺序,从而实现潜在的目标。艺术基线。我们甚至还学习了一个多任务视觉运动策略,用于现实世界中25个不同的操纵任务,这既优于模仿学习和离线强化学习技术。
translated by 谷歌翻译
Deep Reinforcement Learning has been successfully applied to learn robotic control. However, the corresponding algorithms struggle when applied to problems where the agent is only rewarded after achieving a complex task. In this context, using demonstrations can significantly speed up the learning process, but demonstrations can be costly to acquire. In this paper, we propose to leverage a sequential bias to learn control policies for complex robotic tasks using a single demonstration. To do so, our method learns a goal-conditioned policy to control a system between successive low-dimensional goals. This sequential goal-reaching approach raises a problem of compatibility between successive goals: we need to ensure that the state resulting from reaching a goal is compatible with the achievement of the following goals. To tackle this problem, we present a new algorithm called DCIL-II. We show that DCIL-II can solve with unprecedented sample efficiency some challenging simulated tasks such as humanoid locomotion and stand-up as well as fast running with a simulated Cassie robot. Our method leveraging sequentiality is a step towards the resolution of complex robotic tasks under minimal specification effort, a key feature for the next generation of autonomous robots.
translated by 谷歌翻译
Complex and contact-rich robotic manipulation tasks, particularly those that involve multi-fingered hands and underactuated object manipulation, present a significant challenge to any control method. Methods based on reinforcement learning offer an appealing choice for such settings, as they can enable robots to learn to delicately balance contact forces and dexterously reposition objects without strong modeling assumptions. However, running reinforcement learning on real-world dexterous manipulation systems often requires significant manual engineering. This negates the benefits of autonomous data collection and ease of use that reinforcement learning should in principle provide. In this paper, we describe a system for vision-based dexterous manipulation that provides a "programming-free" approach for users to define new tasks and enable robots with complex multi-fingered hands to learn to perform them through interaction. The core principle underlying our system is that, in a vision-based setting, users should be able to provide high-level intermediate supervision that circumvents challenges in teleoperation or kinesthetic teaching which allow a robot to not only learn a task efficiently but also to autonomously practice. Our system includes a framework for users to define a final task and intermediate sub-tasks with image examples, a reinforcement learning procedure that learns the task autonomously without interventions, and experimental results with a four-finger robotic hand learning multi-stage object manipulation tasks directly in the real world, without simulation, manual modeling, or reward engineering.
translated by 谷歌翻译
Adversarial imitation learning (AIL) has become a popular alternative to supervised imitation learning that reduces the distribution shift suffered by the latter. However, AIL requires effective exploration during an online reinforcement learning phase. In this work, we show that the standard, naive approach to exploration can manifest as a suboptimal local maximum if a policy learned with AIL sufficiently matches the expert distribution without fully learning the desired task. This can be particularly catastrophic for manipulation tasks, where the difference between an expert and a non-expert state-action pair is often subtle. We present Learning from Guided Play (LfGP), a framework in which we leverage expert demonstrations of multiple exploratory, auxiliary tasks in addition to a main task. The addition of these auxiliary tasks forces the agent to explore states and actions that standard AIL may learn to ignore. Additionally, this particular formulation allows for the reusability of expert data between main tasks. Our experimental results in a challenging multitask robotic manipulation domain indicate that LfGP significantly outperforms both AIL and behaviour cloning, while also being more expert sample efficient than these baselines. To explain this performance gap, we provide further analysis of a toy problem that highlights the coupling between a local maximum and poor exploration, and also visualize the differences between the learned models from AIL and LfGP.
translated by 谷歌翻译
Skill-based reinforcement learning (RL) has emerged as a promising strategy to leverage prior knowledge for accelerated robot learning. Skills are typically extracted from expert demonstrations and are embedded into a latent space from which they can be sampled as actions by a high-level RL agent. However, this skill space is expansive, and not all skills are relevant for a given robot state, making exploration difficult. Furthermore, the downstream RL agent is limited to learning structurally similar tasks to those used to construct the skill space. We firstly propose accelerating exploration in the skill space using state-conditioned generative models to directly bias the high-level agent towards only sampling skills relevant to a given state based on prior experience. Next, we propose a low-level residual policy for fine-grained skill adaptation enabling downstream RL agents to adapt to unseen task variations. Finally, we validate our approach across four challenging manipulation tasks that differ from those used to build the skill space, demonstrating our ability to learn across task variations while significantly accelerating exploration, outperforming prior works. Code and videos are available on our project website: https://krishanrana.github.io/reskill.
translated by 谷歌翻译
模仿学习在有效地学习政策方面对复杂的决策问题有着巨大的希望。当前的最新算法经常使用逆增强学习(IRL),在给定一组专家演示的情况下,代理会替代奖励功能和相关的最佳策略。但是,这种IRL方法通常需要在复杂控制问题上进行实质性的在线互动。在这项工作中,我们提出了正规化的最佳运输(ROT),这是一种新的模仿学习算法,基于最佳基于最佳运输轨迹匹配的最新进展。我们的主要技术见解是,即使只有少量演示,即使只有少量演示,也可以自适应地将轨迹匹配的奖励与行为克隆相结合。我们对横跨DeepMind Control Suite,OpenAI Robotics和Meta-World基准的20个视觉控制任务进行的实验表明,与先前最新的方法相比,平均仿真达到了90%的专家绩效的速度,达到了90%的专家性能。 。在现实世界的机器人操作中,只有一次演示和一个小时的在线培训,ROT在14个任务中的平均成功率为90.1%。
translated by 谷歌翻译
需要大量人类努力和迭代的奖励功能规范仍然是通过深入的强化学习来学习行为的主要障碍。相比之下,提供所需行为的视觉演示通常会提供一种更简单,更自然的教师的方式。我们考虑为代理提供了一个固定的视觉演示数据集,说明了如何执行任务,并且必须学习使用提供的演示和无监督的环境交互来解决任务。此设置提出了许多挑战,包括对视觉观察的表示,由于缺乏固定的奖励或学习信号而导致的,由于高维空间而引起的样本复杂性以及学习不稳定。为了解决这些挑战,我们开发了一种基于变异模型的对抗模仿学习(V-Mail)算法。基于模型的方法为表示学习,实现样本效率并通过实现派利学习来提高对抗性训练的稳定性提供了强烈的信号。通过涉及几种基于视觉的运动和操纵任务的实验,我们发现V-Mail以样本有效的方式学习了成功的视觉运动策略,与先前的工作相比,稳定性更高,并且还可以实现较高的渐近性能。我们进一步发现,通过传输学习模型,V-Mail可以从视觉演示中学习新任务,而无需任何其他环境交互。所有结果在内的所有结果都可以在\ url {https://sites.google.com/view/variational-mail}在线找到。
translated by 谷歌翻译
Dealing with sparse rewards is one of the biggest challenges in Reinforcement Learning (RL). We present a novel technique called Hindsight Experience Replay which allows sample-efficient learning from rewards which are sparse and binary and therefore avoid the need for complicated reward engineering. It can be combined with an arbitrary off-policy RL algorithm and may be seen as a form of implicit curriculum. We demonstrate our approach on the task of manipulating objects with a robotic arm. In particular, we run experiments on three different tasks: pushing, sliding, and pick-and-place, in each case using only binary rewards indicating whether or not the task is completed. Our ablation studies show that Hindsight Experience Replay is a crucial ingredient which makes training possible in these challenging environments. We show that our policies trained on a physics simulation can be deployed on a physical robot and successfully complete the task. The video presenting our experiments is available at https://goo.gl/SMrQnI.
translated by 谷歌翻译
我们提出了一种从演示方法(LFD)方法的新颖学习,即示范(DMFD)的可变形操作,以使用状态或图像作为输入(给定的专家演示)来求解可变形的操纵任务。我们的方法以三种不同的方式使用演示,并平衡在线探索环境和使用专家的指导之间进行权衡的权衡,以有效地探索高维空间。我们在一组一维绳索的一组代表性操纵任务上测试DMFD,并从软件套件中的一套二维布和2维布进行测试,每个任务都带有状态和图像观测。对于基于状态的任务,我们的方法超过基线性能高达12.9%,在基于图像的任务上最多超过33.44%,具有可比或更好的随机性。此外,我们创建了两个具有挑战性的环境,用于使用基于图像的观测值折叠2D布,并为其设定性能基准。与仿真相比,我们在现实世界执行过程中归一化性能损失最小的真实机器人(约为6%),我们将DMFD部署为最小。源代码在github.com/uscresl/dmfd上
translated by 谷歌翻译
元强化学习(RL)方法可以使用比标准RL少的数据级的元培训策略,但元培训本身既昂贵又耗时。如果我们可以在离线数据上进行元训练,那么我们可以重复使用相同的静态数据集,该数据集将一次标记为不同任务的奖励,以在元测试时间适应各种新任务的元训练策略。尽管此功能将使Meta-RL成为现实使用的实用工具,但离线META-RL提出了除在线META-RL或标准离线RL设置之外的其他挑战。 Meta-RL学习了一种探索策略,该策略收集了用于适应的数据,并元培训策略迅速适应了新任务的数据。由于该策略是在固定的离线数据集上进行了元训练的,因此当适应学识渊博的勘探策略收集的数据时,它可能表现得不可预测,这与离线数据有系统地不同,从而导致分布变化。我们提出了一种混合脱机元元素算法,该算法使用带有奖励的脱机数据来进行自适应策略,然后收集其他无监督的在线数据,而无需任何奖励标签来桥接这一分配变化。通过不需要在线收集的奖励标签,此数据可以便宜得多。我们将我们的方法比较了在模拟机器人的运动和操纵任务上进行离线元rl的先前工作,并发现使用其他无监督的在线数据收集可以显着提高元训练政策的自适应能力,从而匹配完全在线的表现。在一系列具有挑战性的域上,需要对新任务进行概括。
translated by 谷歌翻译