长期的Horizo​​n机器人学习任务稀疏的奖励对当前的强化学习算法构成了重大挑战。使人类能够学习挑战的控制任务的关键功能是,他们经常获得专家干预,使他们能够在掌握低级控制动作之前了解任务的高级结构。我们为利用专家干预来解决长马增强学习任务的框架。我们考虑\ emph {选项模板},这是编码可以使用强化学习训练的潜在选项的规格。我们将专家干预提出,因为允许代理商在学习实施之前执行选项模板。这使他们能够使用选项,然后才能为学习成本昂贵的资源学习。我们在三个具有挑战性的强化学习问题上评估了我们的方法,这表明它的表现要优于最先进的方法。训练有素的代理商和我们的代码视频可以在以下网址找到:https://sites.google.com/view/stickymittens
translated by 谷歌翻译
现实的操纵任务要求机器人与具有长时间运动动作序列的环境相互作用。尽管最近出现了深厚的强化学习方法,这是自动化操作行为的有希望的范式,但由于勘探负担,它们通常在长途任务中缺乏。这项工作介绍了操纵原始增强的强化学习(Maple),这是一个学习框架,可通过预定的行为原始库来增强标准强化学习算法。这些行为原始素是专门实现操纵目标(例如抓住和推动)的强大功能模块。为了使用这些异质原始素,我们制定了涉及原语的层次结构策略,并使用输入参数实例化执行。我们证明,枫树的表现优于基线方法,通过一系列模拟的操纵任务的大幅度。我们还量化了学习行为的组成结构,并突出了我们方法将策略转移到新任务变体和物理硬件的能力。视频和代码可从https://ut-aut-autin-rpl.github.io/maple获得
translated by 谷歌翻译
Exploration in environments with sparse rewards has been a persistent problem in reinforcement learning (RL). Many tasks are natural to specify with a sparse reward, and manually shaping a reward function can result in suboptimal performance. However, finding a non-zero reward is exponentially more difficult with increasing task horizon or action dimensionality. This puts many real-world tasks out of practical reach of RL methods. In this work, we use demonstrations to overcome the exploration problem and successfully learn to perform long-horizon, multi-step robotics tasks with continuous control such as stacking blocks with a robot arm. Our method, which builds on top of Deep Deterministic Policy Gradients and Hindsight Experience Replay, provides an order of magnitude of speedup over RL on simulated robotics tasks. It is simple to implement and makes only the additional assumption that we can collect a small set of demonstrations. Furthermore, our method is able to solve tasks not solvable by either RL or behavior cloning alone, and often ends up outperforming the demonstrator policy.
translated by 谷歌翻译
我们提出了一种层次结构的增强学习方法Hidio,可以以自我监督的方式学习任务不合时宜的选项,同时共同学习利用它们来解决稀疏的奖励任务。与当前倾向于制定目标的低水平任务或预定临时的低级政策不同的层次RL方法不同,Hidio鼓励下级选项学习与手头任务无关,几乎不需要假设或很少的知识任务结构。这些选项是通过基于选项子对象的固有熵最小化目标来学习的。博学的选择是多种多样的,任务不可能的。在稀疏的机器人操作和导航任务的实验中,Hidio比常规RL基准和两种最先进的层次RL方法,其样品效率更高。
translated by 谷歌翻译
Learning goal-directed behavior in environments with sparse feedback is a major challenge for reinforcement learning algorithms. The primary difficulty arises due to insufficient exploration, resulting in an agent being unable to learn robust value functions. Intrinsically motivated agents can explore new behavior for its own sake rather than to directly solve problems. Such intrinsic behaviors could eventually help the agent solve tasks posed by the environment. We present hierarchical-DQN (h-DQN), a framework to integrate hierarchical value functions, operating at different temporal scales, with intrinsically motivated deep reinforcement learning. A top-level value function learns a policy over intrinsic goals, and a lower-level function learns a policy over atomic actions to satisfy the given goals. h-DQN allows for flexible goal specifications, such as functions over entities and relations. This provides an efficient space for exploration in complicated environments. We demonstrate the strength of our approach on two problems with very sparse, delayed feedback: (1) a complex discrete stochastic decision process, and (2) the classic ATARI game 'Montezuma's Revenge'.
translated by 谷歌翻译
尽管深入的强化学习(DRL)在包括机器人技术在内的许多学科中都很流行,但最先进的DRL算法仍然难以学习长途,多步骤和稀疏奖励任务,例如仅在只有一项任务的情况下堆叠几个块 - 集合奖励信号。为了提高此类任务的学习效率,本文提出了一种称为A^2的DRL探索技术,该技术集成了受人类经验启发的两个组成部分:抽象演示和适应性探索。 A^2首先将复杂的任务分解为子任务,然后提供正确的子任务订单以学习。在训练过程中,该代理商会自适应地探索环境,对良好的子任务的行为更确定性,并且更随机地对不良的子任务子任务。消融和比较实验是对几个网格世界任务和三个机器人操纵任务进行的。我们证明A^2可以帮助流行的DRL算法(DQN,DDPG和SAC)在这些环境中更有效,稳定地学习。
translated by 谷歌翻译
最先进的多机构增强学习(MARL)方法为各种复杂问题提供了有希望的解决方案。然而,这些方法都假定代理执行同步的原始操作执行,因此它们不能真正可扩展到长期胜利的真实世界多代理/机器人任务,这些任务固有地要求代理/机器人以异步的理由,涉及有关高级动作选择的理由。不同的时间。宏观行动分散的部分可观察到的马尔可夫决策过程(MACDEC-POMDP)是在完全合作的多代理任务中不确定的异步决策的一般形式化。在本论文中,我们首先提出了MacDec-Pomdps的一组基于价值的RL方法,其中允许代理在三个范式中使用宏观成果功能执行异步学习和决策:分散学习和控制,集中学习,集中学习和控制,以及分散执行的集中培训(CTDE)。在上述工作的基础上,我们在三个训练范式下制定了一组基于宏观行动的策略梯度算法,在该训练范式下,允许代理以异步方式直接优化其参数化策略。我们在模拟和真实的机器人中评估了我们的方法。经验结果证明了我们在大型多代理问题中的方法的优势,并验证了我们算法在学习具有宏观actions的高质量和异步溶液方面的有效性。
translated by 谷歌翻译
使用强化学习解决复杂的问题必须将问题分解为可管理的任务,无论是明确或隐式的任务,并学习解决这些任务的政策。反过来,这些政策必须由采取高级决策的总体政策来控制。这需要培训算法在学习这些政策时考虑这种等级决策结构。但是,实践中的培训可能会导致泛化不良,要么在很少的时间步骤执行动作,要么将其全部转变为单个政策。在我们的工作中,我们介绍了一种替代方法来依次学习此类技能,而无需使用总体层次的政策。我们在环境的背景下提出了这种方法,在这种环境的背景下,学习代理目标的主要组成部分是尽可能长时间延长情节。我们将我们提出的方法称为顺序选择评论家。我们在我们开发的灵活的模拟3D导航环境中演示了我们在导航和基于目标任务的方法的实用性。我们还表明,我们的方法优于先前的方法,例如在我们的环境中,柔软的演员和软选择评论家,以及健身房自动驾驶汽车模拟器和Atari River RAID RAID环境。
translated by 谷歌翻译
在本文中,我们提出了一种新的马尔可夫决策过程学习分层表示的方法。我们的方法通过将状态空间划分为子集,并定义用于在分区之间执行转换的子任务。我们制定将状态空间作为优化问题分区的问题,该优化问题可以使用梯度下降给出一组采样的轨迹来解决,使我们的方法适用于大状态空间的高维问题。我们经验验证方法,通过表示它可以成功地在导航域中成功学习有用的分层表示。一旦了解到,分层表示可以用于解决给定域中的不同任务,从而概括跨任务的知识。
translated by 谷歌翻译
将有用的背景知识传达给加强学习(RL)代理是加速学习的重要方法。我们介绍了Rlang,这是一种特定领域的语言(DSL),用于将域知识传达给RL代理。与RL社区提出的其他现有DSL不同,该基础是决策形式主义的单个要素(例如,奖励功能或政策功能),RLANG可以指定有关马尔可夫决策过程中每个元素的信息。我们为rlang定义了精确的语法和基础语义,并提供了解析器实施,将rlang程序基于算法 - 敏捷的部分世界模型和政策,可以由RL代理利用。我们提供一系列示例RLANG程序,并演示不同的RL方法如何利用所得的知识,包括无模型和基于模型的表格算法,分层方法和深度RL算法(包括策略梯度和基于价值的方法)。
translated by 谷歌翻译
Dealing with sparse rewards is one of the biggest challenges in Reinforcement Learning (RL). We present a novel technique called Hindsight Experience Replay which allows sample-efficient learning from rewards which are sparse and binary and therefore avoid the need for complicated reward engineering. It can be combined with an arbitrary off-policy RL algorithm and may be seen as a form of implicit curriculum. We demonstrate our approach on the task of manipulating objects with a robotic arm. In particular, we run experiments on three different tasks: pushing, sliding, and pick-and-place, in each case using only binary rewards indicating whether or not the task is completed. Our ablation studies show that Hindsight Experience Replay is a crucial ingredient which makes training possible in these challenging environments. We show that our policies trained on a physics simulation can be deployed on a physical robot and successfully complete the task. The video presenting our experiments is available at https://goo.gl/SMrQnI.
translated by 谷歌翻译
Hierarchical Reinforcement Learning (HRL) algorithms have been demonstrated to perform well on high-dimensional decision making and robotic control tasks. However, because they solely optimize for rewards, the agent tends to search the same space redundantly. This problem reduces the speed of learning and achieved reward. In this work, we present an Off-Policy HRL algorithm that maximizes entropy for efficient exploration. The algorithm learns a temporally abstracted low-level policy and is able to explore broadly through the addition of entropy to the high-level. The novelty of this work is the theoretical motivation of adding entropy to the RL objective in the HRL setting. We empirically show that the entropy can be added to both levels if the Kullback-Leibler (KL) divergence between consecutive updates of the low-level policy is sufficiently small. We performed an ablative study to analyze the effects of entropy on hierarchy, in which adding entropy to high-level emerged as the most desirable configuration. Furthermore, a higher temperature in the low-level leads to Q-value overestimation and increases the stochasticity of the environment that the high-level operates on, making learning more challenging. Our method, SHIRO, surpasses state-of-the-art performance on a range of simulated robotic control benchmark tasks and requires minimal tuning.
translated by 谷歌翻译
通过加强学习(RL)掌握机器人操纵技巧通常需要设计奖励功能。该地区的最新进展表明,使用稀疏奖励,即仅在成功完成任务时奖励代理,可能会导致更好的政策。但是,在这种情况下,国家行动空间探索更困难。最近的RL与稀疏奖励学习的方法已经为任务提供了高质量的人类演示,但这些可能是昂贵的,耗时甚至不可能获得的。在本文中,我们提出了一种不需要人类示范的新颖有效方法。我们观察到,每个机器人操纵任务都可以被视为涉及从被操纵对象的角度来看运动的任务,即,对象可以了解如何自己达到目标状态。为了利用这个想法,我们介绍了一个框架,最初使用现实物理模拟器获得对象运动策略。然后,此策略用于生成辅助奖励,称为模拟的机器人演示奖励(SLDRS),使我们能够学习机器人操纵策略。拟议的方法已在增加复杂性的13个任务中进行了评估,与替代算法相比,可以实现更高的成功率和更快的学习率。 SLDRS对多对象堆叠和非刚性物体操作等任务特别有益。
translated by 谷歌翻译
Deep reinforcement learning is poised to revolutionise the field of AI and represents a step towards building autonomous systems with a higher level understanding of the visual world. Currently, deep learning is enabling reinforcement learning to scale to problems that were previously intractable, such as learning to play video games directly from pixels. Deep reinforcement learning algorithms are also applied to robotics, allowing control policies for robots to be learned directly from camera inputs in the real world. In this survey, we begin with an introduction to the general field of reinforcement learning, then progress to the main streams of value-based and policybased methods. Our survey will cover central algorithms in deep reinforcement learning, including the deep Q-network, trust region policy optimisation, and asynchronous advantage actor-critic. In parallel, we highlight the unique advantages of deep neural networks, focusing on visual understanding via reinforcement learning. To conclude, we describe several current areas of research within the field.
translated by 谷歌翻译
The ability to effectively reuse prior knowledge is a key requirement when building general and flexible Reinforcement Learning (RL) agents. Skill reuse is one of the most common approaches, but current methods have considerable limitations.For example, fine-tuning an existing policy frequently fails, as the policy can degrade rapidly early in training. In a similar vein, distillation of expert behavior can lead to poor results when given sub-optimal experts. We compare several common approaches for skill transfer on multiple domains including changes in task and system dynamics. We identify how existing methods can fail and introduce an alternative approach to mitigate these problems. Our approach learns to sequence existing temporally-extended skills for exploration but learns the final policy directly from the raw experience. This conceptual split enables rapid adaptation and thus efficient data collection but without constraining the final solution.It significantly outperforms many classical methods across a suite of evaluation tasks and we use a broad set of ablations to highlight the importance of differentc omponents of our method.
translated by 谷歌翻译
当加强学习以稀疏的奖励应用时,代理必须花费很长时间探索未知环境而没有任何学习信号。抽象是一种为代理提供在潜在空间中过渡的内在奖励的方法。先前的工作着重于密集的连续潜在空间,或要求用户手动提供表示形式。我们的方法是第一个自动学习基础环境的离散抽象的方法。此外,我们的方法使用端到端可训练的正规后继代表模型在任意输入空间上起作用。对于抽象状态之间的过渡,我们以选项的形式训练一组时间扩展的动作,即动作抽象。我们提出的算法,离散的国家行动抽象(DSAA),在训练这些选项之间进行迭代交换,并使用它们有效地探索更多环境以改善状态抽象。结果,我们的模型不仅对转移学习,而且在在线学习环境中有用。我们从经验上表明,与基线加强学习算法相比,我们的代理能够探索环境并更有效地解决任务。我们的代码可在\ url {https://github.com/amnonattali/dsaa}上公开获得。
translated by 谷歌翻译
Hierarchical methods in reinforcement learning have the potential to reduce the amount of decisions that the agent needs to perform when learning new tasks. However, finding a reusable useful temporal abstractions that facilitate fast learning remains a challenging problem. Recently, several deep learning approaches were proposed to learn such temporal abstractions in the form of options in an end-to-end manner. In this work, we point out several shortcomings of these methods and discuss their potential negative consequences. Subsequently, we formulate the desiderata for reusable options and use these to frame the problem of learning options as a gradient-based meta-learning problem. This allows us to formulate an objective that explicitly incentivizes options which allow a higher-level decision maker to adjust in few steps to different tasks. Experimentally, we show that our method is able to learn transferable components which accelerate learning and performs better than existing prior methods developed for this setting. Additionally, we perform ablations to quantify the impact of using gradient-based meta-learning as well as other proposed changes.
translated by 谷歌翻译
用于机器人操纵的多进球政策学习具有挑战性。先前的成功使用了对象的基于状态的表示或提供了演示数据来促进学习。在本文中,通过对域的高级离散表示形式进行手工编码,我们表明,可以使用来自像素的Q学习来学习达到数十个目标的策略。代理商将学习重点放在更简单的本地政策上,这些政策是通过在抽象空间中进行计划来对其进行测序的。我们将我们的方法与标准的多目标RL基线以及在具有挑战性的块构造域上利用离散表示的其他方法进行了比较。我们发现我们的方法可以构建一百多个不同的块结构,并证明具有新物体的结构向前转移。最后,我们将所学的政策部署在真正的机器人上的模拟中。
translated by 谷歌翻译
本文详细介绍了我们对2021年真正机器人挑战的第一阶段提交的提交;三指机器人必须沿指定目标轨迹携带立方体的挑战。为了解决第1阶段,我们使用一种纯净的增强学习方法,该方法需要对机器人系统或机器人抓握的最少专家知识。与事后的经验重播一起采用了稀疏,基于目标的奖励,以教导控制立方体将立方体移至目标的X和Y坐标。同时,采用了基于密集的距离奖励来教授将立方体提升到目标的Z坐标(高度组成部分)的政策。该策略在将域随机化的模拟中进行培训,然后再转移到真实的机器人进行评估。尽管此次转移后的性能往往会恶化,但我们的最佳政策可以通过有效的捏合掌握能够成功地沿目标轨迹提升真正的立方体。我们的方法表现优于所有其他提交,包括那些利用更传统的机器人控制技术的提交,并且是第一个解决这一挑战的纯学习方法。
translated by 谷歌翻译
我们提出了一种新型的参数化技能学习算法,旨在学习可转移的参数化技能并将其合成为新的动作空间,以支持长期任务中的有效学习。我们首先提出了新颖的学习目标 - 以轨迹为中心的多样性和平稳性 - 允许代理商能够重复使用的参数化技能。我们的代理商可以使用这些学习的技能来构建时间扩展的参数化行动马尔可夫决策过程,我们为此提出了一种层次的参与者 - 批判算法,旨在通过学习技能有效地学习高级控制政策。我们从经验上证明,所提出的算法使代理能够解决复杂的长途障碍源环境。
translated by 谷歌翻译