从过去的经验中发现有用的行为并将其转移到新任务的能力被认为是自然体现智力的核心组成部分。受神经科学的启发,发现在瓶颈状态下切换的行为一直被人们追求,以引起整个任务的最小描述长度的计划。先前的方法仅支持在线,政策,瓶颈状态发现,限制样本效率或离散的状态行动域,从而限制适用性。为了解决这个问题,我们介绍了基于模型的离线选项(MO2),这是一个脱机后视框架,支持在连续的状态行动空间上发现样品效率高效瓶颈选项。一旦脱机而在源域上学习了瓶颈选项,它们就会在线转移,以改善转移域的探索和价值估计。我们的实验表明,在复杂的长途连续控制任务上,具有稀疏,延迟的奖励,MO2的属性至关重要,并且导致性能超过最近的选项学习方法。其他消融进一步证明了对期权可预测性和信用分配的影响。
translated by 谷歌翻译
我们提出了一种层次结构的增强学习方法Hidio,可以以自我监督的方式学习任务不合时宜的选项,同时共同学习利用它们来解决稀疏的奖励任务。与当前倾向于制定目标的低水平任务或预定临时的低级政策不同的层次RL方法不同,Hidio鼓励下级选项学习与手头任务无关,几乎不需要假设或很少的知识任务结构。这些选项是通过基于选项子对象的固有熵最小化目标来学习的。博学的选择是多种多样的,任务不可能的。在稀疏的机器人操作和导航任务的实验中,Hidio比常规RL基准和两种最先进的层次RL方法,其样品效率更高。
translated by 谷歌翻译
The ability to effectively reuse prior knowledge is a key requirement when building general and flexible Reinforcement Learning (RL) agents. Skill reuse is one of the most common approaches, but current methods have considerable limitations.For example, fine-tuning an existing policy frequently fails, as the policy can degrade rapidly early in training. In a similar vein, distillation of expert behavior can lead to poor results when given sub-optimal experts. We compare several common approaches for skill transfer on multiple domains including changes in task and system dynamics. We identify how existing methods can fail and introduce an alternative approach to mitigate these problems. Our approach learns to sequence existing temporally-extended skills for exploration but learns the final policy directly from the raw experience. This conceptual split enables rapid adaptation and thus efficient data collection but without constraining the final solution.It significantly outperforms many classical methods across a suite of evaluation tasks and we use a broad set of ablations to highlight the importance of differentc omponents of our method.
translated by 谷歌翻译
基于模型的增强学习(RL)是一种通过利用学习的单步动力学模型来计划想象中的动作来学习复杂行为的样本效率方法。但是,计划为长马操作计划的每项行动都是不切实际的,类似于每个肌肉运动的人类计划。相反,人类有效地计划具有高级技能来解决复杂的任务。从这种直觉中,我们提出了一个基于技能的RL框架(SKIMO),该框架能够使用技能动力学模型在技能空间中进行计划,该模型直接预测技能成果,而不是预测中级状态中的所有小细节,逐步。为了准确有效的长期计划,我们共同学习了先前经验的技能动力学模型和技能曲目。然后,我们利用学到的技能动力学模型准确模拟和计划技能空间中的长范围,这可以有效地学习长摩盛,稀疏的奖励任务。导航和操纵域中的实验结果表明,Skimo扩展了基于模型的方法的时间范围,并提高了基于模型的RL和基于技能的RL的样品效率。代码和视频可在\ url {https://clvrai.com/skimo}上找到
translated by 谷歌翻译
Hierarchical methods in reinforcement learning have the potential to reduce the amount of decisions that the agent needs to perform when learning new tasks. However, finding a reusable useful temporal abstractions that facilitate fast learning remains a challenging problem. Recently, several deep learning approaches were proposed to learn such temporal abstractions in the form of options in an end-to-end manner. In this work, we point out several shortcomings of these methods and discuss their potential negative consequences. Subsequently, we formulate the desiderata for reusable options and use these to frame the problem of learning options as a gradient-based meta-learning problem. This allows us to formulate an objective that explicitly incentivizes options which allow a higher-level decision maker to adjust in few steps to different tasks. Experimentally, we show that our method is able to learn transferable components which accelerate learning and performs better than existing prior methods developed for this setting. Additionally, we perform ablations to quantify the impact of using gradient-based meta-learning as well as other proposed changes.
translated by 谷歌翻译
对于在现实世界中运营的机器人来说,期望学习可以有效地转移和适应许多任务和场景的可重复使用的行为。我们提出了一种使用分层混合潜变量模型来从数据中学习抽象运动技能的方法。与现有工作相比,我们的方法利用了离散和连续潜在变量的三级层次结构,以捕获一组高级行为,同时允许如何执行它们的差异。我们在操纵域中展示该方法可以有效地将离线数据脱落到不同的可执行行为,同时保留连续潜变量模型的灵活性。由此产生的技能可以在新的任务,看不见的对象和州内转移和微调到基于视觉的策略,与现有的技能和仿制的方法相比,产生更好的样本效率和渐近性能。我们进一步分析了技能最有益的方式以及何时:他们鼓励定向探索来涵盖与任务相关的国家空间的大区域,使其在挑战稀疏奖励环境中最有效。
translated by 谷歌翻译
增强学习(RL)研究领域非常活跃,并具有重要的新贡献;特别是考虑到深RL(DRL)的新兴领域。但是,仍然需要解决许多科学和技术挑战,其中我们可以提及抽象行动的能力或在稀疏回报环境中探索环境的难以通过内在动机(IM)来解决的。我们建议通过基于信息理论的新分类法调查这些研究工作:我们在计算上重新审视了惊喜,新颖性和技能学习的概念。这使我们能够确定方法的优势和缺点,并展示当前的研究前景。我们的分析表明,新颖性和惊喜可以帮助建立可转移技能的层次结构,从而进一步抽象环境并使勘探过程更加健壮。
translated by 谷歌翻译
有效的探索是深度强化学习的关键挑战。几种方法,例如行为先验,能够利用离线数据,以便在复杂任务上有效加速加强学习。但是,如果手动的任务与所证明的任务过度偏离,则此类方法的有效性是有限的。在我们的工作中,我们建议从离线数据中学习功能,这些功能由更加多样化的任务共享,例如动作与定向之间的相关性。因此,我们介绍了无国有先验,该先验直接在显示的轨迹中直接建模时间一致性,并且即使在对简单任务收集的数据进行培训时,也能够在复杂的任务中推动探索。此外,我们通过从政策和行动之前的概率混合物中动态采样动作,引入了一种新颖的集成方案,用于非政策强化学习中的动作研究。我们将我们的方法与强大的基线相提并论,并提供了经验证据,表明它可以在稀疏奖励环境下的长途持续控制任务中加速加强学习。
translated by 谷歌翻译
在本文中,我们提出了一种新的马尔可夫决策过程学习分层表示的方法。我们的方法通过将状态空间划分为子集,并定义用于在分区之间执行转换的子任务。我们制定将状态空间作为优化问题分区的问题,该优化问题可以使用梯度下降给出一组采样的轨迹来解决,使我们的方法适用于大状态空间的高维问题。我们经验验证方法,通过表示它可以成功地在导航域中成功学习有用的分层表示。一旦了解到,分层表示可以用于解决给定域中的不同任务,从而概括跨任务的知识。
translated by 谷歌翻译
我们介绍了一种改进政策改进的方法,该方法在基于价值的强化学习(RL)的贪婪方法与基于模型的RL的典型计划方法之间进行了插值。新方法建立在几何视野模型(GHM,也称为伽马模型)的概念上,该模型对给定策略的折现状态验证分布进行了建模。我们表明,我们可以通过仔细的基本策略GHM的仔细组成,而无需任何其他学习,可以评估任何非马尔科夫策略,以固定的概率在一组基本马尔可夫策略之间切换。然后,我们可以将广义政策改进(GPI)应用于此类非马尔科夫政策的收集,以获得新的马尔可夫政策,通常将其表现优于其先驱。我们对这种方法提供了彻底的理论分析,开发了转移和标准RL的应用,并在经验上证明了其对标准GPI的有效性,对充满挑战的深度RL连续控制任务。我们还提供了GHM培训方法的分析,证明了关于先前提出的方法的新型收敛结果,并显示了如何在深度RL设置中稳定训练这些模型。
translated by 谷歌翻译
需要长马计划和持续控制能力的问题对现有的强化学习剂构成了重大挑战。在本文中,我们介绍了一种新型的分层增强学习代理,该学习代理将延时的技能与持续控制的技能与远期模型联系起来,以象征性的分离环境的计划进行计划。我们认为我们的代理商符合符号效应的多样化技能。我们制定了一种客观且相应的算法,该算法通过已知的抽象来通过内在动机来无监督学习各种技能。这些技能是通过符号前向模型共同学习的,该模型捕获了国家抽象中技能执行的影响。训练后,我们可以使用向前模型来利用符号动作的技能来进行长途计划,并随后使用学识渊博的连续行动控制技能执行计划。拟议的算法学习了技能和前瞻性模型,可用于解决复杂的任务,这些任务既需要连续控制和长效计划功能,却具有很高的成功率。它与其他平坦和分层的增强学习基线代理相比,并通过真正的机器人成功证明。
translated by 谷歌翻译
当加强学习以稀疏的奖励应用时,代理必须花费很长时间探索未知环境而没有任何学习信号。抽象是一种为代理提供在潜在空间中过渡的内在奖励的方法。先前的工作着重于密集的连续潜在空间,或要求用户手动提供表示形式。我们的方法是第一个自动学习基础环境的离散抽象的方法。此外,我们的方法使用端到端可训练的正规后继代表模型在任意输入空间上起作用。对于抽象状态之间的过渡,我们以选项的形式训练一组时间扩展的动作,即动作抽象。我们提出的算法,离散的国家行动抽象(DSAA),在训练这些选项之间进行迭代交换,并使用它们有效地探索更多环境以改善状态抽象。结果,我们的模型不仅对转移学习,而且在在线学习环境中有用。我们从经验上表明,与基线加强学习算法相比,我们的代理能够探索环境并更有效地解决任务。我们的代码可在\ url {https://github.com/amnonattali/dsaa}上公开获得。
translated by 谷歌翻译
The reinforcement learning paradigm is a popular way to address problems that have only limited environmental feedback, rather than correctly labeled examples, as is common in other machine learning contexts. While significant progress has been made to improve learning in a single task, the idea of transfer learning has only recently been applied to reinforcement learning tasks. The core idea of transfer is that experience gained in learning to perform one task can help improve learning performance in a related, but different, task. In this article we present a framework that classifies transfer learning methods in terms of their capabilities and goals, and then use it to survey the existing literature, as well as to suggest future directions for transfer learning work.
translated by 谷歌翻译
Identifying statistical regularities in solutions to some tasks in multi-task reinforcement learning can accelerate the learning of new tasks. Skill learning offers one way of identifying these regularities by decomposing pre-collected experiences into a sequence of skills. A popular approach to skill learning is maximizing the likelihood of the pre-collected experience with latent variable models, where the latent variables represent the skills. However, there are often many solutions that maximize the likelihood equally well, including degenerate solutions. To address this underspecification, we propose a new objective that combines the maximum likelihood objective with a penalty on the description length of the skills. This penalty incentivizes the skills to maximally extract common structures from the experiences. Empirically, our objective learns skills that solve downstream tasks in fewer samples compared to skills learned from only maximizing likelihood. Further, while most prior works in the offline multi-task setting focus on tasks with low-dimensional observations, our objective can scale to challenging tasks with high-dimensional image observations.
translated by 谷歌翻译
顺序决策的两种常见方法是AI计划(AIP)和强化学习(RL)。每个都有优点和缺点。 AIP是可解释的,易于与象征知识集成,并且通常是有效的,但需要前期逻辑域的规范,并且对噪声敏感; RL仅需要奖励的规范,并且对噪声是强大的,但效率低下,不容易提供外部知识。我们提出了一种综合方法,将高级计划与RL结合在一起,保留可解释性,转移和效率,同时允许对低级计划行动进行强有力的学习。我们的方法通过在AI计划问题的状态过渡模型与Markov决策过程(MDP)的抽象状态过渡系统(MDP)之间建立对应关系,从而定义了AIP操作员的分层增强学习(HRL)的选项。通过添加内在奖励来鼓励MDP和AIP过渡模型之间的一致性来学习选项。我们通过比较Minigrid和N房间环境中RL和HRL算法的性能来证明我们的综合方法的好处,从而显示了我们方法比现有方法的优势。
translated by 谷歌翻译
Skill-based reinforcement learning (RL) has emerged as a promising strategy to leverage prior knowledge for accelerated robot learning. Skills are typically extracted from expert demonstrations and are embedded into a latent space from which they can be sampled as actions by a high-level RL agent. However, this skill space is expansive, and not all skills are relevant for a given robot state, making exploration difficult. Furthermore, the downstream RL agent is limited to learning structurally similar tasks to those used to construct the skill space. We firstly propose accelerating exploration in the skill space using state-conditioned generative models to directly bias the high-level agent towards only sampling skills relevant to a given state based on prior experience. Next, we propose a low-level residual policy for fine-grained skill adaptation enabling downstream RL agents to adapt to unseen task variations. Finally, we validate our approach across four challenging manipulation tasks that differ from those used to build the skill space, demonstrating our ability to learn across task variations while significantly accelerating exploration, outperforming prior works. Code and videos are available on our project website: https://krishanrana.github.io/reskill.
translated by 谷歌翻译
A long-standing challenge in artificial intelligence is lifelong learning. In lifelong learning, many tasks are presented in sequence and learners must efficiently transfer knowledge between tasks while avoiding catastrophic forgetting over long lifetimes. On these problems, policy reuse and other multi-policy reinforcement learning techniques can learn many tasks. However, they can generate many temporary or permanent policies, resulting in memory issues. Consequently, there is a need for lifetime-scalable methods that continually refine a policy library of a pre-defined size. This paper presents a first approach to lifetime-scalable policy reuse. To pre-select the number of policies, a notion of task capacity, the maximal number of tasks that a policy can accurately solve, is proposed. To evaluate lifetime policy reuse using this method, two state-of-the-art single-actor base-learners are compared: 1) a value-based reinforcement learner, Deep Q-Network (DQN) or Deep Recurrent Q-Network (DRQN); and 2) an actor-critic reinforcement learner, Proximal Policy Optimisation (PPO) with or without Long Short-Term Memory layer. By selecting the number of policies based on task capacity, D(R)QN achieves near-optimal performance with 6 policies in a 27-task MDP domain and 9 policies in an 18-task POMDP domain; with fewer policies, catastrophic forgetting and negative transfer are observed. Due to slow, monotonic improvement, PPO requires fewer policies, 1 policy for the 27-task domain and 4 policies for the 18-task domain, but it learns the tasks with lower accuracy than D(R)QN. These findings validate lifetime-scalable policy reuse and suggest using D(R)QN for larger and PPO for smaller library sizes.
translated by 谷歌翻译
Adequately assigning credit to actions for future outcomes based on their contributions is a long-standing open challenge in Reinforcement Learning. The assumptions of the most commonly used credit assignment method are disadvantageous in tasks where the effects of decisions are not immediately evident. Furthermore, this method can only evaluate actions that have been selected by the agent, making it highly inefficient. Still, no alternative methods have been widely adopted in the field. Hindsight Credit Assignment is a promising, but still unexplored candidate, which aims to solve the problems of both long-term and counterfactual credit assignment. In this thesis, we empirically investigate Hindsight Credit Assignment to identify its main benefits, and key points to improve. Then, we apply it to factored state representations, and in particular to state representations based on the causal structure of the environment. In this setting, we propose a variant of Hindsight Credit Assignment that effectively exploits a given causal structure. We show that our modification greatly decreases the workload of Hindsight Credit Assignment, making it more efficient and enabling it to outperform the baseline credit assignment method on various tasks. This opens the way to other methods based on given or learned causal structures.
translated by 谷歌翻译
目标条件层次结构增强学习(HRL)是扩大强化学习(RL)技术的有前途的方法。但是,由于高级的动作空间,即目标空间很大。在大型目标空间中进行搜索对于高级子观念和低级政策学习都构成了困难。在本文中,我们表明,可以使用邻接约束来限制从整个目标空间到当前状态的$ k $步骤相邻区域的高级动作空间,从而有效缓解此问题。从理论上讲,我们证明在确定性的马尔可夫决策过程(MDP)中,所提出的邻接约束保留了最佳的层次结构策略,而在随机MDP中,邻接约束诱导了由MDP的过渡结构确定的有界状态价值次数。我们进一步表明,可以通过培训可以区分邻近和非贴种亚目标的邻接网络来实际实现此约束。对离散和连续控制任务的实验结果,包括挑战性的机器人运动和操纵任务,表明合并邻接性约束可显着提高最先进的目标条件条件的HRL方法的性能。
translated by 谷歌翻译
钢筋学习中的时间抽象(RL),通过更有效地随时间传播信息,提供了提高复杂环境中的泛化和知识传输的承诺。虽然选项学习最初是以允许同时更新许多选项的方式制定的,但使用违规策略,期间内部学习(Sutton,Precup&Singh,1999),许多最近的分层强化学习方法仅更新一个选项时间:当前正在执行的选项。我们在深度加强学习的背景下重新审视并扩展了内部期间学习,以便启用与当前原始操作选项一致的所有选项,而不会引入任何其他估计。因此,我们的方法可以在大多数分层RL框架中自然采用。当我们将我们的方法与选项发现的选项批评算法结合起来时,我们在各种域中获得了性能和数据效率的显着改进。
translated by 谷歌翻译