在本文中,我们提出了一种新的马尔可夫决策过程学习分层表示的方法。我们的方法通过将状态空间划分为子集,并定义用于在分区之间执行转换的子任务。我们制定将状态空间作为优化问题分区的问题,该优化问题可以使用梯度下降给出一组采样的轨迹来解决,使我们的方法适用于大状态空间的高维问题。我们经验验证方法,通过表示它可以成功地在导航域中成功学习有用的分层表示。一旦了解到,分层表示可以用于解决给定域中的不同任务,从而概括跨任务的知识。
translated by 谷歌翻译
当加强学习以稀疏的奖励应用时,代理必须花费很长时间探索未知环境而没有任何学习信号。抽象是一种为代理提供在潜在空间中过渡的内在奖励的方法。先前的工作着重于密集的连续潜在空间,或要求用户手动提供表示形式。我们的方法是第一个自动学习基础环境的离散抽象的方法。此外,我们的方法使用端到端可训练的正规后继代表模型在任意输入空间上起作用。对于抽象状态之间的过渡,我们以选项的形式训练一组时间扩展的动作,即动作抽象。我们提出的算法,离散的国家行动抽象(DSAA),在训练这些选项之间进行迭代交换,并使用它们有效地探索更多环境以改善状态抽象。结果,我们的模型不仅对转移学习,而且在在线学习环境中有用。我们从经验上表明,与基线加强学习算法相比,我们的代理能够探索环境并更有效地解决任务。我们的代码可在\ url {https://github.com/amnonattali/dsaa}上公开获得。
translated by 谷歌翻译
顺序决策的两种常见方法是AI计划(AIP)和强化学习(RL)。每个都有优点和缺点。 AIP是可解释的,易于与象征知识集成,并且通常是有效的,但需要前期逻辑域的规范,并且对噪声敏感; RL仅需要奖励的规范,并且对噪声是强大的,但效率低下,不容易提供外部知识。我们提出了一种综合方法,将高级计划与RL结合在一起,保留可解释性,转移和效率,同时允许对低级计划行动进行强有力的学习。我们的方法通过在AI计划问题的状态过渡模型与Markov决策过程(MDP)的抽象状态过渡系统(MDP)之间建立对应关系,从而定义了AIP操作员的分层增强学习(HRL)的选项。通过添加内在奖励来鼓励MDP和AIP过渡模型之间的一致性来学习选项。我们通过比较Minigrid和N房间环境中RL和HRL算法的性能来证明我们的综合方法的好处,从而显示了我们方法比现有方法的优势。
translated by 谷歌翻译
本文研究了一种使用背景计划的新方法,用于基于模型的增强学习:混合(近似)动态编程更新和无模型更新,类似于DYNA体系结构。通过学习模型的背景计划通常比无模型替代方案(例如Double DQN)差,尽管前者使用了更多的内存和计算。基本问题是,学到的模型可能是不准确的,并且经常会产生无效的状态,尤其是在迭代许多步骤时。在本文中,我们通过将背景规划限制为一组(抽象)子目标并仅学习本地,子观念模型来避免这种限制。这种目标空间计划(GSP)方法更有效地是在计算上,自然地纳入了时间抽象,以进行更快的长胜压计划,并避免完全学习过渡动态。我们表明,在各种情况下,我们的GSP算法比双DQN基线要快得多。
translated by 谷歌翻译
Learning goal-directed behavior in environments with sparse feedback is a major challenge for reinforcement learning algorithms. The primary difficulty arises due to insufficient exploration, resulting in an agent being unable to learn robust value functions. Intrinsically motivated agents can explore new behavior for its own sake rather than to directly solve problems. Such intrinsic behaviors could eventually help the agent solve tasks posed by the environment. We present hierarchical-DQN (h-DQN), a framework to integrate hierarchical value functions, operating at different temporal scales, with intrinsically motivated deep reinforcement learning. A top-level value function learns a policy over intrinsic goals, and a lower-level function learns a policy over atomic actions to satisfy the given goals. h-DQN allows for flexible goal specifications, such as functions over entities and relations. This provides an efficient space for exploration in complicated environments. We demonstrate the strength of our approach on two problems with very sparse, delayed feedback: (1) a complex discrete stochastic decision process, and (2) the classic ATARI game 'Montezuma's Revenge'.
translated by 谷歌翻译
增强学习(RL)研究领域非常活跃,并具有重要的新贡献;特别是考虑到深RL(DRL)的新兴领域。但是,仍然需要解决许多科学和技术挑战,其中我们可以提及抽象行动的能力或在稀疏回报环境中探索环境的难以通过内在动机(IM)来解决的。我们建议通过基于信息理论的新分类法调查这些研究工作:我们在计算上重新审视了惊喜,新颖性和技能学习的概念。这使我们能够确定方法的优势和缺点,并展示当前的研究前景。我们的分析表明,新颖性和惊喜可以帮助建立可转移技能的层次结构,从而进一步抽象环境并使勘探过程更加健壮。
translated by 谷歌翻译
我们提出了一种层次结构的增强学习方法Hidio,可以以自我监督的方式学习任务不合时宜的选项,同时共同学习利用它们来解决稀疏的奖励任务。与当前倾向于制定目标的低水平任务或预定临时的低级政策不同的层次RL方法不同,Hidio鼓励下级选项学习与手头任务无关,几乎不需要假设或很少的知识任务结构。这些选项是通过基于选项子对象的固有熵最小化目标来学习的。博学的选择是多种多样的,任务不可能的。在稀疏的机器人操作和导航任务的实验中,Hidio比常规RL基准和两种最先进的层次RL方法,其样品效率更高。
translated by 谷歌翻译
Recent advances in reinforcement-learning research have demonstrated impressive results in building algorithms that can out-perform humans in complex tasks. Nevertheless, creating reinforcement-learning systems that can build abstractions of their experience to accelerate learning in new contexts still remains an active area of research. Previous work showed that reward-predictive state abstractions fulfill this goal, but have only be applied to tabular settings. Here, we provide a clustering algorithm that enables the application of such state abstractions to deep learning settings, providing compressed representations of an agent's inputs that preserve the ability to predict sequences of reward. A convergence theorem and simulations show that the resulting reward-predictive deep network maximally compresses the agent's inputs, significantly speeding up learning in high dimensional visual control tasks. Furthermore, we present different generalization experiments and analyze under which conditions a pre-trained reward-predictive representation network can be re-used without re-training to accelerate learning -- a form of systematic out-of-distribution transfer.
translated by 谷歌翻译
Hierarchical Reinforcement Learning (HRL) algorithms have been demonstrated to perform well on high-dimensional decision making and robotic control tasks. However, because they solely optimize for rewards, the agent tends to search the same space redundantly. This problem reduces the speed of learning and achieved reward. In this work, we present an Off-Policy HRL algorithm that maximizes entropy for efficient exploration. The algorithm learns a temporally abstracted low-level policy and is able to explore broadly through the addition of entropy to the high-level. The novelty of this work is the theoretical motivation of adding entropy to the RL objective in the HRL setting. We empirically show that the entropy can be added to both levels if the Kullback-Leibler (KL) divergence between consecutive updates of the low-level policy is sufficiently small. We performed an ablative study to analyze the effects of entropy on hierarchy, in which adding entropy to high-level emerged as the most desirable configuration. Furthermore, a higher temperature in the low-level leads to Q-value overestimation and increases the stochasticity of the environment that the high-level operates on, making learning more challenging. Our method, SHIRO, surpasses state-of-the-art performance on a range of simulated robotic control benchmark tasks and requires minimal tuning.
translated by 谷歌翻译
Hierarchical methods in reinforcement learning have the potential to reduce the amount of decisions that the agent needs to perform when learning new tasks. However, finding a reusable useful temporal abstractions that facilitate fast learning remains a challenging problem. Recently, several deep learning approaches were proposed to learn such temporal abstractions in the form of options in an end-to-end manner. In this work, we point out several shortcomings of these methods and discuss their potential negative consequences. Subsequently, we formulate the desiderata for reusable options and use these to frame the problem of learning options as a gradient-based meta-learning problem. This allows us to formulate an objective that explicitly incentivizes options which allow a higher-level decision maker to adjust in few steps to different tasks. Experimentally, we show that our method is able to learn transferable components which accelerate learning and performs better than existing prior methods developed for this setting. Additionally, we perform ablations to quantify the impact of using gradient-based meta-learning as well as other proposed changes.
translated by 谷歌翻译
长期的Horizo​​n机器人学习任务稀疏的奖励对当前的强化学习算法构成了重大挑战。使人类能够学习挑战的控制任务的关键功能是,他们经常获得专家干预,使他们能够在掌握低级控制动作之前了解任务的高级结构。我们为利用专家干预来解决长马增强学习任务的框架。我们考虑\ emph {选项模板},这是编码可以使用强化学习训练的潜在选项的规格。我们将专家干预提出,因为允许代理商在学习实施之前执行选项模板。这使他们能够使用选项,然后才能为学习成本昂贵的资源学习。我们在三个具有挑战性的强化学习问题上评估了我们的方法,这表明它的表现要优于最先进的方法。训练有素的代理商和我们的代码视频可以在以下网址找到:https://sites.google.com/view/stickymittens
translated by 谷歌翻译
我们提出了世界价值函数(WVFS),这是一种面向目标的一般价值函数,它代表了如何不仅要解决给定任务,还代表代理环境中的任何其他目标任务。这是通过将代理装备内部目标空间定义为经历终端过渡的所有世界状态来实现的。然后,代理可以修改标准任务奖励以定义其自己的奖励功能,事实证明,它可以驱动其学习如何实现所有可触及的内部目标,以及在当前任务中的价值。我们在学习和计划的背景下展示了WVF的两个关键好处。特别是,给定有学习的WVF,代理可以通过简单地估计任务的奖励功能来计算新任务中的最佳策略。此外,我们表明WVF还隐式编码环境的过渡动力学,因此可以用于执行计划。实验结果表明,WVF可以比常规价值功能更快地学习,而它们的推断环境动态的能力可用于整合学习和计划方法以进一步提高样本效率。
translated by 谷歌翻译
Deep reinforcement learning is poised to revolutionise the field of AI and represents a step towards building autonomous systems with a higher level understanding of the visual world. Currently, deep learning is enabling reinforcement learning to scale to problems that were previously intractable, such as learning to play video games directly from pixels. Deep reinforcement learning algorithms are also applied to robotics, allowing control policies for robots to be learned directly from camera inputs in the real world. In this survey, we begin with an introduction to the general field of reinforcement learning, then progress to the main streams of value-based and policybased methods. Our survey will cover central algorithms in deep reinforcement learning, including the deep Q-network, trust region policy optimisation, and asynchronous advantage actor-critic. In parallel, we highlight the unique advantages of deep neural networks, focusing on visual understanding via reinforcement learning. To conclude, we describe several current areas of research within the field.
translated by 谷歌翻译
我们介绍了一种改进政策改进的方法,该方法在基于价值的强化学习(RL)的贪婪方法与基于模型的RL的典型计划方法之间进行了插值。新方法建立在几何视野模型(GHM,也称为伽马模型)的概念上,该模型对给定策略的折现状态验证分布进行了建模。我们表明,我们可以通过仔细的基本策略GHM的仔细组成,而无需任何其他学习,可以评估任何非马尔科夫策略,以固定的概率在一组基本马尔可夫策略之间切换。然后,我们可以将广义政策改进(GPI)应用于此类非马尔科夫政策的收集,以获得新的马尔可夫政策,通常将其表现优于其先驱。我们对这种方法提供了彻底的理论分析,开发了转移和标准RL的应用,并在经验上证明了其对标准GPI的有效性,对充满挑战的深度RL连续控制任务。我们还提供了GHM培训方法的分析,证明了关于先前提出的方法的新型收敛结果,并显示了如何在深度RL设置中稳定训练这些模型。
translated by 谷歌翻译
Many practical applications of reinforcement learning constrain agents to learn from a fixed batch of data which has already been gathered, without offering further possibility for data collection. In this paper, we demonstrate that due to errors introduced by extrapolation, standard offpolicy deep reinforcement learning algorithms, such as DQN and DDPG, are incapable of learning without data correlated to the distribution under the current policy, making them ineffective for this fixed batch setting. We introduce a novel class of off-policy algorithms, batch-constrained reinforcement learning, which restricts the action space in order to force the agent towards behaving close to on-policy with respect to a subset of the given data. We present the first continuous control deep reinforcement learning algorithm which can learn effectively from arbitrary, fixed batch data, and empirically demonstrate the quality of its behavior in several tasks.
translated by 谷歌翻译
The reinforcement learning paradigm is a popular way to address problems that have only limited environmental feedback, rather than correctly labeled examples, as is common in other machine learning contexts. While significant progress has been made to improve learning in a single task, the idea of transfer learning has only recently been applied to reinforcement learning tasks. The core idea of transfer is that experience gained in learning to perform one task can help improve learning performance in a related, but different, task. In this article we present a framework that classifies transfer learning methods in terms of their capabilities and goals, and then use it to survey the existing literature, as well as to suggest future directions for transfer learning work.
translated by 谷歌翻译
With the development of deep representation learning, the domain of reinforcement learning (RL) has become a powerful learning framework now capable of learning complex policies in high dimensional environments. This review summarises deep reinforcement learning (DRL) algorithms and provides a taxonomy of automated driving tasks where (D)RL methods have been employed, while addressing key computational challenges in real world deployment of autonomous driving agents. It also delineates adjacent domains such as behavior cloning, imitation learning, inverse reinforcement learning that are related but are not classical RL algorithms. The role of simulators in training agents, methods to validate, test and robustify existing solutions in RL are discussed.
translated by 谷歌翻译
Go-Explore achieved breakthrough performance on challenging reinforcement learning (RL) tasks with sparse rewards. The key insight of Go-Explore was that successful exploration requires an agent to first return to an interesting state ('Go'), and only then explore into unknown terrain ('Explore'). We refer to such exploration after a goal is reached as 'post-exploration'. In this paper, we present a clear ablation study of post-exploration in a general intrinsically motivated goal exploration process (IMGEP) framework, that the Go-Explore paper did not show. We study the isolated potential of post-exploration, by turning it on and off within the same algorithm under both tabular and deep RL settings on both discrete navigation and continuous control tasks. Experiments on a range of MiniGrid and Mujoco environments show that post-exploration indeed helps IMGEP agents reach more diverse states and boosts their performance. In short, our work suggests that RL researchers should consider to use post-exploration in IMGEP when possible since it is effective, method-agnostic and easy to implement.
translated by 谷歌翻译
Model-Based Reinforcement Learning (RL) is widely believed to have the potential to improve sample efficiency by allowing an agent to synthesize large amounts of imagined experience. Experience Replay (ER) can be considered a simple kind of model, which has proved extremely effective at improving the stability and efficiency of deep RL. In principle, a learned parametric model could improve on ER by generalizing from real experience to augment the dataset with additional plausible experience. However, owing to the many design choices involved in empirically successful algorithms, it can be very hard to establish where the benefits are actually coming from. Here, we provide theoretical and empirical insight into when, and how, we can expect data generated by a learned model to be useful. First, we provide a general theorem motivating how learning a model as an intermediate step can narrow down the set of possible value functions more than learning a value function directly from data using the Bellman equation. Second, we provide an illustrative example showing empirically how a similar effect occurs in a more concrete setting with neural network function approximation. Finally, we provide extensive experiments showing the benefit of model-based learning for online RL in environments with combinatorial complexity, but factored structure that allows a learned model to generalize. In these experiments, we take care to control for other factors in order to isolate, insofar as possible, the benefit of using experience generated by a learned model relative to ER alone.
translated by 谷歌翻译
动物和人工代理商都受益于支持跨任务的快速学习的国家表示,使他们能够有效地遍历其环境以获得奖励状态。在固定政策下衡量预期累积,贴现国家占用的后续代表(SR),可以在否则的马尔可维亚环境中有效地转移到不同的奖励结构,并假设生物行为和神经活动的基础方面。然而,在现实世界中,奖励可能会移动或仅用于消费一次,可能只是将位置或者代理可以简单地旨在尽可能快地到达目标状态,而不会产生人工强加的任务视野的约束。在这种情况下,最具行为相关的代表将携带有关代理人可能首先达到兴趣国的信息的信息,而不是在可能的无限时间跨度访问它们的频率。为了反映此类需求,我们介绍了第一次占用代表(FR),该代表(FR),该代表(FR)衡量预期的时间折扣首次访问状态。我们证明FR有助于探索,选择有效的路径到所需状态,允许代理在某些条件下规划由一系列子板定义的可透明的最佳轨迹,并引起避免威胁刺激的动物类似的行为。
translated by 谷歌翻译