Representation learning and option discovery are two of the biggestchallenges in reinforcement learning (RL). Proto-value functions (PVFs) are awell-known approach for representation learning in MDPs. In this paper weaddress the option discovery problem by showing how PVFs implicitly defineoptions. We do it by introducing eigenpurposes, intrinsic reward functionsderived from the learned representations. The options discovered fromeigenpurposes traverse the principal directions of the state space. They areuseful for multiple tasks because they are discovered without taking theenvironment's rewards into consideration. Moreover, different options act atdifferent time scales, making them helpful for exploration. We demonstratefeatures of eigenpurposes in traditional tabular domains as well as in Atari2600 games.
translated by 谷歌翻译
强化学习的一个关键挑战是能够在控制问题中概括知识。虽然深度学习方法已成功地与无模型强化学习算法相结合,但如何在存在近似误差的情况下执行基于模型的强化学习仍然是一个悬而未决的问题。使用后继特征,一种预测时间约束的特征表示,本文提出了三种贡献:首先,它展示了学习后继特征如何是等效的无模型学习。然后,它显示了后继功能如何编码通过创建两个相似状态的状态分区来压缩状态空间的模型减少。使用此表示,保证智能代理准确预测未来的奖励结果,这是基于模型的执行学习算法的关键属性。最后,它提出了一个损失目标和预测误差界限,表明通过近似的后继特征可以准确地预测值函数和回报序列。无限控制问题,我们说明如何最小化这种损失目标导致近似互模拟。本文提出的结果提供了对表示的新颖理解,可以支持无模型和基于模型的强化学习。
translated by 谷歌翻译
This paper introduces an automated skill acquisition framework in reinforcement learning which involves identifying a hierarchical description of the given task in terms of abstract states and extended actions between abstract states. Identifying such structures present in the task provides ways to simplify and speed up reinforcement learning algorithms. These structures also help to generalize such algorithms over multiple tasks without relearning policies from scratch. We use ideas from dynamical systems to find metastable regions in the state space and associate them with abstract states. The spectral clustering algorithm PCCA+ is used to identify suitable abstractions aligned to the underlying structure. Skills are defined in terms of the sequence of actions that lead to transitions between such abstract states. The connectivity information from PCCA+ is used to generate these skills or options. These skills are independent of the learning task and can be efficiently reused across a variety of tasks defined over the same model. This approach works well even without the exact model of the environment by using sample trajectories to construct an approximate estimate. We also present our approach to scaling the skill acquisition framework to complex tasks with large state spaces for which we perform state aggregation using the representation learned from an action conditional video prediction network and use the skill acquisition framework on the aggregated state space.
translated by 谷歌翻译
设计引入安全概念的分层强化学习算法不仅对安全关键应用至关重要,而且可以更好地理解人工智能代理的决策。尽管学习端到端选项最近已经完全实现,但我们提出了一种学习安全选项的解决方案。我们基于选择 - 批评框架中的时间差异误差,介绍了状态可控性的概念。然后,我们推导出具有可控性的策略梯度定理,并提出了一个新的框架,称为安全选项 - 评论。我们在Arcade学习环境(ALE)的四室网格世界,推车和三场比赛中展示了我们的方法的有效性:MsPacman,Amidar和Q * Bert。利用所提出的安全概念来学习端到端选项可以减少回报的方差,并在奖励结构中具有内在可变性的环境中提升绩效。更重要的是,所提出的算法在三种ALE游戏中的两种中,在所有环境和原始动作中都优于vanilla选项。
translated by 谷歌翻译
Linear options
分类:
Learning, planning, and representing knowledge in large state spaces at multiple levels of temporal abstraction are key, long-standing challenges for building flexible autonomous agents. The options framework provides a formal mechanism for specifying and learning temporally-extended skills. Although past work has demonstrated the benefit of acting according to options in continuous state spaces, one of the central advantages of temporal abstraction-the ability to plan using a temporally abstract model-remains a challenging problem when the number of environment states is large or infinite. In this work, we develop a knowledge construct, the linear option, which is capable of modeling temporally abstract dynamics in continuous state spaces. We show that planning with a linear expectation model of an option's dynamics converges to a fixed point with low Temporal Difference (TD) error. Next, building on recent work on linear feature selection, we show conditions under which a linear feature set is sufficient for accurately representing the value function of an option policy. We extend this result to show conditions under which multiple options may be repeatedly composed to create new options with accurate linear models. Finally, we demonstrate linear option learning and planning algorithms in a simulated robot environment.
translated by 谷歌翻译
学习选项允许代理表现出时间上更高的顺序行为已被证明在增加探索,减少样本复杂性和各种转移场景方面是有用的。选项的深度发现(DDO)是一种生成算法,可以从专家轨迹直接学习分层策略以及选项。我们对DDO在不同领域推断的选项进行定性和定量分析。为此,我们提出了不同的价值指标,如期权终止条件,铰链值函数误差和基于KL-发散的距离度量,以比较不同的方法。分析选项的终止条件和运行选项的时间步数表明选项是过早终止的。我们建议可以轻松合并的修改,并减少选项更短和同一模式选项崩溃的问题。
translated by 谷歌翻译
稀疏奖励强化学习的探索仍然是一个难以接受的挑战。许多最先进的方法使用内在动机来补充稀疏的外在奖励信号,使代理人有更多机会在探索过程中接收反馈。最常见的是,这些信号被添加为asbonus奖励,这导致混合策略忠实地进行探索或任务履行延长的时间。在本文中,我们学习单独的内在和外在任务政策,并在这些不同的驱动之间进行计划,以加速探索和稳定学习。此外,我们引入了一种新类型的内在奖励,表示为安装或特征控制(SFC),它是一般的而不是任务特定的。它考虑了完整轨迹的统计数据,因此不同的方法仅仅使用本地信息来评估内在动机。我们使用纯视觉输入评估我们提出的计划内在驱动器(SID)代理程序:VizDoom,DeepMindLab和OpenAI Gym经典控件。结果表明,SFC的探索效率和内在驱动器的分层使用得到了极大的提高。我们的实验结果视频可以在http://youtu.be/4ZHcBo7006Y找到。
translated by 谷歌翻译
In the pursuit of increasingly intelligent learning systems, abstraction plays a vital role in enabling sophisticated decisions to be made in complex environments. The options framework provides formalism for such abstraction over sequences of decisions. However most models require that options be given a priori, presumably specified by hand, which is neither efficient, nor scalable. Indeed , it is preferable to learn options directly from interaction with the environment. Despite several efforts, this remains a difficult problem. In this work we develop a novel policy gradient method for the automatic learning of policies with options. This algorithm uses inference methods to simultaneously improve all of the options available to an agent, and thus can be employed in an off-policy manner, without observing option labels. The dif-ferentiable inference procedure employed yields options that can be easily interpreted. Empirical results confirm these attributes, and indicate that our algorithm has an improved sample efficiency relative to state-of-the-art in learning options end-to-end.
translated by 谷歌翻译
We introduce FeUdal Networks (FuNs): a novel architecture for hierarchicalreinforcement learning. Our approach is inspired by the feudal reinforcementlearning proposal of Dayan and Hinton, and gains power and efficacy bydecoupling end-to-end learning across multiple levels -- allowing it to utilisedifferent resolutions of time. Our framework employs a Manager module and aWorker module. The Manager operates at a lower temporal resolution and setsabstract goals which are conveyed to and enacted by the Worker. The Workergenerates primitive actions at every tick of the environment. The decoupledstructure of FuN conveys several benefits -- in addition to facilitating verylong timescale credit assignment it also encourages the emergence ofsub-policies associated with different goals set by the Manager. Theseproperties allow FuN to dramatically outperform a strong baseline agent ontasks that involve long-term credit assignment or memorisation. We demonstratethe performance of our proposed system on a range of tasks from the ATARI suiteand also from a 3D DeepMind Lab environment.
translated by 谷歌翻译
强化学习(RL)的常见方法受到涉及巨大状态空间和稀疏延迟奖励反馈的大规模应用的严重挑战。分层强化学习(HRL)方法试图通过在时间抽象的多个级别学习动作选择策略来增加这种可伸缩性问题。通过识别可能作为子目标有用的相对较小的一组状态,与实现那些子目标的相应技能政策的学习相一致,可以进行抽象。 HRL中对子目标发现的Manyapproach取决于对环境模型的分析,但是学习这种模型的需要引入了它自己的规模问题。一旦识别出子目标,就可以通过内在动机学习技能,引入一个标记子目标获得的内部奖励信号。在本文中,我们提出了一种新的无模型方法,用于在代理人的最大经验的小记忆中使用增量无监督学习进行子目标发现。 。当与内在动机学习机制相结合时,该方法基于对环境的一种体验来一起学习子目标和技能。因此,我们提供了一种原创的HRL方法,即不需要获取适合大规模应用的环境模型。我们在具有稀疏延迟反馈的两个RL问题上证明了我们的方法的效率:房间环境的变体和称为Montezuma's Revenge的ATARI 2600游戏。
translated by 谷歌翻译
Learning, planning, and representing knowledge at multiple levels of temporal abstraction are key, longstanding challenges for AI. In this paper we consider how these challenges can be addressed within the mathematical framework of reinforcement learning and Markov decision processes (MDPs). We extend the usual notion of action in this framework to include options-closed-loop policies for taking action over a period of time. Examples of options include picking up an object, going to lunch, and traveling to a distant city, as well as primitive actions such as muscle twitches and joint torques. Overall, we show that options enable temporally abstract knowledge and action to be included in the reinforcement learning framework in a natural and general way. In particular, we show that options may be used interchangeably with primitive actions in planning methods such as dynamic programming and in learning methods such as Q-learning. Formally, a set of options defined over an MDP constitutes a semi-Markov decision process (SMDP), and the theory of SMDPs provides the foundation for the theory of options. However, the most interesting issues concern the interplay between the underlying MDP and the SMDP and are thus beyond SMDP theory. We present results for three such cases: (1) we show that the results of planning with options can be used during execution to interrupt options and thereby perform even better than planned, (2) we introduce new intra-option methods that are able to learn about an option from fragments of its execution, and (3) we propose a notion of subgoal that can be used to improve the options themselves. All of these results have precursors in the existing literature; the contribution of this paper is to establish them in a simpler and more general setting with fewer changes to the existing reinforcement learning framework. In particular, we show that these results can be obtained without committing to (or ruling out) any particular approach to state abstraction, hierarchy, function approximation, or the macro-utility problem.: S 0 0 0 4-3 7 0 2 (9 9) 0 0 0 5 2-1
translated by 谷歌翻译
Reinforcement learning is bedeviled by the curse of dimensionality: the number of parameters to be learned grows exponentially with the size of any compact encoding of a state. Recent attempts to combat the curse of dimensionality have turned to principled ways of exploiting temporal abstraction, where decisions are not required at each step, but rather invoke the execution of temporally-extended activities which follow their own policies until termination. This leads naturally to hierarchical control architectures and associated learning algorithms. We review several approaches to temporal abstraction and hierarchical organization that machine learning researchers have recently developed. Common to these approaches is a reliance on the theory of semi-Markov decision processes, which we emphasize in our review. We then discuss extensions of these ideas to concurrent activities, multiagent coordination, and hierarchical memory for addressing partial observability. Concluding remarks address open challenges facing the further development of reinforcement learning in a hierarchical setting.
translated by 谷歌翻译
在反馈稀疏的环境中学习目标导向行为是强化学习算法的主要挑战。由于探索不充分导致的主要困难因素导致代理无法容忍强大的价值功能。内在动机的代理人可以为自己的利益探索新的行为,而不是直接解决问题。这种内在行为最终可以帮助代理人解决环境带来的任务。我们提出了分层-DQN(h-DQN),一个集成分层价值函数的框架,在不同的时间尺度上运作,内在动机的深层强化学习。顶级价值函数是针对内在目标的政策,而较低级别的函数则学习对原子行为的政策,以满足给定的目标。 h-DQN允许灵活的目标规范,例如实体和关系上的函数。这为复杂环境中的探索提供了有效的空间。我们展示了我们的方法在非常稀疏,延迟反馈的两个问题上的强度:(1)复杂的离散随机决策过程,以及(2)经典的ATARI游戏“Montezuma's Revenge”。
translated by 谷歌翻译
In this paper we consider the problem of how a reinforcement learning agent tasked with solving a set of related Markov decision processes can use knowledge acquired early in its lifetime to improve its ability to more rapidly solve novel, but related, tasks. One way of exploiting this experience is by identifying recurrent patterns in trajectories obtained from well-performing policies. We propose a three-step framework in which an agent 1) generates a set of candidate open-loop macros by compressing trajectories drawn from near-optimal policies; 2) evaluates the value of each macro; and 3) selects a maximally diverse subset of macros that spans the space of policies typically required for solving the set of related tasks. Our experiments show that extending the original primitive action-set of the agent with the identified macros allows it to more rapidly learn an optimal policy in unseen, but similar MDPs.
translated by 谷歌翻译
We present a new method for automatically creating useful temporal abstractions in reinforcement learning. We argue that states that allow the agent to transition to a different region of the state space are useful sub-goals, and propose a method for identifying them using the concept of relative novelty. When such a state is identified, a temporally-extended activity (e.g., an option) is generated that takes the agent efficiently to this state. We illustrate the utility of the method in a number of tasks.
translated by 谷歌翻译
我们提出了一种探索方法,该方法结合了超前搜索基础学习技能及其动态,并将其用于操纵策略的强化学习(RL)。我们的技能是使用现有的多目标RL公式在类似于选项或宏观反应的简单环境中学习分离的多目标政策。粗略技能动态,即由(完全)技能执行引起的状态转换被学习并且在先行搜索期间展开。策略搜索受益于探索期间的临时实验,虽然它本身可以在低级别的基本规则上运行,因此由此产生的策略不会受到粗略技能链导致的次优性和不灵活性的影响。我们表明,提议的探索策略能够比当前最先进的RL方法更快地有效地学习复杂的操作策略,并且比使用选项或参数化技能作为策略本身的构建块的方法收敛到更好的策略,而不是指导探索。假设所提出的探索策略比现有的最先进的RL方法更快地有效地学习复杂的操作策略,并且比使用选项或参数化技术作为策略本身的构建块的方法收敛到更好的策略,而不是guidingexploration 。
translated by 谷歌翻译
强化学习中的一个关键问题是代理可以在哪些表示中有效地在不同任务之间重用知识。最近,对于具有共享转换动态的任务之间转移知识,已经证明了接入者表示具有经验益处。本文介绍了模型特征:一种特征表示,其集群在行为上等同于状态,并且等同于模型简化。此外,我们提出了一个继承人特征模型,它表明学习后继特征等同于学习模型减少。我们开发了一个新的优化目标,并且我们提供的界限表明,最小化该目标会导致模型减少的近似得到越来越多的改进。此外,我们提供了随机生成的MDP的转移实验,这些MDP在转换和奖励函数方面有所不同,但大致保持了状态之间的行为等效性。这些结果表明,模型特征适用于具有不同转换和奖励功能的任务之间的转移。
translated by 谷歌翻译
How should a reinforcement learning agent act if its sole purpose is to efficiently learn an optimal policy for later use? In other words, how should it explore, to be able to exploit later? We formulate this problem as a Markov Decision Process by explicitly mod-eling the internal state of the agent and propose a principled heuristic for its solution. We present experimental results in a number of domains, also exploring the algorithm's use for learning a policy for a skill given its reward function-an important but neglected component of skill discovery.
translated by 谷歌翻译
Value functions are a core component of reinforcement learning systems. The main idea is to to construct a single function approximator V (s; θ) that estimates the long-term reward from any state s, using parameters θ. In this paper we introduce universal value function approx-imators (UVFAs) V (s, g; θ) that generalise not just over states s but also over goals g. We develop an efficient technique for supervised learning of UVFAs, by factoring observed values into separate embedding vectors for state and goal, and then learning a mapping from s and g to these factored embedding vectors. We show how this technique may be incorporated into a reinforcement learning algorithm that updates the UVFA solely from observed rewards. Finally, we demonstrate that a UVFA can successfully gener-alise to previously unseen goals.
translated by 谷歌翻译
Reinforcement learning addresses the problem of learning to select actions in order to maximize an agent's performance in unknown environments. To scale reinforcement learning to complex real-world tasks, agent must be able to discover hierarchical structures within their learning and control systems. This paper presents a method by which a reinforcement learning agent can discover subgoals with certain structural properties. By discovering subgoals and including policies to subgoals as actions in its action set, the agent is able to explore more effectively and accelerate learning in other tasks in the same or similar environments where the same subgoals are useful. The agent discovers the subgoals by searching a learned policy model for state that exhibits certain structural properties. This approach is illustrated using gridworld tasks.
translated by 谷歌翻译