强化学习中的一个关键问题是代理可以在哪些表示中有效地在不同任务之间重用知识。最近,对于具有共享转换动态的任务之间转移知识,已经证明了接入者表示具有经验益处。本文介绍了模型特征:一种特征表示,其集群在行为上等同于状态,并且等同于模型简化。此外,我们提出了一个继承人特征模型,它表明学习后继特征等同于学习模型减少。我们开发了一个新的优化目标,并且我们提供的界限表明,最小化该目标会导致模型减少的近似得到越来越多的改进。此外,我们提供了随机生成的MDP的转移实验,这些MDP在转换和奖励函数方面有所不同,但大致保持了状态之间的行为等效性。这些结果表明,模型特征适用于具有不同转换和奖励功能的任务之间的转移。
translated by 谷歌翻译
Policy evaluation is an essential step in most reinforcement learning approaches. It yields a value function, the quality assessment of states for a given policy, which can be used in a policy improvement step. Since the late 1980s, this research area has been dominated by temporal-difference (TD) methods due to their data-efficiency. However, core issues such as stability guarantees in the off-policy scenario, improved sample efficiency and probabilistic treatment of the uncertainty in the estimates have only been tackled recently, which has led to a large number of new approaches. This paper aims at making these new developments accessible in a concise overview, with foci on underlying cost functions, the off-policy scenario as well as on regularization in high dimensional feature spaces. By presenting the first extensive, systematic comparative evaluations comparing TD, LSTD, LSPE, FPKF, the residual-gradient algorithm, Bellman residual minimization, GTD, GTD2 and TDC, we shed light on the strengths and weaknesses of the methods. Moreover, we present alternative versions of LSTD and LSPE with drastically improved off-policy performance.
translated by 谷歌翻译
Linear options
分类:
Learning, planning, and representing knowledge in large state spaces at multiple levels of temporal abstraction are key, long-standing challenges for building flexible autonomous agents. The options framework provides a formal mechanism for specifying and learning temporally-extended skills. Although past work has demonstrated the benefit of acting according to options in continuous state spaces, one of the central advantages of temporal abstraction-the ability to plan using a temporally abstract model-remains a challenging problem when the number of environment states is large or infinite. In this work, we develop a knowledge construct, the linear option, which is capable of modeling temporally abstract dynamics in continuous state spaces. We show that planning with a linear expectation model of an option's dynamics converges to a fixed point with low Temporal Difference (TD) error. Next, building on recent work on linear feature selection, we show conditions under which a linear feature set is sufficient for accurately representing the value function of an option policy. We extend this result to show conditions under which multiple options may be repeatedly composed to create new options with accurate linear models. Finally, we demonstrate linear option learning and planning algorithms in a simulated robot environment.
translated by 谷歌翻译
强化学习中的转移是指概念不仅应发生在任务中,还应发生在任务之间。我们提出了转移框架,用于奖励函数改变的场景,但环境的动态保持不变。我们的方法依赖于两个关键思想:“后继特征”,一种将环境动态与奖励分离的价值函数表示,以及“广义政策改进”,即动态规划的政策改进操作的概括,它考虑一组政策而不是单一政策。 。总而言之,这两个想法导致了一种方法,可以与强化学习框架无缝集成,并允许跨任务自由交换信息。即使在任何学习过程之前,所提出的方法也为转移的政策提供了保证。我们推导出两个定理,将我们的方法设置在坚实的理论基础和现有的实验中,表明它成功地促进了实践中的转移,在一系列导航任务中明显优于替代方法。并控制模拟机器人手臂。
translated by 谷歌翻译
当使用强化学习(RL)算法时,在给定最大空间的情况下,为值函数(VF)引入某种形式的近似架构是常见的。然而,这种架构的确切形式可能对代理的性能产生重大影响,并且确定合适的近似架构通常可能是一项非常复杂的任务。因此,研究人员目前有兴趣允许RL算法自适应地生成(即学习)近似体系结构。一种相对未开发的适应近似体系结构的方法包括使用关于代理已访问某些状态的频率的反馈来指导该状态的哪些区域空间与近似细节近似。在本文中,我们将:(a)非正式地讨论这些方法提供的潜在优势; (b)引入一种基于这种方法的新算法,该算法在线适应状态聚合近似体系结构,并设计用于与SARSA结合使用; (c)在政策评估设置中提供关于该特定算法的复杂性,收敛性和降低VF误差的潜力的理论结果;最后(d)通过实验测试该算法在给定许多不同测试问题的情况下可以提高性能的程度。综上所述,我们的结果表明,我们的算法(以及可能更普遍的这种方法)可以提供一种通用且计算量轻的方法,在实际应用的合适条件下,显着提高RL性能。
translated by 谷歌翻译
Reinforcement learning tasks are typically specified as Markov decision processes. This formalism has been highly successful, though specifications often couple the dynamics of the environment and the learning objective. This lack of mod-ularity can complicate generalization of the task specification, as well as obfuscate connections between different task settings, such as episodic and continuing. In this work, we introduce the RL task formalism, that provides a unification through simple constructs including a generalization to transition-based discounting. Through a series of examples, we demonstrate the generality and utility of this formalism. Finally, we extend standard learning constructs, including Bellman operators, and extend some seminal theoretical results, including approximation errors bounds. Overall, we provide a well-understood and sound formalism on which to build theoretical results and simplify algorithm use and development.
translated by 谷歌翻译
Reinforcement learning (RL) offers powerful algorithms to search for optimal controllers of systems with nonlinear, possibly stochastic dynamics that are unknown or highly uncertain. This review mainly covers artificial-intelligence approaches to RL, from the viewpoint of the control engineer. We explain how approximate representations of the solution make RL feasible for problems with continuous states and control actions. Stability is a central concern in control, and we argue that while the control-theoretic RL subfield called adaptive dynamic programming is dedicated to it, stability of RL largely remains an open question. We also cover in detail the case where deep neural networks are used for approximation, leading to the field of deep RL, which has shown great success in recent years. With the control practitioner in mind, we outline opportunities and pitfalls of deep RL; and we close the survey with an outlook that-among other things-points out some avenues for bridging the gap between control and artificial-intelligence RL techniques.
translated by 谷歌翻译
我们研究了具有从少数潜在状态产生的丰富观察的情节MDP中的探索问题。在某些可识别性假设下,我们演示了如何通过一系列回归和聚类步骤来诱导地估计从观察到状态的映射 - 其中先前解码的潜在状态为后来的回归问题提供标签 - 并用它来构建良好的勘探政策。我们对学习状态解码函数和探索策略的质量提供有限样本保证,并通过对一类硬探索问题的经验评估来补充我们的理论。我们的方法成倍地提高了超过$ Q $ -learning的水平,即使在$ Q $ -learning已经进入潜伏状态时也是如此。
translated by 谷歌翻译
具有不准确的环境模型的代理面临困难的选择:它可以忽略其模型中的错误并且在现实世界中以其确定的关于其模型的最佳方式行事。或者,它可以采取更保守的立场,避开其模型,转而仅仅通过现实世界的互动来优化其行为。后一种方法可以非常慢地从经验中学习,而前者可以导致“计划者过度拟合” - 代理人行为的各个方面被优化以利用其模型中的错误。本文探讨了一个中间立场,即规划者试图通过其所考虑的计划的一种正规化来避免过度拟合。我们提出了三种不同的方法,可以显着减轻强化学习环境中的计划者过度拟合。
translated by 谷歌翻译
We propose a novel approach to optimize Partially Observable Markov Decisions Processes (POMDPs) defined on continuous spaces. To date, most algorithms for model-based POMDPs are restricted to discrete states, actions, and observations, but many real-world problems such as, for instance, robot navigation, are naturally defined on continuous spaces. In this work, we demonstrate that the value function for continuous POMDPs is convex in the beliefs over continuous state spaces, and piecewise-linear convex for the particular case of discrete observations and actions but still continuous states. We also demonstrate that continuous Bellman backups are contracting and isotonic ensuring the monotonic convergence of value-iteration algorithms. Relying on those properties , we extend the PERSEUS algorithm, originally developed for discrete POMDPs, to work in continuous state spaces by representing the observation, transition, and reward models using Gaus-sian mixtures, and the beliefs using Gaussian mixtures or particle sets. With these representations, the integrals that appear in the Bellman backup can be computed in closed form and, therefore, the algorithm is computationally feasible. Finally, we further extend PERSEUS to deal with continuous action and observation sets by designing effective sampling approaches.
translated by 谷歌翻译
We study the problem of learning near-optimal behavior in finite Markov Decision Processes (MDPs) with a polynomial number of samples. These "PAC-MDP" algorithms include the well-known E 3 and R-MAX algorithms as well as the more recent Delayed Q-learning algorithm. We summarize the current state-of-the-art by presenting bounds for the problem in a unified theoretical framework. A more refined analysis for upper and lower bounds is presented to yield insight into the differences between the model-free Delayed Q-learning and the model-based R-MAX.
translated by 谷歌翻译
当只有系统的任意策略下的单个样本路径可用时,我们考虑具有连续状态空间和非翻译内核的无限时间贴现马尔可夫决策过程(MDP)的无模型强化学习。我们考虑使用最近邻Q-Learning(NNQL)算法来学习使用最近邻回归方法的最优Q函数。作为主要贡献,我们提供了收敛速度的严格有限样本分析。特别是对于具有$ d $ -dimensional状态空间和折扣因子$ \ gamma \ in(0,1)$的MDP,给定具有“覆盖时间”$ L $的任意样本路径,我们确定算法保证输出使用$ \ tilde {O} \ big(L /(\ varepsilon ^ 3(1- \ gamma)^ 7)\ big)$ samples的$ \ varepsilon $ -accurate估计最佳Q函数。例如,对于一个表现良好的MDP,纯粹的randompolicy下的样本路径的覆盖时间会缩放为$ \ tilde {O} \ big(1 / \ varepsilon ^ d \ big),因此样本复杂度会缩放为$ \ tilde {O} \ big(1 / \ varepsilon ^ {d + 3} \ big)。$确实,我们建立了一个下限,认为$ \ tilde {\ Omega} \ big(1 / \ varepsilon ^ {d)的依赖性+2} \ big)$是必要的。
translated by 谷歌翻译
Bayesian learning methods have recently been shown to provide an elegant solution to the exploration-exploitation trade-off in reinforcement learning. However most investigations of Bayesian reinforcement learning to date focus on the standard Markov Decision Processes (MDPs). The primary focus of this paper is to extend these ideas to the case of partially observable domains, by introducing the Bayes-Adaptive Partially Observable Markov Decision Processes. This new framework can be used to simultaneously (1) learn a model of the POMDP domain through interaction with the environment , (2) track the state of the system under partial observability, and (3) plan (near-)optimal sequences of actions. An important contribution of this paper is to provide theoretical results showing how the model can be finitely approximated while preserving good learning performance. We present approximate algorithms for belief tracking and planning in this model, as well as empirical results that illustrate how the model estimate and agent's return improve as a function of experience.
translated by 谷歌翻译
For Markov decision processes with long horizons (i.e., discount factors close to one), it is common in practice to use reduced horizons during planning to speed computation. However, perhaps surprisingly, when the model available to the agent is estimated from data, as will be the case in most real-world problems, the policy found using a shorter planning horizon can actually be better than a policy learned with the true horizon. In this paper we provide a precise explanation for this phenomenon based on principles of learning theory. We show formally that the planning horizon is a complexity control parameter for the class of policies to be learned. In particular, it has an intuitive, monotonic relationship with a simple counting measure of complexity, and that a similar relationship can be observed empirically with a more general and data-dependent Rademacher complexity measure. Each complexity measure gives rise to a bound on the planning loss predicting that a planning horizon shorter than the true horizon can reduce overfitting and improve test performance, and we confirm these predictions empirically.
translated by 谷歌翻译
In this paper, we consider Markov Decision Processes (MDPs) with error states. Error states are those states entering which is undesirable or dangerous. We define the risk with respect to a policy as the probability of entering such a state when the policy is pursued. We consider the problem of finding good policies whose risk is smaller than some user-specified threshold, and formalize it as a constrained MDP with two criteria. The first criterion corresponds to the value function originally given. We will show that the risk can be formulated as a second criterion function based on a cumulative return, whose definition is independent of the original value function. We present a model free, heuristic reinforcement learning algorithm that aims at finding good deterministic policies. It is based on weighting the original value function and the risk. The weight parameter is adapted in order to find a feasible solution for the constrained problem that has a good performance with respect to the value function. The algorithm was successfully applied to the control of a feed tank with stochastic inflows that lies upstream of a distillation column. This control task was originally formulated as an optimal control problem with chance constraints, and it was solved under certain assumptions on the model to obtain an optimal solution. The power of our learning algorithm is that it can be used even when some of these restrictive assumptions are relaxed.
translated by 谷歌翻译
In this paper, we propose a novel policy iteration method, called dynamic policy programming (DPP), to estimate the optimal policy in the infinite-horizon Markov decision processes. DPP is an incremental algorithm that forces a gradual change in policy update. This allows us to prove finite-iteration and asymptotic ℓ ∞-norm performance-loss bounds in the presence of approxima-tion/estimation error which depend on the average accumulated error as opposed to the standard bounds which are expressed in terms of the supremum of the errors. The dependency on the average error is important in problems with limited number of samples per iteration, for which the average of the errors can be significantly smaller in size than the supremum of the errors. Based on these theoretical results, we prove that a sampling-based variant of DPP (DPP-RL) asymptotically converges to the optimal policy. Finally, we illustrate numerically the applicability of these results on some benchmark problems and compare the performance of the approximate variants of DPP with some existing reinforcement learning (RL) methods.
translated by 谷歌翻译
An important application of reinforcement learning (RL) is to finite-state control problems and one of the most difficult problems in learning for control is balancing the exploration/exploitation tradeoff. Existing theoretical results for RL give very little guidance on reasonable ways to perform exploration. In this paper, we examine the convergence of single-step on-policy RL algorithms for control. On-policy algorithms cannot separate exploration from learning and therefore must confront the exploration problem directly. We prove convergence results for several related on-policy algorithms with both decaying exploration and persistent exploration. We also provide examples of exploration strategies that can be followed during learning that result in convergence to both optimal values and optimal policies.
translated by 谷歌翻译
Inverse reinforcement learning (IRL) is the problem of recovering the underlying reward function from the behavior of an expert. Most of the existing IRL algorithms assume that the environment is modeled as a Markov decision process (MDP), although it is desirable to handle partially observable settings in order to handle more realistic scenarios. In this paper, we present IRL algorithms for partially observable environments that can be modeled as a partially observable Markov decision process (POMDP). We deal with two cases according to the representation of the given expert's behavior, namely the case in which the expert's policy is explicitly given, and the case in which the expert's trajectories are available instead. The IRL in POMDPs poses a greater challenge than in MDPs since it is not only ill-posed due to the nature of IRL, but also computationally intractable due to the hardness in solving POMDPs. To overcome these obstacles, we present algorithms that exploit some of the classical results from the POMDP literature. Experimental results on several benchmark POMDP domains show that our work is useful for partially observable settings.
translated by 谷歌翻译
We propose a new approach to reinforcement learning for control problems which combines value-function approximation with linear architectures and approximate policy iteration. This new approach is motivated by the least-squares temporal-difference learning algorithm (LSTD) for prediction problems, which is known for its efficient use of sample experiences compared to pure temporal-difference algorithms. Heretofore, LSTD has not had a straightforward application to control problems mainly because LSTD learns the state value function of a fixed policy which cannot be used for action selection and control without a model of the underlying process. Our new algorithm, least-squares policy iteration (LSPI), learns the state-action value function which allows for action selection without a model and for incremental policy improvement within a policy-iteration framework. LSPI is a model-free, off-policy method which can use efficiently (and reuse in each iteration) sample experiences collected in any manner. By separating the sample collection method, the choice of the linear approximation architecture, and the solution method, LSPI allows for focused attention on the distinct elements that contribute to practical reinforcement learning. LSPI is tested on the simple task of balancing an inverted pendulum and the harder task of balancing and riding a bicycle to a target location. In both cases, LSPI learns to control the pendulum or the bicycle by merely observing a relatively small number of trials where actions are selected randomly. LSPI is also compared against Q-learning (both with and without experience replay) using the same value function architecture. While LSPI achieves good performance fairly consistently on the difficult bicycle task, Q-learning variants were rarely able to balance for more than a small fraction of the time needed to reach the target location.
translated by 谷歌翻译
This chapter surveys recent lines of work that use Bayesian techniques for reinforcement learning. In Bayesian learning, uncertainty is expressed by a prior distribution over unknown parameters and learning is achieved by computing a posterior distribution based on the data observed. Hence, Bayesian reinforcement learning distinguishes itself from other forms of reinforcement learning by explicitly maintaining a distribution over various quantities such as the parameters of the model, the value function, the policy or its gradient. This yields several benefits: a) domain knowledge can be naturally encoded in the prior distribution to speed up learning; b) the exploration/exploitation tradeoff can be naturally optimized; and c) notions of risk can be naturally taken into account to obtain robust policies.
translated by 谷歌翻译