基于模型的强化学习(RL)被认为是一种有希望的方法,可以降低阻碍无模型RL的样本复杂性。然而,对这些方法的理论理解相当有限。本文介绍了一种新的算法框架,用于设计和分析基于模型的RL算法,并提供理论保证。我们设计了ameta算法,其理论上保证单调改进到预期奖励的局部最大值。元算法基于估计的动态模型和样本轨迹迭代地建立预期奖励的下限,然后在策略和模型上共同最大化下限。该框架将不确定性的乐观主义原则扩展到非线性动力学模型,其方式不需要明确的不确定性量化。通过简化实例化我们的框架提供了基于模型的RL算法随机下界优化(SLBO)的变体。实验证明,当在一系列连续控制基准任务上仅允许一百万或更少的样本时,SLBO实现了最先进的性能。
translated by 谷歌翻译
样本效率对于解决现实世界的强化学习问题至关重要,因为代理与环境之间的相互作用可能代价高昂。事实证明,专家建议的模仿学习是减少培训政策所需交互次数的有效策略。在线模拟学习是政策评估和政策优化的交叉,是一种特别有效的技术,具有可证明的性能保证。在这项工作中,我们寻求进一步加快在线模拟学习的收敛速度,从而使其更具样本效率。我们提出了两种基于模型的算法,其灵感来自跟随领导者(FTL)和预测:MoBIL-VI基于解决变分不等式,MoBIL-Prox基于随机一阶更新。这两种方法利用模型来预测未来的梯度,以加速政策学习。当在线学习模型oracle时,这些算法可以证明可以加速最佳已知的收敛速度。我们的算法可以看作是stochasticMirror-Prox的一般化(Juditsky et al。,2011),并且承认了一种简单的建设性FTL风格的性能分析。
translated by 谷歌翻译
强化学习是学习机器人控制器的一种有前途的方法。最近已经表明,基于政策梯度的有限差分估计的算法与基于政策梯度定理的算法相竞争。我们提出了理解这种现象的理论框架。我们的关键见解是,许多动力系统(尤其是机器人控制任务中的动力系统)都是\ emph {几乎是确定性的} ---即,它们可以被建模为具有小随机扰动的确定性系统。我们证明了对于这样的系统,有限 - 政策梯度的差异估计可以比基于政策梯度定理的估计具有显着更低的方差。我们在计算器事实估计的背景下解释这些结果。最后,我们通过实验评估我们对倒立摆实验的见解。
translated by 谷歌翻译
我们研究了无线性方法,用于优化线性策略类的策略优化。我们专注于在应用于线性二次系统时表征这些方法的收敛速度,并研究驱动噪声和奖励反馈的各种设置。我们证明了这些方法可以合理地收敛到最优策略的任何预先指定的容差内,并且具有多个零阶评估,该评估是问题的容错,维度和曲率属性的显式多项式。我们的分析揭示了附加驱动噪声和随机初始化设置之间的一些有趣的差异,以及一点和两点向前反馈的设置。我们的理论通过对这些系统上的无衍生方法的广泛模拟得到证实。在此过程中,我们推导出随机零阶优化算法的收敛性,当应用于某类非凸问题时。
translated by 谷歌翻译
用于强化学习和连续控制问题的直接策略梯度方法是一种流行的方法,原因有很多:1)它们很容易实现而没有明确的底层模型知识2)它们是一种“端到端”方法,直接优化性能指标的兴趣3)它们固有地允许丰富参数化的政策。值得注意的是,即使在最基本的连续控制问题(线性二次调节器的问题)中,这些方法也必须解决非凸优化问题,从计算和统计角度来看,它们的效率知之甚少。相比之下,最优控制理论中的系统识别和基于模型的计划具有更加坚实的理论基础,其中关于其计算和统计特性已知很多。这项工作弥合了这一差距,表明(无模型)政策梯度方法全局收敛到最优解,并且在样本和计算复杂性方面是有效的(在相关的问题依赖量中是多项式的)。
translated by 谷歌翻译
我们提出了一个名为PicCoLO的预测校正器框架,它将一阶无模型强化或模仿学习算法转换为一种新的混合方法,利用预测模型加速政策学习。新的“PicCoLOed”算法通过递归重复两个步骤来优化策略:在预测步骤中,学习者使用模型来预测未见的未来梯度,然后应用预测差异来更新策略;在更正步骤中,学习者在环境中运行更新策略,接收真实梯度,然后使用梯度误差校正策略。与以前的算法不同,PicCoLO可以纠正使用不完美预测梯度的错误,并且不会受到模型偏差的影响。 PicCoLO的开发可以通过从可预测的在线学习到对抗性在线学习的新颖减少来实现,这提供了一种系统的方法来修改现有的第一或多种算法,以实现关于可预测信息的最佳后悔。我们在理论和仿真中表明,PicCoLO可以提高几种一阶无模型算法的收敛速度。
translated by 谷歌翻译
在本文中,我们提出了一种新的强化学习算法,该算法包含在马尔可夫决策过程(MDP)的政策梯度的随机方差减少版本中。随机方差减少梯度(SVRG)方法已被证明在监督学习中非常成功。然而,它们对政策梯度的适应并不简单,需要考虑到I)非凹目标函数; II)完整梯度计算的近似值;和III)非静态采样过程。结果是SVRPG,一种随机方差减少的策略梯度算法,它利用重要性权重来保持梯度差异的无偏差。在MDP的标准假设下,我们为SVRPG提供收敛保证,其收敛率在增加的批量大小下是线性的。最后,我们建议使用SVRPG的实际变体,并在连续的MDP上进行测试。
translated by 谷歌翻译
Temporal difference learning (TD) is a simple iterative algorithm used toestimate the value function corresponding to a given policy in a Markovdecision process. Although TD is one of the most widely used algorithms inreinforcement learning, its theoretical analysis has proved challenging and fewguarantees on its statistical efficiency are available. In this work, weprovide a simple and explicit finite time analysis of temporal differencelearning with linear function approximation. Except for a few key insights, ouranalysis mirrors standard techniques for analyzing stochastic gradient descentalgorithms, and therefore inherits the simplicity and elegance of thatliterature. Final sections of the paper show how all of our main results extendto the study of TD learning with eligibility traces, known as TD($\lambda$),and to Q-learning applied in high-dimensional optimal stopping problems.
translated by 谷歌翻译
Optimization of parameterized policies for reinforcement learning (RL) is an important and challenging problem in artificial intelligence. Among the most common approaches are algorithms based on gradient ascent of a score function representing discounted return. In this paper, we examine the role of these policy gradient and actor-critic algorithms in partially-observable multiagent environments. We show several candidate policy update rules and relate them to a foundation of regret minimization and multiagent learning techniques for the one-shot and tabular cases, leading to previously unknown convergence guarantees. We apply our method to model-free multiagent reinforcement learning in adversarial sequential decision problems (zero-sum imperfect information games), using RL-style function approximation. We evaluate on commonly used benchmark Poker domains, showing performance against fixed policies and empirical convergence to approximate Nash equilibria in self-play with rates similar to or better than a baseline model-free algorithm for zero-sum games, without any domain-specific state space reductions.
translated by 谷歌翻译
We propose Bayesian Deep Q-Networks (BDQN), a Thompson sampling approach for Deep Reinforcement Learning (DRL) in Markov decision processes (MDP). BDQN is an efficient exploration-exploitation algorithm which combines Thompson sampling with deep-Q networks (DQN) and directly incorporates uncertainty over the Q-value in the last layer of the DQN, on the feature representation layer. This allows us to efficiently carry out Thomp-son sampling through Gaussian sampling and Bayesian Linear Regression (BLR), which has fast closed-form updates. We apply our method to a wide range of Atari games and compare BDQN to a powerful baseline: the double deep Q-network (DDQN). Since BDQN carries out more efficient exploration, it is able to reach higher rewards substantially faster: in less than 5M±1M interactions for almost half of the games to reach DDQN scores. We also establish theoretical guarantees for the special case when the feature representation is d-dimensional and fixed. We provide the Bayesian regret of posterior sampling RL (PSRL) and frequentist regret of the optimism in the face of uncertainty (OFU) for episodic MDPs.
translated by 谷歌翻译
在一系列具有挑战性的决策制定和控制任务中,已经证明了无模型深度强化学习(RL)算法。然而,这些方法通常遭受两个主要挑战:非常高的样品复杂性和脆弱的收敛性质,这需要细致的超级参数化。这两个挑战都严重限制了这些方法对复杂的现实领域的适用性。在本文中,我们提出了softactor-critic,一种基于最大熵强化学习框架的非策略行为者 - 评论者深度RL算法。在这个框架中,演员的目标是最大化预期的奖励,同时也最大化熵。也就是说,尽可能随机地完成任务。基于该框架的先前深度RL方法已经被公式化为Q学习方法。通过将关闭政策更新与稳定的随机行为者 - 批评者表述相结合,我们的方法在一系列连续的控制基准任务上实现了最先进的性能,优于先前的政策和非政策方法。此外,我们证明,与其他非策略算法相比,我们的方法非常稳定,在不同的随机种子上实现了非常相似的性能。
translated by 谷歌翻译
Reinforcement learning (RL) offers powerful algorithms to search for optimal controllers of systems with nonlinear, possibly stochastic dynamics that are unknown or highly uncertain. This review mainly covers artificial-intelligence approaches to RL, from the viewpoint of the control engineer. We explain how approximate representations of the solution make RL feasible for problems with continuous states and control actions. Stability is a central concern in control, and we argue that while the control-theoretic RL subfield called adaptive dynamic programming is dedicated to it, stability of RL largely remains an open question. We also cover in detail the case where deep neural networks are used for approximation, leading to the field of deep RL, which has shown great success in recent years. With the control practitioner in mind, we outline opportunities and pitfalls of deep RL; and we close the survey with an outlook that-among other things-points out some avenues for bridging the gap between control and artificial-intelligence RL techniques.
translated by 谷歌翻译
无模型强化学习(RL)算法(例如Q学习)直接参数化和更新值函数或策略,而无需明确地模拟环境。它们通常更简单,使用更灵活,并且在现代深度RL中比基于模型的方法更普遍。然而,实证研究表明,无模型算法可能需要更多样本来学习[Deisenroth和Rasmussen 2011,Schulman等。 2015年]。 “无模型算法是否可以使样本有效”的理论问题是RL中最基本的问题之一,并且在有限多个状态和动作的基本场景中仍然没有解决。我们证明,在一个情节MDP设置中,UCB exploreachieves的Q学习后悔$ \ tilde {O}(\ sqrt {H ^ 3 SAT})$,其中$ S $和$ A $是状态和动作的数量, $ H $是每集的步数,$ T $是总步数。此样本效率与任何基于模型的方法可达到的最佳后悔相匹配,最高可达单个$ \ sqrt {H} $因子。据我们所知,这是模型自由设置中的第一个分析,它建立了$ \ sqrt {T} $后悔无需访问“模拟器”。
translated by 谷歌翻译
Researchers have demonstrated state-of-the-art performance in sequential decision making problems (e.g., robotics control, sequential prediction) with deep neural network models. One often has access to near-optimal oracles that achieve good performance on the task during training. We demonstrate that AggreVaTeD-a policy gradient extension of the Imitation Learning (IL) approach of (Ross & Bagnell, 2014)-can leverage such an oracle to achieve faster and better solutions with less training data than a less-informed Reinforcement Learning (RL) technique. Using both feedforward and recurrent neural predictors, we present stochastic gradient procedures on a sequential prediction task, dependency-parsing from raw image data, as well as on various high dimensional robotics control problems. We also provide a comprehensive theoretical study of IL that demonstrates we can expect up to exponentially lower sample complexity for learning with AggreVaTeD than with RL algorithms, which backs our empirical findings. Our results and theory indicate that the proposed approach can achieve superior performance with respect to the oracle when the demonstrator is sub-optimal.
translated by 谷歌翻译
In this paper we consider deterministic policy gradient algorithms for reinforcement learning with continuous actions. The deterministic policy gradient has a particularly appealing form: it is the expected gradient of the action-value function. This simple form means that the deter-ministic policy gradient can be estimated much more efficiently than the usual stochastic policy gradient. To ensure adequate exploration, we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policy. We demonstrate that deterministic policy gradient algorithms can significantly outperform their stochastic counterparts in high-dimensional action spaces.
translated by 谷歌翻译
We propose a new reinforcement learning algorithm for partially observable Markov decision processes (POMDP) based on spectral decomposition methods. While spectral methods have been previously employed for consistent learning of (passive) latent variable models such as hidden Markov models, POMDPs are more challenging since the learner interacts with the environment and possibly changes the future observations in the process. We devise a learning algorithm running through episodes, in each episode we employ spectral techniques to learn the POMDP parameters from a trajectory generated by a fixed policy. At the end of the episode, an optimization oracle returns the optimal memoryless planning policy which maximizes the expected reward based on the estimated POMDP model. We prove an order-optimal regret bound with respect to the optimal memoryless policy and efficient scaling with respect to the dimensionality of observation and action spaces.
translated by 谷歌翻译
强化学习(RL)和连续控制的无模型方法仅基于过去的状态和奖励找到政策,而不需要拟合系统动力学模型。它们具有吸引力,因为它们是通用的,易于实施;然而,与基于模型的RL相比,它们的理论保证更少。在这项工作中,我们提出了一种新的无模型算法来控制线性二次(LQ)系统,并表明它的后悔尺度为$ O(T ^ {\ xi + 2/3})$任何小的$ \ xi> 0 $ if如果时间范围满足$ T> C ^ {1 / \ xi} $,则为$ C $。该算法基于将Markovdecision过程的控制减少到专家预测问题。在实践中,它对应于策略迭代与强制探索的变体,其中策略的阶段相对于所有先前的值函数的平均值是贪婪的。这是第一个用于LQ系统自适应控制的无模型算法,可证明实现了次线性后悔,并具有多项式计算成本。根据经验,我们的算法明显优于标准策略迭代,但比基于模型的方法表现更差。
translated by 谷歌翻译
Policy gradient methods are a widely used class of model-free reinforcement learning algorithms where a state-dependent baseline is used to reduce gradient estimator variance. Several recent papers extend the baseline to depend on both the state and action and suggest that this significantly reduces variance and improves sample efficiency without introducing bias into the gradient estimates. To better understand this development, we decompose the variance of the policy gradient estimator and numerically show that learned state-action-dependent baselines do not in fact reduce variance over a state-dependent baseline in commonly tested benchmark domains. We confirm this unexpected result by reviewing the open-source code accompanying these prior papers, and show that subtle implementation decisions cause deviations from the methods presented in the papers and explain the source of the previously observed empirical gains. Furthermore, the variance decomposition highlights areas for improvement, which we demonstrate by illustrating a simple change to the typical value function parameterization that can significantly improve performance.
translated by 谷歌翻译
我们研究了具有从少数潜在状态产生的丰富观察的情节MDP中的探索问题。在某些可识别性假设下,我们演示了如何通过一系列回归和聚类步骤来诱导地估计从观察到状态的映射 - 其中先前解码的潜在状态为后来的回归问题提供标签 - 并用它来构建良好的勘探政策。我们对学习状态解码函数和探索策略的质量提供有限样本保证,并通过对一类硬探索问题的经验评估来补充我们的理论。我们的方法成倍地提高了超过$ Q $ -learning的水平,即使在$ Q $ -learning已经进入潜伏状态时也是如此。
translated by 谷歌翻译
本手稿从优化和控制的角度对强化学习进行了调查,重点是连续控制应用。它提供了强化学习的一般表述,术语和典型实验实现,并回顾了竞争解决方案范例。为了比较各种技术的相对优点,本文提出了一个线性二次调节器(LQR)的案例研究,该模型具有未知动力学,可能是最优控制中最简单,研究最多的问题。该手稿描述了学习理论和控制的合并技术如何提供LQR性能的非渐近特征,并表明这些特征往往与实验行为相匹配。 Inturn,当重新审视更复杂的应用程序时,许多观察到的现象LQR仍然存在。特别是,理论和实验证明了模型的作用和重要性以及强化学习算法的一般性成本。本次调查最后讨论了设计学习系统的一些挑战,这些系统安全可靠地与复杂和不确定的环境相互作用,以及如何将强化学习和控制的工具结合起来应对这些挑战。
translated by 谷歌翻译