基于模型的强化学习(RL)被认为是一种有希望的方法,可以降低阻碍无模型RL的样本复杂性。然而,对这些方法的理论理解相当有限。本文介绍了一种新的算法框架,用于设计和分析基于模型的RL算法,并提供理论保证。我们设计了ameta算法,其理论上保证单调改进到预期奖励的局部最大值。元算法基于估计的动态模型和样本轨迹迭代地建立预期奖励的下限,然后在策略和模型上共同最大化下限。该框架将不确定性的乐观主义原则扩展到非线性动力学模型,其方式不需要明确的不确定性量化。通过简化实例化我们的框架提供了基于模型的RL算法随机下界优化(SLBO)的变体。实验证明,当在一系列连续控制基准任务上仅允许一百万或更少的样本时,SLBO实现了最先进的性能。
translated by 谷歌翻译
样本效率对于解决现实世界的强化学习问题至关重要,因为代理与环境之间的相互作用可能代价高昂。事实证明,专家建议的模仿学习是减少培训政策所需交互次数的有效策略。在线模拟学习是政策评估和政策优化的交叉,是一种特别有效的技术,具有可证明的性能保证。在这项工作中,我们寻求进一步加快在线模拟学习的收敛速度,从而使其更具样本效率。我们提出了两种基于模型的算法,其灵感来自跟随领导者(FTL)和预测:MoBIL-VI基于解决变分不等式,MoBIL-Prox基于随机一阶更新。这两种方法利用模型来预测未来的梯度,以加速政策学习。当在线学习模型oracle时,这些算法可以证明可以加速最佳已知的收敛速度。我们的算法可以看作是stochasticMirror-Prox的一般化(Juditsky et al。,2011),并且承认了一种简单的建设性FTL风格的性能分析。
translated by 谷歌翻译
我们研究了无线性方法,用于优化线性策略类的策略优化。我们专注于在应用于线性二次系统时表征这些方法的收敛速度,并研究驱动噪声和奖励反馈的各种设置。我们证明了这些方法可以合理地收敛到最优策略的任何预先指定的容差内,并且具有多个零阶评估,该评估是问题的容错,维度和曲率属性的显式多项式。我们的分析揭示了附加驱动噪声和随机初始化设置之间的一些有趣的差异,以及一点和两点向前反馈的设置。我们的理论通过对这些系统上的无衍生方法的广泛模拟得到证实。在此过程中,我们推导出随机零阶优化算法的收敛性,当应用于某类非凸问题时。
translated by 谷歌翻译
强化学习是学习机器人控制器的一种有前途的方法。最近已经表明,基于政策梯度的有限差分估计的算法与基于政策梯度定理的算法相竞争。我们提出了理解这种现象的理论框架。我们的关键见解是,许多动力系统(尤其是机器人控制任务中的动力系统)都是\ emph {几乎是确定性的} ---即,它们可以被建模为具有小随机扰动的确定性系统。我们证明了对于这样的系统,有限 - 政策梯度的差异估计可以比基于政策梯度定理的估计具有显着更低的方差。我们在计算器事实估计的背景下解释这些结果。最后,我们通过实验评估我们对倒立摆实验的见解。
translated by 谷歌翻译
我们提出了一个名为PicCoLO的预测校正器框架,它将一阶无模型强化或模仿学习算法转换为一种新的混合方法,利用预测模型加速政策学习。新的“PicCoLOed”算法通过递归重复两个步骤来优化策略:在预测步骤中,学习者使用模型来预测未见的未来梯度,然后应用预测差异来更新策略;在更正步骤中,学习者在环境中运行更新策略,接收真实梯度,然后使用梯度误差校正策略。与以前的算法不同,PicCoLO可以纠正使用不完美预测梯度的错误,并且不会受到模型偏差的影响。 PicCoLO的开发可以通过从可预测的在线学习到对抗性在线学习的新颖减少来实现,这提供了一种系统的方法来修改现有的第一或多种算法,以实现关于可预测信息的最佳后悔。我们在理论和仿真中表明,PicCoLO可以提高几种一阶无模型算法的收敛速度。
translated by 谷歌翻译
Temporal difference learning (TD) is a simple iterative algorithm used toestimate the value function corresponding to a given policy in a Markovdecision process. Although TD is one of the most widely used algorithms inreinforcement learning, its theoretical analysis has proved challenging and fewguarantees on its statistical efficiency are available. In this work, weprovide a simple and explicit finite time analysis of temporal differencelearning with linear function approximation. Except for a few key insights, ouranalysis mirrors standard techniques for analyzing stochastic gradient descentalgorithms, and therefore inherits the simplicity and elegance of thatliterature. Final sections of the paper show how all of our main results extendto the study of TD learning with eligibility traces, known as TD($\lambda$),and to Q-learning applied in high-dimensional optimal stopping problems.
translated by 谷歌翻译
在本文中,我们提出了一种新的强化学习算法,该算法包含在马尔可夫决策过程(MDP)的政策梯度的随机方差减少版本中。随机方差减少梯度(SVRG)方法已被证明在监督学习中非常成功。然而,它们对政策梯度的适应并不简单,需要考虑到I)非凹目标函数; II)完整梯度计算的近似值;和III)非静态采样过程。结果是SVRPG,一种随机方差减少的策略梯度算法,它利用重要性权重来保持梯度差异的无偏差。在MDP的标准假设下,我们为SVRPG提供收敛保证,其收敛率在增加的批量大小下是线性的。最后,我们建议使用SVRPG的实际变体,并在连续的MDP上进行测试。
translated by 谷歌翻译
用于强化学习和连续控制问题的直接策略梯度方法是一种流行的方法,原因有很多:1)它们很容易实现而没有明确的底层模型知识2)它们是一种“端到端”方法,直接优化性能指标的兴趣3)它们固有地允许丰富参数化的政策。值得注意的是,即使在最基本的连续控制问题(线性二次调节器的问题)中,这些方法也必须解决非凸优化问题,从计算和统计角度来看,它们的效率知之甚少。相比之下,最优控制理论中的系统识别和基于模型的计划具有更加坚实的理论基础,其中关于其计算和统计特性已知很多。这项工作弥合了这一差距,表明(无模型)政策梯度方法全局收敛到最优解,并且在样本和计算复杂性方面是有效的(在相关的问题依赖量中是多项式的)。
translated by 谷歌翻译
在学习现实领域的政策时,会出现两个重要问题:(i)如何有效地使用预先收集的非政策性,非最佳行为数据;以及(ii)如何在不同的竞争目标和约束之间进行调解。研究多约束下批量策略学习的问题,并提供系统的解决方案。我们首先提出一种灵活的元算法,它允许任何批量强化学习和在线学习过程的子程序。然后,我们提出了一个特定的算法实例,并为主要目标和所有约束提供性能保证。为了证明约束满足,我们提出了一种新的简单的非政策政策评估方法(OPE),并推导出PAC风格的界限。我们的算法在不同的领域中实现了强有力的实证结果,包括在模拟汽车驾驶的挑战性问题中受制于多种约束,例如车道保持和平稳驾驶。我们还通过实验证明,我们的OPE方法在独立的基础上优于其他流行的OPE技术,特别是在高维设置中。
translated by 谷歌翻译
Optimization of parameterized policies for reinforcement learning (RL) is an important and challenging problem in artificial intelligence. Among the most common approaches are algorithms based on gradient ascent of a score function representing discounted return. In this paper, we examine the role of these policy gradient and actor-critic algorithms in partially-observable multiagent environments. We show several candidate policy update rules and relate them to a foundation of regret minimization and multiagent learning techniques for the one-shot and tabular cases, leading to previously unknown convergence guarantees. We apply our method to model-free multiagent reinforcement learning in adversarial sequential decision problems (zero-sum imperfect information games), using RL-style function approximation. We evaluate on commonly used benchmark Poker domains, showing performance against fixed policies and empirical convergence to approximate Nash equilibria in self-play with rates similar to or better than a baseline model-free algorithm for zero-sum games, without any domain-specific state space reductions.
translated by 谷歌翻译
We propose Bayesian Deep Q-Networks (BDQN), a Thompson sampling approach for Deep Reinforcement Learning (DRL) in Markov decision processes (MDP). BDQN is an efficient exploration-exploitation algorithm which combines Thompson sampling with deep-Q networks (DQN) and directly incorporates uncertainty over the Q-value in the last layer of the DQN, on the feature representation layer. This allows us to efficiently carry out Thomp-son sampling through Gaussian sampling and Bayesian Linear Regression (BLR), which has fast closed-form updates. We apply our method to a wide range of Atari games and compare BDQN to a powerful baseline: the double deep Q-network (DDQN). Since BDQN carries out more efficient exploration, it is able to reach higher rewards substantially faster: in less than 5M±1M interactions for almost half of the games to reach DDQN scores. We also establish theoretical guarantees for the special case when the feature representation is d-dimensional and fixed. We provide the Bayesian regret of posterior sampling RL (PSRL) and frequentist regret of the optimism in the face of uncertainty (OFU) for episodic MDPs.
translated by 谷歌翻译
Reinforcement learning (RL) offers powerful algorithms to search for optimal controllers of systems with nonlinear, possibly stochastic dynamics that are unknown or highly uncertain. This review mainly covers artificial-intelligence approaches to RL, from the viewpoint of the control engineer. We explain how approximate representations of the solution make RL feasible for problems with continuous states and control actions. Stability is a central concern in control, and we argue that while the control-theoretic RL subfield called adaptive dynamic programming is dedicated to it, stability of RL largely remains an open question. We also cover in detail the case where deep neural networks are used for approximation, leading to the field of deep RL, which has shown great success in recent years. With the control practitioner in mind, we outline opportunities and pitfalls of deep RL; and we close the survey with an outlook that-among other things-points out some avenues for bridging the gap between control and artificial-intelligence RL techniques.
translated by 谷歌翻译
在一系列具有挑战性的决策制定和控制任务中,已经证明了无模型深度强化学习(RL)算法。然而,这些方法通常遭受两个主要挑战:非常高的样品复杂性和脆弱的收敛性质,这需要细致的超级参数化。这两个挑战都严重限制了这些方法对复杂的现实领域的适用性。在本文中,我们提出了softactor-critic,一种基于最大熵强化学习框架的非策略行为者 - 评论者深度RL算法。在这个框架中,演员的目标是最大化预期的奖励,同时也最大化熵。也就是说,尽可能随机地完成任务。基于该框架的先前深度RL方法已经被公式化为Q学习方法。通过将关闭政策更新与稳定的随机行为者 - 批评者表述相结合,我们的方法在一系列连续的控制基准任务上实现了最先进的性能,优于先前的政策和非政策方法。此外,我们证明,与其他非策略算法相比,我们的方法非常稳定,在不同的随机种子上实现了非常相似的性能。
translated by 谷歌翻译
我们研究了具有从少数潜在状态产生的丰富观察的情节MDP中的探索问题。在某些可识别性假设下,我们演示了如何通过一系列回归和聚类步骤来诱导地估计从观察到状态的映射 - 其中先前解码的潜在状态为后来的回归问题提供标签 - 并用它来构建良好的勘探政策。我们对学习状态解码函数和探索策略的质量提供有限样本保证,并通过对一类硬探索问题的经验评估来补充我们的理论。我们的方法成倍地提高了超过$ Q $ -learning的水平,即使在$ Q $ -learning已经进入潜伏状态时也是如此。
translated by 谷歌翻译
In this paper we consider deterministic policy gradient algorithms for reinforcement learning with continuous actions. The deterministic policy gradient has a particularly appealing form: it is the expected gradient of the action-value function. This simple form means that the deter-ministic policy gradient can be estimated much more efficiently than the usual stochastic policy gradient. To ensure adequate exploration, we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policy. We demonstrate that deterministic policy gradient algorithms can significantly outperform their stochastic counterparts in high-dimensional action spaces.
translated by 谷歌翻译
强化学习(RL)和连续控制的无模型方法仅基于过去的状态和奖励找到政策,而不需要拟合系统动力学模型。它们具有吸引力,因为它们是通用的,易于实施;然而,与基于模型的RL相比,它们的理论保证更少。在这项工作中,我们提出了一种新的无模型算法来控制线性二次(LQ)系统,并表明它的后悔尺度为$ O(T ^ {\ xi + 2/3})$任何小的$ \ xi> 0 $ if如果时间范围满足$ T> C ^ {1 / \ xi} $,则为$ C $。该算法基于将Markovdecision过程的控制减少到专家预测问题。在实践中,它对应于策略迭代与强制探索的变体,其中策略的阶段相对于所有先前的值函数的平均值是贪婪的。这是第一个用于LQ系统自适应控制的无模型算法,可证明实现了次线性后悔,并具有多项式计算成本。根据经验,我们的算法明显优于标准策略迭代,但比基于模型的方法表现更差。
translated by 谷歌翻译
We propose a new reinforcement learning algorithm for partially observable Markov decision processes (POMDP) based on spectral decomposition methods. While spectral methods have been previously employed for consistent learning of (passive) latent variable models such as hidden Markov models, POMDPs are more challenging since the learner interacts with the environment and possibly changes the future observations in the process. We devise a learning algorithm running through episodes, in each episode we employ spectral techniques to learn the POMDP parameters from a trajectory generated by a fixed policy. At the end of the episode, an optimization oracle returns the optimal memoryless planning policy which maximizes the expected reward based on the estimated POMDP model. We prove an order-optimal regret bound with respect to the optimal memoryless policy and efficient scaling with respect to the dimensionality of observation and action spaces.
translated by 谷歌翻译
Researchers have demonstrated state-of-the-art performance in sequential decision making problems (e.g., robotics control, sequential prediction) with deep neural network models. One often has access to near-optimal oracles that achieve good performance on the task during training. We demonstrate that AggreVaTeD-a policy gradient extension of the Imitation Learning (IL) approach of (Ross & Bagnell, 2014)-can leverage such an oracle to achieve faster and better solutions with less training data than a less-informed Reinforcement Learning (RL) technique. Using both feedforward and recurrent neural predictors, we present stochastic gradient procedures on a sequential prediction task, dependency-parsing from raw image data, as well as on various high dimensional robotics control problems. We also provide a comprehensive theoretical study of IL that demonstrates we can expect up to exponentially lower sample complexity for learning with AggreVaTeD than with RL algorithms, which backs our empirical findings. Our results and theory indicate that the proposed approach can achieve superior performance with respect to the oracle when the demonstrator is sub-optimal.
translated by 谷歌翻译
We present four new reinforcement learning algorithms based on actor-critic, function approximation , and natural gradient ideas, and we provide their convergence proofs. Actor-critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their compatibility with function approximation methods, which are needed to handle large or infinite state spaces. The use of temporal difference learning in this way is of special interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior two-timescale convergence results for actor-critic methods by Konda and Tsitsiklis by using temporal difference learning in the actor and by incorporating natural gradients. Our results extend prior empirical studies of natural actor-critic methods by Peters, Vijayakumar and Schaal by providing the first convergence proofs and the first fully incremental algorithms. We present empirical results verifying the convergence of our algorithms.
translated by 谷歌翻译
政策梯度算法是强化学习应用于现实世界控制任务的最佳候选者,例如机器人技术中出现的任务。然而,只要学习阶段本身必须在物理系统上执行,这些方法的反复试验性质就会引入安全问题。在本文中,我们讨论了一个特定的安全制定,其中危险被编码在奖励信号中,并且学习者被限制为永远不会恶化其表现。通过从随机优化的角度研究演员专用政策梯度,我们为广泛的参数政策建立了改进保证,在高斯政策上推广了结果。这与政策梯度估计器的方差的新上限一起,允许识别那些保证单调改进具有高概率的参数调度表。两个关键元参数是参数更新的步长和梯度估计器的批量大小。通过对这些元参数的联合,自适应选择,我们获得了一种安全的策略梯度算法。
translated by 谷歌翻译