策略强化学习(RL)算法具有高样本复杂度,而非策略算法难以调整。合并这两者有利于开发有效的算法,这些算法可以在不同的环境中进行推广。然而,在实践中找到适当的高参数来控制这种权衡是具有挑战性的。本文开发了一个名为P3O的简单算法,它将关闭策略更新与策略更新交错.P3O使用行为策略和目标策略之间的有效样本大小来控制它们彼此之间的距离,并且不会引入任何额外的超参数。 Atari-2600和MuJoCo基准测试套件的大量实验表明,这种简单的技术可以非常有效地降低最先进算法的样本复杂性。
translated by 谷歌翻译
在一系列具有挑战性的决策制定和控制任务中,已经证明了无模型深度强化学习(RL)算法。然而,这些方法通常遭受两个主要挑战:非常高的样品复杂性和脆弱的收敛性质,这需要细致的超级参数化。这两个挑战都严重限制了这些方法对复杂的现实领域的适用性。在本文中,我们提出了softactor-critic,一种基于最大熵强化学习框架的非策略行为者 - 评论者深度RL算法。在这个框架中,演员的目标是最大化预期的奖励,同时也最大化熵。也就是说,尽可能随机地完成任务。基于该框架的先前深度RL方法已经被公式化为Q学习方法。通过将关闭政策更新与稳定的随机行为者 - 批评者表述相结合,我们的方法在一系列连续的控制基准任务上实现了最先进的性能,优于先前的政策和非政策方法。此外,我们证明,与其他非策略算法相比,我们的方法非常稳定,在不同的随机种子上实现了非常相似的性能。
translated by 谷歌翻译
Model-free deep reinforcement learning (RL) methods have been successful in a wide variety of simulated domains. However, a major obstacle facing deep RL in the real world is their high sample complexity. Batch policy gradient methods offer stable learning, but at the cost of high variance, which often requires large batches. TD-style methods, such as off-policy actor-critic and Q-learning, are more sample-efficient but biased, and often require costly hyperparameter sweeps to stabilize. In this work, we aim to develop methods that combine the stability of policy gradients with the efficiency of off-policy RL. We present Q-Prop, a policy gradient method that uses a Taylor expansion of the off-policy critic as a control variate. Q-Prop is both sample efficient and stable, and effectively combines the benefits of on-policy and off-policy methods. We analyze the connection between Q-Prop and existing model-free algorithms, and use control variate theory to derive two variants of Q-Prop with conservative and aggressive adaptation. We show that conservative Q-Prop provides substantial gains in sample efficiency over trust region policy optimization (TRPO) with generalized advantage estimation (GAE), and improves stability over deep deterministic policy gradient (DDPG), the state-of-the-art on-policy and off-policy methods, on OpenAI Gym's MuJoCo continuous control environments.
translated by 谷歌翻译
策略优化是解决连续控制任务的有效强化学习方法。最近的成就表明,交替在线和离线优化是高效轨道重用的成功选择。然而,决定何时停止优化和收集新轨迹是非平凡的,因为它需要考虑目标函数估计的方差。在本文中,我们提出了一种新的模型 - 免费政策搜索算法POIS,适用于基于控制和基于参数的设置。我们首先得出一个高置信度约束重要抽样估计,然后我们定义一个替代目标函数,它使用一批轨迹离线优化。最后,该算法在选择的连续控制任务上进行测试,包括线性和深度策略,并与最先进的策略优化方法进行比较。
translated by 谷歌翻译
Reinforcement learning (RL) offers powerful algorithms to search for optimal controllers of systems with nonlinear, possibly stochastic dynamics that are unknown or highly uncertain. This review mainly covers artificial-intelligence approaches to RL, from the viewpoint of the control engineer. We explain how approximate representations of the solution make RL feasible for problems with continuous states and control actions. Stability is a central concern in control, and we argue that while the control-theoretic RL subfield called adaptive dynamic programming is dedicated to it, stability of RL largely remains an open question. We also cover in detail the case where deep neural networks are used for approximation, leading to the field of deep RL, which has shown great success in recent years. With the control practitioner in mind, we outline opportunities and pitfalls of deep RL; and we close the survey with an outlook that-among other things-points out some avenues for bridging the gap between control and artificial-intelligence RL techniques.
translated by 谷歌翻译
The success of popular algorithms for deep reinforcement learning, such as policy-gradients and Q-learning, relies heavily on the availability of an informative reward signal at each timestep of the sequential decision-making process. When rewards are only sparsely available during an episode, or a rewarding feedback is provided only after episode termination, these algorithms perform sub-optimally due to the difficultly in credit assignment. Alternatively, trajectory-based policy optimization methods, such as cross-entropy method and evolution strategies, do not require per-timestep rewards, but have been found to suffer from high sample complexity by completing forgoing the temporal nature of the problem. Improving the efficiency of RL algorithms in real-world problems with sparse or episodic rewards is therefore a pressing need. In this work, we introduce a self-imitation learning algorithm that exploits and explores well in the sparse and episodic reward settings. We view each policy as a state-action visitation distribution and formulate policy optimization as a divergence minimization problem. We show that with Jensen-Shannon divergence, this divergence minimization problem can be reduced into a policy-gradient algorithm with shaped rewards learned from experience replays. Experimental results indicate that our algorithm works comparable to existing algorithms in environments with dense rewards, and significantly better in environments with sparse and episodic rewards. We then discuss limitations of self-imitation learning, and propose to solve them by using Stein variational policy gradient descent with the Jensen-Shannon kernel to learn multiple diverse policies. We demonstrate its effectiveness on a challenging variant of continuous-control MuJoCo locomotion tasks.
translated by 谷歌翻译
有效探索是强化学习中尚未解决的问题。我们介绍了基于模型的主动eXploration(MAX),一种主动探测环境的算法。它通过计划观察新事件来最小化全面建模环境所需的数据,而不仅仅是对偶然遇到的新奇事物作出反应。通过仅在行动时构建新的勘探政策,避免了传统勘探奖励技术引起的非平稳性。在半随机玩具环境中,directedexploration对于取得进展至关重要,我们的算法至少比强基线更有效。
translated by 谷歌翻译
我们提出了一种用于强化学习(RL)的非策略行为者 - 评论算法,该算法将来自随机搜索的无梯度优化的思想与学习的动作 - 值函数相结合。结果是一个简单的过程,包括三个步骤:i)通过估计参数动作 - 值函数进行策略评估; ii)通过估计当地非参数政策来改善政策; iii)通过拟合参数策略进行推广。每个步骤都可以以不同的方式实现,从而产生了几种算法变量。我们的算法利用与黑盒优化和“RL作为推理”的现有文献的联系,可以看作是最大后验策略优化算法的扩展。 (MPO)[Abdolmaleki等,2018a],或作为信赖域协方差矩阵适应进化战略(CMA-ES)的延伸[Abdolmaleki等,2017b; Hansen等,1997]对策略迭代方案。我们对parkour套件[Heess et al。,2017],DeepMind控制套件[Tassa et al。,2018]和OpenAI Gym [Brockman et al。,2016]的31个连续控制任务进行了比较,具有不同的属性,有限的计算量和单个超参数集,证明了我们的方法的有效性和艺术状态的结果。可以在goo.gl/HtvJKR找到视频,总结结果。
translated by 谷歌翻译
Deep reinforcement learning (RL) methods generally engage in exploratorybehavior through noise injection in the action space. An alternative is to addnoise directly to the agent's parameters, which can lead to more consistentexploration and a richer set of behaviors. Methods such as evolutionarystrategies use parameter perturbations, but discard all temporal structure inthe process and require significantly more samples. Combining parameter noisewith traditional RL methods allows to combine the best of both worlds. Wedemonstrate that both off- and on-policy methods benefit from this approachthrough experimental comparison of DQN, DDPG, and TRPO on high-dimensionaldiscrete action environments as well as continuous control tasks. Our resultsshow that RL with parameter noise learns more efficiently than traditional RLwith action space noise and evolutionary strategies individually.
translated by 谷歌翻译
Policy gradient methods have been successfully applied to many complexreinforcement learning problems. However, policy gradient methods suffer fromhigh variance, slow convergence, and inefficient exploration. In this work, weintroduce a maximum entropy policy optimization framework which explicitlyencourages parameter exploration, and show that this framework can be reducedto a Bayesian inference problem. We then propose a novel Stein variationalpolicy gradient method (SVPG) which combines existing policy gradient methodsand a repulsive functional to generate a set of diverse but well-behavedpolicies. SVPG is robust to initialization and can easily be implemented in aparallel manner. On continuous control problems, we find that implementing SVPGon top of REINFORCE and advantage actor-critic algorithms improves both averagereturn and data efficiency.
translated by 谷歌翻译
状态 - 动作值函数(即Q值)在增强学习(RL)中无处不在,产生了SARSA和Q学习等流行算法。我们提出了一种新的动作值概念,它由预期Q的高斯平滑版本定义。 -值。我们证明了这种平滑的Q值仍然满足贝尔曼方程,使得它们可以从环境中采集的经验中学习。此外,可以从平滑的Q值函数的梯度和Hessian恢复关于参数化高斯策略的主题和协方差的预期奖励的梯度。基于这些关系,我们开发了新算法,用于从学习平滑的Q值近似器直接训练高斯策略。该方法通常通过增加与先前策略的KL-分歧的惩罚来增加近似优化。我们发现,在训练期间能够容忍均值和协方差的能力导致在标准连续控制基准上显着改善的结果。
translated by 谷歌翻译
我们引入了一种新的强化学习算法,称为Maximumaposteriori Policy Optimization(MPO),它基于相对熵目标的坐标上升。我们证明了几种现有方法可以直接与我们的推导相关联。我们开发了两种非策略算法,并证明它们与深度强化学习中的最新技术竞争。特别是,对于连续控制,我们的方法在实现类似或更好的最终性能的同时,在样本效率,早熟收敛和对超参数设置的鲁棒性方面优于成熟方法。
translated by 谷歌翻译
在基于重要性抽样(IS)的强化学习算法(例如,近似策略优化(PPO))中,通常剪切IS权重以避免学习中的大变化。但是,剪切统计信息的策略更新会导致具有高操作维度的任务中的大偏差,并且剪切偏差会导致难以重用具有大IS权重的旧样本。在本文中,我们考虑PPO,一种代表性的策略算法,并通过尺寸方面的IS权重裁剪提出它的改进,它分别剪切每个动作维度的IS权重以避免大的偏差并自适应地控制IS权重以限制当前的策略更新政策。这种新技术可以有效地学习高动作维度任务,并重新利用非政策学习中的旧样本来提高样本效率。数值结果表明,所提出的新算法在各种Open AI Gym任务中优于PPO和其他RL算法。
translated by 谷歌翻译
Deep reinforcement learning is poised to revolu-tionise the field of AI and represents a step towards building autonomous systems with a higher level understanding of the visual world. Currently, deep learning is enabling reinforcement learning to scale to problems that were previously intractable, such as learning to play video games directly from pixels. Deep reinforcement learning algorithms are also applied to robotics, allowing control policies for robots to be learned directly from camera inputs in the real world. In this survey, we begin with an introduction to the general field of reinforcement learning, then progress to the main streams of value-based and policy-based methods. Our survey will cover central algorithms in deep reinforcement learning, including the deep Q-network, trust region policy optimisation, and asynchronous advantage actor-critic. In parallel, we highlight the unique advantages of deep neural networks, focusing on visual understanding via reinforcement learning. To conclude, we describe several current areas of research within the field.
translated by 谷歌翻译
translated by 谷歌翻译
在本文中,我们提出了一种新的强化学习算法,该算法包含在马尔可夫决策过程(MDP)的政策梯度的随机方差减少版本中。随机方差减少梯度(SVRG)方法已被证明在监督学习中非常成功。然而,它们对政策梯度的适应并不简单,需要考虑到I)非凹目标函数; II)完整梯度计算的近似值;和III)非静态采样过程。结果是SVRPG,一种随机方差减少的策略梯度算法,它利用重要性权重来保持梯度差异的无偏差。在MDP的标准假设下,我们为SVRPG提供收敛保证,其收敛率在增加的批量大小下是线性的。最后,我们建议使用SVRPG的实际变体,并在连续的MDP上进行测试。
translated by 谷歌翻译
在深度学习的最新进展的帮助下,无模型强化学习(RL)方法在越来越多的任务中取得了成功。然而,它们往往遭受高样本复杂性的困扰,这阻碍了它们在现实世界中的使用。或者,基于模型的强化学习有望降低样本的复杂性,但往往需要仔细调整,迄今为止主要是在限制性领域中,简单模型足以学习。在本文中,我们分析了使用深度神经网络学习模型和策略时基于香草模型的增强学习方法的行为,并表明学习策略倾向于利用可用于学习模型的数据不足的区域,导致训练不稳定。为了克服这个问题,我们建议使用一组模型来维持模型的不确定性并规范学习过程。我们进一步表明,似然比衍生物的使用比通过时间的反向传播更加稳定。总而言之,与无模型深度RL方法相比,我们的方法模型 - 集成信任区域策略优化(ME-TRPO)显着降低了样本复杂性,具有挑战性的连续控制基准测试任务。
translated by 谷歌翻译
在基于价值的强化学习方法中,例如深度Q学习,已知函数逼近误差会导致高估的估值和次优政策。我们表明,这个问题在反应者 - 批评者的背景下仍然存在,并提出新的机制来尽量减少其对演员和评论家的影响。我们的算法建立在双Q学习的基础上,通过一对评论家之间的最小值来限制高估。在目标网络和高估偏差之间建立连接,并提出延迟策略更新以减少每次更新错误并进一步提高性能。我们在OpenAI健身房任务套件上评估我们的方法,在每个测试环境中都超越了最先进的技术水平。
translated by 谷歌翻译
In the pursuit of increasingly intelligent learning systems, abstraction plays a vital role in enabling sophisticated decisions to be made in complex environments. The options framework provides formalism for such abstraction over sequences of decisions. However most models require that options be given a priori, presumably specified by hand, which is neither efficient, nor scalable. Indeed , it is preferable to learn options directly from interaction with the environment. Despite several efforts, this remains a difficult problem. In this work we develop a novel policy gradient method for the automatic learning of policies with options. This algorithm uses inference methods to simultaneously improve all of the options available to an agent, and thus can be employed in an off-policy manner, without observing option labels. The dif-ferentiable inference procedure employed yields options that can be easily interpreted. Empirical results confirm these attributes, and indicate that our algorithm has an improved sample efficiency relative to state-of-the-art in learning options end-to-end.
translated by 谷歌翻译
随机值函数为在具有高维状态和动作空间的复杂环境中进行有效探索的挑战提供了有前景的方法。与传统的点估计方法不同,随机值函数在动作空间值上保持后验分布。这可以防止代理人的行为政策过早地利用早期估计并陷入局部最优。在这项工作中,我们利用变分贝叶斯神经网络中的近期进展,并将这些与传统的深度Q网络(DQN)和深度确定性政策梯度(DDPG)相结合,以实现高维域的随机值函数。特别地,我们使用乘法归一化流来增加DQN和DDPG,以便跟踪值函数的参数上的丰富的近似后验分布。这允许代理通过随机梯度方法以计算有效的方式执行近似Thompsonsampling。我们通过在高维环境中的经验比较证明了我们的方法的好处。
translated by 谷歌翻译