我们提出了一个与参数化函数近似器无关的分析策略更新规则。更新规则适用于单调改进保证的一般随机策略。在使用信任区域方法中收紧策略搜索的新的理论结果之后,更新规则源自使用变化阶段的闭合表单信任区域解决方案。提供了策略更新规则和值函数方法之间连接的解释。基于更新规则的递归形式,自然导出了脱助策略算法,单调改进保证仍然存在。此外,当一次代理执行更新时,更新规则立即扩展到多代理系统。
translated by 谷歌翻译
We describe an iterative procedure for optimizing policies, with guaranteed monotonic improvement. By making several approximations to the theoretically-justified procedure, we develop a practical algorithm, called Trust Region Policy Optimization (TRPO). This algorithm is similar to natural policy gradient methods and is effective for optimizing large nonlinear policies such as neural networks. Our experiments demonstrate its robust performance on a wide variety of tasks: learning simulated robotic swimming, hopping, and walking gaits; and playing Atari games using images of the screen as input. Despite its approximations that deviate from the theory, TRPO tends to give monotonic improvement, with little tuning of hyperparameters.
translated by 谷歌翻译
通过信任区域政策优化(TRPO)和近端策略优化(PPO)的存在,深入的强化学习取得了很大的成功,以提高其可扩展性和效率。但是,两种算法的悲观情绪,其中包括在信托区域受到限制或严格排除所有可疑梯度,已被证明可以抑制探索和损害代理的性能。为了解决这些问题,我们提出了一个转移的马尔可夫决策过程(MDP),或者更确切地说,随着熵的增强,以鼓励探索并增强逃脱次级的能力。我们的方法是可扩展的,可以适应奖励成型或自举。通过进行收敛分析,我们发现控制温度系数至关重要。但是,如果适当地调整它,即使在其他算法上,我们也可以实现出色的性能,因为它很简单而有效。我们的实验测试在Mujoco基准任务上增强了TRPO和PPO,这表明该代理商对更高的奖励区域表示振奋,并且在探索和剥削之间取得了平衡。我们验证方法在两个网格世界环境上的探索加成。
translated by 谷歌翻译
For many applications of reinforcement learning it can be more convenient to specify both a reward function and constraints, rather than trying to design behavior through the reward function. For example, systems that physically interact with or around humans should satisfy safety constraints. Recent advances in policy search algorithms (
translated by 谷歌翻译
Model-free deep reinforcement learning (RL) algorithms have been demonstrated on a range of challenging decision making and control tasks. However, these methods typically suffer from two major challenges: very high sample complexity and brittle convergence properties, which necessitate meticulous hyperparameter tuning. Both of these challenges severely limit the applicability of such methods to complex, real-world domains. In this paper, we propose soft actor-critic, an offpolicy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework. In this framework, the actor aims to maximize expected reward while also maximizing entropy. That is, to succeed at the task while acting as randomly as possible. Prior deep RL methods based on this framework have been formulated as Q-learning methods. By combining off-policy updates with a stable stochastic actor-critic formulation, our method achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off-policy methods. Furthermore, we demonstrate that, in contrast to other off-policy algorithms, our approach is very stable, achieving very similar performance across different random seeds.
translated by 谷歌翻译
深度加强学习(DRL)的框架为连续决策提供了强大而广泛适用的数学形式化。本文提出了一种新的DRL框架,称为\ emph {$ f $-diveliventcence加强学习(frl)}。在FRL中,通过最大限度地减少学习政策和采样策略之间的$ F $同时执行策略评估和政策改进阶段,这与旨在最大化预期累计奖励的传统DRL算法不同。理论上,我们证明最小化此类$ F $ - 可以使学习政策会聚到最佳政策。此外,我们将FRL框架中的培训代理程序转换为通过Fenchel Concugate的特定$ F $函数转换为鞍点优化问题,这构成了政策评估和政策改进的新方法。通过数学证据和经验评估,我们证明FRL框架有两个优点:(1)政策评估和政策改进过程同时进行,(2)高估价值函数的问题自然而缓解。为了评估FRL框架的有效性,我们对Atari 2600的视频游戏进行实验,并显示在FRL框架中培训的代理匹配或超越基线DRL算法。
translated by 谷歌翻译
The study of decentralized learning or independent learning in cooperative multi-agent reinforcement learning has a history of decades. Recently empirical studies show that independent PPO (IPPO) can obtain good performance, close to or even better than the methods of centralized training with decentralized execution, in several benchmarks. However, decentralized actor-critic with convergence guarantee is still open. In this paper, we propose \textit{decentralized policy optimization} (DPO), a decentralized actor-critic algorithm with monotonic improvement and convergence guarantee. We derive a novel decentralized surrogate for policy optimization such that the monotonic improvement of joint policy can be guaranteed by each agent \textit{independently} optimizing the surrogate. In practice, this decentralized surrogate can be realized by two adaptive coefficients for policy optimization at each agent. Empirically, we compare DPO with IPPO in a variety of cooperative multi-agent tasks, covering discrete and continuous action spaces, and fully and partially observable environments. The results show DPO outperforms IPPO in most tasks, which can be the evidence for our theoretical results.
translated by 谷歌翻译
现实世界的顺序决策需要数据驱动的算法,这些算法在整个培训中为性能提供实际保证,同时还可以有效利用数据。无模型的深入强化学习代表了此类数据驱动决策的框架,但是现有算法通常只关注其中一个目标,同时牺牲了相对于另一个目标。政策算法确保整个培训的政策改进,但遭受了较高的样本复杂性,而政策算法则可以通过样本重用,但缺乏理论保证来有效利用数据。为了平衡这些竞争目标,我们开发了一系列广义政策改进算法,这些算法结合了政策改进的政策保证和理论支持的样本重用的效率。我们通过对DeepMind Control Suite的各种连续控制任务进行广泛的实验分析来证明这种新算法的好处。
translated by 谷歌翻译
Reinforcement learning (RL) gained considerable attention by creating decision-making agents that maximize rewards received from fully observable environments. However, many real-world problems are partially or noisily observable by nature, where agents do not receive the true and complete state of the environment. Such problems are formulated as partially observable Markov decision processes (POMDPs). Some studies applied RL to POMDPs by recalling previous decisions and observations or inferring the true state of the environment from received observations. Nevertheless, aggregating observations and decisions over time is impractical for environments with high-dimensional continuous state and action spaces. Moreover, so-called inference-based RL approaches require large number of samples to perform well since agents eschew uncertainty in the inferred state for the decision-making. Active inference is a framework that is naturally formulated in POMDPs and directs agents to select decisions by minimising expected free energy (EFE). This supplies reward-maximising (exploitative) behaviour in RL, with an information-seeking (exploratory) behaviour. Despite this exploratory behaviour of active inference, its usage is limited to discrete state and action spaces due to the computational difficulty of the EFE. We propose a unified principle for joint information-seeking and reward maximization that clarifies a theoretical connection between active inference and RL, unifies active inference and RL, and overcomes their aforementioned limitations. Our findings are supported by strong theoretical analysis. The proposed framework's superior exploration property is also validated by experimental results on partial observable tasks with high-dimensional continuous state and action spaces. Moreover, the results show that our model solves reward-free problems, making task reward design optional.
translated by 谷歌翻译
政策梯度(PG)算法是备受期待的强化学习对现实世界控制任务(例如机器人技术)的最佳候选人之一。但是,每当必须在物理系统上执行学习过程本身或涉及任何形式的人类计算机相互作用时,这些方法的反复试验性质就会提出安全问题。在本文中,我们解决了一种特定的安全公式,其中目标和危险都以标量奖励信号进行编码,并且学习代理被限制为从不恶化其性能,以衡量为预期的奖励总和。通过从随机优化的角度研究仅行为者的政策梯度,我们为广泛的参数政策建立了改进保证,从而将现有结果推广到高斯政策上。这与策略梯度估计器的差异的新型上限一起,使我们能够识别出具有很高概率的单调改进的元参数计划。两个关键的元参数是参数更新的步长和梯度估计的批处理大小。通过对这些元参数的联合自适应选择,我们获得了具有单调改进保证的政策梯度算法。
translated by 谷歌翻译
我们提供了一种新的单调改进保证,以优化合作多代理增强学习(MARL)中的分散政策,即使过渡动态是非平稳的。这项新分析提供了对两种最新的MARL参与者批评方法的强劲表现的理论理解,即独立的近端策略优化(IPPO)和多代理PPO(MAPPO)(MAPPO),它们都依赖于独立比率,即计算概率,每个代理商的政策分别比率。我们表明,尽管独立比率引起的非平稳性,但由于对所有分散政策的信任区域约束,仍会产生单调的改进保证。我们还可以根据培训中的代理数量来界定独立比率,从而以原则性的方式有效地执行这种信任区域约束,从而为近端剪辑提供了理论基础。此外,我们表明,当IPPO和Mappo中优化的替代目标在批评者收敛到固定点时实质上是等效的。最后,我们的经验结果支持以下假设:IPPO和MAPPO的强劲表现是通过削减集中式培训来执行这种信任区域约束的直接结果,而该执行的超参数的良好值对此对此具有高度敏感性正如我们的理论分析所预测的那样。
translated by 谷歌翻译
安全的加强学习(RL)旨在学习在将其部署到关键安全应用程序中之前满足某些约束的政策。以前的原始双重风格方法遭受了不稳定性问题的困扰,并且缺乏最佳保证。本文从概率推断的角度克服了问题。我们在政策学习过程中介绍了一种新颖的期望最大化方法来自然纳入约束:1)在凸优化(E-step)后,可以以封闭形式计算可证明的最佳非参数变异分布; 2)基于最佳变异分布(M-step),在信任区域内改进了策略参数。提出的算法将安全的RL问题分解为凸优化阶段和监督学习阶段,从而产生了更稳定的培训性能。对连续机器人任务进行的广泛实验表明,所提出的方法比基线获得了更好的约束满意度和更好的样品效率。该代码可在https://github.com/liuzuxin/cvpo-safe-rl上找到。
translated by 谷歌翻译
资产分配(或投资组合管理)是确定如何最佳将有限预算的资金分配给一系列金融工具/资产(例如股票)的任务。这项研究调查了使用无模型的深RL代理应用于投资组合管理的增强学习(RL)的性能。我们培训了几个RL代理商的现实股票价格,以学习如何执行资产分配。我们比较了这些RL剂与某些基线剂的性能。我们还比较了RL代理,以了解哪些类别的代理表现更好。从我们的分析中,RL代理可以执行投资组合管理的任务,因为它们的表现明显优于基线代理(随机分配和均匀分配)。四个RL代理(A2C,SAC,PPO和TRPO)总体上优于最佳基线MPT。这显示了RL代理商发现更有利可图的交易策略的能力。此外,基于价值和基于策略的RL代理之间没有显着的性能差异。演员批评者的表现比其他类型的药物更好。同样,在政策代理商方面的表现要好,因为它们在政策评估方面更好,样品效率在投资组合管理中并不是一个重大问题。这项研究表明,RL代理可以大大改善资产分配,因为它们的表现优于强基础。基于我们的分析,在政策上,参与者批评的RL药物显示出最大的希望。
translated by 谷歌翻译
我们呈现协调的近端策略优化(COPPO),该算法将原始近端策略优化(PPO)扩展到多功能代理设置。关键的想法在于多个代理之间的策略更新过程中的步骤大小的协调适应。当优化理论上接地的联合目标时,我们证明了政策改进的单调性,并基于一组近似推导了简化的优化目标。然后,我们解释了Coppo中的这种目标可以在代理商之间实现动态信用分配,从而减轻了代理政策的同时更新期间的高方差问题。最后,我们证明COPPO优于几种强大的基线,并且在典型的多代理设置下,包括最新的多代理PPO方法(即MAPPO),包括合作矩阵游戏和星际争霸II微管理任务。
translated by 谷歌翻译
尽管政策梯度方法的普及日益越来越大,但它们尚未广泛用于样品稀缺应用,例如机器人。通过充分利用可用信息,可以提高样本效率。作为强化学习中的关键部件,奖励功能通常仔细设计以引导代理商。因此,奖励功能通常是已知的,允许访问不仅可以访问标量奖励信号,而且允许奖励梯度。为了从奖励梯度中受益,之前的作品需要了解环境动态,这很难获得。在这项工作中,我们开发\ Textit {奖励政策梯度}估计器,这是一种新的方法,可以在不学习模型的情况下整合奖励梯度。绕过模型动态允许我们的估算器实现更好的偏差差异,这导致更高的样本效率,如经验分析所示。我们的方法还提高了在不同的Mujoco控制任务上的近端策略优化的性能。
translated by 谷歌翻译
Learning policies from fixed offline datasets is a key challenge to scale up reinforcement learning (RL) algorithms towards practical applications. This is often because off-policy RL algorithms suffer from distributional shift, due to mismatch between dataset and the target policy, leading to high variance and over-estimation of value functions. In this work, we propose variance regularization for offline RL algorithms, using stationary distribution corrections. We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer. The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms. We show that the regularizer leads to a lower bound to the offline policy optimization objective, which can help avoid over-estimation errors, and explains the benefits of our approach across a range of continuous control domains when compared to existing state-of-the-art algorithms.
translated by 谷歌翻译
在现实世界中的决策情况(例如金融,机器人技术,自动驾驶等)中,控制风险通常比最大程度地提高预期奖励更为重要。风险措施的最自然选择是差异,而它会惩罚上升波动率作为下行部分。取而代之的是,(下行)半变量捕获了随机变量在其平均值下的负偏差,更适合于规避风险的提议。本文旨在优化加强学习W.R.T.中的平均持续性(MSV)标准。稳定的奖励。由于半变量是时间的,并且不满足标准的贝尔曼方程,因此传统的动态编程方法直接不适合MSV问题。为了应对这一挑战,我们求助于扰动分析(PA)理论,并建立MSV的性能差异公式。我们揭示MSV问题可以通过迭代解决与策略有关的奖励功能的一系列RL问题来解决。此外,我们根据政策梯度理论和信任区域方法提出了两种派利算法。最后,我们进行了不同的实验,从简单的匪徒问题到穆约科的连续控制任务,这些实验证明了我们提出的方法的有效性。
translated by 谷歌翻译
Effectively leveraging large, previously collected datasets in reinforcement learning (RL) is a key challenge for large-scale real-world applications. Offline RL algorithms promise to learn effective policies from previously-collected, static datasets without further interaction. However, in practice, offline RL presents a major challenge, and standard off-policy RL methods can fail due to overestimation of values induced by the distributional shift between the dataset and the learned policy, especially when training on complex and multi-modal data distributions. In this paper, we propose conservative Q-learning (CQL), which aims to address these limitations by learning a conservative Q-function such that the expected value of a policy under this Q-function lower-bounds its true value. We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees. In practice, CQL augments the standard Bellman error objective with a simple Q-value regularizer which is straightforward to implement on top of existing deep Q-learning and actor-critic implementations. On both discrete and continuous control domains, we show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return, especially when learning from complex and multi-modal data distributions.Preprint. Under review.
translated by 谷歌翻译
熵正则化是增强学习(RL)的流行方法。尽管它具有许多优势,但它改变了原始马尔可夫决策过程(MDP)的RL目标。尽管已经提出了差异正则化来解决这个问题,但不能微不足道地应用于合作的多代理增强学习(MARL)。在本文中,我们研究了合作MAL中的差异正则化,并提出了一种新型的非政策合作MARL框架,差异性的多代理参与者 - 参与者(DMAC)。从理论上讲,我们得出了DMAC的更新规则,该规则自然存在,并保证了原始MDP和Divergence regullatized MDP的单调政策改进和收敛。我们还给出了原始MDP中融合策略和最佳策略之间的差异。 DMAC是一个灵活的框架,可以与许多现有的MARL算法结合使用。从经验上讲,我们在教学随机游戏和Starcraft Multi-Agent挑战中评估了DMAC,并表明DMAC显着提高了现有的MARL算法的性能。
translated by 谷歌翻译
为了在许多因素动态影响输出轨迹的复杂随机系统上学习,希望有效利用从以前迭代中收集的历史样本中的信息来加速策略优化。经典的经验重播使代理商可以通过重复使用历史观察来记住。但是,处理所有观察结果的统一重复使用策略均忽略了不同样本的相对重要性。为了克服这一限制,我们提出了一个基于一般差异的经验重播(VRER)框架,该框架可以选择性地重复使用最相关的样本以改善策略梯度估计。这种选择性机制可以自适应地对过去的样品增加重量,这些样本更可能由当前目标分布产生。我们的理论和实证研究表明,提议的VRER可以加速学习最佳政策,并增强最先进的政策优化方法的性能。
translated by 谷歌翻译