通过信任区域政策优化(TRPO)和近端策略优化(PPO)的存在,深入的强化学习取得了很大的成功,以提高其可扩展性和效率。但是,两种算法的悲观情绪,其中包括在信托区域受到限制或严格排除所有可疑梯度,已被证明可以抑制探索和损害代理的性能。为了解决这些问题,我们提出了一个转移的马尔可夫决策过程(MDP),或者更确切地说,随着熵的增强,以鼓励探索并增强逃脱次级的能力。我们的方法是可扩展的,可以适应奖励成型或自举。通过进行收敛分析,我们发现控制温度系数至关重要。但是,如果适当地调整它,即使在其他算法上,我们也可以实现出色的性能,因为它很简单而有效。我们的实验测试在Mujoco基准任务上增强了TRPO和PPO,这表明该代理商对更高的奖励区域表示振奋,并且在探索和剥削之间取得了平衡。我们验证方法在两个网格世界环境上的探索加成。
translated by 谷歌翻译
Model-free deep reinforcement learning (RL) algorithms have been demonstrated on a range of challenging decision making and control tasks. However, these methods typically suffer from two major challenges: very high sample complexity and brittle convergence properties, which necessitate meticulous hyperparameter tuning. Both of these challenges severely limit the applicability of such methods to complex, real-world domains. In this paper, we propose soft actor-critic, an offpolicy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework. In this framework, the actor aims to maximize expected reward while also maximizing entropy. That is, to succeed at the task while acting as randomly as possible. Prior deep RL methods based on this framework have been formulated as Q-learning methods. By combining off-policy updates with a stable stochastic actor-critic formulation, our method achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off-policy methods. Furthermore, we demonstrate that, in contrast to other off-policy algorithms, our approach is very stable, achieving very similar performance across different random seeds.
translated by 谷歌翻译
强化学习的主要困难之一是从{\ em dobsolicy}样本中学习,这些样本是由算法评估(目标策略)的不同策略(行为策略)收集的。非政策学习需要从行为政策中纠正样本的分布到目标策略的分布。不幸的是,重要的抽样具有固有的高方差问题,从而导致策略梯度方法的梯度估计差。我们专注于范围的参与者 - 批评体系结构,并提出了一种称为预处理近端政策优化(P3O)的新方法,该方法可以通过将预处理程序应用于保守政策迭代(CPI)目标来控制重要性采样的较高差异。 {\ em此预处理以一种特殊的方式使用Sigmoid函数,即当没有策略更改时,梯度是最大的,因此策略梯度将驱动大参数更新以有效地探索参数空间}。这是一种新颖的探索方法,鉴于现有的探索方法是基于国家和行动的新颖性,尚未对其进行研究。我们与离散和连续任务上的几种表现最好的算法进行了比较,结果表明{\ em ppo不足以实现异位},并且我们的p3O比ppo {\ em off-policy}比ppo比“根据off off ppo”。 - 通过Deon Metric衡量的Policyness,P3O在比PPO更大的政策空间中探索。结果还表明,在训练过程中,我们的P3O比PPO更好地提高了CPI目标。
translated by 谷歌翻译
我们提出了一个与参数化函数近似器无关的分析策略更新规则。更新规则适用于单调改进保证的一般随机策略。在使用信任区域方法中收紧策略搜索的新的理论结果之后,更新规则源自使用变化阶段的闭合表单信任区域解决方案。提供了策略更新规则和值函数方法之间连接的解释。基于更新规则的递归形式,自然导出了脱助策略算法,单调改进保证仍然存在。此外,当一次代理执行更新时,更新规则立即扩展到多代理系统。
translated by 谷歌翻译
软演员 - 评论家(SAC)是最先进的偏离策略强化学习(RL)算法之一,其在基于最大熵的RL框架内。 SAC被证明在具有良好稳定性和稳健性的持续控制任务的列表中表现得非常好。 SAC了解一个随机高斯政策,可以最大限度地提高预期奖励和政策熵之间的权衡。要更新策略,SAC可最大限度地减少当前策略密度与软值函数密度之间的kl分歧。然后用于获得这种分歧的近似梯度的回报。在本文中,我们提出了跨熵策略优化(SAC-CEPO)的软演员 - 评论家,它使用跨熵方法(CEM)来优化SAC的政策网络。初始思想是使用CEM来迭代地对软价函数密度的最接近的分布进行采样,并使用结果分布作为更新策略网络的目标。为了降低计算复杂性,我们还介绍了一个解耦的策略结构,该策略结构将高斯策略解耦为一个策略,了解了学习均值的均值和另一个策略,以便只有CEM训练平均政策。我们表明,这种解耦的政策结构确实会聚到最佳,我们还通过实验证明SAC-CEPO实现对原始囊的竞争性能。
translated by 谷歌翻译
深度加强学习(DRL)的框架为连续决策提供了强大而广泛适用的数学形式化。本文提出了一种新的DRL框架,称为\ emph {$ f $-diveliventcence加强学习(frl)}。在FRL中,通过最大限度地减少学习政策和采样策略之间的$ F $同时执行策略评估和政策改进阶段,这与旨在最大化预期累计奖励的传统DRL算法不同。理论上,我们证明最小化此类$ F $ - 可以使学习政策会聚到最佳政策。此外,我们将FRL框架中的培训代理程序转换为通过Fenchel Concugate的特定$ F $函数转换为鞍点优化问题,这构成了政策评估和政策改进的新方法。通过数学证据和经验评估,我们证明FRL框架有两个优点:(1)政策评估和政策改进过程同时进行,(2)高估价值函数的问题自然而缓解。为了评估FRL框架的有效性,我们对Atari 2600的视频游戏进行实验,并显示在FRL框架中培训的代理匹配或超越基线DRL算法。
translated by 谷歌翻译
在现实世界中的决策情况(例如金融,机器人技术,自动驾驶等)中,控制风险通常比最大程度地提高预期奖励更为重要。风险措施的最自然选择是差异,而它会惩罚上升波动率作为下行部分。取而代之的是,(下行)半变量捕获了随机变量在其平均值下的负偏差,更适合于规避风险的提议。本文旨在优化加强学习W.R.T.中的平均持续性(MSV)标准。稳定的奖励。由于半变量是时间的,并且不满足标准的贝尔曼方程,因此传统的动态编程方法直接不适合MSV问题。为了应对这一挑战,我们求助于扰动分析(PA)理论,并建立MSV的性能差异公式。我们揭示MSV问题可以通过迭代解决与策略有关的奖励功能的一系列RL问题来解决。此外,我们根据政策梯度理论和信任区域方法提出了两种派利算法。最后,我们进行了不同的实验,从简单的匪徒问题到穆约科的连续控制任务,这些实验证明了我们提出的方法的有效性。
translated by 谷歌翻译
在本文中,我们提出了一种用于增强学习(RL)的最大熵框架,以克服在无模型基于样本的学习中实现最大熵RL的软演员 - 评论权(SAC)算法的限制。尽管在未来的最大熵RL指南学习政策中,未来的高熵达到国家,所提出的MAX-MIN熵框架旨在学会访问低熵的国家,并最大限度地提高这些低熵状态的熵,以促进更好的探索。对于一般马尔可夫决策过程(MDP),基于勘探和剥削的解剖学,在提议的MAX-MIN熵框架下构建了一种有效的算法。数值结果表明,该算法对目前最先进的RL算法产生了剧烈性能改进。
translated by 谷歌翻译
尽管政策梯度方法的普及日益越来越大,但它们尚未广泛用于样品稀缺应用,例如机器人。通过充分利用可用信息,可以提高样本效率。作为强化学习中的关键部件,奖励功能通常仔细设计以引导代理商。因此,奖励功能通常是已知的,允许访问不仅可以访问标量奖励信号,而且允许奖励梯度。为了从奖励梯度中受益,之前的作品需要了解环境动态,这很难获得。在这项工作中,我们开发\ Textit {奖励政策梯度}估计器,这是一种新的方法,可以在不学习模型的情况下整合奖励梯度。绕过模型动态允许我们的估算器实现更好的偏差差异,这导致更高的样本效率,如经验分析所示。我们的方法还提高了在不同的Mujoco控制任务上的近端策略优化的性能。
translated by 谷歌翻译
资产分配(或投资组合管理)是确定如何最佳将有限预算的资金分配给一系列金融工具/资产(例如股票)的任务。这项研究调查了使用无模型的深RL代理应用于投资组合管理的增强学习(RL)的性能。我们培训了几个RL代理商的现实股票价格,以学习如何执行资产分配。我们比较了这些RL剂与某些基线剂的性能。我们还比较了RL代理,以了解哪些类别的代理表现更好。从我们的分析中,RL代理可以执行投资组合管理的任务,因为它们的表现明显优于基线代理(随机分配和均匀分配)。四个RL代理(A2C,SAC,PPO和TRPO)总体上优于最佳基线MPT。这显示了RL代理商发现更有利可图的交易策略的能力。此外,基于价值和基于策略的RL代理之间没有显着的性能差异。演员批评者的表现比其他类型的药物更好。同样,在政策代理商方面的表现要好,因为它们在政策评估方面更好,样品效率在投资组合管理中并不是一个重大问题。这项研究表明,RL代理可以大大改善资产分配,因为它们的表现优于强基础。基于我们的分析,在政策上,参与者批评的RL药物显示出最大的希望。
translated by 谷歌翻译
大多数加固学习算法优化了折扣标准,这些标准是有益的,可以加速收敛并降低估计的方差。虽然折扣标准适用于诸如财务相关问题的某些任务,但许多工程问题同样对待未来的奖励,并更喜欢长期的平均标准。在本文中,我们研究了长期平均标准的强化学习问题。首先,我们在折扣和平均标准中制定统一的信任区域理论,并在扰动分析(PA)理论中导出信托区域内的新颖性能。其次,我们提出了一种名为平均策略优化(APO)的实用算法,其提高了名为平均值约束的新颖技术的值估计。最后,实验在连续控制环境Mujoco中进行。在大多数任务中,APO比折扣PPO更好,这表明了我们方法的有效性。我们的工作提供了统一的信任地区方法,包括折扣和平均标准,这可能会补充折扣目标超出了钢筋学习的框架。
translated by 谷歌翻译
We describe an iterative procedure for optimizing policies, with guaranteed monotonic improvement. By making several approximations to the theoretically-justified procedure, we develop a practical algorithm, called Trust Region Policy Optimization (TRPO). This algorithm is similar to natural policy gradient methods and is effective for optimizing large nonlinear policies such as neural networks. Our experiments demonstrate its robust performance on a wide variety of tasks: learning simulated robotic swimming, hopping, and walking gaits; and playing Atari games using images of the screen as input. Despite its approximations that deviate from the theory, TRPO tends to give monotonic improvement, with little tuning of hyperparameters.
translated by 谷歌翻译
Effectively leveraging large, previously collected datasets in reinforcement learning (RL) is a key challenge for large-scale real-world applications. Offline RL algorithms promise to learn effective policies from previously-collected, static datasets without further interaction. However, in practice, offline RL presents a major challenge, and standard off-policy RL methods can fail due to overestimation of values induced by the distributional shift between the dataset and the learned policy, especially when training on complex and multi-modal data distributions. In this paper, we propose conservative Q-learning (CQL), which aims to address these limitations by learning a conservative Q-function such that the expected value of a policy under this Q-function lower-bounds its true value. We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees. In practice, CQL augments the standard Bellman error objective with a simple Q-value regularizer which is straightforward to implement on top of existing deep Q-learning and actor-critic implementations. On both discrete and continuous control domains, we show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return, especially when learning from complex and multi-modal data distributions.Preprint. Under review.
translated by 谷歌翻译
The study of decentralized learning or independent learning in cooperative multi-agent reinforcement learning has a history of decades. Recently empirical studies show that independent PPO (IPPO) can obtain good performance, close to or even better than the methods of centralized training with decentralized execution, in several benchmarks. However, decentralized actor-critic with convergence guarantee is still open. In this paper, we propose \textit{decentralized policy optimization} (DPO), a decentralized actor-critic algorithm with monotonic improvement and convergence guarantee. We derive a novel decentralized surrogate for policy optimization such that the monotonic improvement of joint policy can be guaranteed by each agent \textit{independently} optimizing the surrogate. In practice, this decentralized surrogate can be realized by two adaptive coefficients for policy optimization at each agent. Empirically, we compare DPO with IPPO in a variety of cooperative multi-agent tasks, covering discrete and continuous action spaces, and fully and partially observable environments. The results show DPO outperforms IPPO in most tasks, which can be the evidence for our theoretical results.
translated by 谷歌翻译
Reinforcement learning (RL) gained considerable attention by creating decision-making agents that maximize rewards received from fully observable environments. However, many real-world problems are partially or noisily observable by nature, where agents do not receive the true and complete state of the environment. Such problems are formulated as partially observable Markov decision processes (POMDPs). Some studies applied RL to POMDPs by recalling previous decisions and observations or inferring the true state of the environment from received observations. Nevertheless, aggregating observations and decisions over time is impractical for environments with high-dimensional continuous state and action spaces. Moreover, so-called inference-based RL approaches require large number of samples to perform well since agents eschew uncertainty in the inferred state for the decision-making. Active inference is a framework that is naturally formulated in POMDPs and directs agents to select decisions by minimising expected free energy (EFE). This supplies reward-maximising (exploitative) behaviour in RL, with an information-seeking (exploratory) behaviour. Despite this exploratory behaviour of active inference, its usage is limited to discrete state and action spaces due to the computational difficulty of the EFE. We propose a unified principle for joint information-seeking and reward maximization that clarifies a theoretical connection between active inference and RL, unifies active inference and RL, and overcomes their aforementioned limitations. Our findings are supported by strong theoretical analysis. The proposed framework's superior exploration property is also validated by experimental results on partial observable tasks with high-dimensional continuous state and action spaces. Moreover, the results show that our model solves reward-free problems, making task reward design optional.
translated by 谷歌翻译
熵正则化是增强学习(RL)的流行方法。尽管它具有许多优势,但它改变了原始马尔可夫决策过程(MDP)的RL目标。尽管已经提出了差异正则化来解决这个问题,但不能微不足道地应用于合作的多代理增强学习(MARL)。在本文中,我们研究了合作MAL中的差异正则化,并提出了一种新型的非政策合作MARL框架,差异性的多代理参与者 - 参与者(DMAC)。从理论上讲,我们得出了DMAC的更新规则,该规则自然存在,并保证了原始MDP和Divergence regullatized MDP的单调政策改进和收敛。我们还给出了原始MDP中融合策略和最佳策略之间的差异。 DMAC是一个灵活的框架,可以与许多现有的MARL算法结合使用。从经验上讲,我们在教学随机游戏和Starcraft Multi-Agent挑战中评估了DMAC,并表明DMAC显着提高了现有的MARL算法的性能。
translated by 谷歌翻译
我们提供了一种新的单调改进保证,以优化合作多代理增强学习(MARL)中的分散政策,即使过渡动态是非平稳的。这项新分析提供了对两种最新的MARL参与者批评方法的强劲表现的理论理解,即独立的近端策略优化(IPPO)和多代理PPO(MAPPO)(MAPPO),它们都依赖于独立比率,即计算概率,每个代理商的政策分别比率。我们表明,尽管独立比率引起的非平稳性,但由于对所有分散政策的信任区域约束,仍会产生单调的改进保证。我们还可以根据培训中的代理数量来界定独立比率,从而以原则性的方式有效地执行这种信任区域约束,从而为近端剪辑提供了理论基础。此外,我们表明,当IPPO和Mappo中优化的替代目标在批评者收敛到固定点时实质上是等效的。最后,我们的经验结果支持以下假设:IPPO和MAPPO的强劲表现是通过削减集中式培训来执行这种信任区域约束的直接结果,而该执行的超参数的良好值对此对此具有高度敏感性正如我们的理论分析所预测的那样。
translated by 谷歌翻译
政策优化是设计强化学习算法的基本原则,一个例子是具有剪切的替代物镜(PPO-CLIP)的近端政策优化算法(PPO-CLIP),由于其简单性和有效性,该算法已被普遍用于深度强化学习。尽管具有出色的经验表现,但PPO-CLIP尚未通过最新的理论证明是合理的。在本文中,我们在神经功能近似下建立了PPO-CLIP的第一个全局收敛速率。我们确定分析PPO-CLIP的基本挑战并用两个核心思想解决:(i)我们从铰链损失的角度重新解释了PPO-CLIP,这将政策改进与解决铰链损失和铰链损失和铰链损失和铰链分类问题的联系联系起来。提供PPO-CLIP目标的广义版。 (ii)基于上面的观点,我们提出了一个两步的策略改进方案,该方案通过熵镜下降和基于回归的策略更新方案从复杂的神经策略参数借助复杂的神经策略参数化来促进收敛分析。此外,我们的理论结果提供了剪辑机理对PPO-CLIP收敛的影响的首次表征。通过实验,我们从经验上验证了PPO-CLIP的重新解释,并在各种RL基准任务上具有各种分类器的广义目标。
translated by 谷歌翻译
安全的加强学习(RL)研究智能代理人不仅必须最大程度地提高奖励,而且还要避免探索不安全领域的问题。在这项研究中,我们提出了CUP,这是一种基于约束更新投影框架的新型政策优化方法,享有严格的安全保证。我们杯杯发展的核心是新提出的替代功能以及性能结合。与以前的安全RL方法相比,杯子的好处1)杯子将代孕功能推广到广义优势估计量(GAE),从而导致强烈的经验性能。 2)杯赛统一性界限,为某些现有算法提供更好的理解和解释性; 3)CUP仅通过一阶优化器提供非凸的实现,该优化器不需要在目标的凸面上进行任何强近似。为了验证我们的杯子方法,我们将杯子与在各种任务上进行的安全RL基线的全面列表进行了比较。实验表明杯子在奖励和安全限制满意度方面的有效性。我们已经在https://github.com/rl-boxes/safe-rl/tree/ main/cup上打开了杯子源代码。
translated by 谷歌翻译
Hierarchical Reinforcement Learning (HRL) algorithms have been demonstrated to perform well on high-dimensional decision making and robotic control tasks. However, because they solely optimize for rewards, the agent tends to search the same space redundantly. This problem reduces the speed of learning and achieved reward. In this work, we present an Off-Policy HRL algorithm that maximizes entropy for efficient exploration. The algorithm learns a temporally abstracted low-level policy and is able to explore broadly through the addition of entropy to the high-level. The novelty of this work is the theoretical motivation of adding entropy to the RL objective in the HRL setting. We empirically show that the entropy can be added to both levels if the Kullback-Leibler (KL) divergence between consecutive updates of the low-level policy is sufficiently small. We performed an ablative study to analyze the effects of entropy on hierarchy, in which adding entropy to high-level emerged as the most desirable configuration. Furthermore, a higher temperature in the low-level leads to Q-value overestimation and increases the stochasticity of the environment that the high-level operates on, making learning more challenging. Our method, SHIRO, surpasses state-of-the-art performance on a range of simulated robotic control benchmark tasks and requires minimal tuning.
translated by 谷歌翻译