Model-free deep reinforcement learning (RL) algorithms have been demonstrated on a range of challenging decision making and control tasks. However, these methods typically suffer from two major challenges: very high sample complexity and brittle convergence properties, which necessitate meticulous hyperparameter tuning. Both of these challenges severely limit the applicability of such methods to complex, real-world domains. In this paper, we propose soft actor-critic, an offpolicy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework. In this framework, the actor aims to maximize expected reward while also maximizing entropy. That is, to succeed at the task while acting as randomly as possible. Prior deep RL methods based on this framework have been formulated as Q-learning methods. By combining off-policy updates with a stable stochastic actor-critic formulation, our method achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off-policy methods. Furthermore, we demonstrate that, in contrast to other off-policy algorithms, our approach is very stable, achieving very similar performance across different random seeds.
translated by 谷歌翻译
软演员 - 评论家(SAC)是最先进的偏离策略强化学习(RL)算法之一,其在基于最大熵的RL框架内。 SAC被证明在具有良好稳定性和稳健性的持续控制任务的列表中表现得非常好。 SAC了解一个随机高斯政策,可以最大限度地提高预期奖励和政策熵之间的权衡。要更新策略,SAC可最大限度地减少当前策略密度与软值函数密度之间的kl分歧。然后用于获得这种分歧的近似梯度的回报。在本文中,我们提出了跨熵策略优化(SAC-CEPO)的软演员 - 评论家,它使用跨熵方法(CEM)来优化SAC的政策网络。初始思想是使用CEM来迭代地对软价函数密度的最接近的分布进行采样,并使用结果分布作为更新策略网络的目标。为了降低计算复杂性,我们还介绍了一个解耦的策略结构,该策略结构将高斯策略解耦为一个策略,了解了学习均值的均值和另一个策略,以便只有CEM训练平均政策。我们表明,这种解耦的政策结构确实会聚到最佳,我们还通过实验证明SAC-CEPO实现对原始囊的竞争性能。
translated by 谷歌翻译
We propose a method for learning expressive energy-based policies for continuous states and actions, which has been feasible only in tabular domains before. We apply our method to learning maximum entropy policies, resulting into a new algorithm, called soft Q-learning, that expresses the optimal policy via a Boltzmann distribution. We use the recently proposed amortized Stein variational gradient descent to learn a stochastic sampling network that approximates samples from this distribution. The benefits of the proposed algorithm include improved exploration and compositionality that allows transferring skills between tasks, which we confirm in simulated experiments with swimming and walking robots. We also draw a connection to actorcritic methods, which can be viewed performing approximate inference on the corresponding energy-based model.
translated by 谷歌翻译
采用合理的策略是具有挑战性的,但对于智能代理商的智能代理人至关重要,其资源有限,在危险,非结构化和动态环境中工作,以改善系统实用性,降低整体成本并增加任务成功概率。深度强化学习(DRL)帮助组织代理的行为和基于其状态的行为,并代表复杂的策略(行动的组成)。本文提出了一种基于贝叶斯链条的新型分层策略分解方法,将复杂的政策分为几个简单的子手段,并将其作为贝叶斯战略网络(BSN)组织。我们将这种方法整合到最先进的DRL方法中,软演奏者 - 批评者(SAC),并通过组织几个子主管作为联合政策来构建相应的贝叶斯软演奏者(BSAC)模型。我们将建议的BSAC方法与标准连续控制基准(Hopper-V2,Walker2D-V2和Humanoid-V2)在SAC和其他最先进的方法(例如TD3,DDPG和PPO)中进行比较 - Mujoco与Openai健身房环境。结果表明,BSAC方法的有希望的潜力可显着提高训练效率。可以从https://github.com/herolab-uga/bsac访问BSAC的开源代码。
translated by 谷歌翻译
Reinforcement learning (RL) gained considerable attention by creating decision-making agents that maximize rewards received from fully observable environments. However, many real-world problems are partially or noisily observable by nature, where agents do not receive the true and complete state of the environment. Such problems are formulated as partially observable Markov decision processes (POMDPs). Some studies applied RL to POMDPs by recalling previous decisions and observations or inferring the true state of the environment from received observations. Nevertheless, aggregating observations and decisions over time is impractical for environments with high-dimensional continuous state and action spaces. Moreover, so-called inference-based RL approaches require large number of samples to perform well since agents eschew uncertainty in the inferred state for the decision-making. Active inference is a framework that is naturally formulated in POMDPs and directs agents to select decisions by minimising expected free energy (EFE). This supplies reward-maximising (exploitative) behaviour in RL, with an information-seeking (exploratory) behaviour. Despite this exploratory behaviour of active inference, its usage is limited to discrete state and action spaces due to the computational difficulty of the EFE. We propose a unified principle for joint information-seeking and reward maximization that clarifies a theoretical connection between active inference and RL, unifies active inference and RL, and overcomes their aforementioned limitations. Our findings are supported by strong theoretical analysis. The proposed framework's superior exploration property is also validated by experimental results on partial observable tasks with high-dimensional continuous state and action spaces. Moreover, the results show that our model solves reward-free problems, making task reward design optional.
translated by 谷歌翻译
In value-based reinforcement learning methods such as deep Q-learning, function approximation errors are known to lead to overestimated value estimates and suboptimal policies. We show that this problem persists in an actor-critic setting and propose novel mechanisms to minimize its effects on both the actor and the critic. Our algorithm builds on Double Q-learning, by taking the minimum value between a pair of critics to limit overestimation. We draw the connection between target networks and overestimation bias, and suggest delaying policy updates to reduce per-update error and further improve performance. We evaluate our method on the suite of OpenAI gym tasks, outperforming the state of the art in every environment tested.
translated by 谷歌翻译
在本文中,我们提出了一种用于增强学习(RL)的最大熵框架,以克服在无模型基于样本的学习中实现最大熵RL的软演员 - 评论权(SAC)算法的限制。尽管在未来的最大熵RL指南学习政策中,未来的高熵达到国家,所提出的MAX-MIN熵框架旨在学会访问低熵的国家,并最大限度地提高这些低熵状态的熵,以促进更好的探索。对于一般马尔可夫决策过程(MDP),基于勘探和剥削的解剖学,在提议的MAX-MIN熵框架下构建了一种有效的算法。数值结果表明,该算法对目前最先进的RL算法产生了剧烈性能改进。
translated by 谷歌翻译
无模型的深度增强学习(RL)已成功应用于挑战连续控制域。然而,较差的样品效率可防止这些方法广泛用于现实世界领域。我们通过提出一种新的无模型算法,现实演员 - 评论家(RAC)来解决这个问题,旨在通过学习关于Q函数的各种信任的政策家庭来解决价值低估和高估之间的权衡。我们构建不确定性惩罚Q-Learning(UPQ),该Q-Learning(UPQ)使用多个批评者的合并来控制Q函数的估计偏差,使Q函数平稳地从低于更高的置信范围偏移。随着这些批评者的指导,RAC采用通用价值函数近似器(UVFA),同时使用相同的神经网络学习许多乐观和悲观的政策。乐观的政策会产生有效的探索行为,而悲观政策会降低价值高估的风险,以确保稳定的策略更新和Q函数。该方法可以包含任何违规的演员 - 评论家RL算法。我们的方法实现了10倍的样本效率和25 \%的性能改进与SAC在最具挑战性的人形环境中,获得了11107美元的集中奖励1107美元,价格为10 ^ 6美元。所有源代码都可以在https://github.com/ihuhuhu/rac获得。
translated by 谷歌翻译
Effectively leveraging large, previously collected datasets in reinforcement learning (RL) is a key challenge for large-scale real-world applications. Offline RL algorithms promise to learn effective policies from previously-collected, static datasets without further interaction. However, in practice, offline RL presents a major challenge, and standard off-policy RL methods can fail due to overestimation of values induced by the distributional shift between the dataset and the learned policy, especially when training on complex and multi-modal data distributions. In this paper, we propose conservative Q-learning (CQL), which aims to address these limitations by learning a conservative Q-function such that the expected value of a policy under this Q-function lower-bounds its true value. We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees. In practice, CQL augments the standard Bellman error objective with a simple Q-value regularizer which is straightforward to implement on top of existing deep Q-learning and actor-critic implementations. On both discrete and continuous control domains, we show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return, especially when learning from complex and multi-modal data distributions.Preprint. Under review.
translated by 谷歌翻译
深度加强学习(DRL)的框架为连续决策提供了强大而广泛适用的数学形式化。本文提出了一种新的DRL框架,称为\ emph {$ f $-diveliventcence加强学习(frl)}。在FRL中,通过最大限度地减少学习政策和采样策略之间的$ F $同时执行策略评估和政策改进阶段,这与旨在最大化预期累计奖励的传统DRL算法不同。理论上,我们证明最小化此类$ F $ - 可以使学习政策会聚到最佳政策。此外,我们将FRL框架中的培训代理程序转换为通过Fenchel Concugate的特定$ F $函数转换为鞍点优化问题,这构成了政策评估和政策改进的新方法。通过数学证据和经验评估,我们证明FRL框架有两个优点:(1)政策评估和政策改进过程同时进行,(2)高估价值函数的问题自然而缓解。为了评估FRL框架的有效性,我们对Atari 2600的视频游戏进行实验,并显示在FRL框架中培训的代理匹配或超越基线DRL算法。
translated by 谷歌翻译
While reinforcement learning algorithms provide automated acquisition of optimal policies, practical application of such methods requires a number of design decisions, such as manually designing reward functions that not only define the task, but also provide sufficient shaping to accomplish it. In this paper, we view reinforcement learning as inferring policies that achieve desired outcomes, rather than as a problem of maximizing rewards. To solve this inference problem, we establish a novel variational inference formulation that allows us to derive a well-shaped reward function which can be learned directly from environment interactions. From the corresponding variational objective, we also derive a new probabilistic Bellman backup operator and use it to develop an off-policy algorithm to solve goal-directed tasks. We empirically demonstrate that this method eliminates the need to hand-craft reward functions for a suite of diverse manipulation and locomotion tasks and leads to effective goal-directed behaviors.
translated by 谷歌翻译
资产分配(或投资组合管理)是确定如何最佳将有限预算的资金分配给一系列金融工具/资产(例如股票)的任务。这项研究调查了使用无模型的深RL代理应用于投资组合管理的增强学习(RL)的性能。我们培训了几个RL代理商的现实股票价格,以学习如何执行资产分配。我们比较了这些RL剂与某些基线剂的性能。我们还比较了RL代理,以了解哪些类别的代理表现更好。从我们的分析中,RL代理可以执行投资组合管理的任务,因为它们的表现明显优于基线代理(随机分配和均匀分配)。四个RL代理(A2C,SAC,PPO和TRPO)总体上优于最佳基线MPT。这显示了RL代理商发现更有利可图的交易策略的能力。此外,基于价值和基于策略的RL代理之间没有显着的性能差异。演员批评者的表现比其他类型的药物更好。同样,在政策代理商方面的表现要好,因为它们在政策评估方面更好,样品效率在投资组合管理中并不是一个重大问题。这项研究表明,RL代理可以大大改善资产分配,因为它们的表现优于强基础。基于我们的分析,在政策上,参与者批评的RL药物显示出最大的希望。
translated by 谷歌翻译
通过信任区域政策优化(TRPO)和近端策略优化(PPO)的存在,深入的强化学习取得了很大的成功,以提高其可扩展性和效率。但是,两种算法的悲观情绪,其中包括在信托区域受到限制或严格排除所有可疑梯度,已被证明可以抑制探索和损害代理的性能。为了解决这些问题,我们提出了一个转移的马尔可夫决策过程(MDP),或者更确切地说,随着熵的增强,以鼓励探索并增强逃脱次级的能力。我们的方法是可扩展的,可以适应奖励成型或自举。通过进行收敛分析,我们发现控制温度系数至关重要。但是,如果适当地调整它,即使在其他算法上,我们也可以实现出色的性能,因为它很简单而有效。我们的实验测试在Mujoco基准任务上增强了TRPO和PPO,这表明该代理商对更高的奖励区域表示振奋,并且在探索和剥削之间取得了平衡。我们验证方法在两个网格世界环境上的探索加成。
translated by 谷歌翻译
有效的强化学习需要适当的平衡探索和剥削,由动作分布的分散定义。但是,这种平衡取决于任务,学习过程的当前阶段以及当前的环境状态。指定动作分布分散的现有方法需要依赖问题的超参数。在本文中,我们建议使用以下原则自动指定动作分布分布:该分布应具有足够的分散,以评估未来的政策。为此,应调整色散以确保重播缓冲区中的动作和产生它们的分布模式的足够高的概率(密度),但是这种分散不应更高。这样,可以根据缓冲区中的动作有效评估策略,但是当此策略收敛时,动作的探索性随机性会降低。上述原则在挑战性的基准蚂蚁,Halfcheetah,Hopper和Walker2D上进行了验证,并取得了良好的效果。我们的方法使动作标准偏差收敛到与试验和错误优化产生的相似的值。
translated by 谷歌翻译
Hierarchical Reinforcement Learning (HRL) algorithms have been demonstrated to perform well on high-dimensional decision making and robotic control tasks. However, because they solely optimize for rewards, the agent tends to search the same space redundantly. This problem reduces the speed of learning and achieved reward. In this work, we present an Off-Policy HRL algorithm that maximizes entropy for efficient exploration. The algorithm learns a temporally abstracted low-level policy and is able to explore broadly through the addition of entropy to the high-level. The novelty of this work is the theoretical motivation of adding entropy to the RL objective in the HRL setting. We empirically show that the entropy can be added to both levels if the Kullback-Leibler (KL) divergence between consecutive updates of the low-level policy is sufficiently small. We performed an ablative study to analyze the effects of entropy on hierarchy, in which adding entropy to high-level emerged as the most desirable configuration. Furthermore, a higher temperature in the low-level leads to Q-value overestimation and increases the stochasticity of the environment that the high-level operates on, making learning more challenging. Our method, SHIRO, surpasses state-of-the-art performance on a range of simulated robotic control benchmark tasks and requires minimal tuning.
translated by 谷歌翻译
与政策策略梯度技术相比,使用先前收集的数据的无模型的无模型深钢筋学习(RL)方法可以提高采样效率。但是,当利益政策的分布与收集数据的政策之间的差异时,非政策学习变得具有挑战性。尽管提出了良好的重要性抽样和范围的政策梯度技术来补偿这种差异,但它们通常需要一系列长轨迹,以增加计算复杂性并引起其他问题,例如消失或爆炸梯度。此外,由于需要行动概率,它们对连续动作领域的概括严格受到限制,这不适合确定性政策。为了克服这些局限性,我们引入了一种替代的非上政策校正算法,用于连续作用空间,参与者 - 批判性非政策校正(AC-OFF-POC),以减轻先前收集的数据引入的潜在缺陷。通过由代理商对随机采样批次过渡的状态的最新动作决策计算出的新颖差异度量,该方法不需要任何策略的实际或估计的行动概率,并提供足够的一步重要性抽样。理论结果表明,引入的方法可以使用固定的独特点获得收缩映射,从而可以进行“安全”的非政策学习。我们的经验结果表明,AC-Off-POC始终通过有效地安排学习率和Q学习和政策优化的学习率,以比竞争方法更少的步骤改善最新的回报。
translated by 谷歌翻译
最大熵增强学习(MaxEnt RL)算法,如软Q-Learning(SQL)和软演员 - 评论家权衡奖励和政策熵,有可能提高培训稳定性和鲁棒性。然而,大多数最大的RL方法使用恒定的权衡系数(温度),与温度应该在训练早期高的直觉相反,以避免对嘈杂的价值估算和减少培训后,我们越来越多地信任高价值估计,避免危险的估算和减少导致好奖励。此外,我们对价值估计的置信度是国家依赖的,每次使用更多证据来更新估算时都会增加。在本文中,我们提出了一种简单的状态温度调度方法,并将其实例化为基于计数的软Q学习(CBSQL)。我们在玩具领域以及在几个Atari 2600域中评估我们的方法,并显示有前途的结果。
translated by 谷歌翻译
政策梯度定理(Sutton等,2000)规定了目标政策下的累积折扣国家分配以近似梯度。实际上,基于该定理的大多数算法都打破了这一假设,引入了分布转移,该分配转移可能导致逆转溶液的收敛性。在本文中,我们提出了一种新的方法,可以从开始状态重建政策梯度,而无需采取特定的采样策略。可以根据梯度评论家来简化此形式的策略梯度计算,由于梯度的新钟声方程式,可以递归估算。通过使用来自差异数据流的梯度评论家的时间差异更新,我们开发了第一个以无模型方式避开分布变化问题的估计器。我们证明,在某些可实现的条件下,无论采样策略如何,我们的估计器都是公正的。我们从经验上表明,我们的技术在存在非政策样品的情况下实现了卓越的偏见变化权衡和性能。
translated by 谷歌翻译
一种被称为优先体验重播(PER)的广泛研究的深钢筋学习(RL)技术使代理可以从与其时间差异(TD)误差成正比的过渡中学习。尽管已经表明,PER是离散作用域中深度RL方法总体性能的最关键组成部分之一,但许多经验研究表明,在连续控制中,它的表现非常低于参与者 - 批评算法。从理论上讲,我们表明,无法有效地通过具有较大TD错误的过渡对演员网络进行训练。结果,在Q网络下计算的近似策略梯度与在最佳Q功能下计算的实际梯度不同。在此激励的基础上,我们引入了一种新颖的经验重播抽样框架,用于演员批评方法,该框架还认为稳定性和最新发现的问题是Per的经验表现不佳。引入的算法提出了对演员和评论家网络的有效和高效培训的改进的新分支。一系列广泛的实验验证了我们的理论主张,并证明了引入的方法显着优于竞争方法,并获得了与标准的非政策参与者 - 批评算法相比,获得最先进的结果。
translated by 谷歌翻译
In this paper we consider deterministic policy gradient algorithms for reinforcement learning with continuous actions. The deterministic policy gradient has a particularly appealing form: it is the expected gradient of the action-value function. This simple form means that the deterministic policy gradient can be estimated much more efficiently than the usual stochastic policy gradient. To ensure adequate exploration, we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policy. We demonstrate that deterministic policy gradient algorithms can significantly outperform their stochastic counterparts in high-dimensional action spaces.
translated by 谷歌翻译