钢筋学习的最新进展证明了其在超级人类水平上解决硬质孕代环境互动任务的能力。然而,由于大多数RL最先进的算法的样本低效率,即,需要大量培训集,因此在实际和现实世界任务中的应用目前有限。例如,在Dota 2中击败人类参与者的Openai五种算法已经训练了数千年的游戏时间。存在解决样本低效问题的几种方法,可以通过更好地探索环境来提供更有效的使用或旨在获得更相关和多样化的经验。然而,为了我们的知识,没有用于基于模型的算法的这种方法,其在求解具有高维状态空间的硬控制任务方面的高采样效率。这项工作连接了探索技术和基于模型的加强学习。我们设计了一种新颖的探索方法,考虑了基于模型的方法的特征。我们还通过实验证明我们的方法显着提高了基于模型的算法梦想家的性能。
translated by 谷歌翻译
资产分配(或投资组合管理)是确定如何最佳将有限预算的资金分配给一系列金融工具/资产(例如股票)的任务。这项研究调查了使用无模型的深RL代理应用于投资组合管理的增强学习(RL)的性能。我们培训了几个RL代理商的现实股票价格,以学习如何执行资产分配。我们比较了这些RL剂与某些基线剂的性能。我们还比较了RL代理,以了解哪些类别的代理表现更好。从我们的分析中,RL代理可以执行投资组合管理的任务,因为它们的表现明显优于基线代理(随机分配和均匀分配)。四个RL代理(A2C,SAC,PPO和TRPO)总体上优于最佳基线MPT。这显示了RL代理商发现更有利可图的交易策略的能力。此外,基于价值和基于策略的RL代理之间没有显着的性能差异。演员批评者的表现比其他类型的药物更好。同样,在政策代理商方面的表现要好,因为它们在政策评估方面更好,样品效率在投资组合管理中并不是一个重大问题。这项研究表明,RL代理可以大大改善资产分配,因为它们的表现优于强基础。基于我们的分析,在政策上,参与者批评的RL药物显示出最大的希望。
translated by 谷歌翻译
Reinforcement learning (RL) gained considerable attention by creating decision-making agents that maximize rewards received from fully observable environments. However, many real-world problems are partially or noisily observable by nature, where agents do not receive the true and complete state of the environment. Such problems are formulated as partially observable Markov decision processes (POMDPs). Some studies applied RL to POMDPs by recalling previous decisions and observations or inferring the true state of the environment from received observations. Nevertheless, aggregating observations and decisions over time is impractical for environments with high-dimensional continuous state and action spaces. Moreover, so-called inference-based RL approaches require large number of samples to perform well since agents eschew uncertainty in the inferred state for the decision-making. Active inference is a framework that is naturally formulated in POMDPs and directs agents to select decisions by minimising expected free energy (EFE). This supplies reward-maximising (exploitative) behaviour in RL, with an information-seeking (exploratory) behaviour. Despite this exploratory behaviour of active inference, its usage is limited to discrete state and action spaces due to the computational difficulty of the EFE. We propose a unified principle for joint information-seeking and reward maximization that clarifies a theoretical connection between active inference and RL, unifies active inference and RL, and overcomes their aforementioned limitations. Our findings are supported by strong theoretical analysis. The proposed framework's superior exploration property is also validated by experimental results on partial observable tasks with high-dimensional continuous state and action spaces. Moreover, the results show that our model solves reward-free problems, making task reward design optional.
translated by 谷歌翻译
Deep reinforcement learning is poised to revolutionise the field of AI and represents a step towards building autonomous systems with a higher level understanding of the visual world. Currently, deep learning is enabling reinforcement learning to scale to problems that were previously intractable, such as learning to play video games directly from pixels. Deep reinforcement learning algorithms are also applied to robotics, allowing control policies for robots to be learned directly from camera inputs in the real world. In this survey, we begin with an introduction to the general field of reinforcement learning, then progress to the main streams of value-based and policybased methods. Our survey will cover central algorithms in deep reinforcement learning, including the deep Q-network, trust region policy optimisation, and asynchronous advantage actor-critic. In parallel, we highlight the unique advantages of deep neural networks, focusing on visual understanding via reinforcement learning. To conclude, we describe several current areas of research within the field.
translated by 谷歌翻译
Model-free deep reinforcement learning (RL) algorithms have been demonstrated on a range of challenging decision making and control tasks. However, these methods typically suffer from two major challenges: very high sample complexity and brittle convergence properties, which necessitate meticulous hyperparameter tuning. Both of these challenges severely limit the applicability of such methods to complex, real-world domains. In this paper, we propose soft actor-critic, an offpolicy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework. In this framework, the actor aims to maximize expected reward while also maximizing entropy. That is, to succeed at the task while acting as randomly as possible. Prior deep RL methods based on this framework have been formulated as Q-learning methods. By combining off-policy updates with a stable stochastic actor-critic formulation, our method achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off-policy methods. Furthermore, we demonstrate that, in contrast to other off-policy algorithms, our approach is very stable, achieving very similar performance across different random seeds.
translated by 谷歌翻译
经验重放机制允许代理多次使用经验。在以前的作品中,过渡的抽样概率根据其重要性进行调整。重新分配采样概率在每次迭代后的重传缓冲器的每个过渡是非常低效的。因此,经验重播优先算法重新计算时,相应的过渡进行采样,以获得计算效率转变的意义。然而,过渡的重要性水平动态变化的政策和代理人的价值函数被更新。此外,经验回放存储转换由可显著从代理的最新货币政策偏离剂的以前的政策产生。从代理引线的最新货币政策更关闭策略更新,这是有害的代理高偏差。在本文中,我们开发了一种新的算法,通过KL散度批次优先化体验重播(KLPER),其优先批次转换的,而不是直接优先每个过渡。此外,为了减少更新的截止policyness,我们的算法选择一个批次中的某一批次的数量和力量的通过很有可能是代理的最新货币政策所产生的一批学习代理。我们结合与深确定性政策渐变和Twin算法延迟深确定性政策渐变,并评估它在不同的连续控制任务。 KLPER提供培训期间的抽样效率,最终表现和政策的稳定性方面有前途的深确定性的连续控制算法的改进。
translated by 谷歌翻译
Hierarchical Reinforcement Learning (HRL) algorithms have been demonstrated to perform well on high-dimensional decision making and robotic control tasks. However, because they solely optimize for rewards, the agent tends to search the same space redundantly. This problem reduces the speed of learning and achieved reward. In this work, we present an Off-Policy HRL algorithm that maximizes entropy for efficient exploration. The algorithm learns a temporally abstracted low-level policy and is able to explore broadly through the addition of entropy to the high-level. The novelty of this work is the theoretical motivation of adding entropy to the RL objective in the HRL setting. We empirically show that the entropy can be added to both levels if the Kullback-Leibler (KL) divergence between consecutive updates of the low-level policy is sufficiently small. We performed an ablative study to analyze the effects of entropy on hierarchy, in which adding entropy to high-level emerged as the most desirable configuration. Furthermore, a higher temperature in the low-level leads to Q-value overestimation and increases the stochasticity of the environment that the high-level operates on, making learning more challenging. Our method, SHIRO, surpasses state-of-the-art performance on a range of simulated robotic control benchmark tasks and requires minimal tuning.
translated by 谷歌翻译
软演员 - 评论家(SAC)是最先进的偏离策略强化学习(RL)算法之一,其在基于最大熵的RL框架内。 SAC被证明在具有良好稳定性和稳健性的持续控制任务的列表中表现得非常好。 SAC了解一个随机高斯政策,可以最大限度地提高预期奖励和政策熵之间的权衡。要更新策略,SAC可最大限度地减少当前策略密度与软值函数密度之间的kl分歧。然后用于获得这种分歧的近似梯度的回报。在本文中,我们提出了跨熵策略优化(SAC-CEPO)的软演员 - 评论家,它使用跨熵方法(CEM)来优化SAC的政策网络。初始思想是使用CEM来迭代地对软价函数密度的最接近的分布进行采样,并使用结果分布作为更新策略网络的目标。为了降低计算复杂性,我们还介绍了一个解耦的策略结构,该策略结构将高斯策略解耦为一个策略,了解了学习均值的均值和另一个策略,以便只有CEM训练平均政策。我们表明,这种解耦的政策结构确实会聚到最佳,我们还通过实验证明SAC-CEPO实现对原始囊的竞争性能。
translated by 谷歌翻译
深度强化学习(RL)导致了许多最近和开创性的进步。但是,这些进步通常以培训的基础体系结构的规模增加以及用于训练它们的RL算法的复杂性提高,而均以增加规模的成本。这些增长反过来又使研究人员更难迅速原型新想法或复制已发表的RL算法。为了解决这些问题,这项工作描述了ACME,这是一个用于构建新型RL算法的框架,这些框架是专门设计的,用于启用使用简单的模块化组件构建的代理,这些组件可以在各种执行范围内使用。尽管ACME的主要目标是为算法开发提供一个框架,但第二个目标是提供重要或最先进算法的简单参考实现。这些实现既是对我们的设计决策的验证,也是对RL研究中可重复性的重要贡献。在这项工作中,我们描述了ACME内部做出的主要设计决策,并提供了有关如何使用其组件来实施各种算法的进一步详细信息。我们的实验为许多常见和最先进的算法提供了基准,并显示了如何为更大且更复杂的环境扩展这些算法。这突出了ACME的主要优点之一,即它可用于实现大型,分布式的RL算法,这些算法可以以较大的尺度运行,同时仍保持该实现的固有可读性。这项工作提出了第二篇文章的版本,恰好与模块化的增加相吻合,对离线,模仿和从演示算法学习以及作为ACME的一部分实现的各种新代理。
translated by 谷歌翻译
In order to avoid conventional controlling methods which created obstacles due to the complexity of systems and intense demand on data density, developing modern and more efficient control methods are required. In this way, reinforcement learning off-policy and model-free algorithms help to avoid working with complex models. In terms of speed and accuracy, they become prominent methods because the algorithms use their past experience to learn the optimal policies. In this study, three reinforcement learning algorithms; DDPG, TD3 and SAC have been used to train Fetch robotic manipulator for four different tasks in MuJoCo simulation environment. All of these algorithms are off-policy and able to achieve their desired target by optimizing both policy and value functions. In the current study, the efficiency and the speed of these three algorithms are analyzed in a controlled environment.
translated by 谷歌翻译
采用合理的策略是具有挑战性的,但对于智能代理商的智能代理人至关重要,其资源有限,在危险,非结构化和动态环境中工作,以改善系统实用性,降低整体成本并增加任务成功概率。深度强化学习(DRL)帮助组织代理的行为和基于其状态的行为,并代表复杂的策略(行动的组成)。本文提出了一种基于贝叶斯链条的新型分层策略分解方法,将复杂的政策分为几个简单的子手段,并将其作为贝叶斯战略网络(BSN)组织。我们将这种方法整合到最先进的DRL方法中,软演奏者 - 批评者(SAC),并通过组织几个子主管作为联合政策来构建相应的贝叶斯软演奏者(BSAC)模型。我们将建议的BSAC方法与标准连续控制基准(Hopper-V2,Walker2D-V2和Humanoid-V2)在SAC和其他最先进的方法(例如TD3,DDPG和PPO)中进行比较 - Mujoco与Openai健身房环境。结果表明,BSAC方法的有希望的潜力可显着提高训练效率。可以从https://github.com/herolab-uga/bsac访问BSAC的开源代码。
translated by 谷歌翻译
Learned world models summarize an agent's experience to facilitate learning complex behaviors. While learning world models from high-dimensional sensory inputs is becoming feasible through deep learning, there are many potential ways for deriving behaviors from them. We present Dreamer, a reinforcement learning agent that solves long-horizon tasks from images purely by latent imagination. We efficiently learn behaviors by propagating analytic gradients of learned state values back through trajectories imagined in the compact state space of a learned world model. On 20 challenging visual control tasks, Dreamer exceeds existing approaches in data-efficiency, computation time, and final performance.
translated by 谷歌翻译
深层确定性的非政策算法的类别有效地用于解决具有挑战性的连续控制问题。但是,当前的方法使用随机噪声作为一种常见的探索方法,该方法具有多个弱点,例如需要对给定任务进行手动调整以及在训练过程中没有探索性校准。我们通过提出一种新颖的指导探索方法来应对这些挑战,该方法使用差异方向控制器来结合可扩展的探索性动作校正。提供探索性方向的蒙特卡洛评论家合奏作为控制器。提出的方法通过动态改变勘探来改善传统探索方案。然后,我们提出了一种新颖的算法,利用拟议的定向控制器进行政策和评论家修改。所提出的算法在DMControl Suite的各种问题上都优于现代增强算法的现代增强算法。
translated by 谷歌翻译
近年来近年来,加固学习方法已经发展了一系列政策梯度方法,主要用于建模随机政策的高斯分布。然而,高斯分布具有无限的支持,而现实世界应用通常具有有限的动作空间。如果它提供有限支持,则该解剖会导致可以消除的估计偏差,因为它提出了有限的支持。在这项工作中,我们调查如何在Openai健身房的两个连续控制任务中训练该测试策略在训练时执行该测试策略。对于这两个任务来说,测试政策在代理人的最终预期奖励方面优于高斯政策,也显示出更多的稳定性和更快的培训过程融合。对于具有高维图像输入的卡路里环境,在高斯政策中,代理的成功率提高了63%。
translated by 谷歌翻译
强化学习(RL)通过与环境相互作用的试验过程解决顺序决策问题。尽管RL在玩复杂的视频游戏方面取得了巨大的成功,但在现实世界中,犯错误总是不希望的。为了提高样本效率并从而降低错误,据信基于模型的增强学习(MBRL)是一个有前途的方向,它建立了环境模型,在该模型中可以进行反复试验,而无需实际成本。在这项调查中,我们对MBRL进行了审查,重点是Deep RL的最新进展。对于非壮观环境,学到的环境模型与真实环境之间始终存在概括性错误。因此,非常重要的是分析环境模型中的政策培训与实际环境中的差异,这反过来又指导了更好的模型学习,模型使用和政策培训的算法设计。此外,我们还讨论了其他形式的RL,包括离线RL,目标条件RL,多代理RL和Meta-RL的最新进展。此外,我们讨论了MBRL在现实世界任务中的适用性和优势。最后,我们通过讨论MBRL未来发展的前景来结束这项调查。我们认为,MBRL在被忽略的现实应用程序中具有巨大的潜力和优势,我们希望这项调查能够吸引更多关于MBRL的研究。
translated by 谷歌翻译
有效的探索仍然是强化学习中有挑战性的问题,特别是对于来自环境的外在奖励稀疏甚至完全忽视的任务。基于内在动机的重要进展显示了在简单环境中的有希望的结果,但通常会在具有多式联运和随机动力学的环境中陷入困境。在这项工作中,我们提出了一种基于条件变分推理的变分动力模型来模拟多模和随机性。通过在当前状态,动作和潜在变量的条件下产生下一个状态预测,我们考虑作为条件生成过程的环境状态动作转换,这提供了更好地了解动态并在勘探中引发更好的性能。我们派生了环境过渡的负面日志可能性的上限,并使用这样一个上限作为勘探的内在奖励,这使得代理通过自我监督的探索来学习技能,而无需观察外在奖励。我们在基于图像的仿真任务和真正的机器人操纵任务中评估所提出的方法。我们的方法优于若干基于最先进的环境模型的勘探方法。
translated by 谷歌翻译
With the development of deep representation learning, the domain of reinforcement learning (RL) has become a powerful learning framework now capable of learning complex policies in high dimensional environments. This review summarises deep reinforcement learning (DRL) algorithms and provides a taxonomy of automated driving tasks where (D)RL methods have been employed, while addressing key computational challenges in real world deployment of autonomous driving agents. It also delineates adjacent domains such as behavior cloning, imitation learning, inverse reinforcement learning that are related but are not classical RL algorithms. The role of simulators in training agents, methods to validate, test and robustify existing solutions in RL are discussed.
translated by 谷歌翻译
一种被称为优先体验重播(PER)的广泛研究的深钢筋学习(RL)技术使代理可以从与其时间差异(TD)误差成正比的过渡中学习。尽管已经表明,PER是离散作用域中深度RL方法总体性能的最关键组成部分之一,但许多经验研究表明,在连续控制中,它的表现非常低于参与者 - 批评算法。从理论上讲,我们表明,无法有效地通过具有较大TD错误的过渡对演员网络进行训练。结果,在Q网络下计算的近似策略梯度与在最佳Q功能下计算的实际梯度不同。在此激励的基础上,我们引入了一种新颖的经验重播抽样框架,用于演员批评方法,该框架还认为稳定性和最新发现的问题是Per的经验表现不佳。引入的算法提出了对演员和评论家网络的有效和高效培训的改进的新分支。一系列广泛的实验验证了我们的理论主张,并证明了引入的方法显着优于竞争方法,并获得了与标准的非政策参与者 - 批评算法相比,获得最先进的结果。
translated by 谷歌翻译
尽管学习环境内部模型的强化学习(RL)方法具有比没有模型的对应物更有效的样本效率,但学会从高维传感器中建模原始观察结果可能具有挑战性。先前的工作通过通过辅助目标(例如重建或价值预测)学习观察值的低维表示来解决这一挑战。但是,这些辅助目标与RL目标之间的一致性通常不清楚。在这项工作中,我们提出了一个单一的目标,该目标共同优化了潜在空间模型和政策,以实现高回报,同时保持自洽。这个目标是预期收益的下限。与基于模型的RL在策略探索或模型保证方面的先前范围不同,我们的界限直接依靠整体RL目标。我们证明,所得算法匹配或改善了最佳基于模型和无模型的RL方法的样品效率。尽管这种有效的样品方法通常在计算上是要求的,但我们的方法在较小的壁式锁定时间降低了50 \%。
translated by 谷歌翻译
In this paper we consider deterministic policy gradient algorithms for reinforcement learning with continuous actions. The deterministic policy gradient has a particularly appealing form: it is the expected gradient of the action-value function. This simple form means that the deterministic policy gradient can be estimated much more efficiently than the usual stochastic policy gradient. To ensure adequate exploration, we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policy. We demonstrate that deterministic policy gradient algorithms can significantly outperform their stochastic counterparts in high-dimensional action spaces.
translated by 谷歌翻译