We propose a distributed architecture for deep reinforcement learning atscale, that enables agents to learn effectively from orders of magnitude moredata than previously possible. The algorithm decouples acting from learning:the actors interact with their own instances of the environment by selectingactions according to a shared neural network, and accumulate the resultingexperience in a shared experience replay memory; the learner replays samples ofexperience and updates the neural network. The architecture relies onprioritized experience replay to focus only on the most significant datagenerated by the actors. Our architecture substantially improves the state ofthe art on the Arcade Learning Environment, achieving better final performancein a fraction of the wall-clock training time.
translated by 谷歌翻译
在这项工作中,我们提出了一种新的代理体系结构,称为Reactor,它结合了多种算法和体系结构的贡献来生成具有比优先级Dueling DQN(Wang et al。,2016)和Categorical DQN(Bellemare et al。,2017)更高样本效率的试剂。 ,同时提供比A3C更好的运行时间性能(Mnih等,2016)。我们的第一个贡献是一种名为Distributional Retrace的新政策评估算法,该算法为分布式强化学习设置带来了多步骤的政策更新。可以使用相同的方法来转换几类多步策略评估算法,这些算法是为分布式预期值评估而设计的。接下来,我们介绍\ b {eta} -leave-one-out政策梯度算法,该算法通过将动作值用作基线来改善方差和偏差之间的权衡。我们的最终算法贡献是用于序列的新优先级重放算法,其利用邻近观察的时间性来更有效地重放优先级。使用Atari 2600基准,我们表明这些创新中的每一个都有助于样本效率和最终代理性能。最后,我们证明了Reactor在2亿帧和不到一天的训练后达到了最先进的性能。
translated by 谷歌翻译
体验重播让在线强化学习代理能够记住并重用过去的经验。在先前的工作中,从重放存储器中均匀地采样经验转换。然而,这种方法只是以与它们最初经历的频率相同的频率重放变换,而不管它们的重要性如何。在本文中,我们开发了一个优化体验的框架,以便更频繁地重放重要的转换,从而更有效地学习。我们在深度Q网络(DQN)中使用优先级经验重播,这是一种强化学习算法,可以在许多Atari游戏中实现人类级别的表现。具有优先级经验重放的DQN在49场比赛中的41场中实现了新的最先进,表现优异的DQN和均匀重播。
translated by 谷歌翻译
近年来,在强化学习中使用深度表示已经取得了很多成功。尽管如此,这些应用程序中的许多仍然使用常规架构,例如卷积网络,LSTM或自动编码器。在本文中,我们提出了一种新的神经网络架构,用于无模型增强学习。我们的决斗网络代表两个独立的估算器:一个用于状态值函数,一个用于状态依赖的动作优势函数。这种因子分解的主要好处是可以在不对基础强化学习算法进行任何改变的情况下概括整个行动。我们的结果表明,这种架构可以在存在许多类似值的行为的情况下进行更好的策略评估。此外,决斗架构使我们的RL代理能够超越Atari 2600域的最新技术。
translated by 谷歌翻译
We introduce NoisyNet, a deep reinforcement learning agent with parametricnoise added to its weights, and show that the induced stochasticity of theagent's policy can be used to aid efficient exploration. The parameters of thenoise are learned with gradient descent along with the remaining networkweights. NoisyNet is straightforward to implement and adds little computationaloverhead. We find that replacing the conventional exploration heuristics forA3C, DQN and dueling agents (entropy reward and $\epsilon$-greedy respectively)with NoisyNet yields substantially higher scores for a wide range of Atarigames, in some cases advancing the agent from sub to super-human performance.
translated by 谷歌翻译
Deep reinforcement learning (RL) has achieved several high profile successesin difficult decision-making problems. However, these algorithms typicallyrequire a huge amount of data before they reach reasonable performance. Infact, their performance during learning can be extremely poor. This may beacceptable for a simulator, but it severely limits the applicability of deep RLto many real-world tasks, where the agent must learn in the real environment.In this paper we study a setting where the agent may access data from previouscontrol of the system. We present an algorithm, Deep Q-learning fromDemonstrations (DQfD), that leverages small sets of demonstration data tomassively accelerate the learning process even from relatively small amounts ofdemonstration data and is able to automatically assess the necessary ratio ofdemonstration data while learning thanks to a prioritized replay mechanism.DQfD works by combining temporal difference updates with supervisedclassification of the demonstrator's actions. We show that DQfD has betterinitial performance than Prioritized Dueling Double Deep Q-Networks (PDD DQN)as it starts with better scores on the first million steps on 41 of 42 gamesand on average it takes PDD DQN 83 million steps to catch up to DQfD'sperformance. DQfD learns to out-perform the best demonstration given in 14 of42 games. In addition, DQfD leverages human demonstrations to achievestate-of-the-art results for 11 games. Finally, we show that DQfD performsbetter than three related algorithms for incorporating demonstration data intoDQN.
translated by 谷歌翻译
有效的探索仍然是强化学习的一个主要挑战。其中一个原因是回报的变化往往取决于当前状态和行动,因此是异方差的。经典的探索策略,例如上置信度约束算法和汤普森抽样未能恰当地解释异方差性,即使在强盗设置中也是如此。最近在土匪中解决这个问题的研究结果表明,我们建议使用信息导向抽样(IDS)进行强化学习中的探索。作为我们的主要贡献,我们建立在最近的进步分布式强化学习的基础上,并提出了一种新的,易处理的IDS用于深度Q学习。由此产生的探索策略明确地解释了参数不确定性和异方差观察噪声。我们评估了我们在Atari游戏中的方法,并证明了对替代方法的显着改进。
translated by 谷歌翻译
我们提出短暂价值调整(EVA):一种允许深度执行学习代理快速适应重播缓冲经验的方法。 EVA通过对来自当前状态附近的重放缓冲区的经验元组进行规划而得到的值函数的估计来改变由神经网络预测的值。 EVA结合了许多近期的想法,将类似情节记忆的结构组合成强化学习代理:基于插槽的存储,基于内容的检索和基于内存的规划。我们展示了EVA在演示任务和Atari游戏中的表现。
translated by 谷歌翻译
我们提出了一个概念上简单而轻量级的深度强化学习框架,它使用异步梯度下降来优化深度神经网络控制器。我们提出了fourstandard强化学习算法的异步变体,并表明并行学习者对训练有稳定作用,允许所有四种方法成功训练神经网络控制器。表现最佳的方法是演员评论家的异步变体,它超越了当前最先进的Atari领域,同时在单个多核CPU上进行了一半的训练,而不是GPU。此外,我们表明,异步演员评论家在各种连续电机控制问题以及使用视觉输入导航随机3D迷宫的新任务中取得了成功。
translated by 谷歌翻译
The Arcade Learning Environment (ALE) is an evaluation platform that poses the challenge of building AI agents with general competency across dozens of Atari 2600 games. It supports a variety of different problem settings and it has been receiving increasing attention from the scientific community, leading to some high-profile success stories such as the much publicized Deep Q-Networks (DQN). In this article we take a big picture look at how the ALE is being used by the research community. We show how diverse the evaluation methodologies in the ALE have become with time, and highlight some key concerns when evaluating agents in the ALE. We use this discussion to present some methodological best practices and provide new benchmark results using these best practices. To further the progress in the field, we introduce a new version of the ALE that supports multiple game modes and provides a form of stochasticity we call sticky actions. We conclude this big picture look by revisiting challenges posed when the ALE was introduced, summarizing the state-of-the-art in various problems and highlighting problems that remain open.
translated by 谷歌翻译
We present the first massively distributed architecture for deepreinforcement learning. This architecture uses four main components: parallelactors that generate new behaviour; parallel learners that are trained fromstored experience; a distributed neural network to represent the value functionor behaviour policy; and a distributed store of experience. We used ourarchitecture to implement the Deep Q-Network algorithm (DQN). Our distributedalgorithm was applied to 49 games from Atari 2600 games from the ArcadeLearning Environment, using identical hyperparameters. Our performancesurpassed non-distributed DQN in 41 of the 49 games and also reduced thewall-time required to achieve these results by an order of magnitude on mostgames.
translated by 谷歌翻译
在复杂环境中进行有效的探索仍然是加强学习的主要挑战。我们提出了自举DQN,这是一种简单的算法,可以通过使用随机值函数以计算和统计有效的方式进行测试。与诸如epsilon-greedyexploration之类的抖动策略不同,自举DQN执行时间延长(或深度)探索;这可以导致学习成倍增长。我们在复杂的随机MDP和大规模的ArcadeLearning环境中展示了这些优势。 Bootstrapped DQN大大改善了大多数Atari游戏的学习时间和性能。
translated by 谷歌翻译
强化学习社区在设计能够超越特定任务的人类表现的算法方面取得了很大进展。这些算法大多是当时训练的一项任务,每项新任务都需要一个全新的代理实例。这意味着学习算法是通用的,但每个解决方案都不是;每个代理只能解决它所训练的一项任务。在这项工作中,我们研究了学习掌握不是一个而是多个顺序决策任务的问题。多任务学习中的一个普遍问题是,必须在多个任务的需求之间找到平衡,以满足单个学习系统的有限资源。许多学习算法可能会被要解决的任务集中的某些任务分散注意力。这些任务对于学习过程似乎更为突出,例如由于任务内奖励的密度或大小。这导致算法以牺牲普遍性为代价专注于那些突出的任务。我们建议自动调整每个任务对代理更新的贡献,使所有任务对学习动态产生类似的影响。这导致学习在一系列57diverse Atari游戏中玩所有游戏的艺术表现。令人兴奋的是,我们的方法学会了一套训练有素的政策 - 只有一套权重 - 超过了人类的中等绩效。据我们所知,这是单个代理首次超越此多任务域的人员级别性能。同样的方法还证明了3D加强学习平台DeepMind Lab中30项任务的艺术表现。
translated by 谷歌翻译
众所周知,流行的Q学习算法会高估某些条件下的动作值。以前不知道在实践中,这种估计是否常见,是否会损害表现,以及是否可以一般地预防这些表现。在本文中,我们肯定地回答了所有这些问题。特别是,我们首先表明,最近的DQN算法将Q-learning与深度神经网络相结合,在Atari 2600域的一些游戏中遭受了大量的估计。然后,我们展示了双Q学习算法背后的原理,它是在一个表格中引入的,可以推广到大规模函数逼近。我们提出了对DQN算法的特定修改,并表明所得到的算法不仅可以减少观察到的高估,如假设的那样,但这也可以在几个游戏中获得更好的性能。
translated by 谷歌翻译
在这项工作中,我们建立在分布式强化学习的最新进展基础上,以提供一种普遍适用的,灵活的,最先进的DQN分布式变体。我们通过使用分位数回归来逼近状态 - 动作回归分配的完全分位数函数来实现这一点。通过重新参数化样本空间上的分布,这会产生一个隐式定义的回报分布,并产生一大类风险敏感策略。我们展示了ALE中57Atari 2600游戏的改进性能,并使用我们的算法隐式定义的分布来研究风险敏感策略在Atari游戏中的影响。
translated by 谷歌翻译
我们提出了第一个深度学习模型,使用强化学习直接从高维感觉输入成功学习控制政策。该模型是一个卷积神经网络,使用Q学习的变体进行训练,其输入是原始像素,其输出是估计未来奖励的值函数。我们将方法应用于Arcade学习环境中的七个Atari 2600游戏,而不调整架构或学习算法。我们发现它在六个游戏中的表现优于以前的所有方法,并且在三个游戏中超过了人类专家。
translated by 谷歌翻译
我们从强化学习理论中知道,在某些情况下,时间差异学习会失败。 Sutton和Barto(2018)确定了致命的三元组功能近似,自举和非政策学习。当这三个属性组合在一起时,学习可能会与价值估计结果不一致。然而,几种算法成功地结合了这三种属性,这表明我们的理解至少存在部分差距。在这项工作中,我们研究了致命三元组在实践中的影响,在一系列流行的深层强化学习模型的背景下 - 深度Q网络训练经验重放 - 分析该系统的组成部分如何在致命的出现中发挥作用三合会,以及代理人的表现
translated by 谷歌翻译
Instability and variability of Deep Reinforcement Learning (DRL) algorithms tend to adversely affect their performance. Averaged-DQN is a simple extension to the DQN algorithm, based on averaging previously learned Q-values estimates, which leads to a more stable training procedure and improved performance by reducing approximation error variance in the target values. To understand the effect of the algorithm, we examine the source of value function estimation errors and provide an analytical comparison within a simplified model. We further present experiments on the Arcade Learning Environment benchmark that demonstrate significantly improved stability and performance due to the proposed extension.
translated by 谷歌翻译
随机值函数为在具有高维状态和动作空间的复杂环境中进行有效探索的挑战提供了有前景的方法。与传统的点估计方法不同,随机值函数在动作空间值上保持后验分布。这可以防止代理人的行为政策过早地利用早期估计并陷入局部最优。在这项工作中,我们利用变分贝叶斯神经网络中的近期进展,并将这些与传统的深度Q网络(DQN)和深度确定性政策梯度(DDPG)相结合,以实现高维域的随机值函数。特别地,我们使用乘法归一化流来增加DQN和DDPG,以便跟踪值函数的参数上的丰富的近似后验分布。这允许代理通过随机梯度方法以计算有效的方式执行近似Thompsonsampling。我们通过在高维环境中的经验比较证明了我们的方法的好处。
translated by 谷歌翻译
Please cite this article as: Elfwing, S., Uchibe, E., Doya, K., Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks (2018), https://doi. Abstract In recent years, neural networks have enjoyed a renaissance as function approxima-tors in reinforcement learning. Two decades after Tesauro's TD-Gammon achieved near top-level human performance in backgammon, the deep reinforcement learning algorithm DQN achieved human-level performance in many Atari 2600 games. The purpose of this study is twofold. First, we propose two activation functions for neu-ral network function approximation in reinforcement learning: the sigmoid-weighted linear unit (SiLU) and its derivative function (dSiLU). The activation of the SiLU is computed by the sigmoid function multiplied by its input. Second, we suggest that the more traditional approach of using on-policy learning with eligibility traces, instead of experience replay, and softmax action selection can be competitive with DQN, without the need for a separate target network. We validate our proposed approach by, first, achieving new state-of-the-art results in both stochastic SZ-Tetris and Tetris with a small 10×10 board, using TD(λ) learning and shallow dSiLU network agents, and, then, by outperforming DQN in the Atari 2600 domain by using a deep Sarsa(λ) agent with SiLU and dSiLU hidden units.
translated by 谷歌翻译