深度Q-Network(DQN)标志着强化学习的主要里程碑,首次展示了人类水平控制政策,可以通过奖励最大化直接从原始视觉输入学习。即使是介绍多年后,DQN与研究界仍然高度相关,因为其在继承方法中采用了许多创新。然而,尽管在临时上的硬件进步,但DQN的原始ATari 2600实验仍然昂贵,以便全面复制。这对无法负担最先进的硬件或缺乏大规模云计算资源的研究人员构成了巨大的障碍。为了便于改进对深度加强学习研究的访问,我们介绍了一种DQN实现,它利用了一种新颖的并发和同步执行框架,旨在最大限度地利用异构CPU-GPU桌面系统。只需一个Nvidia GeForce GTX 1080 GPU,我们的实施将200亿帧atari实验的培训时间从25小时到仅需9小时。本文介绍的想法应普遍适用于大量违规的深度增强学习方法。
translated by 谷歌翻译
返回缓存是最近的策略,可以使用多步骤估算器(例如{\ lambda} -Return)实现高效的小纤维培训,用于深入加强学习。通过预先计算顺序批量的返回估计,然后将结果存储在辅助数据结构中以供稍后采样,可以大大减少每次估计的平均计算。尽管如此,可以提高返回缓存的效率,特别是关于其大的内存使用和重复数据副本。我们提出了一种新的数据结构,虚拟重播缓存(VRC),以解决这些缺点。学习播放Atari 2600游戏时,VRC几乎消除了DQN({\ Lambda})的缓存内存占用占据占据功能,略微降低了硬件上的总培训时间。
translated by 谷歌翻译
We propose a conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers. We present asynchronous variants of four standard reinforcement learning algorithms and show that parallel actor-learners have a stabilizing effect on training allowing all four methods to successfully train neural network controllers. The best performing method, an asynchronous variant of actor-critic, surpasses the current state-of-the-art on the Atari domain while training for half the time on a single multi-core CPU instead of a GPU. Furthermore, we show that asynchronous actor-critic succeeds on a wide variety of continuous motor control problems as well as on a new task of navigating random 3D mazes using a visual input.
translated by 谷歌翻译
培训代理商的训练人群在加强学习方面表现出了巨大的希望,可以稳定训练,改善探索和渐近性能以及产生各种解决方案。但是,从业人员通常不考虑基于人群的培训,因为它被认为是过速的(依次实施),或者在计算上昂贵(如果代理在独立加速器上并行训练)。在这项工作中,我们比较了实施和重新审视以前的研究,以表明对汇编和矢量化的明智使用允许与培训单个代理相比,在单台机器上进行基于人群的培训。我们还表明,当提供一些加速器时,我们的协议扩展到诸如高参数调谐等应用的大型人口大小。我们希望这项工作和公众发布我们的代码将鼓励从业者更频繁地使用基于人群的学习来进行研究和应用。
translated by 谷歌翻译
在这项工作中,我们提出并评估了一种新的增强学习方法,紧凑体验重放(编者),它使用基于相似转换集的复发的预测目标值的时间差异学习,以及基于两个转换的经验重放的新方法记忆。我们的目标是减少在长期累计累计奖励的经纪人培训所需的经验。它与强化学习的相关性与少量观察结果有关,即它需要实现类似于文献中的相关方法获得的结果,这通常需要数百万视频框架来培训ATARI 2600游戏。我们举报了在八个挑战街机学习环境(ALE)挑战游戏中,为仅10万帧的培训试验和大约25,000次迭代的培训试验中报告了培训试验。我们还在与基线的同一游戏中具有相同的实验协议的DQN代理呈现结果。为了验证从较少数量的观察结果近似于良好的政策,我们还将其结果与从啤酒的基准上呈现的数百万帧中获得的结果进行比较。
translated by 谷歌翻译
Off-policy reinforcement learning (RL) using a fixed offline dataset of logged interactions is an important consideration in real world applications. This paper studies offline RL using the DQN Replay Dataset comprising the entire replay experience of a DQN agent on 60 Atari 2600 games. We demonstrate that recent off-policy deep RL algorithms, even when trained solely on this fixed dataset, outperform the fully-trained DQN agent. To enhance generalization in the offline setting, we present Random Ensemble Mixture (REM), a robust Q-learning algorithm that enforces optimal Bellman consistency on random convex combinations of multiple Q-value estimates. Offline REM trained on the DQN Replay Dataset surpasses strong RL baselines. Ablation studies highlight the role of offline dataset size and diversity as well as the algorithm choice in our positive results. Overall, the results here present an optimistic view that robust RL algorithms used on sufficiently large and diverse offline datasets can lead to high quality policies. To provide a testbed for offline RL and reproduce our results, the DQN Replay Dataset is released at offline-rl.github.io.
translated by 谷歌翻译
深度强化学习(RL)导致了许多最近和开创性的进步。但是,这些进步通常以培训的基础体系结构的规模增加以及用于训练它们的RL算法的复杂性提高,而均以增加规模的成本。这些增长反过来又使研究人员更难迅速原型新想法或复制已发表的RL算法。为了解决这些问题,这项工作描述了ACME,这是一个用于构建新型RL算法的框架,这些框架是专门设计的,用于启用使用简单的模块化组件构建的代理,这些组件可以在各种执行范围内使用。尽管ACME的主要目标是为算法开发提供一个框架,但第二个目标是提供重要或最先进算法的简单参考实现。这些实现既是对我们的设计决策的验证,也是对RL研究中可重复性的重要贡献。在这项工作中,我们描述了ACME内部做出的主要设计决策,并提供了有关如何使用其组件来实施各种算法的进一步详细信息。我们的实验为许多常见和最先进的算法提供了基准,并显示了如何为更大且更复杂的环境扩展这些算法。这突出了ACME的主要优点之一,即它可用于实现大型,分布式的RL算法,这些算法可以以较大的尺度运行,同时仍保持该实现的固有可读性。这项工作提出了第二篇文章的版本,恰好与模块化的增加相吻合,对离线,模仿和从演示算法学习以及作为ACME的一部分实现的各种新代理。
translated by 谷歌翻译
自成立以来,建立在广泛任务中表现出色的普通代理的任务一直是强化学习的重要目标。这个问题一直是对Alarge工作体系的研究的主题,并且经常通过观察Atari 57基准中包含的广泛范围环境的分数来衡量的性能。 Agent57是所有57场比赛中第一个超过人类基准的代理商,但这是以数据效率差的代价,需要实现近800亿帧的经验。以Agent57为起点,我们采用了各种各样的形式,以降低超过人类基线所需的经验200倍。在减少数据制度和Propose有效的解决方案时,我们遇到了一系列不稳定性和瓶颈,以构建更强大,更有效的代理。我们还使用诸如Muesli和Muzero之类的高性能方法证明了竞争性的性能。 TOOUR方法的四个关键组成部分是(1)近似信任区域方法,该方法可以从TheOnline网络中稳定引导,(2)损失和优先级的归一化方案,在学习具有广泛量表的一组值函数时,可以提高鲁棒性, (3)改进的体系结构采用了NFNET的技术技术来利用更深的网络而无需标准化层,并且(4)政策蒸馏方法可使瞬时贪婪的策略加班。
translated by 谷歌翻译
Efficient exploration remains a major challenge for reinforcement learning (RL). Common dithering strategies for exploration, such as -greedy, do not carry out temporally-extended (or deep) exploration; this can lead to exponentially larger data requirements. However, most algorithms for statistically efficient RL are not computationally tractable in complex environments. Randomized value functions offer a promising approach to efficient exploration with generalization, but existing algorithms are not compatible with nonlinearly parameterized value functions. As a first step towards addressing such contexts we develop bootstrapped DQN. We demonstrate that bootstrapped DQN can combine deep exploration with deep neural networks for exponentially faster learning than any dithering strategy. In the Arcade Learning Environment bootstrapped DQN substantially improves learning speed and cumulative performance across most games.
translated by 谷歌翻译
We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.
translated by 谷歌翻译
横跨街机学习环境,彩虹实现了对人类和现代RL算法的竞争程度。然而,获得这种性能水平需要大量的数据和硬件资源,在该区域进行研究计算地昂贵并且在实际应用中使用通常是不可行的。本文的贡献是三倍:我们(1)提出了一种改进的彩虹版本,寻求大大减少彩虹的数据,培训时间和计算要求,同时保持其竞争性能; (2)我们通过实验通过对街机学习环境的实验来证明我们的方法的有效性,以及(3)我们进行了许多消融研究,以研究个体提出的修改的效果。我们改进的Rainbow版本达到了靠近经典彩虹的中位数的人为规范化分数,而使用20倍的数据,只需要7.5小时的单个GPU培训时间。我们还提供了我们的全部实施,包括预先训练的型号。
translated by 谷歌翻译
Deep reinforcement learning is poised to revolutionise the field of AI and represents a step towards building autonomous systems with a higher level understanding of the visual world. Currently, deep learning is enabling reinforcement learning to scale to problems that were previously intractable, such as learning to play video games directly from pixels. Deep reinforcement learning algorithms are also applied to robotics, allowing control policies for robots to be learned directly from camera inputs in the real world. In this survey, we begin with an introduction to the general field of reinforcement learning, then progress to the main streams of value-based and policybased methods. Our survey will cover central algorithms in deep reinforcement learning, including the deep Q-network, trust region policy optimisation, and asynchronous advantage actor-critic. In parallel, we highlight the unique advantages of deep neural networks, focusing on visual understanding via reinforcement learning. To conclude, we describe several current areas of research within the field.
translated by 谷歌翻译
The deep reinforcement learning community has made several independent improvements to the DQN algorithm. However, it is unclear which of these extensions are complementary and can be fruitfully combined. This paper examines six extensions to the DQN algorithm and empirically studies their combination. Our experiments show that the combination provides state-of-the-art performance on the Atari 2600 benchmark, both in terms of data efficiency and final performance. We also provide results from a detailed ablation study that shows the contribution of each component to overall performance.
translated by 谷歌翻译
In recent years there have been many successes of using deep representations in reinforcement learning. Still, many of these applications use conventional architectures, such as convolutional networks, LSTMs, or auto-encoders. In this paper, we present a new neural network architecture for model-free reinforcement learning. Our dueling network represents two separate estimators: one for the state value function and one for the state-dependent action advantage function. The main benefit of this factoring is to generalize learning across actions without imposing any change to the underlying reinforcement learning algorithm. Our results show that this architecture leads to better policy evaluation in the presence of many similar-valued actions. Moreover, the dueling architecture enables our RL agent to outperform the state-of-the-art on the Atari 2600 domain.
translated by 谷歌翻译
在探索中,由于当前的低效率而引起的强化学习领域,具有较大动作空间的学习控制政策是一个具有挑战性的问题。在这项工作中,我们介绍了深入的强化学习(DRL)算法呼叫多动作网络(MAN)学习,以应对大型离散动作空间的挑战。我们建议将动作空间分为两个组件,从而为每个子行动创建一个值神经网络。然后,人使用时间差异学习来同步训练网络,这比训练直接动作输出的单个网络要简单。为了评估所提出的方法,我们在块堆叠任务上测试了人,然后扩展了人类从Atari Arcade学习环境中使用18个动作空间的12个游戏。我们的结果表明,人的学习速度比深Q学习和双重Q学习更快,这意味着我们的方法比当前可用于大型动作空间的方法更好地执行同步时间差异算法。
translated by 谷歌翻译
Experience replay lets online reinforcement learning agents remember and reuse experiences from the past. In prior work, experience transitions were uniformly sampled from a replay memory. However, this approach simply replays transitions at the same frequency that they were originally experienced, regardless of their significance. In this paper we develop a framework for prioritizing experience, so as to replay important transitions more frequently, and therefore learn more efficiently. We use prioritized experience replay in Deep Q-Networks (DQN), a reinforcement learning algorithm that achieved human-level performance across many Atari games. DQN with prioritized experience replay achieves a new stateof-the-art, outperforming DQN with uniform replay on 41 out of 49 games.
translated by 谷歌翻译
大多数强化学习算法都利用了经验重播缓冲液,以反复对代理商过去观察到的样本进行训练。这样可以防止灾难性的遗忘,但是仅仅对每个样本都分配了同等的重要性是一种天真的策略。在本文中,我们提出了一种根据样本可以从样本中学到多少样本确定样本优先级的方法。我们将样本的学习能力定义为随着时间的推移,与该样品相关的训练损失的稳定减少。我们开发了一种算法,以优先考虑具有较高学习能力的样本,同时将优先级较低,为那些难以学习的样本,通常是由噪声或随机性引起的。我们从经验上表明,我们的方法比随机抽样更强大,而且比仅在训练损失方面优先排序更好,即时间差损失,这是在香草优先的经验重播中使用的。
translated by 谷歌翻译
We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain. We present an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. Using the same learning algorithm, network architecture and hyper-parameters, our algorithm robustly solves more than 20 simulated physics tasks, including classic problems such as cartpole swing-up, dexterous manipulation, legged locomotion and car driving. Our algorithm is able to find policies whose performance is competitive with those found by a planning algorithm with full access to the dynamics of the domain and its derivatives. We further demonstrate that for many of the tasks the algorithm can learn policies "end-to-end": directly from raw pixel inputs.
translated by 谷歌翻译
With the breakthrough of AlphaGo, deep reinforcement learning becomes a recognized technique for solving sequential decision-making problems. Despite its reputation, data inefficiency caused by its trial and error learning mechanism makes deep reinforcement learning hard to be practical in a wide range of areas. Plenty of methods have been developed for sample efficient deep reinforcement learning, such as environment modeling, experience transfer, and distributed modifications, amongst which, distributed deep reinforcement learning has shown its potential in various applications, such as human-computer gaming, and intelligent transportation. In this paper, we conclude the state of this exciting field, by comparing the classical distributed deep reinforcement learning methods, and studying important components to achieve efficient distributed learning, covering single player single agent distributed deep reinforcement learning to the most complex multiple players multiple agents distributed deep reinforcement learning. Furthermore, we review recently released toolboxes that help to realize distributed deep reinforcement learning without many modifications of their non-distributed versions. By analyzing their strengths and weaknesses, a multi-player multi-agent distributed deep reinforcement learning toolbox is developed and released, which is further validated on Wargame, a complex environment, showing usability of the proposed toolbox for multiple players and multiple agents distributed deep reinforcement learning under complex games. Finally, we try to point out challenges and future trends, hoping this brief review can provide a guide or a spark for researchers who are interested in distributed deep reinforcement learning.
translated by 谷歌翻译