Experience replay lets online reinforcement learning agents remember and reuse experiences from the past. In prior work, experience transitions were uniformly sampled from a replay memory. However, this approach simply replays transitions at the same frequency that they were originally experienced, regardless of their significance. In this paper we develop a framework for prioritizing experience, so as to replay important transitions more frequently, and therefore learn more efficiently. We use prioritized experience replay in Deep Q-Networks (DQN), a reinforcement learning algorithm that achieved human-level performance across many Atari games. DQN with prioritized experience replay achieves a new stateof-the-art, outperforming DQN with uniform replay on 41 out of 49 games.
translated by 谷歌翻译
The deep reinforcement learning community has made several independent improvements to the DQN algorithm. However, it is unclear which of these extensions are complementary and can be fruitfully combined. This paper examines six extensions to the DQN algorithm and empirically studies their combination. Our experiments show that the combination provides state-of-the-art performance on the Atari 2600 benchmark, both in terms of data efficiency and final performance. We also provide results from a detailed ablation study that shows the contribution of each component to overall performance.
translated by 谷歌翻译
我们确定和研究政策流失的现象,即基于价值的强化学习中贪婪政策的快速变化。政策流失以惊人的快速步伐运作,改变了少数学习更新(在Atari上的DQN等典型的深层RL设置中)中大量州的贪婪行动。我们从经验上表征了现象,验证它不限于特定算法或环境特性。许多消融有助于削弱关于为什么流失仅与深度学习有关的少数相关的合理解释。最后,我们假设政策流失是一种有益但被忽视的隐性探索形式,它以新鲜的方式铸造了$ \ epsilon $ greedy探索,即$ \ epsilon $ - noise的作用比预期的要小得多。
translated by 谷歌翻译
In recent years there have been many successes of using deep representations in reinforcement learning. Still, many of these applications use conventional architectures, such as convolutional networks, LSTMs, or auto-encoders. In this paper, we present a new neural network architecture for model-free reinforcement learning. Our dueling network represents two separate estimators: one for the state value function and one for the state-dependent action advantage function. The main benefit of this factoring is to generalize learning across actions without imposing any change to the underlying reinforcement learning algorithm. Our results show that this architecture leads to better policy evaluation in the presence of many similar-valued actions. Moreover, the dueling architecture enables our RL agent to outperform the state-of-the-art on the Atari 2600 domain.
translated by 谷歌翻译
自成立以来,建立在广泛任务中表现出色的普通代理的任务一直是强化学习的重要目标。这个问题一直是对Alarge工作体系的研究的主题,并且经常通过观察Atari 57基准中包含的广泛范围环境的分数来衡量的性能。 Agent57是所有57场比赛中第一个超过人类基准的代理商,但这是以数据效率差的代价,需要实现近800亿帧的经验。以Agent57为起点,我们采用了各种各样的形式,以降低超过人类基线所需的经验200倍。在减少数据制度和Propose有效的解决方案时,我们遇到了一系列不稳定性和瓶颈,以构建更强大,更有效的代理。我们还使用诸如Muesli和Muzero之类的高性能方法证明了竞争性的性能。 TOOUR方法的四个关键组成部分是(1)近似信任区域方法,该方法可以从TheOnline网络中稳定引导,(2)损失和优先级的归一化方案,在学习具有广泛量表的一组值函数时,可以提高鲁棒性, (3)改进的体系结构采用了NFNET的技术技术来利用更深的网络而无需标准化层,并且(4)政策蒸馏方法可使瞬时贪婪的策略加班。
translated by 谷歌翻译
在这项工作中,我们提出并评估了一种新的增强学习方法,紧凑体验重放(编者),它使用基于相似转换集的复发的预测目标值的时间差异学习,以及基于两个转换的经验重放的新方法记忆。我们的目标是减少在长期累计累计奖励的经纪人培训所需的经验。它与强化学习的相关性与少量观察结果有关,即它需要实现类似于文献中的相关方法获得的结果,这通常需要数百万视频框架来培训ATARI 2600游戏。我们举报了在八个挑战街机学习环境(ALE)挑战游戏中,为仅10万帧的培训试验和大约25,000次迭代的培训试验中报告了培训试验。我们还在与基线的同一游戏中具有相同的实验协议的DQN代理呈现结果。为了验证从较少数量的观察结果近似于良好的政策,我们还将其结果与从啤酒的基准上呈现的数百万帧中获得的结果进行比较。
translated by 谷歌翻译
Efficient exploration remains a major challenge for reinforcement learning (RL). Common dithering strategies for exploration, such as -greedy, do not carry out temporally-extended (or deep) exploration; this can lead to exponentially larger data requirements. However, most algorithms for statistically efficient RL are not computationally tractable in complex environments. Randomized value functions offer a promising approach to efficient exploration with generalization, but existing algorithms are not compatible with nonlinearly parameterized value functions. As a first step towards addressing such contexts we develop bootstrapped DQN. We demonstrate that bootstrapped DQN can combine deep exploration with deep neural networks for exponentially faster learning than any dithering strategy. In the Arcade Learning Environment bootstrapped DQN substantially improves learning speed and cumulative performance across most games.
translated by 谷歌翻译
We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.
translated by 谷歌翻译
优先考虑体验重放是一种强化学习技术,可以通过允许代理商更频繁地重播过去的经验来加速学习。这种有用性被量化为从重播经验的预期增益,并且通常近似为在相应的经验期间观察到的预测误差(TD误差)。但是,预测误差只是一个可能的优先级度量。神经科学的最新作品表明,在生物生物中,通过增益和需求优先考虑重播。需要期限衡量每种经验对目前情况的预期相关性,更重要的是,该术语目前尚未考虑在Q-Network(DQN)等算法中考虑。因此,在本文中,我们提出了一种新方法,以确定重播经验的优先考虑增益和需求。我们通过考虑所需术语,量词,作为继承人表示,进入不同强化学习算法的采样过程来测试我们的方法。我们所提出的算法表现出基准中的性能显着增加,包括Dyna-Q迷宫和一系列Atari Games。
translated by 谷歌翻译
大多数强化学习算法都利用了经验重播缓冲液,以反复对代理商过去观察到的样本进行训练。这样可以防止灾难性的遗忘,但是仅仅对每个样本都分配了同等的重要性是一种天真的策略。在本文中,我们提出了一种根据样本可以从样本中学到多少样本确定样本优先级的方法。我们将样本的学习能力定义为随着时间的推移,与该样品相关的训练损失的稳定减少。我们开发了一种算法,以优先考虑具有较高学习能力的样本,同时将优先级较低,为那些难以学习的样本,通常是由噪声或随机性引起的。我们从经验上表明,我们的方法比随机抽样更强大,而且比仅在训练损失方面优先排序更好,即时间差损失,这是在香草优先的经验重播中使用的。
translated by 谷歌翻译
Deep reinforcement learning (RL) has achieved several high profile successes in difficult decision-making problems. However, these algorithms typically require a huge amount of data before they reach reasonable performance. In fact, their performance during learning can be extremely poor. This may be acceptable for a simulator, but it severely limits the applicability of deep RL to many real-world tasks, where the agent must learn in the real environment. In this paper we study a setting where the agent may access data from previous control of the system. We present an algorithm, Deep Q-learning from Demonstrations (DQfD), that leverages small sets of demonstration data to massively accelerate the learning process even from relatively small amounts of demonstration data and is able to automatically assess the necessary ratio of demonstration data while learning thanks to a prioritized replay mechanism. DQfD works by combining temporal difference updates with supervised classification of the demonstrator's actions. We show that DQfD has better initial performance than Prioritized Dueling Double Deep Q-Networks (PDD DQN) as it starts with better scores on the first million steps on 41 of 42 games and on average it takes PDD DQN 83 million steps to catch up to DQfD's performance. DQfD learns to out-perform the best demonstration given in 14 of 42 games. In addition, DQfD leverages human demonstrations to achieve state-of-the-art results for 11 games. Finally, we show that DQfD performs better than three related algorithms for incorporating demonstration data into DQN.
translated by 谷歌翻译
Off-policy reinforcement learning (RL) using a fixed offline dataset of logged interactions is an important consideration in real world applications. This paper studies offline RL using the DQN Replay Dataset comprising the entire replay experience of a DQN agent on 60 Atari 2600 games. We demonstrate that recent off-policy deep RL algorithms, even when trained solely on this fixed dataset, outperform the fully-trained DQN agent. To enhance generalization in the offline setting, we present Random Ensemble Mixture (REM), a robust Q-learning algorithm that enforces optimal Bellman consistency on random convex combinations of multiple Q-value estimates. Offline REM trained on the DQN Replay Dataset surpasses strong RL baselines. Ablation studies highlight the role of offline dataset size and diversity as well as the algorithm choice in our positive results. Overall, the results here present an optimistic view that robust RL algorithms used on sufficiently large and diverse offline datasets can lead to high quality policies. To provide a testbed for offline RL and reproduce our results, the DQN Replay Dataset is released at offline-rl.github.io.
translated by 谷歌翻译
一种被称为优先体验重播(PER)的广泛研究的深钢筋学习(RL)技术使代理可以从与其时间差异(TD)误差成正比的过渡中学习。尽管已经表明,PER是离散作用域中深度RL方法总体性能的最关键组成部分之一,但许多经验研究表明,在连续控制中,它的表现非常低于参与者 - 批评算法。从理论上讲,我们表明,无法有效地通过具有较大TD错误的过渡对演员网络进行训练。结果,在Q网络下计算的近似策略梯度与在最佳Q功能下计算的实际梯度不同。在此激励的基础上,我们引入了一种新颖的经验重播抽样框架,用于演员批评方法,该框架还认为稳定性和最新发现的问题是Per的经验表现不佳。引入的算法提出了对演员和评论家网络的有效和高效培训的改进的新分支。一系列广泛的实验验证了我们的理论主张,并证明了引入的方法显着优于竞争方法,并获得了与标准的非政策参与者 - 批评算法相比,获得最先进的结果。
translated by 谷歌翻译
在时间差异增强学习算法中,价值估计的差异会导致最大目标值的不稳定性和高估。已经提出了许多算法来减少高估,包括最近的几种集合方法,但是,没有通过解决估计方差作为高估的根本原因来表现出样品效率学习的成功。在本文中,我们提出了一种简单的集合方法,将目标值估计为集合均值。尽管它很简单,但卑鄙的(还是在Atari学习环境基准测试的实验中显示出明显的样本效率)。重要的是,我们发现大小5的合奏充分降低了估计方差以消除滞后目标网络,从而消除了它作为偏见的来源并进一步获得样本效率。我们以直观和经验的方式为曲线的设计选择证明了合理性,包括独立经验抽样的必要性。在一组26个基准ATARI环境中,曲线均优于所有经过测试的基线,包括最佳的基线,日出,在16/26环境中的100K交互步骤,平均为68​​%。在21/26的环境中,曲线还优于500k步骤的Rainbow DQN,平均为49%,并使用200K($ \ pm $ 100k)的交互步骤实现平均人级绩效。我们的实施可从https://github.com/indylab/meanq获得。
translated by 谷歌翻译
We propose a conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers. We present asynchronous variants of four standard reinforcement learning algorithms and show that parallel actor-learners have a stabilizing effect on training allowing all four methods to successfully train neural network controllers. The best performing method, an asynchronous variant of actor-critic, surpasses the current state-of-the-art on the Atari domain while training for half the time on a single multi-core CPU instead of a GPU. Furthermore, we show that asynchronous actor-critic succeeds on a wide variety of continuous motor control problems as well as on a new task of navigating random 3D mazes using a visual input.
translated by 谷歌翻译
深度强化学习(RL)导致了许多最近和开创性的进步。但是,这些进步通常以培训的基础体系结构的规模增加以及用于训练它们的RL算法的复杂性提高,而均以增加规模的成本。这些增长反过来又使研究人员更难迅速原型新想法或复制已发表的RL算法。为了解决这些问题,这项工作描述了ACME,这是一个用于构建新型RL算法的框架,这些框架是专门设计的,用于启用使用简单的模块化组件构建的代理,这些组件可以在各种执行范围内使用。尽管ACME的主要目标是为算法开发提供一个框架,但第二个目标是提供重要或最先进算法的简单参考实现。这些实现既是对我们的设计决策的验证,也是对RL研究中可重复性的重要贡献。在这项工作中,我们描述了ACME内部做出的主要设计决策,并提供了有关如何使用其组件来实施各种算法的进一步详细信息。我们的实验为许多常见和最先进的算法提供了基准,并显示了如何为更大且更复杂的环境扩展这些算法。这突出了ACME的主要优点之一,即它可用于实现大型,分布式的RL算法,这些算法可以以较大的尺度运行,同时仍保持该实现的固有可读性。这项工作提出了第二篇文章的版本,恰好与模块化的增加相吻合,对离线,模仿和从演示算法学习以及作为ACME的一部分实现的各种新代理。
translated by 谷歌翻译
Atari games have been a long-standing benchmark in the reinforcement learning (RL) community for the past decade. This benchmark was proposed to test general competency of RL algorithms. Previous work has achieved good average performance by doing outstandingly well on many games of the set, but very poorly in several of the most challenging games. We propose Agent57, the first deep RL agent that outperforms the standard human benchmark on all 57 Atari games. To achieve this result, we train a neural network which parameterizes a family of policies ranging from very exploratory to purely exploitative. We propose an adaptive mechanism to choose which policy to prioritize throughout the training process. Additionally, we utilize a novel parameterization of the architecture that allows for more consistent and stable learning.
translated by 谷歌翻译
深度神经网络的强大学习能力使强化学习者能够直接从连续环境中学习有效的控制政策。从理论上讲,为了实现稳定的性能,神经网络假设I.I.D.不幸的是,在训练数据在时间上相关且非平稳的一般强化学习范式中,输入不存在。这个问题可能导致“灾难性干扰”和性能崩溃的现象。在本文中,我们提出智商,即干涉意识深度Q学习,以减轻单任务深度加固学习中的灾难性干扰。具体来说,我们求助于在线聚类,以实现在线上下文部门,以及一个多头网络和一个知识蒸馏正规化术语,用于保留学习上下文的政策。与现有方法相比,智商基于深Q网络,始终如一地提高稳定性和性能,并通过对经典控制和ATARI任务进行了广泛的实验。该代码可在以下网址公开获取:https://github.com/sweety-dm/interference-aware-ware-deep-q-learning。
translated by 谷歌翻译
In recent years, neural networks have enjoyed a renaissance as function approximators in reinforcement learning. Two decades after Tesauro's TD-Gammon achieved near toplevel human performance in backgammon, the deep reinforcement learning algorithm DQN achieved human-level performance in many Atari 2600 games. The purpose of this study is twofold. First, we propose two activation functions for neural network function approximation in reinforcement learning: the sigmoid-weighted linear unit (SiLU) and its derivative function (dSiLU). The activation of the SiLU is computed by the sigmoid function multiplied by its input. Second, we suggest that the more traditional approach of using on-policy learning with eligibility traces, instead of experience replay, and softmax action selection with simple annealing can be competitive with DQN, without the need for a separate target network. We validate our proposed approach by, first, achieving new state-of-the-art results in both stochastic SZ-Tetris and Tetris with a small 10×10 board, using TD(λ) learning and shallow dSiLU network agents, and, then, by outperforming DQN in the Atari 2600 domain by using a deep Sarsa(λ) agent with SiLU and dSiLU hidden units.
translated by 谷歌翻译