自成立以来,建立在广泛任务中表现出色的普通代理的任务一直是强化学习的重要目标。这个问题一直是对Alarge工作体系的研究的主题,并且经常通过观察Atari 57基准中包含的广泛范围环境的分数来衡量的性能。 Agent57是所有57场比赛中第一个超过人类基准的代理商,但这是以数据效率差的代价,需要实现近800亿帧的经验。以Agent57为起点,我们采用了各种各样的形式,以降低超过人类基线所需的经验200倍。在减少数据制度和Propose有效的解决方案时,我们遇到了一系列不稳定性和瓶颈,以构建更强大,更有效的代理。我们还使用诸如Muesli和Muzero之类的高性能方法证明了竞争性的性能。 TOOUR方法的四个关键组成部分是(1)近似信任区域方法,该方法可以从TheOnline网络中稳定引导,(2)损失和优先级的归一化方案,在学习具有广泛量表的一组值函数时,可以提高鲁棒性, (3)改进的体系结构采用了NFNET的技术技术来利用更深的网络而无需标准化层,并且(4)政策蒸馏方法可使瞬时贪婪的策略加班。
translated by 谷歌翻译
Atari games have been a long-standing benchmark in the reinforcement learning (RL) community for the past decade. This benchmark was proposed to test general competency of RL algorithms. Previous work has achieved good average performance by doing outstandingly well on many games of the set, but very poorly in several of the most challenging games. We propose Agent57, the first deep RL agent that outperforms the standard human benchmark on all 57 Atari games. To achieve this result, we train a neural network which parameterizes a family of policies ranging from very exploratory to purely exploitative. We propose an adaptive mechanism to choose which policy to prioritize throughout the training process. Additionally, we utilize a novel parameterization of the architecture that allows for more consistent and stable learning.
translated by 谷歌翻译
深度强化学习(RL)导致了许多最近和开创性的进步。但是,这些进步通常以培训的基础体系结构的规模增加以及用于训练它们的RL算法的复杂性提高,而均以增加规模的成本。这些增长反过来又使研究人员更难迅速原型新想法或复制已发表的RL算法。为了解决这些问题,这项工作描述了ACME,这是一个用于构建新型RL算法的框架,这些框架是专门设计的,用于启用使用简单的模块化组件构建的代理,这些组件可以在各种执行范围内使用。尽管ACME的主要目标是为算法开发提供一个框架,但第二个目标是提供重要或最先进算法的简单参考实现。这些实现既是对我们的设计决策的验证,也是对RL研究中可重复性的重要贡献。在这项工作中,我们描述了ACME内部做出的主要设计决策,并提供了有关如何使用其组件来实施各种算法的进一步详细信息。我们的实验为许多常见和最先进的算法提供了基准,并显示了如何为更大且更复杂的环境扩展这些算法。这突出了ACME的主要优点之一,即它可用于实现大型,分布式的RL算法,这些算法可以以较大的尺度运行,同时仍保持该实现的固有可读性。这项工作提出了第二篇文章的版本,恰好与模块化的增加相吻合,对离线,模仿和从演示算法学习以及作为ACME的一部分实现的各种新代理。
translated by 谷歌翻译
The deep reinforcement learning community has made several independent improvements to the DQN algorithm. However, it is unclear which of these extensions are complementary and can be fruitfully combined. This paper examines six extensions to the DQN algorithm and empirically studies their combination. Our experiments show that the combination provides state-of-the-art performance on the Atari 2600 benchmark, both in terms of data efficiency and final performance. We also provide results from a detailed ablation study that shows the contribution of each component to overall performance.
translated by 谷歌翻译
Experience replay lets online reinforcement learning agents remember and reuse experiences from the past. In prior work, experience transitions were uniformly sampled from a replay memory. However, this approach simply replays transitions at the same frequency that they were originally experienced, regardless of their significance. In this paper we develop a framework for prioritizing experience, so as to replay important transitions more frequently, and therefore learn more efficiently. We use prioritized experience replay in Deep Q-Networks (DQN), a reinforcement learning algorithm that achieved human-level performance across many Atari games. DQN with prioritized experience replay achieves a new stateof-the-art, outperforming DQN with uniform replay on 41 out of 49 games.
translated by 谷歌翻译
Off-policy reinforcement learning (RL) using a fixed offline dataset of logged interactions is an important consideration in real world applications. This paper studies offline RL using the DQN Replay Dataset comprising the entire replay experience of a DQN agent on 60 Atari 2600 games. We demonstrate that recent off-policy deep RL algorithms, even when trained solely on this fixed dataset, outperform the fully-trained DQN agent. To enhance generalization in the offline setting, we present Random Ensemble Mixture (REM), a robust Q-learning algorithm that enforces optimal Bellman consistency on random convex combinations of multiple Q-value estimates. Offline REM trained on the DQN Replay Dataset surpasses strong RL baselines. Ablation studies highlight the role of offline dataset size and diversity as well as the algorithm choice in our positive results. Overall, the results here present an optimistic view that robust RL algorithms used on sufficiently large and diverse offline datasets can lead to high quality policies. To provide a testbed for offline RL and reproduce our results, the DQN Replay Dataset is released at offline-rl.github.io.
translated by 谷歌翻译
在这项工作中,我们提出并评估了一种新的增强学习方法,紧凑体验重放(编者),它使用基于相似转换集的复发的预测目标值的时间差异学习,以及基于两个转换的经验重放的新方法记忆。我们的目标是减少在长期累计累计奖励的经纪人培训所需的经验。它与强化学习的相关性与少量观察结果有关,即它需要实现类似于文献中的相关方法获得的结果,这通常需要数百万视频框架来培训ATARI 2600游戏。我们举报了在八个挑战街机学习环境(ALE)挑战游戏中,为仅10万帧的培训试验和大约25,000次迭代的培训试验中报告了培训试验。我们还在与基线的同一游戏中具有相同的实验协议的DQN代理呈现结果。为了验证从较少数量的观察结果近似于良好的政策,我们还将其结果与从啤酒的基准上呈现的数百万帧中获得的结果进行比较。
translated by 谷歌翻译
In recent years there have been many successes of using deep representations in reinforcement learning. Still, many of these applications use conventional architectures, such as convolutional networks, LSTMs, or auto-encoders. In this paper, we present a new neural network architecture for model-free reinforcement learning. Our dueling network represents two separate estimators: one for the state value function and one for the state-dependent action advantage function. The main benefit of this factoring is to generalize learning across actions without imposing any change to the underlying reinforcement learning algorithm. Our results show that this architecture leads to better policy evaluation in the presence of many similar-valued actions. Moreover, the dueling architecture enables our RL agent to outperform the state-of-the-art on the Atari 2600 domain.
translated by 谷歌翻译
We propose a conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers. We present asynchronous variants of four standard reinforcement learning algorithms and show that parallel actor-learners have a stabilizing effect on training allowing all four methods to successfully train neural network controllers. The best performing method, an asynchronous variant of actor-critic, surpasses the current state-of-the-art on the Atari domain while training for half the time on a single multi-core CPU instead of a GPU. Furthermore, we show that asynchronous actor-critic succeeds on a wide variety of continuous motor control problems as well as on a new task of navigating random 3D mazes using a visual input.
translated by 谷歌翻译
我们确定和研究政策流失的现象,即基于价值的强化学习中贪婪政策的快速变化。政策流失以惊人的快速步伐运作,改变了少数学习更新(在Atari上的DQN等典型的深层RL设置中)中大量州的贪婪行动。我们从经验上表征了现象,验证它不限于特定算法或环境特性。许多消融有助于削弱关于为什么流失仅与深度学习有关的少数相关的合理解释。最后,我们假设政策流失是一种有益但被忽视的隐性探索形式,它以新鲜的方式铸造了$ \ epsilon $ greedy探索,即$ \ epsilon $ - noise的作用比预期的要小得多。
translated by 谷歌翻译
Efficient exploration remains a major challenge for reinforcement learning (RL). Common dithering strategies for exploration, such as -greedy, do not carry out temporally-extended (or deep) exploration; this can lead to exponentially larger data requirements. However, most algorithms for statistically efficient RL are not computationally tractable in complex environments. Randomized value functions offer a promising approach to efficient exploration with generalization, but existing algorithms are not compatible with nonlinearly parameterized value functions. As a first step towards addressing such contexts we develop bootstrapped DQN. We demonstrate that bootstrapped DQN can combine deep exploration with deep neural networks for exponentially faster learning than any dithering strategy. In the Arcade Learning Environment bootstrapped DQN substantially improves learning speed and cumulative performance across most games.
translated by 谷歌翻译
Deep reinforcement learning is poised to revolutionise the field of AI and represents a step towards building autonomous systems with a higher level understanding of the visual world. Currently, deep learning is enabling reinforcement learning to scale to problems that were previously intractable, such as learning to play video games directly from pixels. Deep reinforcement learning algorithms are also applied to robotics, allowing control policies for robots to be learned directly from camera inputs in the real world. In this survey, we begin with an introduction to the general field of reinforcement learning, then progress to the main streams of value-based and policybased methods. Our survey will cover central algorithms in deep reinforcement learning, including the deep Q-network, trust region policy optimisation, and asynchronous advantage actor-critic. In parallel, we highlight the unique advantages of deep neural networks, focusing on visual understanding via reinforcement learning. To conclude, we describe several current areas of research within the field.
translated by 谷歌翻译
在时间差异增强学习算法中,价值估计的差异会导致最大目标值的不稳定性和高估。已经提出了许多算法来减少高估,包括最近的几种集合方法,但是,没有通过解决估计方差作为高估的根本原因来表现出样品效率学习的成功。在本文中,我们提出了一种简单的集合方法,将目标值估计为集合均值。尽管它很简单,但卑鄙的(还是在Atari学习环境基准测试的实验中显示出明显的样本效率)。重要的是,我们发现大小5的合奏充分降低了估计方差以消除滞后目标网络,从而消除了它作为偏见的来源并进一步获得样本效率。我们以直观和经验的方式为曲线的设计选择证明了合理性,包括独立经验抽样的必要性。在一组26个基准ATARI环境中,曲线均优于所有经过测试的基线,包括最佳的基线,日出,在16/26环境中的100K交互步骤,平均为68​​%。在21/26的环境中,曲线还优于500k步骤的Rainbow DQN,平均为49%,并使用200K($ \ pm $ 100k)的交互步骤实现平均人级绩效。我们的实施可从https://github.com/indylab/meanq获得。
translated by 谷歌翻译
In this work we aim to solve a large collection of tasks using a single reinforcement learning agent with a single set of parameters. A key challenge is to handle the increased amount of data and extended training time. We have developed a new distributed agent IMPALA (Importance Weighted Actor-Learner Architecture) that not only uses resources more efficiently in singlemachine training but also scales to thousands of machines without sacrificing data efficiency or resource utilisation. We achieve stable learning at high throughput by combining decoupled acting and learning with a novel off-policy correction method called V-trace. We demonstrate the effectiveness of IMPALA for multi-task reinforcement learning on DMLab-30 (a set of 30 tasks from the DeepMind Lab environment (Beattie et al., 2016)) and Atari-57 (all available Atari games in Arcade Learning Environment (Bellemare et al., 2013a)). Our results show that IMPALA is able to achieve better performance than previous agents with less data, and crucially exhibits positive transfer between tasks as a result of its multi-task approach. The source code is publicly available at github.com/deepmind/scalable agent.
translated by 谷歌翻译
资产分配(或投资组合管理)是确定如何最佳将有限预算的资金分配给一系列金融工具/资产(例如股票)的任务。这项研究调查了使用无模型的深RL代理应用于投资组合管理的增强学习(RL)的性能。我们培训了几个RL代理商的现实股票价格,以学习如何执行资产分配。我们比较了这些RL剂与某些基线剂的性能。我们还比较了RL代理,以了解哪些类别的代理表现更好。从我们的分析中,RL代理可以执行投资组合管理的任务,因为它们的表现明显优于基线代理(随机分配和均匀分配)。四个RL代理(A2C,SAC,PPO和TRPO)总体上优于最佳基线MPT。这显示了RL代理商发现更有利可图的交易策略的能力。此外,基于价值和基于策略的RL代理之间没有显着的性能差异。演员批评者的表现比其他类型的药物更好。同样,在政策代理商方面的表现要好,因为它们在政策评估方面更好,样品效率在投资组合管理中并不是一个重大问题。这项研究表明,RL代理可以大大改善资产分配,因为它们的表现优于强基础。基于我们的分析,在政策上,参与者批评的RL药物显示出最大的希望。
translated by 谷歌翻译
大多数强化学习算法都利用了经验重播缓冲液,以反复对代理商过去观察到的样本进行训练。这样可以防止灾难性的遗忘,但是仅仅对每个样本都分配了同等的重要性是一种天真的策略。在本文中,我们提出了一种根据样本可以从样本中学到多少样本确定样本优先级的方法。我们将样本的学习能力定义为随着时间的推移,与该样品相关的训练损失的稳定减少。我们开发了一种算法,以优先考虑具有较高学习能力的样本,同时将优先级较低,为那些难以学习的样本,通常是由噪声或随机性引起的。我们从经验上表明,我们的方法比随机抽样更强大,而且比仅在训练损失方面优先排序更好,即时间差损失,这是在香草优先的经验重播中使用的。
translated by 谷歌翻译
通过回顾一封来自情节记忆的过去的经验,可以通过回忆过去的经验来实现钢筋学习的样本效率。我们提出了一种新的基于模型的轨迹的集体记忆,解决了集体控制的当前限制。我们的记忆估计轨迹值,指导代理人朝着良好的政策。基于内存构建,我们通过动态混合控制统一模型的基于动态和习惯学习来构建互补学习模型,进入单个架构。实验表明,我们的模型可以比各种环境中的其他强力加强学习代理更快,更好地学习,包括随机和非马尔可夫环境。
translated by 谷歌翻译
本文探讨了在深度参与者批评的增强学习模型中同时学习价值功能和政策的问题。我们发现,由于这两个任务之间的噪声水平差异差异,共同学习这些功能的共同实践是亚最佳选择。取而代之的是,我们表明独立学习这些任务,但是由于蒸馏阶段有限,可以显着提高性能。此外,我们发现可以使用较低的\ textIt {方差}返回估计值来降低策略梯度噪声水平。鉴于,值学习噪声水平降低了较低的\ textit {bias}估计值。这些见解共同为近端策略优化的扩展提供了信息,我们称为\ textit {dual Network Archituction}(DNA),这极大地超过了其前身。DNA还超过了受欢迎的彩虹DQN算法在测试的五个环境中的四个环境中的性能,即使在更困难的随机控制设置下也是如此。
translated by 谷歌翻译
横跨街机学习环境,彩虹实现了对人类和现代RL算法的竞争程度。然而,获得这种性能水平需要大量的数据和硬件资源,在该区域进行研究计算地昂贵并且在实际应用中使用通常是不可行的。本文的贡献是三倍:我们(1)提出了一种改进的彩虹版本,寻求大大减少彩虹的数据,培训时间和计算要求,同时保持其竞争性能; (2)我们通过实验通过对街机学习环境的实验来证明我们的方法的有效性,以及(3)我们进行了许多消融研究,以研究个体提出的修改的效果。我们改进的Rainbow版本达到了靠近经典彩虹的中位数的人为规范化分数,而使用20倍的数据,只需要7.5小时的单个GPU培训时间。我们还提供了我们的全部实施,包括预先训练的型号。
translated by 谷歌翻译
深度神经网络的强大学习能力使强化学习者能够直接从连续环境中学习有效的控制政策。从理论上讲,为了实现稳定的性能,神经网络假设I.I.D.不幸的是,在训练数据在时间上相关且非平稳的一般强化学习范式中,输入不存在。这个问题可能导致“灾难性干扰”和性能崩溃的现象。在本文中,我们提出智商,即干涉意识深度Q学习,以减轻单任务深度加固学习中的灾难性干扰。具体来说,我们求助于在线聚类,以实现在线上下文部门,以及一个多头网络和一个知识蒸馏正规化术语,用于保留学习上下文的政策。与现有方法相比,智商基于深Q网络,始终如一地提高稳定性和性能,并通过对经典控制和ATARI任务进行了广泛的实验。该代码可在以下网址公开获取:https://github.com/sweety-dm/interference-aware-ware-deep-q-learning。
translated by 谷歌翻译