在探索中,由于当前的低效率而引起的强化学习领域,具有较大动作空间的学习控制政策是一个具有挑战性的问题。在这项工作中,我们介绍了深入的强化学习(DRL)算法呼叫多动作网络(MAN)学习,以应对大型离散动作空间的挑战。我们建议将动作空间分为两个组件,从而为每个子行动创建一个值神经网络。然后,人使用时间差异学习来同步训练网络,这比训练直接动作输出的单个网络要简单。为了评估所提出的方法,我们在块堆叠任务上测试了人,然后扩展了人类从Atari Arcade学习环境中使用18个动作空间的12个游戏。我们的结果表明,人的学习速度比深Q学习和双重Q学习更快,这意味着我们的方法比当前可用于大型动作空间的方法更好地执行同步时间差异算法。
translated by 谷歌翻译
In recent years there have been many successes of using deep representations in reinforcement learning. Still, many of these applications use conventional architectures, such as convolutional networks, LSTMs, or auto-encoders. In this paper, we present a new neural network architecture for model-free reinforcement learning. Our dueling network represents two separate estimators: one for the state value function and one for the state-dependent action advantage function. The main benefit of this factoring is to generalize learning across actions without imposing any change to the underlying reinforcement learning algorithm. Our results show that this architecture leads to better policy evaluation in the presence of many similar-valued actions. Moreover, the dueling architecture enables our RL agent to outperform the state-of-the-art on the Atari 2600 domain.
translated by 谷歌翻译
The deep reinforcement learning community has made several independent improvements to the DQN algorithm. However, it is unclear which of these extensions are complementary and can be fruitfully combined. This paper examines six extensions to the DQN algorithm and empirically studies their combination. Our experiments show that the combination provides state-of-the-art performance on the Atari 2600 benchmark, both in terms of data efficiency and final performance. We also provide results from a detailed ablation study that shows the contribution of each component to overall performance.
translated by 谷歌翻译
在这项工作中,我们提出并评估了一种新的增强学习方法,紧凑体验重放(编者),它使用基于相似转换集的复发的预测目标值的时间差异学习,以及基于两个转换的经验重放的新方法记忆。我们的目标是减少在长期累计累计奖励的经纪人培训所需的经验。它与强化学习的相关性与少量观察结果有关,即它需要实现类似于文献中的相关方法获得的结果,这通常需要数百万视频框架来培训ATARI 2600游戏。我们举报了在八个挑战街机学习环境(ALE)挑战游戏中,为仅10万帧的培训试验和大约25,000次迭代的培训试验中报告了培训试验。我们还在与基线的同一游戏中具有相同的实验协议的DQN代理呈现结果。为了验证从较少数量的观察结果近似于良好的政策,我们还将其结果与从啤酒的基准上呈现的数百万帧中获得的结果进行比较。
translated by 谷歌翻译
Deep reinforcement learning (RL) has achieved several high profile successes in difficult decision-making problems. However, these algorithms typically require a huge amount of data before they reach reasonable performance. In fact, their performance during learning can be extremely poor. This may be acceptable for a simulator, but it severely limits the applicability of deep RL to many real-world tasks, where the agent must learn in the real environment. In this paper we study a setting where the agent may access data from previous control of the system. We present an algorithm, Deep Q-learning from Demonstrations (DQfD), that leverages small sets of demonstration data to massively accelerate the learning process even from relatively small amounts of demonstration data and is able to automatically assess the necessary ratio of demonstration data while learning thanks to a prioritized replay mechanism. DQfD works by combining temporal difference updates with supervised classification of the demonstrator's actions. We show that DQfD has better initial performance than Prioritized Dueling Double Deep Q-Networks (PDD DQN) as it starts with better scores on the first million steps on 41 of 42 games and on average it takes PDD DQN 83 million steps to catch up to DQfD's performance. DQfD learns to out-perform the best demonstration given in 14 of 42 games. In addition, DQfD leverages human demonstrations to achieve state-of-the-art results for 11 games. Finally, we show that DQfD performs better than three related algorithms for incorporating demonstration data into DQN.
translated by 谷歌翻译
Deep reinforcement learning is poised to revolutionise the field of AI and represents a step towards building autonomous systems with a higher level understanding of the visual world. Currently, deep learning is enabling reinforcement learning to scale to problems that were previously intractable, such as learning to play video games directly from pixels. Deep reinforcement learning algorithms are also applied to robotics, allowing control policies for robots to be learned directly from camera inputs in the real world. In this survey, we begin with an introduction to the general field of reinforcement learning, then progress to the main streams of value-based and policybased methods. Our survey will cover central algorithms in deep reinforcement learning, including the deep Q-network, trust region policy optimisation, and asynchronous advantage actor-critic. In parallel, we highlight the unique advantages of deep neural networks, focusing on visual understanding via reinforcement learning. To conclude, we describe several current areas of research within the field.
translated by 谷歌翻译
Off-policy reinforcement learning (RL) using a fixed offline dataset of logged interactions is an important consideration in real world applications. This paper studies offline RL using the DQN Replay Dataset comprising the entire replay experience of a DQN agent on 60 Atari 2600 games. We demonstrate that recent off-policy deep RL algorithms, even when trained solely on this fixed dataset, outperform the fully-trained DQN agent. To enhance generalization in the offline setting, we present Random Ensemble Mixture (REM), a robust Q-learning algorithm that enforces optimal Bellman consistency on random convex combinations of multiple Q-value estimates. Offline REM trained on the DQN Replay Dataset surpasses strong RL baselines. Ablation studies highlight the role of offline dataset size and diversity as well as the algorithm choice in our positive results. Overall, the results here present an optimistic view that robust RL algorithms used on sufficiently large and diverse offline datasets can lead to high quality policies. To provide a testbed for offline RL and reproduce our results, the DQN Replay Dataset is released at offline-rl.github.io.
translated by 谷歌翻译
Efficient exploration remains a major challenge for reinforcement learning (RL). Common dithering strategies for exploration, such as -greedy, do not carry out temporally-extended (or deep) exploration; this can lead to exponentially larger data requirements. However, most algorithms for statistically efficient RL are not computationally tractable in complex environments. Randomized value functions offer a promising approach to efficient exploration with generalization, but existing algorithms are not compatible with nonlinearly parameterized value functions. As a first step towards addressing such contexts we develop bootstrapped DQN. We demonstrate that bootstrapped DQN can combine deep exploration with deep neural networks for exponentially faster learning than any dithering strategy. In the Arcade Learning Environment bootstrapped DQN substantially improves learning speed and cumulative performance across most games.
translated by 谷歌翻译
We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain. We present an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. Using the same learning algorithm, network architecture and hyper-parameters, our algorithm robustly solves more than 20 simulated physics tasks, including classic problems such as cartpole swing-up, dexterous manipulation, legged locomotion and car driving. Our algorithm is able to find policies whose performance is competitive with those found by a planning algorithm with full access to the dynamics of the domain and its derivatives. We further demonstrate that for many of the tasks the algorithm can learn policies "end-to-end": directly from raw pixel inputs.
translated by 谷歌翻译
自成立以来,建立在广泛任务中表现出色的普通代理的任务一直是强化学习的重要目标。这个问题一直是对Alarge工作体系的研究的主题,并且经常通过观察Atari 57基准中包含的广泛范围环境的分数来衡量的性能。 Agent57是所有57场比赛中第一个超过人类基准的代理商,但这是以数据效率差的代价,需要实现近800亿帧的经验。以Agent57为起点,我们采用了各种各样的形式,以降低超过人类基线所需的经验200倍。在减少数据制度和Propose有效的解决方案时,我们遇到了一系列不稳定性和瓶颈,以构建更强大,更有效的代理。我们还使用诸如Muesli和Muzero之类的高性能方法证明了竞争性的性能。 TOOUR方法的四个关键组成部分是(1)近似信任区域方法,该方法可以从TheOnline网络中稳定引导,(2)损失和优先级的归一化方案,在学习具有广泛量表的一组值函数时,可以提高鲁棒性, (3)改进的体系结构采用了NFNET的技术技术来利用更深的网络而无需标准化层,并且(4)政策蒸馏方法可使瞬时贪婪的策略加班。
translated by 谷歌翻译
With the development of deep representation learning, the domain of reinforcement learning (RL) has become a powerful learning framework now capable of learning complex policies in high dimensional environments. This review summarises deep reinforcement learning (DRL) algorithms and provides a taxonomy of automated driving tasks where (D)RL methods have been employed, while addressing key computational challenges in real world deployment of autonomous driving agents. It also delineates adjacent domains such as behavior cloning, imitation learning, inverse reinforcement learning that are related but are not classical RL algorithms. The role of simulators in training agents, methods to validate, test and robustify existing solutions in RL are discussed.
translated by 谷歌翻译
In recent years, neural networks have enjoyed a renaissance as function approximators in reinforcement learning. Two decades after Tesauro's TD-Gammon achieved near toplevel human performance in backgammon, the deep reinforcement learning algorithm DQN achieved human-level performance in many Atari 2600 games. The purpose of this study is twofold. First, we propose two activation functions for neural network function approximation in reinforcement learning: the sigmoid-weighted linear unit (SiLU) and its derivative function (dSiLU). The activation of the SiLU is computed by the sigmoid function multiplied by its input. Second, we suggest that the more traditional approach of using on-policy learning with eligibility traces, instead of experience replay, and softmax action selection with simple annealing can be competitive with DQN, without the need for a separate target network. We validate our proposed approach by, first, achieving new state-of-the-art results in both stochastic SZ-Tetris and Tetris with a small 10×10 board, using TD(λ) learning and shallow dSiLU network agents, and, then, by outperforming DQN in the Atari 2600 domain by using a deep Sarsa(λ) agent with SiLU and dSiLU hidden units.
translated by 谷歌翻译
基于Q学习的强化学习算法正在推动深入的强化学习(DRL)研究,以解决复杂的问题并在其中许多方面实现超人的表现。然而,已知Q学习是积极偏见的,因为它通过使用最大值的期望值噪声估计来学习。对动作值的系统高估与DRL方法的固有较高方差相结合会导致逐渐积累的错误,从而导致学习算法的差异。理想情况下,我们希望DRL代理人考虑到他们对每个动作的最佳性的不确定性,并能够利用它以对预期收益进行更明智的估计。在这方面,加权Q学习(WQL)有效地减少了偏见,并在随机环境中显示出显着的结果。 WQL使用估计动作值的加权总和,其中权重对应于每个动作值的概率为最大值。但是,这些概率的计算仅在表格设置中是实用的。在这项工作中,我们通过使用接受辍学训练的神经网络作为深豪斯过程的有效近似,从而提供了方法上的进步,以从DRL中的WQL属性中受益。特别是,我们采用具体的辍学变体来获得DRL认知不确定性的校准估计值。然后,通过采取几个随机前向通过动作值网络并以蒙特卡洛的方式计算权重来获得估计器。这样的权重是对应于最大W.R.T.的每个动作值的概率的贝叶斯估计。通过辍学估计的后验概率分布。我们展示了我们的新颖加权Q学习算法如何减少偏见W.R.T.相关基线,并提供了其在代表性基准方面的优势的经验证据。
translated by 谷歌翻译
深Q学习网络(DQN)是一种成功的方式,将增强学习与深神经网络结合在一起,并导致广泛应用强化学习。当将DQN或其他强化学习算法应用于现实世界问题时,一个具有挑战性的问题是数据收集。因此,如何提高数据效率是强化学习研究中最重要的问题之一。在本文中,我们提出了一个框架,该框架使用深q网络中的最大均值损失(m $^2 $ dqn)。我们没有在训练步骤中抽样一批体验,而是从体验重播中采样了几批,并更新参数,以使这些批次的最大td-Error最小化。所提出的方法可以通过替换损耗函数来与DQN算法的大多数现有技术结合使用。我们在几个健身游戏中使用了最广泛的技术DQN(DDQN)之一来验证该框架的有效性。结果表明,我们的方法会导致学习速度和性能的实质性提高。
translated by 谷歌翻译
在时间差异增强学习算法中,价值估计的差异会导致最大目标值的不稳定性和高估。已经提出了许多算法来减少高估,包括最近的几种集合方法,但是,没有通过解决估计方差作为高估的根本原因来表现出样品效率学习的成功。在本文中,我们提出了一种简单的集合方法,将目标值估计为集合均值。尽管它很简单,但卑鄙的(还是在Atari学习环境基准测试的实验中显示出明显的样本效率)。重要的是,我们发现大小5的合奏充分降低了估计方差以消除滞后目标网络,从而消除了它作为偏见的来源并进一步获得样本效率。我们以直观和经验的方式为曲线的设计选择证明了合理性,包括独立经验抽样的必要性。在一组26个基准ATARI环境中,曲线均优于所有经过测试的基线,包括最佳的基线,日出,在16/26环境中的100K交互步骤,平均为68​​%。在21/26的环境中,曲线还优于500k步骤的Rainbow DQN,平均为49%,并使用200K($ \ pm $ 100k)的交互步骤实现平均人级绩效。我们的实施可从https://github.com/indylab/meanq获得。
translated by 谷歌翻译
Deep Q-Network(DQN)算法解决的大规模实践工作表明,随机政策尽管简单,但最常用的探索方法是最常用的探索方法。但是,大多数现有的随机探索方法要么探索新的动作,无论Q值如何,要么不可避免地将偏见引入学习过程中,以将抽样与Q值搭配。在本文中,我们提出了一种新颖的偏好指导$ \ epsilon $ greedy Exploration算法,该算法可以在不引入其他偏见的情况下根据Q值的Q值有效地学习动作分布。具体而言,我们设计了一个由两个分支组成的双重体系结构,其中一个是DQN的副本,即Q Branch。我们称为首选项分支的另一个分支,了解了DQN隐式所遵循的动作偏好。从理论上讲,我们证明了策略改进定理适用于首选项指导的$ \ epsilon $ greedy策略,并在实验上表明,推断的动作偏好分布与相应的Q值的景观保持一致。因此,偏好引导的$ \ epsilon $ - 秘密探索激励DQN代理采取多种操作,即可以更频繁地采样较大的Q值的行动,而使用较小的Q值的行动仍然可以探索,因此仍有机会。鼓励探索。我们在九个不同的环境中使用四个众所周知的DQN变体评估了提出的方法。广泛的结果证实了我们提出的方法在性能和收敛速度方面的优势。索引术语 - 偏好引导的探索,随机政策,数据效率,深度强化学习,深度Q学习。
translated by 谷歌翻译
We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.
translated by 谷歌翻译
We propose a conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers. We present asynchronous variants of four standard reinforcement learning algorithms and show that parallel actor-learners have a stabilizing effect on training allowing all four methods to successfully train neural network controllers. The best performing method, an asynchronous variant of actor-critic, surpasses the current state-of-the-art on the Atari domain while training for half the time on a single multi-core CPU instead of a GPU. Furthermore, we show that asynchronous actor-critic succeeds on a wide variety of continuous motor control problems as well as on a new task of navigating random 3D mazes using a visual input.
translated by 谷歌翻译
深度强化学习(RL)导致了许多最近和开创性的进步。但是,这些进步通常以培训的基础体系结构的规模增加以及用于训练它们的RL算法的复杂性提高,而均以增加规模的成本。这些增长反过来又使研究人员更难迅速原型新想法或复制已发表的RL算法。为了解决这些问题,这项工作描述了ACME,这是一个用于构建新型RL算法的框架,这些框架是专门设计的,用于启用使用简单的模块化组件构建的代理,这些组件可以在各种执行范围内使用。尽管ACME的主要目标是为算法开发提供一个框架,但第二个目标是提供重要或最先进算法的简单参考实现。这些实现既是对我们的设计决策的验证,也是对RL研究中可重复性的重要贡献。在这项工作中,我们描述了ACME内部做出的主要设计决策,并提供了有关如何使用其组件来实施各种算法的进一步详细信息。我们的实验为许多常见和最先进的算法提供了基准,并显示了如何为更大且更复杂的环境扩展这些算法。这突出了ACME的主要优点之一,即它可用于实现大型,分布式的RL算法,这些算法可以以较大的尺度运行,同时仍保持该实现的固有可读性。这项工作提出了第二篇文章的版本,恰好与模块化的增加相吻合,对离线,模仿和从演示算法学习以及作为ACME的一部分实现的各种新代理。
translated by 谷歌翻译