深层强化学习代理通过直接最大化累积奖励来实现最先进的结果。但是,环境包含各种各样的可能的训练信号。在本文中,我们介绍了通过执行学习同时最大化许多其他伪奖励功能的anagent。所有这些任务都有一个共同的表现形式,就像无监督学习一样,在没有外在学习者的情况下继续发展。我们还引入了一种新的机制,用于将这种表示集中在外在奖励上,以便学习可以快速适应实际任务的最相关方面。我们的经纪人明显优于以前最先进的Atari,平均880%专家的人类表现,以及具有挑战性的第一人称,三维\ emph {Labyrinth}任务套件,平均加速学习10美元在迷宫中获得$和平均87%的专家表现。
translated by 谷歌翻译
In this article, we review recent Deep Learning advances in the context of how they have been applied to play different types of video games such as first-person shooters, arcade games, and real-time strategy games. We analyze the unique requirements that different game genres pose to a deep learning system and highlight important open challenges in the context of applying these machine learning methods to video games, such as general game playing, dealing with extremely large decision spaces and sparse rewards.
translated by 谷歌翻译
强化学习(RL)是机器学习的一个分支,用于解决各种顺序决策问题而无需预先监督。由于最近深度学习的进步,新提出的Deep-RL算法已经能够在复杂的高维环境中表现得非常好。然而,即使在许多领域取得成功之后,这些方法的主要挑战之一是与高效决策所需的环境的高度相互作用。从大脑中寻求灵感,这个问题可以通过偏置决策来结合基于实例的学习来解决。记录高级经验。本文回顾了各种最近的强化学习方法,它们结合了外部记忆来解决决策问题,并对它们进行了调查。我们概述了不同的方法 - 以及它们的优点和缺点,应用以及用于基于内存的模型的标准实验设置。该评论希望成为有用的资源,以提供该领域最新进展的关键见解,并为其未来的进一步发展提供帮助。
translated by 谷歌翻译
Deep reinforcement learning is poised to revolu-tionise the field of AI and represents a step towards building autonomous systems with a higher level understanding of the visual world. Currently, deep learning is enabling reinforcement learning to scale to problems that were previously intractable, such as learning to play video games directly from pixels. Deep reinforcement learning algorithms are also applied to robotics, allowing control policies for robots to be learned directly from camera inputs in the real world. In this survey, we begin with an introduction to the general field of reinforcement learning, then progress to the main streams of value-based and policy-based methods. Our survey will cover central algorithms in deep reinforcement learning, including the deep Q-network, trust region policy optimisation, and asynchronous advantage actor-critic. In parallel, we highlight the unique advantages of deep neural networks, focusing on visual understanding via reinforcement learning. To conclude, we describe several current areas of research within the field.
translated by 谷歌翻译
Learning to navigate in complex environments with dynamic elements is animportant milestone in developing AI agents. In this work we formulate thenavigation question as a reinforcement learning problem and show that dataefficiency and task performance can be dramatically improved by relying onadditional auxiliary tasks leveraging multimodal sensory inputs. In particularwe consider jointly learning the goal-driven reinforcement learning problemwith auxiliary depth prediction and loop closure classification tasks. Thisapproach can learn to navigate from raw sensory input in complicated 3D mazes,approaching human-level performance even under conditions where the goallocation changes frequently. We provide detailed analysis of the agentbehaviour, its ability to localise, and its network activity dynamics, showingthat the agent implicitly learns key navigation abilities.
translated by 谷歌翻译
在当前最先进的商业第一人称射击游戏中,计算机控制的机器人(也称为非玩家角色)通常可以容易地与人类控制的机器人区分开来。诸如失败的航行,“第六感”知识,人类玩家的下落和确定性,脚本行为等故事标志是其中的一些原因。然而,我们建议,这些游戏中非人类行为的最大指标之一可以在机器人的武器射击能力中找到。始终如一的完美准确性和从任何距离“锁定”在视野中的对手都是人体运动中没有的机器人的指示能力。传统上,机器人在某种程度上是残障的,其时间延迟或随机扰动其目标,其随着时间的推移不适应或改进其技术。我们假设让机器人通过反复试验来学习射击技巧,就像人类玩家一样,将导致游戏中更大的变化并产生不那么可预测的非玩家角色。本文描述了一种增强学习射击机制,用于根据对对手造成的伤害量的动态向前信号,随时间调整射击。
translated by 谷歌翻译
强化学习的一个主要挑战是发现奖励分布稀疏的任务的有效政策。我们假设在没有有用的奖励信号的情况下,有效的探索策略应该找出{\ it decision states}。这些状态位于状态空间中的关键交叉点,代理可以从这些交叉点转换到新的,可能未开发的区域。我们建议从先前的经验中了解决策状态。通过训练具有信息瓶颈的目标条件,我们可以通过检查模型实际利用目标状态的位置来识别决策状态。我们发现,这种简单的机制可以有效地识别决策状态,即使在部分观察到的环境中实际上,该模型学习了与潜在子目标相关的理论线索。在新的环境中,这个模型可以识别新的子目标以进行进一步的探索,引导代理通过一系列潜在的决策状态并通过状态空间的新区域。
translated by 谷歌翻译
We introduce Imagination-Augmented Agents (I2As), a novel architecture for deep reinforcement learning combining model-free and model-based aspects. In contrast to most existing model-based reinforcement learning and planning methods, which prescribe how a model should be used to arrive at a policy, I2As learn to interpret predictions from a learned environment model to construct implicit plans in arbitrary ways, by using the predictions as additional context in deep policy networks. I2As show improved data efficiency, performance, and robustness to model misspecification compared to several baselines.
translated by 谷歌翻译
我们提出了第一个深度学习模型,使用强化学习直接从高维感觉输入成功学习控制政策。该模型是一个卷积神经网络,使用Q学习的变体进行训练,其输入是原始像素,其输出是估计未来奖励的值函数。我们将方法应用于Arcade学习环境中的七个Atari 2600游戏,而不调整架构或学习算法。我们发现它在六个游戏中的表现优于以前的所有方法,并且在三个游戏中超过了人类专家。
translated by 谷歌翻译
Wang et al. Meta-Reinforcement Learning 1 Over the past twenty years, neuroscience research on reward-based learning has converged on a canonical model, under which the neurotransmitter dopamine 'stamps in' associations between situations, actions and rewards by modulating the strength of synaptic connections between neurons. However, a growing number of recent findings have placed this standard model under strain. In the present work, we draw on recent advances in artificial intelligence to introduce a new theory of reward-based learning. Here, the dopamine system trains another part of the brain, the prefrontal cortex, to operate as its own free-standing learning system. This new perspective accommodates the findings that motivated the standard model, but also deals gracefully with a wider range of observations, providing a fresh foundation for future research. Exhilarating advances have recently been made toward understanding the mechanisms involved in reward-driven learning. This progress has been enabled in part by the importation of ideas from the field of reinforcement learning 1 (RL). Most centrally, this input has led to an RL-based theory of dopaminergic function. Here, phasic dopamine (DA) release is interpreted as conveying a reward prediction error (RPE) signal 2-4 , an index of surprise which figures centrally in temporal-difference RL algorithms 1. Under the theory, the RPE drives synaptic plasticity in the striatum, translating experienced action-reward associations into optimized behavioral policies 4, 5. Over the past two decades, evidence has steadily mounted for this proposal, establishing it as the standard model of reward-driven learning. However, even as this standard model has solidified, a collection of problematic observations has accumulated. One quandary arises from research on prefrontal cortex (PFC). A growing body of evidence suggests that PFC implements mechanisms for reward-based learning, performing computations that strikingly resemble those ascribed to DA-based RL. It has long been established that sectors of the PFC represent the expected values of actions, objects and states 6-8. More recently, it has emerged that PFC also encodes the recent history of actions and rewards 9-15. The set of variables encoded, along with observations concerning the temporal profile of neural activation in the PFC, has led to the conclusion that "PFC neurons dynamically [encode] conversions from reward and choice history to object value, and from object value to object choice" 10. In short, neural activity in PFC appears to reflect a set of operations that together constitute a self-contained RL algorithm. Placing PFC beside DA, we obtain a picture containing two full-fledged RL systems, one utilizing activity-based representations and the other synaptic learning. What is the relationship between these systems? If both support RL, are their functions simply redundant? One suggestion has been that DA and PFC subserve different for
translated by 谷歌翻译
学习如何在没有手工制作的奖励或专家数据的情况下控制环境仍然具有挑战性,并且处于强化学习研究的前沿。我们提出了一种无监督的学习算法来训练代理人仅使用观察和反应流来达到感知指定的目标。我们的经纪人同时学习目标条件政策和goalachievement奖励功能,衡量一个国家与目标国家的相似程度。这种双重优化导致合作游戏,产生了奖励的奖励函数,其反映了环境的可控方面的相似性而不是观察空间中的距离。我们展示了我们的代理人以无人监督的方式学习在三个领域--Atari,DeepMind Control Suite和DeepMind Lab实现目标的目标。
translated by 谷歌翻译
生成的递归神经网络以无人监督的方式快速训练,通过压缩时空表示对流行的强化学习环境进行建模。世界模型的提取特征被提供给由进化训练的紧凑和简单的政策,在各种环境中实现了结果的状态。我们还完全在其内部世界模型生成的环境中训练我们的代理,并将此政策转移回实际环境。交互式纸质版本:http://worldmodels.github.io
translated by 谷歌翻译
强化学习社区在设计能够超越特定任务的人类表现的算法方面取得了很大进展。这些算法大多是当时训练的一项任务,每项新任务都需要一个全新的代理实例。这意味着学习算法是通用的,但每个解决方案都不是;每个代理只能解决它所训练的一项任务。在这项工作中,我们研究了学习掌握不是一个而是多个顺序决策任务的问题。多任务学习中的一个普遍问题是,必须在多个任务的需求之间找到平衡,以满足单个学习系统的有限资源。许多学习算法可能会被要解决的任务集中的某些任务分散注意力。这些任务对于学习过程似乎更为突出,例如由于任务内奖励的密度或大小。这导致算法以牺牲普遍性为代价专注于那些突出的任务。我们建议自动调整每个任务对代理更新的贡献,使所有任务对学习动态产生类似的影响。这导致学习在一系列57diverse Atari游戏中玩所有游戏的艺术表现。令人兴奋的是,我们的方法学会了一套训练有素的政策 - 只有一套权重 - 超过了人类的中等绩效。据我们所知,这是单个代理首次超越此多任务域的人员级别性能。同样的方法还证明了3D加强学习平台DeepMind Lab中30项任务的艺术表现。
translated by 谷歌翻译
深度强化学习的进步使得自主代理能够在Atari游戏中表现出色,通常优于人类,仅使用原始像素做出决定。但是,大多数这些游戏都发生在对代理完全可观察的2D环境中。在本文中,我们提出了第一个解决第一人称射击游戏中3D环境的架构,涉及部分可观察状态。通常,深度加强学习方法仅利用视觉输入进行训练。我们提出了一种方法来训练这些模型,以便在训练阶段利用游戏特征信息,例如敌人或物品的存在。我们的模型经过培训,同时学习这些特征,同时最大限度地减少Q学习目标,这可以显着提高我们的代理人的培训速度和绩效。我们的架构也模块化,允许不同的模型在游戏的不同阶段进行独立训练。我们表明,所提出的架构在死亡场景中基本上优于游戏的内置AIagents以及人类。
translated by 谷歌翻译
Domain adaptation is an important open problem in deep reinforcement learning (RL). In many scenarios of interest data is hard to obtain , so agents may learn a source policy in a setting where data is readily available, with the hope that it generalises well to the target domain. We propose a new multi-stage RL agent, DARLA (DisentAngled Representation Learning Agent), which learns to see before learning to act. DARLA's vision is based on learning a disentangled representation of the observed environment. Once DARLA can see, it is able to acquire source policies that are robust to many domain shifts-even with no access to the target domain. DARLA significantly outperforms conventional baselines in zero-shot domain adaptation scenarios , an effect that holds across a variety of RL environments (Jaco arm, DeepMind Lab) and base RL algorithms (DQN, A3C and EC).
translated by 谷歌翻译
在观察他人的行为时,人类会对其他人为何如此行事进行推断,以及这对他们对世界的看法意味着什么。人类还使用这样一个事实,即当他人观察时,他们的行为将以这种方式进行解释,从而使他们能够提供信息,从而与他人进行有效沟通。尽管学习算法最近在许多双人游戏,零和游戏中实现了超人的表现,但可扩展的多智能体强化学习算法在复杂的,部分可观察的环境中可以发挥作用的策略和惯例已被证明是难以捉摸的。我们提出了贝叶斯动作解码器(BAD),这是一种新的多智能体学习方法,它使用近似贝叶斯更新来获得公众对环境中所有代理所采取的行动的条件的信念。与公众的信念一起,这种贝叶斯更新有效地定义了一种新的马尔可夫决策过程,即公众信念MDP,其中行动空间由确定性的部分政策组成,由深层网络参数化,可以针对给定的公共状态进行抽样。它利用了这样的事实:如果动作空间被扩充到将私有信息映射到环境动作的部分策略,那么仅作用于该公共信念状态的代理仍然可以学习其私人信息。贝叶斯更新也与人类在观察他人行为时进行的心理推理理论密切相关。我们首先在基于原理的两步矩阵游戏中验证BAD,其中它优于传统的政策梯度方法。然后,我们在具有挑战性的合作部分信息卡游戏Hanabi上评估BAD,其中双人游戏设置方法超越了之前发布的所有学习和手动编码方法。
translated by 谷歌翻译
The theory of reinforcement learning provides a normative account 1 , deeply rooted in psychological 2 and neuroscientific 3 perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems 4,5 , the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopa-minergic neurons and temporal difference reinforcement learning algorithms 3. While reinforcement learning agents have achieved some successes in a variety of domains 6-8 , their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks 9-11 to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games 12. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks. We set out to create a single algorithm that would be able to develop a wide range of competencies on a varied range of challenging tasks-a central goal of general artificial intelligence 13 that has eluded previous efforts 8,14,15. To achieve this, we developed a novel agent, a deep Q-network (DQN), which is able to combine reinforcement learning with a class of artificial neural network 16 known as deep neural networks. Notably, recent advances in deep neural networks 9-11 , in which several layers of nodes are used to build up progressively more abstract representations of the data, have made it possible for artificial neural networks to learn concepts such as object categories directly from raw sensory data. We use one particularly successful architecture, the deep convolutional network 17 , which uses hierarchical layers of tiled convolutional filters to mimic the effects of receptive fields-inspired by Hubel and Wiesel's seminal work on feedforward proce
translated by 谷歌翻译
Monte Carlo Tree Search (MCTS) is a recently proposed search method that combines the precision of tree search with the generality of random sampling. It has received considerable interest due to its spectacular success in the difficult problem of computer Go, but has also proved beneficial in a range of other domains. This paper is a survey of the literature to date, intended to provide a snapshot of the state of the art after the first five years of MCTS research. We outline the core algorithm's derivation, impart some structure on the many variations and enhancements that have been proposed, and summarise the results from the key game and non-game domains to which MCTS methods have been applied. A number of open research questions indicate that the field is ripe for future work.
translated by 谷歌翻译
我们介绍了TextWorld,这是一个沙盒学习环境,用于培训和评估R​​L代理在基于文本的游戏上的价值。 TextWorld是一个Python库,可以处理文本游戏的交互式播放,以及状态跟踪和奖励分配等后端功能。它附带了一系列精选的游戏,我们分析了它们的特点和挑战。更重要的是,它使用户能够手工制作或自动生成新游戏。它的生成机制可以精确控制构建游戏的难度,范围和语言,并可用于放松商业文本游戏固有的挑战,如部分可观察性和稀疏奖励。通过生成各种但类似的游戏集,TextWorld也可用于研究泛化和转移学习。我们在强化学习形式中投入基于文本的游戏,使用我们的框架开发一组基准游戏,并在此集合和策划列表上评估几个基线代理。
translated by 谷歌翻译
The Arcade Learning Environment (ALE) is an evaluation platform that poses the challenge of building AI agents with general competency across dozens of Atari 2600 games. It supports a variety of different problem settings and it has been receiving increasing attention from the scientific community, leading to some high-profile success stories such as the much publicized Deep Q-Networks (DQN). In this article we take a big picture look at how the ALE is being used by the research community. We show how diverse the evaluation methodologies in the ALE have become with time, and highlight some key concerns when evaluating agents in the ALE. We use this discussion to present some methodological best practices and provide new benchmark results using these best practices. To further the progress in the field, we introduce a new version of the ALE that supports multiple game modes and provides a form of stochasticity we call sticky actions. We conclude this big picture look by revisiting challenges posed when the ALE was introduced, summarizing the state-of-the-art in various problems and highlighting problems that remain open.
translated by 谷歌翻译