Deep reinforcement learning is poised to revolu-tionise the field of AI and represents a step towards building autonomous systems with a higher level understanding of the visual world. Currently, deep learning is enabling reinforcement learning to scale to problems that were previously intractable, such as learning to play video games directly from pixels. Deep reinforcement learning algorithms are also applied to robotics, allowing control policies for robots to be learned directly from camera inputs in the real world. In this survey, we begin with an introduction to the general field of reinforcement learning, then progress to the main streams of value-based and policy-based methods. Our survey will cover central algorithms in deep reinforcement learning, including the deep Q-network, trust region policy optimisation, and asynchronous advantage actor-critic. In parallel, we highlight the unique advantages of deep neural networks, focusing on visual understanding via reinforcement learning. To conclude, we describe several current areas of research within the field.
translated by 谷歌翻译
在本文中,我们提出了一种新的强化学习算法,该算法包含在马尔可夫决策过程(MDP)的政策梯度的随机方差减少版本中。随机方差减少梯度(SVRG)方法已被证明在监督学习中非常成功。然而,它们对政策梯度的适应并不简单,需要考虑到I)非凹目标函数; II)完整梯度计算的近似值;和III)非静态采样过程。结果是SVRPG,一种随机方差减少的策略梯度算法,它利用重要性权重来保持梯度差异的无偏差。在MDP的标准假设下,我们为SVRPG提供收敛保证,其收敛率在增加的批量大小下是线性的。最后,我们建议使用SVRPG的实际变体,并在连续的MDP上进行测试。
translated by 谷歌翻译
许多现实世界的任务表现出丰富的结构,在州空间的不同部分或时间上重复。在这项工作中,我们研究了利用这种重复结构加速和规范学习的可能性。我们从KL正规化的预期奖励目标开始,该目标引入了一个额外的组件,即默认策略。我们不是依赖于固定的默认策略,而是从数据中学习它。但至关重要的是,我们限制默认策略接收的信息量,迫使其学习可重用行为,以帮助策略更快地学习。我们正式化了这一策略,并讨论了与信息瓶颈方法和变分EM算法的联系。我们在离散和连续作用域中提供实证结果,并证明,对于某些任务,在策略旁边学习默认策略可以显着加快和改善学习。
translated by 谷歌翻译
强化学习社区在设计能够超越特定任务的人类表现的算法方面取得了很大进展。这些算法大多是当时训练的一项任务,每项新任务都需要一个全新的代理实例。这意味着学习算法是通用的,但每个解决方案都不是;每个代理只能解决它所训练的一项任务。在这项工作中,我们研究了学习掌握不是一个而是多个顺序决策任务的问题。多任务学习中的一个普遍问题是,必须在多个任务的需求之间找到平衡,以满足单个学习系统的有限资源。许多学习算法可能会被要解决的任务集中的某些任务分散注意力。这些任务对于学习过程似乎更为突出,例如由于任务内奖励的密度或大小。这导致算法以牺牲普遍性为代价专注于那些突出的任务。我们建议自动调整每个任务对代理更新的贡献,使所有任务对学习动态产生类似的影响。这导致学习在一系列57diverse Atari游戏中玩所有游戏的艺术表现。令人兴奋的是,我们的方法学会了一套训练有素的政策 - 只有一套权重 - 超过了人类的中等绩效。据我们所知,这是单个代理首次超越此多任务域的人员级别性能。同样的方法还证明了3D加强学习平台DeepMind Lab中30项任务的艺术表现。
translated by 谷歌翻译
我们研究深度政策梯度算法的行为如何反映激励其发展的概念框架。我们基于该框架的关键方面提出了对最先进方法的细粒度分析:梯度估计,价值预测,优化格局和信任区域执行。我们发现,从这个角度来看,深度梯度算法的行为往往偏离其激励框架预测的行为。我们的分析表明了巩固这些算法基础的第一步,特别是表明我们可能需要超越当前以基准为中心的评估方法。
translated by 谷歌翻译
Reinforcement learning (RL) offers powerful algorithms to search for optimal controllers of systems with nonlinear, possibly stochastic dynamics that are unknown or highly uncertain. This review mainly covers artificial-intelligence approaches to RL, from the viewpoint of the control engineer. We explain how approximate representations of the solution make RL feasible for problems with continuous states and control actions. Stability is a central concern in control, and we argue that while the control-theoretic RL subfield called adaptive dynamic programming is dedicated to it, stability of RL largely remains an open question. We also cover in detail the case where deep neural networks are used for approximation, leading to the field of deep RL, which has shown great success in recent years. With the control practitioner in mind, we outline opportunities and pitfalls of deep RL; and we close the survey with an outlook that-among other things-points out some avenues for bridging the gap between control and artificial-intelligence RL techniques.
translated by 谷歌翻译
多代理方案中的强化学习对于实际应用非常重要,但却带来了超出单一代理设置的挑战。我们提出了一种演员 - 评论家算法,它在多智能体设置中训练分散的政策,使用集中计算的批评者,这些批评者共享注意机制,每个时间步长为每个代理选择相关信息。与最近的方法相比,这种注意机制可以在复杂的多代理环境中实现更有效和可扩展的学习。我们的方法不仅适用于具有共享奖励的合作设置,还适用于个性化奖励设置,包括对抗设置,并且不对代理的动作空间做出任何假设。因此,它足够灵活,可以应用于大多数多智能体学习问题。
translated by 谷歌翻译
大多数深度强化学习算法在复杂和丰富的环境中数据效率低,限制了它们在许多场景中的适用性。用于提高数据效率的唯一方向是使用共享神经网络参数的多任务学习,其中可以通过跨交叉相关任务来提高效率。然而,在实践中,通常不会观察到这种情况,因为来自不同任务的渐变可能会产生负面干扰,导致学习不稳定,有时甚至会降低数据效率。另一个问题是任务之间的不同奖励方案,这很容易导致一个任务确定共享模型的学习。我们提出了一种新的联合训练方法,我们称之为Distral(Distill&transferlearning)。我们建议分享一个捕获常见行为的“蒸馏”策略,而不是在不同的工作者之间共享参数。每个工人都经过培训,可以解决自己的任务,同时受限于保持对共享政策的控制,而共享政策则通过蒸馏培训成为所有任务政策的质心。学习过程的两个方面都是通过优化联合目标函数得出的。我们表明,我们的方法支持在复杂的3D环境中进行有效传输,优于多个相关方法。此外,所提出的学习过程更加健壮且更加稳定 - 这些属性在深层强化学习中至关重要。
translated by 谷歌翻译
一些真实世界的域名最好被描述为单一任务,但对于其他人而言,这种观点是有限的。相反,一些任务不断增加不复杂性,与代理人的能力相结合。在不断学习中,也被认为是终身学习,没有明确的任务边界或课程。随着学习代理变得越来越强大,持续学习仍然是阻碍快速进步的前沿之一。为了测试连续学习能力,我们考虑具有明确的任务序列和稀疏奖励的具有挑战性的3D域。我们提出了一种名为Unicorn的新型代理体系结构,它展示了强大的持续学习能力,并在拟议的领域中表现出优秀的几个基线代理。代理通过使用并行的非策略学习设置,有效地共同表示和学习多个策略来实现这一目标。
translated by 谷歌翻译
在一系列具有挑战性的决策制定和控制任务中,已经证明了无模型深度强化学习(RL)算法。然而,这些方法通常遭受两个主要挑战:非常高的样品复杂性和脆弱的收敛性质,这需要细致的超级参数化。这两个挑战都严重限制了这些方法对复杂的现实领域的适用性。在本文中,我们提出了softactor-critic,一种基于最大熵强化学习框架的非策略行为者 - 评论者深度RL算法。在这个框架中,演员的目标是最大化预期的奖励,同时也最大化熵。也就是说,尽可能随机地完成任务。基于该框架的先前深度RL方法已经被公式化为Q学习方法。通过将关闭政策更新与稳定的随机行为者 - 批评者表述相结合,我们的方法在一系列连续的控制基准任务上实现了最先进的性能,优于先前的政策和非政策方法。此外,我们证明,与其他非策略算法相比,我们的方法非常稳定,在不同的随机种子上实现了非常相似的性能。
translated by 谷歌翻译
强化学习中的转移是指概念不仅应发生在任务中,还应发生在任务之间。我们提出了转移框架,用于奖励函数改变的场景,但环境的动态保持不变。我们的方法依赖于两个关键思想:“后继特征”,一种将环境动态与奖励分离的价值函数表示,以及“广义政策改进”,即动态规划的政策改进操作的概括,它考虑一组政策而不是单一政策。 。总而言之,这两个想法导致了一种方法,可以与强化学习框架无缝集成,并允许跨任务自由交换信息。即使在任何学习过程之前,所提出的方法也为转移的政策提供了保证。我们推导出两个定理,将我们的方法设置在坚实的理论基础和现有的实验中,表明它成功地促进了实践中的转移,在一系列导航任务中明显优于替代方法。并控制模拟机器人手臂。
translated by 谷歌翻译
在许多顺序决策制定任务中,设计奖励功能是有挑战性的,这有助于RL代理有效地学习代理设计者认为良好的行为。在文献中已经提出了许多不同的向外设计问题的公式或其近似变体。在本文中,我们建立了Singhet.al的最佳奖励框架。将最优内在奖励函数定义为当RL代理使用时实现优化任务指定的内部奖励函数的行为。此框架中的先前工作已经显示了如何为基于前瞻性搜索的规划者学习良好的内在奖励功能。是否有可能学习学习者的内在奖励功能仍然是一个悬而未决的问题。在本文中,我们推导出一种新的算法,用于学习基于策略梯度的学习代理的内在奖励。我们将使用我们的算法的增强代理的性能与基于A2C的策略学习器(针对Atarigames)和基于PPO的策略学习器(针对Mujoco域)提供额外的内在奖励,其中基线代理使用相同的策略学习者但仅使用外在奖励。我们的结果显示大多数但不是所有领域的性能都有所提高。
translated by 谷歌翻译
从先前学习的任务(源任务)重用或传递信息以学习新任务(目标任务)的想法有可能显着提高强化学习者的样本效率。在这项工作中,我们描述了一种新方法,用于重复使用先前获得的知识,通过使用它来指导代理人在完成新任务时的探索。为了做到这一点,我们采用了增长的自组织映射算法的变体,该算法使用直接在值函数的矢量化表示的空间中定义的相似性度量来训练。除了支持跨任务的传输之外,所得到的映射还同时用于以自适应和可伸缩的方式有效地存储先前获得的任务知识。我们在模拟导航环境中验证了我们的方法,并通过使用移动微机器人平台的简单实验来证明其实用性。此外,我们展示了这种方法的可扩展性,并分析性地检查了它与所提出的网络增长机制的关系。此外,我们简要讨论了这种方法的一些可能的改进和扩展,以及它在持续学习的背景下与现实世界场景的相关性。
translated by 谷歌翻译
The reinforcement learning paradigm is a popular way to address problems that have only limited environmental feedback, rather than correctly labeled examples, as is common in other machine learning contexts. While significant progress has been made to improve learning in a single task, the idea of transfer learning has only recently been applied to reinforcement learning tasks. The core idea of transfer is that experience gained in learning to perform one task can help improve learning performance in a related, but different, task. In this article we present a framework that classifies transfer learning methods in terms of their capabilities and goals, and then use it to survey the existing literature, as well as to suggest future directions for transfer learning work. In reinforcement learning (RL) (Sutton and Barto, 1998) problems, leaning agents take sequential actions with the goal of maximizing a reward signal, which may be time-delayed. For example, an agent could learn to play a game by being told whether it wins or loses, but is never given the "correct" action at any given point in time. The RL framework has gained popularity as learning methods have been developed that are capable of handling increasingly complex problems. However , when RL agents begin learning tabula rasa, mastering difficult tasks is often slow or infeasible, and thus a significant amount of current RL research focuses on improving the speed of learning by exploiting domain expertise with varying amounts of human-provided knowledge. Common approaches include deconstructing the task into a hierarchy of subtasks (cf., Dietterich, 2000); learning with higher-level, temporally abstract, actions (e.g., options, Sutton et al. 1999) rather than simple one-step actions; and efficiently abstracting over the state space (e.g., via function approximation) so that the agent may generalize its experience more efficiently. The insight behind transfer learning (TL) is that generalization may occur not only within tasks, but also across tasks. This insight is not new; transfer has long been studied in the psychological literature (cf., Thorndike and Woodworth, 1901; Skinner, 1953). More relevant are a number of * .
translated by 谷歌翻译
In this work, we propose to apply trust region optimization to deep reinforcement learning using a recently proposed Kronecker-factored approximation to the curvature. We extend the framework of natural policy gradient and propose to optimize both the actor and the critic using Kronecker-factored approximate curvature (K-FAC) with trust region; hence we call our method Actor Critic using Kronecker-Factored Trust Region (ACKTR). To the best of our knowledge, this is the first scalable trust region natural gradient method for actor-critic methods. It is also the method that learns non-trivial tasks in continuous control as well as discrete control policies directly from raw pixel inputs. We tested our approach across discrete domains in Atari games as well as continuous domains in the Mu-JoCo environment. With the proposed methods, we are able to achieve higher rewards and a 2-to 3-fold improvement in sample efficiency on average, compared to previous state-of-the-art on-policy actor-critic methods. Code is available at https://github.com/openai/baselines.
translated by 谷歌翻译
在这项工作中,我们的目标是使用具有一组参数的单一执行学习代理来解决大量任务。一个关键的挑战是处理增加的数据量和延长的培训时间。我们开发了一种新的分布式代理IMPALA(重要性加权演员-LearnerArchitecture),它不仅可以在单机训练中更有效地使用资源,而且可以扩展到数千台机器而不会牺牲数据效率或资源利用率。我们通过将解耦的行为和学习与称为V-trace的新型非政策校正方法相结合,实现了在高吞吐量下的稳定学习。我们展示了IMPALA在DMLab-30(来自DeepMind Lab环境(Beattie等,2016))和Atari-57(在Arcade学习环境中所有可用的Atari游戏(Bellemare等人)中的30个任务的多任务强化学习的有效性。 。,2013a))。我们的结果表明,IMPALA能够比数据量较少的前一代产品获得更好的性能,并且由于其多任务方法,关键性地表现出任务之间的正向转移。
translated by 谷歌翻译
通过使用深度神经网络作为函数准直器直接从原始输入图像学习,深度强化学习(深度RL)已经实现了优越的性能不复杂的顺序任务。然而,从原始图像直接学习数据是低效的。除了学习策略之外,代理还必须学习复杂状态的特征表示。因此,深度RL通常会受到学习速度慢的影响,并且通常需要大量的培训时间和数据来达到合理的性能,这使得它不适用于数据昂贵的实际环境。在这项工作中,我们通过解决两个学习目标 - 特征学习中的一个来提高深度RL中的数据效率。我们利用监督学习对一小部分非专家人类演示进行预训练,并使用Atari领域中的异步优势行为者 - 关键算法(A3C)来验证我们的方法。我们的结果显示学习速度有了显着提高,即使所提供的演示是嘈杂和低质量的。
translated by 谷歌翻译
Reinforcement learning offers to robotics a framework and set of tools for the design of sophisticated and hard-to-engineer behaviors. Conversely, the challenges of robotic problems provide both inspiration, impact, and validation for developments in reinforcement learning. The relationship between disciplines has sufficient promise to be likened to that between physics and mathematics. In this article, we attempt to strengthen the links between the two research communities by providing a survey of work in reinforcement learning for behavior generation in robots. We highlight both key challenges in robot reinforcement learning as well as notable successes. We discuss how contributions tamed the complexity of the domain and study the role of algorithms, representations, and prior knowledge in achieving these successes. As a result, a particular focus of our paper lies on the choice between model-based and model-free as well as between value-function-based and policy-search methods. By analyzing a simple problem in some detail we demonstrate how reinforcement learning approaches may be profitably applied, and we note throughout open questions and the tremendous potential for future research.
translated by 谷歌翻译
Optimization of parameterized policies for reinforcement learning (RL) is an important and challenging problem in artificial intelligence. Among the most common approaches are algorithms based on gradient ascent of a score function representing discounted return. In this paper, we examine the role of these policy gradient and actor-critic algorithms in partially-observable multiagent environments. We show several candidate policy update rules and relate them to a foundation of regret minimization and multiagent learning techniques for the one-shot and tabular cases, leading to previously unknown convergence guarantees. We apply our method to model-free multiagent reinforcement learning in adversarial sequential decision problems (zero-sum imperfect information games), using RL-style function approximation. We evaluate on commonly used benchmark Poker domains, showing performance against fixed policies and empirical convergence to approximate Nash equilibria in self-play with rates similar to or better than a baseline model-free algorithm for zero-sum games, without any domain-specific state space reductions.
translated by 谷歌翻译
我们研究了马尔可夫决策过程中的政策外政策优化问题,并开发了一种新的非政策性政策梯度方法。先行政策政策梯度方法通常忽略了在使用tocollect数据的行为政策下访问的国家分布与在学术政策下的国家分布之间的不匹配。在这里,我们建立在估计政策评估中马尔可夫链固定分布比率的最新进展,以及可以解释分布中这种情绪的现有政策政策梯度优化技术。我们提供了一个说明性的例子,说明为什么这是重要的,理论上的收敛保证对于我们的方法和经验模拟,突出了纠正这种分布差异的好处。
translated by 谷歌翻译