强化学习社区在设计能够超越特定任务的人类表现的算法方面取得了很大进展。这些算法大多是当时训练的一项任务,每项新任务都需要一个全新的代理实例。这意味着学习算法是通用的,但每个解决方案都不是;每个代理只能解决它所训练的一项任务。在这项工作中,我们研究了学习掌握不是一个而是多个顺序决策任务的问题。多任务学习中的一个普遍问题是,必须在多个任务的需求之间找到平衡,以满足单个学习系统的有限资源。许多学习算法可能会被要解决的任务集中的某些任务分散注意力。这些任务对于学习过程似乎更为突出,例如由于任务内奖励的密度或大小。这导致算法以牺牲普遍性为代价专注于那些突出的任务。我们建议自动调整每个任务对代理更新的贡献,使所有任务对学习动态产生类似的影响。这导致学习在一系列57diverse Atari游戏中玩所有游戏的艺术表现。令人兴奋的是,我们的方法学会了一套训练有素的政策 - 只有一套权重 - 超过了人类的中等绩效。据我们所知,这是单个代理首次超越此多任务域的人员级别性能。同样的方法还证明了3D加强学习平台DeepMind Lab中30项任务的艺术表现。
translated by 谷歌翻译
Deep reinforcement learning is poised to revolu-tionise the field of AI and represents a step towards building autonomous systems with a higher level understanding of the visual world. Currently, deep learning is enabling reinforcement learning to scale to problems that were previously intractable, such as learning to play video games directly from pixels. Deep reinforcement learning algorithms are also applied to robotics, allowing control policies for robots to be learned directly from camera inputs in the real world. In this survey, we begin with an introduction to the general field of reinforcement learning, then progress to the main streams of value-based and policy-based methods. Our survey will cover central algorithms in deep reinforcement learning, including the deep Q-network, trust region policy optimisation, and asynchronous advantage actor-critic. In parallel, we highlight the unique advantages of deep neural networks, focusing on visual understanding via reinforcement learning. To conclude, we describe several current areas of research within the field.
translated by 谷歌翻译
We propose Scheduled Auxiliary Control (SAC-X), a new learning paradigm in the context of Reinforcement Learning (RL). SAC-X enables learning of complex behaviors-from scratch-in the presence of multiple sparse reward signals. To this end, the agent is equipped with a set of general auxiliary tasks, that it attempts to learn simultaneously via off-policy RL. The key idea behind our method is that active (learned) scheduling and execution of auxiliary policies allows the agent to efficiently explore its environment-enabling it to excel at sparse reward RL. Our experiments in several challenging robotic manipulation settings demonstrate the power of our approach. A video of the rich set of learned behaviours can be found at https://youtu.be/mPKyvocNe M.
translated by 谷歌翻译
跨任务转移技能的能力有可能将增强型学习(RL)代理扩展到目前无法实现的环境。最近,基于两个概念,后继特征(SF)和广泛策略改进(GPI)的框架已被引入转移技能的原则性方式。在本文中,我们在两个方面扩展了SF和GPI框架。 SFs和GPI原始公式的基本假设之一是,所有感兴趣的任务的奖励可以计算为固定特征集的线性组合。我们放松了这个约束,并表明支持框架的理论保证可以扩展到只有奖励函数不同的任何一组任务。我们的第二个贡献是,可以使用奖励函数本身作为未来任务的特征,而不会损失任何表现力,从而无需事先指定一组特征。这使得可以以更稳定的方式将SF和GPI与深度学习相结合。我们在acomplex 3D环境中凭经验验证了这一主张,其中观察是来自第一人称视角的图像。我们表明,SF和GPI推动的转移几乎可以立即实现看不见任务的非常好的政策。我们还描述了如何以一种允许将它们添加到代理的技能集中的方式学习专门用于新任务的策略,从而在将来重用。
translated by 谷歌翻译
In this paper, we explore the utilization of natural language to drive transfer for reinforcement learning (RL). Despite the widespread application of deep RL techniques, learning generalized policy representations that work across domains remains a challenging problem. We demonstrate that textual descriptions of environments provide a compact intermediate channel to facilitate effective policy transfer. Specifically, by learning to ground the meaning of text to the dynamics of the environment such as transitions and rewards, an autonomous agent can effectively bootstrap policy learning on a new domain given its description. We employ a model-based RL approach consisting of a differentiable planning module, a model-free component and a factorized state representation to effectively use entity descriptions. Our model outperforms prior work on both transfer and multi-task scenarios in a variety of different environments. For instance, we achieve up to 14% and 11.5% absolute improvement over previously existing models in terms of average and initial rewards, respectively.
translated by 谷歌翻译
学习如何在没有手工制作的奖励或专家数据的情况下控制环境仍然具有挑战性,并且处于强化学习研究的前沿。我们提出了一种无监督的学习算法来训练代理人仅使用观察和反应流来达到感知指定的目标。我们的经纪人同时学习目标条件政策和goalachievement奖励功能,衡量一个国家与目标国家的相似程度。这种双重优化导致合作游戏,产生了奖励的奖励函数,其反映了环境的可控方面的相似性而不是观察空间中的距离。我们展示了我们的代理人以无人监督的方式学习在三个领域--Atari,DeepMind Control Suite和DeepMind Lab实现目标的目标。
translated by 谷歌翻译
在这项工作中,我们的目标是使用具有一组参数的单一执行学习代理来解决大量任务。一个关键的挑战是处理增加的数据量和延长的培训时间。我们开发了一种新的分布式代理IMPALA(重要性加权演员-LearnerArchitecture),它不仅可以在单机训练中更有效地使用资源,而且可以扩展到数千台机器而不会牺牲数据效率或资源利用率。我们通过将解耦的行为和学习与称为V-trace的新型非政策校正方法相结合,实现了在高吞吐量下的稳定学习。我们展示了IMPALA在DMLab-30(来自DeepMind Lab环境(Beattie等,2016))和Atari-57(在Arcade学习环境中所有可用的Atari游戏(Bellemare等人)中的30个任务的多任务强化学习的有效性。 。,2013a))。我们的结果表明,IMPALA能够比数据量较少的前一代产品获得更好的性能,并且由于其多任务方法,关键性地表现出任务之间的正向转移。
translated by 谷歌翻译
强化学习(RL)代理同时学习许多奖励功能的能力具有许多潜在的好处,例如将复杂任务分解为更简单的任务,任务之间的信息交换以及技能的重用。我们特别关注一个方面,即能够推广到看不见的任务。参数泛化依赖于函数逼近器的插值功率,该函数逼近器被赋予任务描述作为输入;其最常见的形式之一是通用值函数逼近器(UVFA)。推广到新任务的另一种方法是在RL问题本身中开发结构。广义策略改进(GPI)将先前任务的解决方案组合到针对看不见的任务的策略中;这依赖于新向下功能下的旧策略的即时策略评估,这通过后继特征(SF)实现。我们提出的通用后继特征近似器(USFAs)结合了所有这些的优点,即UVFAs的可扩展性,SF的即时参考,以及GPI的强大推广。我们讨论了培训USFA所涉及的挑战,其泛化属性,并证明其实际利益和转移能力在一个大规模的领域,其中代理人必须在第一人称视角三维环境中导航。
translated by 谷歌翻译
深层强化学习代理通过直接最大化累积奖励来实现最先进的结果。但是,环境包含各种各样的可能的训练信号。在本文中,我们介绍了通过执行学习同时最大化许多其他伪奖励功能的anagent。所有这些任务都有一个共同的表现形式,就像无监督学习一样,在没有外在学习者的情况下继续发展。我们还引入了一种新的机制,用于将这种表示集中在外在奖励上,以便学习可以快速适应实际任务的最相关方面。我们的经纪人明显优于以前最先进的Atari,平均880%专家的人类表现,以及具有挑战性的第一人称,三维\ emph {Labyrinth}任务套件,平均加速学习10美元在迷宫中获得$和平均87%的专家表现。
translated by 谷歌翻译
We present a differentiable framework capable of learning a wide variety of compositions of simple policies that we call skills. By recursively composing skills with themselves, we can create hierarchies that display complex behavior. Skill networks are trained to generate skill-state embeddings that are provided as inputs to a trainable composition function, which in turn outputs a policy for the overall task. Our experiments on an environment consisting of multiple collect and evade tasks show that this architecture is able to quickly build complex skills from simpler ones. Furthermore, the learned composition function displays some transfer to unseen combinations of skills, allowing for zero-shot generalizations.
translated by 谷歌翻译
在许多顺序决策制定任务中,设计奖励功能是有挑战性的,这有助于RL代理有效地学习代理设计者认为良好的行为。在文献中已经提出了许多不同的向外设计问题的公式或其近似变体。在本文中,我们建立了Singhet.al的最佳奖励框架。将最优内在奖励函数定义为当RL代理使用时实现优化任务指定的内部奖励函数的行为。此框架中的先前工作已经显示了如何为基于前瞻性搜索的规划者学习良好的内在奖励功能。是否有可能学习学习者的内在奖励功能仍然是一个悬而未决的问题。在本文中,我们推导出一种新的算法,用于学习基于策略梯度的学习代理的内在奖励。我们将使用我们的算法的增强代理的性能与基于A2C的策略学习器(针对Atarigames)和基于PPO的策略学习器(针对Mujoco域)提供额外的内在奖励,其中基线代理使用相同的策略学习者但仅使用外在奖励。我们的结果显示大多数但不是所有领域的性能都有所提高。
translated by 谷歌翻译
所有目标更新利用Q学习的非政策性质来更新代理人可能从世界上每次转变中获得的所有可能目标,并被Kaelbling(1993)引入强化学习(RL)。在以前的工作中,这主要是在小状态RL问题中进行探讨,这些问题允许表格表示,并且所有可能的目标都可以明确地列举和分开学习。在本文中,我们通过实验探索了在具有深度神经网络(或简称DeepRL)的RL环境中更新许多(而不是所有)目标的3种不同扩展。首先,在直接调整凯尔宾的方法时,我们探索是否可以使用多目标更新来实现非表格视觉观察领域的制作。其次,我们探讨是否可以使用多目标更新来预先训练网络,以便随后学习更快,更好地处理一个感兴趣的单个主要任务。第三,我们探讨是否可以使用多目标更新来提供辅助任务更新,以便在一个感兴趣的单个主要任务中更快更好地学习网络。我们提供了与3个扩展中的每个扩展的基线的比较。
translated by 谷歌翻译
人类是高保真模仿的专家 - 通常在一次尝试中非常模仿演示。人类使用此功能快速解决atask实例,并引导学习新任务。在自主代理中实现这些可能性是一个悬而未决的问题。在本文中,我们介绍了非政策RL算法(MetaMimic)来缩小这一差距。 MetaMimic可以学习(i)高保真一次性模仿各种新技能的政策,以及(ii)使代理人能够更有效地解决任务的政策。 MetaMimic依赖于将所有经验存储在存储器中并重放这些经验以通过非策略RL学习大规模深度神经网络策略的原理。在我们所知的情况下,本文介绍了用于深度RL的最大现有神经网络,并且表明需要具有归一化的较大网络来实现对于具有挑战性的操纵任务的一次性高保真模仿。结果还表明,尽管任务奖励稀少,并且无法访问示威者行动,但可以从愿景中学习这两种类型的政策。
translated by 谷歌翻译
We introduce FeUdal Networks (FuNs): a novel architecture for hierarchicalreinforcement learning. Our approach is inspired by the feudal reinforcementlearning proposal of Dayan and Hinton, and gains power and efficacy bydecoupling end-to-end learning across multiple levels -- allowing it to utilisedifferent resolutions of time. Our framework employs a Manager module and aWorker module. The Manager operates at a lower temporal resolution and setsabstract goals which are conveyed to and enacted by the Worker. The Workergenerates primitive actions at every tick of the environment. The decoupledstructure of FuN conveys several benefits -- in addition to facilitating verylong timescale credit assignment it also encourages the emergence ofsub-policies associated with different goals set by the Manager. Theseproperties allow FuN to dramatically outperform a strong baseline agent ontasks that involve long-term credit assignment or memorisation. We demonstratethe performance of our proposed system on a range of tasks from the ATARI suiteand also from a 3D DeepMind Lab environment.
translated by 谷歌翻译
强化学习中的转移是指概念不仅应发生在任务中,还应发生在任务之间。我们提出了转移框架,用于奖励函数改变的场景,但环境的动态保持不变。我们的方法依赖于两个关键思想:“后继特征”,一种将环境动态与奖励分离的价值函数表示,以及“广义政策改进”,即动态规划的政策改进操作的概括,它考虑一组政策而不是单一政策。 。总而言之,这两个想法导致了一种方法,可以与强化学习框架无缝集成,并允许跨任务自由交换信息。即使在任何学习过程之前,所提出的方法也为转移的政策提供了保证。我们推导出两个定理,将我们的方法设置在坚实的理论基础和现有的实验中,表明它成功地促进了实践中的转移,在一系列导航任务中明显优于替代方法。并控制模拟机器人手臂。
translated by 谷歌翻译
We introduce Imagination-Augmented Agents (I2As), a novel architecture for deep reinforcement learning combining model-free and model-based aspects. In contrast to most existing model-based reinforcement learning and planning methods, which prescribe how a model should be used to arrive at a policy, I2As learn to interpret predictions from a learned environment model to construct implicit plans in arbitrary ways, by using the predictions as additional context in deep policy networks. I2As show improved data efficiency, performance, and robustness to model misspecification compared to several baselines.
translated by 谷歌翻译
稀疏奖励强化学习的探索仍然是一个难以接受的挑战。许多最先进的方法使用内在动机来补充稀疏的外在奖励信号,使代理人有更多机会在探索过程中接收反馈。最常见的是,这些信号被添加为asbonus奖励,这导致混合策略忠实地进行探索或任务履行延长的时间。在本文中,我们学习单独的内在和外在任务政策,并在这些不同的驱动之间进行计划,以加速探索和稳定学习。此外,我们引入了一种新类型的内在奖励,表示为安装或特征控制(SFC),它是一般的而不是任务特定的。它考虑了完整轨迹的统计数据,因此不同的方法仅仅使用本地信息来评估内在动机。我们使用纯视觉输入评估我们提出的计划内在驱动器(SID)代理程序:VizDoom,DeepMindLab和OpenAI Gym经典控件。结果表明,SFC的探索效率和内在驱动器的分层使用得到了极大的提高。我们的实验结果视频可以在http://youtu.be/4ZHcBo7006Y找到。
translated by 谷歌翻译
稀疏奖励是强化学习(RL)中最具挑战性的问题之一。事后体验重播(HER)试图通过重新标记目标来将失败的体验转化为成功体验来解决这个问题。尽管HER的有效性,HER的适用性有限,因为它缺乏紧凑和普遍的目标表示。我们通过TeacheR的建议(ACTRCE)提出增强体验,这是一种有效的强化学习技术,可以使用自然语言作为目标表示来实现HER框架。我们首先分析了目标表征之间的差异,并表明ACTRCE可以有效地解决难以应对的3D导航任务中的强化学习问题,而具有非语言目标表示的HER无法学习。我们还表明,通过语言目标表示,代理可以概括为看不见的指令,甚至可以通过单个词典推广到指令。我们进一步证明,使用后见之明来解决具有挑战性的任务是至关重要的,即使是少量的建议也足以让代理商达到良好的性能。
translated by 谷歌翻译
The reinforcement learning paradigm is a popular way to address problems that have only limited environmental feedback, rather than correctly labeled examples, as is common in other machine learning contexts. While significant progress has been made to improve learning in a single task, the idea of transfer learning has only recently been applied to reinforcement learning tasks. The core idea of transfer is that experience gained in learning to perform one task can help improve learning performance in a related, but different, task. In this article we present a framework that classifies transfer learning methods in terms of their capabilities and goals, and then use it to survey the existing literature, as well as to suggest future directions for transfer learning work. In reinforcement learning (RL) (Sutton and Barto, 1998) problems, leaning agents take sequential actions with the goal of maximizing a reward signal, which may be time-delayed. For example, an agent could learn to play a game by being told whether it wins or loses, but is never given the "correct" action at any given point in time. The RL framework has gained popularity as learning methods have been developed that are capable of handling increasingly complex problems. However , when RL agents begin learning tabula rasa, mastering difficult tasks is often slow or infeasible, and thus a significant amount of current RL research focuses on improving the speed of learning by exploiting domain expertise with varying amounts of human-provided knowledge. Common approaches include deconstructing the task into a hierarchy of subtasks (cf., Dietterich, 2000); learning with higher-level, temporally abstract, actions (e.g., options, Sutton et al. 1999) rather than simple one-step actions; and efficiently abstracting over the state space (e.g., via function approximation) so that the agent may generalize its experience more efficiently. The insight behind transfer learning (TL) is that generalization may occur not only within tasks, but also across tasks. This insight is not new; transfer has long been studied in the psychological literature (cf., Thorndike and Woodworth, 1901; Skinner, 1953). More relevant are a number of * .
translated by 谷歌翻译
translated by 谷歌翻译