我们引入了一种新的强化学习算法,称为Maximumaposteriori Policy Optimization(MPO),它基于相对熵目标的坐标上升。我们证明了几种现有方法可以直接与我们的推导相关联。我们开发了两种非策略算法,并证明它们与深度强化学习中的最新技术竞争。特别是,对于连续控制,我们的方法在实现类似或更好的最终性能的同时,在样本效率,早熟收敛和对超参数设置的鲁棒性方面优于成熟方法。
translated by 谷歌翻译
强化学习的一个主要挑战是发现奖励分布稀疏的任务的有效政策。我们假设在没有有用的奖励信号的情况下,有效的探索策略应该找出{\ it decision states}。这些状态位于状态空间中的关键交叉点,代理可以从这些交叉点转换到新的,可能未开发的区域。我们建议从先前的经验中了解决策状态。通过训练具有信息瓶颈的目标条件,我们可以通过检查模型实际利用目标状态的位置来识别决策状态。我们发现,这种简单的机制可以有效地识别决策状态,即使在部分观察到的环境中实际上,该模型学习了与潜在子目标相关的理论线索。在新的环境中,这个模型可以识别新的子目标以进行进一步的探索,引导代理通过一系列潜在的决策状态并通过状态空间的新区域。
translated by 谷歌翻译
大多数深度强化学习算法在复杂和丰富的环境中数据效率低,限制了它们在许多场景中的适用性。用于提高数据效率的唯一方向是使用共享神经网络参数的多任务学习,其中可以通过跨交叉相关任务来提高效率。然而,在实践中,通常不会观察到这种情况,因为来自不同任务的渐变可能会产生负面干扰,导致学习不稳定,有时甚至会降低数据效率。另一个问题是任务之间的不同奖励方案,这很容易导致一个任务确定共享模型的学习。我们提出了一种新的联合训练方法,我们称之为Distral(Distill&transferlearning)。我们建议分享一个捕获常见行为的“蒸馏”策略,而不是在不同的工作者之间共享参数。每个工人都经过培训,可以解决自己的任务,同时受限于保持对共享政策的控制,而共享政策则通过蒸馏培训成为所有任务政策的质心。学习过程的两个方面都是通过优化联合目标函数得出的。我们表明,我们的方法支持在复杂的3D环境中进行有效传输,优于多个相关方法。此外,所提出的学习过程更加健壮且更加稳定 - 这些属性在深层强化学习中至关重要。
translated by 谷歌翻译
We present a method for reinforcement learning of closely related skills that are parameterized via a skill embedding space. We learn such skills by taking advantage of latent variables and exploiting a connection between reinforcement learning and variational inference. The main contribution of our work is an entropy-regularized policy gradient formulation for hierarchical policies, and an associated , data-efficient and robust off-policy gradient algorithm based on stochastic value gradients. We demonstrate the effectiveness of our method on several simulated robotic manipulation tasks. We find that our method allows for discovery of multiple solutions and is capable of learning the minimum number of distinct skills that are necessary to solve a given set of tasks. In addition, our results indicate that the hereby proposed technique can interpolate and/or sequence previously learned skills in order to accomplish more complex tasks, even in the presence of sparse rewards.
translated by 谷歌翻译
学习如何在没有手工制作的奖励或专家数据的情况下控制环境仍然具有挑战性,并且处于强化学习研究的前沿。我们提出了一种无监督的学习算法来训练代理人仅使用观察和反应流来达到感知指定的目标。我们的经纪人同时学习目标条件政策和goalachievement奖励功能,衡量一个国家与目标国家的相似程度。这种双重优化导致合作游戏,产生了奖励的奖励函数,其反映了环境的可控方面的相似性而不是观察空间中的距离。我们展示了我们的代理人以无人监督的方式学习在三个领域--Atari,DeepMind Control Suite和DeepMind Lab实现目标的目标。
translated by 谷歌翻译
我们提出了一种用于强化学习(RL)的非策略行为者 - 评论算法,该算法将来自随机搜索的无梯度优化的思想与学习的动作 - 值函数相结合。结果是一个简单的过程,包括三个步骤:i)通过估计参数动作 - 值函数进行策略评估; ii)通过估计当地非参数政策来改善政策; iii)通过拟合参数策略进行推广。每个步骤都可以以不同的方式实现,从而产生了几种算法变量。我们的算法利用与黑盒优化和“RL作为推理”的现有文献的联系,可以看作是最大后验策略优化算法的扩展。 (MPO)[Abdolmaleki等,2018a],或作为信赖域协方差矩阵适应进化战略(CMA-ES)的延伸[Abdolmaleki等,2017b; Hansen等,1997]对策略迭代方案。我们对parkour套件[Heess et al。,2017],DeepMind控制套件[Tassa et al。,2018]和OpenAI Gym [Brockman et al。,2016]的31个连续控制任务进行了比较,具有不同的属性,有限的计算量和单个超参数集,证明了我们的方法的有效性和艺术状态的结果。可以在goo.gl/HtvJKR找到视频,总结结果。
translated by 谷歌翻译
We propose a method for learning expressive energy-based policies for continuous states and actions, which has been feasible only in tabular domains before. We apply our method to learning maximum entropy policies, resulting into a new algorithm, called soft Q-learning, that expresses the optimal policy via a Boltzmann distribution. We use the recently proposed amor-tized Stein variational gradient descent to learn a stochastic sampling network that approximates samples from this distribution. The benefits of the proposed algorithm include improved exploration and compositionality that allows transferring skills between tasks, which we confirm in simulated experiments with swimming and walking robots. We also draw a connection to actor-critic methods, which can be viewed performing approximate inference on the corresponding energy-based model.
translated by 谷歌翻译
This work adopts the very successful distributional perspective on reinforcement learning and adapts it to the continuous control setting. We combine this within a distributed framework for off-policy learning in order to develop what we call the Distributed Distributional Deep Deterministic Policy Gradient algorithm, D4PG. We also combine this technique with a number of additional, simple improvements such as the use of N-step returns and prioritized experience replay. Experimentally we examine the contribution of each of these individual components, and show how they interact, as well as their combined contributions. Our results show that across a wide variety of simple control tasks, difficult manipulation tasks, and a set of hard obstacle-based locomotion tasks the D4PG algorithm achieves state of the art performance.
translated by 谷歌翻译
Reinforcement learning (RL) offers powerful algorithms to search for optimal controllers of systems with nonlinear, possibly stochastic dynamics that are unknown or highly uncertain. This review mainly covers artificial-intelligence approaches to RL, from the viewpoint of the control engineer. We explain how approximate representations of the solution make RL feasible for problems with continuous states and control actions. Stability is a central concern in control, and we argue that while the control-theoretic RL subfield called adaptive dynamic programming is dedicated to it, stability of RL largely remains an open question. We also cover in detail the case where deep neural networks are used for approximation, leading to the field of deep RL, which has shown great success in recent years. With the control practitioner in mind, we outline opportunities and pitfalls of deep RL; and we close the survey with an outlook that-among other things-points out some avenues for bridging the gap between control and artificial-intelligence RL techniques.
translated by 谷歌翻译
Deep reinforcement learning (RL) methods generally engage in exploratorybehavior through noise injection in the action space. An alternative is to addnoise directly to the agent's parameters, which can lead to more consistentexploration and a richer set of behaviors. Methods such as evolutionarystrategies use parameter perturbations, but discard all temporal structure inthe process and require significantly more samples. Combining parameter noisewith traditional RL methods allows to combine the best of both worlds. Wedemonstrate that both off- and on-policy methods benefit from this approachthrough experimental comparison of DQN, DDPG, and TRPO on high-dimensionaldiscrete action environments as well as continuous control tasks. Our resultsshow that RL with parameter noise learns more efficiently than traditional RLwith action space noise and evolutionary strategies individually.
translated by 谷歌翻译
智能生物可以在没有监督的情况下探索环境并学习有用的技能。在本文中,我们提出了DIAYN('Diversity is All YouNeed'),这是一种在没有奖励功能的情况下学习有用技能的方法。我们提出的方法通过使用最大熵策略最大化信息理论目标来学习技能。在各种模拟机器人任务中,weshow表明这个简单的目标导致了无人监督的多种技能的出现,例如步行和跳跃。在许多强化学习基准测试环境中,我们的方法能够学习一项能够解决基准测试任务的技能,尽管从未收到真正的任务奖励。我们展示了预训练技能可以为下游任务提供良好的参数初始化,并且可以分层次地组合以解决复杂的,稀疏的任务。我们的研究结果表明,无监督的技能发现可以作为克服强化学习中探索和数据效率挑战的有效预训练机制。
translated by 谷歌翻译
我们提出了一个在线计划和离线学习(POLO)框架,用于具有内部模型的代理需要在世界中不断行动和学习的环境。我们的工作建立在基于本地模型的控制,全球价值功能学习和探索之间的协同关系之上。我们研究局部轨迹优化可以应对值函数中的近似误差,并且可以稳定和加速价值函数学习。相反,我们还研究了近似值函数如何帮助减少计划范围并允许更好的策略超出局部解。最后,我们还演示了轨迹优化如何用于执行同步协调探索以及估计值函数近似的不确定性。这种探索对快速和稳定地学习价值功能至关重要。将这些组件结合起来,可以解决复杂的控制任务,例如人形运动和灵巧手操作,相当于几分钟的局部经验。
translated by 谷歌翻译
We propose Scheduled Auxiliary Control (SAC-X), a new learning paradigm in the context of Reinforcement Learning (RL). SAC-X enables learning of complex behaviors-from scratch-in the presence of multiple sparse reward signals. To this end, the agent is equipped with a set of general auxiliary tasks, that it attempts to learn simultaneously via off-policy RL. The key idea behind our method is that active (learned) scheduling and execution of auxiliary policies allows the agent to efficiently explore its environment-enabling it to excel at sparse reward RL. Our experiments in several challenging robotic manipulation settings demonstrate the power of our approach. A video of the rich set of learned behaviours can be found at https://youtu.be/mPKyvocNe M.
translated by 谷歌翻译
The success of popular algorithms for deep reinforcement learning, such as policy-gradients and Q-learning, relies heavily on the availability of an informative reward signal at each timestep of the sequential decision-making process. When rewards are only sparsely available during an episode, or a rewarding feedback is provided only after episode termination, these algorithms perform sub-optimally due to the difficultly in credit assignment. Alternatively, trajectory-based policy optimization methods, such as cross-entropy method and evolution strategies, do not require per-timestep rewards, but have been found to suffer from high sample complexity by completing forgoing the temporal nature of the problem. Improving the efficiency of RL algorithms in real-world problems with sparse or episodic rewards is therefore a pressing need. In this work, we introduce a self-imitation learning algorithm that exploits and explores well in the sparse and episodic reward settings. We view each policy as a state-action visitation distribution and formulate policy optimization as a divergence minimization problem. We show that with Jensen-Shannon divergence, this divergence minimization problem can be reduced into a policy-gradient algorithm with shaped rewards learned from experience replays. Experimental results indicate that our algorithm works comparable to existing algorithms in environments with dense rewards, and significantly better in environments with sparse and episodic rewards. We then discuss limitations of self-imitation learning, and propose to solve them by using Stein variational policy gradient descent with the Jensen-Shannon kernel to learn multiple diverse policies. We demonstrate its effectiveness on a challenging variant of continuous-control MuJoCo locomotion tasks.
translated by 谷歌翻译
在线,非政策强化学习算法能够使用经验记忆来记忆和重放过去的经历。在以前的工作中,这种方法被用来通过打破更新的时间相关性来避免可能罕见的经验的快速遗忘来稳定训练。在这项工作中,我们提出了一个概念上简单的框架,它使用经验记忆通过优先考虑起始状态来帮助探索。代理开始在环境中起作用,重要的是,它还与策略上的算法兼容。鉴于能够在与过去观察相对应的状态下重新启动代理人,我们通过以下方式实现了这一目标:(i)使代理人能够在属于过去经验的状态(例如,附近的目标)中重新开始,以及(ii)通过以下方式促进更快地覆盖状态空间从一组更多样化的国家开始。虽然使用一个很好的优先级来确定重要的过去转移,但我们期望案例(i)更有助于探索某些问题(例如,稀疏奖励任务),我们假设案例(ii)即使没有任何优先次序,通常也会有益。我们通过证明,我们的方法可以提高非政策性和政策性深层强化学习方法的学习成绩,并且在一项非常稀疏的奖励任务中最显着的改进。
translated by 谷歌翻译
多代理方案中的强化学习对于实际应用非常重要,但却带来了超出单一代理设置的挑战。我们提出了一种演员 - 评论家算法,它在多智能体设置中训练分散的政策,使用集中计算的批评者,这些批评者共享注意机制,每个时间步长为每个代理选择相关信息。与最近的方法相比,这种注意机制可以在复杂的多代理环境中实现更有效和可扩展的学习。我们的方法不仅适用于具有共享奖励的合作设置,还适用于个性化奖励设置,包括对抗设置,并且不对代理的动作空间做出任何假设。因此,它足够灵活,可以应用于大多数多智能体学习问题。
translated by 谷歌翻译
在这项工作中,我们的目标是使用具有一组参数的单一执行学习代理来解决大量任务。一个关键的挑战是处理增加的数据量和延长的培训时间。我们开发了一种新的分布式代理IMPALA(重要性加权演员-LearnerArchitecture),它不仅可以在单机训练中更有效地使用资源,而且可以扩展到数千台机器而不会牺牲数据效率或资源利用率。我们通过将解耦的行为和学习与称为V-trace的新型非政策校正方法相结合,实现了在高吞吐量下的稳定学习。我们展示了IMPALA在DMLab-30(来自DeepMind Lab环境(Beattie等,2016))和Atari-57(在Arcade学习环境中所有可用的Atari游戏(Bellemare等人)中的30个任务的多任务强化学习的有效性。 。,2013a))。我们的结果表明,IMPALA能够比数据量较少的前一代产品获得更好的性能,并且由于其多任务方法,关键性地表现出任务之间的正向转移。
translated by 谷歌翻译
在一系列具有挑战性的决策制定和控制任务中,已经证明了无模型深度强化学习(RL)算法。然而,这些方法通常遭受两个主要挑战:非常高的样品复杂性和脆弱的收敛性质,这需要细致的超级参数化。这两个挑战都严重限制了这些方法对复杂的现实领域的适用性。在本文中,我们提出了softactor-critic,一种基于最大熵强化学习框架的非策略行为者 - 评论者深度RL算法。在这个框架中,演员的目标是最大化预期的奖励,同时也最大化熵。也就是说,尽可能随机地完成任务。基于该框架的先前深度RL方法已经被公式化为Q学习方法。通过将关闭政策更新与稳定的随机行为者 - 批评者表述相结合,我们的方法在一系列连续的控制基准任务上实现了最先进的性能,优于先前的政策和非政策方法。此外,我们证明,与其他非策略算法相比,我们的方法非常稳定,在不同的随机种子上实现了非常相似的性能。
translated by 谷歌翻译
我们探索了基于变分推理的选项发现方法,并提出了两种算法贡献。第一:我们强调变分选项发现方法和变分自动编码器之间的紧密联系,并通过增强(VALOR)引入变体自动编码学习,这是一种从连接中派生出来的新方法。在VALOR中,策略编码从噪声分布到轨迹的内容,并且解码器从完整轨迹中恢复上下文。第二:我们提出了一种课程学习方法,即当代理人的表现在当前的一组情境中足够强大(由解码器衡量)时,代理人看到的情境数量会增加。我们证明了这个简单的技巧可以稳定VALOR和先前的变分选项发现方法,允许单个代理学习比固定上下文分布更多的行为模式。最后,我们研究了与变分选项发现相关的其他主题,包括基因限制的基本限制以及学习选项对下游任务的适用性。
translated by 谷歌翻译
本文提出了自我模仿学习(SIL),这是一种简单的非政策因素评判算法,可以学习如何重现代理人过去的良好决策。该算法旨在验证我们的假设,即利用丰富的经验可以间接推动深度探索。我们的实证结果表明,SIL在几个艰难的探索Atari游戏中显着提高了演员评论(A2C)的优势,并且与最先进的基于数量的探索方法相比具有竞争力。我们还表明SIL改进了MuJoCo任务的近端策略优化(PPO)。
translated by 谷歌翻译