We present temporally layered architecture (TLA), a biologically inspired system for temporally adaptive distributed control. TLA layers a fast and a slow controller together to achieve temporal abstraction that allows each layer to focus on a different time-scale. Our design is biologically inspired and draws on the architecture of the human brain which executes actions at different timescales depending on the environment's demands. Such distributed control design is widespread across biological systems because it increases survivability and accuracy in certain and uncertain environments. We demonstrate that TLA can provide many advantages over existing approaches, including persistent exploration, adaptive control, explainable temporal behavior, compute efficiency and distributed control. We present two different algorithms for training TLA: (a) Closed-loop control, where the fast controller is trained over a pre-trained slow controller, allowing better exploration for the fast controller and closed-loop control where the fast controller decides whether to "act-or-not" at each timestep; and (b) Partially open loop control, where the slow controller is trained over a pre-trained fast controller, allowing for open loop-control where the slow controller picks a temporally extended action or defers the next n-actions to the fast controller. We evaluated our method on a suite of continuous control tasks and demonstrate the advantages of TLA over several strong baselines.
translated by 谷歌翻译
Deep reinforcement learning is poised to revolutionise the field of AI and represents a step towards building autonomous systems with a higher level understanding of the visual world. Currently, deep learning is enabling reinforcement learning to scale to problems that were previously intractable, such as learning to play video games directly from pixels. Deep reinforcement learning algorithms are also applied to robotics, allowing control policies for robots to be learned directly from camera inputs in the real world. In this survey, we begin with an introduction to the general field of reinforcement learning, then progress to the main streams of value-based and policybased methods. Our survey will cover central algorithms in deep reinforcement learning, including the deep Q-network, trust region policy optimisation, and asynchronous advantage actor-critic. In parallel, we highlight the unique advantages of deep neural networks, focusing on visual understanding via reinforcement learning. To conclude, we describe several current areas of research within the field.
translated by 谷歌翻译
With the development of deep representation learning, the domain of reinforcement learning (RL) has become a powerful learning framework now capable of learning complex policies in high dimensional environments. This review summarises deep reinforcement learning (DRL) algorithms and provides a taxonomy of automated driving tasks where (D)RL methods have been employed, while addressing key computational challenges in real world deployment of autonomous driving agents. It also delineates adjacent domains such as behavior cloning, imitation learning, inverse reinforcement learning that are related but are not classical RL algorithms. The role of simulators in training agents, methods to validate, test and robustify existing solutions in RL are discussed.
translated by 谷歌翻译
Adequately assigning credit to actions for future outcomes based on their contributions is a long-standing open challenge in Reinforcement Learning. The assumptions of the most commonly used credit assignment method are disadvantageous in tasks where the effects of decisions are not immediately evident. Furthermore, this method can only evaluate actions that have been selected by the agent, making it highly inefficient. Still, no alternative methods have been widely adopted in the field. Hindsight Credit Assignment is a promising, but still unexplored candidate, which aims to solve the problems of both long-term and counterfactual credit assignment. In this thesis, we empirically investigate Hindsight Credit Assignment to identify its main benefits, and key points to improve. Then, we apply it to factored state representations, and in particular to state representations based on the causal structure of the environment. In this setting, we propose a variant of Hindsight Credit Assignment that effectively exploits a given causal structure. We show that our modification greatly decreases the workload of Hindsight Credit Assignment, making it more efficient and enabling it to outperform the baseline credit assignment method on various tasks. This opens the way to other methods based on given or learned causal structures.
translated by 谷歌翻译
从意外的外部扰动中恢复的能力是双模型运动的基本机动技能。有效的答复包括不仅可以恢复平衡并保持稳定性的能力,而且在平衡恢复物质不可行时,也可以保证安全的方式。对于与双式运动有关的机器人,例如人形机器人和辅助机器人设备,可帮助人类行走,设计能够提供这种稳定性和安全性的控制器可以防止机器人损坏或防止伤害相关的医疗费用。这是一个具有挑战性的任务,因为它涉及用触点产生高维,非线性和致动系统的高动态运动。尽管使用基于模型和优化方法的前进方面,但诸如广泛领域知识的要求,诸如较大的计算时间和有限的动态变化的鲁棒性仍然会使这个打开问题。在本文中,为了解决这些问题,我们开发基于学习的算法,能够为两种不同的机器人合成推送恢复控制政策:人形机器人和有助于双模型运动的辅助机器人设备。我们的工作可以分为两个密切相关的指示:1)学习人形机器人的安全下降和预防策略,2)使用机器人辅助装置学习人类的预防策略。为实现这一目标,我们介绍了一套深度加强学习(DRL)算法,以学习使用这些机器人时提高安全性的控制策略。
translated by 谷歌翻译
最先进的多机构增强学习(MARL)方法为各种复杂问题提供了有希望的解决方案。然而,这些方法都假定代理执行同步的原始操作执行,因此它们不能真正可扩展到长期胜利的真实世界多代理/机器人任务,这些任务固有地要求代理/机器人以异步的理由,涉及有关高级动作选择的理由。不同的时间。宏观行动分散的部分可观察到的马尔可夫决策过程(MACDEC-POMDP)是在完全合作的多代理任务中不确定的异步决策的一般形式化。在本论文中,我们首先提出了MacDec-Pomdps的一组基于价值的RL方法,其中允许代理在三个范式中使用宏观成果功能执行异步学习和决策:分散学习和控制,集中学习,集中学习和控制,以及分散执行的集中培训(CTDE)。在上述工作的基础上,我们在三个训练范式下制定了一组基于宏观行动的策略梯度算法,在该训练范式下,允许代理以异步方式直接优化其参数化策略。我们在模拟和真实的机器人中评估了我们的方法。经验结果证明了我们在大型多代理问题中的方法的优势,并验证了我们算法在学习具有宏观actions的高质量和异步溶液方面的有效性。
translated by 谷歌翻译
Learning goal-directed behavior in environments with sparse feedback is a major challenge for reinforcement learning algorithms. The primary difficulty arises due to insufficient exploration, resulting in an agent being unable to learn robust value functions. Intrinsically motivated agents can explore new behavior for its own sake rather than to directly solve problems. Such intrinsic behaviors could eventually help the agent solve tasks posed by the environment. We present hierarchical-DQN (h-DQN), a framework to integrate hierarchical value functions, operating at different temporal scales, with intrinsically motivated deep reinforcement learning. A top-level value function learns a policy over intrinsic goals, and a lower-level function learns a policy over atomic actions to satisfy the given goals. h-DQN allows for flexible goal specifications, such as functions over entities and relations. This provides an efficient space for exploration in complicated environments. We demonstrate the strength of our approach on two problems with very sparse, delayed feedback: (1) a complex discrete stochastic decision process, and (2) the classic ATARI game 'Montezuma's Revenge'.
translated by 谷歌翻译
In classic reinforcement learning algorithms, agents make decisions at discrete and fixed time intervals. The physical duration between one decision and the next becomes a critical hyperparameter. When this duration is too short, the agent needs to make many decisions to achieve its goal, aggravating the problem's difficulty. But when this duration is too long, the agent becomes incapable of controlling the system. Physical systems, however, do not need a constant control frequency. For learning agents, it is desirable to operate with low frequency when possible and high frequency when necessary. We propose a framework called Continuous-Time Continuous-Options (CTCO), where the agent chooses options as sub-policies of variable durations. Such options are time-continuous and can interact with the system at any desired frequency providing a smooth change of actions. The empirical analysis shows that our algorithm is competitive w.r.t. other time-abstraction techniques, such as classic option learning and action repetition, and practically overcomes the difficult choice of the decision frequency.
translated by 谷歌翻译
长期的Horizo​​n机器人学习任务稀疏的奖励对当前的强化学习算法构成了重大挑战。使人类能够学习挑战的控制任务的关键功能是,他们经常获得专家干预,使他们能够在掌握低级控制动作之前了解任务的高级结构。我们为利用专家干预来解决长马增强学习任务的框架。我们考虑\ emph {选项模板},这是编码可以使用强化学习训练的潜在选项的规格。我们将专家干预提出,因为允许代理商在学习实施之前执行选项模板。这使他们能够使用选项,然后才能为学习成本昂贵的资源学习。我们在三个具有挑战性的强化学习问题上评估了我们的方法,这表明它的表现要优于最先进的方法。训练有素的代理商和我们的代码视频可以在以下网址找到:https://sites.google.com/view/stickymittens
translated by 谷歌翻译
Hierarchical methods in reinforcement learning have the potential to reduce the amount of decisions that the agent needs to perform when learning new tasks. However, finding a reusable useful temporal abstractions that facilitate fast learning remains a challenging problem. Recently, several deep learning approaches were proposed to learn such temporal abstractions in the form of options in an end-to-end manner. In this work, we point out several shortcomings of these methods and discuss their potential negative consequences. Subsequently, we formulate the desiderata for reusable options and use these to frame the problem of learning options as a gradient-based meta-learning problem. This allows us to formulate an objective that explicitly incentivizes options which allow a higher-level decision maker to adjust in few steps to different tasks. Experimentally, we show that our method is able to learn transferable components which accelerate learning and performs better than existing prior methods developed for this setting. Additionally, we perform ablations to quantify the impact of using gradient-based meta-learning as well as other proposed changes.
translated by 谷歌翻译
深度强化学习(RL)导致了许多最近和开创性的进步。但是,这些进步通常以培训的基础体系结构的规模增加以及用于训练它们的RL算法的复杂性提高,而均以增加规模的成本。这些增长反过来又使研究人员更难迅速原型新想法或复制已发表的RL算法。为了解决这些问题,这项工作描述了ACME,这是一个用于构建新型RL算法的框架,这些框架是专门设计的,用于启用使用简单的模块化组件构建的代理,这些组件可以在各种执行范围内使用。尽管ACME的主要目标是为算法开发提供一个框架,但第二个目标是提供重要或最先进算法的简单参考实现。这些实现既是对我们的设计决策的验证,也是对RL研究中可重复性的重要贡献。在这项工作中,我们描述了ACME内部做出的主要设计决策,并提供了有关如何使用其组件来实施各种算法的进一步详细信息。我们的实验为许多常见和最先进的算法提供了基准,并显示了如何为更大且更复杂的环境扩展这些算法。这突出了ACME的主要优点之一,即它可用于实现大型,分布式的RL算法,这些算法可以以较大的尺度运行,同时仍保持该实现的固有可读性。这项工作提出了第二篇文章的版本,恰好与模块化的增加相吻合,对离线,模仿和从演示算法学习以及作为ACME的一部分实现的各种新代理。
translated by 谷歌翻译
强化学习(RL)和脑电脑接口(BCI)是过去十年一直在增长的两个领域。直到最近,这些字段彼此独立操作。随着对循环(HITL)应用的兴趣升高,RL算法已经适用于人类指导,从而产生互动强化学习(IRL)的子领域。相邻的,BCI应用一直很感兴趣在人机交互期间从神经活动中提取内在反馈。这两个想法通过将BCI集成到IRL框架中,将RL和BCI设置在碰撞过程中,通过将内在反馈可用于帮助培训代理商来帮助框架。这种交叉点被称为内在的IRL。为了进一步帮助,促进BCI和IRL的更深层次,我们对内在IRILL的审查有着重点在于其母体领域的反馈驱动的IRL,同时还提供有关有效性,挑战和未来研究方向的讨论。
translated by 谷歌翻译
Hierarchical Reinforcement Learning (HRL) algorithms have been demonstrated to perform well on high-dimensional decision making and robotic control tasks. However, because they solely optimize for rewards, the agent tends to search the same space redundantly. This problem reduces the speed of learning and achieved reward. In this work, we present an Off-Policy HRL algorithm that maximizes entropy for efficient exploration. The algorithm learns a temporally abstracted low-level policy and is able to explore broadly through the addition of entropy to the high-level. The novelty of this work is the theoretical motivation of adding entropy to the RL objective in the HRL setting. We empirically show that the entropy can be added to both levels if the Kullback-Leibler (KL) divergence between consecutive updates of the low-level policy is sufficiently small. We performed an ablative study to analyze the effects of entropy on hierarchy, in which adding entropy to high-level emerged as the most desirable configuration. Furthermore, a higher temperature in the low-level leads to Q-value overestimation and increases the stochasticity of the environment that the high-level operates on, making learning more challenging. Our method, SHIRO, surpasses state-of-the-art performance on a range of simulated robotic control benchmark tasks and requires minimal tuning.
translated by 谷歌翻译
使用强化学习解决复杂的问题必须将问题分解为可管理的任务,无论是明确或隐式的任务,并学习解决这些任务的政策。反过来,这些政策必须由采取高级决策的总体政策来控制。这需要培训算法在学习这些政策时考虑这种等级决策结构。但是,实践中的培训可能会导致泛化不良,要么在很少的时间步骤执行动作,要么将其全部转变为单个政策。在我们的工作中,我们介绍了一种替代方法来依次学习此类技能,而无需使用总体层次的政策。我们在环境的背景下提出了这种方法,在这种环境的背景下,学习代理目标的主要组成部分是尽可能长时间延长情节。我们将我们提出的方法称为顺序选择评论家。我们在我们开发的灵活的模拟3D导航环境中演示了我们在导航和基于目标任务的方法的实用性。我们还表明,我们的方法优于先前的方法,例如在我们的环境中,柔软的演员和软选择评论家,以及健身房自动驾驶汽车模拟器和Atari River RAID RAID环境。
translated by 谷歌翻译
深度强化学习(DRL)是一种有前途的方法,可以通过与环境的互动来学习政策来解决复杂的控制任务。但是,对DRL政策的培训需要大量的培训经验,这使得直接了解物理系统的政策是不切实际的。 SIM到运行的方法可以利用模拟来验证DRL政策,然后将其部署在现实世界中。不幸的是,经过验证的政策的直接现实部署通常由于不同的动态(称为现实差距)而遭受性能恶化。最近的SIM到现实方法,例如域随机化和域的适应性,重点是改善预审预告剂的鲁棒性。然而,经过模拟训练的策略通常需要使用现实世界中的数据来调整以达到最佳性能,这是由于现实世界样本的高成本而具有挑战性的。这项工作提出了一个分布式的云边缘建筑,以实时培训现实世界中的DRL代理。在体系结构中,推理和训练被分配到边缘和云,将实时控制循环与计算昂贵的训练回路分开。为了克服现实差距,我们的体系结构利用了SIM到现实的转移策略,以继续在物理系统上训练模拟预言的代理。我们证明了其在物理倒置螺旋控制系统上的适用性,分析了关键参数。现实世界实验表明,我们的体系结构可以使验证的DRL代理能够始终如一,有效地看不见动态。
translated by 谷歌翻译
Starcraft II(SC2)对强化学习(RL)提出了巨大的挑战,其中主要困难包括巨大的状态空间,不同的动作空间和长期的视野。在这项工作中,我们研究了《星际争霸II》全长游戏的一系列RL技术。我们研究了涉及提取的宏观活动和神经网络的层次结构的层次RL方法。我们研究了课程转移培训程序,并在具有4个GPU和48个CPU线的单台计算机上训练代理。在64x64地图并使用限制性单元上,我们对内置AI的获胜率达到99%。通过课程转移学习算法和战斗模型的混合物,我们在最困难的非作战水平内置AI(7级)中获得了93%的胜利率。在本文的扩展版本中,我们改进了架构,以针对作弊水平训练代理商,并在8级,9级和10级AIS上达到胜利率,为96%,97%和94 %, 分别。我们的代码在https://github.com/liuruoze/hiernet-sc2上。为了为我们的工作以及研究和开源社区提供基线,我们将其复制了一个缩放版本的Mini-Alphastar(MAS)。 MAS的最新版本为1.07,可以在具有564个动作的原始动作空间上进行培训。它旨在通过使超参数可调节来在单个普通机器上进行训练。然后,我们使用相同的资源将我们的工作与MAS进行比较,并表明我们的方法更有效。迷你α的代码在https://github.com/liuruoze/mini-alphastar上。我们希望我们的研究能够阐明对SC2和其他大型游戏有效增强学习的未来研究。
translated by 谷歌翻译
在这里,我们报告了强化学习(RL)的案例研究实施,以自动化扫描传输电子显微镜(STEM)工作流程中的操作。为此,我们设计了一个虚拟的,典型的RL环境,以测试和开发网络,以自主对电子束进行自主对齐,而无需事先了解。使用此模拟器,我们评估了环境设计和算法超参数对对齐准确性和学习收敛的影响,从而显示了宽阔的超级参数空间的稳健收敛性。此外,我们在显微镜上部署了成功的模型,以验证该方法并演示设计适当的虚拟环境的价值。与模拟结果一致,微观RL模型在最小的训练后达到了与目标一致性的收敛。总体而言,结果表明,通过利用RL,可以自动化显微镜操作而无需广泛的算法设计,从而通过机器学习方法迈出了增强电子显微镜的又一步。
translated by 谷歌翻译
资产分配(或投资组合管理)是确定如何最佳将有限预算的资金分配给一系列金融工具/资产(例如股票)的任务。这项研究调查了使用无模型的深RL代理应用于投资组合管理的增强学习(RL)的性能。我们培训了几个RL代理商的现实股票价格,以学习如何执行资产分配。我们比较了这些RL剂与某些基线剂的性能。我们还比较了RL代理,以了解哪些类别的代理表现更好。从我们的分析中,RL代理可以执行投资组合管理的任务,因为它们的表现明显优于基线代理(随机分配和均匀分配)。四个RL代理(A2C,SAC,PPO和TRPO)总体上优于最佳基线MPT。这显示了RL代理商发现更有利可图的交易策略的能力。此外,基于价值和基于策略的RL代理之间没有显着的性能差异。演员批评者的表现比其他类型的药物更好。同样,在政策代理商方面的表现要好,因为它们在政策评估方面更好,样品效率在投资组合管理中并不是一个重大问题。这项研究表明,RL代理可以大大改善资产分配,因为它们的表现优于强基础。基于我们的分析,在政策上,参与者批评的RL药物显示出最大的希望。
translated by 谷歌翻译
In value-based reinforcement learning methods such as deep Q-learning, function approximation errors are known to lead to overestimated value estimates and suboptimal policies. We show that this problem persists in an actor-critic setting and propose novel mechanisms to minimize its effects on both the actor and the critic. Our algorithm builds on Double Q-learning, by taking the minimum value between a pair of critics to limit overestimation. We draw the connection between target networks and overestimation bias, and suggest delaying policy updates to reduce per-update error and further improve performance. We evaluate our method on the suite of OpenAI gym tasks, outperforming the state of the art in every environment tested.
translated by 谷歌翻译
We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain. We present an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. Using the same learning algorithm, network architecture and hyper-parameters, our algorithm robustly solves more than 20 simulated physics tasks, including classic problems such as cartpole swing-up, dexterous manipulation, legged locomotion and car driving. Our algorithm is able to find policies whose performance is competitive with those found by a planning algorithm with full access to the dynamics of the domain and its derivatives. We further demonstrate that for many of the tasks the algorithm can learn policies "end-to-end": directly from raw pixel inputs.
translated by 谷歌翻译