One of the major challenges in Deep Reinforcement Learning for control is the need for extensive training to learn the policy. Motivated by this, we present the design of the Control-Tutored Deep Q-Networks (CT-DQN) algorithm, a Deep Reinforcement Learning algorithm that leverages a control tutor, i.e., an exogenous control law, to reduce learning time. The tutor can be designed using an approximate model of the system, without any assumption about the knowledge of the system's dynamics. There is no expectation that it will be able to achieve the control objective if used stand-alone. During learning, the tutor occasionally suggests an action, thus partially guiding exploration. We validate our approach on three scenarios from OpenAI Gym: the inverted pendulum, lunar lander, and car racing. We demonstrate that CT-DQN is able to achieve better or equivalent data efficiency with respect to the classic function approximation solutions.
translated by 谷歌翻译
我们提出了一种架构,其中导致环境近似模型导出的反馈控制器有助于学习过程来提高其数据效率。我们作为控制辅导Q-Learning(CTQL)的这个架构,在两个替代的口味中呈现。前者是基于定义奖励功能,以便可以使用布尔条件来确定采用控制导师策略,而后者被称为概率CTQL(PCTQL),则是基于与Tutor的执行呼叫学习期间的某些概率。通过考虑在Openai Body中定义的倒挂摆在作为代表性问题,通过验证两种方法,并彻底地反对Q-Learning基准测试。
translated by 谷歌翻译
深钢筋学习(DRL)被视为一种潜在的方法来控制汽车控制,并主要研究以支持一辆接下来的车辆。但是,在排中有多个以下车辆,尤其是在不可预测的领先车辆行为中,学习稳定,高效的汽车跟随政策是更具挑战性的。在这种情况下,我们采用集成的DRL和动态编程(DP)方法来学习自主排控制策略,该政策将深层确定性策略梯度(DDPG)算法嵌入到有限的 - Horizo​​n值迭代框架中。尽管DP框架可以提高DDPG的稳定性和性能,但它具有较低的采样和训练效率的局限性。在本文中,我们提出了一种算法,即有限的horizo​​n-ddpg,使用固定近似(FH-DDPG-SS)通过减少状态空间(FH-DDPG-SS)进行扫描,该算法使用三个关键思想来克服上述限制,即,即将网络权重转移到向后转移的网络权重。时间,较早的时间步骤的固定政策近似,并通过减少的状态空间进行扫描。为了验证FH-DDPG-SS的有效性,使用实际驾驶数据进行了模拟,其中将FH-DDPG-SS的性能与基准算法的性能进行了比较。最后,展示了FH-DDPG-SS的排安全性和弦稳定性。
translated by 谷歌翻译
In this paper, we build on advances introduced by the Deep Q-Networks (DQN) approach to extend the multi-objective tabular Reinforcement Learning (RL) algorithm W-learning to large state spaces. W-learning algorithm can naturally solve the competition between multiple single policies in multi-objective environments. However, the tabular version does not scale well to environments with large state spaces. To address this issue, we replace underlying Q-tables with DQN, and propose an addition of W-Networks, as a replacement for tabular weights (W) representations. We evaluate the resulting Deep W-Networks (DWN) approach in two widely-accepted multi-objective RL benchmarks: deep sea treasure and multi-objective mountain car. We show that DWN solves the competition between multiple policies while outperforming the baseline in the form of a DQN solution. Additionally, we demonstrate that the proposed algorithm can find the Pareto front in both tested environments.
translated by 谷歌翻译
智能城市的智能交通灯可以最佳地减少交通拥堵。在这项研究中,我们采用了加强学习,培训了城市移动模拟器的红绿灯的控制代理。由于现有工程的差异,除了基于价值的方法之外,利用基于策略的深度加强学习方法,近端策略优化(PPO),例如Deep Q网络(DQN)和双DQN(DDQN)。首先,将获得PPO的最佳政策与来自DQN和DDQN的PPO相比。发现PPO的政策比其他政策更好。接下来,而不是固定间隔的流量光阶段,我们采用具有可变时间间隔的光相位,这导致更好的策略来传递流量流。然后,研究了环境和行动干扰的影响,以展示基于学习的控制器是强大的。最后,我们考虑不平衡的交通流量,并发现智能流量可以适度地对不平衡的流量方案执行,尽管它仅从平衡流量方案中了解最佳策略。
translated by 谷歌翻译
然而,由于各种交通/道路结构方案以及人类驾驶员行为的长时间分布,自动驾驶的感应,感知和本地化取得了重大进展,因此,对于智能车辆来说,这仍然是一个持开放态度的挑战始终知道如何在有可用的传感 /感知 /本地化信息的道路上做出和执行最佳决定。在本章中,我们讨论了人工智能,更具体地说,强化学习如何利用运营知识和安全反射来做出战略性和战术决策。我们讨论了一些与强化学习解决方案的鲁棒性及其对自动驾驶驾驶策略的实践设计有关的具有挑战性的问题。我们专注于在高速公路上自动驾驶以及增强学习,车辆运动控制和控制屏障功能的整合,从而实现了可靠的AI驾驶策略,可以安全地学习和适应。
translated by 谷歌翻译
在探索中,由于当前的低效率而引起的强化学习领域,具有较大动作空间的学习控制政策是一个具有挑战性的问题。在这项工作中,我们介绍了深入的强化学习(DRL)算法呼叫多动作网络(MAN)学习,以应对大型离散动作空间的挑战。我们建议将动作空间分为两个组件,从而为每个子行动创建一个值神经网络。然后,人使用时间差异学习来同步训练网络,这比训练直接动作输出的单个网络要简单。为了评估所提出的方法,我们在块堆叠任务上测试了人,然后扩展了人类从Atari Arcade学习环境中使用18个动作空间的12个游戏。我们的结果表明,人的学习速度比深Q学习和双重Q学习更快,这意味着我们的方法比当前可用于大型动作空间的方法更好地执行同步时间差异算法。
translated by 谷歌翻译
With the development of deep representation learning, the domain of reinforcement learning (RL) has become a powerful learning framework now capable of learning complex policies in high dimensional environments. This review summarises deep reinforcement learning (DRL) algorithms and provides a taxonomy of automated driving tasks where (D)RL methods have been employed, while addressing key computational challenges in real world deployment of autonomous driving agents. It also delineates adjacent domains such as behavior cloning, imitation learning, inverse reinforcement learning that are related but are not classical RL algorithms. The role of simulators in training agents, methods to validate, test and robustify existing solutions in RL are discussed.
translated by 谷歌翻译
Drug dosing is an important application of AI, which can be formulated as a Reinforcement Learning (RL) problem. In this paper, we identify two major challenges of using RL for drug dosing: delayed and prolonged effects of administering medications, which break the Markov assumption of the RL framework. We focus on prolongedness and define PAE-POMDP (Prolonged Action Effect-Partially Observable Markov Decision Process), a subclass of POMDPs in which the Markov assumption does not hold specifically due to prolonged effects of actions. Motivated by the pharmacology literature, we propose a simple and effective approach to converting drug dosing PAE-POMDPs into MDPs, enabling the use of the existing RL algorithms to solve such problems. We validate the proposed approach on a toy task, and a challenging glucose control task, for which we devise a clinically-inspired reward function. Our results demonstrate that: (1) the proposed method to restore the Markov assumption leads to significant improvements over a vanilla baseline; (2) the approach is competitive with recurrent policies which may inherently capture the prolonged effect of actions; (3) it is remarkably more time and memory efficient than the recurrent baseline and hence more suitable for real-time dosing control systems; and (4) it exhibits favorable qualitative behavior in our policy analysis.
translated by 谷歌翻译
Deep reinforcement learning is poised to revolutionise the field of AI and represents a step towards building autonomous systems with a higher level understanding of the visual world. Currently, deep learning is enabling reinforcement learning to scale to problems that were previously intractable, such as learning to play video games directly from pixels. Deep reinforcement learning algorithms are also applied to robotics, allowing control policies for robots to be learned directly from camera inputs in the real world. In this survey, we begin with an introduction to the general field of reinforcement learning, then progress to the main streams of value-based and policybased methods. Our survey will cover central algorithms in deep reinforcement learning, including the deep Q-network, trust region policy optimisation, and asynchronous advantage actor-critic. In parallel, we highlight the unique advantages of deep neural networks, focusing on visual understanding via reinforcement learning. To conclude, we describe several current areas of research within the field.
translated by 谷歌翻译
A central problem in computational biophysics is protein structure prediction, i.e., finding the optimal folding of a given amino acid sequence. This problem has been studied in a classical abstract model, the HP model, where the protein is modeled as a sequence of H (hydrophobic) and P (polar) amino acids on a lattice. The objective is to find conformations maximizing H-H contacts. It is known that even in this reduced setting, the problem is intractable (NP-hard). In this work, we apply deep reinforcement learning (DRL) to the two-dimensional HP model. We can obtain the conformations of best known energies for benchmark HP sequences with lengths from 20 to 50. Our DRL is based on a deep Q-network (DQN). We find that a DQN based on long short-term memory (LSTM) architecture greatly enhances the RL learning ability and significantly improves the search process. DRL can sample the state space efficiently, without the need of manual heuristics. Experimentally we show that it can find multiple distinct best-known solutions per trial. This study demonstrates the effectiveness of deep reinforcement learning in the HP model for protein folding.
translated by 谷歌翻译
Machine learning frameworks such as Genetic Programming (GP) and Reinforcement Learning (RL) are gaining popularity in flow control. This work presents a comparative analysis of the two, bench-marking some of their most representative algorithms against global optimization techniques such as Bayesian Optimization (BO) and Lipschitz global optimization (LIPO). First, we review the general framework of the model-free control problem, bringing together all methods as black-box optimization problems. Then, we test the control algorithms on three test cases. These are (1) the stabilization of a nonlinear dynamical system featuring frequency cross-talk, (2) the wave cancellation from a Burgers' flow and (3) the drag reduction in a cylinder wake flow. We present a comprehensive comparison to illustrate their differences in exploration versus exploitation and their balance between `model capacity' in the control law definition versus `required complexity'. We believe that such a comparison paves the way toward the hybridization of the various methods, and we offer some perspective on their future development in the literature on flow control problems.
translated by 谷歌翻译
A long-standing challenge in artificial intelligence is lifelong learning. In lifelong learning, many tasks are presented in sequence and learners must efficiently transfer knowledge between tasks while avoiding catastrophic forgetting over long lifetimes. On these problems, policy reuse and other multi-policy reinforcement learning techniques can learn many tasks. However, they can generate many temporary or permanent policies, resulting in memory issues. Consequently, there is a need for lifetime-scalable methods that continually refine a policy library of a pre-defined size. This paper presents a first approach to lifetime-scalable policy reuse. To pre-select the number of policies, a notion of task capacity, the maximal number of tasks that a policy can accurately solve, is proposed. To evaluate lifetime policy reuse using this method, two state-of-the-art single-actor base-learners are compared: 1) a value-based reinforcement learner, Deep Q-Network (DQN) or Deep Recurrent Q-Network (DRQN); and 2) an actor-critic reinforcement learner, Proximal Policy Optimisation (PPO) with or without Long Short-Term Memory layer. By selecting the number of policies based on task capacity, D(R)QN achieves near-optimal performance with 6 policies in a 27-task MDP domain and 9 policies in an 18-task POMDP domain; with fewer policies, catastrophic forgetting and negative transfer are observed. Due to slow, monotonic improvement, PPO requires fewer policies, 1 policy for the 27-task domain and 4 policies for the 18-task domain, but it learns the tasks with lower accuracy than D(R)QN. These findings validate lifetime-scalable policy reuse and suggest using D(R)QN for larger and PPO for smaller library sizes.
translated by 谷歌翻译
具有很多玩家的非合作和合作游戏具有许多应用程序,但是当玩家数量增加时,通常仍然很棘手。由Lasry和Lions以及Huang,Caines和Malham \'E引入的,平均野外运动会(MFGS)依靠平均场外近似值,以使玩家数量可以成长为无穷大。解决这些游戏的传统方法通常依赖于以完全了解模型的了解来求解部分或随机微分方程。最近,增强学习(RL)似乎有望解决复杂问题。通过组合MFGS和RL,我们希望在人口规模和环境复杂性方面能够大规模解决游戏。在这项调查中,我们回顾了有关学习MFG中NASH均衡的最新文献。我们首先确定最常见的设置(静态,固定和进化)。然后,我们为经典迭代方法(基于最佳响应计算或策略评估)提供了一个通用框架,以确切的方式解决MFG。在这些算法和与马尔可夫决策过程的联系的基础上,我们解释了如何使用RL以无模型的方式学习MFG解决方案。最后,我们在基准问题上介绍了数值插图,并以某些视角得出结论。
translated by 谷歌翻译
本文提出了一个基于加固学习(RL)的电动连接车辆(CV)的生态驾驶框架,以提高信号交叉点的车辆能效。通过整合基于型号的汽车策略,改变车道的政策和RL政策来确保车辆代理的安全操作。随后,制定了马尔可夫决策过程(MDP),该过程使车辆能够执行纵向控制和横向决策,从而共同优化了交叉口附近CVS的CAR跟踪和改变车道的行为。然后,将混合动作空间参数化为层次结构,从而在动态交通环境中使用二维运动模式训练代理。最后,我们所提出的方法从基于单车的透视和基于流的透视图中在Sumo软件中进行了评估。结果表明,我们的策略可以通过学习适当的动作方案来大大减少能源消耗,而不会中断其他人类驱动的车辆(HDVS)。
translated by 谷歌翻译
We propose a conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers. We present asynchronous variants of four standard reinforcement learning algorithms and show that parallel actor-learners have a stabilizing effect on training allowing all four methods to successfully train neural network controllers. The best performing method, an asynchronous variant of actor-critic, surpasses the current state-of-the-art on the Atari domain while training for half the time on a single multi-core CPU instead of a GPU. Furthermore, we show that asynchronous actor-critic succeeds on a wide variety of continuous motor control problems as well as on a new task of navigating random 3D mazes using a visual input.
translated by 谷歌翻译
深入强化学习(DRL)用于开发自主优化和定制设计的热处理过程,这些过程既对微观结构敏感又节能。与常规监督的机器学习不同,DRL不仅依赖于数据中的静态神经网络培训,但是学习代理人会根据奖励和惩罚元素自主开发最佳解决方案,并减少或没有监督。在我们的方法中,依赖温度的艾伦 - 卡恩模型用于相转换,用作DRL代理的环境,是其获得经验并采取自主决策的模型世界。 DRL算法的试剂正在控制系统的温度,作为用于合金热处理的模型炉。根据所需的相位微观结构为代理定义了微观结构目标。训练后,代理可以为各种初始微观结构状态生成温度时间曲线,以达到最终所需的微观结构状态。详细研究了代理商的性能和热处理概况的物理含义。特别是,该试剂能够控制温度以从各种初始条件开始达到所需的微观结构。代理在处理各种条件方面的这种能力为使用这种方法铺平了道路,也用于回收的导向热处理过程设计,由于杂质的侵入,初始组合物可能因批量而异,以及用于设计节能热处理。为了检验这一假设,将无罚款的代理人与考虑能源成本的代理人进行了比较。对能源成本的罚款是针对找到最佳温度时间剖面的代理的附加标准。
translated by 谷歌翻译
Efficient exploration remains a major challenge for reinforcement learning (RL). Common dithering strategies for exploration, such as -greedy, do not carry out temporally-extended (or deep) exploration; this can lead to exponentially larger data requirements. However, most algorithms for statistically efficient RL are not computationally tractable in complex environments. Randomized value functions offer a promising approach to efficient exploration with generalization, but existing algorithms are not compatible with nonlinearly parameterized value functions. As a first step towards addressing such contexts we develop bootstrapped DQN. We demonstrate that bootstrapped DQN can combine deep exploration with deep neural networks for exponentially faster learning than any dithering strategy. In the Arcade Learning Environment bootstrapped DQN substantially improves learning speed and cumulative performance across most games.
translated by 谷歌翻译
本文探讨了强化学习(RL)模型用于自动赛车的使用。与安全车是头等大事的乘用车相反,赛车的目的是最大程度地减少单圈时间。我们将问题视为一项强化学习任务,其中包括由车辆遥测组成的多维输入和连续的动作空间。为了找出哪种RL方法更好地解决了问题,以及获得的模型是否推广到未知轨道上,我们将10种深层确定性策略梯度(DDPG)变体进行了两个实验:i)〜研究RL方法如何学习驱动驱动赛车和ii)研究学习方案如何影响模型的推广能力。我们的研究表明,接受RL训练的模型不仅能够比基线开源手工机器人更快地驾驶,而且还可以推广到未知轨道。
translated by 谷歌翻译
With the development of experimental quantum technology, quantum control has attracted increasing attention due to the realization of controllable artificial quantum systems. However, because quantum-mechanical systems are often too difficult to analytically deal with, heuristic strategies and numerical algorithms which search for proper control protocols are adopted, and, deep learning, especially deep reinforcement learning (RL), is a promising generic candidate solution for the control problems. Although there have been a few successful applications of deep RL to quantum control problems, most of the existing RL algorithms suffer from instabilities and unsatisfactory reproducibility, and require a large amount of fine-tuning and a large computational budget, both of which limit their applicability. To resolve the issue of instabilities, in this dissertation, we investigate the non-convergence issue of Q-learning. Then, we investigate the weakness of existing convergent approaches that have been proposed, and we develop a new convergent Q-learning algorithm, which we call the convergent deep Q network (C-DQN) algorithm, as an alternative to the conventional deep Q network (DQN) algorithm. We prove the convergence of C-DQN and apply it to the Atari 2600 benchmark. We show that when DQN fail, C-DQN still learns successfully. Then, we apply the algorithm to the measurement-feedback cooling problems of a quantum quartic oscillator and a trapped quantum rigid body. We establish the physical models and analyse their properties, and we show that although both C-DQN and DQN can learn to cool the systems, C-DQN tends to behave more stably, and when DQN suffers from instabilities, C-DQN can achieve a better performance. As the performance of DQN can have a large variance and lack consistency, C-DQN can be a better choice for researches on complicated control problems.
translated by 谷歌翻译