本文探讨了强化学习(RL)模型用于自动赛车的使用。与安全车是头等大事的乘用车相反,赛车的目的是最大程度地减少单圈时间。我们将问题视为一项强化学习任务,其中包括由车辆遥测组成的多维输入和连续的动作空间。为了找出哪种RL方法更好地解决了问题,以及获得的模型是否推广到未知轨道上,我们将10种深层确定性策略梯度(DDPG)变体进行了两个实验:i)〜研究RL方法如何学习驱动驱动赛车和ii)研究学习方案如何影响模型的推广能力。我们的研究表明,接受RL训练的模型不仅能够比基线开源手工机器人更快地驾驶,而且还可以推广到未知轨道。
translated by 谷歌翻译
With the development of deep representation learning, the domain of reinforcement learning (RL) has become a powerful learning framework now capable of learning complex policies in high dimensional environments. This review summarises deep reinforcement learning (DRL) algorithms and provides a taxonomy of automated driving tasks where (D)RL methods have been employed, while addressing key computational challenges in real world deployment of autonomous driving agents. It also delineates adjacent domains such as behavior cloning, imitation learning, inverse reinforcement learning that are related but are not classical RL algorithms. The role of simulators in training agents, methods to validate, test and robustify existing solutions in RL are discussed.
translated by 谷歌翻译
We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain. We present an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. Using the same learning algorithm, network architecture and hyper-parameters, our algorithm robustly solves more than 20 simulated physics tasks, including classic problems such as cartpole swing-up, dexterous manipulation, legged locomotion and car driving. Our algorithm is able to find policies whose performance is competitive with those found by a planning algorithm with full access to the dynamics of the domain and its derivatives. We further demonstrate that for many of the tasks the algorithm can learn policies "end-to-end": directly from raw pixel inputs.
translated by 谷歌翻译
当任何安全违规可能导致灾难性失败时,赛车要求每个车辆都能在其物质范围内驾驶。在这项工作中,我们研究了自主赛车的安全强化学习(RL)的问题,使用车辆的自我摄像机视图和速度作为输入。鉴于任务的性质,自主代理需要能够1)识别并避免复杂的车辆动态下的不安全场景,而2)在快速变化的环境中使子第二决定。为了满足这些标准,我们建议纳入汉密尔顿 - 雅各(HJ)可达性理论,是一般非线性系统的安全验证方法,进入受约束的马尔可夫决策过程(CMDP)框架。 HJ可达性不仅提供了一种了解安全的控制理论方法,还可以实现低延迟安全验证。尽管HJ可达性传统上不可扩展到高维系统,但我们证明了具有神经逼近的,可以直接在视觉上下文中学习HJ安全值 - 迄今为止通过该方法研究的最高尺寸问题。我们在最近发布的高保真自主赛车环境中评估了我们在几个基准任务中的方法,包括安全健身房和学习(L2R)。与安全健身房的其他受约束的RL基线相比,我们的方法非常少的限制性违规,并在L2R基准任务上实现了新的最先进结果。我们在以下匿名纸质网站提供额外可视化代理行为:https://sites.google.com/view/safeautomouracing/home
translated by 谷歌翻译
在自主驾驶场中,人类知识融合到深增强学习(DRL)通常基于在模拟环境中记录的人类示范。这限制了在现实世界交通中的概率和可行性。我们提出了一种两级DRL方法,从真实的人类驾驶中学习,实现优于纯DRL代理的性能。培训DRL代理商是在Carla的框架内完成了机器人操作系统(ROS)。对于评估,我们设计了不同的真实驾驶场景,可以将提出的两级DRL代理与纯DRL代理进行比较。在从人驾驶员中提取“良好”行为之后,例如在信号交叉口中的预期,该代理变得更有效,并且驱动更安全,这使得这种自主代理更适应人体机器人交互(HRI)流量。
translated by 谷歌翻译
We present an approach for safe trajectory planning, where a strategic task related to autonomous racing is learned sample-efficient within a simulation environment. A high-level policy, represented as a neural network, outputs a reward specification that is used within the cost function of a parametric nonlinear model predictive controller (NMPC). By including constraints and vehicle kinematics in the NLP, we are able to guarantee safe and feasible trajectories related to the used model. Compared to classical reinforcement learning (RL), our approach restricts the exploration to safe trajectories, starts with a good prior performance and yields full trajectories that can be passed to a tracking lowest-level controller. We do not address the lowest-level controller in this work and assume perfect tracking of feasible trajectories. We show the superior performance of our algorithm on simulated racing tasks that include high-level decision making. The vehicle learns to efficiently overtake slower vehicles and to avoid getting overtaken by blocking faster vehicles.
translated by 谷歌翻译
Deep reinforcement learning is poised to revolutionise the field of AI and represents a step towards building autonomous systems with a higher level understanding of the visual world. Currently, deep learning is enabling reinforcement learning to scale to problems that were previously intractable, such as learning to play video games directly from pixels. Deep reinforcement learning algorithms are also applied to robotics, allowing control policies for robots to be learned directly from camera inputs in the real world. In this survey, we begin with an introduction to the general field of reinforcement learning, then progress to the main streams of value-based and policybased methods. Our survey will cover central algorithms in deep reinforcement learning, including the deep Q-network, trust region policy optimisation, and asynchronous advantage actor-critic. In parallel, we highlight the unique advantages of deep neural networks, focusing on visual understanding via reinforcement learning. To conclude, we describe several current areas of research within the field.
translated by 谷歌翻译
深度强化学习(DRL)是一种仅从演示和经验中学习机器人控制政策的有前途的方法。为了涵盖机器人的整个动态行为,DRL训练是通常在仿真环境中得出的主动探索过程。尽管这种模拟培训廉价且快速,但将DRL算法应用于现实世界的设置很困难。如果对代理进行训练直到它们在模拟中安全执行,则由于模拟动力学和物理机器人之间的差异引起的SIM到真实差距,将其传输到物理系统很困难。在本文中,我们提出了一种在线培训DRL代理的方法,可以使用基于模型的安全主管在实体车辆上自动驾驶。我们的解决方案使用监督系统检查代理选择的操作是安全还是不安全,并确保在车辆上始终采取安全措施。这样,我们可以在安全,快速,有效地训练DRL算法的同时绕过SIM到现实的问题。我们提供各种现实世界实验,在线培训一辆小型实体车辆,可以自动驾驶,没有事先模拟培训。评估结果表明,我们的方法在未崩溃的同时提高了样品效率的训练代理,并且受过训练的代理比在模拟中训练的代理表现出更好的驾驶性能。
translated by 谷歌翻译
Adequately assigning credit to actions for future outcomes based on their contributions is a long-standing open challenge in Reinforcement Learning. The assumptions of the most commonly used credit assignment method are disadvantageous in tasks where the effects of decisions are not immediately evident. Furthermore, this method can only evaluate actions that have been selected by the agent, making it highly inefficient. Still, no alternative methods have been widely adopted in the field. Hindsight Credit Assignment is a promising, but still unexplored candidate, which aims to solve the problems of both long-term and counterfactual credit assignment. In this thesis, we empirically investigate Hindsight Credit Assignment to identify its main benefits, and key points to improve. Then, we apply it to factored state representations, and in particular to state representations based on the causal structure of the environment. In this setting, we propose a variant of Hindsight Credit Assignment that effectively exploits a given causal structure. We show that our modification greatly decreases the workload of Hindsight Credit Assignment, making it more efficient and enabling it to outperform the baseline credit assignment method on various tasks. This opens the way to other methods based on given or learned causal structures.
translated by 谷歌翻译
在包装交付,交通监控,搜索和救援操作以及军事战斗订婚等不同应用中,对使用无人驾驶汽车(UAV)(无人机)的需求越来越不断增加。在所有这些应用程序中,无人机用于自动导航环境 - 没有人类互动,执行特定任务并避免障碍。自主无人机导航通常是使用强化学习(RL)来完成的,在该学习中,代理在域中充当专家在避免障碍的同时导航环境。了解导航环境和算法限制在选择适当的RL算法以有效解决导航问题方面起着至关重要的作用。因此,本研究首先确定了无人机导航任务,并讨论导航框架和仿真软件。接下来,根据环境,算法特征,能力和不同无人机导航问题的应用程序对RL算法进行分类和讨论,这将帮助从业人员和研究人员为其无人机导航使用情况选择适当的RL算法。此外,确定的差距和机会将推动无人机导航研究。
translated by 谷歌翻译
Reinforcement learning (RL) requires skillful definition and remarkable computational efforts to solve optimization and control problems, which could impair its prospect. Introducing human guidance into reinforcement learning is a promising way to improve learning performance. In this paper, a comprehensive human guidance-based reinforcement learning framework is established. A novel prioritized experience replay mechanism that adapts to human guidance in the reinforcement learning process is proposed to boost the efficiency and performance of the reinforcement learning algorithm. To relieve the heavy workload on human participants, a behavior model is established based on an incremental online learning method to mimic human actions. We design two challenging autonomous driving tasks for evaluating the proposed algorithm. Experiments are conducted to access the training and testing performance and learning mechanism of the proposed algorithm. Comparative results against the state-of-the-art methods suggest the advantages of our algorithm in terms of learning efficiency, performance, and robustness.
translated by 谷歌翻译
事件触发的模型预测控制(EMPC)是一种流行的最佳控制方法,旨在减轻MPC的计算和/或通信负担。但是,通常需要先验了解闭环系统行为以及设计事件触发策略的通信特征。本文试图通过提出有效的EMPC框架来解决这一挑战,并在随后的自动驾驶汽车路径上成功实施了该框架。首先,使用无模型的加固学习(RL)代理用于学习最佳的事件触发策略,而无需在此框架中具有完整的动态系统和通信知识。此外,还采用了包括优先经验重播(PER)缓冲区和长期术语记忆(LSTM)的技术来促进探索和提高训练效率。在本文中,我们使用提出的三种深度RL算法的拟议框架,即双Q学习(DDQN),近端策略优化(PPO)和软参与者 - 批评(SAC),以解决此问题。实验结果表明,所有三个基于RL的EMPC(DEEP-RL-EMPC)都比在自动途径下的常规阈值和以前的基于线性Q的方法获得更好的评估性能。特别是,具有LSTM和DDQN-EMPC的PPO-EMPC具有PER和LSTM的PPO-EMPC在闭环控制性能和事件触发频率之间获得了较高的平衡。关联的代码是开源的,可在以下网址提供:https://github.com/dangfengying/rl基础基础 - event-triggered-mpc。
translated by 谷歌翻译
强化学习(RL)已证明可以在各种任务中达到超级人类水平的表现。但是,与受监督的机器学习不同,将其推广到各种情况的学习策略仍然是现实世界中最具挑战性的问题之一。自主驾驶(AD)提供了一个多方面的实验领域,因为有必要在许多变化的道路布局和可能的交通情况大量分布中学习正确的行为,包括个人驾驶员个性和难以预测的交通事件。在本文中,我们根据可配置,灵活和性能的代码库为AD提出了一个具有挑战性的基准。我们的基准测试使用了随机场景生成器的目录,包括用于道路布局和交通变化的多种机制,不同的数值和视觉观察类型,不同的动作空间,不同的车辆模型,并允许在静态场景定义下使用。除了纯粹的算法见解外,我们面向应用程序的基准还可以更好地理解设计决策的影响,例如行动和观察空间对政策的普遍性。我们的基准旨在鼓励研究人员提出能够在各种情况下成功概括的解决方案,这是当前RL方法失败的任务。基准的代码可在https://github.com/seawee1/driver-dojo上获得。
translated by 谷歌翻译
虽然在过去几年中,越来越多地应用了深入的增强学习(RL),但该研究旨在研究基于RL的车辆辅助对复杂的车辆动力学和强烈的环境干扰的可行性。作为用例,我们开发了一种基于逼真的容器动力学的内陆水道跟随模型,该模型考虑了环境影响,例如变化的河流速度和河流剖面。我们从匿名的AIS数据中提取了自然血管行为,以制定奖励功能,该奖励功能反映了舒适且安全的导航旁边的现实驾驶方式。针对高概括能力,我们提出了一个RL训练环境,该环境使用随机过程来建模领先的轨迹和河流动力学。为了验证训练有素的模型,我们定义了在训练中尚未看到的不同情况,包括在中间莱茵河上逼真的船只。我们的模型在所有情况下都表现出安全舒适的驾驶,证明了出色的概括能力。此外,通过在一系列船只上部署训练的模型,可以有效地抑制交通振荡。
translated by 谷歌翻译
We propose a conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers. We present asynchronous variants of four standard reinforcement learning algorithms and show that parallel actor-learners have a stabilizing effect on training allowing all four methods to successfully train neural network controllers. The best performing method, an asynchronous variant of actor-critic, surpasses the current state-of-the-art on the Atari domain while training for half the time on a single multi-core CPU instead of a GPU. Furthermore, we show that asynchronous actor-critic succeeds on a wide variety of continuous motor control problems as well as on a new task of navigating random 3D mazes using a visual input.
translated by 谷歌翻译
Exploration in environments with sparse rewards has been a persistent problem in reinforcement learning (RL). Many tasks are natural to specify with a sparse reward, and manually shaping a reward function can result in suboptimal performance. However, finding a non-zero reward is exponentially more difficult with increasing task horizon or action dimensionality. This puts many real-world tasks out of practical reach of RL methods. In this work, we use demonstrations to overcome the exploration problem and successfully learn to perform long-horizon, multi-step robotics tasks with continuous control such as stacking blocks with a robot arm. Our method, which builds on top of Deep Deterministic Policy Gradients and Hindsight Experience Replay, provides an order of magnitude of speedup over RL on simulated robotics tasks. It is simple to implement and makes only the additional assumption that we can collect a small set of demonstrations. Furthermore, our method is able to solve tasks not solvable by either RL or behavior cloning alone, and often ends up outperforming the demonstrator policy.
translated by 谷歌翻译
使用强化学习解决复杂的问题必须将问题分解为可管理的任务,无论是明确或隐式的任务,并学习解决这些任务的政策。反过来,这些政策必须由采取高级决策的总体政策来控制。这需要培训算法在学习这些政策时考虑这种等级决策结构。但是,实践中的培训可能会导致泛化不良,要么在很少的时间步骤执行动作,要么将其全部转变为单个政策。在我们的工作中,我们介绍了一种替代方法来依次学习此类技能,而无需使用总体层次的政策。我们在环境的背景下提出了这种方法,在这种环境的背景下,学习代理目标的主要组成部分是尽可能长时间延长情节。我们将我们提出的方法称为顺序选择评论家。我们在我们开发的灵活的模拟3D导航环境中演示了我们在导航和基于目标任务的方法的实用性。我们还表明,我们的方法优于先前的方法,例如在我们的环境中,柔软的演员和软选择评论家,以及健身房自动驾驶汽车模拟器和Atari River RAID RAID环境。
translated by 谷歌翻译
尽管深度强化学习(RL)最近取得了许多成功,但其方法仍然效率低下,这使得在数据方面解决了昂贵的许多问题。我们的目标是通过利用未标记的数据中的丰富监督信号来进行学习状态表示,以解决这一问题。本文介绍了三种不同的表示算法,可以访问传统RL算法使用的数据源的不同子集使用:(i)GRICA受到独立组件分析(ICA)的启发,并训练深层神经网络以输出统计独立的独立特征。输入。 Grica通过最大程度地减少每个功能与其他功能之间的相互信息来做到这一点。此外,格里卡仅需要未分类的环境状态。 (ii)潜在表示预测(LARP)还需要更多的上下文:除了要求状态作为输入外,它还需要先前的状态和连接它们的动作。该方法通过预测当前状态和行动的环境的下一个状态来学习状态表示。预测器与图形搜索算法一起使用。 (iii)重新培训通过训练深层神经网络来学习国家表示,以学习奖励功能的平滑版本。该表示形式用于预处理输入到深度RL,而奖励预测指标用于奖励成型。此方法仅需要环境中的状态奖励对学习表示表示。我们发现,每种方法都有其优势和缺点,并从我们的实验中得出结论,包括无监督的代表性学习在RL解决问题的管道中可以加快学习的速度。
translated by 谷歌翻译
机器学习算法中多个超参数的最佳设置是发出大多数可用数据的关键。为此目的,已经提出了几种方法,例如进化策略,随机搜索,贝叶斯优化和启发式拇指规则。在钢筋学习(RL)中,学习代理在与其环境交互时收集的数据的信息内容严重依赖于许多超参数的设置。因此,RL算法的用户必须依赖于基于搜索的优化方法,例如网格搜索或Nelder-Mead单简单算法,这对于大多数R1任务来说是非常效率的,显着减慢学习曲线和离开用户的速度有目的地偏见数据收集的负担。在这项工作中,为了使RL算法更加用户独立,提出了一种使用贝叶斯优化的自主超参数设置的新方法。来自过去剧集和不同的超参数值的数据通过执行行为克隆在元学习水平上使用,这有助于提高最大化获取功能的加强学习变体的有效性。此外,通过紧密地整合在加强学习代理设计中的贝叶斯优化,还减少了收敛到给定任务的最佳策略所需的状态转换的数量。与其他手动调整和基于优化的方法相比,计算实验显示了有希望的结果,这突出了改变算法超级参数来增加所生成数据的信息内容的好处。
translated by 谷歌翻译
自主驾驶有可能彻底改变流动性,因此是一个积极的研究领域。实际上,自动驾驶汽车的行为必须是可以接受的,即高效,安全和可解释的。尽管香草钢筋学习(RL)找到了表现的行为策略,但它们通常是不安全且无法解释的。安全性是通过安全的RL方法引入的,但是它们仍然无法解释,因为学习的行为在没有分别进行建模的情况下共同优化了安全性和性能。可解释的机器学习很少应用于RL。本文提出了SAFEDQN,它允许在仍然有效的同时使自动驾驶汽车的行为安全可解释。 SAFEDQN在算法上透明的同时,在预期风险和效用的效用之间提供了可以理解的语义权衡。我们表明,SAFEDQN为各种场景找到了可解释且安全的驾驶政策,并展示了最先进的显着性技术如何帮助评估风险和实用性。
translated by 谷歌翻译