Learning a risk-aware policy is essential but rather challenging in unstructured robotic tasks. Safe reinforcement learning methods open up new possibilities to tackle this problem. However, the conservative policy updates make it intractable to achieve sufficient exploration and desirable performance in complex, sample-expensive environments. In this paper, we propose a dual-agent safe reinforcement learning strategy consisting of a baseline and a safe agent. Such a decoupled framework enables high flexibility, data efficiency and risk-awareness for RL-based control. Concretely, the baseline agent is responsible for maximizing rewards under standard RL settings. Thus, it is compatible with off-the-shelf training techniques of unconstrained optimization, exploration and exploitation. On the other hand, the safe agent mimics the baseline agent for policy improvement and learns to fulfill safety constraints via off-policy RL tuning. In contrast to training from scratch, safe policy correction requires significantly fewer interactions to obtain a near-optimal policy. The dual policies can be optimized synchronously via a shared replay buffer, or leveraging the pre-trained model or the non-learning-based controller as a fixed baseline agent. Experimental results show that our approach can learn feasible skills without prior knowledge as well as deriving risk-averse counterparts from pre-trained unsafe policies. The proposed method outperforms the state-of-the-art safe RL algorithms on difficult robot locomotion and manipulation tasks with respect to both safety constraint satisfaction and sample efficiency.
translated by 谷歌翻译
Safety comes first in many real-world applications involving autonomous agents. Despite a large number of reinforcement learning (RL) methods focusing on safety-critical tasks, there is still a lack of high-quality evaluation of those algorithms that adheres to safety constraints at each decision step under complex and unknown dynamics. In this paper, we revisit prior work in this scope from the perspective of state-wise safe RL and categorize them as projection-based, recovery-based, and optimization-based approaches, respectively. Furthermore, we propose Unrolling Safety Layer (USL), a joint method that combines safety optimization and safety projection. This novel technique explicitly enforces hard constraints via the deep unrolling architecture and enjoys structural advantages in navigating the trade-off between reward improvement and constraint satisfaction. To facilitate further research in this area, we reproduce related algorithms in a unified pipeline and incorporate them into SafeRL-Kit, a toolkit that provides off-the-shelf interfaces and evaluation utilities for safety-critical tasks. We then perform a comparative study of the involved algorithms on six benchmarks ranging from robotic control to autonomous driving. The empirical results provide an insight into their applicability and robustness in learning zero-cost-return policies without task-dependent handcrafting. The project page is available at https://sites.google.com/view/saferlkit.
translated by 谷歌翻译
安全加强学习(RL)在对风险敏感的任务上取得了重大成功,并在自主驾驶方面也表现出了希望(AD)。考虑到这个社区的独特性,对于安全广告而言,仍然缺乏高效且可再现的基线。在本文中,我们将SAFERL-KIT释放到基准的安全RL方法,以实现倾向的任务。具体而言,SAFERL-KIT包含了针对零构成的侵略任务的几种最新算法,包括安全层,恢复RL,非政策Lagrangian方法和可行的Actor-Critic。除了现有方法外,我们还提出了一种名为精确惩罚优化(EPO)的新型一阶方法,并充分证明了其在安全AD中的能力。 SAFERL-KIT中的所有算法均在政策设置下实现(i),从而提高了样本效率并可以更好地利用过去的日志; (ii)具有统一的学习框架,为研究人员提供了现成的接口,以将其特定领域的知识纳入基本的安全RL方法中。最后,我们对上述算法进行了比较评估,并阐明了它们的安全自动驾驶功效。源代码可在\ href {https://github.com/zlr20/saferl_kit} {this https url}中获得。
translated by 谷歌翻译
安全的加强学习(RL)旨在学习在将其部署到关键安全应用程序中之前满足某些约束的政策。以前的原始双重风格方法遭受了不稳定性问题的困扰,并且缺乏最佳保证。本文从概率推断的角度克服了问题。我们在政策学习过程中介绍了一种新颖的期望最大化方法来自然纳入约束:1)在凸优化(E-step)后,可以以封闭形式计算可证明的最佳非参数变异分布; 2)基于最佳变异分布(M-step),在信任区域内改进了策略参数。提出的算法将安全的RL问题分解为凸优化阶段和监督学习阶段,从而产生了更稳定的培训性能。对连续机器人任务进行的广泛实验表明,所提出的方法比基线获得了更好的约束满意度和更好的样品效率。该代码可在https://github.com/liuzuxin/cvpo-safe-rl上找到。
translated by 谷歌翻译
安全的强化学习旨在学习最佳政策,同时满足安全限制,这在现实世界中至关重要。但是,当前的算法仍在为有效的政策更新而努力,并具有严格的约束满意度。在本文中,我们提出了受惩罚的近端政策优化(P3O),该政策优化(P3O)通过单一的最小化等效不受约束的问题来解决麻烦的受约束政策迭代。具体而言,P3O利用了简单的罚款功能来消除成本限制,并通过剪裁的替代目标消除了信任区域的约束。从理论上讲,我们用有限的惩罚因素证明了所提出的方法的精确性,并在对样品轨迹进行评估时提供了最坏情况分析,以实现近似误差。此外,我们将P3O扩展到更具挑战性的多构造和多代理方案,这些方案在以前的工作中所研究的情况较少。广泛的实验表明,在一组受限的机车任务上,P3O优于奖励改进和约束满意度的最先进算法。
translated by 谷歌翻译
安全已成为对现实世界系统应用深度加固学习的主要挑战之一。目前,诸如人类监督等外部知识的纳入唯一可以防止代理人访问灾难性状态的手段。在本文中,我们提出了一种基于安全模型的强化学习的新框架MBHI,可确保状态级安全,可以有效地避免“本地”和“非本地”灾难。监督学习者的合并在MBHI培训,以模仿人类阻止决策。类似于人类决策过程,MBHI将在执行对环境的动作之前在动态模型中推出一个想象的轨迹,并估算其安全性。当想象力遇到灾难时,MBHI将阻止当前的动作并使用高效的MPC方法来输出安全策略。我们在几个安全任务中评估了我们的方法,结果表明,与基线相比,MBHI在样品效率和灾难数方面取得了更好的性能。
translated by 谷歌翻译
In this work, we focus on the problem of safe policy transfer in reinforcement learning: we seek to leverage existing policies when learning a new task with specified constraints. This problem is important for safety-critical applications where interactions are costly and unconstrained policies can lead to undesirable or dangerous outcomes, e.g., with physical robots that interact with humans. We propose a Constrained Markov Decision Process (CMDP) formulation that simultaneously enables the transfer of policies and adherence to safety constraints. Our formulation cleanly separates task goals from safety considerations and permits the specification of a wide variety of constraints. Our approach relies on a novel extension of generalized policy improvement to constrained settings via a Lagrangian formulation. We devise a dual optimization algorithm that estimates the optimal dual variable of a target task, thus enabling safe transfer of policies derived from successor features learned on source tasks. Our experiments in simulated domains show that our approach is effective; it visits unsafe states less frequently and outperforms alternative state-of-the-art methods when taking safety constraints into account.
translated by 谷歌翻译
For many applications of reinforcement learning it can be more convenient to specify both a reward function and constraints, rather than trying to design behavior through the reward function. For example, systems that physically interact with or around humans should satisfy safety constraints. Recent advances in policy search algorithms (
translated by 谷歌翻译
值得信赖的强化学习算法应有能力解决挑战性的现实问题,包括{Robustly}处理不确定性,满足{安全}的限制以避免灾难性的失败,以及在部署过程中{prencepentiming}以避免灾难性的失败}。这项研究旨在概述这些可信赖的强化学习的主要观点,即考虑其在鲁棒性,安全性和概括性上的内在脆弱性。特别是,我们给出严格的表述,对相应的方法进行分类,并讨论每个观点的基准。此外,我们提供了一个前景部分,以刺激有希望的未来方向,并简要讨论考虑人类反馈的外部漏洞。我们希望这项调查可以在统一的框架中将单独的研究汇合在一起,并促进强化学习的可信度。
translated by 谷歌翻译
深度加固学习(DRL)使机器人能够结束结束地执行一些智能任务。然而,长地平线稀疏奖励机器人机械手任务仍存在许多挑战。一方面,稀疏奖励设置会导致探索效率低下。另一方面,使用物理机器人的探索是高成本和不安全的。在本文中,我们提出了一种学习使用本文中名为基础控制器的一个或多个现有传统控制器的长地平线稀疏奖励任务。基于深度确定性的政策梯度(DDPG),我们的算法将现有基础控制器融入勘探,价值学习和策略更新的阶段。此外,我们介绍了合成不同基础控制器以整合它们的优点的直接方式。通过从堆叠块到杯子的实验,证明学习的国家或基于图像的策略稳定优于基础控制器。与以前的示范中的学习作品相比,我们的方法通过数量级提高了样品效率,提高了性能。总体而言,我们的方法具有利用现有的工业机器人操纵系统来构建更灵活和智能控制器的可能性。
translated by 谷歌翻译
在本文中,我们研究了加强学习问题的安全政策的学习。这是,我们的目标是控制我们不知道过渡概率的马尔可夫决策过程(MDP),但我们通过经验访问样品轨迹。我们将安全性定义为在操作时间内具有高概率的期望安全集中的代理。因此,我们考虑受限制的MDP,其中限制是概率。由于没有直接的方式来优化关于加强学习框架中的概率约束的政策,因此我们提出了对问题的遍历松弛。拟议的放松的优点是三倍。 (i)安全保障在集界任务的情况下保持,并且它们保持在一个给定的时间范围内,以继续进行任务。 (ii)如果政策的参数化足够丰富,则约束优化问题尽管其非凸起具有任意小的二元间隙。 (iii)可以使用标准策略梯度结果和随机近似工具容易地计算与安全学习问题相关的拉格朗日的梯度。利用这些优势,我们建立了原始双算法能够找到安全和最佳的政策。我们在连续域中的导航任务中测试所提出的方法。数值结果表明,我们的算法能够将策略动态调整到环境和所需的安全水平。
translated by 谷歌翻译
本文解决了当参与需求响应(DR)时优化电动汽车(EV)的充电/排放时间表的问题。由于电动汽车的剩余能量,到达和出发时间以及未来的电价中存在不确定性,因此很难做出充电决定以最大程度地减少充电成本,同时保证电动汽车的电池最先进(SOC)在内某些范围。为了解决这一难题,本文将EV充电调度问题制定为Markov决策过程(CMDP)。通过协同结合增强的Lagrangian方法和软演员评论家算法,本文提出了一种新型安全的非政策钢筋学习方法(RL)方法来解决CMDP。通过Lagrangian值函数以策略梯度方式更新Actor网络。采用双重危机网络来同步估计动作值函数,以避免高估偏差。所提出的算法不需要强烈的凸度保证,可以保证被检查的问题,并且是有效的样本。现实世界中电价的全面数值实验表明,我们提出的算法可以实现高解决方案最佳性和约束依从性。
translated by 谷歌翻译
在强化学习(RL)的试验和错误机制中,我们期望学习安全的政策时出现臭名昭着的矛盾:如何学习没有足够数据和关于危险区域的先前模型的安全政策?现有方法主要使用危险行动的后期惩罚,这意味着代理人不会受到惩罚,直到体验危险。这一事实导致代理商也无法在收敛之后学习零违规政策。否则,它不会收到任何惩罚并失去有关危险的知识。在本文中,我们提出了安全设置的演员 - 评论家(SSAC)算法,它使用面向安全的能量函数或安全索引限制了策略更新。安全索引旨在迅速增加,以便潜在的危险行动,这使我们能够在动作空间上找到安全设置,或控制安全集。因此,我们可以在服用它们之前识别危险行为,并在收敛后进一步获得零限制违规政策。我们声称我们可以以类似于学习价值函数的无模型方式学习能量函数。通过使用作为约束目标的能量函数转变,我们制定了受约束的RL问题。我们证明我们基于拉格朗日的解决方案确保学习的政策将收敛到某些假设下的约束优化。在复杂的模拟环境和硬件循环(HIL)实验中评估了所提出的算法,具有来自自动车辆的真实控制器。实验结果表明,所有环境中的融合政策达到了零限制违规和基于模型的基线的相当性能。
translated by 谷歌翻译
在过去的十年中,强化学习成功地解决了复杂的控制任务和决策问题,例如Go棋盘游戏。然而,在将这些算法部署到现实世界情景方面的成功案例很少。原因之一是在处理和避免不安全状态时缺乏保证,这是关键控制工程系统的基本要求。在本文中,我们介绍了指导性的安全射击(GUS),这是一种基于模型的RL方法,可以学会以最小的侵犯安全限制来控制系统。该模型以迭代批次方式在系统操作过程中收集的数据中学习,然后用于计划在每个时间步骤执行的最佳动作。我们提出了三个不同的安全计划者,一个基于简单的随机拍摄策略,两个基于MAP-ELITE,一种更高级的发散搜索算法。实验表明,这些计划者可以帮助学习代理避免在最大程度地探索状态空间的同时避免不安全的情况,这是学习系统准确模型的必要方面。此外,与无模型方法相比,学习模型可以减少与现实系统的交互作用的数量,同时仍达到高奖励,这是处理工程系统时的基本要求。
translated by 谷歌翻译
除了最大化奖励目标之外,现实世界中的强化学习(RL)代理商必须满足安全限制。基于模型的RL算法占据了减少不安全的现实世界行动的承诺:它们可以合成使用来自学习模型的模拟样本遵守所有约束的策略。但是,即使对于预测满足所有约束的操作,甚至可能导致真实的结构违规。我们提出了保守和自适应惩罚(CAP),一种基于模型的安全RL框架,其通过捕获模型不确定性并自适应利用它来平衡奖励和成本目标来占潜在的建模错误。首先,CAP利用基于不确定性的惩罚来膨胀预测成本。从理论上讲,我们展示了满足这种保守成本约束的政策,也可以保证在真正的环境中是可行的。我们进一步表明,这保证了在RL培训期间所有中间解决方案的安全性。此外,在使用环境中使用真正的成本反馈,帽子在培训期间自适应地调整这种惩罚。我们在基于状态和基于图像的环境中,评估了基于模型的安全RL的保守和自适应惩罚方法。我们的结果表明了样品效率的大量收益,同时产生比现有安全RL算法更少的违规行为。代码可用:https://github.com/redrew/cap
translated by 谷歌翻译
基于能量功能的安全证书可以为复杂机器人系统的安全控制任务提供可证明的安全保证。但是,所有有关基于学习的能量功能合成的最新研究仅考虑可行性,这可能会导致过度保存并导致效率较低的控制器。在这项工作中,我们提出了幅度的正规化技术,以通过降低能量功能内部的保守性,同时保持有希望的可证明的安全保证,以提高安全控制器的效率。具体而言,我们通过能量函数的幅度来量化保守性,并通过在合成损失中增加幅度的正则化项来降低保守性。我们提出了使用加固学习(RL)进行合成的SAFEMR算法来统一安全控制器和能量功能的学习过程。实验结果表明,所提出的方法确实会降低能量功能的保守性,并在控制器效率方面优于基准,同时确保安全性。
translated by 谷歌翻译
Reinforcement learning often suffer from the sparse reward issue in real-world robotics problems. Learning from demonstration (LfD) is an effective way to eliminate this problem, which leverages collected expert data to aid online learning. Prior works often assume that the learning agent and the expert aim to accomplish the same task, which requires collecting new data for every new task. In this paper, we consider the case where the target task is mismatched from but similar with that of the expert. Such setting can be challenging and we found existing LfD methods can not effectively guide learning in mismatched new tasks with sparse rewards. We propose conservative reward shaping from demonstration (CRSfD), which shapes the sparse rewards using estimated expert value function. To accelerate learning processes, CRSfD guides the agent to conservatively explore around demonstrations. Experimental results of robot manipulation tasks show that our approach outperforms baseline LfD methods when transferring demonstrations collected in a single task to other different but similar tasks.
translated by 谷歌翻译
With the development of deep representation learning, the domain of reinforcement learning (RL) has become a powerful learning framework now capable of learning complex policies in high dimensional environments. This review summarises deep reinforcement learning (DRL) algorithms and provides a taxonomy of automated driving tasks where (D)RL methods have been employed, while addressing key computational challenges in real world deployment of autonomous driving agents. It also delineates adjacent domains such as behavior cloning, imitation learning, inverse reinforcement learning that are related but are not classical RL algorithms. The role of simulators in training agents, methods to validate, test and robustify existing solutions in RL are discussed.
translated by 谷歌翻译
Reinforcement learning holds the promise of enabling autonomous robots to learn large repertoires of behavioral skills with minimal human intervention. However, robotic applications of reinforcement learning often compromise the autonomy of the learning process in favor of achieving training times that are practical for real physical systems. This typically involves introducing hand-engineered policy representations and human-supplied demonstrations. Deep reinforcement learning alleviates this limitation by training general-purpose neural network policies, but applications of direct deep reinforcement learning algorithms have so far been restricted to simulated settings and relatively simple tasks, due to their apparent high sample complexity. In this paper, we demonstrate that a recent deep reinforcement learning algorithm based on offpolicy training of deep Q-functions can scale to complex 3D manipulation tasks and can learn deep neural network policies efficiently enough to train on real physical robots. We demonstrate that the training times can be further reduced by parallelizing the algorithm across multiple robots which pool their policy updates asynchronously. Our experimental evaluation shows that our method can learn a variety of 3D manipulation skills in simulation and a complex door opening skill on real robots without any prior demonstrations or manually designed representations.
translated by 谷歌翻译
由于在存在障碍物和高维视觉观测的情况下,由于在存在障碍和高维视觉观测的情况下,学习复杂的操纵任务是一个具有挑战性的问题。事先工作通过整合运动规划和强化学习来解决勘探问题。但是,运动计划程序增强策略需要访问状态信息,该信息通常在现实世界中不可用。为此,我们建议通过(1)视觉行为克隆以通过(1)视觉行为克隆来将基于国家的运动计划者增强策略,以删除运动计划员依赖以及其抖动运动,以及(2)基于视觉的增强学习来自行为克隆代理的平滑轨迹的指导。我们在阻塞环境中的三个操作任务中评估我们的方法,并将其与各种加固学习和模仿学习基线进行比较。结果表明,我们的框架是高度采样的和优于最先进的算法。此外,与域随机化相结合,我们的政策能够用零击转移到未经分散的人的未经环境环境。 https://clvrai.com/mopa-pd提供的代码和视频
translated by 谷歌翻译