Safe Reinforcement Learning can be defined as the process of learning policies that maximize the expectation of the return in problems in which it is important to ensure reasonable system performance and/or respect safety constraints during the learning and/or deployment processes. We categorize and analyze two approaches of Safe Reinforcement Learning. The first is based on the modification of the optimality criterion, the classic discounted finite/infinite horizon, with a safety factor. The second is based on the modification of the exploration process through the incorporation of external knowledge or the guidance of a risk metric. We use the proposed classification to survey the existing literature, as well as suggesting future directions for Safe Reinforcement Learning.
translated by 谷歌翻译
The reinforcement learning paradigm is a popular way to address problems that have only limited environmental feedback, rather than correctly labeled examples, as is common in other machine learning contexts. While significant progress has been made to improve learning in a single task, the idea of transfer learning has only recently been applied to reinforcement learning tasks. The core idea of transfer is that experience gained in learning to perform one task can help improve learning performance in a related, but different, task. In this article we present a framework that classifies transfer learning methods in terms of their capabilities and goals, and then use it to survey the existing literature, as well as to suggest future directions for transfer learning work.
translated by 谷歌翻译
由于数据量增加,金融业的快速变化已经彻底改变了数据处理和数据分析的技术,并带来了新的理论和计算挑战。与古典随机控制理论和解决财务决策问题的其他分析方法相比,解决模型假设的财务决策问题,强化学习(RL)的新发展能够充分利用具有更少模型假设的大量财务数据并改善复杂的金融环境中的决策。该调查纸目的旨在审查最近的资金途径的发展和使用RL方法。我们介绍了马尔可夫决策过程,这是许多常用的RL方法的设置。然后引入各种算法,重点介绍不需要任何模型假设的基于价值和基于策略的方法。连接是用神经网络进行的,以扩展框架以包含深的RL算法。我们的调查通过讨论了这些RL算法在金融中各种决策问题中的应用,包括最佳执行,投资组合优化,期权定价和对冲,市场制作,智能订单路由和Robo-Awaring。
translated by 谷歌翻译
值得信赖的强化学习算法应有能力解决挑战性的现实问题,包括{Robustly}处理不确定性,满足{安全}的限制以避免灾难性的失败,以及在部署过程中{prencepentiming}以避免灾难性的失败}。这项研究旨在概述这些可信赖的强化学习的主要观点,即考虑其在鲁棒性,安全性和概括性上的内在脆弱性。特别是,我们给出严格的表述,对相应的方法进行分类,并讨论每个观点的基准。此外,我们提供了一个前景部分,以刺激有希望的未来方向,并简要讨论考虑人类反馈的外部漏洞。我们希望这项调查可以在统一的框架中将单独的研究汇合在一起,并促进强化学习的可信度。
translated by 谷歌翻译
This paper surveys the eld of reinforcement learning from a computer-science perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the eld and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trial-and-error interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but di ers considerably in the details and in the use of the word \reinforcement." The paper discusses central issues of reinforcement learning, including trading o exploration and exploitation, establishing the foundations of the eld via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state. It concludes with a survey of some implemented systems and an assessment of the practical utility of current methods for reinforcement learning.
translated by 谷歌翻译
With the development of deep representation learning, the domain of reinforcement learning (RL) has become a powerful learning framework now capable of learning complex policies in high dimensional environments. This review summarises deep reinforcement learning (DRL) algorithms and provides a taxonomy of automated driving tasks where (D)RL methods have been employed, while addressing key computational challenges in real world deployment of autonomous driving agents. It also delineates adjacent domains such as behavior cloning, imitation learning, inverse reinforcement learning that are related but are not classical RL algorithms. The role of simulators in training agents, methods to validate, test and robustify existing solutions in RL are discussed.
translated by 谷歌翻译
深度强化学习(DRL)和深度多机构的强化学习(MARL)在包括游戏AI,自动驾驶汽车,机器人技术等各种领域取得了巨大的成功。但是,众所周知,DRL和Deep MARL代理的样本效率低下,即使对于相对简单的问题设置,通常也需要数百万个相互作用,从而阻止了在实地场景中的广泛应用和部署。背后的一个瓶颈挑战是众所周知的探索问题,即如何有效地探索环境和收集信息丰富的经验,从而使政策学习受益于最佳研究。在稀疏的奖励,吵闹的干扰,长距离和非平稳的共同学习者的复杂环境中,这个问题变得更加具有挑战性。在本文中,我们对单格和多代理RL的现有勘探方法进行了全面的调查。我们通过确定有效探索的几个关键挑战开始调查。除了上述两个主要分支外,我们还包括其他具有不同思想和技术的著名探索方法。除了算法分析外,我们还对一组常用基准的DRL进行了全面和统一的经验比较。根据我们的算法和实证研究,我们终于总结了DRL和Deep Marl中探索的公开问题,并指出了一些未来的方向。
translated by 谷歌翻译
在过去的十年中,多智能经纪人强化学习(Marl)已经有了重大进展,但仍存在许多挑战,例如高样本复杂性和慢趋同稳定的政策,在广泛的部署之前需要克服,这是可能的。然而,在实践中,许多现实世界的环境已经部署了用于生成策略的次优或启发式方法。一个有趣的问题是如何最好地使用这些方法作为顾问,以帮助改善多代理领域的加强学习。在本文中,我们提供了一个原则的框架,用于将动作建议纳入多代理设置中的在线次优顾问。我们描述了在非传记通用随机游戏环境中提供多种智能强化代理(海军上将)的问题,并提出了两种新的基于Q学习的算法:海军上将决策(海军DM)和海军上将 - 顾问评估(Admiral-AE) ,这使我们能够通过适当地纳入顾问(Admiral-DM)的建议来改善学习,并评估顾问(Admiral-AE)的有效性。我们从理论上分析了算法,并在一般加上随机游戏中提供了关于他们学习的定点保证。此外,广泛的实验说明了这些算法:可以在各种环境中使用,具有对其他相关基线的有利相比的性能,可以扩展到大状态行动空间,并且对来自顾问的不良建议具有稳健性。
translated by 谷歌翻译
Imitation learning techniques aim to mimic human behavior in a given task. An agent (a learning machine) is trained to perform a task from demonstrations by learning a mapping between observations and actions. The idea of teaching by imitation has been around for many years, however, the field is gaining attention recently due to advances in computing and sensing as well as rising demand for intelligent applications. The paradigm of learning by imitation is gaining popularity because it facilitates teaching complex tasks with minimal expert knowledge of the tasks. Generic imitation learning methods could potentially reduce the problem of teaching a task to that of providing demonstrations; without the need for explicit programming or designing reward functions specific to the task. Modern sensors are able to collect and transmit high volumes of data rapidly, and processors with high computational power allow fast processing that maps the sensory data to actions in a timely manner. This opens the door for many potential AI applications that require real-time perception and reaction such as humanoid robots, self-driving vehicles, human computer interaction and computer games to name a few. However, specialized algorithms are needed to effectively and robustly learn models as learning by imitation poses its own set of challenges. In this paper, we survey imitation learning methods and present design options in different steps of the learning process. We introduce a background and motivation for the field as well as highlight challenges specific to the imitation problem. Methods for designing and evaluating imitation learning tasks are categorized and reviewed. Special attention is given to learning methods in robotics and games as these domains are the most popular in the literature and provide a wide array of problems and methodologies. We extensively discuss combining imitation learning approaches using different sources and methods, as well as incorporating other motion learning methods to enhance imitation. We also discuss the potential impact on industry, present major applications and highlight current and future research directions.
translated by 谷歌翻译
在钢筋学习(RL)中,代理必须探索最初未知的环境,以便学习期望的行为。当RL代理部署在现实世界环境中时,安全性是主要关注的。受约束的马尔可夫决策过程(CMDPS)可以提供长期的安全约束;但是,该代理人可能会违反探索其环境的制约因素。本文提出了一种称为显式探索,漏洞探索或转义($ e ^ {4} $)的基于模型的RL算法,它将显式探索或利用($ e ^ {3} $)算法扩展到强大的CMDP设置。 $ e ^ 4 $明确地分离开发,探索和逃脱CMDP,允许针对已知状态的政策改进的有针对性的政策,发现未知状态,以及安全返回到已知状态。 $ e ^ 4 $强制优化了从一组CMDP模型的最坏情况CMDP上的这些策略,该模型符合部署环境的经验观察。理论结果表明,在整个学习过程中满足安全限制的情况下,在多项式时间中找到近最优的约束政策。我们讨论了稳健约束的离线优化算法,以及如何基于经验推理和先验知识来结合未知状态过渡动态的不确定性。
translated by 谷歌翻译
Deep reinforcement learning is poised to revolutionise the field of AI and represents a step towards building autonomous systems with a higher level understanding of the visual world. Currently, deep learning is enabling reinforcement learning to scale to problems that were previously intractable, such as learning to play video games directly from pixels. Deep reinforcement learning algorithms are also applied to robotics, allowing control policies for robots to be learned directly from camera inputs in the real world. In this survey, we begin with an introduction to the general field of reinforcement learning, then progress to the main streams of value-based and policybased methods. Our survey will cover central algorithms in deep reinforcement learning, including the deep Q-network, trust region policy optimisation, and asynchronous advantage actor-critic. In parallel, we highlight the unique advantages of deep neural networks, focusing on visual understanding via reinforcement learning. To conclude, we describe several current areas of research within the field.
translated by 谷歌翻译
强化学习(RL)和脑电脑接口(BCI)是过去十年一直在增长的两个领域。直到最近,这些字段彼此独立操作。随着对循环(HITL)应用的兴趣升高,RL算法已经适用于人类指导,从而产生互动强化学习(IRL)的子领域。相邻的,BCI应用一直很感兴趣在人机交互期间从神经活动中提取内在反馈。这两个想法通过将BCI集成到IRL框架中,将RL和BCI设置在碰撞过程中,通过将内在反馈可用于帮助培训代理商来帮助框架。这种交叉点被称为内在的IRL。为了进一步帮助,促进BCI和IRL的更深层次,我们对内在IRILL的审查有着重点在于其母体领域的反馈驱动的IRL,同时还提供有关有效性,挑战和未来研究方向的讨论。
translated by 谷歌翻译
蒙特卡洛树搜索(MCT)是设计游戏机器人或解决顺序决策问题的强大方法。该方法依赖于平衡探索和开发的智能树搜索。MCT以模拟的形式进行随机抽样,并存储动作的统计数据,以在每个随后的迭代中做出更有教育的选择。然而,该方法已成为组合游戏的最新技术,但是,在更复杂的游戏(例如那些具有较高的分支因素或实时系列的游戏)以及各种实用领域(例如,运输,日程安排或安全性)有效的MCT应用程序通常需要其与问题有关的修改或与其他技术集成。这种特定领域的修改和混合方法是本调查的主要重点。最后一项主要的MCT调查已于2012年发布。自发布以来出现的贡献特别感兴趣。
translated by 谷歌翻译
过去半年来,从控制和强化学习社区的真实机器人部署的安全学习方法的贡献数量急剧上升。本文提供了一种简洁的但整体审查,对利用机器学习实现的最新进展,以实现在不确定因素下的安全决策,重点是统一控制理论和加固学习研究中使用的语言和框架。我们的评论包括:基于学习的控制方法,通过学习不确定的动态,加强学习方法,鼓励安全或坚固性的加固学习方法,以及可以正式证明学习控制政策安全的方法。随着基于数据和学习的机器人控制方法继续获得牵引力,研究人员必须了解何时以及如何最好地利用它们在安全势在必行的现实情景中,例如在靠近人类的情况下操作时。我们突出了一些开放的挑战,即将在未来几年推动机器人学习领域,并强调需要逼真的物理基准的基准,以便于控制和加固学习方法之间的公平比较。
translated by 谷歌翻译
我们介绍了有关风险分析与自治系统控制之间的联系的历史概述。我们提供两个主要贡献。我们的第一个贡献是提出三个重叠的范式,以对庞大的文献进行分类:最严重的案例,风险中性和风险避免风险的范式。我们考虑对自治系统依赖手头应用的风险进行适当的评估。相比之下,仅使用预期,差异或概率来评估风险是典型的。我们的第二个贡献是统一风险和自治系统的概念。我们通过连接量化和优化从学术领域的系统行为引起的风险的方法来实现这一目标。该调查是高度多学科的。我们包括来自强化学习,随机和健壮的控制理论,运营研究和正式验证的研究。我们描述了基于模型的方法和无模型方法,重点是前者。最后,我们重点介绍了富有成果的领域,以供进一步研究。一个关键方向是将基于风险的模型和无模型的方法融合在一起,以增强系统的实时自适应能力,以改善人类和环境福利。
translated by 谷歌翻译
数字化和远程连接扩大了攻击面,使网络系统更脆弱。由于攻击者变得越来越复杂和资源丰富,仅仅依赖传统网络保护,如入侵检测,防火墙和加密,不足以保护网络系统。网络弹性提供了一种新的安全范式,可以使用弹性机制来补充保护不足。一种网络弹性机制(CRM)适应了已知的或零日威胁和实际威胁和不确定性,并对他们进行战略性地响应,以便在成功攻击时保持网络系统的关键功能。反馈架构在启用CRM的在线感应,推理和致动过程中发挥关键作用。强化学习(RL)是一个重要的工具,对网络弹性的反馈架构构成。它允许CRM提供有限或没有事先知识和攻击者的有限攻击的顺序响应。在这项工作中,我们审查了Cyber​​恢复力的RL的文献,并讨论了对三种主要类型的漏洞,即姿势有关,与信息相关的脆弱性的网络恢复力。我们介绍了三个CRM的应用领域:移动目标防御,防守网络欺骗和辅助人类安全技术。 RL算法也有漏洞。我们解释了RL的三个漏洞和目前的攻击模型,其中攻击者针对环境与代理商之间交换的信息:奖励,国家观察和行动命令。我们展示攻击者可以通过最低攻击努力来欺骗RL代理商学习邪恶的政策。最后,我们讨论了RL为基于RL的CRM的网络安全和恢复力和新兴应用的未来挑战。
translated by 谷歌翻译
体验重播\ CITEP {Lin1993ReInforcement,Mnih2015human}是一种广泛使用的技术,可以实现有效利用数据和R1算法中的性能提高。在经验重放中,过去的转换存储在内存缓冲区中并在学习期间重新使用。在以前的作品中提出了从重播缓冲区中提出了用于从重放缓冲区的采样方案的各种建议,试图最佳选择这些经验,这些经历将有最大贡献的融合到最佳政策。在这里,我们对重播采样方案提供一些条件,该方案将确保收敛,重点是表格设置中的众所周知的Q学习算法。在为收敛建立充足的条件后,我们向建议以偏见方式重播的经验略有不同的用法作为改变所产生的策略的属性的方法。我们启动了对体验重放的严格研究作为控制和修改生成策略的属性的工具。特别是,我们表明使用适当的偏置采样方案可以允许我们实现\ emph {Safe}策略。我们认为,使用体验重放作为偏置机制,允许以可取的方式控制所产生的政策是许多应用程序具有有希望的潜力的想法。
translated by 谷歌翻译
Adequately assigning credit to actions for future outcomes based on their contributions is a long-standing open challenge in Reinforcement Learning. The assumptions of the most commonly used credit assignment method are disadvantageous in tasks where the effects of decisions are not immediately evident. Furthermore, this method can only evaluate actions that have been selected by the agent, making it highly inefficient. Still, no alternative methods have been widely adopted in the field. Hindsight Credit Assignment is a promising, but still unexplored candidate, which aims to solve the problems of both long-term and counterfactual credit assignment. In this thesis, we empirically investigate Hindsight Credit Assignment to identify its main benefits, and key points to improve. Then, we apply it to factored state representations, and in particular to state representations based on the causal structure of the environment. In this setting, we propose a variant of Hindsight Credit Assignment that effectively exploits a given causal structure. We show that our modification greatly decreases the workload of Hindsight Credit Assignment, making it more efficient and enabling it to outperform the baseline credit assignment method on various tasks. This opens the way to other methods based on given or learned causal structures.
translated by 谷歌翻译
Batch reinforcement learning is a subfield of dynamic programming-based reinforcement learning. Originally defined as the task of learning the best possible policy from a fixed set of a priori-known transition samples, the (batch) algorithms developed in this field can be easily adapted to the classical online case, where the agent interacts with the environment while learning. Due to the efficient use of collected data and the stability of the learning process, this research area has attracted a lot of attention recently. In this chapter, we introduce the basic principles and the theory behind batch reinforcement learning, describe the most important algorithms, exemplarily discuss ongoing research within this field, and briefly survey real-world applications of batch reinforcement learning.
translated by 谷歌翻译
学习涉及时变和不断发展的系统动态的控制政策通常对主流强化学习算法构成了巨大的挑战。在大多数标准方法中,通常认为动作是一组刚性的,固定的选择,这些选择以预定义的方式顺序应用于状态空间。因此,在不诉诸于重大学习过程的情况下,学识渊博的政策缺乏适应动作集和动作的“行为”结果的能力。此外,标准行动表示和动作引起的状态过渡机制固有地限制了如何将强化学习应用于复杂的现实世界应用中,这主要是由于所得大的状态空间的棘手性以及缺乏概括的学术知识对国家空间未知部分的政策。本文提出了一个贝叶斯味的广义增强学习框架,首先建立参数动作模型的概念,以更好地应对不确定性和流体动作行为,然后将增强领域的概念作为物理启发的结构引入通过“极化体验颗粒颗粒建立) “维持在学习代理的工作记忆中。这些粒子有效地编码了以自组织方式随时间演变的动态学习体验。在强化领域之上,我们将进一步概括策略学习过程,以通过将过去的记忆视为具有隐式图结构来结合高级决策概念,在该结构中,过去的内存实例(或粒子)与决策之间的相似性相互联系。定义,因此,可以应用“关联记忆”原则来增强学习代理的世界模型。
translated by 谷歌翻译