Reinforcement learning is a machine learning approach based on behavioral psychology. It is focused on learning agents that can acquire knowledge and learn to carry out new tasks by interacting with the environment. However, a problem occurs when reinforcement learning is used in critical contexts where the users of the system need to have more information and reliability for the actions executed by an agent. In this regard, explainable reinforcement learning seeks to provide to an agent in training with methods in order to explain its behavior in such a way that users with no experience in machine learning could understand the agent's behavior. One of these is the memory-based explainable reinforcement learning method that is used to compute probabilities of success for each state-action pair using an episodic memory. In this work, we propose to make use of the memory-based explainable reinforcement learning method in a hierarchical environment composed of sub-tasks that need to be first addressed to solve a more complex task. The end goal is to verify if it is possible to provide to the agent the ability to explain its actions in the global task as well as in the sub-tasks. The results obtained showed that it is possible to use the memory-based method in hierarchical environments with high-level tasks and compute the probabilities of success to be used as a basis for explaining the agent's behavior.
translated by 谷歌翻译
可解释的人工智能是一个研究领域,试图为自动智能系统提供更透明度。已经使用了解释性,尤其是在强化学习和机器人场景中,以更好地了解机器人决策过程。然而,以前的工作已广泛专注于提供技术解释,而不是非专家最终用户可以更好地理解的技术解释。在这项工作中,我们利用从成功的可能性中构建的类似人类的解释来完成自主机器人执行动作后显示的目标。这些解释旨在由没有或很少有人工智能方法经验的人来理解。本文提出了一项用户试验,以研究这些解释的重点是成功实现其目标的概率的解释是否构成了非专家最终用户的合适解释。获得的结果表明,与Q值产生的技术解释相比,非专业参与者的评分机器人的解释侧重于更高的成功概率和差异的差异,并且也赞成反事实解释而不是独立解释。
translated by 谷歌翻译
深度强化学习(RL)涉及使用深神经网络(DNN)来做出顺序决策,以最大程度地提高奖励。对于许多任务,由深度RL政策产生的一系列动作顺序对于人类来说可能是漫长而难以理解的。人类解释的一个关键组成部分是选择性,仅叙述关键决定和原因。使深层RL代理具有这种能力,将使他们的产生政策从人的角度更容易理解,并产生一套简洁的指示,以帮助学习未来的代理商。为此,我们使用具有情节内存系统的深度RL代理来识别和叙述策略执行期间的关键决策。我们表明,这些决策形成了一个简短的可读解释,也可以用来以算法独立的方式加快对天真的深度RL代理的学习。
translated by 谷歌翻译
In recent years, unmanned aerial vehicle (UAV) related technology has expanded knowledge in the area, bringing to light new problems and challenges that require solutions. Furthermore, because the technology allows processes usually carried out by people to be automated, it is in great demand in industrial sectors. The automation of these vehicles has been addressed in the literature, applying different machine learning strategies. Reinforcement learning (RL) is an automation framework that is frequently used to train autonomous agents. RL is a machine learning paradigm wherein an agent interacts with an environment to solve a given task. However, learning autonomously can be time consuming, computationally expensive, and may not be practical in highly-complex scenarios. Interactive reinforcement learning allows an external trainer to provide advice to an agent while it is learning a task. In this study, we set out to teach an RL agent to control a drone using reward-shaping and policy-shaping techniques simultaneously. Two simulated scenarios were proposed for the training; one without obstacles and one with obstacles. We also studied the influence of each technique. The results show that an agent trained simultaneously with both techniques obtains a lower reward than an agent trained using only a policy-based approach. Nevertheless, the agent achieves lower execution times and less dispersion during training.
translated by 谷歌翻译
Many works in explainable AI have focused on explaining black-box classification models. Explaining deep reinforcement learning (RL) policies in a manner that could be understood by domain users has received much less attention. In this paper, we propose a novel perspective to understanding RL policies based on identifying important states from automatically learned meta-states. The key conceptual difference between our approach and many previous ones is that we form meta-states based on locality governed by the expert policy dynamics rather than based on similarity of actions, and that we do not assume any particular knowledge of the underlying topology of the state space. Theoretically, we show that our algorithm to find meta-states converges and the objective that selects important states from each meta-state is submodular leading to efficient high quality greedy selection. Experiments on four domains (four rooms, door-key, minipacman, and pong) and a carefully conducted user study illustrate that our perspective leads to better understanding of the policy. We conjecture that this is a result of our meta-states being more intuitive in that the corresponding important states are strong indicators of tractable intermediate goals that are easier for humans to interpret and follow.
translated by 谷歌翻译
自2015年首次介绍以来,深入增强学习(DRL)方案的使用已大大增加。尽管在许多不同的应用中使用了使用,但他们仍然存在缺乏可解释性的问题。面包缺乏对研究人员和公众使用DRL解决方案的使用。为了解决这个问题,已经出现了可解释的人工智能(XAI)领域。这是各种不同的方法,它们希望打开DRL黑框,范围从使用可解释的符号决策树到诸如Shapley值之类的数值方法。这篇评论研究了使用哪些方法以及使用了哪些应用程序。这样做是为了确定哪些模型最适合每个应用程序,或者是否未充分利用方法。
translated by 谷歌翻译
Safety is still one of the major research challenges in reinforcement learning (RL). In this paper, we address the problem of how to avoid safety violations of RL agents during exploration in probabilistic and partially unknown environments. Our approach combines automata learning for Markov Decision Processes (MDPs) and shield synthesis in an iterative approach. Initially, the MDP representing the environment is unknown. The agent starts exploring the environment and collects traces. From the collected traces, we passively learn MDPs that abstractly represent the safety-relevant aspects of the environment. Given a learned MDP and a safety specification, we construct a shield. For each state-action pair within a learned MDP, the shield computes exact probabilities on how likely it is that executing the action results in violating the specification from the current state within the next $k$ steps. After the shield is constructed, the shield is used during runtime and blocks any actions that induce a too large risk from the agent. The shielded agent continues to explore the environment and collects new data on the environment. Iteratively, we use the collected data to learn new MDPs with higher accuracy, resulting in turn in shields able to prevent more safety violations. We implemented our approach and present a detailed case study of a Q-learning agent exploring slippery Gridworlds. In our experiments, we show that as the agent explores more and more of the environment during training, the improved learned models lead to shields that are able to prevent many safety violations.
translated by 谷歌翻译
深度加强学习(DEEPRL)方法已广泛用于机器人学,以了解环境,自主获取行为。深度互动强化学习(Deepirl)包括来自外部培训师或专家的互动反馈,提供建议,帮助学习者选择采取行动以加快学习过程。但是,目前的研究仅限于仅为特工现任提供可操作建议的互动。另外,在单个使用之后,代理丢弃该信息,该用途在为Revisit以相同状态引起重复过程。在本文中,我们提出了广泛的建议(BPA),这是一种广泛的持久的咨询方法,可以保留并重新使用加工信息。它不仅可以帮助培训师提供与类似状态相关的更一般性建议,而不是仅仅是当前状态,而且还允许代理加快学习过程。我们在两个连续机器人场景中测试提出的方法,即购物车极衡任务和模拟机器人导航任务。所得结果表明,使用BPA的代理的性能在于与深层方法相比保持培训师所需的相互作用的数量。
translated by 谷歌翻译
Reinforcement Learning (RL) is a popular machine learning paradigm where intelligent agents interact with the environment to fulfill a long-term goal. Driven by the resurgence of deep learning, Deep RL (DRL) has witnessed great success over a wide spectrum of complex control tasks. Despite the encouraging results achieved, the deep neural network-based backbone is widely deemed as a black box that impedes practitioners to trust and employ trained agents in realistic scenarios where high security and reliability are essential. To alleviate this issue, a large volume of literature devoted to shedding light on the inner workings of the intelligent agents has been proposed, by constructing intrinsic interpretability or post-hoc explainability. In this survey, we provide a comprehensive review of existing works on eXplainable RL (XRL) and introduce a new taxonomy where prior works are clearly categorized into model-explaining, reward-explaining, state-explaining, and task-explaining methods. We also review and highlight RL methods that conversely leverage human knowledge to promote learning efficiency and performance of agents while this kind of method is often ignored in XRL field. Some challenges and opportunities in XRL are discussed. This survey intends to provide a high-level summarization of XRL and to motivate future research on more effective XRL solutions. Corresponding open source codes are collected and categorized at https://github.com/Plankson/awesome-explainable-reinforcement-learning.
translated by 谷歌翻译
在这项工作中,我们提出了一种初步调查一种名为DYNA-T的新算法。在钢筋学习(RL)中,规划代理有自己的环境表示作为模型。要发现与环境互动的最佳政策,代理商会收集试验和错误时尚的经验。经验可用于学习更好的模型或直接改进价值函数和政策。通常是分离的,Dyna-Q是一种混合方法,在每次迭代,利用真实体验更新模型以及值函数,同时使用模拟数据从其模型中的应用程序进行行动。然而,规划过程是计算昂贵的并且强烈取决于国家行动空间的维度。我们建议在模拟体验上构建一个上置信树(UCT),并在在线学习过程中搜索要选择的最佳动作。我们证明了我们提出的方法对来自Open AI的三个测试平台环境的一系列初步测试的有效性。与Dyna-Q相比,Dyna-T通过选择更强大的动作选择策略来优于随机环境中的最先进的RL代理。
translated by 谷歌翻译
近年来,可解释的人工智能(XAI)研究因对用户社区对AI的更高透明度和信任的需求而获得了突出性。这尤其重要,因为AI在金融,医学等敏感领域采用,在这种敏感领域,对社会,道德和安全的影响是巨大的。经过彻底的系统评估,XAI的工作主要集中于机器学习(ML)进行分类,决策或行动。据我们所知,没有任何据报道提供可解释的加固学习(XRL)方法来交易金融股票的方法。在本文中,我们提议在流行的深层增强学习体系结构,深Q网络(DQN)上采用Shapley添加说明(SHAP),以解释代理商在给定实例中在金融股票交易中的行动。为了证明我们方法的有效性,我们在两个流行的数据集(即Sensex和DJIA)上对其进行了测试,并报告了结果。
translated by 谷歌翻译
强化学习(RL)在很大程度上依赖于探索以从环境中学习并最大程度地获得观察到的奖励。因此,必须设计一个奖励功能,以确保从收到的经验中获得最佳学习。以前的工作将自动机和基于逻辑的奖励成型与环境假设相结合,以提供自动机制,以根据任务综合奖励功能。但是,关于如何将基于逻辑的奖励塑造扩大到多代理增强学习(MARL)的工作有限。如果任务需要合作,则环境将需要考虑联合状态,以跟踪其他代理,从而遭受对代理数量的维度的诅咒。该项目探讨了如何针对不同场景和任务设计基于逻辑的奖励成型。我们提出了一种针对半偏心逻辑基于逻辑的MARL奖励成型的新方法,该方法在代理数量中是可扩展的,并在多种情况下对其进行了评估。
translated by 谷歌翻译
未用性的自治车辆(无人机)在过去的美国军事活动中对侦察和监督任务进行了重大贡献。随着无人机的普遍性增加,柜台上还有改进,使他们难以在感兴趣的领域成功获得宝贵的智能。因此,现代无人机可以在最大化他们的生存机会的同时实现他们的任务已经重要。在这项工作中,我们专门研究从指定开始到目标的识别短路的问题,同时收集所有奖励,避免随机移动到网格上的对手。我们还可以在军事环境中提供框架的可能应用,即自动伤员疏散。我们展示了三种方法来解决这个问题的比较:即我们实施一个深度Q学习模型,一个$ \ varepsilon $ -greedy表格Q学习模型,以及在线优化框架。我们的计算实验,使用具有随机对手的简单网格世界环境设计,展示这些方法如何工作,并在性能,准确性和计算时间方面进行比较。
translated by 谷歌翻译
了解强化学习(RL)代理的新兴行为可能很困难,因为这种代理通常使用高度复杂的决策程序在复杂的环境中进行训练。这引起了RL中解释性的多种方法,旨在调和可能在主体行为与观察者预期的行为之间产生的差异。最近的方法取决于域知识,这可能并非总是可用的,分析代理商的策略,或者是对基础环境的特定要素的分析,通常被建模为马尔可夫决策过程(MDP)。我们的主要主张是,即使基本的MDP尚不完全了解(例如,尚未准确地了解过渡概率),也没有由代理商维护(即,在使用无模型方法时),但仍可以利用它为自动生成解释。为此,我们建议使用以前在文献中使用的正式MDP抽象和转换来加快寻找最佳策略的搜索,以自动产生解释。由于这种转换通常基于环境的符号表示,因此它们可能代表了预期和实际代理行为之间差距的有意义的解释。我们正式定义了这个问题,建议一类可用于解释新兴行为的转换,并提出了有效搜索解释的方法。我们演示了一组标准基准测试的方法。
translated by 谷歌翻译
This paper surveys the eld of reinforcement learning from a computer-science perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the eld and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trial-and-error interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but di ers considerably in the details and in the use of the word \reinforcement." The paper discusses central issues of reinforcement learning, including trading o exploration and exploitation, establishing the foundations of the eld via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state. It concludes with a survey of some implemented systems and an assessment of the practical utility of current methods for reinforcement learning.
translated by 谷歌翻译
Hierarchical decomposition of control is unavoidable in large dynamical systems. In reinforcement learning (RL), it is usually solved with subgoals defined at higher policy levels and achieved at lower policy levels. Reaching these goals can take a substantial amount of time, during which it is not verified whether they are still worth pursuing. However, due to the randomness of the environment, these goals may become obsolete. In this paper, we address this gap in the state-of-the-art approaches and propose a method in which the validity of higher-level actions (thus lower-level goals) is constantly verified at the higher level. If the actions, i.e. lower level goals, become inadequate, they are replaced by more appropriate ones. This way we combine the advantages of hierarchical RL, which is fast training, and flat RL, which is immediate reactivity. We study our approach experimentally on seven benchmark environments.
translated by 谷歌翻译
在这项工作中,我们提出并评估了一种新的增强学习方法,紧凑体验重放(编者),它使用基于相似转换集的复发的预测目标值的时间差异学习,以及基于两个转换的经验重放的新方法记忆。我们的目标是减少在长期累计累计奖励的经纪人培训所需的经验。它与强化学习的相关性与少量观察结果有关,即它需要实现类似于文献中的相关方法获得的结果,这通常需要数百万视频框架来培训ATARI 2600游戏。我们举报了在八个挑战街机学习环境(ALE)挑战游戏中,为仅10万帧的培训试验和大约25,000次迭代的培训试验中报告了培训试验。我们还在与基线的同一游戏中具有相同的实验协议的DQN代理呈现结果。为了验证从较少数量的观察结果近似于良好的政策,我们还将其结果与从啤酒的基准上呈现的数百万帧中获得的结果进行比较。
translated by 谷歌翻译
最近的自主代理和机器人的应用,如自动驾驶汽车,情景的培训师,勘探机器人和服务机器人带来了关注与当前生成人工智能(AI)系统相关的至关重要的信任相关挑战。尽管取得了巨大的成功,基于连接主义深度学习神经网络方法的神经网络方法缺乏解释他们对他人的决策和行动的能力。没有符号解释能力,它们是黑色盒子,这使得他们的决定或行动不透明,这使得难以信任它们在安全关键的应用中。最近对AI系统解释性的立场目睹了可解释的人工智能(XAI)的几种方法;然而,大多数研究都专注于应用于计算科学中的数据驱动的XAI系统。解决越来越普遍的目标驱动器和机器人的研究仍然缺失。本文评论了可解释的目标驱动智能代理和机器人的方法,重点是解释和沟通代理人感知功能的技术(示例,感官和愿景)和认知推理(例如,信仰,欲望,意图,计划和目标)循环中的人类。审查强调了强调透明度,可辨与和持续学习以获得解释性的关键策略。最后,本文提出了解释性的要求,并提出了用于实现有效目标驱动可解释的代理和机器人的路线图。
translated by 谷歌翻译
Hierarchical Reinforcement Learning (HRL) algorithms have been demonstrated to perform well on high-dimensional decision making and robotic control tasks. However, because they solely optimize for rewards, the agent tends to search the same space redundantly. This problem reduces the speed of learning and achieved reward. In this work, we present an Off-Policy HRL algorithm that maximizes entropy for efficient exploration. The algorithm learns a temporally abstracted low-level policy and is able to explore broadly through the addition of entropy to the high-level. The novelty of this work is the theoretical motivation of adding entropy to the RL objective in the HRL setting. We empirically show that the entropy can be added to both levels if the Kullback-Leibler (KL) divergence between consecutive updates of the low-level policy is sufficiently small. We performed an ablative study to analyze the effects of entropy on hierarchy, in which adding entropy to high-level emerged as the most desirable configuration. Furthermore, a higher temperature in the low-level leads to Q-value overestimation and increases the stochasticity of the environment that the high-level operates on, making learning more challenging. Our method, SHIRO, surpasses state-of-the-art performance on a range of simulated robotic control benchmark tasks and requires minimal tuning.
translated by 谷歌翻译
马尔可夫决策过程通常用于不确定性下的顺序决策。然而,对于许多方面,从受约束或安全规范到任务和奖励结构中的各种时间(非Markovian)依赖性,需要扩展。为此,近年来,兴趣已经发展成为强化学习和时间逻辑的组合,即灵活的行为学习方法的组合,具有稳健的验证和保证。在本文中,我们描述了最近引入的常规决策过程的实验调查,该过程支持非马洛维亚奖励功能以及过渡职能。特别是,我们为常规决策过程,与在线,增量学习有关的算法扩展,对无模型和基于模型的解决方案算法的实证评估,以及以常规但非马尔维亚,网格世界的应用程序的算法扩展。
translated by 谷歌翻译