将有用的背景知识传达给加强学习(RL)代理是加速学习的重要方法。我们介绍了Rlang,这是一种特定领域的语言(DSL),用于将域知识传达给RL代理。与RL社区提出的其他现有DSL不同,该基础是决策形式主义的单个要素(例如,奖励功能或政策功能),RLANG可以指定有关马尔可夫决策过程中每个元素的信息。我们为rlang定义了精确的语法和基础语义,并提供了解析器实施,将rlang程序基于算法 - 敏捷的部分世界模型和政策,可以由RL代理利用。我们提供一系列示例RLANG程序,并演示不同的RL方法如何利用所得的知识,包括无模型和基于模型的表格算法,分层方法和深度RL算法(包括策略梯度和基于价值的方法)。
translated by 谷歌翻译
顺序决策的两种常见方法是AI计划(AIP)和强化学习(RL)。每个都有优点和缺点。 AIP是可解释的,易于与象征知识集成,并且通常是有效的,但需要前期逻辑域的规范,并且对噪声敏感; RL仅需要奖励的规范,并且对噪声是强大的,但效率低下,不容易提供外部知识。我们提出了一种综合方法,将高级计划与RL结合在一起,保留可解释性,转移和效率,同时允许对低级计划行动进行强有力的学习。我们的方法通过在AI计划问题的状态过渡模型与Markov决策过程(MDP)的抽象状态过渡系统(MDP)之间建立对应关系,从而定义了AIP操作员的分层增强学习(HRL)的选项。通过添加内在奖励来鼓励MDP和AIP过渡模型之间的一致性来学习选项。我们通过比较Minigrid和N房间环境中RL和HRL算法的性能来证明我们的综合方法的好处,从而显示了我们方法比现有方法的优势。
translated by 谷歌翻译
长期的Horizo​​n机器人学习任务稀疏的奖励对当前的强化学习算法构成了重大挑战。使人类能够学习挑战的控制任务的关键功能是,他们经常获得专家干预,使他们能够在掌握低级控制动作之前了解任务的高级结构。我们为利用专家干预来解决长马增强学习任务的框架。我们考虑\ emph {选项模板},这是编码可以使用强化学习训练的潜在选项的规格。我们将专家干预提出,因为允许代理商在学习实施之前执行选项模板。这使他们能够使用选项,然后才能为学习成本昂贵的资源学习。我们在三个具有挑战性的强化学习问题上评估了我们的方法,这表明它的表现要优于最先进的方法。训练有素的代理商和我们的代码视频可以在以下网址找到:https://sites.google.com/view/stickymittens
translated by 谷歌翻译
在本文中,我们提出了一种新的马尔可夫决策过程学习分层表示的方法。我们的方法通过将状态空间划分为子集,并定义用于在分区之间执行转换的子任务。我们制定将状态空间作为优化问题分区的问题,该优化问题可以使用梯度下降给出一组采样的轨迹来解决,使我们的方法适用于大状态空间的高维问题。我们经验验证方法,通过表示它可以成功地在导航域中成功学习有用的分层表示。一旦了解到,分层表示可以用于解决给定域中的不同任务,从而概括跨任务的知识。
translated by 谷歌翻译
The reinforcement learning paradigm is a popular way to address problems that have only limited environmental feedback, rather than correctly labeled examples, as is common in other machine learning contexts. While significant progress has been made to improve learning in a single task, the idea of transfer learning has only recently been applied to reinforcement learning tasks. The core idea of transfer is that experience gained in learning to perform one task can help improve learning performance in a related, but different, task. In this article we present a framework that classifies transfer learning methods in terms of their capabilities and goals, and then use it to survey the existing literature, as well as to suggest future directions for transfer learning work.
translated by 谷歌翻译
我们介绍了一种改进政策改进的方法,该方法在基于价值的强化学习(RL)的贪婪方法与基于模型的RL的典型计划方法之间进行了插值。新方法建立在几何视野模型(GHM,也称为伽马模型)的概念上,该模型对给定策略的折现状态验证分布进行了建模。我们表明,我们可以通过仔细的基本策略GHM的仔细组成,而无需任何其他学习,可以评估任何非马尔科夫策略,以固定的概率在一组基本马尔可夫策略之间切换。然后,我们可以将广义政策改进(GPI)应用于此类非马尔科夫政策的收集,以获得新的马尔可夫政策,通常将其表现优于其先驱。我们对这种方法提供了彻底的理论分析,开发了转移和标准RL的应用,并在经验上证明了其对标准GPI的有效性,对充满挑战的深度RL连续控制任务。我们还提供了GHM培训方法的分析,证明了关于先前提出的方法的新型收敛结果,并显示了如何在深度RL设置中稳定训练这些模型。
translated by 谷歌翻译
With the development of deep representation learning, the domain of reinforcement learning (RL) has become a powerful learning framework now capable of learning complex policies in high dimensional environments. This review summarises deep reinforcement learning (DRL) algorithms and provides a taxonomy of automated driving tasks where (D)RL methods have been employed, while addressing key computational challenges in real world deployment of autonomous driving agents. It also delineates adjacent domains such as behavior cloning, imitation learning, inverse reinforcement learning that are related but are not classical RL algorithms. The role of simulators in training agents, methods to validate, test and robustify existing solutions in RL are discussed.
translated by 谷歌翻译
马尔可夫决策过程通常用于不确定性下的顺序决策。然而,对于许多方面,从受约束或安全规范到任务和奖励结构中的各种时间(非Markovian)依赖性,需要扩展。为此,近年来,兴趣已经发展成为强化学习和时间逻辑的组合,即灵活的行为学习方法的组合,具有稳健的验证和保证。在本文中,我们描述了最近引入的常规决策过程的实验调查,该过程支持非马洛维亚奖励功能以及过渡职能。特别是,我们为常规决策过程,与在线,增量学习有关的算法扩展,对无模型和基于模型的解决方案算法的实证评估,以及以常规但非马尔维亚,网格世界的应用程序的算法扩展。
translated by 谷歌翻译
本文研究了一种使用背景计划的新方法,用于基于模型的增强学习:混合(近似)动态编程更新和无模型更新,类似于DYNA体系结构。通过学习模型的背景计划通常比无模型替代方案(例如Double DQN)差,尽管前者使用了更多的内存和计算。基本问题是,学到的模型可能是不准确的,并且经常会产生无效的状态,尤其是在迭代许多步骤时。在本文中,我们通过将背景规划限制为一组(抽象)子目标并仅学习本地,子观念模型来避免这种限制。这种目标空间计划(GSP)方法更有效地是在计算上,自然地纳入了时间抽象,以进行更快的长胜压计划,并避免完全学习过渡动态。我们表明,在各种情况下,我们的GSP算法比双DQN基线要快得多。
translated by 谷歌翻译
在过去的几年中,逆增强学习(\ textit {irl})问题已经迅速发展,在机器人技术,认知和健康等领域中具有重要的应用。在这项工作中,我们探讨了当前IRL方法从描述长马,复杂的顺序任务的专家轨迹中学习代理奖励函数的效率低下。我们假设,将IRL模型带入捕获基本任务的结构图案可以实现和提高其性能。随后,我们提出了一种新颖的IRL方法Smirl,该方法首先学习任务的(近似)结构为有限状态-Satate-automaton(FSA),然后使用结构基序来解决IRL问题。我们在离散网格世界和高维连续域环境上测试我们的模型。我们从经验上表明,我们提出的方法成功地学习了所有四个复杂的任务,其中两个基础IRL基准失败了。我们的模型还优于简单的玩具任务中样本效率的基准。我们进一步在具有组成奖励函数的任务上的经过修改的连续域中显示了有希望的测试结果。
translated by 谷歌翻译
Safety is still one of the major research challenges in reinforcement learning (RL). In this paper, we address the problem of how to avoid safety violations of RL agents during exploration in probabilistic and partially unknown environments. Our approach combines automata learning for Markov Decision Processes (MDPs) and shield synthesis in an iterative approach. Initially, the MDP representing the environment is unknown. The agent starts exploring the environment and collects traces. From the collected traces, we passively learn MDPs that abstractly represent the safety-relevant aspects of the environment. Given a learned MDP and a safety specification, we construct a shield. For each state-action pair within a learned MDP, the shield computes exact probabilities on how likely it is that executing the action results in violating the specification from the current state within the next $k$ steps. After the shield is constructed, the shield is used during runtime and blocks any actions that induce a too large risk from the agent. The shielded agent continues to explore the environment and collects new data on the environment. Iteratively, we use the collected data to learn new MDPs with higher accuracy, resulting in turn in shields able to prevent more safety violations. We implemented our approach and present a detailed case study of a Q-learning agent exploring slippery Gridworlds. In our experiments, we show that as the agent explores more and more of the environment during training, the improved learned models lead to shields that are able to prevent many safety violations.
translated by 谷歌翻译
深度强化学习(RL)导致了许多最近和开创性的进步。但是,这些进步通常以培训的基础体系结构的规模增加以及用于训练它们的RL算法的复杂性提高,而均以增加规模的成本。这些增长反过来又使研究人员更难迅速原型新想法或复制已发表的RL算法。为了解决这些问题,这项工作描述了ACME,这是一个用于构建新型RL算法的框架,这些框架是专门设计的,用于启用使用简单的模块化组件构建的代理,这些组件可以在各种执行范围内使用。尽管ACME的主要目标是为算法开发提供一个框架,但第二个目标是提供重要或最先进算法的简单参考实现。这些实现既是对我们的设计决策的验证,也是对RL研究中可重复性的重要贡献。在这项工作中,我们描述了ACME内部做出的主要设计决策,并提供了有关如何使用其组件来实施各种算法的进一步详细信息。我们的实验为许多常见和最先进的算法提供了基准,并显示了如何为更大且更复杂的环境扩展这些算法。这突出了ACME的主要优点之一,即它可用于实现大型,分布式的RL算法,这些算法可以以较大的尺度运行,同时仍保持该实现的固有可读性。这项工作提出了第二篇文章的版本,恰好与模块化的增加相吻合,对离线,模仿和从演示算法学习以及作为ACME的一部分实现的各种新代理。
translated by 谷歌翻译
We present lilGym, a new benchmark for language-conditioned reinforcement learning in visual environments. lilGym is based on 2,661 highly-compositional human-written natural language statements grounded in an interactive visual environment. We annotate all statements with executable Python programs representing their meaning to enable exact reward computation in every possible world state. Each statement is paired with multiple start states and reward functions to form thousands of distinct Markov Decision Processes of varying difficulty. We experiment with lilGym with different models and learning regimes. Our results and analysis show that while existing methods are able to achieve non-trivial performance, lilGym forms a challenging open problem. lilGym is available at https://lil.nlp.cornell.edu/lilgym/.
translated by 谷歌翻译
当环境稀疏和非马克维亚奖励时,使用标量奖励信号的训练加强学习(RL)代理通常是不可行的。此外,在训练之前对这些奖励功能进行手工制作很容易指定,尤其是当环境的动态仅部分知道时。本文提出了一条新型的管道,用于学习非马克维亚任务规格,作为简洁的有限状态“任务自动机”,从未知环境中的代理体验情节中。我们利用两种关键算法的见解。首先,我们通过将其视为部分可观察到的MDP并为隐藏的Markov模型使用现成的算法,从而学习了由规范的自动机和环境MDP组成的产品MDP,该模型是由规范的自动机和环境MDP组成的。其次,我们提出了一种从学习的产品MDP中提取任务自动机(假定为确定性有限自动机)的新方法。我们学到的任务自动机可以使任务分解为其组成子任务,从而提高了RL代理以后可以合成最佳策略的速率。它还提供了高级环境和任务功能的可解释编码,因此人可以轻松地验证代理商是否在没有错误的情况下学习了连贯的任务。此外,我们采取步骤确保学识渊博的自动机是环境不可静止的,使其非常适合用于转移学习。最后,我们提供实验结果,以说明我们在不同环境和任务中的算法的性能及其合并先前的领域知识以促进更有效学习的能力。
translated by 谷歌翻译
Adequately assigning credit to actions for future outcomes based on their contributions is a long-standing open challenge in Reinforcement Learning. The assumptions of the most commonly used credit assignment method are disadvantageous in tasks where the effects of decisions are not immediately evident. Furthermore, this method can only evaluate actions that have been selected by the agent, making it highly inefficient. Still, no alternative methods have been widely adopted in the field. Hindsight Credit Assignment is a promising, but still unexplored candidate, which aims to solve the problems of both long-term and counterfactual credit assignment. In this thesis, we empirically investigate Hindsight Credit Assignment to identify its main benefits, and key points to improve. Then, we apply it to factored state representations, and in particular to state representations based on the causal structure of the environment. In this setting, we propose a variant of Hindsight Credit Assignment that effectively exploits a given causal structure. We show that our modification greatly decreases the workload of Hindsight Credit Assignment, making it more efficient and enabling it to outperform the baseline credit assignment method on various tasks. This opens the way to other methods based on given or learned causal structures.
translated by 谷歌翻译
虽然深增强学习已成为连续决策问题的有希望的机器学习方法,但对于自动驾驶或医疗应用等高利害域来说仍然不够成熟。在这种情况下,学习的政策需要例如可解释,因此可以在任何部署之前检查它(例如,出于安全性和验证原因)。本调查概述了各种方法,以实现加固学习(RL)的更高可解释性。为此,我们将解释性(作为模型的财产区分开来和解释性(作为HOC操作后的讲话,通过代理的干预),并在RL的背景下讨论它们,并强调前概念。特别是,我们认为可译文的RL可能会拥抱不同的刻面:可解释的投入,可解释(转型/奖励)模型和可解释的决策。根据该计划,我们总结和分析了与可解释的RL相关的最近工作,重点是过去10年来发表的论文。我们还简要讨论了一些相关的研究领域并指向一些潜在的有前途的研究方向。
translated by 谷歌翻译
强化学习(RL)是人工智能中的核心问题。这个问题包括定义可以通过与环境交互学习最佳行为的人工代理 - 其中,在代理试图最大化的奖励信号的奖励信号中定义最佳行为。奖励机(RMS)提供了一种基于Automate的基于自动机的表示,该奖励功能使RL代理能够将RL问题分解为可以通过禁止策略学习有效地学习的结构化子问题。在这里,我们表明可以从经验中学习RMS,而不是由用户指定,并且可以使用所产生的问题分解来有效地解决部分可观察的RL问题。我们将学习RMS的任务作为离散优化问题构成,其中目标是找到将问题分解为一组子问题的RM,使得其最佳记忆策略的组合是原始问题的最佳策略。我们展示了这种方法在三个部分可观察的域中的有效性,在那里它显着优于A3C,PPO和宏碁,并讨论其优点,限制和更广泛的潜力。
translated by 谷歌翻译
Besides the recent impressive results on reinforcement learning (RL), safety is still one of the major research challenges in RL. RL is a machine-learning approach to determine near-optimal policies in Markov decision processes (MDPs). In this paper, we consider the setting where the safety-relevant fragment of the MDP together with a temporal logic safety specification is given and many safety violations can be avoided by planning ahead a short time into the future. We propose an approach for online safety shielding of RL agents. During runtime, the shield analyses the safety of each available action. For any action, the shield computes the maximal probability to not violate the safety specification within the next $k$ steps when executing this action. Based on this probability and a given threshold, the shield decides whether to block an action from the agent. Existing offline shielding approaches compute exhaustively the safety of all state-action combinations ahead of time, resulting in huge computation times and large memory consumption. The intuition behind online shielding is to compute at runtime the set of all states that could be reached in the near future. For each of these states, the safety of all available actions is analysed and used for shielding as soon as one of the considered states is reached. Our approach is well suited for high-level planning problems where the time between decisions can be used for safety computations and it is sustainable for the agent to wait until these computations are finished. For our evaluation, we selected a 2-player version of the classical computer game SNAKE. The game represents a high-level planning problem that requires fast decisions and the multiplayer setting induces a large state space, which is computationally expensive to analyse exhaustively.
translated by 谷歌翻译
Linear temporal logic (LTL) is a widely-used task specification language which has a compositional grammar that naturally induces temporally extended behaviours across tasks, including conditionals and alternative realizations. An important problem i RL with LTL tasks is to learn task-conditioned policies which can zero-shot generalize to new LTL instructions not observed in the training. However, because symbolic observation is often lossy and LTL tasks can have long time horizon, previous works can suffer from issues such as training sampling inefficiency and infeasibility or sub-optimality of the found solutions. In order to tackle these issues, this paper proposes a novel multi-task RL algorithm with improved learning efficiency and optimality. To achieve the global optimality of task completion, we propose to learn options dependent on the future subgoals via a novel off-policy approach. In order to propagate the rewards of satisfying future subgoals back more efficiently, we propose to train a multi-step value function conditioned on the subgoal sequence which is updated with Monte Carlo estimates of multi-step discounted returns. In experiments on three different domains, we evaluate the LTL generalization capability of the agent trained by the proposed method, showing its advantage over previous representative methods.
translated by 谷歌翻译
强化学习(RL)在很大程度上依赖于探索以从环境中学习并最大程度地获得观察到的奖励。因此,必须设计一个奖励功能,以确保从收到的经验中获得最佳学习。以前的工作将自动机和基于逻辑的奖励成型与环境假设相结合,以提供自动机制,以根据任务综合奖励功能。但是,关于如何将基于逻辑的奖励塑造扩大到多代理增强学习(MARL)的工作有限。如果任务需要合作,则环境将需要考虑联合状态,以跟踪其他代理,从而遭受对代理数量的维度的诅咒。该项目探讨了如何针对不同场景和任务设计基于逻辑的奖励成型。我们提出了一种针对半偏心逻辑基于逻辑的MARL奖励成型的新方法,该方法在代理数量中是可扩展的,并在多种情况下对其进行了评估。
translated by 谷歌翻译