In this work we introduce reinforcement learning techniques for solving lexicographic multi-objective problems. These are problems that involve multiple reward signals, and where the goal is to learn a policy that maximises the first reward signal, and subject to this constraint also maximises the second reward signal, and so on. We present a family of both action-value and policy gradient algorithms that can be used to solve such problems, and prove that they converge to policies that are lexicographically optimal. We evaluate the scalability and performance of these algorithms empirically, demonstrating their practical applicability. As a more specific application, we show how our algorithms can be used to impose safety constraints on the behaviour of an agent, and compare their performance in this context with that of other constrained reinforcement learning algorithms.
translated by 谷歌翻译
The aim of Inverse Reinforcement Learning (IRL) is to infer a reward function $R$ from a policy $\pi$. To do this, we need a model of how $\pi$ relates to $R$. In the current literature, the most common models are optimality, Boltzmann rationality, and causal entropy maximisation. One of the primary motivations behind IRL is to infer human preferences from human behaviour. However, the true relationship between human preferences and human behaviour is much more complex than any of the models currently used in IRL. This means that they are misspecified, which raises the worry that they might lead to unsound inferences if applied to real-world data. In this paper, we provide a mathematical analysis of how robust different IRL models are to misspecification, and answer precisely how the demonstrator policy may differ from each of the standard models before that model leads to faulty inferences about the reward function $R$. We also introduce a framework for reasoning about misspecification in IRL, together with formal tools that can be used to easily derive the misspecification robustness of new IRL models.
translated by 谷歌翻译
Automated synthesis of provably correct controllers for cyber-physical systems is crucial for deploying these systems in safety-critical scenarios. However, their hybrid features and stochastic or unknown behaviours make this synthesis problem challenging. In this paper, we propose a method for synthesizing controllers for Markov jump linear systems (MJLSs), a particular class of cyber-physical systems, that certifiably satisfy a requirement expressed as a specification in probabilistic computation tree logic (PCTL). An MJLS consists of a finite set of linear dynamics with unknown additive disturbances, where jumps between these modes are governed by a Markov decision process (MDP). We consider both the case where the transition function of this MDP is given by probability intervals or where it is completely unknown. Our approach is based on generating a finite-state abstraction which captures both the discrete and the continuous behaviour of the original system. We formalise such abstraction as an interval Markov decision process (iMDP): intervals of transition probabilities are computed using sampling techniques from the so-called "scenario approach", resulting in a probabilistically sound approximation of the MJLS. This iMDP abstracts both the jump dynamics between modes, as well as the continuous dynamics within the modes. To demonstrate the efficacy of our technique, we apply our method to multiple realistic benchmark problems, in particular, temperature control, and aerial vehicle delivery problems.
translated by 谷歌翻译
Capturing uncertainty in models of complex dynamical systems is crucial to designing safe controllers. Stochastic noise causes aleatoric uncertainty, whereas imprecise knowledge of model parameters leads to epistemic uncertainty. Several approaches use formal abstractions to synthesize policies that satisfy temporal specifications related to safety and reachability. However, the underlying models exclusively capture aleatoric but not epistemic uncertainty, and thus require that model parameters are known precisely. Our contribution to overcoming this restriction is a novel abstraction-based controller synthesis method for continuous-state models with stochastic noise and uncertain parameters. By sampling techniques and robust analysis, we capture both aleatoric and epistemic uncertainty, with a user-specified confidence level, in the transition probability intervals of a so-called interval Markov decision process (iMDP). We synthesize an optimal policy on this iMDP, which translates (with the specified confidence level) to a feedback controller for the continuous model with the same performance guarantees. Our experimental benchmarks confirm that accounting for epistemic uncertainty leads to controllers that are more robust against variations in parameter values.
translated by 谷歌翻译
LCRL是一种软件工具,可在未知的马尔可夫决策过程(MDPS)上实现无模型增强学习(RL)算法,合成满足给定线性时间规范具有最大概率的策略。 LCRL利用被称为极限确定性Buchi Automata(LDBA)的部分确定性有限状态机器表达给定的线性时间规范。 RL算法的奖励函数是根据LDBA的结构即时塑造的。理论保证在适当的假设下确保RL算法与最大化满意度概率的最佳策略的收敛性。我们提出了案例研究,以证明LCRL的适用性,易用性,可伸缩性和性能。由于LDBA引导的探索和无LCRL模型架构,我们观察到了稳健的性能,与标准RL方法相比(每当适用于LTL规格)时,它也可以很好地缩放。有关如何执行本文所有案例研究的完整说明,请在lcrl分发www.github.com/grockious/lcrl的GitHub页面上提供。
translated by 谷歌翻译
当环境稀疏和非马克维亚奖励时,使用标量奖励信号的训练加强学习(RL)代理通常是不可行的。此外,在训练之前对这些奖励功能进行手工制作很容易指定,尤其是当环境的动态仅部分知道时。本文提出了一条新型的管道,用于学习非马克维亚任务规格,作为简洁的有限状态“任务自动机”,从未知环境中的代理体验情节中。我们利用两种关键算法的见解。首先,我们通过将其视为部分可观察到的MDP并为隐藏的Markov模型使用现成的算法,从而学习了由规范的自动机和环境MDP组成的产品MDP,该模型是由规范的自动机和环境MDP组成的。其次,我们提出了一种从学习的产品MDP中提取任务自动机(假定为确定性有限自动机)的新方法。我们学到的任务自动机可以使任务分解为其组成子任务,从而提高了RL代理以后可以合成最佳策略的速率。它还提供了高级环境和任务功能的可解释编码,因此人可以轻松地验证代理商是否在没有错误的情况下学习了连贯的任务。此外,我们采取步骤确保学识渊博的自动机是环境不可静止的,使其非常适合用于转移学习。最后,我们提供实验结果,以说明我们在不同环境和任务中的算法的性能及其合并先前的领域知识以促进更有效学习的能力。
translated by 谷歌翻译
建筑物中的加热和冷却系统占全球能源使用的31 \%,其中大部分受基于规则的控制器(RBC)调节,这些控制器(RBC)既不通过与电网进行最佳交互来最大化能源效率或最小化排放。通过强化学习(RL)的控制已显示可显着提高建筑能源效率,但是现有的解决方案需要访问世界上每栋建筑物都无法期望的特定建筑模拟器或数据。作为回应,我们表明可以在没有这样的知识的情况下获得减少排放的政策,这是我们称为零射击建筑物控制的范式。我们结合了系统识别和基于模型的RL的想法,以创建PEARL(概率避免发射的增强学习),并表明建立表现模型所需的短期积极探索是所需的。在三个不同的建筑能源模拟的实验中,我们显示珍珠在所有情况下都优于现有的RBC,并且在所有情况下,流行的RL基线,在维持热舒适度的同时,将建筑物排放量减少了31 \%。我们的源代码可通过https://enjeener.io/projects/pearl在线获得。
translated by 谷歌翻译
在安全关键设置中运行的自治系统的控制器必须考虑随机扰动。这种干扰通常被建模为过程噪声,并且常见的假设是底层分布是已知的和/或高斯的。然而,在实践中,这些假设可能是不现实的并且可以导致真正噪声分布的近似值。我们提出了一种新的规划方法,不依赖于噪声分布的任何明确表示。特别是,我们解决了计算控制器的控制器,该控制器提供了安全地到达目标的概率保证。首先,我们将连续系统摘要进入一个离散状态模型,通过状态之间的概率转换捕获噪声。作为关键贡献,我们根据噪声的有限数量的样本来调整这些过渡概率的方案方法中的工具。我们在所谓的间隔马尔可夫决策过程(IMDP)的转换概率间隔中捕获这些界限。该IMDP在过渡概率中的不确定性稳健,并且可以通过样本的数量来控制概率间隔的紧张性。我们使用最先进的验证技术在IMDP上提供保证,并计算这些保证对自主系统的控制器。即使IMDP有数百万个州或过渡,也表明了我们方法的实际适用性。
translated by 谷歌翻译
本文研究了Markov决策过程(MDP)建模的自主动态系统的运动规划,在连续状态和动作空间上具有未知的过渡概率。线性时间逻辑(LTL)用于指定无限地平线上的高级任务,可以转换为具有几种接受集的极限确定性广义B \“UCHI Automaton(LDGBA)。新颖性是设计嵌入式产品MDP(通过结合同步跟踪 - 前沿函数来记录自动化的同步跟踪 - 前沿函数,并促进接受条件的满足感。基于LDGBA的奖励塑造和折扣方案的模型的满足 - 免费加强学习(RL)仅取决于EP-MDP状态,并可以克服稀疏奖励的问题。严格的分析表明,任何优化预期折扣返回的RL方法都保证找到最佳策略,其迹线最大化满意度概率。然后开发模块化深度确定性政策梯度(DDPG)以在连续状态和行动空间上生成此类策略。我们的f Ramework通过一系列Openai健身房环境进行评估。
translated by 谷歌翻译
Computational units in artificial neural networks follow a simplified model of biological neurons. In the biological model, the output signal of a neuron runs down the axon, splits following the many branches at its end, and passes identically to all the downward neurons of the network. Each of the downward neurons will use their copy of this signal as one of many inputs dendrites, integrate them all and fire an output, if above some threshold. In the artificial neural network, this translates to the fact that the nonlinear filtering of the signal is performed in the upward neuron, meaning that in practice the same activation is shared between all the downward neurons that use that signal as their input. Dendrites thus play a passive role. We propose a slightly more complex model for the biological neuron, where dendrites play an active role: the activation in the output of the upward neuron becomes optional, and instead the signals going through each dendrite undergo independent nonlinear filterings, before the linear combination. We implement this new model into a ReLU computational unit and discuss its biological plausibility. We compare this new computational unit with the standard one and describe it from a geometrical point of view. We provide a Keras implementation of this unit into fully connected and convolutional layers and estimate their FLOPs and weights change. We then use these layers in ResNet architectures on CIFAR-10, CIFAR-100, Imagenette, and Imagewoof, obtaining performance improvements over standard ResNets up to 1.73%. Finally, we prove a universal representation theorem for continuous functions on compact sets and show that this new unit has more representational power than its standard counterpart.
translated by 谷歌翻译