Learning-to-defer is a framework to automatically defer decision-making to a human expert when ML-based decisions are deemed unreliable. Existing learning-to-defer frameworks are not designed for sequential settings. That is, they defer at every instance independently, based on immediate predictions, while ignoring the potential long-term impact of these interventions. As a result, existing frameworks are myopic. Further, they do not defer adaptively, which is crucial when human interventions are costly. In this work, we propose Sequential Learning-to-Defer (SLTD), a framework for learning-to-defer to a domain expert in sequential decision-making settings. Contrary to existing literature, we pose the problem of learning-to-defer as model-based reinforcement learning (RL) to i) account for long-term consequences of ML-based actions using RL and ii) adaptively defer based on the dynamics (model-based). Our proposed framework determines whether to defer (at each time step) by quantifying whether a deferral now will improve the value compared to delaying deferral to the next time step. To quantify the improvement, we account for potential future deferrals. As a result, we learn a pre-emptive deferral policy (i.e. a policy that defers early if using the ML-based policy could worsen long-term outcomes). Our deferral policy is adaptive to the non-stationarity in the dynamics. We demonstrate that adaptive deferral via SLTD provides an improved trade-off between long-term outcomes and deferral frequency on synthetic, semi-synthetic, and real-world data with non-stationary dynamics. Finally, we interpret the deferral decision by decomposing the propagated (long-term) uncertainty around the outcome, to justify the deferral decision.
translated by 谷歌翻译
非政策评估(OPE)方法是评估高风险领域(例如医疗保健)中的政策的关键工具,在这些领域,直接部署通常是不可行的,不道德的或昂贵的。当期望部署环境发生变化(即数据集偏移)时,对于OPE方法,在此类更改中对策略进行强大的评估非常重要。现有的方法考虑对可以任意改变环境的任何可观察到的任何可观察到的属性的大量转变。这通常会导致对公用事业的高度悲观估计,从而使可能对部署有用的政策无效。在这项工作中,我们通过研究领域知识如何帮助提供对政策公用事业的更现实的估计来解决上述问题。我们利用人类的投入,在环境的哪些方面可能会发生变化,并适应OPE方法仅考虑这些方面的转变。具体而言,我们提出了一个新颖的框架,可靠的OPE(绳索),该框架认为基于用户输入的数据中的协变量子集,并估算了这些变化下最坏情况的效用。然后,我们为OPE开发了对OPE的计算有效算法,这些算法对上述强盗和马尔可夫决策过程的上述变化很强。我们还理论上分析了这些算法的样品复杂性。从医疗领域进行的合成和现实世界数据集进行了广泛的实验表明,我们的方法不仅可以捕获现实的数据集准确地转移,而且还会导致较少的悲观政策评估。
translated by 谷歌翻译
Offline reinforcement learning (RL) is suitable for safety-critical domains where online exploration is too costly or dangerous. In safety-critical settings, decision-making should take into consideration the risk of catastrophic outcomes. In other words, decision-making should be risk-sensitive. Previous works on risk in offline RL combine together offline RL techniques, to avoid distributional shift, with risk-sensitive RL algorithms, to achieve risk-sensitivity. In this work, we propose risk-sensitivity as a mechanism to jointly address both of these issues. Our model-based approach is risk-averse to both epistemic and aleatoric uncertainty. Risk-aversion to epistemic uncertainty prevents distributional shift, as areas not covered by the dataset have high epistemic uncertainty. Risk-aversion to aleatoric uncertainty discourages actions that may result in poor outcomes due to environment stochasticity. Our experiments show that our algorithm achieves competitive performance on deterministic benchmarks, and outperforms existing approaches for risk-sensitive objectives in stochastic domains.
translated by 谷歌翻译
Adequately assigning credit to actions for future outcomes based on their contributions is a long-standing open challenge in Reinforcement Learning. The assumptions of the most commonly used credit assignment method are disadvantageous in tasks where the effects of decisions are not immediately evident. Furthermore, this method can only evaluate actions that have been selected by the agent, making it highly inefficient. Still, no alternative methods have been widely adopted in the field. Hindsight Credit Assignment is a promising, but still unexplored candidate, which aims to solve the problems of both long-term and counterfactual credit assignment. In this thesis, we empirically investigate Hindsight Credit Assignment to identify its main benefits, and key points to improve. Then, we apply it to factored state representations, and in particular to state representations based on the causal structure of the environment. In this setting, we propose a variant of Hindsight Credit Assignment that effectively exploits a given causal structure. We show that our modification greatly decreases the workload of Hindsight Credit Assignment, making it more efficient and enabling it to outperform the baseline credit assignment method on various tasks. This opens the way to other methods based on given or learned causal structures.
translated by 谷歌翻译
机器学习已成功构建许多顺序决策,作为监督预测,或通过加强学习的最佳决策政策识别。在数据约束的离线设置中,两种方法可能会失败,因为它们假设完全最佳行为或依赖于探索可能不存在的替代方案。我们介绍了一种固有的不同方法,该方法识别出状态空间的可能的“死角”。我们专注于重症监护病房中患者的状况,其中``“医疗死亡端”表明患者将过期,无论所有潜在的未来治疗序列如何。我们假设“治疗安全”为避免与其导致死亡事件的机会成比例的概率成比例的治疗,呈现正式证明,以及作为RL问题的帧发现。然后,我们将三个独立的深度神经模型进行自动化状态建设,死端发现和确认。我们的经验结果发现,死亡末端存在于脓毒症患者的真正临床数据中,并进一步揭示了安全处理与施用的差距。
translated by 谷歌翻译
离线政策优化可能会对许多现实世界的决策问题产生重大影响,因为在线学习在许多应用中可能是不可行的。重要性采样及其变体是离线策略评估中一种常用的估计器类型,此类估计器通常不需要关于价值函数或决策过程模型功能类的属性和代表性能力的假设。在本文中,我们确定了一种重要的过度拟合现象,以优化重要性加权收益,在这种情况下,学到的政策可以基本上避免在最初的状态空间的一部分中做出一致的决策。我们提出了一种算法,以避免通过新的每个国家 - 邻居标准化约束过度拟合,并提供对拟议算法的理论理由。我们还显示了以前尝试这种方法的局限性。我们在以医疗风格的模拟器为中测试算法,该模拟器是从真实医院收集的记录数据集和连续的控制任务。这些实验表明,与最先进的批处理学习算法相比,所提出的方法的过度拟合和更好的测试性能。
translated by 谷歌翻译
Safe Reinforcement Learning can be defined as the process of learning policies that maximize the expectation of the return in problems in which it is important to ensure reasonable system performance and/or respect safety constraints during the learning and/or deployment processes. We categorize and analyze two approaches of Safe Reinforcement Learning. The first is based on the modification of the optimality criterion, the classic discounted finite/infinite horizon, with a safety factor. The second is based on the modification of the exploration process through the incorporation of external knowledge or the guidance of a risk metric. We use the proposed classification to survey the existing literature, as well as suggesting future directions for Safe Reinforcement Learning.
translated by 谷歌翻译
我们开发了增强学习(RL)框架,用于通过稀疏,用户解释的更改来改善现有行为策略。我们的目标是在获得尽可能多的收益的同时进行最小的改变。我们将最小的变化定义为在原始政策和拟议的政策之间具有稀疏的全球对比解释。我们改善了当前的政策,以使全球对比解释的简短限制。我们使用离散的MDP和连续的2D导航域来演示我们的框架。
translated by 谷歌翻译
生物制药制造业是一个快速增长的行业,几乎对所有药品分支机构产生了影响。在具有许多相互依赖因素的复杂生物过程动力学的情况下,生物制造过程需要密切监测和控制,并且由于实验的高成本以及个性化的生物毒品的新颖性,因此数据非常有限。我们开发了一种新型的基于模型的增强学习框架,该框架可以在低数据表环境中实现人级控制。该模型使用动态的贝叶斯网络来捕获因素之间的因果关系,并预测不同输入的影响如何通过生物处理机制的途径传播。这使得在模型风险上既可以解释又有坚固的过程控制策略的设计。我们提出了一种计算有效的,可证明的收敛随机梯度方法,用于优化此类策略。验证是在具有多维,连续状态变量的现实应用程序上进行的。
translated by 谷歌翻译
Drug dosing is an important application of AI, which can be formulated as a Reinforcement Learning (RL) problem. In this paper, we identify two major challenges of using RL for drug dosing: delayed and prolonged effects of administering medications, which break the Markov assumption of the RL framework. We focus on prolongedness and define PAE-POMDP (Prolonged Action Effect-Partially Observable Markov Decision Process), a subclass of POMDPs in which the Markov assumption does not hold specifically due to prolonged effects of actions. Motivated by the pharmacology literature, we propose a simple and effective approach to converting drug dosing PAE-POMDPs into MDPs, enabling the use of the existing RL algorithms to solve such problems. We validate the proposed approach on a toy task, and a challenging glucose control task, for which we devise a clinically-inspired reward function. Our results demonstrate that: (1) the proposed method to restore the Markov assumption leads to significant improvements over a vanilla baseline; (2) the approach is competitive with recurrent policies which may inherently capture the prolonged effect of actions; (3) it is remarkably more time and memory efficient than the recurrent baseline and hence more suitable for real-time dosing control systems; and (4) it exhibits favorable qualitative behavior in our policy analysis.
translated by 谷歌翻译
我们考虑创建助手的问题,这些助手可以帮助代理人(通常是人类)解决新颖的顺序决策问题,假设代理人无法将奖励功能明确指定给助手。我们没有像目前的方法那样旨在自动化并代替代理人,而是赋予助手一个咨询角色,并将代理商作为主要决策者。困难是,我们必须考虑由代理商的限制或限制引起的潜在偏见,这可能导致其看似非理性地拒绝建议。为此,我们介绍了一种新颖的援助形式化,以模拟这些偏见,从而使助手推断和适应它们。然后,我们引入了一种计划助手建议的新方法,该方法可以扩展到大型决策问题。最后,我们通过实验表明我们的方法适应了这些代理偏见,并且比基于自动化的替代方案给代理带来了更高的累积奖励。
translated by 谷歌翻译
Offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data. This problem setting offers the promise of utilizing such datasets to acquire policies without any costly or dangerous active exploration. However, it is also challenging, due to the distributional shift between the offline training data and those states visited by the learned policy. Despite significant recent progress, the most successful prior methods are model-free and constrain the policy to the support of data, precluding generalization to unseen states. In this paper, we first observe that an existing model-based RL algorithm already produces significant gains in the offline setting compared to model-free approaches. However, standard model-based RL methods, designed for the online setting, do not provide an explicit mechanism to avoid the offline setting's distributional shift issue. Instead, we propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics. We theoretically show that the algorithm maximizes a lower bound of the policy's return under the true MDP. We also characterize the trade-off between the gain and risk of leaving the support of the batch data. Our algorithm, Model-based Offline Policy Optimization (MOPO), outperforms standard model-based RL algorithms and prior state-of-the-art model-free offline RL algorithms on existing offline RL benchmarks and two challenging continuous control tasks that require generalizing from data collected for a different task. * equal contribution. † equal advising. Orders randomized.34th Conference on Neural Information Processing Systems (NeurIPS 2020),
translated by 谷歌翻译
由于数据量增加,金融业的快速变化已经彻底改变了数据处理和数据分析的技术,并带来了新的理论和计算挑战。与古典随机控制理论和解决财务决策问题的其他分析方法相比,解决模型假设的财务决策问题,强化学习(RL)的新发展能够充分利用具有更少模型假设的大量财务数据并改善复杂的金融环境中的决策。该调查纸目的旨在审查最近的资金途径的发展和使用RL方法。我们介绍了马尔可夫决策过程,这是许多常用的RL方法的设置。然后引入各种算法,重点介绍不需要任何模型假设的基于价值和基于策略的方法。连接是用神经网络进行的,以扩展框架以包含深的RL算法。我们的调查通过讨论了这些RL算法在金融中各种决策问题中的应用,包括最佳执行,投资组合优化,期权定价和对冲,市场制作,智能订单路由和Robo-Awaring。
translated by 谷歌翻译
Structural Health Monitoring (SHM) describes a process for inferring quantifiable metrics of structural condition, which can serve as input to support decisions on the operation and maintenance of infrastructure assets. Given the long lifespan of critical structures, this problem can be cast as a sequential decision making problem over prescribed horizons. Partially Observable Markov Decision Processes (POMDPs) offer a formal framework to solve the underlying optimal planning task. However, two issues can undermine the POMDP solutions. Firstly, the need for a model that can adequately describe the evolution of the structural condition under deterioration or corrective actions and, secondly, the non-trivial task of recovery of the observation process parameters from available monitoring data. Despite these potential challenges, the adopted POMDP models do not typically account for uncertainty on model parameters, leading to solutions which can be unrealistically confident. In this work, we address both key issues. We present a framework to estimate POMDP transition and observation model parameters directly from available data, via Markov Chain Monte Carlo (MCMC) sampling of a Hidden Markov Model (HMM) conditioned on actions. The MCMC inference estimates distributions of the involved model parameters. We then form and solve the POMDP problem by exploiting the inferred distributions, to derive solutions that are robust to model uncertainty. We successfully apply our approach on maintenance planning for railway track assets on the basis of a "fractal value" indicator, which is computed from actual railway monitoring data.
translated by 谷歌翻译
通过观察自己的行为来了解决策者的优先事项对于在医疗保健等决策过程中的透明度和问责制至关重要。尽管传统的政策学习方法几乎总是假定行为的平稳性,但在实践中几乎不正确:随着临床专业人员随着时间的流逝,医学实践不断发展。例如,随着医学界对器官移植的理解多年来的发展,一个相关的问题是:实际的器官分配政策如何发展?为了给出答案,我们希望采用一种政策学习方法,该方法提供了可解释的决策代表,尤其是捕获代理商对世界的非统计知识,并以离线方式运作。首先,我们将决策者的不断发展的行为对上下文的强盗进行了建模,并正式化了背景匪徒(ICB)的问题。其次,我们提出了两种混凝土算法作为解决方案,学习代理行为的学习参数和非参数表示。最后,使用真实和模拟数据进行肝移植,我们说明了我们方法的适用性和解释性,以及基准测试并验证其准确性。
translated by 谷歌翻译
源于机器学习和优化的临床决策支持工具可以为医疗保健提供者提供显着的价值,包括通过更好地管理重症监护单位。特别是,重要的是,患者排放任务在降低患者的住宿时间(以及相关住院费用)和放弃决策后的入院甚至死亡的风险之间存在对细微的折衷。这项工作介绍了一个端到端的一般框架,用于捕获这种权衡,以推荐患者电子健康记录的最佳放电计时决策。数据驱动方法用于导出捕获患者的生理条件的解析,离散状态空间表示。基于该模型和给定的成本函数,在数值上制定并解决了无限的地平线折扣明马尔科夫决策过程,以计算最佳的排放政策,其价值使用违规评估策略进行评估。进行广泛的数值实验以使用现实生活重症监护单元患者数据来验证所提出的框架。
translated by 谷歌翻译
值得信赖的强化学习算法应有能力解决挑战性的现实问题,包括{Robustly}处理不确定性,满足{安全}的限制以避免灾难性的失败,以及在部署过程中{prencepentiming}以避免灾难性的失败}。这项研究旨在概述这些可信赖的强化学习的主要观点,即考虑其在鲁棒性,安全性和概括性上的内在脆弱性。特别是,我们给出严格的表述,对相应的方法进行分类,并讨论每个观点的基准。此外,我们提供了一个前景部分,以刺激有希望的未来方向,并简要讨论考虑人类反馈的外部漏洞。我们希望这项调查可以在统一的框架中将单独的研究汇合在一起,并促进强化学习的可信度。
translated by 谷歌翻译
依赖于太多的实验来学习良好的行动,目前的强化学习(RL)算法在现实世界的环境中具有有限的适用性,这可能太昂贵,无法探索探索。我们提出了一种批量RL算法,其中仅使用固定的脱机数据集来学习有效策略,而不是与环境的在线交互。批量RL中的有限数据产生了在培训数据中不充分表示的状态/行动的价值估计中的固有不确定性。当我们的候选政策从生成数据的候选政策发散时,这导致特别严重的外推。我们建议通过两个直接的惩罚来减轻这个问题:减少这种分歧的政策限制和减少过于乐观估计的价值约束。在全面的32个连续动作批量RL基准测试中,我们的方法对最先进的方法进行了比较,无论如何收集离线数据如何。
translated by 谷歌翻译
在钢筋学习(RL)中,代理必须探索最初未知的环境,以便学习期望的行为。当RL代理部署在现实世界环境中时,安全性是主要关注的。受约束的马尔可夫决策过程(CMDPS)可以提供长期的安全约束;但是,该代理人可能会违反探索其环境的制约因素。本文提出了一种称为显式探索,漏洞探索或转义($ e ^ {4} $)的基于模型的RL算法,它将显式探索或利用($ e ^ {3} $)算法扩展到强大的CMDP设置。 $ e ^ 4 $明确地分离开发,探索和逃脱CMDP,允许针对已知状态的政策改进的有针对性的政策,发现未知状态,以及安全返回到已知状态。 $ e ^ 4 $强制优化了从一组CMDP模型的最坏情况CMDP上的这些策略,该模型符合部署环境的经验观察。理论结果表明,在整个学习过程中满足安全限制的情况下,在多项式时间中找到近最优的约束政策。我们讨论了稳健约束的离线优化算法,以及如何基于经验推理和先验知识来结合未知状态过渡动态的不确定性。
translated by 谷歌翻译
离线RL算法必须说明其提供的数据集可能使环境的许多方面未知。应对这一挑战的最常见方法是采用悲观或保守的方法,避免行为与培训数据集中的行为过于不同。但是,仅依靠保守主义存在缺点:绩效对保守主义的确切程度很敏感,保守的目标可以恢复高度最佳的政策。在这项工作中,我们建议在不确定性的情况下,脱机RL方法应该是适应性的。我们表明,在贝叶斯的意义上,在离线RL中最佳作用涉及解决隐式POMDP。结果,离线RL的最佳策略必须是自适应的,这不仅取决于当前状态,而且还取决于迄今为止在评估期间看到的所有过渡。我们提出了一种无模型的算法,用于近似于此最佳自适应策略,并证明在离线RL基准测试中学习此类适应性政策。
translated by 谷歌翻译