脱机强化学习 - 从一批数据中学习策略 - 是难以努力的:如果没有制造强烈的假设,它很容易构建实体算法失败的校长。在这项工作中,我们考虑了某些现实世界问题的财产,其中离线强化学习应该有效:行动仅对一部分产生有限的行动。我们正规化并介绍此动作影响规律(AIR)财产。我们进一步提出了一种算法,该算法假定和利用AIR属性,并在MDP满足空气时绑定输出策略的子优相。最后,我们展示了我们的算法在定期保留的两个模拟环境中跨越不同的数据收集策略占据了现有的离线强度学习算法。
translated by 谷歌翻译
Epsilon-Greedy,SoftMax或Gaussian噪声等近视探索政策在某些强化学习任务中无法有效探索,但是在许多其他方面,它们的表现都很好。实际上,实际上,由于简单性,它们通常被选为最佳选择。但是,对于哪些任务执行此类政策成功?我们可以为他们的有利表现提供理论保证吗?尽管这些政策具有显着的实际重要性,但这些关键问题几乎没有得到研究。本文介绍了对此类政策的理论分析,并为通过近视探索提供了对增强学习的首次遗憾和样本复杂性。我们的结果适用于具有有限的Bellman Eluder维度的情节MDP中的基于价值功能的算法。我们提出了一种新的复杂度度量,称为近视探索差距,用Alpha表示,该差距捕获了MDP的结构属性,勘探策略和给定的值函数类别。我们表明,近视探索的样品复杂性与该数量的倒数1 / alpha^2二次地量表。我们通过具体的例子进一步证明,由于相应的动态和奖励结构,在近视探索成功的几项任务中,近视探索差距确实是有利的。
translated by 谷歌翻译
在阻碍强化学习(RL)到现实世界中的问题的原因之一,两个因素至关重要:与培训相比,数据有限和测试环境的不匹配。在本文中,我们试图通过分配强大的离线RL的问题同时解决这些问题。特别是,我们学习了一个从源环境中获得的历史数据,并优化了RL代理,并在扰动的环境中表现良好。此外,我们考虑将算法应用于大规模问题的线性函数近似。我们证明我们的算法可以实现$ O(1/\ sqrt {k})$的次级临时性,具体取决于线性函数尺寸$ d $,这似乎是在此设置中使用样品复杂性保证的第一个结果。进行了不同的实验以证明我们的理论发现,显示了我们算法与非持bust算法的优越性。
translated by 谷歌翻译
Effectively leveraging large, previously collected datasets in reinforcement learning (RL) is a key challenge for large-scale real-world applications. Offline RL algorithms promise to learn effective policies from previously-collected, static datasets without further interaction. However, in practice, offline RL presents a major challenge, and standard off-policy RL methods can fail due to overestimation of values induced by the distributional shift between the dataset and the learned policy, especially when training on complex and multi-modal data distributions. In this paper, we propose conservative Q-learning (CQL), which aims to address these limitations by learning a conservative Q-function such that the expected value of a policy under this Q-function lower-bounds its true value. We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees. In practice, CQL augments the standard Bellman error objective with a simple Q-value regularizer which is straightforward to implement on top of existing deep Q-learning and actor-critic implementations. On both discrete and continuous control domains, we show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return, especially when learning from complex and multi-modal data distributions.Preprint. Under review.
translated by 谷歌翻译
Value-function approximation methods that operate in batch mode have foundational importance to reinforcement learning (RL). Finite sample guarantees for these methods often crucially rely on two types of assumptions: (1) mild distribution shift, and (2) representation conditions that are stronger than realizability. However, the necessity ("why do we need them?") and the naturalness ("when do they hold?") of such assumptions have largely eluded the literature. In this paper, we revisit these assumptions and provide theoretical results towards answering the above questions, and make steps towards a deeper understanding of value-function approximation.
translated by 谷歌翻译
如何在离线强化学习(RL)中不同培训算法产生的策略和价值函数 - 这对于Hyperpa-Rameter调整至关重要 - 是一个重要的开放问题。基于禁止策略评估(OPE)的现有方法通常需要额外的函数近似,因此造成鸡蛋和鸡蛋情况。在本文中,我们基于BVFT [XJ21]的策略选择设计了近双数点算法,其最近的价值函数选择的理论前进,并在atari等离散动作基准中展示了它们的有效性。为了应对持续动作域的批评群体较差的绩效劣化,我们进一步将BVFT与OPE结合起来,以获得最佳世界,并获得基于Q函数的OPE的高参与计调整方法,具有侧面产品的理论保证。
translated by 谷歌翻译
由于数据量增加,金融业的快速变化已经彻底改变了数据处理和数据分析的技术,并带来了新的理论和计算挑战。与古典随机控制理论和解决财务决策问题的其他分析方法相比,解决模型假设的财务决策问题,强化学习(RL)的新发展能够充分利用具有更少模型假设的大量财务数据并改善复杂的金融环境中的决策。该调查纸目的旨在审查最近的资金途径的发展和使用RL方法。我们介绍了马尔可夫决策过程,这是许多常用的RL方法的设置。然后引入各种算法,重点介绍不需要任何模型假设的基于价值和基于策略的方法。连接是用神经网络进行的,以扩展框架以包含深的RL算法。我们的调查通过讨论了这些RL算法在金融中各种决策问题中的应用,包括最佳执行,投资组合优化,期权定价和对冲,市场制作,智能订单路由和Robo-Awaring。
translated by 谷歌翻译
强化学习算法的实用性由于相对于问题大小的规模差而受到限制,因为学习$ \ epsilon $ -optimal策略的样本复杂性为$ \ tilde {\ omega} \ left(| s | s || a || a || a || a | h^3 / \ eps^2 \ right)$在MDP的最坏情况下,带有状态空间$ S $,ACTION SPACE $ A $和HORIZON $ H $。我们考虑一类显示出低级结构的MDP,其中潜在特征未知。我们认为,价值迭代和低级别矩阵估计的自然组合导致估计误差在地平线上呈指数增长。然后,我们提供了一种新算法以及统计保证,即有效利用了对生成模型的访问,实现了$ \ tilde {o} \ left的样本复杂度(d^5(d^5(| s |+| a |)\),我们有效利用低级结构。对于等级$ d $设置的Mathrm {Poly}(h)/\ EPS^2 \ right)$,相对于$ | s |,| a | $和$ \ eps $的缩放,这是最小值的最佳。与线性和低级别MDP的文献相反,我们不需要已知的功能映射,我们的算法在计算上很简单,并且我们的结果长期存在。我们的结果提供了有关MDP对过渡内核与最佳动作值函数所需的最小低级结构假设的见解。
translated by 谷歌翻译
Many real-world reinforcement learning tasks require control of complex dynamical systems that involve both costly data acquisition processes and large state spaces. In cases where the transition dynamics can be readily evaluated at specified states (e.g., via a simulator), agents can operate in what is often referred to as planning with a \emph{generative model}. We propose the AE-LSVI algorithm for best-policy identification, a novel variant of the kernelized least-squares value iteration (LSVI) algorithm that combines optimism with pessimism for active exploration (AE). AE-LSVI provably identifies a near-optimal policy \emph{uniformly} over an entire state space and achieves polynomial sample complexity guarantees that are independent of the number of states. When specialized to the recently introduced offline contextual Bayesian optimization setting, our algorithm achieves improved sample complexity bounds. Experimentally, we demonstrate that AE-LSVI outperforms other RL algorithms in a variety of environments when robustness to the initial state is required.
translated by 谷歌翻译
This paper studies systematic exploration for reinforcement learning with rich observations and function approximation. We introduce a new model called contextual decision processes, that unifies and generalizes most prior settings. Our first contribution is a complexity measure, the Bellman rank , that we show enables tractable learning of near-optimal behavior in these processes and is naturally small for many well-studied reinforcement learning settings. Our second contribution is a new reinforcement learning algorithm that engages in systematic exploration to learn contextual decision processes with low Bellman rank. Our algorithm provably learns near-optimal behavior with a number of samples that is polynomial in all relevant parameters but independent of the number of unique observations. The approach uses Bellman error minimization with optimistic exploration and provides new insights into efficient exploration for reinforcement learning with function approximation.
translated by 谷歌翻译
在线强化学习(RL)中的挑战之一是代理人需要促进对环境的探索和对样品的利用来优化其行为。无论我们是否优化遗憾,采样复杂性,状态空间覆盖范围或模型估计,我们都需要攻击不同的勘探开发权衡。在本文中,我们建议在分离方法组成的探索 - 剥削问题:1)“客观特定”算法(自适应)规定哪些样本以收集到哪些状态,似乎它可以访问a生成模型(即环境的模拟器); 2)负责尽可能快地生成规定样品的“客观无关的”样品收集勘探策略。建立最近在随机最短路径问题中进行探索的方法,我们首先提供一种算法,它给出了每个状态动作对所需的样本$ B(S,a)$的样本数量,需要$ \ tilde {o} (bd + d ^ {3/2} s ^ 2 a)收集$ b = \ sum_ {s,a} b(s,a)$所需样本的$时间步骤,以$ s $各国,$ a $行动和直径$ d $。然后我们展示了这种通用探索算法如何与“客观特定的”策略配对,这些策略规定了解决各种设置的样本要求 - 例如,模型估计,稀疏奖励发现,无需无成本勘探沟通MDP - 我们获得改进或新颖的样本复杂性保证。
translated by 谷歌翻译
强大的增强学习(RL)的目的是学习一项与模型参数不确定性的强大策略。由于模拟器建模错误,随着时间的推移,现实世界系统动力学的变化以及对抗性干扰,参数不确定性通常发生在许多现实世界中的RL应用中。强大的RL通常被称为最大问题问题,其目的是学习最大化价值与不确定性集合中最坏可能的模型的策略。在这项工作中,我们提出了一种称为鲁棒拟合Q-材料(RFQI)的强大RL算法,该算法仅使用离线数据集来学习最佳稳健策略。使用离线数据的强大RL比其非持续性对应物更具挑战性,因为在强大的Bellman运营商中所有模型的最小化。这在离线数据收集,对模型的优化以及公正的估计中构成了挑战。在这项工作中,我们提出了一种系统的方法来克服这些挑战,从而导致了我们的RFQI算法。我们证明,RFQI在标准假设下学习了一项近乎最佳的强大政策,并证明了其在标准基准问题上的出色表现。
translated by 谷歌翻译
Offline reinforcement learning (RL) concerns pursuing an optimal policy for sequential decision-making from a pre-collected dataset, without further interaction with the environment. Recent theoretical progress has focused on developing sample-efficient offline RL algorithms with various relaxed assumptions on data coverage and function approximators, especially to handle the case with excessively large state-action spaces. Among them, the framework based on the linear-programming (LP) reformulation of Markov decision processes has shown promise: it enables sample-efficient offline RL with function approximation, under only partial data coverage and realizability assumptions on the function classes, with favorable computational tractability. In this work, we revisit the LP framework for offline RL, and advance the existing results in several aspects, relaxing certain assumptions and achieving optimal statistical rates in terms of sample size. Our key enabler is to introduce proper constraints in the reformulation, instead of using any regularization as in the literature, sometimes also with careful choices of the function classes and initial state distributions. We hope our insights further advocate the study of the LP framework, as well as the induced primal-dual minimax optimization, in offline RL.
translated by 谷歌翻译
在这项工作中,我们将注意力集中在数据分布与基于Q学基于Q学基于函数近似之间的相互作用的研究。我们提供了一个理论和实证分析,以及为什么数据分布的不同性质可以有助于调节算法不稳定性的来源。首先,我们重新审视近似动态编程算法性能的理论界限。其次,我们提供了一种新型的四态MDP,突出了在线和离线设置中具有功能近似的Q学习算法的数据分布的影响。最后,我们通过实验评估数据分布属性在离线深度Q网算法的性能中的影响。我们的结果表明:(i)数据分布需要拥有某些属性,以便在离线设置中鲁棒地学习,即距离MDP的最佳策略和高覆盖范围内的分布在状态 - 动作空间上的低距离; (ii)高熵数据分布可以有助于减轻算法不稳定性的来源。
translated by 谷歌翻译
在钢筋学习(RL)中,代理必须探索最初未知的环境,以便学习期望的行为。当RL代理部署在现实世界环境中时,安全性是主要关注的。受约束的马尔可夫决策过程(CMDPS)可以提供长期的安全约束;但是,该代理人可能会违反探索其环境的制约因素。本文提出了一种称为显式探索,漏洞探索或转义($ e ^ {4} $)的基于模型的RL算法,它将显式探索或利用($ e ^ {3} $)算法扩展到强大的CMDP设置。 $ e ^ 4 $明确地分离开发,探索和逃脱CMDP,允许针对已知状态的政策改进的有针对性的政策,发现未知状态,以及安全返回到已知状态。 $ e ^ 4 $强制优化了从一组CMDP模型的最坏情况CMDP上的这些策略,该模型符合部署环境的经验观察。理论结果表明,在整个学习过程中满足安全限制的情况下,在多项式时间中找到近最优的约束政策。我们讨论了稳健约束的离线优化算法,以及如何基于经验推理和先验知识来结合未知状态过渡动态的不确定性。
translated by 谷歌翻译
我们介绍了一种普遍的策略,可实现有效的多目标勘探。它依赖于adagoal,一种基于简单约束优化问题的新的目标选择方案,其自适应地针对目标状态,这既不是太困难也不是根据代理目前的知识达到的。我们展示了Adagoal如何用于解决学习$ \ epsilon $ -optimal的目标条件的政策,以便在$ L $ S_0 $ S_0 $奖励中获得的每一个目标状态,以便在$ S_0 $中获取。免费马尔可夫决策过程。在标准的表格外壳中,我们的算法需要$ \ tilde {o}(l ^ 3 s a \ epsilon ^ { - 2})$探索步骤,这几乎很少最佳。我们还容易在线性混合Markov决策过程中实例化Adagoal,其产生具有线性函数近似的第一目标导向的PAC保证。除了强大的理论保证之外,迈克纳队以现有方法的高级别算法结构为锚定,为目标条件的深度加固学习。
translated by 谷歌翻译
离线增强学习(RL)的样本效率保证通常依赖于对功能类别(例如Bellman-Completeness)和数据覆盖范围(例如,全政策浓缩性)的强有力的假设。尽管最近在放松这些假设方面做出了努力,但现有作品只能放松这两个因素之一,从而使另一个因素的强烈假设完好无损。作为一个重要的开放问题,我们是否可以实现对这两个因素的假设较弱的样本效率离线RL?在本文中,我们以积极的态度回答了这个问题。我们基于MDP的原始偶对偶进行分析了一种简单的算法,其中双重变量(打折占用)是使用密度比函数对离线数据进行建模的。通过适当的正则化,我们表明该算法仅在可变性和单极浓缩性下具有多项式样品的复杂性。我们还基于不同的假设提供了替代分析,以阐明离线RL原始二算法的性质。
translated by 谷歌翻译
大部分强化学习理论都建立在计算上难以实施的甲板上。专门用于在部分可观察到的马尔可夫决策过程(POMDP)中学习近乎最佳的政策,现有算法要么需要对模型动态(例如确定性过渡)做出强有力的假设,要么假设访问甲骨文作为解决艰难的计划或估算问题的访问子例程。在这项工作中,我们在合理的假设下开发了第一个用于POMDP的无Oracle学习算法。具体而言,我们给出了一种用于在“可观察” pomdps中学习的准化性时间端到端算法,其中可观察性是一个假设,即对国家而言,分离良好的分布诱导了分离良好的分布分布而不是观察。我们的技术规定了在不确定性下使用乐观原则来促进探索的更传统的方法,而是在构建策略涵盖的情况下提供了一种新颖的barycentric跨度应用。
translated by 谷歌翻译
本文介绍了一项有关离线增强学习中依赖间隙依赖样品复杂性的系统研究。先前的工作显示了何时最佳策略和行为策略之间的密度比上限(最佳策略覆盖范围假设),则代理可以实现$ o \ left(\ frac {1} {\ epsilon^2} \ right)$ rate,这也是最小值的最佳。我们在最佳策略覆盖范围假设下显示,当在最佳$ q $ unction中存在积极的子临时差距时,可以将费率提高到$ o \ left(\ frac {1} {\ epsilon} \ right)$。。此外,我们显示了行为策略的访问概率何时在最佳策略的访问概率为正(统一的最佳策略覆盖范围假设)的状态下,均匀下降,识别最佳政策的样本复杂性独立于$ \ frac {1} {\ epsilon} $。最后,我们呈现几乎匹配的下限,以补充我们的间隙依赖性上限。
translated by 谷歌翻译
Offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data. This problem setting offers the promise of utilizing such datasets to acquire policies without any costly or dangerous active exploration. However, it is also challenging, due to the distributional shift between the offline training data and those states visited by the learned policy. Despite significant recent progress, the most successful prior methods are model-free and constrain the policy to the support of data, precluding generalization to unseen states. In this paper, we first observe that an existing model-based RL algorithm already produces significant gains in the offline setting compared to model-free approaches. However, standard model-based RL methods, designed for the online setting, do not provide an explicit mechanism to avoid the offline setting's distributional shift issue. Instead, we propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics. We theoretically show that the algorithm maximizes a lower bound of the policy's return under the true MDP. We also characterize the trade-off between the gain and risk of leaving the support of the batch data. Our algorithm, Model-based Offline Policy Optimization (MOPO), outperforms standard model-based RL algorithms and prior state-of-the-art model-free offline RL algorithms on existing offline RL benchmarks and two challenging continuous control tasks that require generalizing from data collected for a different task. * equal contribution. † equal advising. Orders randomized.34th Conference on Neural Information Processing Systems (NeurIPS 2020),
translated by 谷歌翻译