离线增强学习(RL)将经典RL算法的范式扩展到纯粹从静态数据集中学习,而无需在学习过程中与基础环境进行交互。离线RL的一个关键挑战是政策培训的不稳定,这是由于离线数据的分布与学习政策的未结束的固定状态分配之间的不匹配引起的。为了避免分配不匹配的有害影响,我们将当前政策的未静置固定分配正规化在政策优化过程中的离线数据。此外,我们训练动力学模型既实施此正规化,又可以更好地估计当前策略的固定分布,从而减少了分布不匹配引起的错误。在各种连续控制的离线RL数据集中,我们的方法表示竞争性能,从而验证了我们的算法。该代码公开可用。
translated by 谷歌翻译
我们根据相对悲观主义的概念,在数据覆盖不足的情况下提出了经过对抗训练的演员评论家(ATAC),这是一种新的无模型算法(RL)。 ATAC被设计为两人Stackelberg游戏:政策演员与受对抗训练的价值评论家竞争,后者发现参与者不如数据收集行为策略的数据一致方案。我们证明,当演员在两人游戏中不后悔时,运行ATAC会产生一项政策,证明1)在控制悲观程度的各种超级参数上都超过了行为政策,而2)与最佳竞争。 policy covered by data with appropriately chosen hyperparameters.与现有作品相比,尤其是我们的框架提供了一般函数近似的理论保证,也提供了可扩展到复杂环境和大型数据集的深度RL实现。在D4RL基准测试中,ATAC在一系列连续的控制任务上始终优于最先进的离线RL算法。
translated by 谷歌翻译
离线增强学习(RL)定义了从静态记录数据集学习的任务,而无需与环境不断交互。学识渊博的政策与行为政策之间的分配变化使得价值函数必须保持保守,以使分布(OOD)的动作不会被严重高估。但是,现有的方法,对看不见的行为进行惩罚或与行为政策进行正规化,太悲观了,这抑制了价值功能的概括并阻碍了性能的提高。本文探讨了温和但足够的保守主义,可以在线学习,同时不损害概括。我们提出了轻度保守的Q学习(MCQ),其中通过分配了适当的伪Q值来积极训练OOD。从理论上讲,我们表明MCQ诱导了至少与行为策略的行为,并且对OOD行动不会发生错误的高估。 D4RL基准测试的实验结果表明,与先前的工作相比,MCQ取得了出色的性能。此外,MCQ在从离线转移到在线时显示出卓越的概括能力,并明显胜过基准。
translated by 谷歌翻译
Effectively leveraging large, previously collected datasets in reinforcement learning (RL) is a key challenge for large-scale real-world applications. Offline RL algorithms promise to learn effective policies from previously-collected, static datasets without further interaction. However, in practice, offline RL presents a major challenge, and standard off-policy RL methods can fail due to overestimation of values induced by the distributional shift between the dataset and the learned policy, especially when training on complex and multi-modal data distributions. In this paper, we propose conservative Q-learning (CQL), which aims to address these limitations by learning a conservative Q-function such that the expected value of a policy under this Q-function lower-bounds its true value. We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees. In practice, CQL augments the standard Bellman error objective with a simple Q-value regularizer which is straightforward to implement on top of existing deep Q-learning and actor-critic implementations. On both discrete and continuous control domains, we show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return, especially when learning from complex and multi-modal data distributions.Preprint. Under review.
translated by 谷歌翻译
强化学习(RL)通过与环境相互作用的试验过程解决顺序决策问题。尽管RL在玩复杂的视频游戏方面取得了巨大的成功,但在现实世界中,犯错误总是不希望的。为了提高样本效率并从而降低错误,据信基于模型的增强学习(MBRL)是一个有前途的方向,它建立了环境模型,在该模型中可以进行反复试验,而无需实际成本。在这项调查中,我们对MBRL进行了审查,重点是Deep RL的最新进展。对于非壮观环境,学到的环境模型与真实环境之间始终存在概括性错误。因此,非常重要的是分析环境模型中的政策培训与实际环境中的差异,这反过来又指导了更好的模型学习,模型使用和政策培训的算法设计。此外,我们还讨论了其他形式的RL,包括离线RL,目标条件RL,多代理RL和Meta-RL的最新进展。此外,我们讨论了MBRL在现实世界任务中的适用性和优势。最后,我们通过讨论MBRL未来发展的前景来结束这项调查。我们认为,MBRL在被忽略的现实应用程序中具有巨大的潜力和优势,我们希望这项调查能够吸引更多关于MBRL的研究。
translated by 谷歌翻译
离线增强学习(RL)可以从先前收集的数据中进行有效的学习,而无需探索,这在探索昂贵甚至不可行时在现实世界应用中显示出巨大的希望。折扣因子$ \ gamma $在提高在线RL样本效率和估计准确性方面起着至关重要的作用,但是折现因子在离线RL中的作用尚未得到很好的探索。本文研究了$ \ gamma $在离线RL中的两个明显影响,并通过理论分析,即正则化效果和悲观效应。一方面,$ \ gamma $是在现有离线技术下以样本效率而定的最佳选择的监管机构。另一方面,较低的指导$ \ gamma $也可以看作是一种悲观的方式,我们在最坏的模型中优化了政策的性能。我们通过表格MDP和标准D4RL任务从经验上验证上述理论观察。结果表明,折现因子在离线RL算法的性能中起着至关重要的作用,无论是在现有的离线方法的小型数据制度下还是在没有其他保守主义的大型数据制度中。
translated by 谷歌翻译
Offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data. This problem setting offers the promise of utilizing such datasets to acquire policies without any costly or dangerous active exploration. However, it is also challenging, due to the distributional shift between the offline training data and those states visited by the learned policy. Despite significant recent progress, the most successful prior methods are model-free and constrain the policy to the support of data, precluding generalization to unseen states. In this paper, we first observe that an existing model-based RL algorithm already produces significant gains in the offline setting compared to model-free approaches. However, standard model-based RL methods, designed for the online setting, do not provide an explicit mechanism to avoid the offline setting's distributional shift issue. Instead, we propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics. We theoretically show that the algorithm maximizes a lower bound of the policy's return under the true MDP. We also characterize the trade-off between the gain and risk of leaving the support of the batch data. Our algorithm, Model-based Offline Policy Optimization (MOPO), outperforms standard model-based RL algorithms and prior state-of-the-art model-free offline RL algorithms on existing offline RL benchmarks and two challenging continuous control tasks that require generalizing from data collected for a different task. * equal contribution. † equal advising. Orders randomized.34th Conference on Neural Information Processing Systems (NeurIPS 2020),
translated by 谷歌翻译
强大的增强学习(RL)的目的是学习一项与模型参数不确定性的强大策略。由于模拟器建模错误,随着时间的推移,现实世界系统动力学的变化以及对抗性干扰,参数不确定性通常发生在许多现实世界中的RL应用中。强大的RL通常被称为最大问题问题,其目的是学习最大化价值与不确定性集合中最坏可能的模型的策略。在这项工作中,我们提出了一种称为鲁棒拟合Q-材料(RFQI)的强大RL算法,该算法仅使用离线数据集来学习最佳稳健策略。使用离线数据的强大RL比其非持续性对应物更具挑战性,因为在强大的Bellman运营商中所有模型的最小化。这在离线数据收集,对模型的优化以及公正的估计中构成了挑战。在这项工作中,我们提出了一种系统的方法来克服这些挑战,从而导致了我们的RFQI算法。我们证明,RFQI在标准假设下学习了一项近乎最佳的强大政策,并证明了其在标准基准问题上的出色表现。
translated by 谷歌翻译
Learning policies from fixed offline datasets is a key challenge to scale up reinforcement learning (RL) algorithms towards practical applications. This is often because off-policy RL algorithms suffer from distributional shift, due to mismatch between dataset and the target policy, leading to high variance and over-estimation of value functions. In this work, we propose variance regularization for offline RL algorithms, using stationary distribution corrections. We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer. The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms. We show that the regularizer leads to a lower bound to the offline policy optimization objective, which can help avoid over-estimation errors, and explains the benefits of our approach across a range of continuous control domains when compared to existing state-of-the-art algorithms.
translated by 谷歌翻译
依赖于太多的实验来学习良好的行动,目前的强化学习(RL)算法在现实世界的环境中具有有限的适用性,这可能太昂贵,无法探索探索。我们提出了一种批量RL算法,其中仅使用固定的脱机数据集来学习有效策略,而不是与环境的在线交互。批量RL中的有限数据产生了在培训数据中不充分表示的状态/行动的价值估计中的固有不确定性。当我们的候选政策从生成数据的候选政策发散时,这导致特别严重的外推。我们建议通过两个直接的惩罚来减轻这个问题:减少这种分歧的政策限制和减少过于乐观估计的价值约束。在全面的32个连续动作批量RL基准测试中,我们的方法对最先进的方法进行了比较,无论如何收集离线数据如何。
translated by 谷歌翻译
在没有高保真模拟环境的情况下,学习有效的加强学习(RL)政策可以解决现实世界中的复杂任务。在大多数情况下,我们只有具有简化动力学的不完善的模拟器,这不可避免地导致RL策略学习中的SIM到巨大差距。最近出现的离线RL领域为直接从预先收集的历史数据中学习政策提供了另一种可能性。但是,为了达到合理的性能,现有的离线RL算法需要不切实际的离线数据,并具有足够的州行动空间覆盖范围进行培训。这提出了一个新问题:是否有可能通过在线RL中的不完美模拟器中的离线RL中的有限数据中的学习结合到无限制的探索,以解决两种方法的缺点?在这项研究中,我们提出了动态感知的混合离线和对线增强学习(H2O)框架,以为这个问题提供肯定的答案。 H2O引入了动态感知的政策评估方案,该方案可以自适应地惩罚Q函数在模拟的状态行动对上具有较大的动态差距,同时也允许从固定的现实世界数据集中学习。通过广泛的模拟和现实世界任务以及理论分析,我们证明了H2O与其他跨域在线和离线RL算法相对于其他跨域的表现。 H2O提供了全新的脱机脱机RL范式,该范式可能会阐明未来的RL算法设计,以解决实用的现实世界任务。
translated by 谷歌翻译
Offline reinforcement learning (RL) is suitable for safety-critical domains where online exploration is too costly or dangerous. In safety-critical settings, decision-making should take into consideration the risk of catastrophic outcomes. In other words, decision-making should be risk-sensitive. Previous works on risk in offline RL combine together offline RL techniques, to avoid distributional shift, with risk-sensitive RL algorithms, to achieve risk-sensitivity. In this work, we propose risk-sensitivity as a mechanism to jointly address both of these issues. Our model-based approach is risk-averse to both epistemic and aleatoric uncertainty. Risk-aversion to epistemic uncertainty prevents distributional shift, as areas not covered by the dataset have high epistemic uncertainty. Risk-aversion to aleatoric uncertainty discourages actions that may result in poor outcomes due to environment stochasticity. Our experiments show that our algorithm achieves competitive performance on deterministic benchmarks, and outperforms existing approaches for risk-sensitive objectives in stochastic domains.
translated by 谷歌翻译
Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning. By exploiting historical transitions, a policy is trained to maximize a learned value function while constrained by the behavior policy to avoid a significant distributional shift. In this paper, we propose our closed-form policy improvement operators. We make a novel observation that the behavior constraint naturally motivates the use of first-order Taylor approximation, leading to a linear approximation of the policy objective. Additionally, as practical datasets are usually collected by heterogeneous policies, we model the behavior policies as a Gaussian Mixture and overcome the induced optimization difficulties by leveraging the LogSumExp's lower bound and Jensen's Inequality, giving rise to a closed-form policy improvement operator. We instantiate offline RL algorithms with our novel policy improvement operators and empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark.
translated by 谷歌翻译
大多数前往离线强化学习(RL)的方法都采取了一种迭代演员 - 批评批评,涉及违规评估。在本文中,我们展示了使用行为政策的政策Q估计来令人惊讶地执行一步的Q估计,从而简单地执行一个受限制/正规化的政策改进的步骤。该一步算法在大部分D4RL基准测试中击败了先前报告的迭代算法的结果。一步基线实现了这种强劲的性能,同时对超公数更简单,更强大而不是先前提出的迭代算法。我们认为迭代方法的表现相对较差是在违反政策评估中固有的高方差,并通过对这些估计的重复优化的政策进行放大。此外,我们假设一步算法的强大性能是由于环境和行为政策中有利结构的组合。
translated by 谷歌翻译
Model-based reinforcement learning (RL) methods are appealing in the offline setting because they allow an agent to reason about the consequences of actions without interacting with the environment. Prior methods learn a 1-step dynamics model, which predicts the next state given the current state and action. These models do not immediately tell the agent which actions to take, but must be integrated into a larger RL framework. Can we model the environment dynamics in a different way, such that the learned model does directly indicate the value of each action? In this paper, we propose Contrastive Value Learning (CVL), which learns an implicit, multi-step model of the environment dynamics. This model can be learned without access to reward functions, but nonetheless can be used to directly estimate the value of each action, without requiring any TD learning. Because this model represents the multi-step transitions implicitly, it avoids having to predict high-dimensional observations and thus scales to high-dimensional tasks. Our experiments demonstrate that CVL outperforms prior offline RL methods on complex continuous control benchmarks.
translated by 谷歌翻译
现有的离线增强学习(RL)方法面临一些主要挑战,尤其是学识渊博的政策与行为政策之间的分配转变。离线Meta-RL正在成为应对这些挑战的一种有前途的方法,旨在从一系列任务中学习信息丰富的元基础。然而,如我们的实证研究所示,离线元RL在具有良好数据集质量的任务上的单个任务RL方法可能胜过,这表明必须在“探索”不合时宜的情况下进行精细的平衡。通过遵循元元素和“利用”离线数据集的分配状态行为,保持靠近行为策略。通过这种经验分析的激励,我们探索了基于模型的离线元RL,并具有正则政策优化(MERPO),该策略优化(MERPO)学习了一种用于有效任务结构推理的元模型,并提供了提供信息的元元素,以安全地探索过分分布状态 - 行为。特别是,我们使用保守的政策评估和正规政策改进,设计了一种新的基于元指数的基于元指数的基于元模型的参与者批判性(RAC),作为MERPO的关键构建块作为Merpo的关键构建块;而其中的内在权衡是通过在两个正规机构之间达到正确的平衡来实现的,一个是基于行为政策,另一个基于元政策。从理论上讲,我们学识渊博的政策可以保证对行为政策和元政策都有保证的改进,从而确保通过离线元RL对新任务的绩效提高。实验证实了Merpo优于现有的离线META-RL方法的出色性能。
translated by 谷歌翻译
强化学习(RL)已在域中展示有效,在域名可以通过与其操作环境进行积极互动来学习政策。但是,如果我们将RL方案更改为脱机设置,代理商只能通过静态数据集更新其策略,其中脱机强化学习中的一个主要问题出现,即分配转移。我们提出了一种悲观的离线强化学习(PESSORL)算法,以主动引导代理通过操纵价值函数来恢复熟悉的区域。我们专注于由分销外(OOD)状态引起的问题,并且故意惩罚训练数据集中不存在的状态的高值,以便学习的悲观值函数下限界限状态空间内的任何位置。我们在各种基准任务中评估Pessorl算法,在那里我们表明我们的方法通过明确处理OOD状态,与这些方法仅考虑ood行动时,我们的方法通过明确处理OOD状态。
translated by 谷歌翻译
我们提出了状态匹配的离线分布校正估计(SMODICE),这是一种新颖且基于多功能回归的离线模仿学习(IL)算法,该算法是通过状态占用匹配得出的。我们表明,SMODICE目标通过在表格MDP中的Fenchel二元性和一个分析解决方案的应用来接受一个简单的优化过程。不需要访问专家的行动,可以将Smodice有效地应用于三个离线IL设置:(i)模仿观察值(IFO),(ii)IFO具有动态或形态上不匹配的专家,以及(iii)基于示例的加固学习,这些学习我们表明可以将其公式为州占领的匹配问题。我们在GridWorld环境以及高维离线基准上广泛评估了Smodice。我们的结果表明,Smodice对于所有三个问题设置都有效,并且在前最新情况下均明显胜过。
translated by 谷歌翻译
Many practical applications of reinforcement learning constrain agents to learn from a fixed batch of data which has already been gathered, without offering further possibility for data collection. In this paper, we demonstrate that due to errors introduced by extrapolation, standard offpolicy deep reinforcement learning algorithms, such as DQN and DDPG, are incapable of learning without data correlated to the distribution under the current policy, making them ineffective for this fixed batch setting. We introduce a novel class of off-policy algorithms, batch-constrained reinforcement learning, which restricts the action space in order to force the agent towards behaving close to on-policy with respect to a subset of the given data. We present the first continuous control deep reinforcement learning algorithm which can learn effectively from arbitrary, fixed batch data, and empirically demonstrate the quality of its behavior in several tasks.
translated by 谷歌翻译
离线增强学习(RL)旨在使用先前收集的静态数据集学习最佳策略,是RL的重要范式。由于函数近似错误在分布外动作上的功能近似错误,因此在此任务上的标准RL方法通常会表现较差。尽管已经提出了各种正规化方法来减轻此问题,但它们通常受到表达有限的策略类别的限制,有时会导致次优的解决方案。在本文中,我们提出了利用条件扩散模型作为行为克隆和策略正则化的高度表达政策类别的扩散-QL。在我们的方法中,我们学习了一个动作值函数,并在有条件扩散模型的训练损失中添加了最大化动作值的术语,这导致损失寻求接近行为政策的最佳动作。我们展示了基于扩散模型的策略的表现力以及在扩散模型下的行为克隆和策略改进的耦合都有助于扩散-QL的出色性能。我们在具有多模式行为策略的简单2D强盗示例中说明了我们的方法和先前的工作。然后,我们证明我们的方法可以在离线RL的大多数D4RL基准任务上实现最先进的性能。
translated by 谷歌翻译