在没有高保真模拟环境的情况下,学习有效的加强学习(RL)政策可以解决现实世界中的复杂任务。在大多数情况下,我们只有具有简化动力学的不完善的模拟器,这不可避免地导致RL策略学习中的SIM到巨大差距。最近出现的离线RL领域为直接从预先收集的历史数据中学习政策提供了另一种可能性。但是,为了达到合理的性能,现有的离线RL算法需要不切实际的离线数据,并具有足够的州行动空间覆盖范围进行培训。这提出了一个新问题:是否有可能通过在线RL中的不完美模拟器中的离线RL中的有限数据中的学习结合到无限制的探索,以解决两种方法的缺点?在这项研究中,我们提出了动态感知的混合离线和对线增强学习(H2O)框架,以为这个问题提供肯定的答案。 H2O引入了动态感知的政策评估方案,该方案可以自适应地惩罚Q函数在模拟的状态行动对上具有较大的动态差距,同时也允许从固定的现实世界数据集中学习。通过广泛的模拟和现实世界任务以及理论分析,我们证明了H2O与其他跨域在线和离线RL算法相对于其他跨域的表现。 H2O提供了全新的脱机脱机RL范式,该范式可能会阐明未来的RL算法设计,以解决实用的现实世界任务。
translated by 谷歌翻译
Effectively leveraging large, previously collected datasets in reinforcement learning (RL) is a key challenge for large-scale real-world applications. Offline RL algorithms promise to learn effective policies from previously-collected, static datasets without further interaction. However, in practice, offline RL presents a major challenge, and standard off-policy RL methods can fail due to overestimation of values induced by the distributional shift between the dataset and the learned policy, especially when training on complex and multi-modal data distributions. In this paper, we propose conservative Q-learning (CQL), which aims to address these limitations by learning a conservative Q-function such that the expected value of a policy under this Q-function lower-bounds its true value. We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees. In practice, CQL augments the standard Bellman error objective with a simple Q-value regularizer which is straightforward to implement on top of existing deep Q-learning and actor-critic implementations. On both discrete and continuous control domains, we show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return, especially when learning from complex and multi-modal data distributions.Preprint. Under review.
translated by 谷歌翻译
强化学习(RL)已在域中展示有效,在域名可以通过与其操作环境进行积极互动来学习政策。但是,如果我们将RL方案更改为脱机设置,代理商只能通过静态数据集更新其策略,其中脱机强化学习中的一个主要问题出现,即分配转移。我们提出了一种悲观的离线强化学习(PESSORL)算法,以主动引导代理通过操纵价值函数来恢复熟悉的区域。我们专注于由分销外(OOD)状态引起的问题,并且故意惩罚训练数据集中不存在的状态的高值,以便学习的悲观值函数下限界限状态空间内的任何位置。我们在各种基准任务中评估Pessorl算法,在那里我们表明我们的方法通过明确处理OOD状态,与这些方法仅考虑ood行动时,我们的方法通过明确处理OOD状态。
translated by 谷歌翻译
离线增强学习(RL)定义了从静态记录数据集学习的任务,而无需与环境不断交互。学识渊博的政策与行为政策之间的分配变化使得价值函数必须保持保守,以使分布(OOD)的动作不会被严重高估。但是,现有的方法,对看不见的行为进行惩罚或与行为政策进行正规化,太悲观了,这抑制了价值功能的概括并阻碍了性能的提高。本文探讨了温和但足够的保守主义,可以在线学习,同时不损害概括。我们提出了轻度保守的Q学习(MCQ),其中通过分配了适当的伪Q值来积极训练OOD。从理论上讲,我们表明MCQ诱导了至少与行为策略的行为,并且对OOD行动不会发生错误的高估。 D4RL基准测试的实验结果表明,与先前的工作相比,MCQ取得了出色的性能。此外,MCQ在从离线转移到在线时显示出卓越的概括能力,并明显胜过基准。
translated by 谷歌翻译
依赖于太多的实验来学习良好的行动,目前的强化学习(RL)算法在现实世界的环境中具有有限的适用性,这可能太昂贵,无法探索探索。我们提出了一种批量RL算法,其中仅使用固定的脱机数据集来学习有效策略,而不是与环境的在线交互。批量RL中的有限数据产生了在培训数据中不充分表示的状态/行动的价值估计中的固有不确定性。当我们的候选政策从生成数据的候选政策发散时,这导致特别严重的外推。我们建议通过两个直接的惩罚来减轻这个问题:减少这种分歧的政策限制和减少过于乐观估计的价值约束。在全面的32个连续动作批量RL基准测试中,我们的方法对最先进的方法进行了比较,无论如何收集离线数据如何。
translated by 谷歌翻译
离线模仿学习(IL)是从没有奖励标签的专家演示中解决决策问题的强大方法。由于协变量转移,现有的离线IL方法在有限的专家数据下遭受严重的性能变性。但是,包括学习的动力学模型可以潜在地改善专家数据的状态行动空间覆盖范围,但是,它也面临着诸如模型近似/概括/概括性错误和推出数据的次级优势之类的挑战性问题。在本文中,我们提出了基于歧视者指导的基于模型的离线模仿学习(DMIL)框架,该框架引入了一个歧视者,以同时区分模型推出数据的动力学正确性和次优性与真实专家示范。 DMIL采用了一种新颖的合作对抗学习策略,该策略使用歧视者指导和融合了政策和动态模型的学习过程,从而改善了模型性能和鲁棒性。当演示包含大量次优数据时,我们的框架也可以扩展到案例。实验结果表明,与小型数据集下的最新离线IL方法相比,DMIL及其扩展具有出色的性能和鲁棒性。
translated by 谷歌翻译
强化学习(RL)通过与环境相互作用的试验过程解决顺序决策问题。尽管RL在玩复杂的视频游戏方面取得了巨大的成功,但在现实世界中,犯错误总是不希望的。为了提高样本效率并从而降低错误,据信基于模型的增强学习(MBRL)是一个有前途的方向,它建立了环境模型,在该模型中可以进行反复试验,而无需实际成本。在这项调查中,我们对MBRL进行了审查,重点是Deep RL的最新进展。对于非壮观环境,学到的环境模型与真实环境之间始终存在概括性错误。因此,非常重要的是分析环境模型中的政策培训与实际环境中的差异,这反过来又指导了更好的模型学习,模型使用和政策培训的算法设计。此外,我们还讨论了其他形式的RL,包括离线RL,目标条件RL,多代理RL和Meta-RL的最新进展。此外,我们讨论了MBRL在现实世界任务中的适用性和优势。最后,我们通过讨论MBRL未来发展的前景来结束这项调查。我们认为,MBRL在被忽略的现实应用程序中具有巨大的潜力和优势,我们希望这项调查能够吸引更多关于MBRL的研究。
translated by 谷歌翻译
现有的离线增强学习(RL)方法面临一些主要挑战,尤其是学识渊博的政策与行为政策之间的分配转变。离线Meta-RL正在成为应对这些挑战的一种有前途的方法,旨在从一系列任务中学习信息丰富的元基础。然而,如我们的实证研究所示,离线元RL在具有良好数据集质量的任务上的单个任务RL方法可能胜过,这表明必须在“探索”不合时宜的情况下进行精细的平衡。通过遵循元元素和“利用”离线数据集的分配状态行为,保持靠近行为策略。通过这种经验分析的激励,我们探索了基于模型的离线元RL,并具有正则政策优化(MERPO),该策略优化(MERPO)学习了一种用于有效任务结构推理的元模型,并提供了提供信息的元元素,以安全地探索过分分布状态 - 行为。特别是,我们使用保守的政策评估和正规政策改进,设计了一种新的基于元指数的基于元指数的基于元模型的参与者批判性(RAC),作为MERPO的关键构建块作为Merpo的关键构建块;而其中的内在权衡是通过在两个正规机构之间达到正确的平衡来实现的,一个是基于行为政策,另一个基于元政策。从理论上讲,我们学识渊博的政策可以保证对行为政策和元政策都有保证的改进,从而确保通过离线元RL对新任务的绩效提高。实验证实了Merpo优于现有的离线META-RL方法的出色性能。
translated by 谷歌翻译
Offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data. This problem setting offers the promise of utilizing such datasets to acquire policies without any costly or dangerous active exploration. However, it is also challenging, due to the distributional shift between the offline training data and those states visited by the learned policy. Despite significant recent progress, the most successful prior methods are model-free and constrain the policy to the support of data, precluding generalization to unseen states. In this paper, we first observe that an existing model-based RL algorithm already produces significant gains in the offline setting compared to model-free approaches. However, standard model-based RL methods, designed for the online setting, do not provide an explicit mechanism to avoid the offline setting's distributional shift issue. Instead, we propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics. We theoretically show that the algorithm maximizes a lower bound of the policy's return under the true MDP. We also characterize the trade-off between the gain and risk of leaving the support of the batch data. Our algorithm, Model-based Offline Policy Optimization (MOPO), outperforms standard model-based RL algorithms and prior state-of-the-art model-free offline RL algorithms on existing offline RL benchmarks and two challenging continuous control tasks that require generalizing from data collected for a different task. * equal contribution. † equal advising. Orders randomized.34th Conference on Neural Information Processing Systems (NeurIPS 2020),
translated by 谷歌翻译
Off-policy reinforcement learning aims to leverage experience collected from prior policies for sample-efficient learning. However, in practice, commonly used off-policy approximate dynamic programming methods based on Q-learning and actor-critic methods are highly sensitive to the data distribution, and can make only limited progress without collecting additional on-policy data. As a step towards more robust off-policy algorithms, we study the setting where the off-policy experience is fixed and there is no further interaction with the environment. We identify bootstrapping error as a key source of instability in current methods. Bootstrapping error is due to bootstrapping from actions that lie outside of the training data distribution, and it accumulates via the Bellman backup operator. We theoretically analyze bootstrapping error, and demonstrate how carefully constraining action selection in the backup can mitigate it. Based on our analysis, we propose a practical algorithm, bootstrapping error accumulation reduction (BEAR). We demonstrate that BEAR is able to learn robustly from different off-policy distributions, including random and suboptimal demonstrations, on a range of continuous control tasks.
translated by 谷歌翻译
离线增强学习(RL)将经典RL算法的范式扩展到纯粹从静态数据集中学习,而无需在学习过程中与基础环境进行交互。离线RL的一个关键挑战是政策培训的不稳定,这是由于离线数据的分布与学习政策的未结束的固定状态分配之间的不匹配引起的。为了避免分配不匹配的有害影响,我们将当前政策的未静置固定分配正规化在政策优化过程中的离线数据。此外,我们训练动力学模型既实施此正规化,又可以更好地估计当前策略的固定分布,从而减少了分布不匹配引起的错误。在各种连续控制的离线RL数据集中,我们的方法表示竞争性能,从而验证了我们的算法。该代码公开可用。
translated by 谷歌翻译
元强化学习(RL)方法可以使用比标准RL少的数据级的元培训策略,但元培训本身既昂贵又耗时。如果我们可以在离线数据上进行元训练,那么我们可以重复使用相同的静态数据集,该数据集将一次标记为不同任务的奖励,以在元测试时间适应各种新任务的元训练策略。尽管此功能将使Meta-RL成为现实使用的实用工具,但离线META-RL提出了除在线META-RL或标准离线RL设置之外的其他挑战。 Meta-RL学习了一种探索策略,该策略收集了用于适应的数据,并元培训策略迅速适应了新任务的数据。由于该策略是在固定的离线数据集上进行了元训练的,因此当适应学识渊博的勘探策略收集的数据时,它可能表现得不可预测,这与离线数据有系统地不同,从而导致分布变化。我们提出了一种混合脱机元元素算法,该算法使用带有奖励的脱机数据来进行自适应策略,然后收集其他无监督的在线数据,而无需任何奖励标签来桥接这一分配变化。通过不需要在线收集的奖励标签,此数据可以便宜得多。我们将我们的方法比较了在模拟机器人的运动和操纵任务上进行离线元rl的先前工作,并发现使用其他无监督的在线数据收集可以显着提高元训练政策的自适应能力,从而匹配完全在线的表现。在一系列具有挑战性的域上,需要对新任务进行概括。
translated by 谷歌翻译
Model-free deep reinforcement learning (RL) algorithms have been demonstrated on a range of challenging decision making and control tasks. However, these methods typically suffer from two major challenges: very high sample complexity and brittle convergence properties, which necessitate meticulous hyperparameter tuning. Both of these challenges severely limit the applicability of such methods to complex, real-world domains. In this paper, we propose soft actor-critic, an offpolicy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework. In this framework, the actor aims to maximize expected reward while also maximizing entropy. That is, to succeed at the task while acting as randomly as possible. Prior deep RL methods based on this framework have been formulated as Q-learning methods. By combining off-policy updates with a stable stochastic actor-critic formulation, our method achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off-policy methods. Furthermore, we demonstrate that, in contrast to other off-policy algorithms, our approach is very stable, achieving very similar performance across different random seeds.
translated by 谷歌翻译
Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning. By exploiting historical transitions, a policy is trained to maximize a learned value function while constrained by the behavior policy to avoid a significant distributional shift. In this paper, we propose our closed-form policy improvement operators. We make a novel observation that the behavior constraint naturally motivates the use of first-order Taylor approximation, leading to a linear approximation of the policy objective. Additionally, as practical datasets are usually collected by heterogeneous policies, we model the behavior policies as a Gaussian Mixture and overcome the induced optimization difficulties by leveraging the LogSumExp's lower bound and Jensen's Inequality, giving rise to a closed-form policy improvement operator. We instantiate offline RL algorithms with our novel policy improvement operators and empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark.
translated by 谷歌翻译
我们根据相对悲观主义的概念,在数据覆盖不足的情况下提出了经过对抗训练的演员评论家(ATAC),这是一种新的无模型算法(RL)。 ATAC被设计为两人Stackelberg游戏:政策演员与受对抗训练的价值评论家竞争,后者发现参与者不如数据收集行为策略的数据一致方案。我们证明,当演员在两人游戏中不后悔时,运行ATAC会产生一项政策,证明1)在控制悲观程度的各种超级参数上都超过了行为政策,而2)与最佳竞争。 policy covered by data with appropriately chosen hyperparameters.与现有作品相比,尤其是我们的框架提供了一般函数近似的理论保证,也提供了可扩展到复杂环境和大型数据集的深度RL实现。在D4RL基准测试中,ATAC在一系列连续的控制任务上始终优于最先进的离线RL算法。
translated by 谷歌翻译
Learning policies from fixed offline datasets is a key challenge to scale up reinforcement learning (RL) algorithms towards practical applications. This is often because off-policy RL algorithms suffer from distributional shift, due to mismatch between dataset and the target policy, leading to high variance and over-estimation of value functions. In this work, we propose variance regularization for offline RL algorithms, using stationary distribution corrections. We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer. The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms. We show that the regularizer leads to a lower bound to the offline policy optimization objective, which can help avoid over-estimation errors, and explains the benefits of our approach across a range of continuous control domains when compared to existing state-of-the-art algorithms.
translated by 谷歌翻译
Pessimism is of great importance in offline reinforcement learning (RL). One broad category of offline RL algorithms fulfills pessimism by explicit or implicit behavior regularization. However, most of them only consider policy divergence as behavior regularization, ignoring the effect of how the offline state distribution differs with that of the learning policy, which may lead to under-pessimism for some states and over-pessimism for others. Taking account of this problem, we propose a principled algorithmic framework for offline RL, called \emph{State-Aware Proximal Pessimism} (SA-PP). The key idea of SA-PP is leveraging discounted stationary state distribution ratios between the learning policy and the offline dataset to modulate the degree of behavior regularization in a state-wise manner, so that pessimism can be implemented in a more appropriate way. We first provide theoretical justifications on the superiority of SA-PP over previous algorithms, demonstrating that SA-PP produces a lower suboptimality upper bound in a broad range of settings. Furthermore, we propose a new algorithm named \emph{State-Aware Conservative Q-Learning} (SA-CQL), by building SA-PP upon representative CQL algorithm with the help of DualDICE for estimating discounted stationary state distribution ratios. Extensive experiments on standard offline RL benchmark show that SA-CQL outperforms the popular baselines on a large portion of benchmarks and attains the highest average return.
translated by 谷歌翻译
离线增强学习(RL)提供了一个有希望的方向,可以利用大量离线数据来实现复杂的决策任务。由于分配转移问题,当前的离线RL算法通常被设计为在价值估计和行动选择方面是保守的。但是,这种保守主义在现实情况下遇到观察偏差时,例如传感器错误和对抗性攻击时会损害学习政策的鲁棒性。为了权衡鲁棒性和保守主义,我们通过一种新颖的保守平滑技术提出了强大的离线增强学习(RORL)。在RORL中,我们明确地介绍了数据集附近国家的策略和价值函数的正则化,以及对这些OOD状态的其他保守价值估计。从理论上讲,我们表明RORL比线性MDP中的最新理论结果更紧密地构成。我们证明RORL可以在一般离线RL基准上实现最新性能,并且对对抗性观察的扰动非常强大。
translated by 谷歌翻译
离线增强学习(RL)可以从静态数据集中学习控制策略,但是像标准RL方法一样,它需要每个过渡的奖励注释。在许多情况下,将大型数据集标记为奖励可能会很高,尤其是如果人类标签必须提供这些奖励,同时收集多样的未标记数据可能相对便宜。我们如何在离线RL中最好地利用这种未标记的数据?一种自然解决方案是从标记的数据中学习奖励函数,并使用它标记未标记的数据。在本文中,我们发现,也许令人惊讶的是,一种简单得多的方法,它简单地将零奖励应用于未标记的数据可以导致理论和实践中的有效数据共享,而无需学习任何奖励模型。虽然这种方法起初可能看起来很奇怪(并且不正确),但我们提供了广泛的理论和经验分析,说明了它如何摆脱奖励偏见,样本复杂性和分配变化,通常会导致良好的结果。我们表征了这种简单策略有效的条件,并进一步表明,使用简单的重新加权方法扩展它可以进一步缓解通过使用不正确的奖励标签引入的偏见。我们的经验评估证实了模拟机器人运动,导航和操纵设置中的这些发现。
translated by 谷歌翻译
离线强化学习在利用大型预采用的数据集进行政策学习方面表现出了巨大的希望,使代理商可以放弃经常廉价的在线数据收集。但是,迄今为止,离线强化学习的探索相对较小,并且缺乏对剩余挑战所在的何处的了解。在本文中,我们试图建立简单的基线以在视觉域中连续控制。我们表明,对两个基于最先进的在线增强学习算法,Dreamerv2和DRQ-V2进行了简单的修改,足以超越事先工作并建立竞争性的基准。我们在现有的离线数据集中对这些算法进行了严格的评估,以及从视觉观察结果中进行离线强化学习的新测试台,更好地代表现实世界中离线增强学习问题中存在的数据分布,并开放我们的代码和数据以促进此方面的进度重要领域。最后,我们介绍并分析了来自视觉观察的离线RL所独有的几个关键Desiderata,包括视觉分散注意力和动态视觉上可识别的变化。
translated by 谷歌翻译