离线RL算法必须说明其提供的数据集可能使环境的许多方面未知。应对这一挑战的最常见方法是采用悲观或保守的方法,避免行为与培训数据集中的行为过于不同。但是,仅依靠保守主义存在缺点:绩效对保守主义的确切程度很敏感,保守的目标可以恢复高度最佳的政策。在这项工作中,我们建议在不确定性的情况下,脱机RL方法应该是适应性的。我们表明,在贝叶斯的意义上,在离线RL中最佳作用涉及解决隐式POMDP。结果,离线RL的最佳策略必须是自适应的,这不仅取决于当前状态,而且还取决于迄今为止在评估期间看到的所有过渡。我们提出了一种无模型的算法,用于近似于此最佳自适应策略,并证明在离线RL基准测试中学习此类适应性政策。
translated by 谷歌翻译
Offline reinforcement learning (RL) promises the ability to learn effective policies solely using existing, static datasets, without any costly online interaction. To do so, offline RL methods must handle distributional shift between the dataset and the learned policy. The most common approach is to learn conservative, or lower-bound, value functions, which underestimate the return of out-of-distribution (OOD) actions. However, such methods exhibit one notable drawback: policies optimized on such value functions can only behave according to a fixed, possibly suboptimal, degree of conservatism. However, this can be alleviated if we instead are able to learn policies for varying degrees of conservatism at training time and devise a method to dynamically choose one of them during evaluation. To do so, in this work, we propose learning value functions that additionally condition on the degree of conservatism, which we dub confidence-conditioned value functions. We derive a new form of a Bellman backup that simultaneously learns Q-values for any degree of confidence with high probability. By conditioning on confidence, our value functions enable adaptive strategies during online evaluation by controlling for confidence level using the history of observations thus far. This approach can be implemented in practice by conditioning the Q-function from existing conservative algorithms on the confidence. We theoretically show that our learned value functions produce conservative estimates of the true value at any desired confidence. Finally, we empirically show that our algorithm outperforms existing conservative offline RL algorithms on multiple discrete control domains.
translated by 谷歌翻译
Offline reinforcement learning (RL) is suitable for safety-critical domains where online exploration is too costly or dangerous. In safety-critical settings, decision-making should take into consideration the risk of catastrophic outcomes. In other words, decision-making should be risk-sensitive. Previous works on risk in offline RL combine together offline RL techniques, to avoid distributional shift, with risk-sensitive RL algorithms, to achieve risk-sensitivity. In this work, we propose risk-sensitivity as a mechanism to jointly address both of these issues. Our model-based approach is risk-averse to both epistemic and aleatoric uncertainty. Risk-aversion to epistemic uncertainty prevents distributional shift, as areas not covered by the dataset have high epistemic uncertainty. Risk-aversion to aleatoric uncertainty discourages actions that may result in poor outcomes due to environment stochasticity. Our experiments show that our algorithm achieves competitive performance on deterministic benchmarks, and outperforms existing approaches for risk-sensitive objectives in stochastic domains.
translated by 谷歌翻译
Deep reinforcement learning algorithms require large amounts of experience to learn an individual task. While in principle meta-reinforcement learning (meta-RL) algorithms enable agents to learn new skills from small amounts of experience, several major challenges preclude their practicality. Current methods rely heavily on on-policy experience, limiting their sample efficiency. The also lack mechanisms to reason about task uncertainty when adapting to new tasks, limiting their effectiveness in sparse reward problems. In this paper, we address these challenges by developing an offpolicy meta-RL algorithm that disentangles task inference and control. In our approach, we perform online probabilistic filtering of latent task variables to infer how to solve a new task from small amounts of experience. This probabilistic interpretation enables posterior sampling for structured and efficient exploration. We demonstrate how to integrate these task variables with off-policy RL algorithms to achieve both metatraining and adaptation efficiency. Our method outperforms prior algorithms in sample efficiency by 20-100X as well as in asymptotic performance on several meta-RL benchmarks.
translated by 谷歌翻译
Offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data. This problem setting offers the promise of utilizing such datasets to acquire policies without any costly or dangerous active exploration. However, it is also challenging, due to the distributional shift between the offline training data and those states visited by the learned policy. Despite significant recent progress, the most successful prior methods are model-free and constrain the policy to the support of data, precluding generalization to unseen states. In this paper, we first observe that an existing model-based RL algorithm already produces significant gains in the offline setting compared to model-free approaches. However, standard model-based RL methods, designed for the online setting, do not provide an explicit mechanism to avoid the offline setting's distributional shift issue. Instead, we propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics. We theoretically show that the algorithm maximizes a lower bound of the policy's return under the true MDP. We also characterize the trade-off between the gain and risk of leaving the support of the batch data. Our algorithm, Model-based Offline Policy Optimization (MOPO), outperforms standard model-based RL algorithms and prior state-of-the-art model-free offline RL algorithms on existing offline RL benchmarks and two challenging continuous control tasks that require generalizing from data collected for a different task. * equal contribution. † equal advising. Orders randomized.34th Conference on Neural Information Processing Systems (NeurIPS 2020),
translated by 谷歌翻译
深度强化学习(RL)导致了许多最近和开创性的进步。但是,这些进步通常以培训的基础体系结构的规模增加以及用于训练它们的RL算法的复杂性提高,而均以增加规模的成本。这些增长反过来又使研究人员更难迅速原型新想法或复制已发表的RL算法。为了解决这些问题,这项工作描述了ACME,这是一个用于构建新型RL算法的框架,这些框架是专门设计的,用于启用使用简单的模块化组件构建的代理,这些组件可以在各种执行范围内使用。尽管ACME的主要目标是为算法开发提供一个框架,但第二个目标是提供重要或最先进算法的简单参考实现。这些实现既是对我们的设计决策的验证,也是对RL研究中可重复性的重要贡献。在这项工作中,我们描述了ACME内部做出的主要设计决策,并提供了有关如何使用其组件来实施各种算法的进一步详细信息。我们的实验为许多常见和最先进的算法提供了基准,并显示了如何为更大且更复杂的环境扩展这些算法。这突出了ACME的主要优点之一,即它可用于实现大型,分布式的RL算法,这些算法可以以较大的尺度运行,同时仍保持该实现的固有可读性。这项工作提出了第二篇文章的版本,恰好与模块化的增加相吻合,对离线,模仿和从演示算法学习以及作为ACME的一部分实现的各种新代理。
translated by 谷歌翻译
Effectively leveraging large, previously collected datasets in reinforcement learning (RL) is a key challenge for large-scale real-world applications. Offline RL algorithms promise to learn effective policies from previously-collected, static datasets without further interaction. However, in practice, offline RL presents a major challenge, and standard off-policy RL methods can fail due to overestimation of values induced by the distributional shift between the dataset and the learned policy, especially when training on complex and multi-modal data distributions. In this paper, we propose conservative Q-learning (CQL), which aims to address these limitations by learning a conservative Q-function such that the expected value of a policy under this Q-function lower-bounds its true value. We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees. In practice, CQL augments the standard Bellman error objective with a simple Q-value regularizer which is straightforward to implement on top of existing deep Q-learning and actor-critic implementations. On both discrete and continuous control domains, we show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return, especially when learning from complex and multi-modal data distributions.Preprint. Under review.
translated by 谷歌翻译
Meta-Renifiltive学习(Meta-RL)已被证明是利用事先任务的经验,以便快速学习新的相关任务的成功框架,但是,当前的Meta-RL接近在稀疏奖励环境中学习的斗争。尽管现有的Meta-RL算法可以学习适应新的稀疏奖励任务的策略,但是使用手形奖励功能来学习实际适应策略,或者需要简单的环境,其中随机探索足以遇到稀疏奖励。在本文中,我们提出了对Meta-RL的后视抢购的制定,该rl抢购了在Meta培训期间的经验,以便能够使用稀疏奖励完全学习。我们展示了我们的方法在套件挑战稀疏奖励目标达到的环境中,以前需要密集的奖励,以便在Meta训练中解决。我们的方法使用真正的稀疏奖励功能来解决这些环境,性能与具有代理密集奖励功能的培训相当。
translated by 谷歌翻译
依赖于太多的实验来学习良好的行动,目前的强化学习(RL)算法在现实世界的环境中具有有限的适用性,这可能太昂贵,无法探索探索。我们提出了一种批量RL算法,其中仅使用固定的脱机数据集来学习有效策略,而不是与环境的在线交互。批量RL中的有限数据产生了在培训数据中不充分表示的状态/行动的价值估计中的固有不确定性。当我们的候选政策从生成数据的候选政策发散时,这导致特别严重的外推。我们建议通过两个直接的惩罚来减轻这个问题:减少这种分歧的政策限制和减少过于乐观估计的价值约束。在全面的32个连续动作批量RL基准测试中,我们的方法对最先进的方法进行了比较,无论如何收集离线数据如何。
translated by 谷歌翻译
深度强化学习(DRL)和深度多机构的强化学习(MARL)在包括游戏AI,自动驾驶汽车,机器人技术等各种领域取得了巨大的成功。但是,众所周知,DRL和Deep MARL代理的样本效率低下,即使对于相对简单的问题设置,通常也需要数百万个相互作用,从而阻止了在实地场景中的广泛应用和部署。背后的一个瓶颈挑战是众所周知的探索问题,即如何有效地探索环境和收集信息丰富的经验,从而使政策学习受益于最佳研究。在稀疏的奖励,吵闹的干扰,长距离和非平稳的共同学习者的复杂环境中,这个问题变得更加具有挑战性。在本文中,我们对单格和多代理RL的现有勘探方法进行了全面的调查。我们通过确定有效探索的几个关键挑战开始调查。除了上述两个主要分支外,我们还包括其他具有不同思想和技术的著名探索方法。除了算法分析外,我们还对一组常用基准的DRL进行了全面和统一的经验比较。根据我们的算法和实证研究,我们终于总结了DRL和Deep Marl中探索的公开问题,并指出了一些未来的方向。
translated by 谷歌翻译
强化学习(RL)已在域中展示有效,在域名可以通过与其操作环境进行积极互动来学习政策。但是,如果我们将RL方案更改为脱机设置,代理商只能通过静态数据集更新其策略,其中脱机强化学习中的一个主要问题出现,即分配转移。我们提出了一种悲观的离线强化学习(PESSORL)算法,以主动引导代理通过操纵价值函数来恢复熟悉的区域。我们专注于由分销外(OOD)状态引起的问题,并且故意惩罚训练数据集中不存在的状态的高值,以便学习的悲观值函数下限界限状态空间内的任何位置。我们在各种基准任务中评估Pessorl算法,在那里我们表明我们的方法通过明确处理OOD状态,与这些方法仅考虑ood行动时,我们的方法通过明确处理OOD状态。
translated by 谷歌翻译
Adequately assigning credit to actions for future outcomes based on their contributions is a long-standing open challenge in Reinforcement Learning. The assumptions of the most commonly used credit assignment method are disadvantageous in tasks where the effects of decisions are not immediately evident. Furthermore, this method can only evaluate actions that have been selected by the agent, making it highly inefficient. Still, no alternative methods have been widely adopted in the field. Hindsight Credit Assignment is a promising, but still unexplored candidate, which aims to solve the problems of both long-term and counterfactual credit assignment. In this thesis, we empirically investigate Hindsight Credit Assignment to identify its main benefits, and key points to improve. Then, we apply it to factored state representations, and in particular to state representations based on the causal structure of the environment. In this setting, we propose a variant of Hindsight Credit Assignment that effectively exploits a given causal structure. We show that our modification greatly decreases the workload of Hindsight Credit Assignment, making it more efficient and enabling it to outperform the baseline credit assignment method on various tasks. This opens the way to other methods based on given or learned causal structures.
translated by 谷歌翻译
Reinforcement learning (RL) gained considerable attention by creating decision-making agents that maximize rewards received from fully observable environments. However, many real-world problems are partially or noisily observable by nature, where agents do not receive the true and complete state of the environment. Such problems are formulated as partially observable Markov decision processes (POMDPs). Some studies applied RL to POMDPs by recalling previous decisions and observations or inferring the true state of the environment from received observations. Nevertheless, aggregating observations and decisions over time is impractical for environments with high-dimensional continuous state and action spaces. Moreover, so-called inference-based RL approaches require large number of samples to perform well since agents eschew uncertainty in the inferred state for the decision-making. Active inference is a framework that is naturally formulated in POMDPs and directs agents to select decisions by minimising expected free energy (EFE). This supplies reward-maximising (exploitative) behaviour in RL, with an information-seeking (exploratory) behaviour. Despite this exploratory behaviour of active inference, its usage is limited to discrete state and action spaces due to the computational difficulty of the EFE. We propose a unified principle for joint information-seeking and reward maximization that clarifies a theoretical connection between active inference and RL, unifies active inference and RL, and overcomes their aforementioned limitations. Our findings are supported by strong theoretical analysis. The proposed framework's superior exploration property is also validated by experimental results on partial observable tasks with high-dimensional continuous state and action spaces. Moreover, the results show that our model solves reward-free problems, making task reward design optional.
translated by 谷歌翻译
元强化学习(RL)方法可以使用比标准RL少的数据级的元培训策略,但元培训本身既昂贵又耗时。如果我们可以在离线数据上进行元训练,那么我们可以重复使用相同的静态数据集,该数据集将一次标记为不同任务的奖励,以在元测试时间适应各种新任务的元训练策略。尽管此功能将使Meta-RL成为现实使用的实用工具,但离线META-RL提出了除在线META-RL或标准离线RL设置之外的其他挑战。 Meta-RL学习了一种探索策略,该策略收集了用于适应的数据,并元培训策略迅速适应了新任务的数据。由于该策略是在固定的离线数据集上进行了元训练的,因此当适应学识渊博的勘探策略收集的数据时,它可能表现得不可预测,这与离线数据有系统地不同,从而导致分布变化。我们提出了一种混合脱机元元素算法,该算法使用带有奖励的脱机数据来进行自适应策略,然后收集其他无监督的在线数据,而无需任何奖励标签来桥接这一分配变化。通过不需要在线收集的奖励标签,此数据可以便宜得多。我们将我们的方法比较了在模拟机器人的运动和操纵任务上进行离线元rl的先前工作,并发现使用其他无监督的在线数据收集可以显着提高元训练政策的自适应能力,从而匹配完全在线的表现。在一系列具有挑战性的域上,需要对新任务进行概括。
translated by 谷歌翻译
在RL的许多实际应用中,观察来自环境的状态过渡是昂贵的。例如,在核聚变的等离子体控制问题中,计算给定的状态对对的下一个状态需要查询昂贵的过渡功能,这可以导致许多小时的计算机模拟或美元科学研究。这种昂贵的数据收集禁止应用标准RL算法,该算法通常需要大量观察来学习。在这项工作中,我们解决了有效地学习策略的问题,同时为转换函数进行最小数量的状态动作查询。特别是,我们利用贝叶斯最优实验设计的想法,以指导选择国家行动查询以获得高效学习。我们提出了一种采集功能,该函数量化了状态动作对将提供多少信息对Markov决策过程提供的最佳解决方案。在每次迭代时,我们的算法最大限度地提高了该采集功能,选择要查询的最具信息性的状态动作对,从而产生数据有效的RL方法。我们试验各种模拟的连续控制问题,并显示我们的方法学习最佳政策,最高$ 5 $ - $ 1,000 \倍的数据,而不是基于模型的RL基线,10 ^ 3美元 - $ 10 ^ 5 \ times比无模型RL基线更少的数据。我们还提供了几种消融比较,这指出了从获得数据的原理方法产生的大量改进。
translated by 谷歌翻译
事实证明,加固学习(RL)的自适应课程有效地制定了稳健的火车和测试环境之间的差异。最近,无监督的环境设计(UED)框架通用RL课程以生成整个环境的序列,从而带来了具有强大的Minimax遗憾属性的新方法。在问题上,在部分观察或随机设置中,最佳策略可能取决于预期部署设置中环境的基本真相分布,而课程学习一定会改变培训分布。我们将这种现象形式化为课程诱导的协变量转移(CICS),并描述了其在核心参数中的发生如何导致次优政策。直接从基本真相分布中采样这些参数可以避免问题,但阻碍了课程学习。我们提出了Samplr,这是一种Minimax遗憾的方法,即使由于CICS偏向基础培训数据,它也优化了基础真相函数。我们证明并验证了具有挑战性的领域,我们的方法在基础上的分布下保留了最佳性,同时促进了整个环境环境的鲁棒性。
translated by 谷歌翻译
对于许多强化学习(RL)应用程序,指定奖励是困难的。本文考虑了一个RL设置,其中代理仅通过查询可以询问可以的专家来获取有关奖励的信息,例如,评估单个状态或通过轨迹提供二进制偏好。从如此昂贵的反馈中,我们的目标是学习奖励的模型,允许标准RL算法实现高预期的回报,尽可能少的专家查询。为此,我们提出了信息定向奖励学习(IDRL),它使用奖励的贝叶斯模型,然后选择要最大化信息增益的查询,这些查询是有关合理的最佳策略之间的返回差异的差异。与针对特定类型查询设计的先前主动奖励学习方法相比,IDRL自然地适应不同的查询类型。此外,它通过将焦点转移降低奖励近似误差来实现类似或更好的性能,从而降低奖励近似误差,以改善奖励模型引起的策略。我们支持我们的调查结果,在多个环境中进行广泛的评估,并具有不同的查询类型。
translated by 谷歌翻译
While reinforcement learning algorithms provide automated acquisition of optimal policies, practical application of such methods requires a number of design decisions, such as manually designing reward functions that not only define the task, but also provide sufficient shaping to accomplish it. In this paper, we view reinforcement learning as inferring policies that achieve desired outcomes, rather than as a problem of maximizing rewards. To solve this inference problem, we establish a novel variational inference formulation that allows us to derive a well-shaped reward function which can be learned directly from environment interactions. From the corresponding variational objective, we also derive a new probabilistic Bellman backup operator and use it to develop an off-policy algorithm to solve goal-directed tasks. We empirically demonstrate that this method eliminates the need to hand-craft reward functions for a suite of diverse manipulation and locomotion tasks and leads to effective goal-directed behaviors.
translated by 谷歌翻译
我们提出了一个基于最小描述长度(MDL)原理的多任务加固学习的新颖框架。在我们称MDL-Control(MDL-C)的这种方法中,代理商在面临的任务中学习了共同的结构,然后将其提炼成更简单的表示,从而促进更快的收敛性和对新任务的概括。这样一来,MDL-C自然将适应性适应与任务分布的认知不确定性平衡。我们通过MDL原理与贝叶斯推论之间的正式联系来激励MDL-C,得出理论性能保证,并在离散和高维连续控制任务上证明了MDL-C的经验有效性。从经验上讲,该框架用于修改现有的策略优化方法,并在离散和高维连续控制问题中改善其多任务性能。
translated by 谷歌翻译
增强学习(RL)算法假设用户通过手动编写奖励函数来指定任务。但是,这个过程可能是费力的,需要相当大的技术专长。我们可以设计RL算法,而是通过提供成功结果的示例来支持用户来指定任务吗?在本文中,我们推导了一种控制算法,可以最大化这些成功结果示例的未来概率。在前阶段的工作已经接近了类似的问题,首先学习奖励功能,然后使用另一个RL算法优化此奖励功能。相比之下,我们的方法直接从过渡和成功的结果中学习价值函数,而无需学习此中间奖励功能。因此,我们的方法需要较少的封闭式曲折和调试的代码行。我们表明我们的方法满足了一种新的数据驱动Bellman方程,其中示例取代了典型的奖励函数术语。实验表明,我们的方法优于学习明确奖励功能的先前方法。
translated by 谷歌翻译