在现实世界中,通过弱势政策影响环境可能是昂贵的或非常危险的,因此妨碍了现实世界的加强学习应用。离线强化学习(RL)可以从给定数据集中学习策略,而不与环境进行交互。但是,数据集是脱机RL算法的唯一信息源,并确定学习策略的性能。我们仍然缺乏关于数据集特征如何影响不同离线RL算法的研究。因此,我们对数据集特性如何实现离散动作环境的离线RL算法的性能的全面实证分析。数据集的特点是两个度量:(1)通过轨迹质量(TQ)测量的平均数据集返回和(2)由状态 - 动作覆盖(SACO)测量的覆盖范围。我们发现,禁止政策深度Q网家族的变体需要具有高SACO的数据集来表现良好。将学习策略朝向给定数据集的算法对具有高TQ或SACO的数据集进行了良好。对于具有高TQ的数据集,行为克隆优先级或类似于最好的离线RL算法。
translated by 谷歌翻译
在这项工作中,我们将注意力集中在数据分布与基于Q学基于Q学基于函数近似之间的相互作用的研究。我们提供了一个理论和实证分析,以及为什么数据分布的不同性质可以有助于调节算法不稳定性的来源。首先,我们重新审视近似动态编程算法性能的理论界限。其次,我们提供了一种新型的四态MDP,突出了在线和离线设置中具有功能近似的Q学习算法的数据分布的影响。最后,我们通过实验评估数据分布属性在离线深度Q网算法的性能中的影响。我们的结果表明:(i)数据分布需要拥有某些属性,以便在离线设置中鲁棒地学习,即距离MDP的最佳策略和高覆盖范围内的分布在状态 - 动作空间上的低距离; (ii)高熵数据分布可以有助于减轻算法不稳定性的来源。
translated by 谷歌翻译
Many practical applications of reinforcement learning constrain agents to learn from a fixed batch of data which has already been gathered, without offering further possibility for data collection. In this paper, we demonstrate that due to errors introduced by extrapolation, standard offpolicy deep reinforcement learning algorithms, such as DQN and DDPG, are incapable of learning without data correlated to the distribution under the current policy, making them ineffective for this fixed batch setting. We introduce a novel class of off-policy algorithms, batch-constrained reinforcement learning, which restricts the action space in order to force the agent towards behaving close to on-policy with respect to a subset of the given data. We present the first continuous control deep reinforcement learning algorithm which can learn effectively from arbitrary, fixed batch data, and empirically demonstrate the quality of its behavior in several tasks.
translated by 谷歌翻译
Off-policy reinforcement learning (RL) using a fixed offline dataset of logged interactions is an important consideration in real world applications. This paper studies offline RL using the DQN Replay Dataset comprising the entire replay experience of a DQN agent on 60 Atari 2600 games. We demonstrate that recent off-policy deep RL algorithms, even when trained solely on this fixed dataset, outperform the fully-trained DQN agent. To enhance generalization in the offline setting, we present Random Ensemble Mixture (REM), a robust Q-learning algorithm that enforces optimal Bellman consistency on random convex combinations of multiple Q-value estimates. Offline REM trained on the DQN Replay Dataset surpasses strong RL baselines. Ablation studies highlight the role of offline dataset size and diversity as well as the algorithm choice in our positive results. Overall, the results here present an optimistic view that robust RL algorithms used on sufficiently large and diverse offline datasets can lead to high quality policies. To provide a testbed for offline RL and reproduce our results, the DQN Replay Dataset is released at offline-rl.github.io.
translated by 谷歌翻译
深度强化学习(RL)导致了许多最近和开创性的进步。但是,这些进步通常以培训的基础体系结构的规模增加以及用于训练它们的RL算法的复杂性提高,而均以增加规模的成本。这些增长反过来又使研究人员更难迅速原型新想法或复制已发表的RL算法。为了解决这些问题,这项工作描述了ACME,这是一个用于构建新型RL算法的框架,这些框架是专门设计的,用于启用使用简单的模块化组件构建的代理,这些组件可以在各种执行范围内使用。尽管ACME的主要目标是为算法开发提供一个框架,但第二个目标是提供重要或最先进算法的简单参考实现。这些实现既是对我们的设计决策的验证,也是对RL研究中可重复性的重要贡献。在这项工作中,我们描述了ACME内部做出的主要设计决策,并提供了有关如何使用其组件来实施各种算法的进一步详细信息。我们的实验为许多常见和最先进的算法提供了基准,并显示了如何为更大且更复杂的环境扩展这些算法。这突出了ACME的主要优点之一,即它可用于实现大型,分布式的RL算法,这些算法可以以较大的尺度运行,同时仍保持该实现的固有可读性。这项工作提出了第二篇文章的版本,恰好与模块化的增加相吻合,对离线,模仿和从演示算法学习以及作为ACME的一部分实现的各种新代理。
translated by 谷歌翻译
Effectively leveraging large, previously collected datasets in reinforcement learning (RL) is a key challenge for large-scale real-world applications. Offline RL algorithms promise to learn effective policies from previously-collected, static datasets without further interaction. However, in practice, offline RL presents a major challenge, and standard off-policy RL methods can fail due to overestimation of values induced by the distributional shift between the dataset and the learned policy, especially when training on complex and multi-modal data distributions. In this paper, we propose conservative Q-learning (CQL), which aims to address these limitations by learning a conservative Q-function such that the expected value of a policy under this Q-function lower-bounds its true value. We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees. In practice, CQL augments the standard Bellman error objective with a simple Q-value regularizer which is straightforward to implement on top of existing deep Q-learning and actor-critic implementations. On both discrete and continuous control domains, we show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return, especially when learning from complex and multi-modal data distributions.Preprint. Under review.
translated by 谷歌翻译
离线强化学习在利用大型预采用的数据集进行政策学习方面表现出了巨大的希望,使代理商可以放弃经常廉价的在线数据收集。但是,迄今为止,离线强化学习的探索相对较小,并且缺乏对剩余挑战所在的何处的了解。在本文中,我们试图建立简单的基线以在视觉域中连续控制。我们表明,对两个基于最先进的在线增强学习算法,Dreamerv2和DRQ-V2进行了简单的修改,足以超越事先工作并建立竞争性的基准。我们在现有的离线数据集中对这些算法进行了严格的评估,以及从视觉观察结果中进行离线强化学习的新测试台,更好地代表现实世界中离线增强学习问题中存在的数据分布,并开放我们的代码和数据以促进此方面的进度重要领域。最后,我们介绍并分析了来自视觉观察的离线RL所独有的几个关键Desiderata,包括视觉分散注意力和动态视觉上可识别的变化。
translated by 谷歌翻译
依赖于太多的实验来学习良好的行动,目前的强化学习(RL)算法在现实世界的环境中具有有限的适用性,这可能太昂贵,无法探索探索。我们提出了一种批量RL算法,其中仅使用固定的脱机数据集来学习有效策略,而不是与环境的在线交互。批量RL中的有限数据产生了在培训数据中不充分表示的状态/行动的价值估计中的固有不确定性。当我们的候选政策从生成数据的候选政策发散时,这导致特别严重的外推。我们建议通过两个直接的惩罚来减轻这个问题:减少这种分歧的政策限制和减少过于乐观估计的价值约束。在全面的32个连续动作批量RL基准测试中,我们的方法对最先进的方法进行了比较,无论如何收集离线数据如何。
translated by 谷歌翻译
脱机强化学习 - 从一批数据中学习策略 - 是难以努力的:如果没有制造强烈的假设,它很容易构建实体算法失败的校长。在这项工作中,我们考虑了某些现实世界问题的财产,其中离线强化学习应该有效:行动仅对一部分产生有限的行动。我们正规化并介绍此动作影响规律(AIR)财产。我们进一步提出了一种算法,该算法假定和利用AIR属性,并在MDP满足空气时绑定输出策略的子优相。最后,我们展示了我们的算法在定期保留的两个模拟环境中跨越不同的数据收集策略占据了现有的离线强度学习算法。
translated by 谷歌翻译
Offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data. This problem setting offers the promise of utilizing such datasets to acquire policies without any costly or dangerous active exploration. However, it is also challenging, due to the distributional shift between the offline training data and those states visited by the learned policy. Despite significant recent progress, the most successful prior methods are model-free and constrain the policy to the support of data, precluding generalization to unseen states. In this paper, we first observe that an existing model-based RL algorithm already produces significant gains in the offline setting compared to model-free approaches. However, standard model-based RL methods, designed for the online setting, do not provide an explicit mechanism to avoid the offline setting's distributional shift issue. Instead, we propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics. We theoretically show that the algorithm maximizes a lower bound of the policy's return under the true MDP. We also characterize the trade-off between the gain and risk of leaving the support of the batch data. Our algorithm, Model-based Offline Policy Optimization (MOPO), outperforms standard model-based RL algorithms and prior state-of-the-art model-free offline RL algorithms on existing offline RL benchmarks and two challenging continuous control tasks that require generalizing from data collected for a different task. * equal contribution. † equal advising. Orders randomized.34th Conference on Neural Information Processing Systems (NeurIPS 2020),
translated by 谷歌翻译
Adequately assigning credit to actions for future outcomes based on their contributions is a long-standing open challenge in Reinforcement Learning. The assumptions of the most commonly used credit assignment method are disadvantageous in tasks where the effects of decisions are not immediately evident. Furthermore, this method can only evaluate actions that have been selected by the agent, making it highly inefficient. Still, no alternative methods have been widely adopted in the field. Hindsight Credit Assignment is a promising, but still unexplored candidate, which aims to solve the problems of both long-term and counterfactual credit assignment. In this thesis, we empirically investigate Hindsight Credit Assignment to identify its main benefits, and key points to improve. Then, we apply it to factored state representations, and in particular to state representations based on the causal structure of the environment. In this setting, we propose a variant of Hindsight Credit Assignment that effectively exploits a given causal structure. We show that our modification greatly decreases the workload of Hindsight Credit Assignment, making it more efficient and enabling it to outperform the baseline credit assignment method on various tasks. This opens the way to other methods based on given or learned causal structures.
translated by 谷歌翻译
离线强化学习(RL)任务要求代理从预先收集的数据集中学习,没有与环境进行进一步的交互。尽管有可能超越行为政策,但基于RL的方法通常是不切实际的,因为培训不稳定并引导外推错误,这始终需要通过在线评估进行仔细的超参数调整。相比之下,离线模仿学习(IL)没有这样的问题,因为它直接在不估计值函数的情况下直接了解策略。然而,IL通常限制在行为政策的能力,并且倾向于从政策混合收集的数据集中学习平庸行为。在本文中,我们的目标是利用IL但缓解这种缺点。观察行为克隆能够使用较少的数据模仿邻近的策略,我们提出\ Textit {课程脱机仿制学习(线圈)},它利用具有更高回报的自适应邻近策略的体验挑选策略,并提高了当前策略沿课程阶段。在连续控制基准测试中,我们将线圈与基于仿制的和基于RL的方法进行比较,表明它不仅避免了在混合数据集上学习平庸行为,而且甚至与最先进的离线RL方法竞争。
translated by 谷歌翻译
Off-policy reinforcement learning aims to leverage experience collected from prior policies for sample-efficient learning. However, in practice, commonly used off-policy approximate dynamic programming methods based on Q-learning and actor-critic methods are highly sensitive to the data distribution, and can make only limited progress without collecting additional on-policy data. As a step towards more robust off-policy algorithms, we study the setting where the off-policy experience is fixed and there is no further interaction with the environment. We identify bootstrapping error as a key source of instability in current methods. Bootstrapping error is due to bootstrapping from actions that lie outside of the training data distribution, and it accumulates via the Bellman backup operator. We theoretically analyze bootstrapping error, and demonstrate how carefully constraining action selection in the backup can mitigate it. Based on our analysis, we propose a practical algorithm, bootstrapping error accumulation reduction (BEAR). We demonstrate that BEAR is able to learn robustly from different off-policy distributions, including random and suboptimal demonstrations, on a range of continuous control tasks.
translated by 谷歌翻译
Offline reinforcement learning (RL) is suitable for safety-critical domains where online exploration is too costly or dangerous. In safety-critical settings, decision-making should take into consideration the risk of catastrophic outcomes. In other words, decision-making should be risk-sensitive. Previous works on risk in offline RL combine together offline RL techniques, to avoid distributional shift, with risk-sensitive RL algorithms, to achieve risk-sensitivity. In this work, we propose risk-sensitivity as a mechanism to jointly address both of these issues. Our model-based approach is risk-averse to both epistemic and aleatoric uncertainty. Risk-aversion to epistemic uncertainty prevents distributional shift, as areas not covered by the dataset have high epistemic uncertainty. Risk-aversion to aleatoric uncertainty discourages actions that may result in poor outcomes due to environment stochasticity. Our experiments show that our algorithm achieves competitive performance on deterministic benchmarks, and outperforms existing approaches for risk-sensitive objectives in stochastic domains.
translated by 谷歌翻译
在许多顺序决策问题(例如,机器人控制,游戏播放,顺序预测),人类或专家数据可用包含有关任务的有用信息。然而,来自少量专家数据的模仿学习(IL)可能在具有复杂动态的高维环境中具有挑战性。行为克隆是一种简单的方法,由于其简单的实现和稳定的收敛而被广泛使用,但不利用涉及环境动态的任何信息。由于对奖励和政策近似器或偏差,高方差梯度估计器,难以在实践中难以在实践中努力训练的许多现有方法。我们介绍了一种用于动态感知IL的方法,它通过学习单个Q函数来避免对抗训练,隐含地代表奖励和策略。在标准基准测试中,隐式学习的奖励显示与地面真实奖励的高正面相关性,说明我们的方法也可以用于逆钢筋学习(IRL)。我们的方法,逆软Q学习(IQ-Learn)获得了最先进的结果,在离线和在线模仿学习设置中,显着优于现有的现有方法,这些方法都在所需的环境交互和高维空间中的可扩展性中,通常超过3倍。
translated by 谷歌翻译
在过去的十年中,多智能经纪人强化学习(Marl)已经有了重大进展,但仍存在许多挑战,例如高样本复杂性和慢趋同稳定的政策,在广泛的部署之前需要克服,这是可能的。然而,在实践中,许多现实世界的环境已经部署了用于生成策略的次优或启发式方法。一个有趣的问题是如何最好地使用这些方法作为顾问,以帮助改善多代理领域的加强学习。在本文中,我们提供了一个原则的框架,用于将动作建议纳入多代理设置中的在线次优顾问。我们描述了在非传记通用随机游戏环境中提供多种智能强化代理(海军上将)的问题,并提出了两种新的基于Q学习的算法:海军上将决策(海军DM)和海军上将 - 顾问评估(Admiral-AE) ,这使我们能够通过适当地纳入顾问(Admiral-DM)的建议来改善学习,并评估顾问(Admiral-AE)的有效性。我们从理论上分析了算法,并在一般加上随机游戏中提供了关于他们学习的定点保证。此外,广泛的实验说明了这些算法:可以在各种环境中使用,具有对其他相关基线的有利相比的性能,可以扩展到大状态行动空间,并且对来自顾问的不良建议具有稳健性。
translated by 谷歌翻译
离线增强学习(RL)将经典RL算法的范式扩展到纯粹从静态数据集中学习,而无需在学习过程中与基础环境进行交互。离线RL的一个关键挑战是政策培训的不稳定,这是由于离线数据的分布与学习政策的未结束的固定状态分配之间的不匹配引起的。为了避免分配不匹配的有害影响,我们将当前政策的未静置固定分配正规化在政策优化过程中的离线数据。此外,我们训练动力学模型既实施此正规化,又可以更好地估计当前策略的固定分布,从而减少了分布不匹配引起的错误。在各种连续控制的离线RL数据集中,我们的方法表示竞争性能,从而验证了我们的算法。该代码公开可用。
translated by 谷歌翻译
由于数据量增加,金融业的快速变化已经彻底改变了数据处理和数据分析的技术,并带来了新的理论和计算挑战。与古典随机控制理论和解决财务决策问题的其他分析方法相比,解决模型假设的财务决策问题,强化学习(RL)的新发展能够充分利用具有更少模型假设的大量财务数据并改善复杂的金融环境中的决策。该调查纸目的旨在审查最近的资金途径的发展和使用RL方法。我们介绍了马尔可夫决策过程,这是许多常用的RL方法的设置。然后引入各种算法,重点介绍不需要任何模型假设的基于价值和基于策略的方法。连接是用神经网络进行的,以扩展框架以包含深的RL算法。我们的调查通过讨论了这些RL算法在金融中各种决策问题中的应用,包括最佳执行,投资组合优化,期权定价和对冲,市场制作,智能订单路由和Robo-Awaring。
translated by 谷歌翻译
强化学习(RL)通过与环境相互作用的试验过程解决顺序决策问题。尽管RL在玩复杂的视频游戏方面取得了巨大的成功,但在现实世界中,犯错误总是不希望的。为了提高样本效率并从而降低错误,据信基于模型的增强学习(MBRL)是一个有前途的方向,它建立了环境模型,在该模型中可以进行反复试验,而无需实际成本。在这项调查中,我们对MBRL进行了审查,重点是Deep RL的最新进展。对于非壮观环境,学到的环境模型与真实环境之间始终存在概括性错误。因此,非常重要的是分析环境模型中的政策培训与实际环境中的差异,这反过来又指导了更好的模型学习,模型使用和政策培训的算法设计。此外,我们还讨论了其他形式的RL,包括离线RL,目标条件RL,多代理RL和Meta-RL的最新进展。此外,我们讨论了MBRL在现实世界任务中的适用性和优势。最后,我们通过讨论MBRL未来发展的前景来结束这项调查。我们认为,MBRL在被忽略的现实应用程序中具有巨大的潜力和优势,我们希望这项调查能够吸引更多关于MBRL的研究。
translated by 谷歌翻译
与政策策略梯度技术相比,使用先前收集的数据的无模型的无模型深钢筋学习(RL)方法可以提高采样效率。但是,当利益政策的分布与收集数据的政策之间的差异时,非政策学习变得具有挑战性。尽管提出了良好的重要性抽样和范围的政策梯度技术来补偿这种差异,但它们通常需要一系列长轨迹,以增加计算复杂性并引起其他问题,例如消失或爆炸梯度。此外,由于需要行动概率,它们对连续动作领域的概括严格受到限制,这不适合确定性政策。为了克服这些局限性,我们引入了一种替代的非上政策校正算法,用于连续作用空间,参与者 - 批判性非政策校正(AC-OFF-POC),以减轻先前收集的数据引入的潜在缺陷。通过由代理商对随机采样批次过渡的状态的最新动作决策计算出的新颖差异度量,该方法不需要任何策略的实际或估计的行动概率,并提供足够的一步重要性抽样。理论结果表明,引入的方法可以使用固定的独特点获得收缩映射,从而可以进行“安全”的非政策学习。我们的经验结果表明,AC-Off-POC始终通过有效地安排学习率和Q学习和政策优化的学习率,以比竞争方法更少的步骤改善最新的回报。
translated by 谷歌翻译