我们假设经验研究离线增强学习(RL)的样本复杂性对RL在现实世界中的实际应用至关重要。最近的几项作品表明了直接从离线数据学习策略的能力。在这项工作中,我们询问了从离线数据学习的样本数量的依赖性问题。我们的目标是强调,研究离线R1的样本复杂性很重要,是现有离线算法有用的指标。我们提出了一种用于离线RL的样本复杂性分析的评估方法。
translated by 谷歌翻译
Recent advances in batch (offline) reinforcement learning have shown promising results in learning from available offline data and proved offline reinforcement learning to be an essential toolkit in learning control policies in a model-free setting. An offline reinforcement learning algorithm applied to a dataset collected by a suboptimal non-learning-based algorithm can result in a policy that outperforms the behavior agent used to collect the data. Such a scenario is frequent in robotics, where existing automation is collecting operational data. Although offline learning techniques can learn from data generated by a sub-optimal behavior agent, there is still an opportunity to improve the sample complexity of existing offline reinforcement learning algorithms by strategically introducing human demonstration data into the training process. To this end, we propose a novel approach that uses uncertainty estimation to trigger the injection of human demonstration data and guide policy training towards optimal behavior while reducing overall sample complexity. Our experiments show that this approach is more sample efficient when compared to a naive way of combining expert data with data collected from a sub-optimal agent. We augmented an existing offline reinforcement learning algorithm Conservative Q-Learning with our approach and performed experiments on data collected from MuJoCo and OffWorld Gym learning environments.
translated by 谷歌翻译
Deep reinforcement learning algorithms have succeeded in several challenging domains. Classic Online RL job schedulers can learn efficient scheduling strategies but often takes thousands of timesteps to explore the environment and adapt from a randomly initialized DNN policy. Existing RL schedulers overlook the importance of learning from historical data and improving upon custom heuristic policies. Offline reinforcement learning presents the prospect of policy optimization from pre-recorded datasets without online environment interaction. Following the recent success of data-driven learning, we explore two RL methods: 1) Behaviour Cloning and 2) Offline RL, which aim to learn policies from logged data without interacting with the environment. These methods address the challenges concerning the cost of data collection and safety, particularly pertinent to real-world applications of RL. Although the data-driven RL methods generate good results, we show that the performance is highly dependent on the quality of the historical datasets. Finally, we demonstrate that by effectively incorporating prior expert demonstrations to pre-train the agent, we short-circuit the random exploration phase to learn a reasonable policy with online training. We utilize Offline RL as a \textbf{launchpad} to learn effective scheduling policies from prior experience collected using Oracle or heuristic policies. Such a framework is effective for pre-training from historical datasets and well suited to continuous improvement with online data collection.
translated by 谷歌翻译
离线强化学习用于在实时访问环境昂贵或不可能的情况下培训策略。作为这些恶劣条件的自然后果,在采取行动之前,代理商可能缺乏完全遵守在线环境的资源。我们配备了这种情况资源受限的设置。这导致脱机数据集(可用于培训)的情况可以包含完全处理的功能(使用功能强大的语言模型,图像模型,复杂传感器等)在实际在线时不可用。此断开连接导致离线RL中的有趣和未开发的问题:是否可以使用丰富地处理的脱机数据集来培训可访问在线环境中的更少功能的策略?在这项工作中,我们介绍并正式化这一新颖的资源受限的问题设置。我们突出了使用有限功能培训的完整脱机数据集和策略培训的策略之间的性能差距。我们通过策略传输算法解决了这种性能缺口,该策略传输算法首先使用功能完全可用的脱机数据集列举教师代理,然后将此知识传输到仅使用资源约束功能的学生代理。为了更好地捕获此设置的挑战,我们提出了一个数据收集过程:RL(RC-D4RL)的资源受限数据集。我们在RC-D4RL和流行的D4RL基准测试中评估传输算法,并观察到基线上的一致性改进(无需传输)。实验的代码在https://github.com/jayanthrr /rc-offlinerl上获得。
translated by 谷歌翻译
Behavioural cloning (BC) is a commonly used imitation learning method to infer a sequential decision-making policy from expert demonstrations. However, when the quality of the data is not optimal, the resulting behavioural policy also performs sub-optimally once deployed. Recently, there has been a surge in offline reinforcement learning methods that hold the promise to extract high-quality policies from sub-optimal historical data. A common approach is to perform regularisation during training, encouraging updates during policy evaluation and/or policy improvement to stay close to the underlying data. In this work, we investigate whether an offline approach to improving the quality of the existing data can lead to improved behavioural policies without any changes in the BC algorithm. The proposed data improvement approach - Trajectory Stitching (TS) - generates new trajectories (sequences of states and actions) by `stitching' pairs of states that were disconnected in the original data and generating their connecting new action. By construction, these new transitions are guaranteed to be highly plausible according to probabilistic models of the environment, and to improve a state-value function. We demonstrate that the iterative process of replacing old trajectories with new ones incrementally improves the underlying behavioural policy. Extensive experimental results show that significant performance gains can be achieved using TS over BC policies extracted from the original data. Furthermore, using the D4RL benchmarking suite, we demonstrate that state-of-the-art results are obtained by combining TS with two existing offline learning methodologies reliant on BC, model-based offline planning (MBOP) and policy constraint (TD3+BC).
translated by 谷歌翻译
离线强化学习(RL)任务要求代理从预先收集的数据集中学习,没有与环境进行进一步的交互。尽管有可能超越行为政策,但基于RL的方法通常是不切实际的,因为培训不稳定并引导外推错误,这始终需要通过在线评估进行仔细的超参数调整。相比之下,离线模仿学习(IL)没有这样的问题,因为它直接在不估计值函数的情况下直接了解策略。然而,IL通常限制在行为政策的能力,并且倾向于从政策混合收集的数据集中学习平庸行为。在本文中,我们的目标是利用IL但缓解这种缺点。观察行为克隆能够使用较少的数据模仿邻近的策略,我们提出\ Textit {课程脱机仿制学习(线圈)},它利用具有更高回报的自适应邻近策略的体验挑选策略,并提高了当前策略沿课程阶段。在连续控制基准测试中,我们将线圈与基于仿制的和基于RL的方法进行比较,表明它不仅避免了在混合数据集上学习平庸行为,而且甚至与最先进的离线RL方法竞争。
translated by 谷歌翻译
Offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data. This problem setting offers the promise of utilizing such datasets to acquire policies without any costly or dangerous active exploration. However, it is also challenging, due to the distributional shift between the offline training data and those states visited by the learned policy. Despite significant recent progress, the most successful prior methods are model-free and constrain the policy to the support of data, precluding generalization to unseen states. In this paper, we first observe that an existing model-based RL algorithm already produces significant gains in the offline setting compared to model-free approaches. However, standard model-based RL methods, designed for the online setting, do not provide an explicit mechanism to avoid the offline setting's distributional shift issue. Instead, we propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics. We theoretically show that the algorithm maximizes a lower bound of the policy's return under the true MDP. We also characterize the trade-off between the gain and risk of leaving the support of the batch data. Our algorithm, Model-based Offline Policy Optimization (MOPO), outperforms standard model-based RL algorithms and prior state-of-the-art model-free offline RL algorithms on existing offline RL benchmarks and two challenging continuous control tasks that require generalizing from data collected for a different task. * equal contribution. † equal advising. Orders randomized.34th Conference on Neural Information Processing Systems (NeurIPS 2020),
translated by 谷歌翻译
深度加强学习(RL)是解决复杂的现实问题的强大框架。在框架中使用的大型神经网络传统上与更好的泛化能力相关,但它们的增加的大小需要广泛的培训持续时间,大量硬件资源和较长推理时间的缺点。解决这个问题的一种方法是修剪神经网络,只留下必要的参数。用于在固定数据分布的应用中施加稀疏性的最先进的并发修剪技术。但是,他们尚未在RL的背景下大大探索。我们缩小了RL和单次修剪技术之间的差距,并将一般修剪方法呈现给离线RL。在RL培训开始之前,我们利用固定数据集进行修剪神经网络。然后,我们运行不同网络稀疏度水平的实验,并评估连续控制任务中的初始化技术修剪的有效性。我们的结果表明,随着95%的网络权重修剪,离线-RL算法仍然可以在我们的大部分实验中保持性能。据我们所知,没有先前的工作,利用在这种高水平的稀疏性的RL保留的性能中进行修剪。此外,在未改变学习目标的情况下,可以在任何现有的离线-RL算法中容易地集成到任何现有的离线RL算法中。
translated by 谷歌翻译
最近的工作表明,单独监督学习,没有时间差异(TD)学习,可以对离线RL显着有效。什么时候保持真实,需要哪些算法组件?通过广泛的实验,我们致力于将RL离线的监督学习到其基本要素。在我们考虑的每个环境套件中,只需通过双层前馈MLP最大化的可能性,与基于TD学习或与变压器的序列建模的基本更复杂的方法具有竞争力的竞争性。仔细选择模型容量(例如,通过正则化或架构),并选择哪些信息(例如,目标或奖励)对性能至关重要。这些见解是通过监督学习进行加强学习的从业者(我们投入“RVS学习”)的实践指南。他们还探讨了现有RVS方法的限制,在随机数据上相对较弱,并提出了许多打开问题。
translated by 谷歌翻译
Deep reinforcement learning (DRL) provides a new way to generate robot control policy. However, the process of training control policy requires lengthy exploration, resulting in a low sample efficiency of reinforcement learning (RL) in real-world tasks. Both imitation learning (IL) and learning from demonstrations (LfD) improve the training process by using expert demonstrations, but imperfect expert demonstrations can mislead policy improvement. Offline to Online reinforcement learning requires a lot of offline data to initialize the policy, and distribution shift can easily lead to performance degradation during online fine-tuning. To solve the above problems, we propose a learning from demonstrations method named A-SILfD, which treats expert demonstrations as the agent's successful experiences and uses experiences to constrain policy improvement. Furthermore, we prevent performance degradation due to large estimation errors in the Q-function by the ensemble Q-functions. Our experiments show that A-SILfD can significantly improve sample efficiency using a small number of different quality expert demonstrations. In four Mujoco continuous control tasks, A-SILfD can significantly outperform baseline methods after 150,000 steps of online training and is not misled by imperfect expert demonstrations during training.
translated by 谷歌翻译
离线强化学习在利用大型预采用的数据集进行政策学习方面表现出了巨大的希望,使代理商可以放弃经常廉价的在线数据收集。但是,迄今为止,离线强化学习的探索相对较小,并且缺乏对剩余挑战所在的何处的了解。在本文中,我们试图建立简单的基线以在视觉域中连续控制。我们表明,对两个基于最先进的在线增强学习算法,Dreamerv2和DRQ-V2进行了简单的修改,足以超越事先工作并建立竞争性的基准。我们在现有的离线数据集中对这些算法进行了严格的评估,以及从视觉观察结果中进行离线强化学习的新测试台,更好地代表现实世界中离线增强学习问题中存在的数据分布,并开放我们的代码和数据以促进此方面的进度重要领域。最后,我们介绍并分析了来自视觉观察的离线RL所独有的几个关键Desiderata,包括视觉分散注意力和动态视觉上可识别的变化。
translated by 谷歌翻译
离线强化学习(RL)定义了从固定批次的数据学习的任务。由于来自分发超出操作的值估计的错误,大多数脱机RL算法采用数据集中包含的动作来计算或正规化策略的方法。构建在预先存在的RL算法上,修改使RL算法正常工作的额外复杂性的成本为代价。离线RL算法引入了新的超参数,通常利用辅助组件,例如生成模型,同时调整底层RL算法。在本文中,我们的目标是在实现最小变化的同时进行深度RL算法。我们发现我们可以通过简单地将行为克隆术语添加到在线RL算法的策略更新并归一化数据的策略更新来匹配最先进的离线RL算法的性能。生成的算法是一种简单的实现和曲线基线,而通过去除先前方法的附加计算开销来大于缩短整个运行时间。
translated by 谷歌翻译
Off-policy reinforcement learning aims to leverage experience collected from prior policies for sample-efficient learning. However, in practice, commonly used off-policy approximate dynamic programming methods based on Q-learning and actor-critic methods are highly sensitive to the data distribution, and can make only limited progress without collecting additional on-policy data. As a step towards more robust off-policy algorithms, we study the setting where the off-policy experience is fixed and there is no further interaction with the environment. We identify bootstrapping error as a key source of instability in current methods. Bootstrapping error is due to bootstrapping from actions that lie outside of the training data distribution, and it accumulates via the Bellman backup operator. We theoretically analyze bootstrapping error, and demonstrate how carefully constraining action selection in the backup can mitigate it. Based on our analysis, we propose a practical algorithm, bootstrapping error accumulation reduction (BEAR). We demonstrate that BEAR is able to learn robustly from different off-policy distributions, including random and suboptimal demonstrations, on a range of continuous control tasks.
translated by 谷歌翻译
尽管理论上的行为克隆(BC)遭受了复杂错误,但其可扩展性和简单性仍然使其成为一种有吸引力的模仿学习算法。相比之下,对抗性训练的模仿方法通常不会共享相同的问题,但需要与环境进行互动。同时,大多数模仿学习方法仅利用最佳数据集,这可能比其次优的数据集更昂贵。出现的一个问题是,我们可以以原则上的方式使用次优数据集,否则会闲置吗?我们提出了一个基于可扩展模型的离线模仿学习算法框架,该算法框架利用次优和最佳策略收集的数据集,并表明其最坏情况下的次优率在时间范围内相对于专家样本而变线。我们从经验上验证了我们的理论结果,并表明所提出的方法\ textit {始终}在模拟连续控制域的低数据状态下优于BC。
translated by 谷歌翻译
大多数前往离线强化学习(RL)的方法都采取了一种迭代演员 - 批评批评,涉及违规评估。在本文中,我们展示了使用行为政策的政策Q估计来令人惊讶地执行一步的Q估计,从而简单地执行一个受限制/正规化的政策改进的步骤。该一步算法在大部分D4RL基准测试中击败了先前报告的迭代算法的结果。一步基线实现了这种强劲的性能,同时对超公数更简单,更强大而不是先前提出的迭代算法。我们认为迭代方法的表现相对较差是在违反政策评估中固有的高方差,并通过对这些估计的重复优化的政策进行放大。此外,我们假设一步算法的强大性能是由于环境和行为政策中有利结构的组合。
translated by 谷歌翻译
We present state advantage weighting for offline reinforcement learning (RL). In contrast to action advantage $A(s,a)$ that we commonly adopt in QSA learning, we leverage state advantage $A(s,s^\prime)$ and QSS learning for offline RL, hence decoupling the action from values. We expect the agent can get to the high-reward state and the action is determined by how the agent can get to that corresponding state. Experiments on D4RL datasets show that our proposed method can achieve remarkable performance against the common baselines. Furthermore, our method shows good generalization capability when transferring from offline to online.
translated by 谷歌翻译
离线增强学习吸引了人们对解决传统强化学习的应用挑战的极大兴趣。离线增强学习使用先前收集的数据集来训练代理而无需任何互动。为了解决对OOD的高估(分布式)动作的高估,保守的估计值对所有输入都具有较低的价值。以前的保守估计方法通常很难避免OOD作用对Q值估计的影响。此外,这些算法通常需要失去一些计算效率,以实现保守估计的目的。在本文中,我们提出了一种简单的保守估计方法,即双重保守估计(DCE),该方法使用两种保守估计方法来限制政策。我们的算法引入了V功能,以避免分发作用的错误,同时隐含得出保守的估计。此外,我们的算法使用可控的罚款术语,改变了培训中保守主义的程度。从理论上讲,我们说明了该方法如何影响OOD动作和分布动作的估计。我们的实验分别表明,两种保守的估计方法影响了所有国家行动的估计。 DCE展示了D4RL的最新性能。
translated by 谷歌翻译
依赖于太多的实验来学习良好的行动,目前的强化学习(RL)算法在现实世界的环境中具有有限的适用性,这可能太昂贵,无法探索探索。我们提出了一种批量RL算法,其中仅使用固定的脱机数据集来学习有效策略,而不是与环境的在线交互。批量RL中的有限数据产生了在培训数据中不充分表示的状态/行动的价值估计中的固有不确定性。当我们的候选政策从生成数据的候选政策发散时,这导致特别严重的外推。我们建议通过两个直接的惩罚来减轻这个问题:减少这种分歧的政策限制和减少过于乐观估计的价值约束。在全面的32个连续动作批量RL基准测试中,我们的方法对最先进的方法进行了比较,无论如何收集离线数据如何。
translated by 谷歌翻译
仿制学习(IL)是一个框架,了解从示范中模仿专家行为。最近,IL显示了高维和控制任务的有希望的结果。然而,IL通常遭受环境互动方面的样本低效率,这严重限制了它们对模拟域的应用。在工业应用中,学习者通常具有高的相互作用成本,与环境的互动越多,对环境的损害越多,学习者本身就越多。在本文中,我们努力通过引入逆钢筋学习的新颖方案来提高样本效率。我们的方法,我们调用\ texit {model redion函数基础的模仿学习}(mrfil),使用一个集合动态模型作为奖励功能,是通过专家演示培训的内容。关键的想法是通过在符合专家示范分布时提供积极奖励,为代理商提供与漫长地平线相匹配的演示。此外,我们展示了新客观函数的收敛保证。实验结果表明,与IL方法相比,我们的算法达到了竞争性能,并显着降低了环境交互。
translated by 谷歌翻译
保守主义的概念导致了离线强化学习(RL)的重要进展,其中代理从预先收集的数据集中学习。但是,尽可能多的实际方案涉及多个代理之间的交互,解决更实际的多代理设置中的离线RL仍然是一个开放的问题。鉴于最近将Online RL算法转移到多代理设置的成功,可以预期离线RL算法也将直接传输到多代理设置。令人惊讶的是,当基于保守的算法应用于多蛋白酶的算法时,性能显着降低了越来越多的药剂。为了减轻劣化,我们确定了价值函数景观可以是非凹形的关键问题,并且策略梯度改进容易出现本地最优。自从任何代理人的次优政策可能导致不协调的全球失败以来,多个代理人会加剧问题。在这种直觉之后,我们提出了一种简单而有效的方法,脱机多代理RL与演员整流(OMAR),通过有效的一阶政策梯度和Zeroth订单优化方法为演员更好地解决这一关键挑战优化保守值函数。尽管简单,奥马尔显着优于强大的基线,在多售后连续控制基准测试中具有最先进的性能。
translated by 谷歌翻译