Recent advances in batch (offline) reinforcement learning have shown promising results in learning from available offline data and proved offline reinforcement learning to be an essential toolkit in learning control policies in a model-free setting. An offline reinforcement learning algorithm applied to a dataset collected by a suboptimal non-learning-based algorithm can result in a policy that outperforms the behavior agent used to collect the data. Such a scenario is frequent in robotics, where existing automation is collecting operational data. Although offline learning techniques can learn from data generated by a sub-optimal behavior agent, there is still an opportunity to improve the sample complexity of existing offline reinforcement learning algorithms by strategically introducing human demonstration data into the training process. To this end, we propose a novel approach that uses uncertainty estimation to trigger the injection of human demonstration data and guide policy training towards optimal behavior while reducing overall sample complexity. Our experiments show that this approach is more sample efficient when compared to a naive way of combining expert data with data collected from a sub-optimal agent. We augmented an existing offline reinforcement learning algorithm Conservative Q-Learning with our approach and performed experiments on data collected from MuJoCo and OffWorld Gym learning environments.
translated by 谷歌翻译
Deep reinforcement learning (DRL) provides a new way to generate robot control policy. However, the process of training control policy requires lengthy exploration, resulting in a low sample efficiency of reinforcement learning (RL) in real-world tasks. Both imitation learning (IL) and learning from demonstrations (LfD) improve the training process by using expert demonstrations, but imperfect expert demonstrations can mislead policy improvement. Offline to Online reinforcement learning requires a lot of offline data to initialize the policy, and distribution shift can easily lead to performance degradation during online fine-tuning. To solve the above problems, we propose a learning from demonstrations method named A-SILfD, which treats expert demonstrations as the agent's successful experiences and uses experiences to constrain policy improvement. Furthermore, we prevent performance degradation due to large estimation errors in the Q-function by the ensemble Q-functions. Our experiments show that A-SILfD can significantly improve sample efficiency using a small number of different quality expert demonstrations. In four Mujoco continuous control tasks, A-SILfD can significantly outperform baseline methods after 150,000 steps of online training and is not misled by imperfect expert demonstrations during training.
translated by 谷歌翻译
Deep reinforcement learning algorithms have succeeded in several challenging domains. Classic Online RL job schedulers can learn efficient scheduling strategies but often takes thousands of timesteps to explore the environment and adapt from a randomly initialized DNN policy. Existing RL schedulers overlook the importance of learning from historical data and improving upon custom heuristic policies. Offline reinforcement learning presents the prospect of policy optimization from pre-recorded datasets without online environment interaction. Following the recent success of data-driven learning, we explore two RL methods: 1) Behaviour Cloning and 2) Offline RL, which aim to learn policies from logged data without interacting with the environment. These methods address the challenges concerning the cost of data collection and safety, particularly pertinent to real-world applications of RL. Although the data-driven RL methods generate good results, we show that the performance is highly dependent on the quality of the historical datasets. Finally, we demonstrate that by effectively incorporating prior expert demonstrations to pre-train the agent, we short-circuit the random exploration phase to learn a reasonable policy with online training. We utilize Offline RL as a \textbf{launchpad} to learn effective scheduling policies from prior experience collected using Oracle or heuristic policies. Such a framework is effective for pre-training from historical datasets and well suited to continuous improvement with online data collection.
translated by 谷歌翻译
如何在演示相对较大时更加普遍地进行模仿学习一直是强化学习(RL)的持续存在问题。糟糕的示威活动导致狭窄和偏见的日期分布,非马洛维亚人类专家演示使代理商难以学习,而过度依赖子最优轨迹可以使代理商努力提高其性能。为了解决这些问题,我们提出了一种名为TD3FG的新算法,可以平稳地过渡从专家到学习从经验中学习。我们的算法在Mujoco环境中实现了有限的有限和次优的演示。我们使用行为克隆来将网络作为参考动作发生器训练,并在丢失函数和勘探噪声方面使用它。这种创新可以帮助代理商从示威活动中提取先验知识,同时降低了糟糕的马尔科维亚特性的公正的不利影响。与BC +微调和DDPGFD方法相比,它具有更好的性能,特别是当示范相对有限时。我们调用我们的方法TD3FG意味着来自发电机的TD3。
translated by 谷歌翻译
在现实世界中学习机器人任务仍然是高度挑战性的,有效的实用解决方案仍有待发现。在该领域使用的传统方法是模仿学习和强化学习,但是当应用于真正的机器人时,它们都有局限性。将强化学习与预先收集的演示结合在一起是一种有前途的方法,可以帮助学习控制机器人任务的控制政策。在本文中,我们提出了一种使用新技术来利用离线和在线培训来利用离线专家数据的算法,以获得更快的收敛性和提高性能。拟议的算法(AWET)用新颖的代理优势权重对批评损失进行了加权,以改善专家数据。此外,AWET利用自动的早期终止技术来停止和丢弃与专家轨迹不同的策略推出 - 以防止脱离专家数据。在一项消融研究中,与在四个标准机器人任务上的最新基线相比,AWET表现出改善和有希望的表现。
translated by 谷歌翻译
离线增强学习吸引了人们对解决传统强化学习的应用挑战的极大兴趣。离线增强学习使用先前收集的数据集来训练代理而无需任何互动。为了解决对OOD的高估(分布式)动作的高估,保守的估计值对所有输入都具有较低的价值。以前的保守估计方法通常很难避免OOD作用对Q值估计的影响。此外,这些算法通常需要失去一些计算效率,以实现保守估计的目的。在本文中,我们提出了一种简单的保守估计方法,即双重保守估计(DCE),该方法使用两种保守估计方法来限制政策。我们的算法引入了V功能,以避免分发作用的错误,同时隐含得出保守的估计。此外,我们的算法使用可控的罚款术语,改变了培训中保守主义的程度。从理论上讲,我们说明了该方法如何影响OOD动作和分布动作的估计。我们的实验分别表明,两种保守的估计方法影响了所有国家行动的估计。 DCE展示了D4RL的最新性能。
translated by 谷歌翻译
Offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data. This problem setting offers the promise of utilizing such datasets to acquire policies without any costly or dangerous active exploration. However, it is also challenging, due to the distributional shift between the offline training data and those states visited by the learned policy. Despite significant recent progress, the most successful prior methods are model-free and constrain the policy to the support of data, precluding generalization to unseen states. In this paper, we first observe that an existing model-based RL algorithm already produces significant gains in the offline setting compared to model-free approaches. However, standard model-based RL methods, designed for the online setting, do not provide an explicit mechanism to avoid the offline setting's distributional shift issue. Instead, we propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics. We theoretically show that the algorithm maximizes a lower bound of the policy's return under the true MDP. We also characterize the trade-off between the gain and risk of leaving the support of the batch data. Our algorithm, Model-based Offline Policy Optimization (MOPO), outperforms standard model-based RL algorithms and prior state-of-the-art model-free offline RL algorithms on existing offline RL benchmarks and two challenging continuous control tasks that require generalizing from data collected for a different task. * equal contribution. † equal advising. Orders randomized.34th Conference on Neural Information Processing Systems (NeurIPS 2020),
translated by 谷歌翻译
仿制学习(IL)是一个框架,了解从示范中模仿专家行为。最近,IL显示了高维和控制任务的有希望的结果。然而,IL通常遭受环境互动方面的样本低效率,这严重限制了它们对模拟域的应用。在工业应用中,学习者通常具有高的相互作用成本,与环境的互动越多,对环境的损害越多,学习者本身就越多。在本文中,我们努力通过引入逆钢筋学习的新颖方案来提高样本效率。我们的方法,我们调用\ texit {model redion函数基础的模仿学习}(mrfil),使用一个集合动态模型作为奖励功能,是通过专家演示培训的内容。关键的想法是通过在符合专家示范分布时提供积极奖励,为代理商提供与漫长地平线相匹配的演示。此外,我们展示了新客观函数的收敛保证。实验结果表明,与IL方法相比,我们的算法达到了竞争性能,并显着降低了环境交互。
translated by 谷歌翻译
近年来,深度加固学习(DRL)已经成功地进入了复杂的决策应用,例如机器人,自动驾驶或视频游戏。违规算法往往比其策略对应物更具样本效率,并且可以从存储在重放缓冲区中存储的任何违规数据中受益。专家演示是此类数据的流行来源:代理人接触到成功的国家和行动,可以加速学习过程并提高性能。在过去,已经提出了多种想法来充分利用缓冲区中的演示,例如仅在演示或最小化额外的成本函数的预先估算。我们继续进行研究,以孤立地评估这些想法中的几个想法,以了解哪一个具有最大的影响。我们还根据给予示范和成功集中的奖励奖金,为稀疏奖励任务提供了一种新的方法。首先,我们向来自示威活动的过渡提供奖励奖金,以鼓励代理商符合所证明的行为。然后,在收集成功的剧集时,我们将其在将其添加到重播缓冲区之前与相同的奖金转换,鼓励代理也与其先前的成功相匹配。我们的实验的基本算法是流行的软演员 - 评论家(SAC),用于连续动作空间的最先进的脱核算法。我们的实验专注于操纵机器人,特别是在模拟中的机器人手臂的3D到达任务。我们表明,我们的方法Sacr2根据奖励重新标记提高了此任务的性能,即使在没有示范的情况下也是如此。
translated by 谷歌翻译
强化学习(RL)通过与环境相互作用的试验过程解决顺序决策问题。尽管RL在玩复杂的视频游戏方面取得了巨大的成功,但在现实世界中,犯错误总是不希望的。为了提高样本效率并从而降低错误,据信基于模型的增强学习(MBRL)是一个有前途的方向,它建立了环境模型,在该模型中可以进行反复试验,而无需实际成本。在这项调查中,我们对MBRL进行了审查,重点是Deep RL的最新进展。对于非壮观环境,学到的环境模型与真实环境之间始终存在概括性错误。因此,非常重要的是分析环境模型中的政策培训与实际环境中的差异,这反过来又指导了更好的模型学习,模型使用和政策培训的算法设计。此外,我们还讨论了其他形式的RL,包括离线RL,目标条件RL,多代理RL和Meta-RL的最新进展。此外,我们讨论了MBRL在现实世界任务中的适用性和优势。最后,我们通过讨论MBRL未来发展的前景来结束这项调查。我们认为,MBRL在被忽略的现实应用程序中具有巨大的潜力和优势,我们希望这项调查能够吸引更多关于MBRL的研究。
translated by 谷歌翻译
Off-policy reinforcement learning (RL) using a fixed offline dataset of logged interactions is an important consideration in real world applications. This paper studies offline RL using the DQN Replay Dataset comprising the entire replay experience of a DQN agent on 60 Atari 2600 games. We demonstrate that recent off-policy deep RL algorithms, even when trained solely on this fixed dataset, outperform the fully-trained DQN agent. To enhance generalization in the offline setting, we present Random Ensemble Mixture (REM), a robust Q-learning algorithm that enforces optimal Bellman consistency on random convex combinations of multiple Q-value estimates. Offline REM trained on the DQN Replay Dataset surpasses strong RL baselines. Ablation studies highlight the role of offline dataset size and diversity as well as the algorithm choice in our positive results. Overall, the results here present an optimistic view that robust RL algorithms used on sufficiently large and diverse offline datasets can lead to high quality policies. To provide a testbed for offline RL and reproduce our results, the DQN Replay Dataset is released at offline-rl.github.io.
translated by 谷歌翻译
强化学习(RL)已在域中展示有效,在域名可以通过与其操作环境进行积极互动来学习政策。但是,如果我们将RL方案更改为脱机设置,代理商只能通过静态数据集更新其策略,其中脱机强化学习中的一个主要问题出现,即分配转移。我们提出了一种悲观的离线强化学习(PESSORL)算法,以主动引导代理通过操纵价值函数来恢复熟悉的区域。我们专注于由分销外(OOD)状态引起的问题,并且故意惩罚训练数据集中不存在的状态的高值,以便学习的悲观值函数下限界限状态空间内的任何位置。我们在各种基准任务中评估Pessorl算法,在那里我们表明我们的方法通过明确处理OOD状态,与这些方法仅考虑ood行动时,我们的方法通过明确处理OOD状态。
translated by 谷歌翻译
Many practical applications of reinforcement learning constrain agents to learn from a fixed batch of data which has already been gathered, without offering further possibility for data collection. In this paper, we demonstrate that due to errors introduced by extrapolation, standard offpolicy deep reinforcement learning algorithms, such as DQN and DDPG, are incapable of learning without data correlated to the distribution under the current policy, making them ineffective for this fixed batch setting. We introduce a novel class of off-policy algorithms, batch-constrained reinforcement learning, which restricts the action space in order to force the agent towards behaving close to on-policy with respect to a subset of the given data. We present the first continuous control deep reinforcement learning algorithm which can learn effectively from arbitrary, fixed batch data, and empirically demonstrate the quality of its behavior in several tasks.
translated by 谷歌翻译
近年来,深度加固学习(DRL)已经成功地进入了复杂的决策应用,例如机器人,自动驾驶或视频游戏。在寻找更多采样高效的算法中,有希望的方向是利用尽可能多的外部偏离策略数据。这种数据驱动方法的一个主题是从专家演示中学习。在过去,已经提出了多种想法来利用添加到重放缓冲区的示范,例如仅在演示中预先预订或最小化额外的成本函数。我们提出了一种新的方法,能够利用任何稀疏奖励环境中在线收集的演示和剧集,以任何违规算法在线。我们的方法基于奖励奖金,给出了示范和成功的剧集,鼓励专家模仿和自模仿。首先,我们向来自示威活动的过渡提供奖励奖金,以鼓励代理商符合所证明的行为。然后,在收集成功的剧集时,我们将其在将其添加到重播缓冲区之前与相同的奖金转换,鼓励代理也与其先前的成功相匹配。我们的实验专注于操纵机器人,特别是在模拟中有6个自由的机器人手臂的三个任务。我们表明,即使在没有示范的情况下,我们基于奖励重新标记的方法可以提高基础算法(SAC和DDPG)对这些任务的性能。此外,集成到我们的方法中的两种改进来自以前的作品,允许我们的方法优于所有基线。
translated by 谷歌翻译
Effectively leveraging large, previously collected datasets in reinforcement learning (RL) is a key challenge for large-scale real-world applications. Offline RL algorithms promise to learn effective policies from previously-collected, static datasets without further interaction. However, in practice, offline RL presents a major challenge, and standard off-policy RL methods can fail due to overestimation of values induced by the distributional shift between the dataset and the learned policy, especially when training on complex and multi-modal data distributions. In this paper, we propose conservative Q-learning (CQL), which aims to address these limitations by learning a conservative Q-function such that the expected value of a policy under this Q-function lower-bounds its true value. We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees. In practice, CQL augments the standard Bellman error objective with a simple Q-value regularizer which is straightforward to implement on top of existing deep Q-learning and actor-critic implementations. On both discrete and continuous control domains, we show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return, especially when learning from complex and multi-modal data distributions.Preprint. Under review.
translated by 谷歌翻译
Reinforcement learning often suffer from the sparse reward issue in real-world robotics problems. Learning from demonstration (LfD) is an effective way to eliminate this problem, which leverages collected expert data to aid online learning. Prior works often assume that the learning agent and the expert aim to accomplish the same task, which requires collecting new data for every new task. In this paper, we consider the case where the target task is mismatched from but similar with that of the expert. Such setting can be challenging and we found existing LfD methods can not effectively guide learning in mismatched new tasks with sparse rewards. We propose conservative reward shaping from demonstration (CRSfD), which shapes the sparse rewards using estimated expert value function. To accelerate learning processes, CRSfD guides the agent to conservatively explore around demonstrations. Experimental results of robot manipulation tasks show that our approach outperforms baseline LfD methods when transferring demonstrations collected in a single task to other different but similar tasks.
translated by 谷歌翻译
Exploration in environments with sparse rewards has been a persistent problem in reinforcement learning (RL). Many tasks are natural to specify with a sparse reward, and manually shaping a reward function can result in suboptimal performance. However, finding a non-zero reward is exponentially more difficult with increasing task horizon or action dimensionality. This puts many real-world tasks out of practical reach of RL methods. In this work, we use demonstrations to overcome the exploration problem and successfully learn to perform long-horizon, multi-step robotics tasks with continuous control such as stacking blocks with a robot arm. Our method, which builds on top of Deep Deterministic Policy Gradients and Hindsight Experience Replay, provides an order of magnitude of speedup over RL on simulated robotics tasks. It is simple to implement and makes only the additional assumption that we can collect a small set of demonstrations. Furthermore, our method is able to solve tasks not solvable by either RL or behavior cloning alone, and often ends up outperforming the demonstrator policy.
translated by 谷歌翻译
元强化学习(RL)方法可以使用比标准RL少的数据级的元培训策略,但元培训本身既昂贵又耗时。如果我们可以在离线数据上进行元训练,那么我们可以重复使用相同的静态数据集,该数据集将一次标记为不同任务的奖励,以在元测试时间适应各种新任务的元训练策略。尽管此功能将使Meta-RL成为现实使用的实用工具,但离线META-RL提出了除在线META-RL或标准离线RL设置之外的其他挑战。 Meta-RL学习了一种探索策略,该策略收集了用于适应的数据,并元培训策略迅速适应了新任务的数据。由于该策略是在固定的离线数据集上进行了元训练的,因此当适应学识渊博的勘探策略收集的数据时,它可能表现得不可预测,这与离线数据有系统地不同,从而导致分布变化。我们提出了一种混合脱机元元素算法,该算法使用带有奖励的脱机数据来进行自适应策略,然后收集其他无监督的在线数据,而无需任何奖励标签来桥接这一分配变化。通过不需要在线收集的奖励标签,此数据可以便宜得多。我们将我们的方法比较了在模拟机器人的运动和操纵任务上进行离线元rl的先前工作,并发现使用其他无监督的在线数据收集可以显着提高元训练政策的自适应能力,从而匹配完全在线的表现。在一系列具有挑战性的域上,需要对新任务进行概括。
translated by 谷歌翻译
With the development of deep representation learning, the domain of reinforcement learning (RL) has become a powerful learning framework now capable of learning complex policies in high dimensional environments. This review summarises deep reinforcement learning (DRL) algorithms and provides a taxonomy of automated driving tasks where (D)RL methods have been employed, while addressing key computational challenges in real world deployment of autonomous driving agents. It also delineates adjacent domains such as behavior cloning, imitation learning, inverse reinforcement learning that are related but are not classical RL algorithms. The role of simulators in training agents, methods to validate, test and robustify existing solutions in RL are discussed.
translated by 谷歌翻译
我们提出了一种从演示方法(LFD)方法的新颖学习,即示范(DMFD)的可变形操作,以使用状态或图像作为输入(给定的专家演示)来求解可变形的操纵任务。我们的方法以三种不同的方式使用演示,并平衡在线探索环境和使用专家的指导之间进行权衡的权衡,以有效地探索高维空间。我们在一组一维绳索的一组代表性操纵任务上测试DMFD,并从软件套件中的一套二维布和2维布进行测试,每个任务都带有状态和图像观测。对于基于状态的任务,我们的方法超过基线性能高达12.9%,在基于图像的任务上最多超过33.44%,具有可比或更好的随机性。此外,我们创建了两个具有挑战性的环境,用于使用基于图像的观测值折叠2D布,并为其设定性能基准。与仿真相比,我们在现实世界执行过程中归一化性能损失最小的真实机器人(约为6%),我们将DMFD部署为最小。源代码在github.com/uscresl/dmfd上
translated by 谷歌翻译