在模仿学习的背景下,提供专家轨迹通常是昂贵且耗时的。因此,目标必须是创建算法,这些算法需要尽可能少的专家数据。在本文中,我们提出了一种算法,该算法模仿了专家的高级战略,而不仅仅是模仿行动水平的专家,我们假设这需要更少的专家数据并使培训更加稳定。作为先验,我们假设高级策略是达到未知的目标状态区域,我们假设这对于强化学习中许多领域是有效的先验。目标国家地区未知,但是由于专家已经证明了如何达到目标,因此代理商试图到达与专家类似的州。我们的算法以时间连贯性的思想为基础,训练神经网络,以预测两个状态是否相似,从某种意义上说,它们可能会随着时间的流逝而发生。在推论期间,代理将其当前状态与案例基础的专家状态进行比较以获得相似性。结果表明,我们的方法仍然可以在很少有专家数据的设置中学习一个近乎最佳的政策,这些算法试图模仿动作级别的专家,这一算法再也无法做到了。
translated by 谷歌翻译
With the development of deep representation learning, the domain of reinforcement learning (RL) has become a powerful learning framework now capable of learning complex policies in high dimensional environments. This review summarises deep reinforcement learning (DRL) algorithms and provides a taxonomy of automated driving tasks where (D)RL methods have been employed, while addressing key computational challenges in real world deployment of autonomous driving agents. It also delineates adjacent domains such as behavior cloning, imitation learning, inverse reinforcement learning that are related but are not classical RL algorithms. The role of simulators in training agents, methods to validate, test and robustify existing solutions in RL are discussed.
translated by 谷歌翻译
Reinforcement learning in partially observable domains is challenging due to the lack of observable state information. Thankfully, learning offline in a simulator with such state information is often possible. In particular, we propose a method for partially observable reinforcement learning that uses a fully observable policy (which we call a state expert) during offline training to improve online performance. Based on Soft Actor-Critic (SAC), our agent balances performing actions similar to the state expert and getting high returns under partial observability. Our approach can leverage the fully-observable policy for exploration and parts of the domain that are fully observable while still being able to learn under partial observability. On six robotics domains, our method outperforms pure imitation, pure reinforcement learning, the sequential or parallel combination of both types, and a recent state-of-the-art method in the same setting. A successful policy transfer to a physical robot in a manipulation task from pixels shows our approach's practicality in learning interesting policies under partial observability.
translated by 谷歌翻译
强化学习和最近的深度增强学习是解决如Markov决策过程建模的顺序决策问题的流行方法。问题和选择算法和超参数的RL建模需要仔细考虑,因为不同的配置可能需要完全不同的性能。这些考虑因素主要是RL专家的任务;然而,RL在研究人员和系统设计师不是RL专家的其他领域中逐渐变得流行。此外,许多建模决策,例如定义状态和动作空间,批次的大小和批量更新的频率以及时间戳的数量通常是手动进行的。由于这些原因,RL框架的自动化不同组成部分具有重要意义,近年来它引起了很多关注。自动RL提供了一个框架,其中RL的不同组件包括MDP建模,算法选择和超参数优化是自动建模和定义的。在本文中,我们探讨了可以在自动化RL中使用的文献和目前的工作。此外,我们讨论了Autorl中的挑战,打开问题和研究方向。
translated by 谷歌翻译
模仿学习研究社区最近取得了重大进展,以使人工代理人仅凭视频演示模仿行为。然而,由于视频观察的高维质性质,针对此问题开发的当前最新方法表现出很高的样本复杂性。为了解决这个问题,我们在这里介绍了一种新的算法,称为使用状态观察者VGAIFO-SO从观察中获得的,称为视觉生成对抗性模仿。 Vgaifo-So以此为核心,试图使用一种新型的自我监管的状态观察者来解决样本效率低下,该观察者从高维图像中提供了较低维度的本体感受状态表示的估计。我们在几个连续的控制环境中进行了实验表明,Vgaifo-SO比其他IFO算法更有效地从仅视频演示中学习,有时甚至可以实现与观察(Gaifo)算法的生成对抗性模仿(Gaifo)算法的性能,该算法有特权访问访问权限示威者的本体感知状态信息。
translated by 谷歌翻译
如何在演示相对较大时更加普遍地进行模仿学习一直是强化学习(RL)的持续存在问题。糟糕的示威活动导致狭窄和偏见的日期分布,非马洛维亚人类专家演示使代理商难以学习,而过度依赖子最优轨迹可以使代理商努力提高其性能。为了解决这些问题,我们提出了一种名为TD3FG的新算法,可以平稳地过渡从专家到学习从经验中学习。我们的算法在Mujoco环境中实现了有限的有限和次优的演示。我们使用行为克隆来将网络作为参考动作发生器训练,并在丢失函数和勘探噪声方面使用它。这种创新可以帮助代理商从示威活动中提取先验知识,同时降低了糟糕的马尔科维亚特性的公正的不利影响。与BC +微调和DDPGFD方法相比,它具有更好的性能,特别是当示范相对有限时。我们调用我们的方法TD3FG意味着来自发电机的TD3。
translated by 谷歌翻译
Deep Reinforcement Learning has been successfully applied to learn robotic control. However, the corresponding algorithms struggle when applied to problems where the agent is only rewarded after achieving a complex task. In this context, using demonstrations can significantly speed up the learning process, but demonstrations can be costly to acquire. In this paper, we propose to leverage a sequential bias to learn control policies for complex robotic tasks using a single demonstration. To do so, our method learns a goal-conditioned policy to control a system between successive low-dimensional goals. This sequential goal-reaching approach raises a problem of compatibility between successive goals: we need to ensure that the state resulting from reaching a goal is compatible with the achievement of the following goals. To tackle this problem, we present a new algorithm called DCIL-II. We show that DCIL-II can solve with unprecedented sample efficiency some challenging simulated tasks such as humanoid locomotion and stand-up as well as fast running with a simulated Cassie robot. Our method leveraging sequentiality is a step towards the resolution of complex robotic tasks under minimal specification effort, a key feature for the next generation of autonomous robots.
translated by 谷歌翻译
Deep reinforcement learning (RL) has achieved several high profile successes in difficult decision-making problems. However, these algorithms typically require a huge amount of data before they reach reasonable performance. In fact, their performance during learning can be extremely poor. This may be acceptable for a simulator, but it severely limits the applicability of deep RL to many real-world tasks, where the agent must learn in the real environment. In this paper we study a setting where the agent may access data from previous control of the system. We present an algorithm, Deep Q-learning from Demonstrations (DQfD), that leverages small sets of demonstration data to massively accelerate the learning process even from relatively small amounts of demonstration data and is able to automatically assess the necessary ratio of demonstration data while learning thanks to a prioritized replay mechanism. DQfD works by combining temporal difference updates with supervised classification of the demonstrator's actions. We show that DQfD has better initial performance than Prioritized Dueling Double Deep Q-Networks (PDD DQN) as it starts with better scores on the first million steps on 41 of 42 games and on average it takes PDD DQN 83 million steps to catch up to DQfD's performance. DQfD learns to out-perform the best demonstration given in 14 of 42 games. In addition, DQfD leverages human demonstrations to achieve state-of-the-art results for 11 games. Finally, we show that DQfD performs better than three related algorithms for incorporating demonstration data into DQN.
translated by 谷歌翻译
Exploration in environments with sparse rewards has been a persistent problem in reinforcement learning (RL). Many tasks are natural to specify with a sparse reward, and manually shaping a reward function can result in suboptimal performance. However, finding a non-zero reward is exponentially more difficult with increasing task horizon or action dimensionality. This puts many real-world tasks out of practical reach of RL methods. In this work, we use demonstrations to overcome the exploration problem and successfully learn to perform long-horizon, multi-step robotics tasks with continuous control such as stacking blocks with a robot arm. Our method, which builds on top of Deep Deterministic Policy Gradients and Hindsight Experience Replay, provides an order of magnitude of speedup over RL on simulated robotics tasks. It is simple to implement and makes only the additional assumption that we can collect a small set of demonstrations. Furthermore, our method is able to solve tasks not solvable by either RL or behavior cloning alone, and often ends up outperforming the demonstrator policy.
translated by 谷歌翻译
元加强学习(META-RL)是一种方法,即从解决各种任务中获得的经验被蒸馏成元政策。当仅适应一个小(或仅一个)数量的步骤时,元派利赛能够在新的相关任务上近距离执行。但是,采用这种方法来解决现实世界中的问题的主要挑战是,它们通常与稀疏的奖励功能相关联,这些功能仅表示任务是部分或完全完成的。我们考虑到某些数据可能由亚最佳代理生成的情况,可用于每个任务。然后,我们使用示范(EMRLD)开发了一类名为“增强元RL”的算法,即使在训练过程中获得了次优的指导,也可以利用此信息。我们展示了EMRLD如何共同利用RL和在离线数据上进行监督学习,以生成一个显示单调性能改进的元数据。我们还开发了一个称为EMRLD-WS的温暖开始的变体,该变体对于亚最佳演示数据特别有效。最后,我们表明,在包括移动机器人在内的各种稀疏奖励环境中,我们的EMRLD算法显着优于现有方法。
translated by 谷歌翻译
Meta-Renifiltive学习(Meta-RL)已被证明是利用事先任务的经验,以便快速学习新的相关任务的成功框架,但是,当前的Meta-RL接近在稀疏奖励环境中学习的斗争。尽管现有的Meta-RL算法可以学习适应新的稀疏奖励任务的策略,但是使用手形奖励功能来学习实际适应策略,或者需要简单的环境,其中随机探索足以遇到稀疏奖励。在本文中,我们提出了对Meta-RL的后视抢购的制定,该rl抢购了在Meta培训期间的经验,以便能够使用稀疏奖励完全学习。我们展示了我们的方法在套件挑战稀疏奖励目标达到的环境中,以前需要密集的奖励,以便在Meta训练中解决。我们的方法使用真正的稀疏奖励功能来解决这些环境,性能与具有代理密集奖励功能的培训相当。
translated by 谷歌翻译
Deep reinforcement learning algorithms have succeeded in several challenging domains. Classic Online RL job schedulers can learn efficient scheduling strategies but often takes thousands of timesteps to explore the environment and adapt from a randomly initialized DNN policy. Existing RL schedulers overlook the importance of learning from historical data and improving upon custom heuristic policies. Offline reinforcement learning presents the prospect of policy optimization from pre-recorded datasets without online environment interaction. Following the recent success of data-driven learning, we explore two RL methods: 1) Behaviour Cloning and 2) Offline RL, which aim to learn policies from logged data without interacting with the environment. These methods address the challenges concerning the cost of data collection and safety, particularly pertinent to real-world applications of RL. Although the data-driven RL methods generate good results, we show that the performance is highly dependent on the quality of the historical datasets. Finally, we demonstrate that by effectively incorporating prior expert demonstrations to pre-train the agent, we short-circuit the random exploration phase to learn a reasonable policy with online training. We utilize Offline RL as a \textbf{launchpad} to learn effective scheduling policies from prior experience collected using Oracle or heuristic policies. Such a framework is effective for pre-training from historical datasets and well suited to continuous improvement with online data collection.
translated by 谷歌翻译
在过去的几年中,逆增强学习(\ textit {irl})问题已经迅速发展,在机器人技术,认知和健康等领域中具有重要的应用。在这项工作中,我们探讨了当前IRL方法从描述长马,复杂的顺序任务的专家轨迹中学习代理奖励函数的效率低下。我们假设,将IRL模型带入捕获基本任务的结构图案可以实现和提高其性能。随后,我们提出了一种新颖的IRL方法Smirl,该方法首先学习任务的(近似)结构为有限状态-Satate-automaton(FSA),然后使用结构基序来解决IRL问题。我们在离散网格世界和高维连续域环境上测试我们的模型。我们从经验上表明,我们提出的方法成功地学习了所有四个复杂的任务,其中两个基础IRL基准失败了。我们的模型还优于简单的玩具任务中样本效率的基准。我们进一步在具有组成奖励函数的任务上的经过修改的连续域中显示了有希望的测试结果。
translated by 谷歌翻译
强化学习(RL)代理商可以通过与环境进行交互来学习解决复杂的顺序决策任务。但是,样品效率仍然是一个重大挑战。在多目标RL领域中,需要代理以达到多个目标来解决复杂任务,提高采样效率可能尤其具有挑战性。另一方面,人类或其他生物代理商以更具战略方式学习此类任务,遵循随着难度水平的增加,以便逐步高效的学习进步。在这项工作中,我们提出了一种以自我监督方式使用动态距离功能(DDF)的自动目标生成方法。 DDF是一种函数,它预测马尔可夫决策过程(MDP)内的任何两个状态之间的动态距离。有了这个,我们在适当的难度水平下生成一个目标课程,以便在整个培训过程中有效地学习。我们在几个目标条件的机器人操纵和导航任务中评估这种方法,并在基线方法上显示出样本效率的改进,该方法仅使用随机目标采样。
translated by 谷歌翻译
需要大量人类努力和迭代的奖励功能规范仍然是通过深入的强化学习来学习行为的主要障碍。相比之下,提供所需行为的视觉演示通常会提供一种更简单,更自然的教师的方式。我们考虑为代理提供了一个固定的视觉演示数据集,说明了如何执行任务,并且必须学习使用提供的演示和无监督的环境交互来解决任务。此设置提出了许多挑战,包括对视觉观察的表示,由于缺乏固定的奖励或学习信号而导致的,由于高维空间而引起的样本复杂性以及学习不稳定。为了解决这些挑战,我们开发了一种基于变异模型的对抗模仿学习(V-Mail)算法。基于模型的方法为表示学习,实现样本效率并通过实现派利学习来提高对抗性训练的稳定性提供了强烈的信号。通过涉及几种基于视觉的运动和操纵任务的实验,我们发现V-Mail以样本有效的方式学习了成功的视觉运动策略,与先前的工作相比,稳定性更高,并且还可以实现较高的渐近性能。我们进一步发现,通过传输学习模型,V-Mail可以从视觉演示中学习新任务,而无需任何其他环境交互。所有结果在内的所有结果都可以在\ url {https://sites.google.com/view/variational-mail}在线找到。
translated by 谷歌翻译
Adversarial imitation learning (AIL) has become a popular alternative to supervised imitation learning that reduces the distribution shift suffered by the latter. However, AIL requires effective exploration during an online reinforcement learning phase. In this work, we show that the standard, naive approach to exploration can manifest as a suboptimal local maximum if a policy learned with AIL sufficiently matches the expert distribution without fully learning the desired task. This can be particularly catastrophic for manipulation tasks, where the difference between an expert and a non-expert state-action pair is often subtle. We present Learning from Guided Play (LfGP), a framework in which we leverage expert demonstrations of multiple exploratory, auxiliary tasks in addition to a main task. The addition of these auxiliary tasks forces the agent to explore states and actions that standard AIL may learn to ignore. Additionally, this particular formulation allows for the reusability of expert data between main tasks. Our experimental results in a challenging multitask robotic manipulation domain indicate that LfGP significantly outperforms both AIL and behaviour cloning, while also being more expert sample efficient than these baselines. To explain this performance gap, we provide further analysis of a toy problem that highlights the coupling between a local maximum and poor exploration, and also visualize the differences between the learned models from AIL and LfGP.
translated by 谷歌翻译
连续控制设置中的复杂顺序任务通常需要代理在其状态空间中成功遍历一组“窄段”。通过以样本有效的方式解决具有稀疏奖励的这些任务对现代钢筋(RL)构成了挑战,由于问题的相关的长地平性,并且在学习期间缺乏充足的正信号。已应用各种工具来解决这一挑战。当可用时,大型演示可以指导代理探索。后威尔同时释放不需要额外的信息来源。然而,现有的战略基于任务不可行的目标分布探索,这可以使长地平线的解决方案不切实际。在这项工作中,我们扩展了后视可释放的机制,以指导沿着一小组成功示范所暗示的特定任务特定分布的探索。我们评估了四个复杂,单身和双臂,机器人操纵任务的方法,对抗强合适的基线。该方法需要较少的演示来解决所有任务,并且达到明显更高的整体性能作为任务复杂性增加。最后,我们研究了提出的解决方案对输入表示质量和示范人数的鲁棒性。
translated by 谷歌翻译
近年来,深度加固学习(DRL)已经成功地进入了复杂的决策应用,例如机器人,自动驾驶或视频游戏。在寻找更多采样高效的算法中,有希望的方向是利用尽可能多的外部偏离策略数据。这种数据驱动方法的一个主题是从专家演示中学习。在过去,已经提出了多种想法来利用添加到重放缓冲区的示范,例如仅在演示中预先预订或最小化额外的成本函数。我们提出了一种新的方法,能够利用任何稀疏奖励环境中在线收集的演示和剧集,以任何违规算法在线。我们的方法基于奖励奖金,给出了示范和成功的剧集,鼓励专家模仿和自模仿。首先,我们向来自示威活动的过渡提供奖励奖金,以鼓励代理商符合所证明的行为。然后,在收集成功的剧集时,我们将其在将其添加到重播缓冲区之前与相同的奖金转换,鼓励代理也与其先前的成功相匹配。我们的实验专注于操纵机器人,特别是在模拟中有6个自由的机器人手臂的三个任务。我们表明,即使在没有示范的情况下,我们基于奖励重新标记的方法可以提高基础算法(SAC和DDPG)对这些任务的性能。此外,集成到我们的方法中的两种改进来自以前的作品,允许我们的方法优于所有基线。
translated by 谷歌翻译
与一组复杂的RL问题有关的目标条件加固学习(GCRL)训练代理在特定情况下实现不同的目标。与仅根据州或观察结果了解政策的标准RL解决方案相比,GCRL还要求代理商根据不同的目标做出决策。在这项调查中,我们对GCRL的挑战和算法进行了全面的概述。首先,我们回答该领域研究的基本问题。然后,我们解释了如何代表目标并介绍如何从不同角度设计现有解决方案。最后,我们得出结论,并讨论最近研究重点的潜在未来前景。
translated by 谷歌翻译
Deep reinforcement learning is poised to revolutionise the field of AI and represents a step towards building autonomous systems with a higher level understanding of the visual world. Currently, deep learning is enabling reinforcement learning to scale to problems that were previously intractable, such as learning to play video games directly from pixels. Deep reinforcement learning algorithms are also applied to robotics, allowing control policies for robots to be learned directly from camera inputs in the real world. In this survey, we begin with an introduction to the general field of reinforcement learning, then progress to the main streams of value-based and policybased methods. Our survey will cover central algorithms in deep reinforcement learning, including the deep Q-network, trust region policy optimisation, and asynchronous advantage actor-critic. In parallel, we highlight the unique advantages of deep neural networks, focusing on visual understanding via reinforcement learning. To conclude, we describe several current areas of research within the field.
translated by 谷歌翻译