本文解决了逆增强学习(IRL)的问题 - 从观察其行为中推断出代理的奖励功能。 IRL可以为学徒学习提供可概括和紧凑的代表,并能够准确推断人的偏好以帮助他们。 %并提供更准确的预测。但是,有效的IRL具有挑战性,因为许多奖励功能可以与观察到的行为兼容。我们专注于如何利用先前的强化学习(RL)经验,以使学习这些偏好更快,更高效。我们提出了IRL算法基础(通过样本中的连续功能意图推断行为获取行为),该算法利用多任务RL预培训和后继功能,使代理商可以为跨越可能的目标建立强大的基础,从而跨越可能的目标。给定的域。当仅接触一些专家演示以优化新颖目标时,代理商会使用其基础快速有效地推断奖励功能。我们的实验表明,我们的方法非常有效地推断和优化显示出奖励功能,从而准确地从少于100个轨迹中推断出奖励功能。
translated by 谷歌翻译
With the development of deep representation learning, the domain of reinforcement learning (RL) has become a powerful learning framework now capable of learning complex policies in high dimensional environments. This review summarises deep reinforcement learning (DRL) algorithms and provides a taxonomy of automated driving tasks where (D)RL methods have been employed, while addressing key computational challenges in real world deployment of autonomous driving agents. It also delineates adjacent domains such as behavior cloning, imitation learning, inverse reinforcement learning that are related but are not classical RL algorithms. The role of simulators in training agents, methods to validate, test and robustify existing solutions in RL are discussed.
translated by 谷歌翻译
仅国家模仿学习的最新进展将模仿学习的适用性扩展到现实世界中的范围,从而减轻了观察专家行动的需求。但是,现有的解决方案只学会从数据中提取州对行动映射策略,而无需考虑专家如何计划到目标。这阻碍了利用示威游行并限制政策的灵活性的能力。在本文中,我们介绍了解耦政策优化(DEPO),该策略优化(DEPO)明确将策略脱离为高级状态计划者和逆动力学模型。借助嵌入式的脱钩策略梯度和生成对抗训练,DEPO可以将知识转移到不同的动作空间或状态过渡动态,并可以将规划师推广到无示威的状态区域。我们的深入实验分析表明,DEPO在学习最佳模仿性能的同时学习通用目标状态计划者的有效性。我们证明了DEPO通过预训练跨任务转移的吸引力,以及与各种技能共同培训的潜力。
translated by 谷歌翻译
从演示中学习的方法(LFD)通过模仿用户表现出在获取行为策略方面的成功。但是,即使对于一项任务,LFD也可能需要大量的演示。对于必须通过演示学习许多任务的多功能代理,如果孤立地学习每个任务,此过程将大大负担用户的负担。为了应对这一挑战,我们介绍了从演示中学习的新颖问题,该问题使代理商能够不断地基于从先前演示的任务中学到的知识,以加速学习新任务,从而减少所需的示范量。作为解决这个问题的一种解决方案,我们提出了第一种终身学习方法来进行逆强化学习,该方法通过演示学习连续的任务,不断地在任务之间转移知识以提高绩效。
translated by 谷歌翻译
交通模拟器是运输系统运营和计划中的重要组成部分。常规的交通模拟器通常采用校准的物理跟踪模型来描述车辆的行为及其与交通环境的相互作用。但是,没有普遍的物理模型可以准确地预测不同情况下车辆行为的模式。鉴于交通动态的非平稳性质,固定的物理模型在复杂的环境中往往不太有效。在本文中,我们将流量模拟作为一个反向加强学习问题,并提出一个参数共享对抗性逆增强学习模型,以进行动态射击模拟学习。我们提出的模型能够模仿现实世界中车辆的轨迹,同时恢复奖励功能,从而揭示了车辆的真实目标,这是不同动态的不变。关于合成和现实世界数据集的广泛实验表明,与最先进的方法相比,我们方法的出色性能及其对流量变化动态的鲁棒性。
translated by 谷歌翻译
需要大量人类努力和迭代的奖励功能规范仍然是通过深入的强化学习来学习行为的主要障碍。相比之下,提供所需行为的视觉演示通常会提供一种更简单,更自然的教师的方式。我们考虑为代理提供了一个固定的视觉演示数据集,说明了如何执行任务,并且必须学习使用提供的演示和无监督的环境交互来解决任务。此设置提出了许多挑战,包括对视觉观察的表示,由于缺乏固定的奖励或学习信号而导致的,由于高维空间而引起的样本复杂性以及学习不稳定。为了解决这些挑战,我们开发了一种基于变异模型的对抗模仿学习(V-Mail)算法。基于模型的方法为表示学习,实现样本效率并通过实现派利学习来提高对抗性训练的稳定性提供了强烈的信号。通过涉及几种基于视觉的运动和操纵任务的实验,我们发现V-Mail以样本有效的方式学习了成功的视觉运动策略,与先前的工作相比,稳定性更高,并且还可以实现较高的渐近性能。我们进一步发现,通过传输学习模型,V-Mail可以从视觉演示中学习新任务,而无需任何其他环境交互。所有结果在内的所有结果都可以在\ url {https://sites.google.com/view/variational-mail}在线找到。
translated by 谷歌翻译
Transformer, originally devised for natural language processing, has also attested significant success in computer vision. Thanks to its super expressive power, researchers are investigating ways to deploy transformers to reinforcement learning (RL) and the transformer-based models have manifested their potential in representative RL benchmarks. In this paper, we collect and dissect recent advances on transforming RL by transformer (transformer-based RL or TRL), in order to explore its development trajectory and future trend. We group existing developments in two categories: architecture enhancement and trajectory optimization, and examine the main applications of TRL in robotic manipulation, text-based games, navigation and autonomous driving. For architecture enhancement, these methods consider how to apply the powerful transformer structure to RL problems under the traditional RL framework, which model agents and environments much more precisely than deep RL methods, but they are still limited by the inherent defects of traditional RL algorithms, such as bootstrapping and "deadly triad". For trajectory optimization, these methods treat RL problems as sequence modeling and train a joint state-action model over entire trajectories under the behavior cloning framework, which are able to extract policies from static datasets and fully use the long-sequence modeling capability of the transformer. Given these advancements, extensions and challenges in TRL are reviewed and proposals about future direction are discussed. We hope that this survey can provide a detailed introduction to TRL and motivate future research in this rapidly developing field.
translated by 谷歌翻译
在许多顺序决策问题(例如,机器人控制,游戏播放,顺序预测),人类或专家数据可用包含有关任务的有用信息。然而,来自少量专家数据的模仿学习(IL)可能在具有复杂动态的高维环境中具有挑战性。行为克隆是一种简单的方法,由于其简单的实现和稳定的收敛而被广泛使用,但不利用涉及环境动态的任何信息。由于对奖励和政策近似器或偏差,高方差梯度估计器,难以在实践中难以在实践中努力训练的许多现有方法。我们介绍了一种用于动态感知IL的方法,它通过学习单个Q函数来避免对抗训练,隐含地代表奖励和策略。在标准基准测试中,隐式学习的奖励显示与地面真实奖励的高正面相关性,说明我们的方法也可以用于逆钢筋学习(IRL)。我们的方法,逆软Q学习(IQ-Learn)获得了最先进的结果,在离线和在线模仿学习设置中,显着优于现有的现有方法,这些方法都在所需的环境交互和高维空间中的可扩展性中,通常超过3倍。
translated by 谷歌翻译
While reinforcement learning (RL) has become a more popular approach for robotics, designing sufficiently informative reward functions for complex tasks has proven to be extremely difficult due their inability to capture human intent and policy exploitation. Preference based RL algorithms seek to overcome these challenges by directly learning reward functions from human feedback. Unfortunately, prior work either requires an unreasonable number of queries implausible for any human to answer or overly restricts the class of reward functions to guarantee the elicitation of the most informative queries, resulting in models that are insufficiently expressive for realistic robotics tasks. Contrary to most works that focus on query selection to \emph{minimize} the amount of data required for learning reward functions, we take an opposite approach: \emph{expanding} the pool of available data by viewing human-in-the-loop RL through the more flexible lens of multi-task learning. Motivated by the success of meta-learning, we pre-train preference models on prior task data and quickly adapt them for new tasks using only a handful of queries. Empirically, we reduce the amount of online feedback needed to train manipulation policies in Meta-World by 20$\times$, and demonstrate the effectiveness of our method on a real Franka Panda Robot. Moreover, this reduction in query-complexity allows us to train robot policies from actual human users. Videos of our results and code can be found at https://sites.google.com/view/few-shot-preference-rl/home.
translated by 谷歌翻译
In this paper, hypernetworks are trained to generate behaviors across a range of unseen task conditions, via a novel TD-based training objective and data from a set of near-optimal RL solutions for training tasks. This work relates to meta RL, contextual RL, and transfer learning, with a particular focus on zero-shot performance at test time, enabled by knowledge of the task parameters (also known as context). Our technical approach is based upon viewing each RL algorithm as a mapping from the MDP specifics to the near-optimal value function and policy and seek to approximate it with a hypernetwork that can generate near-optimal value functions and policies, given the parameters of the MDP. We show that, under certain conditions, this mapping can be considered as a supervised learning problem. We empirically evaluate the effectiveness of our method for zero-shot transfer to new reward and transition dynamics on a series of continuous control tasks from DeepMind Control Suite. Our method demonstrates significant improvements over baselines from multitask and meta RL approaches.
translated by 谷歌翻译
仿制学习(IL)是一个框架,了解从示范中模仿专家行为。最近,IL显示了高维和控制任务的有希望的结果。然而,IL通常遭受环境互动方面的样本低效率,这严重限制了它们对模拟域的应用。在工业应用中,学习者通常具有高的相互作用成本,与环境的互动越多,对环境的损害越多,学习者本身就越多。在本文中,我们努力通过引入逆钢筋学习的新颖方案来提高样本效率。我们的方法,我们调用\ texit {model redion函数基础的模仿学习}(mrfil),使用一个集合动态模型作为奖励功能,是通过专家演示培训的内容。关键的想法是通过在符合专家示范分布时提供积极奖励,为代理商提供与漫长地平线相匹配的演示。此外,我们展示了新客观函数的收敛保证。实验结果表明,与IL方法相比,我们的算法达到了竞争性能,并显着降低了环境交互。
translated by 谷歌翻译
模仿学习在有效地学习政策方面对复杂的决策问题有着巨大的希望。当前的最新算法经常使用逆增强学习(IRL),在给定一组专家演示的情况下,代理会替代奖励功能和相关的最佳策略。但是,这种IRL方法通常需要在复杂控制问题上进行实质性的在线互动。在这项工作中,我们提出了正规化的最佳运输(ROT),这是一种新的模仿学习算法,基于最佳基于最佳运输轨迹匹配的最新进展。我们的主要技术见解是,即使只有少量演示,即使只有少量演示,也可以自适应地将轨迹匹配的奖励与行为克隆相结合。我们对横跨DeepMind Control Suite,OpenAI Robotics和Meta-World基准的20个视觉控制任务进行的实验表明,与先前最新的方法相比,平均仿真达到了90%的专家绩效的速度,达到了90%的专家性能。 。在现实世界的机器人操作中,只有一次演示和一个小时的在线培训,ROT在14个任务中的平均成功率为90.1%。
translated by 谷歌翻译
The past few years have seen rapid progress in combining reinforcement learning (RL) with deep learning. Various breakthroughs ranging from games to robotics have spurred the interest in designing sophisticated RL algorithms and systems. However, the prevailing workflow in RL is to learn tabula rasa, which may incur computational inefficiency. This precludes continuous deployment of RL algorithms and potentially excludes researchers without large-scale computing resources. In many other areas of machine learning, the pretraining paradigm has shown to be effective in acquiring transferable knowledge, which can be utilized for a variety of downstream tasks. Recently, we saw a surge of interest in Pretraining for Deep RL with promising results. However, much of the research has been based on different experimental settings. Due to the nature of RL, pretraining in this field is faced with unique challenges and hence requires new design principles. In this survey, we seek to systematically review existing works in pretraining for deep reinforcement learning, provide a taxonomy of these methods, discuss each sub-field, and bring attention to open problems and future directions.
translated by 谷歌翻译
Imitation learning (IL) is a simple and powerful way to use high-quality human driving data, which can be collected at scale, to identify driving preferences and produce human-like behavior. However, policies based on imitation learning alone often fail to sufficiently account for safety and reliability concerns. In this paper, we show how imitation learning combined with reinforcement learning using simple rewards can substantially improve the safety and reliability of driving policies over those learned from imitation alone. In particular, we use a combination of imitation and reinforcement learning to train a policy on over 100k miles of urban driving data, and measure its effectiveness in test scenarios grouped by different levels of collision risk. To our knowledge, this is the first application of a combined imitation and reinforcement learning approach in autonomous driving that utilizes large amounts of real-world human driving data.
translated by 谷歌翻译
我们开发了一种新的持续元学习方法,以解决连续多任务学习中的挑战。在此设置中,代理商的目标是快速通过任何任务序列实现高奖励。先前的Meta-Creenifiltive学习算法已经表现出有希望加速收购新任务的结果。但是,他们需要在培训期间访问所有任务。除了简单地将过去的经验转移到新任务,我们的目标是设计学习学习的持续加强学习算法,使用他们以前任务的经验更快地学习新任务。我们介绍了一种新的方法,连续的元策略搜索(Comps),通过以增量方式,在序列中的每个任务上,通过序列的每个任务来消除此限制,而无需重新访问先前的任务。 Comps持续重复两个子程序:使用RL学习新任务,并使用RL的经验完全离线Meta学习,为后续任务学习做好准备。我们发现,在若干挑战性连续控制任务的旧序列上,Comps优于持续的持续学习和非政策元增强方法。
translated by 谷歌翻译
强化学习(RL)算法有望为机器人系统实现自主技能获取。但是,实际上,现实世界中的机器人RL通常需要耗时的数据收集和频繁的人类干预来重置环境。此外,当部署超出知识的设置超出其学习的设置时,使用RL学到的机器人政策通常会失败。在这项工作中,我们研究了如何通过从先前看到的任务中收集的各种离线数据集的有效利用来应对这些挑战。当面对一项新任务时,我们的系统会适应以前学习的技能,以快速学习执行新任务并将环境返回到初始状态,从而有效地执行自己的环境重置。我们的经验结果表明,将先前的数据纳入机器人增强学习中可以实现自主学习,从而大大提高了学习的样本效率,并可以更好地概括。
translated by 谷歌翻译
Poor sample efficiency continues to be the primary challenge for deployment of deep Reinforcement Learning (RL) algorithms for real-world applications, and in particular for visuo-motor control. Model-based RL has the potential to be highly sample efficient by concurrently learning a world model and using synthetic rollouts for planning and policy improvement. However, in practice, sample-efficient learning with model-based RL is bottlenecked by the exploration challenge. In this work, we find that leveraging just a handful of demonstrations can dramatically improve the sample-efficiency of model-based RL. Simply appending demonstrations to the interaction dataset, however, does not suffice. We identify key ingredients for leveraging demonstrations in model learning -- policy pretraining, targeted exploration, and oversampling of demonstration data -- which forms the three phases of our model-based RL framework. We empirically study three complex visuo-motor control domains and find that our method is 150%-250% more successful in completing sparse reward tasks compared to prior approaches in the low data regime (100K interaction steps, 5 demonstrations). Code and videos are available at: https://nicklashansen.github.io/modemrl
translated by 谷歌翻译
Deep reinforcement learning (RL) has achieved several high profile successes in difficult decision-making problems. However, these algorithms typically require a huge amount of data before they reach reasonable performance. In fact, their performance during learning can be extremely poor. This may be acceptable for a simulator, but it severely limits the applicability of deep RL to many real-world tasks, where the agent must learn in the real environment. In this paper we study a setting where the agent may access data from previous control of the system. We present an algorithm, Deep Q-learning from Demonstrations (DQfD), that leverages small sets of demonstration data to massively accelerate the learning process even from relatively small amounts of demonstration data and is able to automatically assess the necessary ratio of demonstration data while learning thanks to a prioritized replay mechanism. DQfD works by combining temporal difference updates with supervised classification of the demonstrator's actions. We show that DQfD has better initial performance than Prioritized Dueling Double Deep Q-Networks (PDD DQN) as it starts with better scores on the first million steps on 41 of 42 games and on average it takes PDD DQN 83 million steps to catch up to DQfD's performance. DQfD learns to out-perform the best demonstration given in 14 of 42 games. In addition, DQfD leverages human demonstrations to achieve state-of-the-art results for 11 games. Finally, we show that DQfD performs better than three related algorithms for incorporating demonstration data into DQN.
translated by 谷歌翻译
Imitation learning techniques aim to mimic human behavior in a given task. An agent (a learning machine) is trained to perform a task from demonstrations by learning a mapping between observations and actions. The idea of teaching by imitation has been around for many years, however, the field is gaining attention recently due to advances in computing and sensing as well as rising demand for intelligent applications. The paradigm of learning by imitation is gaining popularity because it facilitates teaching complex tasks with minimal expert knowledge of the tasks. Generic imitation learning methods could potentially reduce the problem of teaching a task to that of providing demonstrations; without the need for explicit programming or designing reward functions specific to the task. Modern sensors are able to collect and transmit high volumes of data rapidly, and processors with high computational power allow fast processing that maps the sensory data to actions in a timely manner. This opens the door for many potential AI applications that require real-time perception and reaction such as humanoid robots, self-driving vehicles, human computer interaction and computer games to name a few. However, specialized algorithms are needed to effectively and robustly learn models as learning by imitation poses its own set of challenges. In this paper, we survey imitation learning methods and present design options in different steps of the learning process. We introduce a background and motivation for the field as well as highlight challenges specific to the imitation problem. Methods for designing and evaluating imitation learning tasks are categorized and reviewed. Special attention is given to learning methods in robotics and games as these domains are the most popular in the literature and provide a wide array of problems and methodologies. We extensively discuss combining imitation learning approaches using different sources and methods, as well as incorporating other motion learning methods to enhance imitation. We also discuss the potential impact on industry, present major applications and highlight current and future research directions.
translated by 谷歌翻译
Deep reinforcement learning algorithms have succeeded in several challenging domains. Classic Online RL job schedulers can learn efficient scheduling strategies but often takes thousands of timesteps to explore the environment and adapt from a randomly initialized DNN policy. Existing RL schedulers overlook the importance of learning from historical data and improving upon custom heuristic policies. Offline reinforcement learning presents the prospect of policy optimization from pre-recorded datasets without online environment interaction. Following the recent success of data-driven learning, we explore two RL methods: 1) Behaviour Cloning and 2) Offline RL, which aim to learn policies from logged data without interacting with the environment. These methods address the challenges concerning the cost of data collection and safety, particularly pertinent to real-world applications of RL. Although the data-driven RL methods generate good results, we show that the performance is highly dependent on the quality of the historical datasets. Finally, we demonstrate that by effectively incorporating prior expert demonstrations to pre-train the agent, we short-circuit the random exploration phase to learn a reasonable policy with online training. We utilize Offline RL as a \textbf{launchpad} to learn effective scheduling policies from prior experience collected using Oracle or heuristic policies. Such a framework is effective for pre-training from historical datasets and well suited to continuous improvement with online data collection.
translated by 谷歌翻译