智能论文笔记

Fast Lifelong Adaptive Inverse Reinforcement Learning from Demonstrations

Letian Chen , Sravan Jayanthi , Rohan Paleja , Daniel Martin , Viacheslav Zakharov , Matthew Gombolay

分类：机器学习 | 机器人

2022-09-24

从示范中学习（LFD）方法使最终用户能够通过演示所需的行为来教机器人新任务，从而使对机器人技术的访问民主化。但是，当前的LFD框架无法快速适应异质的人类示范，也无法在无处不在的机器人技术应用中进行大规模部署。在本文中，我们提出了一个新型的LFD框架，快速的终身自适应逆增强学习（FLAIR）。我们的方法（1）利用策略来构建政策混合物，以快速适应新的示范，从而快速最终用户个性化；（2）提炼跨示范的常识，实现准确的任务推断；（3）仅在终身部署中需要扩展其模型，并保持一套简洁的原型策略，这些策略可以通过政策混合物近似所有行为。我们从经验上验证了能力可以实现适应能力（即机器人适应异质性，特定用户特定的任务偏好），效率（即机器人实现样本适应性）和可伸缩性（即，模型都会与示范范围增长，同时保持高性能）。 Flair超过了三个连续控制任务的基准测试，其政策收益的平均提高了57％，使用策略混合物进行示范建模所需的次数少78％。最后，我们在现实机器人乒乓球任务中展示了Flair的成功。

translated by 谷歌翻译

Few-Shot Preference Learning for Human-in-the-Loop RL

Joey Hejna , Dorsa Sadigh

分类：机器人 | 人工智能 | 机器学习

2022-12-06

While reinforcement learning (RL) has become a more popular approach for robotics, designing sufficiently informative reward functions for complex tasks has proven to be extremely difficult due their inability to capture human intent and policy exploitation. Preference based RL algorithms seek to overcome these challenges by directly learning reward functions from human feedback. Unfortunately, prior work either requires an unreasonable number of queries implausible for any human to answer or overly restricts the class of reward functions to guarantee the elicitation of the most informative queries, resulting in models that are insufficiently expressive for realistic robotics tasks. Contrary to most works that focus on query selection to \emph{minimize} the amount of data required for learning reward functions, we take an opposite approach: \emph{expanding} the pool of available data by viewing human-in-the-loop RL through the more flexible lens of multi-task learning. Motivated by the success of meta-learning, we pre-train preference models on prior task data and quickly adapt them for new tasks using only a handful of queries. Empirically, we reduce the amount of online feedback needed to train manipulation policies in Meta-World by 20$\times$, and demonstrate the effectiveness of our method on a real Franka Panda Robot. Moreover, this reduction in query-complexity allows us to train robot policies from actual human users. Videos of our results and code can be found at https://sites.google.com/view/few-shot-preference-rl/home.

translated by 谷歌翻译

Unified Learning from Demonstrations, Corrections, and Preferences during Physical Human-Robot Interaction

Shaunak A. Mehta , Dylan P. Losey

分类：机器人

2022-07-07

人类可以利用身体互动来教机器人武器。这种物理互动取决于任务，用户以及机器人到目前为止所学的内容。最先进的方法专注于从单一模态学习，或者假设机器人具有有关人类预期任务的先前信息，从而结合了多个互动类型。相比之下，在本文中，我们介绍了一种算法形式主义，该算法从演示，更正和偏好中学习。我们的方法对人类想要教机器人的任务没有任何假设。取而代之的是，我们通过将人类的输入与附近的替代方案进行比较，从头开始学习奖励模型。我们首先得出损失函数，该功能训练奖励模型的合奏，以匹配人类的示范，更正和偏好。反馈的类型和顺序取决于人类老师：我们使机器人能够被动地或积极地收集此反馈。然后，我们应用受约束的优化将我们学习的奖励转换为所需的机器人轨迹。通过模拟和用户研究，我们证明，与现有基线相比，我们提出的方法更准确地从人体互动中学习了操纵任务，尤其是当机器人面临新的或意外的目标时。我们的用户研究视频可在以下网址获得：https：//youtu.be/fsujstyveku

translated by 谷歌翻译

Learning Control Policies for Fall prevention and safety in bipedal locomotion

Visak Kumar

分类：机器人 | 人工智能

2022-01-04

从意外的外部扰动中恢复的能力是双模型运动的基本机动技能。有效的答复包括不仅可以恢复平衡并保持稳定性的能力，而且在平衡恢复物质不可行时，也可以保证安全的方式。对于与双式运动有关的机器人，例如人形机器人和辅助机器人设备，可帮助人类行走，设计能够提供这种稳定性和安全性的控制器可以防止机器人损坏或防止伤害相关的医疗费用。这是一个具有挑战性的任务，因为它涉及用触点产生高维，非线性和致动系统的高动态运动。尽管使用基于模型和优化方法的前进方面，但诸如广泛领域知识的要求，诸如较大的计算时间和有限的动态变化的鲁棒性仍然会使这个打开问题。在本文中，为了解决这些问题，我们开发基于学习的算法，能够为两种不同的机器人合成推送恢复控制政策：人形机器人和有助于双模型运动的辅助机器人设备。我们的工作可以分为两个密切相关的指示：1）学习人形机器人的安全下降和预防策略，2）使用机器人辅助装置学习人类的预防策略。为实现这一目标，我们介绍了一套深度加强学习（DRL）算法，以学习使用这些机器人时提高安全性的控制策略。

translated by 谷歌翻译

The Effect of Modeling Human Rationality Level on Learning Rewards from Multiple Feedback Types

Gaurav R. Ghosal , Matthew Zurek , Daniel S. Brown , Anca D. Dragan

分类：机器学习 | 人工智能

2022-08-23

当从人类行为中推断出奖励功能（无论是演示，比较，物理校正或电子停靠点）时，它已证明对人类进行建模作为做出嘈杂的理性选择，并具有“合理性系数”，以捕获多少噪声或熵我们希望看到人类的行为。无论人类反馈的类型或质量如何，许多现有作品都选择修复此系数。但是，在某些情况下，进行演示可能要比回答比较查询要困难得多。在这种情况下，我们应该期望在示范中看到比比较中更多的噪音或次级临时性，并且应该相应地解释反馈。在这项工作中，我们提倡，将每种反馈类型的实际数据中的理性系数扎根，而不是假设默认值，对奖励学习具有重大的积极影响。我们在模拟反馈以及用户研究的实验中测试了这一点。我们发现，从单一反馈类型中学习时，高估人类理性可能会对奖励准确性和遗憾产生可怕的影响。此外，我们发现合理性层面会影响每种反馈类型的信息性：令人惊讶的是，示威并不总是最有用的信息 - 当人类的行为非常卑鄙时，即使在合理性水平相同的情况下，比较实际上就变得更加有用。。此外，当机器人确定要要求的反馈类型时，它可以通过准确建模每种类型的理性水平来获得很大的优势。最终，我们的结果强调了关注假定理性级别的重要性，不仅是在从单个反馈类型中学习时，尤其是当代理商从多种反馈类型中学习时，尤其是在学习时。

translated by 谷歌翻译

Co-Imitation: Learning Design and Behaviour by Imitation

Chang Rajani , Karol Arndt , David Blanco-Mulero , Kevin Sebastian Luck , Ville Kyrki

分类：机器学习 | 人工智能 | 机器人

2022-09-02

机器人的共同适应一直是一项长期的研究努力，其目的是将系统的身体和行为适应给定的任务，灵感来自动物的自然演变。共同适应有可能消除昂贵的手动硬件工程，并提高系统性能。共同适应的标准方法是使用奖励功能来优化行为和形态。但是，众所周知，定义和构建这种奖励功能是困难的，并且通常是一项重大的工程工作。本文介绍了关于共同适应问题的新观点，我们称之为共同构图：寻找形态和政策，使模仿者可以紧密匹配演示者的行为。为此，我们提出了一种通过匹配示威者的状态分布来适应行为和形态的共同模拟方法。具体而言，我们专注于两种代理之间的状态和动作空间不匹配的挑战性情况。我们发现，共同映射会增加各种任务和设置的行为相似性，并通过将人的步行，慢跑和踢到模拟的人形生物转移来证明共同映射。

translated by 谷歌翻译

HTML版本

Imitation learning: A survey of learning methods

分类：

Imitation learning techniques aim to mimic human behavior in a given task. An agent (a learning machine) is trained to perform a task from demonstrations by learning a mapping between observations and actions. The idea of teaching by imitation has been around for many years, however, the field is gaining attention recently due to advances in computing and sensing as well as rising demand for intelligent applications. The paradigm of learning by imitation is gaining popularity because it facilitates teaching complex tasks with minimal expert knowledge of the tasks. Generic imitation learning methods could potentially reduce the problem of teaching a task to that of providing demonstrations; without the need for explicit programming or designing reward functions specific to the task. Modern sensors are able to collect and transmit high volumes of data rapidly, and processors with high computational power allow fast processing that maps the sensory data to actions in a timely manner. This opens the door for many potential AI applications that require real-time perception and reaction such as humanoid robots, self-driving vehicles, human computer interaction and computer games to name a few. However, specialized algorithms are needed to effectively and robustly learn models as learning by imitation poses its own set of challenges. In this paper, we survey imitation learning methods and present design options in different steps of the learning process. We introduce a background and motivation for the field as well as highlight challenges specific to the imitation problem. Methods for designing and evaluating imitation learning tasks are categorized and reviewed. Special attention is given to learning methods in robotics and games as these domains are the most popular in the literature and provide a wide array of problems and methodologies. We extensively discuss combining imitation learning approaches using different sources and methods, as well as incorporating other motion learning methods to enhance imitation. We also discuss the potential impact on industry, present major applications and highlight current and future research directions.

translated by 谷歌翻译

Basis for Intentions: Efficient Inverse Reinforcement Learning using Past Experience

Marwa Abdulhai , Natasha Jaques , Sergey Levine

分类：机器学习

2022-08-09

本文解决了逆增强学习（IRL）的问题 - 从观察其行为中推断出代理的奖励功能。 IRL可以为学徒学习提供可概括和紧凑的代表，并能够准确推断人的偏好以帮助他们。％并提供更准确的预测。但是，有效的IRL具有挑战性，因为许多奖励功能可以与观察到的行为兼容。我们专注于如何利用先前的强化学习（RL）经验，以使学习这些偏好更快，更高效。我们提出了IRL算法基础（通过样本中的连续功能意图推断行为获取行为），该算法利用多任务RL预培训和后继功能，使代理商可以为跨越可能的目标建立强大的基础，从而跨越可能的目标。给定的域。当仅接触一些专家演示以优化新颖目标时，代理商会使用其基础快速有效地推断奖励功能。我们的实验表明，我们的方法非常有效地推断和优化显示出奖励功能，从而准确地从少于100个轨迹中推断出奖励功能。

translated by 谷歌翻译

Neural Approaches to Co-Optimization in Robotics

Charles Schaff

分类：机器人

2022-09-01

机器人和与世界相互作用或互动的机器人和智能系统越来越多地被用来自动化各种任务。这些系统完成这些任务的能力取决于构成机器人物理及其传感器物体的机械和电气部件，例如，感知算法感知环境，并计划和控制算法以生产和控制算法来生产和控制算法有意义的行动。因此，通常有必要在设计具体系统时考虑这些组件之间的相互作用。本文探讨了以端到端方式对机器人系统进行任务驱动的合作的工作，同时使用推理或控制算法直接优化了系统的物理组件以进行任务性能。我们首先考虑直接优化基于信标的本地化系统以达到本地化准确性的问题。设计这样的系统涉及将信标放置在整个环境中，并通过传感器读数推断位置。在我们的工作中，我们开发了一种深度学习方法，以直接优化信标的放置和位置推断以达到本地化精度。然后，我们将注意力转移到了由任务驱动的机器人及其控制器优化的相关问题上。在我们的工作中，我们首先提出基于多任务增强学习的数据有效算法。我们的方法通过利用能够在物理设计的空间上概括设计条件的控制器，有效地直接优化了物理设计和控制参数，以直接优化任务性能。然后，我们对此进行跟进，以允许对离散形态参数（例如四肢的数字和配置）进行优化。最后，我们通过探索优化的软机器人的制造和部署来得出结论。

translated by 谷歌翻译

HTML版本

Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization

Chelsea Finn , Sergey Levine , Pieter Abbeel

分类：

2016-03-01

Reinforcement learning can acquire complex behaviors from high-level specifications. However, defining a cost function that can be optimized effectively and encodes the correct task is challenging in practice. We explore how inverse optimal control (IOC) can be used to learn behaviors from demonstrations, with applications to torque control of high-dimensional robotic systems. Our method addresses two key challenges in inverse optimal control: first, the need for informative features and effective regularization to impose structure on the cost, and second, the difficulty of learning the cost function under unknown dynamics for high-dimensional continuous systems. To address the former challenge, we present an algorithm capable of learning arbitrary nonlinear cost functions, such as neural networks, without meticulous feature engineering. To address the latter challenge, we formulate an efficient sample-based approximation for MaxEnt IOC. We evaluate our method on a series of simulated tasks and real-world robotic manipulation problems, demonstrating substantial improvement over prior methods both in terms of task complexity and sample efficiency.

translated by 谷歌翻译

Inferring Versatile Behavior from Demonstrations by Matching Geometric Descriptors

Niklas Freymuth , Nicolas Schreiber , Philipp Becker , Aleksandar Taranovic , Gerhard Neumann

分类：机器人 | 机器学习

2022-10-17

Humans intuitively solve tasks in versatile ways, varying their behavior in terms of trajectory-based planning and for individual steps. Thus, they can easily generalize and adapt to new and changing environments. Current Imitation Learning algorithms often only consider unimodal expert demonstrations and act in a state-action-based setting, making it difficult for them to imitate human behavior in case of versatile demonstrations. Instead, we combine a mixture of movement primitives with a distribution matching objective to learn versatile behaviors that match the expert's behavior and versatility. To facilitate generalization to novel task configurations, we do not directly match the agent's and expert's trajectory distributions but rather work with concise geometric descriptors which generalize well to unseen task configurations. We empirically validate our method on various robot tasks using versatile human demonstrations and compare to imitation learning algorithms in a state-action setting as well as a trajectory-based setting. We find that the geometric descriptors greatly help in generalizing to new task configurations and that combining them with our distribution-matching objective is crucial for representing and reproducing versatile behavior.

translated by 谷歌翻译

Goal-Aware Generative Adversarial Imitation Learning from Imperfect Demonstration for Robotic Cloth Manipulation

Yoshihisa Tsurumine , Takamitsu Matsubara

分类：机器人

2022-09-21

生成的对抗性模仿学习（GAIL）可以学习政策，而无需明确定义示威活动的奖励功能。盖尔有可能学习具有高维观测值的政策，例如图像。通过将Gail应用于真正的机器人，也许可以为清洗，折叠衣服，烹饪和清洁等日常活动获得机器人政策。但是，由于错误，人类示范数据通常是不完美的，这会降低由此产生的政策的表现。我们通过关注以下功能来解决此问题：1）许多机器人任务是目标任务，而2）在演示数据中标记此类目标状态相对容易。考虑到这些，本文提出了目标感知的生成对抗性模仿学习（GA-GAIL），该学习通过引入第二个歧视者来训练政策，以与指示演示数据的第一个歧视者并行区分目标状态。这扩展了一个标准的盖尔框架，即使通过促进实现目标状态的目标状态歧视者，甚至可以从不完美的演示中学习理想的政策。此外，GA-GAIL采用熵最大化的深层P-NETWORK（EDPN）作为发电机，该发电机考虑了策略更新中的平滑度和因果熵，以从两个歧视者中获得稳定的政策学习。我们提出的方法成功地应用于两项真正的布料操作任务：将手帕翻过来折叠衣服。我们确认它在没有特定特定任务奖励功能设计的情况下学习了布料操作政策。实际实验的视频可在https://youtu.be/h_nii2ooure上获得。

translated by 谷歌翻译

Visual Adversarial Imitation Learning using Variational Models

Rafael Rafailov , Tianhe Yu , Aravind Rajeswaran , Chelsea Finn

分类：机器学习 | 人工智能 | 机器人

2021-07-16

需要大量人类努力和迭代的奖励功能规范仍然是通过深入的强化学习来学习行为的主要障碍。相比之下，提供所需行为的视觉演示通常会提供一种更简单，更自然的教师的方式。我们考虑为代理提供了一个固定的视觉演示数据集，说明了如何执行任务，并且必须学习使用提供的演示和无监督的环境交互来解决任务。此设置提出了许多挑战，包括对视觉观察的表示，由于缺乏固定的奖励或学习信号而导致的，由于高维空间而引起的样本复杂性以及学习不稳定。为了解决这些挑战，我们开发了一种基于变异模型的对抗模仿学习（V-Mail）算法。基于模型的方法为表示学习，实现样本效率并通过实现派利学习来提高对抗性训练的稳定性提供了强烈的信号。通过涉及几种基于视觉的运动和操纵任务的实验，我们发现V-Mail以样本有效的方式学习了成功的视觉运动策略，与先前的工作相比，稳定性更高，并且还可以实现较高的渐近性能。我们进一步发现，通过传输学习模型，V-Mail可以从视觉演示中学习新任务，而无需任何其他环境交互。所有结果在内的所有结果都可以在\ url {https://sites.google.com/view/variational-mail}在线找到。

translated by 谷歌翻译

CoMPS: Continual Meta Policy Search

Glen Berseth , Zhiwei Zhang , Grace Zhang , Chelsea Finn , Sergey Levine

分类：机器学习 | 人工智能 | 机器人

2021-12-08

我们开发了一种新的持续元学习方法，以解决连续多任务学习中的挑战。在此设置中，代理商的目标是快速通过任何任务序列实现高奖励。先前的Meta-Creenifiltive学习算法已经表现出有希望加速收购新任务的结果。但是，他们需要在培训期间访问所有任务。除了简单地将过去的经验转移到新任务，我们的目标是设计学习学习的持续加强学习算法，使用他们以前任务的经验更快地学习新任务。我们介绍了一种新的方法，连续的元策略搜索（Comps），通过以增量方式，在序列中的每个任务上，通过序列的每个任务来消除此限制，而无需重新访问先前的任务。 Comps持续重复两个子程序：使用RL学习新任务，并使用RL的经验完全离线Meta学习，为后续任务学习做好准备。我们发现，在若干挑战性连续控制任务的旧序列上，Comps优于持续的持续学习和非政策元增强方法。

translated by 谷歌翻译

Learning Agile Skills via Adversarial Imitation of Rough Partial Demonstrations

Chenhao Li , Marin Vlastelica , Sebastian Blaes , Jonas Frey , Felix Grimminger , Georg Martius

分类：机器人 | 人工智能 | 机器学习

2022-06-23

学习敏捷技能是机器人技术的主要挑战之一。为此，加强学习方法取得了令人印象深刻的结果。这些方法需要根据奖励功能或可以在模拟中查询的专家来提供明确的任务信息，以提供目标控制输出，从而限制其适用性。在这项工作中，我们提出了一种生成的对抗方法，用于从部分和潜在的物理不兼容的演示中推断出奖励功能，以成功地获得参考或专家演示的成功技能。此外，我们表明，通过使用Wasserstein gan公式和从以粗糙和部分信息为输入的示范中进行过渡，我们能够提取强大的策略并能够模仿证明的行为。最后，在一个名为Solo 8的敏捷四倍的机器人上测试了所获得的技能，例如后空飞弹，并对手持人类示范的忠实复制进行了测试。

translated by 谷歌翻译

Error-Aware Imitation Learning from Teleoperation Data for Mobile Manipulation

Josiah Wong , Albert Tung , Andrey Kurenkov , Ajay Mandlekar , Li Fei-Fei , Silvio Savarese , Roberto Martín-Martín

分类：机器人 | 人工智能 | 机器学习

2021-12-09

在移动操作（MM）中，机器人可以在内部导航并与其环境进行交互，因此能够完成比仅能够导航或操纵的机器人的更多任务。在这项工作中，我们探讨如何应用模仿学习（IL）来学习MM任务的连续Visuo-Motor策略。许多事先工作表明，IL可以为操作或导航域训练Visuo-Motor策略，但很少有效应用IL到MM域。这样做是挑战的两个原因：在数据方面，当前的接口使得收集高质量的人类示范困难，在学习方面，有限数据培训的政策可能会在部署时遭受协变速转变。为了解决这些问题，我们首先提出了移动操作Roboturk（Momart），这是一种新颖的遥控框架，允许同时导航和操纵移动操纵器，并在现实的模拟厨房设置中收集一类大规模的大规模数据集。然后，我们提出了一个学习错误检测系统来解决通过检测代理处于潜在故障状态时的协变量转变。我们从该数据中培训表演者的IL政策和错误探测器，在专家数据培训时，在多个多级任务中达到超过45％的任务成功率和85％的错误检测成功率。 CodeBase，DataSets，Visualization，以及更多可用的https://sites.google.com/view/il-for-mm/home。

translated by 谷歌翻译

DexMV: Imitation Learning for Dexterous Manipulation from Human Videos

Yuzhe Qin , Yueh-Hua Wu , Shaowei Liu , Hanwen Jiang , Ruihan Yang , Yang Fu , Xiaolong Wang

分类：机器学习 | 计算机视觉 | 机器人

2021-08-12

虽然对理解计算机视觉中的手对象交互进行了重大进展，但机器人执行复杂的灵巧操纵仍然非常具有挑战性。在本文中，我们提出了一种新的平台和管道DEXMV（来自视频的Dexerous操纵）以进行模仿学习。我们设计了一个平台：（i）具有多指机器人手和（ii）计算机视觉系统的复杂灵巧操纵任务的仿真系统，以记录进行相同任务的人类手的大规模示范。在我们的小说管道中，我们从视频中提取3D手和对象姿势，并提出了一种新颖的演示翻译方法，将人类运动转换为机器人示范。然后，我们将多个仿制学习算法与演示进行应用。我们表明，示威活动确实可以通过大幅度提高机器人学习，并解决独自增强学习无法解决的复杂任务。具有视频的项目页面：https://yzqin.github.io/dexmv

translated by 谷歌翻译

Accelerating Interactive Human-like Manipulation Learning with GPU-based Simulation and High-quality Demonstrations

Malte Mosbach , Kara Moraw , Sven Behnke

分类：机器人 | 人工智能 | 机器学习

2022-12-05

Dexterous manipulation with anthropomorphic robot hands remains a challenging problem in robotics because of the high-dimensional state and action spaces and complex contacts. Nevertheless, skillful closed-loop manipulation is required to enable humanoid robots to operate in unstructured real-world environments. Reinforcement learning (RL) has traditionally imposed enormous interaction data requirements for optimizing such complex control problems. We introduce a new framework that leverages recent advances in GPU-based simulation along with the strength of imitation learning in guiding policy search towards promising behaviors to make RL training feasible in these domains. To this end, we present an immersive virtual reality teleoperation interface designed for interactive human-like manipulation on contact rich tasks and a suite of manipulation environments inspired by tasks of daily living. Finally, we demonstrate the complementary strengths of massively parallel RL and imitation learning, yielding robust and natural behaviors. Videos of trained policies, our source code, and the collected demonstration datasets are available at https://maltemosbach.github.io/interactive_ human_like_manipulation/.

translated by 谷歌翻译

Visual Reinforcement Learning with Imagined Goals

Ashvin Nair , Vitchyr Pong , Murtaza Dalal , Shikhar Bahl , Steven Lin , Sergey Levine

分类：

2018-07-12

For an autonomous agent to fulfill a wide range of user-specified goals at test time, it must be able to learn broadly applicable and general-purpose skill repertoires. Furthermore, to provide the requisite level of generality, these skills must handle raw sensory input such as images. In this paper, we propose an algorithm that acquires such general-purpose skills by combining unsupervised representation learning and reinforcement learning of goal-conditioned policies. Since the particular goals that might be required at test-time are not known in advance, the agent performs a self-supervised "practice" phase where it imagines goals and attempts to achieve them. We learn a visual representation with three distinct purposes: sampling goals for self-supervised practice, providing a structured transformation of raw sensory inputs, and computing a reward signal for goal reaching. We also propose a retroactive goal relabeling scheme to further improve the sample-efficiency of our method. Our off-policy algorithm is efficient enough to learn policies that operate on raw image observations and goals for a real-world robotic system, and substantially outperforms prior techniques. * Equal contribution. Order was determined by coin flip.

translated by 谷歌翻译

Improving Policy Optimization with Generalist-Specialist Learning

Zhiwei Jia , Xuanlin Li , Zhan Ling , Shuang Liu , Yiran Wu , Hao Su

分类：机器学习

2022-06-26

对看不见的环境变化的深入强化学习的概括通常需要对大量各种培训变化进行政策学习。我们从经验上观察到，接受过许多变化的代理商（通才）倾向于在一开始就更快地学习，但是长期以来其最佳水平的性能高原。相比之下，只接受一些变体培训的代理商（专家）通常可以在有限的计算预算下获得高回报。为了两全其美，我们提出了一个新颖的通才特权训练框架。具体来说，我们首先培训一名通才的所有环境变化。当它无法改善时，我们会推出大量的专家，并从通才克隆过重量，每个人都接受了训练，以掌握选定的一小部分变化子集。我们终于通过所有专家的示范引起的辅助奖励恢复了通才的培训。特别是，我们调查了开始专业培训的时机，并在专家的帮助下比较策略以学习通才。我们表明，该框架将政策学习的信封推向了包括Procgen，Meta-World和Maniskill在内的几个具有挑战性和流行的基准。

translated by 谷歌翻译