智能论文笔记

Learning Representations that Enable Generalization in Assistive Tasks

Jerry Zhi-Yang He , Aditi Raghunathan , Daniel S. Brown , Zackory Erickson , Anca D. Dragan

分类：机器学习 | 人工智能 | 机器人

2022-12-05

Recent work in sim2real has successfully enabled robots to act in physical environments by training in simulation with a diverse ''population'' of environments (i.e. domain randomization). In this work, we focus on enabling generalization in assistive tasks: tasks in which the robot is acting to assist a user (e.g. helping someone with motor impairments with bathing or with scratching an itch). Such tasks are particularly interesting relative to prior sim2real successes because the environment now contains a human who is also acting. This complicates the problem because the diversity of human users (instead of merely physical environment parameters) is more difficult to capture in a population, thus increasing the likelihood of encountering out-of-distribution (OOD) human policies at test time. We advocate that generalization to such OOD policies benefits from (1) learning a good latent representation for human policies that test-time humans can accurately be mapped to, and (2) making that representation adaptable with test-time interaction data, instead of relying on it to perfectly capture the space of human policies based on the simulated population only. We study how to best learn such a representation by evaluating on purposefully constructed OOD test policies. We find that sim2real methods that encode environment (or population) parameters and work well in tasks that robots do in isolation, do not work well in assistance. In assistance, it seems crucial to train the representation based on the history of interaction directly, because that is what the robot will have access to at test time. Further, training these representations to then predict human actions not only gives them better structure, but also enables them to be fine-tuned at test-time, when the robot observes the partner act. https://adaptive-caregiver.github.io.

translated by 谷歌翻译

Visual Reinforcement Learning with Imagined Goals

Ashvin Nair , Vitchyr Pong , Murtaza Dalal , Shikhar Bahl , Steven Lin , Sergey Levine

分类：

2018-07-12

For an autonomous agent to fulfill a wide range of user-specified goals at test time, it must be able to learn broadly applicable and general-purpose skill repertoires. Furthermore, to provide the requisite level of generality, these skills must handle raw sensory input such as images. In this paper, we propose an algorithm that acquires such general-purpose skills by combining unsupervised representation learning and reinforcement learning of goal-conditioned policies. Since the particular goals that might be required at test-time are not known in advance, the agent performs a self-supervised "practice" phase where it imagines goals and attempts to achieve them. We learn a visual representation with three distinct purposes: sampling goals for self-supervised practice, providing a structured transformation of raw sensory inputs, and computing a reward signal for goal reaching. We also propose a retroactive goal relabeling scheme to further improve the sample-efficiency of our method. Our off-policy algorithm is efficient enough to learn policies that operate on raw image observations and goals for a real-world robotic system, and substantially outperforms prior techniques. * Equal contribution. Order was determined by coin flip.

translated by 谷歌翻译

Learning Control Policies for Fall prevention and safety in bipedal locomotion

Visak Kumar

分类：机器人 | 人工智能

2022-01-04

从意外的外部扰动中恢复的能力是双模型运动的基本机动技能。有效的答复包括不仅可以恢复平衡并保持稳定性的能力，而且在平衡恢复物质不可行时，也可以保证安全的方式。对于与双式运动有关的机器人，例如人形机器人和辅助机器人设备，可帮助人类行走，设计能够提供这种稳定性和安全性的控制器可以防止机器人损坏或防止伤害相关的医疗费用。这是一个具有挑战性的任务，因为它涉及用触点产生高维，非线性和致动系统的高动态运动。尽管使用基于模型和优化方法的前进方面，但诸如广泛领域知识的要求，诸如较大的计算时间和有限的动态变化的鲁棒性仍然会使这个打开问题。在本文中，为了解决这些问题，我们开发基于学习的算法，能够为两种不同的机器人合成推送恢复控制政策：人形机器人和有助于双模型运动的辅助机器人设备。我们的工作可以分为两个密切相关的指示：1）学习人形机器人的安全下降和预防策略，2）使用机器人辅助装置学习人类的预防策略。为实现这一目标，我们介绍了一套深度加强学习（DRL）算法，以学习使用这些机器人时提高安全性的控制策略。

translated by 谷歌翻译

Error-Aware Imitation Learning from Teleoperation Data for Mobile Manipulation

Josiah Wong , Albert Tung , Andrey Kurenkov , Ajay Mandlekar , Li Fei-Fei , Silvio Savarese , Roberto Martín-Martín

分类：机器人 | 人工智能 | 机器学习

2021-12-09

在移动操作（MM）中，机器人可以在内部导航并与其环境进行交互，因此能够完成比仅能够导航或操纵的机器人的更多任务。在这项工作中，我们探讨如何应用模仿学习（IL）来学习MM任务的连续Visuo-Motor策略。许多事先工作表明，IL可以为操作或导航域训练Visuo-Motor策略，但很少有效应用IL到MM域。这样做是挑战的两个原因：在数据方面，当前的接口使得收集高质量的人类示范困难，在学习方面，有限数据培训的政策可能会在部署时遭受协变速转变。为了解决这些问题，我们首先提出了移动操作Roboturk（Momart），这是一种新颖的遥控框架，允许同时导航和操纵移动操纵器，并在现实的模拟厨房设置中收集一类大规模的大规模数据集。然后，我们提出了一个学习错误检测系统来解决通过检测代理处于潜在故障状态时的协变量转变。我们从该数据中培训表演者的IL政策和错误探测器，在专家数据培训时，在多个多级任务中达到超过45％的任务成功率和85％的错误检测成功率。 CodeBase，DataSets，Visualization，以及更多可用的https://sites.google.com/view/il-for-mm/home。

translated by 谷歌翻译

PLATO: Predicting Latent Affordances Through Object-Centric Play

Suneel Belkhale , Dorsa Sadigh

分类：机器人

2022-03-10

在机器人技术中，以可扩展的方式构建各种操纵技巧的曲目仍然是一个未解决的挑战。解决这一挑战的一种方法是在非结构化的人类游戏中，人类在环境中自由运作以实现未指定的目标。游戏是一种简单且廉价的方法，用于收集各种用户演示，并在环境中进行广泛的状态和目标覆盖。由于这种不同的覆盖范围，现有的从游戏中学习的方法对离线数据分布的在线政策偏差更加牢固。但是，这些方法通常很难在场景变化和具有挑战性的操纵基础上学习，部分原因是将复杂的行为与他们引起的场景变化联系起来。我们的见解是，以对象数据为中心的观点可以帮助将人类的行为和所产生的环境变化联系起来，从而改善多任务策略学习。在这项工作中，我们构建了一个潜在空间来建模对象\ textit {proffances} - 在环境中定义其用途的对象的属性，然后学习实现所需负担的策略。通过对可变范围任务进行建模和预测所需的负担，我们的方法通过以对象为中心的游戏（PLATO）预测潜在的负担，在2D和3D对象操纵模拟和现实世界环境中，在复杂的操纵任务上的现有方法优于现有方法互动。可以在我们的网站上找到视频：https：//tinyurl.com/4U23HWFV

translated by 谷歌翻译

Learning Latent Representations to Co-Adapt to Humans

Sagar Parekh , Dylan P. Losey

分类：机器人 | 人工智能 | 机器学习

2022-12-19

When robots interact with humans in homes, roads, or factories the human's behavior often changes in response to the robot. Non-stationary humans are challenging for robot learners: actions the robot has learned to coordinate with the original human may fail after the human adapts to the robot. In this paper we introduce an algorithmic formalism that enables robots (i.e., ego agents) to co-adapt alongside dynamic humans (i.e., other agents) using only the robot's low-level states, actions, and rewards. A core challenge is that humans not only react to the robot's behavior, but the way in which humans react inevitably changes both over time and between users. To deal with this challenge, our insight is that -- instead of building an exact model of the human -- robots can learn and reason over high-level representations of the human's policy and policy dynamics. Applying this insight we develop RILI: Robustly Influencing Latent Intent. RILI first embeds low-level robot observations into predictions of the human's latent strategy and strategy dynamics. Next, RILI harnesses these predictions to select actions that influence the adaptive human towards advantageous, high reward behaviors over repeated interactions. We demonstrate that -- given RILI's measured performance with users sampled from an underlying distribution -- we can probabilistically bound RILI's expected performance across new humans sampled from the same distribution. Our simulated experiments compare RILI to state-of-the-art representation and reinforcement learning baselines, and show that RILI better learns to coordinate with imperfect, noisy, and time-varying agents. Finally, we conduct two user studies where RILI co-adapts alongside actual humans in a game of tag and a tower-building task. See videos of our user studies here: https://youtu.be/WYGO5amDXbQ

translated by 谷歌翻译

Offline Meta-Reinforcement Learning with Online Self-Supervision

Vitchyr H. Pong , Ashvin Nair , Laura Smith , Catherine Huang , Sergey Levine

分类：机器学习 | 人工智能 | 机器人

2021-07-08

元强化学习（RL）方法可以使用比标准RL少的数据级的元培训策略，但元培训本身既昂贵又耗时。如果我们可以在离线数据上进行元训练，那么我们可以重复使用相同的静态数据集，该数据集将一次标记为不同任务的奖励，以在元测试时间适应各种新任务的元训练策略。尽管此功能将使Meta-RL成为现实使用的实用工具，但离线META-RL提出了除在线META-RL或标准离线RL设置之外的其他挑战。 Meta-RL学习了一种探索策略，该策略收集了用于适应的数据，并元培训策略迅速适应了新任务的数据。由于该策略是在固定的离线数据集上进行了元训练的，因此当适应学识渊博的勘探策略收集的数据时，它可能表现得不可预测，这与离线数据有系统地不同，从而导致分布变化。我们提出了一种混合脱机元元素算法，该算法使用带有奖励的脱机数据来进行自适应策略，然后收集其他无监督的在线数据，而无需任何奖励标签来桥接这一分配变化。通过不需要在线收集的奖励标签，此数据可以便宜得多。我们将我们的方法比较了在模拟机器人的运动和操纵任务上进行离线元rl的先前工作，并发现使用其他无监督的在线数据收集可以显着提高元训练政策的自适应能力，从而匹配完全在线的表现。在一系列具有挑战性的域上，需要对新任务进行概括。

translated by 谷歌翻译

Neural Approaches to Co-Optimization in Robotics

Charles Schaff

分类：机器人

2022-09-01

机器人和与世界相互作用或互动的机器人和智能系统越来越多地被用来自动化各种任务。这些系统完成这些任务的能力取决于构成机器人物理及其传感器物体的机械和电气部件，例如，感知算法感知环境，并计划和控制算法以生产和控制算法来生产和控制算法有意义的行动。因此，通常有必要在设计具体系统时考虑这些组件之间的相互作用。本文探讨了以端到端方式对机器人系统进行任务驱动的合作的工作，同时使用推理或控制算法直接优化了系统的物理组件以进行任务性能。我们首先考虑直接优化基于信标的本地化系统以达到本地化准确性的问题。设计这样的系统涉及将信标放置在整个环境中，并通过传感器读数推断位置。在我们的工作中，我们开发了一种深度学习方法，以直接优化信标的放置和位置推断以达到本地化精度。然后，我们将注意力转移到了由任务驱动的机器人及其控制器优化的相关问题上。在我们的工作中，我们首先提出基于多任务增强学习的数据有效算法。我们的方法通过利用能够在物理设计的空间上概括设计条件的控制器，有效地直接优化了物理设计和控制参数，以直接优化任务性能。然后，我们对此进行跟进，以允许对离散形态参数（例如四肢的数字和配置）进行优化。最后，我们通过探索优化的软机器人的制造和部署来得出结论。

translated by 谷歌翻译

HTML版本

Learning Latent Dynamics for Planning from Pixels

Danijar Hafner , Timothy Lillicrap , Ian Fischer , Ruben Villegas , David Ha , Honglak Lee , James Davidson

分类：

2018-11-12

Planning has been very successful for control tasks with known environment dynamics. To leverage planning in unknown environments, the agent needs to learn the dynamics from interactions with the world. However, learning dynamics models that are accurate enough for planning has been a long-standing challenge, especially in image-based domains. We propose the Deep Planning Network (PlaNet), a purely model-based agent that learns the environment dynamics from images and chooses actions through fast online planning in latent space. To achieve high performance, the dynamics model must accurately predict the rewards ahead for multiple time steps. We approach this using a latent dynamics model with both deterministic and stochastic transition components. Moreover, we propose a multi-step variational inference objective that we name latent overshooting. Using only pixel observations, our agent solves continuous control tasks with contact dynamics, partial observability, and sparse rewards, which exceed the difficulty of tasks that were previously solved by planning with learned models. PlaNet uses substantially fewer episodes and reaches final performance close to and sometimes higher than strong model-free algorithms.

translated by 谷歌翻译

On the Effectiveness of Fine-tuning Versus Meta-reinforcement Learning

Zhao Mandi , Pieter Abbeel , Stephen James

分类：机器学习 | 人工智能 | 计算机视觉 | 机器人

2022-06-07

智能代理人应该有能力利用先前学习的任务中的知识，以便快速有效地学习新任务。元学习方法已成为实现这一目标的流行解决方案。然而，迄今为止，元强化学习（META-RL）算法仅限于具有狭窄任务分布的简单环境。此外，预处理的范式随后进行了微调以适应新任务，这是一种简单而有效的解决方案，这些解决方案是监督和自我监督的学习。这使质疑元学习方法的好处在加强学习中的好处，这通常是以高复杂性为代价的。因此，我们研究了包括Procgen，rlbench和Atari在内的各种基于视觉的基准测试中的元RL方法，在这些基准测试中，对完全新颖的任务进行了评估。我们的发现表明，当对不同任务（而不是相同任务的不同变化）评估元学习方法时，对新任务进行微调的多任务预处理也相同或更好，或者更好，比用meta进行元数据。测试时间适应。这对于将来的研究令人鼓舞，因为多任务预处理往往比Meta-RL更简单和计算更便宜。从这些发现中，我们主张评估未来的Meta-RL方法在更具挑战性的任务上，并包括以简单但强大的基线进行微调预处理。

translated by 谷歌翻译

Robust Robotic Control from Pixels using Contrastive Recurrent State-Space Models

Nitish Srivastava , Walter Talbott , Martin Bertran Lopez , Shuangfei Zhai , Josh Susskind

分类：机器学习 | 人工智能 | 机器人

2021-12-02

通过提供丰富的训练信号来塑造代理人的潜国空间，建模世界可以使机器人学习受益。然而，在诸如图像之类的高维观察空间上的无约束环境中学习世界模型是具有挑战性的。一个难度来源是存在无关但难以模范的背景干扰，以及不重要的任务相关实体的视觉细节。我们通过学习经常性潜在的动态模型来解决这个问题，该模型对比预测下一次观察。即使使用同时的相机，背景和色调分散，这种简单的模型也会导致令人惊讶的鲁棒机器人控制。我们优于替代品，如双刺激方法，这些方法施加来自未来奖励或未来最佳行为的不同性措施。我们在分散注意力控制套件上获得最先进的结果，是基于像素的机器人控制的具有挑战性的基准。

translated by 谷歌翻译

Latent Plans for Task-Agnostic Offline Reinforcement Learning

Erick Rosete-Beas , Oier Mees , Gabriel Kalweit , Joschka Boedecker , Wolfram Burgard

分类：机器人 | 人工智能 | 计算机视觉 | 机器学习

2022-09-19

长摩根和包括一系列隐性子任务的日常任务仍然在离线机器人控制中构成了重大挑战。尽管许多先前的方法旨在通过模仿和离线增强学习的变体来解决这种设置，但学习的行为通常是狭窄的，并且经常努力实现可配置的长匹配目标。由于这两个范式都具有互补的优势和劣势，因此我们提出了一种新型的层次结构方法，结合了两种方法的优势，以从高维相机观察中学习任务无关的长胜压策略。具体而言，我们结合了一项低级政策，该政策通过模仿学习和从离线强化学习中学到的高级政策学习潜在的技能，以促进潜在的行为先验。各种模拟和真实机器人控制任务的实验表明，我们的配方使以前看不见的技能组合能够通过“缝制”潜在技能通过目标链条，并在绩效上提高绩效的顺序，从而实现潜在的目标。艺术基线。我们甚至还学习了一个多任务视觉运动策略，用于现实世界中25个不同的操纵任务，这既优于模仿学习和离线强化学习技术。

translated by 谷歌翻译

Learn from Human Teams: a Probabilistic Solution to Real-Time Collaborative Robot Handling with Dynamic Gesture Commands

Rui Chen , Alvin Shek , Changliu Liu

分类：机器人

2021-12-11

我们研究了实时的协作机器人（Cobot）处理，Cobot在人类命令下操纵工件。当人类直接处理工件时，这是有用的。但是，在可能的操作中难以使COBOT易于命令和灵活。在这项工作中，我们提出了一个实时协作机器人处理（RTCOHand）框架，其允许通过用户定制的动态手势控制COBOT。由于用户，人类运动不确定性和嘈杂的人类投入的变化，这很难。我们将任务塑造为概率的生成过程，称为条件协作处理过程（CCHP），并从人类的合作中学习。我们彻底评估了CCHP的适应性和稳健性，并将我们的方法应用于Kinova Gen3机器人手臂的实时Cobot处理任务。我们实现了与经验丰富和新用户的无缝人员合作。与古典控制器相比，RTCEHAND允许更复杂的操作和更低的用户认知负担。它还消除了对试验和错误的需求，在安全关键任务中呈现。

translated by 谷歌翻译

Skill-based Model-based Reinforcement Learning

Lucy Xiaoyang Shi , Joseph J. Lim , Youngwoon Lee

分类：机器学习 | 人工智能 | 机器人

2022-07-15

基于模型的增强学习（RL）是一种通过利用学习的单步动力学模型来计划想象中的动作来学习复杂行为的样本效率方法。但是，计划为长马操作计划的每项行动都是不切实际的，类似于每个肌肉运动的人类计划。相反，人类有效地计划具有高级技能来解决复杂的任务。从这种直觉中，我们提出了一个基于技能的RL框架（SKIMO），该框架能够使用技能动力学模型在技能空间中进行计划，该模型直接预测技能成果，而不是预测中级状态中的所有小细节，逐步。为了准确有效的长期计划，我们共同学习了先前经验的技能动力学模型和技能曲目。然后，我们利用学到的技能动力学模型准确模拟和计划技能空间中的长范围，这可以有效地学习长摩盛，稀疏的奖励任务。导航和操纵域中的实验结果表明，Skimo扩展了基于模型的方法的时间范围，并提高了基于模型的RL和基于技能的RL的样品效率。代码和视频可在\ url {https://clvrai.com/skimo}上找到

translated by 谷歌翻译

XIRL: Cross-embodiment Inverse Reinforcement Learning

Kevin Zakka , Andy Zeng , Pete Florence , Jonathan Tompson , Jeannette Bohg , Debidatta Dwibedi

分类：机器人 | 人工智能 | 计算机视觉 | 机器学习

2021-06-07

我们调查视觉跨实施的模仿设置，其中代理商学习来自其他代理的视频（例如人类）的策略，示范相同的任务，但在其实施例中具有缺点差异 - 形状，动作，终效应器动态等。在这项工作中，我们证明可以从对这些差异强大的跨实施例证视频自动发现和学习基于视觉的奖励功能。具体而言，我们介绍了一种用于跨实施的跨实施的自我监督方法（XIRL），它利用时间周期 - 一致性约束来学习深度视觉嵌入，从而从多个专家代理的示范的脱机视频中捕获任务进度，每个都执行相同的任务不同的原因是实施例差异。在我们的工作之前，从自我监督嵌入产生奖励通常需要与参考轨迹对齐，这可能难以根据STARK实施例的差异来获取。我们凭经验显示，如果嵌入式了解任务进度，则只需在学习的嵌入空间中占据当前状态和目标状态之间的负距离是有用的，作为培训与加强学习的培训政策的奖励。我们发现我们的学习奖励功能不仅适用于在训练期间看到的实施例，而且还概括为完全新的实施例。此外，在将现实世界的人类示范转移到模拟机器人时，我们发现XIRL比当前最佳方法更具样本。 https://x-irl.github.io提供定性结果，代码和数据集

translated by 谷歌翻译

Learning to Synthesize Programs as Interpretable and Generalizable Policies

Dweep Trivedi , Jesse Zhang , Shao-Hua Sun , Joseph J. Lim

分类：机器学习 | 人工智能

2021-08-31

最近，深增强学习（DRL）方法在各种域中的任务方面取得了令人印象深刻的性能。然而，用DRL方法产生的神经网络政策不是人为可解释的，并且通常难以推广到新颖的情景。为了解决这些问题，事先作品探索学习更具可诠释和构建的概括的程序政策。然而，这些作品要么采用有限的政策表示（例如，决策树，状态机或预定义的程序模板）或需要更强的监督（例如输入/输出状态对或专家演示）。我们提出了一个框架，而是学习合成一个程序，该程序详细介绍了以灵活和表现力的方式解决任务的过程，仅来自奖励信号。为了减轻学习难以从头开始诱发所需的代理行为的难度，我们建议首先了解一个程序嵌入空间，以不传达的方式连续参加各种行为，然后搜索嵌入空间以产生程序最大化给定任务的返回。实验结果表明，所提出的框架不仅可以可靠地综合任务解决方案，而且在产生可解释和更广泛的政策的同时优于DRL和程序合成基线。我们还可以证明所提出的两级学习计划的必要性，并分析了学习计划嵌入的各种方法。

translated by 谷歌翻译

SIRL: Similarity-based Implicit Representation Learning

Andreea Bobu , Yi Liu , Rohin Shah , Daniel S. Brown , Anca D. Dragan

分类：机器人 | 人工智能 | 机器学习

2023-01-02

When robots learn reward functions using high capacity models that take raw state directly as input, they need to both learn a representation for what matters in the task -- the task ``features" -- as well as how to combine these features into a single objective. If they try to do both at once from input designed to teach the full reward function, it is easy to end up with a representation that contains spurious correlations in the data, which fails to generalize to new settings. Instead, our ultimate goal is to enable robots to identify and isolate the causal features that people actually care about and use when they represent states and behavior. Our idea is that we can tune into this representation by asking users what behaviors they consider similar: behaviors will be similar if the features that matter are similar, even if low-level behavior is different; conversely, behaviors will be different if even one of the features that matter differs. This, in turn, is what enables the robot to disambiguate between what needs to go into the representation versus what is spurious, as well as what aspects of behavior can be compressed together versus not. The notion of learning representations based on similarity has a nice parallel in contrastive learning, a self-supervised representation learning technique that maps visually similar data points to similar embeddings, where similarity is defined by a designer through data augmentation heuristics. By contrast, in order to learn the representations that people use, so we can learn their preferences and objectives, we use their definition of similarity. In simulation as well as in a user study, we show that learning through such similarity queries leads to representations that, while far from perfect, are indeed more generalizable than self-supervised and task-input alternatives.

translated by 谷歌翻译

Few-Shot Preference Learning for Human-in-the-Loop RL

Joey Hejna , Dorsa Sadigh

分类：机器人 | 人工智能 | 机器学习

2022-12-06

While reinforcement learning (RL) has become a more popular approach for robotics, designing sufficiently informative reward functions for complex tasks has proven to be extremely difficult due their inability to capture human intent and policy exploitation. Preference based RL algorithms seek to overcome these challenges by directly learning reward functions from human feedback. Unfortunately, prior work either requires an unreasonable number of queries implausible for any human to answer or overly restricts the class of reward functions to guarantee the elicitation of the most informative queries, resulting in models that are insufficiently expressive for realistic robotics tasks. Contrary to most works that focus on query selection to \emph{minimize} the amount of data required for learning reward functions, we take an opposite approach: \emph{expanding} the pool of available data by viewing human-in-the-loop RL through the more flexible lens of multi-task learning. Motivated by the success of meta-learning, we pre-train preference models on prior task data and quickly adapt them for new tasks using only a handful of queries. Empirically, we reduce the amount of online feedback needed to train manipulation policies in Meta-World by 20$\times$, and demonstrate the effectiveness of our method on a real Franka Panda Robot. Moreover, this reduction in query-complexity allows us to train robot policies from actual human users. Videos of our results and code can be found at https://sites.google.com/view/few-shot-preference-rl/home.

translated by 谷歌翻译

Robust Task Representations for Offline Meta-Reinforcement Learning via Contrastive Learning

Haoqi Yuan , Zongqing Lu

分类：机器学习

2022-06-21

我们研究离线元加强学习，这是一种实用的强化学习范式，从离线数据中学习以适应新任务。离线数据的分布由行为政策和任务共同确定。现有的离线元强化学习算法无法区分这些因素，从而使任务表示不稳定，不稳定行为策略。为了解决这个问题，我们为任务表示形式提出了一个对比度学习框架，这些框架对培训和测试中行为策略的分布不匹配是可靠的。我们设计了双层编码器结构，使用相互信息最大化来形式化任务表示学习，得出对比度学习目标，并引入了几种方法以近似负面对的真实分布。对各种离线元强化学习基准的实验证明了我们方法比先前方法的优势，尤其是在对分布外行为策略的概括方面。该代码可在https://github.com/pku-ai-ged/corro中找到。

translated by 谷歌翻译

Decoupling Exploration and Exploitation for Meta-Reinforcement Learning without Sacrifices

Evan Zheran Liu , Aditi Raghunathan , Percy Liang , Chelsea Finn

分类：机器学习 | 人工智能 | (统计)机器学习

2020-08-06

Meta-Renifiltive学习（Meta-RL）的目标是通过利用相关任务的经验来建立可以快速学习新任务的代理。学习新任务通常需要探索来收集任务相关信息并利用这些信息来解决任务。原则上，可以通过简单地最大限度地提高任务性能来学习最佳探索和剥削。然而，这种Meta-RL由于鸡蛋和蛋问题而与当地Optima的斗争接近：学习探索需要良好的剥削来衡量探索的实用程序，但学习利用需要通过探索收集的信息。优化用于勘探和剥削的单独目标可以避免这个问题，但先前的Meta-RL探索目标会收益收集与任务无关的信息的次优政策。我们通过构建自动识别任务相关信息的开发目标和勘探目标来缓解对此的担忧，以才能恢复这些信息。这避免了端到端培训中的本地Optima，而不会牺牲最佳探索。凭经验，梦想大幅优于现有的复杂元 - RL问题的方法，例如稀疏奖励3D视觉导航。梦想的视频：https://ezliu.github.io/dream/

translated by 谷歌翻译