Several self-supervised representation learning methods have been proposed for reinforcement learning (RL) with rich observations. For real-world applications of RL, recovering underlying latent states is crucial, particularly when sensory inputs contain irrelevant and exogenous information. In this work, we study how information bottlenecks can be used to construct latent states efficiently in the presence of task-irrelevant information. We propose architectures that utilize variational and discrete information bottlenecks, coined as RepDIB, to learn structured factorized representations. Exploiting the expressiveness bought by factorized representations, we introduce a simple, yet effective, bottleneck that can be integrated with any existing self-supervised objective for RL. We demonstrate this across several online and offline RL benchmarks, along with a real robot arm task, where we find that compressed representations with RepDIB can lead to strong performance improvements, as the learned bottlenecks help predict only the relevant state while ignoring irrelevant information.
translated by 谷歌翻译
The ability to effectively reuse prior knowledge is a key requirement when building general and flexible Reinforcement Learning (RL) agents. Skill reuse is one of the most common approaches, but current methods have considerable limitations.For example, fine-tuning an existing policy frequently fails, as the policy can degrade rapidly early in training. In a similar vein, distillation of expert behavior can lead to poor results when given sub-optimal experts. We compare several common approaches for skill transfer on multiple domains including changes in task and system dynamics. We identify how existing methods can fail and introduce an alternative approach to mitigate these problems. Our approach learns to sequence existing temporally-extended skills for exploration but learns the final policy directly from the raw experience. This conceptual split enables rapid adaptation and thus efficient data collection but without constraining the final solution.It significantly outperforms many classical methods across a suite of evaluation tasks and we use a broad set of ablations to highlight the importance of differentc omponents of our method.
translated by 谷歌翻译
从过去的经验中发现有用的行为并将其转移到新任务的能力被认为是自然体现智力的核心组成部分。受神经科学的启发,发现在瓶颈状态下切换的行为一直被人们追求,以引起整个任务的最小描述长度的计划。先前的方法仅支持在线,政策,瓶颈状态发现,限制样本效率或离散的状态行动域,从而限制适用性。为了解决这个问题,我们介绍了基于模型的离线选项(MO2),这是一个脱机后视框架,支持在连续的状态行动空间上发现样品效率高效瓶颈选项。一旦脱机而在源域上学习了瓶颈选项,它们就会在线转移,以改善转移域的探索和价值估计。我们的实验表明,在复杂的长途连续控制任务上,具有稀疏,延迟的奖励,MO2的属性至关重要,并且导致性能超过最近的选项学习方法。其他消融进一步证明了对期权可预测性和信用分配的影响。
translated by 谷歌翻译
在各种策略中,学会对任何混合物进行最佳作用是竞争游戏中重要的实践兴趣。在本文中,我们提出了同时满足两个Desiderata的单纯形式:i)学习以单个条件网络为代表的战略性不同的基础政策;ii)使用同一网络,通过基础策略的单纯形式学习最佳反应。我们表明,由此产生的条件策略有效地包含了有关对手的先前信息,从而在具有可拖动最佳响应的游戏中几乎可以针对任意混合策略的最佳回报。我们验证此类政策在不确定性下表现出色,并在测试时使用这种灵活性提供了见解。最后,我们提供的证据表明,对任何混合政策学习最佳响应是战略探索的有效辅助任务,这本身可以导致更多的性能人群。
translated by 谷歌翻译
Inspired by progress in large-scale language modeling, we apply a similar approach towards building a single generalist agent beyond the realm of text outputs. The agent, which we refer to as Gato, works as a multi-modal, multi-task, multi-embodiment generalist policy. The same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens. In this report we describe the model and the data, and document the current capabilities of Gato.
translated by 谷歌翻译
机器人将在整个生命周期中都会经历非平稳环境动态:机器人动态可能会因磨损而改变,或者周围的环境可能会随着时间而改变。最终,机器人在遇到的所有环境变化中都应表现良好。同时,它仍然应该能够在新环境中快速学习。我们在这样的终身学习环境中确定了强化学习(RL)的两个挑战:首先,现有的现有非政策算法在保持旧环境中保持良好绩效和有效学习之间的权衡方面挣扎尽管将所有数据保留在重播缓冲区中,但新环境。我们提出了离线蒸馏管道,以通过将培训程序分离为在线互动阶段和离线蒸馏阶段来打破这一权衡。第二,我们发现,通过从一生中多个环境中的不平衡的非政策数据进行培训会产生重要性能下降。我们确定这种性能下降是由数据集中质量不平衡和大小的组合引起的,这些质量和大小加剧了Q功能的外推误差。在蒸馏阶段,我们通过使策略更接近生成数据的行为策略来应用一个简单的解决方案。在实验中,我们在各种环境变化中通过模拟的两足机器人步行任务证明了这两个挑战和拟议的解决方案。我们表明,离线蒸馏管线在所有遇到的环境中都能取得更好的性能,而不会影响数据收集。我们还提供了一项全面的实证研究,以支持我们对数据不平衡问题的假设。
translated by 谷歌翻译
对于在现实世界中运营的机器人来说,期望学习可以有效地转移和适应许多任务和场景的可重复使用的行为。我们提出了一种使用分层混合潜变量模型来从数据中学习抽象运动技能的方法。与现有工作相比,我们的方法利用了离散和连续潜在变量的三级层次结构,以捕获一组高级行为,同时允许如何执行它们的差异。我们在操纵域中展示该方法可以有效地将离线数据脱落到不同的可执行行为,同时保留连续潜变量模型的灵活性。由此产生的技能可以在新的任务,看不见的对象和州内转移和微调到基于视觉的策略,与现有的技能和仿制的方法相比,产生更好的样本效率和渐近性能。我们进一步分析了技能最有益的方式以及何时:他们鼓励定向探索来涵盖与任务相关的国家空间的大区域,使其在挑战稀疏奖励环境中最有效。
translated by 谷歌翻译
强化学习(RL)原则上可以让机器人自动适应新任务,但是当前的RL方法需要大量的试验来实现这一目标。在本文中,我们通过元学习的框架来快速适应新任务,该框架利用过去的任务学习适应了对工业插入任务的特定关注。快速适应至关重要,因为大量的机器人试验可能会损害硬件件。另外,在不同的插入应用之间的经验中,有效的适应性也可以在很大程度上彼此利用。在这种情况下,我们在应用元学习时解决了两个具体的挑战。首先,传统的元元算法需要冗长的在线元训练。 We show that this can be replaced with appropriately chosen offline data, resulting in an offline meta-RL method that only requires demonstrations and trials from each of the prior tasks, without the need to run costly meta-RL procedures online.其次,元RL方法可能无法推广到与元训练时间时看到的新任务太大的任务,这在高成功率至关重要的工业应用中构成了特定的挑战。我们通过将上下文元学习与直接在线填充结合结合来解决这一问题:如果新任务与先前数据中看到的任务相似,则可以立即适应上下文的元学习者,如果它太不同,它会逐渐通过Finetuning适应。我们表明,我们的方法能够快速适应各种不同的插入任务,成功率为100%仅使用从头开始学习任务所需的样本的一小部分。实验视频和详细信息可从https://sites.google.com/view/offline-metarl-insertion获得。
translated by 谷歌翻译
强化学习中的信用作业是衡量行动对未来奖励的影响的问题。特别是,这需要从运气中分离技能,即解除外部因素和随后的行动对奖励行动的影响。为实现这一目标,我们将来自因果关系的反事件的概念调整为无模型RL设置。关键思想是通过学习从轨迹中提取相关信息来应对未来事件的价值函数。我们制定了一系列政策梯度算法,这些算法使用这些未来条件的价值函数作为基准或批评,并表明它们是可怕的差异。为避免对未来信息的调理潜在偏见,我们将后视信息限制为不包含有关代理程序行为的信息。我们展示了我们对许多说明性和具有挑战性问题的算法的功效和有效性。
translated by 谷歌翻译
We present a framework for efficient inference in structured image models that explicitly reason about objects. We achieve this by performing probabilistic inference using a recurrent neural network that attends to scene elements and processes them one at a time. Crucially, the model itself learns to choose the appropriate number of inference steps. We use this scheme to learn to perform inference in partially specified 2D models (variable-sized variational auto-encoders) and fully specified 3D models (probabilistic renderers). We show that such models learn to identify multiple objects -counting, locating and classifying the elements of a scenewithout any supervision, e.g., decomposing 3D images with various numbers of objects in a single forward pass of a neural network at unprecedented speed. We further show that the networks produce accurate inferences when compared to supervised counterparts, and that their structure leads to improved generalization.
translated by 谷歌翻译