强化学习(RL)算法有望为机器人系统实现自主技能获取。但是,实际上,现实世界中的机器人RL通常需要耗时的数据收集和频繁的人类干预来重置环境。此外,当部署超出知识的设置超出其学习的设置时,使用RL学到的机器人政策通常会失败。在这项工作中,我们研究了如何通过从先前看到的任务中收集的各种离线数据集的有效利用来应对这些挑战。当面对一项新任务时,我们的系统会适应以前学习的技能,以快速学习执行新任务并将环境返回到初始状态,从而有效地执行自己的环境重置。我们的经验结果表明,将先前的数据纳入机器人增强学习中可以实现自主学习,从而大大提高了学习的样本效率,并可以更好地概括。
translated by 谷歌翻译
Complex and contact-rich robotic manipulation tasks, particularly those that involve multi-fingered hands and underactuated object manipulation, present a significant challenge to any control method. Methods based on reinforcement learning offer an appealing choice for such settings, as they can enable robots to learn to delicately balance contact forces and dexterously reposition objects without strong modeling assumptions. However, running reinforcement learning on real-world dexterous manipulation systems often requires significant manual engineering. This negates the benefits of autonomous data collection and ease of use that reinforcement learning should in principle provide. In this paper, we describe a system for vision-based dexterous manipulation that provides a "programming-free" approach for users to define new tasks and enable robots with complex multi-fingered hands to learn to perform them through interaction. The core principle underlying our system is that, in a vision-based setting, users should be able to provide high-level intermediate supervision that circumvents challenges in teleoperation or kinesthetic teaching which allow a robot to not only learn a task efficiently but also to autonomously practice. Our system includes a framework for users to define a final task and intermediate sub-tasks with image examples, a reinforcement learning procedure that learns the task autonomously without interventions, and experimental results with a four-finger robotic hand learning multi-stage object manipulation tasks directly in the real world, without simulation, manual modeling, or reward engineering.
translated by 谷歌翻译
通过模仿学习(IL)使用用户提供的演示,或者通过使用大量的自主收集的体验来学习机器人技能。方法具有互补的经验和缺点:RL可以达到高度的性能,但需要缺陷,但是需要缺乏要求,但是需要达到高水平的性能,但需要达到高度的性能这可能非常耗时和不安全; IL不要求Xploration,但只学习与所提供的示范一样好的技能。一种方法将两种方法的优势结合在一起?一系列的方法旨在解决这个问题,提出了整合IL和RL的元素的各种技术。然而,扩大了这种方法,这些方法复杂的机器人技能,整合了不同的离线数据,概括到现实世界的情景仍然存在重大挑战。在本文中,USAIM是测试先前IL + RL算法的可扩展性,并设计了一种系统的详细实验实验,这些实验结合了现有的组件,其具有效果有效和可扩展的方式。为此,我们展示了一系列关于了解每个设计决定的影响的一系列实验,以便开发可以利用示范和异构的先前数据在一系列现实世界和现实的模拟问题上获得最佳表现的批准方法。我们通过致电Wap-opt的完整方法将优势加权回归[1,2]和QT-opt [3]结合在一起,提供了一个UnifiedAgveach,用于集成机器人操作的演示和离线数据。请参阅HTTPS: //awopt.github.io有关更多详细信息。
translated by 谷歌翻译
强化学习(RL)原则上可以让机器人自动适应新任务,但是当前的RL方法需要大量的试验来实现这一目标。在本文中,我们通过元学习的框架来快速适应新任务,该框架利用过去的任务学习适应了对工业插入任务的特定关注。快速适应至关重要,因为大量的机器人试验可能会损害硬件件。另外,在不同的插入应用之间的经验中,有效的适应性也可以在很大程度上彼此利用。在这种情况下,我们在应用元学习时解决了两个具体的挑战。首先,传统的元元算法需要冗长的在线元训练。 We show that this can be replaced with appropriately chosen offline data, resulting in an offline meta-RL method that only requires demonstrations and trials from each of the prior tasks, without the need to run costly meta-RL procedures online.其次,元RL方法可能无法推广到与元训练时间时看到的新任务太大的任务,这在高成功率至关重要的工业应用中构成了特定的挑战。我们通过将上下文元学习与直接在线填充结合结合来解决这一问题:如果新任务与先前数据中看到的任务相似,则可以立即适应上下文的元学习者,如果它太不同,它会逐渐通过Finetuning适应。我们表明,我们的方法能够快速适应各种不同的插入任务,成功率为100%仅使用从头开始学习任务所需的样本的一小部分。实验视频和详细信息可从https://sites.google.com/view/offline-metarl-insertion获得。
translated by 谷歌翻译
加强学习(RL)提供了通过试验和错误学习的自然主义框架,这是由于其简单和有效性,并且由于其与人类和动物如何通过经验获得技能。然而,现实世界的体现学习,例如由人类和动物执行的,位于持续的非剧目世界中,而RL中的共同基准任务是epiSodic,在试验之间重置的环境以提供多次尝试。当尝试采取为ePiSodic模拟环境开发的RL算法并在现实世界平台上运行时,这种差异呈现出一项重大挑战,如机器人。在本文中,我们的目标是通过为自主强化学习(ARL)框架(ARL)提供框架来解决这一差异:加强学习的代理商不仅通过自己的经验学习,而且还争夺缺乏人类监督在试验之间重置。我们在此框架上介绍了一个模拟的基准伯爵,其中包含一系列多样化和具有挑战性的模拟任务,这些任务反映了所引入学习的障碍,当只有最小的对外在干预的依赖性时,可以假设。我们表明,作为干预措施的剧集RL和现有方法斗争的标准方法最小化,强调了对强化学习开发新算法的需求,更加注重自主。
translated by 谷歌翻译
我们研究机器人如何自主学习需要联合导航和抓握的技能。虽然原则上的加固学习提供自动机器人技能学习,但在实践中,在现实世界中的加固学习是挑战性的,并且往往需要大量的仪器和监督。我们的宗旨是以无论没有人为干预的自主方式,设计用于学习导航和操纵的机器人强化学习系统,在没有人为干预的情况下,在现实的假设下实现持续学习。我们建议的系统relmm,可以在没有任何环境仪器的现实世界平台上不断学习,没有人为干预,而无需访问特权信息,例如地图,对象位置或环境的全局视图。我们的方法采用模块化策略与组件进行操纵和导航,其中操纵政策不确定性驱动导航控制器的探索,操作模块为导航提供奖励。我们在房间清理任务上评估我们的方法,机器人必须导航到并拾取散落在地板上的物品。在掌握课程训练阶段之后,relmm可以在自动真实培训的大约40小时内自动学习导航并完全抓住。
translated by 谷歌翻译
For an autonomous agent to fulfill a wide range of user-specified goals at test time, it must be able to learn broadly applicable and general-purpose skill repertoires. Furthermore, to provide the requisite level of generality, these skills must handle raw sensory input such as images. In this paper, we propose an algorithm that acquires such general-purpose skills by combining unsupervised representation learning and reinforcement learning of goal-conditioned policies. Since the particular goals that might be required at test-time are not known in advance, the agent performs a self-supervised "practice" phase where it imagines goals and attempts to achieve them. We learn a visual representation with three distinct purposes: sampling goals for self-supervised practice, providing a structured transformation of raw sensory inputs, and computing a reward signal for goal reaching. We also propose a retroactive goal relabeling scheme to further improve the sample-efficiency of our method. Our off-policy algorithm is efficient enough to learn policies that operate on raw image observations and goals for a real-world robotic system, and substantially outperforms prior techniques. * Equal contribution. Order was determined by coin flip.
translated by 谷歌翻译
元强化学习(RL)方法可以使用比标准RL少的数据级的元培训策略,但元培训本身既昂贵又耗时。如果我们可以在离线数据上进行元训练,那么我们可以重复使用相同的静态数据集,该数据集将一次标记为不同任务的奖励,以在元测试时间适应各种新任务的元训练策略。尽管此功能将使Meta-RL成为现实使用的实用工具,但离线META-RL提出了除在线META-RL或标准离线RL设置之外的其他挑战。 Meta-RL学习了一种探索策略,该策略收集了用于适应的数据,并元培训策略迅速适应了新任务的数据。由于该策略是在固定的离线数据集上进行了元训练的,因此当适应学识渊博的勘探策略收集的数据时,它可能表现得不可预测,这与离线数据有系统地不同,从而导致分布变化。我们提出了一种混合脱机元元素算法,该算法使用带有奖励的脱机数据来进行自适应策略,然后收集其他无监督的在线数据,而无需任何奖励标签来桥接这一分配变化。通过不需要在线收集的奖励标签,此数据可以便宜得多。我们将我们的方法比较了在模拟机器人的运动和操纵任务上进行离线元rl的先前工作,并发现使用其他无监督的在线数据收集可以显着提高元训练政策的自适应能力,从而匹配完全在线的表现。在一系列具有挑战性的域上,需要对新任务进行概括。
translated by 谷歌翻译
无监督的表示学习的最新进展显着提高了模拟环境中培训强化学习政策的样本效率。但是,尚未看到针对实体强化学习的类似收益。在这项工作中,我们专注于从像素中启用数据有效的实体机器人学习。我们提出了有效的机器人学习(编码器)的对比前训练和数据增强,该方法利用数据增强和无监督的学习来从稀疏奖励中实现对实体ARM策略的样本效率培训。虽然对比预训练,数据增强,演示和强化学习不足以进行有效学习,但我们的主要贡献表明,这些不同技术的组合导致了一种简单而数据效率的方法。我们表明,只有10个示范,一个机器人手臂可以从像素中学习稀疏的奖励操纵策略,例如到达,拾取,移动,拉动大物体,翻转开关并在短短30分钟内打开抽屉现实世界训练时间。我们在项目网站上包括视频和代码:https://sites.google.com/view/felfficited-robotic-manipulation/home
translated by 谷歌翻译
Skill-based reinforcement learning (RL) has emerged as a promising strategy to leverage prior knowledge for accelerated robot learning. Skills are typically extracted from expert demonstrations and are embedded into a latent space from which they can be sampled as actions by a high-level RL agent. However, this skill space is expansive, and not all skills are relevant for a given robot state, making exploration difficult. Furthermore, the downstream RL agent is limited to learning structurally similar tasks to those used to construct the skill space. We firstly propose accelerating exploration in the skill space using state-conditioned generative models to directly bias the high-level agent towards only sampling skills relevant to a given state based on prior experience. Next, we propose a low-level residual policy for fine-grained skill adaptation enabling downstream RL agents to adapt to unseen task variations. Finally, we validate our approach across four challenging manipulation tasks that differ from those used to build the skill space, demonstrating our ability to learn across task variations while significantly accelerating exploration, outperforming prior works. Code and videos are available on our project website: https://krishanrana.github.io/reskill.
translated by 谷歌翻译
我们研究了从机器人交互的大型离线数据集学习一系列基于视觉的操纵任务的问题。为了实现这一目标,人类需要简单有效地将任务指定给机器人。目标图像是一种流行的任务规范形式,因为它们已经在机器人的观察空间接地。然而,目标图像也有许多缺点:它们对人类提供的不方便,它们可以通过提供导致稀疏奖励信号的所需行为,或者在非目标达到任务的情况下指定任务信息。自然语言为任务规范提供了一种方便而灵活的替代方案,而是随着机器人观察空间的接地语言挑战。为了可扩展地学习此基础,我们建议利用具有人群源语言标签的离线机器人数据集(包括高度最佳,自主收集的数据)。使用此数据,我们学习一个简单的分类器,该分类器预测状态的更改是否完成了语言指令。这提供了一种语言调节奖励函数,然后可以用于离线多任务RL。在我们的实验中,我们发现,在语言条件的操作任务中,我们的方法优于目标 - 图像规格和语言条件仿制技术超过25%,并且能够从自然语言中执行Visuomotor任务,例如“打开右抽屉“和”移动订书机“,在弗兰卡·埃米卡熊猫机器人上。
translated by 谷歌翻译
智能代理人应该有能力利用先前学习的任务中的知识,以便快速有效地学习新任务。元学习方法已成为实现这一目标的流行解决方案。然而,迄今为止,元强化学习(META-RL)算法仅限于具有狭窄任务分布的简单环境。此外,预处理的范式随后进行了微调以适应新任务,这是一种简单而有效的解决方案,这些解决方案是监督和自我监督的学习。这使质疑元学习方法的好处在加强学习中的好处,这通常是以高复杂性为代价的。因此,我们研究了包括Procgen,rlbench和Atari在内的各种基于视觉的基准测试中的元RL方法,在这些基准测试中,对完全新颖的任务进行了评估。我们的发现表明,当对不同任务(而不是相同任务的不同变化)评估元学习方法时,对新任务进行微调的多任务预处理也相同或更好,或者更好,比用meta进行元数据。测试时间适应。这对于将来的研究令人鼓舞,因为多任务预处理往往比Meta-RL更简单和计算更便宜。从这些发现中,我们主张评估未来的Meta-RL方法在更具挑战性的任务上,并包括以简单但强大的基线进行微调预处理。
translated by 谷歌翻译
Meta-reinforcement learning algorithms can enable robots to acquire new skills much more quickly, by leveraging prior experience to learn how to learn. However, much of the current research on meta-reinforcement learning focuses on task distributions that are very narrow. For example, a commonly used meta-reinforcement learning benchmark uses different running velocities for a simulated robot as different tasks. When policies are meta-trained on such narrow task distributions, they cannot possibly generalize to more quickly acquire entirely new tasks. Therefore, if the aim of these methods is enable faster acquisition of entirely new behaviors, we must evaluate them on task distributions that are sufficiently broad to enable generalization to new behaviors. In this paper, we propose an open-source simulated benchmark for meta-reinforcement learning and multitask learning consisting of 50 distinct robotic manipulation tasks. Our aim is to make it possible to develop algorithms that generalize to accelerate the acquisition of entirely new, held-out tasks. We evaluate 7 state-of-the-art meta-reinforcement learning and multi-task learning algorithms on these tasks. Surprisingly, while each task and its variations (e.g., with different object positions) can be learned with reasonable success, these algorithms struggle to learn with multiple tasks at the same time, even with as few as ten distinct training tasks. Our analysis and open-source environments pave the way for future research in multi-task learning and meta-learning that can enable meaningful generalization, thereby unlocking the full potential of these methods. 1
translated by 谷歌翻译
有效的探索是深度强化学习的关键挑战。几种方法,例如行为先验,能够利用离线数据,以便在复杂任务上有效加速加强学习。但是,如果手动的任务与所证明的任务过度偏离,则此类方法的有效性是有限的。在我们的工作中,我们建议从离线数据中学习功能,这些功能由更加多样化的任务共享,例如动作与定向之间的相关性。因此,我们介绍了无国有先验,该先验直接在显示的轨迹中直接建模时间一致性,并且即使在对简单任务收集的数据进行培训时,也能够在复杂的任务中推动探索。此外,我们通过从政策和行动之前的概率混合物中动态采样动作,引入了一种新颖的集成方案,用于非政策强化学习中的动作研究。我们将我们的方法与强大的基线相提并论,并提供了经验证据,表明它可以在稀疏奖励环境下的长途持续控制任务中加速加强学习。
translated by 谷歌翻译
While reinforcement learning (RL) has become a more popular approach for robotics, designing sufficiently informative reward functions for complex tasks has proven to be extremely difficult due their inability to capture human intent and policy exploitation. Preference based RL algorithms seek to overcome these challenges by directly learning reward functions from human feedback. Unfortunately, prior work either requires an unreasonable number of queries implausible for any human to answer or overly restricts the class of reward functions to guarantee the elicitation of the most informative queries, resulting in models that are insufficiently expressive for realistic robotics tasks. Contrary to most works that focus on query selection to \emph{minimize} the amount of data required for learning reward functions, we take an opposite approach: \emph{expanding} the pool of available data by viewing human-in-the-loop RL through the more flexible lens of multi-task learning. Motivated by the success of meta-learning, we pre-train preference models on prior task data and quickly adapt them for new tasks using only a handful of queries. Empirically, we reduce the amount of online feedback needed to train manipulation policies in Meta-World by 20$\times$, and demonstrate the effectiveness of our method on a real Franka Panda Robot. Moreover, this reduction in query-complexity allows us to train robot policies from actual human users. Videos of our results and code can be found at https://sites.google.com/view/few-shot-preference-rl/home.
translated by 谷歌翻译
强化学习可以培训有效执行复杂任务的政策。然而,对于长地平线任务,这些方法的性能与地平线脱落,通常需要推理和构成较低级别的技能。等级强化学习旨在通过为行动抽象提供一组低级技能来实现这一点。通过抽象空间状态,层次结构也可以进一步提高这一点。我们对适当的状态抽象应取决于可用的较低级别策略的功能。我们提出了价值函数空间:通过使用与每个较低级别的技能对应的值函数来产生这种表示的简单方法。这些价值函数捕获场景的可取性,从而形成了紧凑型摘要任务相关信息的表示,并强大地忽略了分散的人。迷宫解决和机器人操纵任务的实证评估表明,我们的方法提高了长地平的性能,并且能够比替代的无模型和基于模型的方法能够更好的零拍泛化。
translated by 谷歌翻译
Exploration in environments with sparse rewards has been a persistent problem in reinforcement learning (RL). Many tasks are natural to specify with a sparse reward, and manually shaping a reward function can result in suboptimal performance. However, finding a non-zero reward is exponentially more difficult with increasing task horizon or action dimensionality. This puts many real-world tasks out of practical reach of RL methods. In this work, we use demonstrations to overcome the exploration problem and successfully learn to perform long-horizon, multi-step robotics tasks with continuous control such as stacking blocks with a robot arm. Our method, which builds on top of Deep Deterministic Policy Gradients and Hindsight Experience Replay, provides an order of magnitude of speedup over RL on simulated robotics tasks. It is simple to implement and makes only the additional assumption that we can collect a small set of demonstrations. Furthermore, our method is able to solve tasks not solvable by either RL or behavior cloning alone, and often ends up outperforming the demonstrator policy.
translated by 谷歌翻译
While large-scale sequence modeling from offline data has led to impressive performance gains in natural language and image generation, directly translating such ideas to robotics has been challenging. One critical reason for this is that uncurated robot demonstration data, i.e. play data, collected from non-expert human demonstrators are often noisy, diverse, and distributionally multi-modal. This makes extracting useful, task-centric behaviors from such data a difficult generative modeling problem. In this work, we present Conditional Behavior Transformers (C-BeT), a method that combines the multi-modal generation ability of Behavior Transformer with future-conditioned goal specification. On a suite of simulated benchmark tasks, we find that C-BeT improves upon prior state-of-the-art work in learning from play data by an average of 45.7%. Further, we demonstrate for the first time that useful task-centric behaviors can be learned on a real-world robot purely from play data without any task labels or reward information. Robot videos are best viewed on our project website: https://play-to-policy.github.io
translated by 谷歌翻译
Reinforcement learning holds the promise of enabling autonomous robots to learn large repertoires of behavioral skills with minimal human intervention. However, robotic applications of reinforcement learning often compromise the autonomy of the learning process in favor of achieving training times that are practical for real physical systems. This typically involves introducing hand-engineered policy representations and human-supplied demonstrations. Deep reinforcement learning alleviates this limitation by training general-purpose neural network policies, but applications of direct deep reinforcement learning algorithms have so far been restricted to simulated settings and relatively simple tasks, due to their apparent high sample complexity. In this paper, we demonstrate that a recent deep reinforcement learning algorithm based on offpolicy training of deep Q-functions can scale to complex 3D manipulation tasks and can learn deep neural network policies efficiently enough to train on real physical robots. We demonstrate that the training times can be further reduced by parallelizing the algorithm across multiple robots which pool their policy updates asynchronously. Our experimental evaluation shows that our method can learn a variety of 3D manipulation skills in simulation and a complex door opening skill on real robots without any prior demonstrations or manually designed representations.
translated by 谷歌翻译
Developing robots that are capable of many skills and generalization to unseen scenarios requires progress on two fronts: efficient collection of large and diverse datasets, and training of high-capacity policies on the collected data. While large datasets have propelled progress in other fields like computer vision and natural language processing, collecting data of comparable scale is particularly challenging for physical systems like robotics. In this work, we propose a framework to bridge this gap and better scale up robot learning, under the lens of multi-task, multi-scene robot manipulation in kitchen environments. Our framework, named CACTI, has four stages that separately handle data collection, data augmentation, visual representation learning, and imitation policy training. In the CACTI framework, we highlight the benefit of adapting state-of-the-art models for image generation as part of the augmentation stage, and the significant improvement of training efficiency by using pretrained out-of-domain visual representations at the compression stage. Experimentally, we demonstrate that 1) on a real robot setup, CACTI enables efficient training of a single policy capable of 10 manipulation tasks involving kitchen objects, and robust to varying layouts of distractor objects; 2) in a simulated kitchen environment, CACTI trains a single policy on 18 semantic tasks across up to 50 layout variations per task. The simulation task benchmark and augmented datasets in both real and simulated environments will be released to facilitate future research.
translated by 谷歌翻译