Reinforcement learning can enable robots to navigate to distant goals while optimizing user-specified reward functions, including preferences for following lanes, staying on paved paths, or avoiding freshly mowed grass. However, online learning from trial-and-error for real-world robots is logistically challenging, and methods that instead can utilize existing datasets of robotic navigation data could be significantly more scalable and enable broader generalization. In this paper, we present ReViND, the first offline RL system for robotic navigation that can leverage previously collected data to optimize user-specified reward functions in the real-world. We evaluate our system for off-road navigation without any additional data collection or fine-tuning, and show that it can navigate to distant goals using only offline training from this dataset, and exhibit behaviors that qualitatively differ based on the user-specified reward function.
translated by 谷歌翻译
Navigation is one of the most heavily studied problems in robotics, and is conventionally approached as a geometric mapping and planning problem. However, real-world navigation presents a complex set of physical challenges that defies simple geometric abstractions. Machine learning offers a promising way to go beyond geometry and conventional planning, allowing for navigational systems that make decisions based on actual prior experience. Such systems can reason about traversability in ways that go beyond geometry, accounting for the physical outcomes of their actions and exploiting patterns in real-world environments. They can also improve as more data is collected, potentially providing a powerful network effect. In this article, we present a general toolkit for experiential learning of robotic navigation skills that unifies several recent approaches, describe the underlying design principles, summarize experimental results from several of our recent papers, and discuss open problems and directions for future work.
translated by 谷歌翻译
机器人导航的目标条件政策可以在大型未注释的数据集上进行培训,从而为现实世界中的设置提供了良好的概括。但是,尤其是在指定目标需要图像的基于视觉的设置中,这是一个不自然的界面。语言为与机器人的通信提供了一种更方便的方式,但是现代方法通常需要以语言描述注释的轨迹的形式进行昂贵的监督。我们提出了一个用于机器人导航的系统,该系统享受着未注释的大型轨迹数据集培训的好处,同时仍为用户提供高级接口。我们没有在数据集之后使用标记的指令,而是表明可以完全从预先训练的导航模型(VING),图像语言关联(剪辑)和语言建模(GPT-3)中构建这样的系统,而无需任何微调或语言宣布的机器人数据。我们将LM-NAV实例化在现实世界中的移动机器人上,并通过自然语言指令通过复杂的室外环境演示长途导航。有关我们的实验的视频,代码发布和在浏览器中运行的交互式COLAB笔记本,请查看我们的项目页面https://sites.google.com/view/lmnav
translated by 谷歌翻译
在现实世界中经营通常需要代理商来了解复杂的环境,并应用这种理解以实现一系列目标。这个问题被称为目标有条件的强化学习(GCRL),对长地平线的目标变得特别具有挑战性。目前的方法通过使用基于图形的规划算法增强目标条件的策略来解决这个问题。然而,他们努力缩放到大型高维状态空间,并采用用于有效地收集训练数据的探索机制。在这项工作中,我们介绍了继任者功能标志性(SFL),这是一种探索大型高维环境的框架,以获得熟练的政策熟练的策略。 SFL利用继承特性(SF)来捕获转换动态的能力,通过估计状态新颖性来驱动探索,并通过将状态空间作为基于非参数标志的图形来实现高级规划。我们进一步利用SF直接计算地标遍历的目标条件调节策略,我们用于在探索状态空间边缘执行计划“前沿”地标。我们在我们的Minigrid和VizDoom进行了实验,即SFL可以高效地探索大型高维状态空间和优于长地平线GCRL任务的最先进的基线。
translated by 谷歌翻译
这项工作研究了图像目标导航问题,需要通过真正拥挤的环境引导具有嘈杂传感器和控制的机器人。最近的富有成效的方法依赖于深度加强学习,并学习模拟环境中的导航政策,这些环境比真实环境更简单。直接将这些训练有素的策略转移到真正的环境可能非常具有挑战性甚至危险。我们用由四个解耦模块组成的分层导航方法来解决这个问题。第一模块在机器人导航期间维护障碍物映射。第二个将定期预测实时地图上的长期目标。第三个计划碰撞命令集以导航到长期目标,而最终模块将机器人正确靠近目标图像。四个模块是单独开发的,以适应真实拥挤的情景中的图像目标导航。此外,分层分解对导航目标规划,碰撞避免和导航结束预测的学习进行了解耦,这在导航训练期间减少了搜索空间,并有助于改善以前看不见的真实场景的概括。我们通过移动机器人评估模拟器和现实世界中的方法。结果表明,我们的方法优于多种导航基线,可以在这些方案中成功实现导航任务。
translated by 谷歌翻译
With the development of deep representation learning, the domain of reinforcement learning (RL) has become a powerful learning framework now capable of learning complex policies in high dimensional environments. This review summarises deep reinforcement learning (DRL) algorithms and provides a taxonomy of automated driving tasks where (D)RL methods have been employed, while addressing key computational challenges in real world deployment of autonomous driving agents. It also delineates adjacent domains such as behavior cloning, imitation learning, inverse reinforcement learning that are related but are not classical RL algorithms. The role of simulators in training agents, methods to validate, test and robustify existing solutions in RL are discussed.
translated by 谷歌翻译
本文解决了逆增强学习(IRL)的问题 - 从观察其行为中推断出代理的奖励功能。 IRL可以为学徒学习提供可概括和紧凑的代表,并能够准确推断人的偏好以帮助他们。 %并提供更准确的预测。但是,有效的IRL具有挑战性,因为许多奖励功能可以与观察到的行为兼容。我们专注于如何利用先前的强化学习(RL)经验,以使学习这些偏好更快,更高效。我们提出了IRL算法基础(通过样本中的连续功能意图推断行为获取行为),该算法利用多任务RL预培训和后继功能,使代理商可以为跨越可能的目标建立强大的基础,从而跨越可能的目标。给定的域。当仅接触一些专家演示以优化新颖目标时,代理商会使用其基础快速有效地推断奖励功能。我们的实验表明,我们的方法非常有效地推断和优化显示出奖励功能,从而准确地从少于100个轨迹中推断出奖励功能。
translated by 谷歌翻译
For an autonomous agent to fulfill a wide range of user-specified goals at test time, it must be able to learn broadly applicable and general-purpose skill repertoires. Furthermore, to provide the requisite level of generality, these skills must handle raw sensory input such as images. In this paper, we propose an algorithm that acquires such general-purpose skills by combining unsupervised representation learning and reinforcement learning of goal-conditioned policies. Since the particular goals that might be required at test-time are not known in advance, the agent performs a self-supervised "practice" phase where it imagines goals and attempts to achieve them. We learn a visual representation with three distinct purposes: sampling goals for self-supervised practice, providing a structured transformation of raw sensory inputs, and computing a reward signal for goal reaching. We also propose a retroactive goal relabeling scheme to further improve the sample-efficiency of our method. Our off-policy algorithm is efficient enough to learn policies that operate on raw image observations and goals for a real-world robotic system, and substantially outperforms prior techniques. * Equal contribution. Order was determined by coin flip.
translated by 谷歌翻译
Learning long-horizon tasks such as navigation has presented difficult challenges for successfully applying reinforcement learning. However, from another perspective, under a known environment model, methods such as sampling-based planning can robustly find collision-free paths in environments without learning. In this work, we propose Control Transformer which models return-conditioned sequences from low-level policies guided by a sampling-based Probabilistic Roadmap (PRM) planner. Once trained, we demonstrate that our framework can solve long-horizon navigation tasks using only local information. We evaluate our approach on partially-observed maze navigation with MuJoCo robots, including Ant, Point, and Humanoid, and show that Control Transformer can successfully navigate large mazes and generalize to new, unknown environments. Additionally, we apply our method to a differential drive robot (Turtlebot3) and show zero-shot sim2real transfer under noisy observations.
translated by 谷歌翻译
我们介绍了一个目标驱动的导航系统,以改善室内场景中的Fapless视觉导航。我们的方法在每次步骤中都将机器人和目标的多视图观察为输入,以提供将机器人移动到目标的一系列动作,而不依赖于运行时在运行时。通过优化包含三个关键设计的组合目标来了解该系统。首先,我们建议代理人在做出行动决定之前构建下一次观察。这是通过从专家演示中学习变分生成模块来实现的。然后,我们提出预测预先预测静态碰撞,作为辅助任务,以改善导航期间的安全性。此外,为了减轻终止动作预测的训练数据不平衡问题,我们还介绍了一个目标检查模块来区分与终止动作的增强导航策略。这三种建议的设计都有助于提高培训数据效率,静态冲突避免和导航泛化性能,从而产生了一种新颖的目标驱动的FLASES导航系统。通过对Turtlebot的实验,我们提供了证据表明我们的模型可以集成到机器人系统中并在现实世界中导航。视频和型号可以在补充材料中找到。
translated by 谷歌翻译
为了基于深度加强学习(RL)来增强目标驱动的视觉导航的交叉目标和跨场景,我们将信息理论正则化术语引入RL目标。正则化最大化导航动作与代理的视觉观察变换之间的互信息,从而促进更明智的导航决策。这样,代理通过学习变分生成模型来模拟动作观察动态。基于该模型,代理生成(想象)从其当前观察和导航目标的下一次观察。这样,代理学会了解导航操作与其观察变化之间的因果关系,这允许代理通过比较当前和想象的下一个观察来预测导航的下一个动作。 AI2-Thor框架上的交叉目标和跨场景评估表明,我们的方法在某些最先进的模型上获得了平均成功率的10美元。我们进一步评估了我们的模型在两个现实世界中:来自离散的活动视觉数据集(AVD)和带有TurtleBot的连续现实世界环境中的看不见的室内场景导航。我们证明我们的导航模型能够成功实现导航任务这些情景。视频和型号可以在补充材料中找到。
translated by 谷歌翻译
离线强化学习(RL)为从离线数据提供学习决策的框架,因此构成了现实世界应用程序作为自动驾驶的有希望的方法。自动驾驶车辆(SDV)学习策略,这甚至可能甚至优于次优数据集中的行为。特别是在安全关键应用中,作为自动化驾驶,解释性和可转换性是成功的关键。这激发了使用基于模型的离线RL方法,该方法利用规划。然而,目前的最先进的方法往往忽视了多种子体系统随机行为引起的溶液不确定性的影响。这项工作提出了一种新的基于不确定感知模型的离线强化学习利用规划(伞)的新方法,其解决了以可解释的基于学习的方式共同的预测,规划和控制问题。训练有素的动作调节的随机动力学模型捕获了交通场景的独特不同的未来演化。分析为我们在挑战自动化驾驶模拟中的效力和基于现实世界的公共数据集的方法提供了经验证据。
translated by 谷歌翻译
强化学习(RL)算法有望为机器人系统实现自主技能获取。但是,实际上,现实世界中的机器人RL通常需要耗时的数据收集和频繁的人类干预来重置环境。此外,当部署超出知识的设置超出其学习的设置时,使用RL学到的机器人政策通常会失败。在这项工作中,我们研究了如何通过从先前看到的任务中收集的各种离线数据集的有效利用来应对这些挑战。当面对一项新任务时,我们的系统会适应以前学习的技能,以快速学习执行新任务并将环境返回到初始状态,从而有效地执行自己的环境重置。我们的经验结果表明,将先前的数据纳入机器人增强学习中可以实现自主学习,从而大大提高了学习的样本效率,并可以更好地概括。
translated by 谷歌翻译
Complex and contact-rich robotic manipulation tasks, particularly those that involve multi-fingered hands and underactuated object manipulation, present a significant challenge to any control method. Methods based on reinforcement learning offer an appealing choice for such settings, as they can enable robots to learn to delicately balance contact forces and dexterously reposition objects without strong modeling assumptions. However, running reinforcement learning on real-world dexterous manipulation systems often requires significant manual engineering. This negates the benefits of autonomous data collection and ease of use that reinforcement learning should in principle provide. In this paper, we describe a system for vision-based dexterous manipulation that provides a "programming-free" approach for users to define new tasks and enable robots with complex multi-fingered hands to learn to perform them through interaction. The core principle underlying our system is that, in a vision-based setting, users should be able to provide high-level intermediate supervision that circumvents challenges in teleoperation or kinesthetic teaching which allow a robot to not only learn a task efficiently but also to autonomously practice. Our system includes a framework for users to define a final task and intermediate sub-tasks with image examples, a reinforcement learning procedure that learns the task autonomously without interventions, and experimental results with a four-finger robotic hand learning multi-stage object manipulation tasks directly in the real world, without simulation, manual modeling, or reward engineering.
translated by 谷歌翻译
长摩根和包括一系列隐性子任务的日常任务仍然在离线机器人控制中构成了重大挑战。尽管许多先前的方法旨在通过模仿和离线增强学习的变体来解决这种设置,但学习的行为通常是狭窄的,并且经常努力实现可配置的长匹配目标。由于这两个范式都具有互补的优势和劣势,因此我们提出了一种新型的层次结构方法,结合了两种方法的优势,以从高维相机观察中学习任务无关的长胜压策略。具体而言,我们结合了一项低级政策,该政策通过模仿学习和从离线强化学习中学到的高级政策学习潜在的技能,以促进潜在的行为先验。各种模拟和真实机器人控制任务的实验表明,我们的配方使以前看不见的技能组合能够通过“缝制”潜在技能通过目标链条,并在绩效上提高绩效的顺序,从而实现潜在的目标。艺术基线。我们甚至还学习了一个多任务视觉运动策略,用于现实世界中25个不同的操纵任务,这既优于模仿学习和离线强化学习技术。
translated by 谷歌翻译
通过学习占用和公制地图来解决开放世界越野导航任务的几何方法,提供良好的泛化,但在违反他们的假设(例如,高草)的户外环境中可能是脆弱的。基于学习的方法可以直接从原始观察中学习无碰撞行为,但难以与标准的基于几何的管道集成。这创造了一个不幸的冲突 - 要么使用学习,要么丢失很好的几何导航组件,要么不使用它,或者不使用它,支持广泛的手动调整几何的成本图。在这项工作中,我们通过以一种方式设计学习和非学习的组件来拒绝这种二分法,使得它们可以以自我监督的方式有效地组合。这两个组件都有助于规划标准:学习组件作为奖励有助于预测的可遍历,而几何组件会有助于障碍成本信息。我们实例化并相对评价我们的系统在分销和分发的外部环境中,表明这种方法继承了来自学习和几何成分的互补收益,并显着优于其中任何一个。我们的结果视频在https://sites.google.com/view/hybrid -imitative-planning托管
translated by 谷歌翻译
通过模仿学习(IL)使用用户提供的演示,或者通过使用大量的自主收集的体验来学习机器人技能。方法具有互补的经验和缺点:RL可以达到高度的性能,但需要缺陷,但是需要缺乏要求,但是需要达到高水平的性能,但需要达到高度的性能这可能非常耗时和不安全; IL不要求Xploration,但只学习与所提供的示范一样好的技能。一种方法将两种方法的优势结合在一起?一系列的方法旨在解决这个问题,提出了整合IL和RL的元素的各种技术。然而,扩大了这种方法,这些方法复杂的机器人技能,整合了不同的离线数据,概括到现实世界的情景仍然存在重大挑战。在本文中,USAIM是测试先前IL + RL算法的可扩展性,并设计了一种系统的详细实验实验,这些实验结合了现有的组件,其具有效果有效和可扩展的方式。为此,我们展示了一系列关于了解每个设计决定的影响的一系列实验,以便开发可以利用示范和异构的先前数据在一系列现实世界和现实的模拟问题上获得最佳表现的批准方法。我们通过致电Wap-opt的完整方法将优势加权回归[1,2]和QT-opt [3]结合在一起,提供了一个UnifiedAgveach,用于集成机器人操作的演示和离线数据。请参阅HTTPS: //awopt.github.io有关更多详细信息。
translated by 谷歌翻译
模仿学习在有效地学习政策方面对复杂的决策问题有着巨大的希望。当前的最新算法经常使用逆增强学习(IRL),在给定一组专家演示的情况下,代理会替代奖励功能和相关的最佳策略。但是,这种IRL方法通常需要在复杂控制问题上进行实质性的在线互动。在这项工作中,我们提出了正规化的最佳运输(ROT),这是一种新的模仿学习算法,基于最佳基于最佳运输轨迹匹配的最新进展。我们的主要技术见解是,即使只有少量演示,即使只有少量演示,也可以自适应地将轨迹匹配的奖励与行为克隆相结合。我们对横跨DeepMind Control Suite,OpenAI Robotics和Meta-World基准的20个视觉控制任务进行的实验表明,与先前最新的方法相比,平均仿真达到了90%的专家绩效的速度,达到了90%的专家性能。 。在现实世界的机器人操作中,只有一次演示和一个小时的在线培训,ROT在14个任务中的平均成功率为90.1%。
translated by 谷歌翻译
通常,基于学习的拓扑导航方法产生了本地政策,同时通过拓扑图保留了空间的一些松散连通性。然而,拓扑图中的伪造或缺失的边缘通常会导致导航故障。在这项工作中,我们提出了一种基于抽样的图形构建方法,与基线方法相比,导致较为稀疏的图形却具有更高的导航性能。我们还提出了图形维护策略,以消除伪边缘并根据需要扩展图形,从而改善终身导航性能。与从固定培训环境中学习的控制器不同,我们表明我们的模型只能使用来自部署代理的现实世界环境中的少量收集的轨迹图像进行微调。我们在现实世界环境进行了微调后证明了成功的导航,并且通过应用我们的终身图形维护策略,随着时间的推移,随着时间的推移表现出显着的导航改进。
translated by 谷歌翻译
While large-scale sequence modeling from offline data has led to impressive performance gains in natural language and image generation, directly translating such ideas to robotics has been challenging. One critical reason for this is that uncurated robot demonstration data, i.e. play data, collected from non-expert human demonstrators are often noisy, diverse, and distributionally multi-modal. This makes extracting useful, task-centric behaviors from such data a difficult generative modeling problem. In this work, we present Conditional Behavior Transformers (C-BeT), a method that combines the multi-modal generation ability of Behavior Transformer with future-conditioned goal specification. On a suite of simulated benchmark tasks, we find that C-BeT improves upon prior state-of-the-art work in learning from play data by an average of 45.7%. Further, we demonstrate for the first time that useful task-centric behaviors can be learned on a real-world robot purely from play data without any task labels or reward information. Robot videos are best viewed on our project website: https://play-to-policy.github.io
translated by 谷歌翻译