移动机器人的视觉导航经典通过SLAM加上最佳规划,最近通过实现作为深网络的端到端培训。虽然前者通常仅限于航点计划,但即使在真实的物理环境中已经证明了它们的效率,后一种解决方案最常用于模拟中,但已被证明能够学习更复杂的视觉推理,涉及复杂的语义规则。通过实际机器人在物理环境中导航仍然是一个开放问题。端到端的培训方法仅在模拟中进行了彻底测试,实验涉及实际机器人的实际机器人在简化的实验室条件下限制为罕见的性能评估。在这项工作中,我们对真实物理代理的性能和推理能力进行了深入研究,在模拟中培训并部署到两个不同的物理环境。除了基准测试之外,我们提供了对不同条件下不同代理商培训的泛化能力的见解。我们可视化传感器使用以及不同类型信号的重要性。我们展示了,对于Pointgoal Task,一个代理在各种任务上进行预先培训,并在目标环境的模拟版本上进行微调,可以达到竞争性能,而无需建模任何SIM2重传,即通过直接从仿真部署培训的代理即可一个真正的物理机器人。
translated by 谷歌翻译
We present a retrospective on the state of Embodied AI research. Our analysis focuses on 13 challenges presented at the Embodied AI Workshop at CVPR. These challenges are grouped into three themes: (1) visual navigation, (2) rearrangement, and (3) embodied vision-and-language. We discuss the dominant datasets within each theme, evaluation metrics for the challenges, and the performance of state-of-the-art models. We highlight commonalities between top approaches to the challenges and identify potential future directions for Embodied AI research.
translated by 谷歌翻译
We present Habitat, a platform for research in embodied artificial intelligence (AI). Habitat enables training embodied agents (virtual robots) in highly efficient photorealistic 3D simulation. Specifically, Habitat consists of: (i) Habitat-Sim: a flexible, high-performance 3D simulator with configurable agents, sensors, and generic 3D dataset handling. Habitat-Sim is fast -when rendering a scene from Matterport3D, it achieves several thousand frames per second (fps) running single-threaded, and can reach over 10,000 fps multi-process on a single GPU. (ii) Habitat-API: a modular high-level library for end-toend development of embodied AI algorithms -defining tasks (e.g. navigation, instruction following, question answering), configuring, training, and benchmarking embodied agents.These large-scale engineering contributions enable us to answer scientific questions requiring experiments that were till now impracticable or 'merely' impractical. Specifically, in the context of point-goal navigation: (1) we revisit the comparison between learning and SLAM approaches from two recent works [20,16] and find evidence for the opposite conclusion -that learning outperforms SLAM if scaled to an order of magnitude more experience than previous investigations, and (2) we conduct the first cross-dataset generalization experiments {train, test} × {Matterport3D, Gibson} for multiple sensors {blind, RGB, RGBD, D} and find that only agents with depth (D) sensors generalize across datasets. We hope that our open-source platform and these findings will advance research in embodied AI.
translated by 谷歌翻译
从“Internet AI”的时代到“体现AI”的时代,AI算法和代理商出现了一个新兴范式转变,其中不再从主要来自Internet策划的图像,视频或文本的数据集。相反,他们通过与与人类类似的Enocentric感知来通过与其环境的互动学习。因此,对体现AI模拟器的需求存在大幅增长,以支持各种体现的AI研究任务。这种越来越多的体现AI兴趣是有利于对人工综合情报(AGI)的更大追求,但对这一领域并无一直存在当代和全面的调查。本文旨在向体现AI领域提供百科全书的调查,从其模拟器到其研究。通过使用我们提出的七种功能评估九个当前体现的AI模拟器,旨在了解模拟器,以其在体现AI研究和其局限性中使用。最后,本文调查了体现AI - 视觉探索,视觉导航和体现问题的三个主要研究任务(QA),涵盖了最先进的方法,评估指标和数据集。最后,随着通过测量该领域的新见解,本文将为仿真器 - 任务选择和建议提供关于该领域的未来方向的建议。
translated by 谷歌翻译
这项工作研究了图像目标导航问题,需要通过真正拥挤的环境引导具有嘈杂传感器和控制的机器人。最近的富有成效的方法依赖于深度加强学习,并学习模拟环境中的导航政策,这些环境比真实环境更简单。直接将这些训练有素的策略转移到真正的环境可能非常具有挑战性甚至危险。我们用由四个解耦模块组成的分层导航方法来解决这个问题。第一模块在机器人导航期间维护障碍物映射。第二个将定期预测实时地图上的长期目标。第三个计划碰撞命令集以导航到长期目标,而最终模块将机器人正确靠近目标图像。四个模块是单独开发的,以适应真实拥挤的情景中的图像目标导航。此外,分层分解对导航目标规划,碰撞避免和导航结束预测的学习进行了解耦,这在导航训练期间减少了搜索空间,并有助于改善以前看不见的真实场景的概括。我们通过移动机器人评估模拟器和现实世界中的方法。结果表明,我们的方法优于多种导航基线,可以在这些方案中成功实现导航任务。
translated by 谷歌翻译
Semantic navigation is necessary to deploy mobile robots in uncontrolled environments like our homes, schools, and hospitals. Many learning-based approaches have been proposed in response to the lack of semantic understanding of the classical pipeline for spatial navigation, which builds a geometric map using depth sensors and plans to reach point goals. Broadly, end-to-end learning approaches reactively map sensor inputs to actions with deep neural networks, while modular learning approaches enrich the classical pipeline with learning-based semantic sensing and exploration. But learned visual navigation policies have predominantly been evaluated in simulation. How well do different classes of methods work on a robot? We present a large-scale empirical study of semantic visual navigation methods comparing representative methods from classical, modular, and end-to-end learning approaches across six homes with no prior experience, maps, or instrumentation. We find that modular learning works well in the real world, attaining a 90% success rate. In contrast, end-to-end learning does not, dropping from 77% simulation to 23% real-world success rate due to a large image domain gap between simulation and reality. For practitioners, we show that modular learning is a reliable approach to navigate to objects: modularity and abstraction in policy design enable Sim-to-Real transfer. For researchers, we identify two key issues that prevent today's simulators from being reliable evaluation benchmarks - (A) a large Sim-to-Real gap in images and (B) a disconnect between simulation and real-world error modes - and propose concrete steps forward.
translated by 谷歌翻译
我们介绍了一个目标驱动的导航系统,以改善室内场景中的Fapless视觉导航。我们的方法在每次步骤中都将机器人和目标的多视图观察为输入,以提供将机器人移动到目标的一系列动作,而不依赖于运行时在运行时。通过优化包含三个关键设计的组合目标来了解该系统。首先,我们建议代理人在做出行动决定之前构建下一次观察。这是通过从专家演示中学习变分生成模块来实现的。然后,我们提出预测预先预测静态碰撞,作为辅助任务,以改善导航期间的安全性。此外,为了减轻终止动作预测的训练数据不平衡问题,我们还介绍了一个目标检查模块来区分与终止动作的增强导航策略。这三种建议的设计都有助于提高培训数据效率,静态冲突避免和导航泛化性能,从而产生了一种新颖的目标驱动的FLASES导航系统。通过对Turtlebot的实验,我们提供了证据表明我们的模型可以集成到机器人系统中并在现实世界中导航。视频和型号可以在补充材料中找到。
translated by 谷歌翻译
如果我们想在将它们部署在现实中之前在模拟中训练机器人,那么假定减少SIM2REAL差距的人似乎很自然,并且几乎是不言而喻的,涉及创建富裕性的模拟器(因为现实就是事实)。我们挑战了这一假设并提出了相反的假设-SIM2REAL转移机器人可以通过较低(不是更高)的保真度模拟来改善。我们使用3种不同的机器人(A1,Aliengo,Spot)对这一假设进行了系统的大规模评估 - 在现实世界中以及2个不同的模拟器(栖息地和Igibson)。我们的结果表明,与期望相反,增加忠诚无助于学习。由于模拟速度缓慢(防止大规模学习)和对模拟物理学不准确的过度拟合,因此性能较差。取而代之的是,使用现实世界数据构建机器人运动的简单模型可以改善学习和概括。
translated by 谷歌翻译
我们介绍了栖息地2.0(H2.0),这是一个模拟平台,用于培训交互式3D环境和复杂物理的场景中的虚拟机器人。我们为体现的AI堆栈 - 数据,仿真和基准任务做出了全面的贡献。具体来说,我们提出:(i)复制:一个由艺术家的,带注释的,可重新配置的3D公寓(匹配真实空间)与铰接对象(例如可以打开/关闭的橱柜和抽屉); (ii)H2.0:一个高性能物理学的3D模拟器,其速度超过8-GPU节点上的每秒25,000个模拟步骤(实时850x实时),代表先前工作的100倍加速;和(iii)家庭助理基准(HAB):一套辅助机器人(整理房屋,准备杂货,设置餐桌)的一套常见任务,以测试一系列移动操作功能。这些大规模的工程贡献使我们能够系统地比较长期结构化任务中的大规模加固学习(RL)和经典的感官平面操作(SPA)管道,并重点是对新对象,容器和布局的概括。 。我们发现(1)与层次结构相比,(1)平面RL政策在HAB上挣扎; (2)具有独立技能的层次结构遭受“交接问题”的困扰,(3)水疗管道比RL政策更脆。
translated by 谷歌翻译
Training embodied agents in simulation has become mainstream for the embodied AI community. However, these agents often struggle when deployed in the physical world due to their inability to generalize to real-world environments. In this paper, we present Phone2Proc, a method that uses a 10-minute phone scan and conditional procedural generation to create a distribution of training scenes that are semantically similar to the target environment. The generated scenes are conditioned on the wall layout and arrangement of large objects from the scan, while also sampling lighting, clutter, surface textures, and instances of smaller objects with randomized placement and materials. Leveraging just a simple RGB camera, training with Phone2Proc shows massive improvements from 34.7% to 70.7% success rate in sim-to-real ObjectNav performance across a test suite of over 200 trials in diverse real-world environments, including homes, offices, and RoboTHOR. Furthermore, Phone2Proc's diverse distribution of generated scenes makes agents remarkably robust to changes in the real world, such as human movement, object rearrangement, lighting changes, or clutter.
translated by 谷歌翻译
机器人社区已经开始严重依赖越来越逼真的3D模拟器,以便在大量数据上进行大规模培训机器人。但是,一旦机器人部署在现实世界中,仿真差距以及现实世界的变化(例如,灯,物体位移)导致错误。在本文中,我们介绍了SIM2Realviz,这是一种视觉分析工具,可以帮助专家了解并减少机器人EGO-POSE估计任务的这种差距,即使用训练型模型估计机器人的位置。 Sim2Realviz显示了给定模型的详细信息以及在模拟和现实世界中的实例的性能。专家可以识别在给定位置影响模型预测的环境差异,并通过与模型假设的直接交互来探索来解决它。我们详细介绍了工具的设计,以及与对平均偏差的回归利用以及如何解决的案例研究以及如何解决,以及模型如何被诸如自行车等地标的消失的扰动。
translated by 谷歌翻译
深度强化学习在基于激光的碰撞避免有效的情况下取得了巨大的成功,因为激光器可以感觉到准确的深度信息而无需太多冗余数据,这可以在算法从模拟环境迁移到现实世界时保持算法的稳健性。但是,高成本激光设备不仅很难为大型机器人部署,而且还表现出对复杂障碍的鲁棒性,包括不规则的障碍,例如桌子,桌子,椅子和架子,以及复杂的地面和特殊材料。在本文中,我们提出了一个新型的基于单眼相机的复杂障碍避免框架。特别是,我们创新地将捕获的RGB图像转换为伪激光测量,以进行有效的深度强化学习。与在一定高度捕获的传统激光测量相比,仅包含距离附近障碍的一维距离信息,我们提议的伪激光测量融合了捕获的RGB图像的深度和语义信息,这使我们的方法有效地有效障碍。我们还设计了一个功能提取引导模块,以加重输入伪激光测量,并且代理对当前状态具有更合理的关注,这有利于提高障碍避免政策的准确性和效率。
translated by 谷歌翻译
自主代理可以在新环境中导航而不构建明确的地图吗?对于PointGoal Navigation的任务(“转到$ \ delta x $,$ \ delta y $'),在理想化的设置(否RGB -D和驱动噪声,完美的GPS+Compass)下,答案是一个明确的“是” - 由任务无形组件(CNNS和RNN)组成的无地图神经模型接受了大规模增强学习训练,在标准数据集(Gibson)上取得了100%的成功。但是,对于PointNav在现实环境中(RGB-D和致动噪声,没有GPS+Compass),这是一个悬而未决的问题。我们在本文中解决了一个。该任务的最强成绩是成功的71.7%。首先,我们确定了性能下降的主要原因:GPS+指南针的缺失。带有RGB-D传感和致动噪声的完美GPS+指南针的代理商取得了99.8%的成功(Gibson-V2 Val)。这表明(解释模因)强大的视觉探子仪是我们对逼真的PointNav所需的全部。如果我们能够实现这一目标,我们可以忽略感应和致动噪声。作为我们的操作假设,我们扩展了数据集和模型大小,并开发了无人批准的数据启发技术来训练模型以进行视觉探测。我们在栖息地现实的PointNAV挑战方面的最新状态从71%降低到94%的成功(+23,31%相对)和53%至74%的SPL(+21,40%相对)。虽然我们的方法不饱和或“解决”该数据集,但这种强大的改进与有希望的零射击SIM2REAL转移(到Locobot)相结合提供了与假设一致的证据,即即使在现实环境中,显式映射也不是必需的。 。
translated by 谷歌翻译
在这项工作中,我们提出了一种用于图像目标导航的内存调格方法。早期的尝试,包括基于RL的基于RL的方法和基于SLAM的方法的概括性能差,或者在姿势/深度传感器上稳定稳定。我们的方法基于一个基于注意力的端到端模型,该模型利用情节记忆来学习导航。首先,我们以自我监督的方式训练一个国家安置的网络,然后将其嵌入以前访问的状态中的代理商的记忆中。我们的导航政策通过注意机制利用了此信息。我们通过广泛的评估来验证我们的方法,并表明我们的模型在具有挑战性的吉布森数据集上建立了新的最新技术。此外,与相关工作形成鲜明对比的是,我们仅凭RGB输入就实现了这种令人印象深刻的性能,而无需访问其他信息,例如位置或深度。
translated by 谷歌翻译
Underwater navigation presents several challenges, including unstructured unknown environments, lack of reliable localization systems (e.g., GPS), and poor visibility. Furthermore, good-quality obstacle detection sensors for underwater robots are scant and costly; and many sensors like RGB-D cameras and LiDAR only work in-air. To enable reliable mapless underwater navigation despite these challenges, we propose a low-cost end-to-end navigation system, based on a monocular camera and a fixed single-beam echo-sounder, that efficiently navigates an underwater robot to waypoints while avoiding nearby obstacles. Our proposed method is based on Proximal Policy Optimization (PPO), which takes as input current relative goal information, estimated depth images, echo-sounder readings, and previous executed actions, and outputs 3D robot actions in a normalized scale. End-to-end training was done in simulation, where we adopted domain randomization (varying underwater conditions and visibility) to learn a robust policy against noise and changes in visibility conditions. The experiments in simulation and real-world demonstrated that our proposed method is successful and resilient in navigating a low-cost underwater robot in unknown underwater environments. The implementation is made publicly available at https://github.com/dartmouthrobotics/deeprl-uw-robot-navigation.
translated by 谷歌翻译
我们考虑将移动机器人导航到具有视觉传感器的未知环境中的问题,在该环境中,机器人和传感器都无法访问全局定位信息,并且仅使用第一人称视图图像。虽然基于传感器网络的先前工作使用明确的映射和计划技术,并且经常得到外部定位系统的帮助,但我们提出了一种基于视觉的学习方法,该方法利用图形神经网络(GNN)来编码和传达相关的视点信息到移动机器人。在导航期间,机器人以模型为指导,我们通过模仿学习训练以近似最佳的运动原语,从而预测有效的成本(目标)。在我们的实验中,我们首先证明了具有各种传感器布局的以前看不见的环境的普遍性。仿真结果表明,通过利用传感器和机器人之间的通信,我们可以达到$ 18.1 \%$ $的成功率,同时将路径弯路的平均值降低$ 29.3 \%$,并且可变性降低了$ 48.4 \%$ $。这是在不需要全局地图,定位数据或传感器网络预校准的情况下完成的。其次,我们将模型从模拟到现实世界进行零拍传输。为此,我们训练一个“翻译器”模型,该模型在{}真实图像和模拟图像之间转换,以便可以直接在真实的机器人上使用导航策略(完全在模拟中训练),而无需其他微调。 。物理实验证明了我们在各种混乱的环境中的有效性。
translated by 谷歌翻译
Learning how to navigate among humans in an occluded and spatially constrained indoor environment, is a key ability required to embodied agent to be integrated into our society. In this paper, we propose an end-to-end architecture that exploits Socially-Aware Tasks (referred as to Risk and Social Compass) to inject into a reinforcement learning navigation policy the ability to infer common-sense social behaviors. To this end, our tasks exploit the notion of immediate and future dangers of collision. Furthermore, we propose an evaluation protocol specifically designed for the Social Navigation Task in simulated environments. This is done to capture fine-grained features and characteristics of the policy by analyzing the minimal unit of human-robot spatial interaction, called Encounter. We validate our approach on Gibson4+ and Habitat-Matterport3D datasets.
translated by 谷歌翻译
最近的视听导航工作是无噪音音频环境中的单一静态声音,并努力推广到闻名声音。我们介绍了一种新型动态视听导航基准测试,其中一个体现的AI代理必须在存在分散的人和嘈杂的声音存在下在未映射的环境中捕获移动声源。我们提出了一种依赖于多模态架构的端到端增强学习方法,该方法依赖于融合来自双耳音频信号和空间占用映射的空间视听信息,以编码为我们的新的稳健导航策略进行编码所需的功能复杂的任务设置。我们展示了我们的方法优于当前的最先进状态,以更好地推广到闻名声音以及对嘈杂的3D扫描现实世界数据集副本和TASTPORT3D上的嘈杂情景更好地对嘈杂的情景进行了更好的稳健性,以实现静态和动态的视听导航基准。我们的小型基准将在http://dav-nav.cs.uni-freiburg.de提供。
translated by 谷歌翻译
为了基于深度加强学习(RL)来增强目标驱动的视觉导航的交叉目标和跨场景,我们将信息理论正则化术语引入RL目标。正则化最大化导航动作与代理的视觉观察变换之间的互信息,从而促进更明智的导航决策。这样,代理通过学习变分生成模型来模拟动作观察动态。基于该模型,代理生成(想象)从其当前观察和导航目标的下一次观察。这样,代理学会了解导航操作与其观察变化之间的因果关系,这允许代理通过比较当前和想象的下一个观察来预测导航的下一个动作。 AI2-Thor框架上的交叉目标和跨场景评估表明,我们的方法在某些最先进的模型上获得了平均成功率的10美元。我们进一步评估了我们的模型在两个现实世界中:来自离散的活动视觉数据集(AVD)和带有TurtleBot的连续现实世界环境中的看不见的室内场景导航。我们证明我们的导航模型能够成功实现导航任务这些情景。视频和型号可以在补充材料中找到。
translated by 谷歌翻译
Two less addressed issues of deep reinforcement learning are (1) lack of generalization capability to new target goals, and (2) data inefficiency i.e., the model requires several (and often costly) episodes of trial and error to converge, which makes it impractical to be applied to real-world scenarios. In this paper, we address these two issues and apply our model to the task of target-driven visual navigation. To address the first issue, we propose an actor-critic model whose policy is a function of the goal as well as the current state, which allows to better generalize. To address the second issue, we propose AI2-THOR framework, which provides an environment with highquality 3D scenes and physics engine. Our framework enables agents to take actions and interact with objects. Hence, we can collect a huge number of training samples efficiently.We show that our proposed method (1) converges faster than the state-of-the-art deep reinforcement learning methods, (2) generalizes across targets and across scenes, (3) generalizes to a real robot scenario with a small amount of fine-tuning (although the model is trained in simulation), ( 4) is end-to-end trainable and does not need feature engineering, feature matching between frames or 3D reconstruction of the environment.The supplementary video can be accessed at the following link: https://youtu.be/SmBxMDiOrvs.
translated by 谷歌翻译