We present Habitat, a platform for research in embodied artificial intelligence (AI). Habitat enables training embodied agents (virtual robots) in highly efficient photorealistic 3D simulation. Specifically, Habitat consists of: (i) Habitat-Sim: a flexible, high-performance 3D simulator with configurable agents, sensors, and generic 3D dataset handling. Habitat-Sim is fast -when rendering a scene from Matterport3D, it achieves several thousand frames per second (fps) running single-threaded, and can reach over 10,000 fps multi-process on a single GPU. (ii) Habitat-API: a modular high-level library for end-toend development of embodied AI algorithms -defining tasks (e.g. navigation, instruction following, question answering), configuring, training, and benchmarking embodied agents.These large-scale engineering contributions enable us to answer scientific questions requiring experiments that were till now impracticable or 'merely' impractical. Specifically, in the context of point-goal navigation: (1) we revisit the comparison between learning and SLAM approaches from two recent works [20,16] and find evidence for the opposite conclusion -that learning outperforms SLAM if scaled to an order of magnitude more experience than previous investigations, and (2) we conduct the first cross-dataset generalization experiments {train, test} × {Matterport3D, Gibson} for multiple sensors {blind, RGB, RGBD, D} and find that only agents with depth (D) sensors generalize across datasets. We hope that our open-source platform and these findings will advance research in embodied AI.
translated by 谷歌翻译
We present a retrospective on the state of Embodied AI research. Our analysis focuses on 13 challenges presented at the Embodied AI Workshop at CVPR. These challenges are grouped into three themes: (1) visual navigation, (2) rearrangement, and (3) embodied vision-and-language. We discuss the dominant datasets within each theme, evaluation metrics for the challenges, and the performance of state-of-the-art models. We highlight commonalities between top approaches to the challenges and identify potential future directions for Embodied AI research.
translated by 谷歌翻译
从“Internet AI”的时代到“体现AI”的时代,AI算法和代理商出现了一个新兴范式转变,其中不再从主要来自Internet策划的图像,视频或文本的数据集。相反,他们通过与与人类类似的Enocentric感知来通过与其环境的互动学习。因此,对体现AI模拟器的需求存在大幅增长,以支持各种体现的AI研究任务。这种越来越多的体现AI兴趣是有利于对人工综合情报(AGI)的更大追求,但对这一领域并无一直存在当代和全面的调查。本文旨在向体现AI领域提供百科全书的调查,从其模拟器到其研究。通过使用我们提出的七种功能评估九个当前体现的AI模拟器,旨在了解模拟器,以其在体现AI研究和其局限性中使用。最后,本文调查了体现AI - 视觉探索,视觉导航和体现问题的三个主要研究任务(QA),涵盖了最先进的方法,评估指标和数据集。最后,随着通过测量该领域的新见解,本文将为仿真器 - 任务选择和建议提供关于该领域的未来方向的建议。
translated by 谷歌翻译
我们介绍了栖息地2.0(H2.0),这是一个模拟平台,用于培训交互式3D环境和复杂物理的场景中的虚拟机器人。我们为体现的AI堆栈 - 数据,仿真和基准任务做出了全面的贡献。具体来说,我们提出:(i)复制:一个由艺术家的,带注释的,可重新配置的3D公寓(匹配真实空间)与铰接对象(例如可以打开/关闭的橱柜和抽屉); (ii)H2.0:一个高性能物理学的3D模拟器,其速度超过8-GPU节点上的每秒25,000个模拟步骤(实时850x实时),代表先前工作的100倍加速;和(iii)家庭助理基准(HAB):一套辅助机器人(整理房屋,准备杂货,设置餐桌)的一套常见任务,以测试一系列移动操作功能。这些大规模的工程贡献使我们能够系统地比较长期结构化任务中的大规模加固学习(RL)和经典的感官平面操作(SPA)管道,并重点是对新对象,容器和布局的概括。 。我们发现(1)与层次结构相比,(1)平面RL政策在HAB上挣扎; (2)具有独立技能的层次结构遭受“交接问题”的困扰,(3)水疗管道比RL政策更脆。
translated by 谷歌翻译
移动机器人的视觉导航经典通过SLAM加上最佳规划,最近通过实现作为深网络的端到端培训。虽然前者通常仅限于航点计划,但即使在真实的物理环境中已经证明了它们的效率,后一种解决方案最常用于模拟中,但已被证明能够学习更复杂的视觉推理,涉及复杂的语义规则。通过实际机器人在物理环境中导航仍然是一个开放问题。端到端的培训方法仅在模拟中进行了彻底测试,实验涉及实际机器人的实际机器人在简化的实验室条件下限制为罕见的性能评估。在这项工作中,我们对真实物理代理的性能和推理能力进行了深入研究,在模拟中培训并部署到两个不同的物理环境。除了基准测试之外,我们提供了对不同条件下不同代理商培训的泛化能力的见解。我们可视化传感器使用以及不同类型信号的重要性。我们展示了,对于Pointgoal Task,一个代理在各种任务上进行预先培训,并在目标环境的模拟版本上进行微调,可以达到竞争性能,而无需建模任何SIM2重传,即通过直接从仿真部署培训的代理即可一个真正的物理机器人。
translated by 谷歌翻译
Training embodied agents in simulation has become mainstream for the embodied AI community. However, these agents often struggle when deployed in the physical world due to their inability to generalize to real-world environments. In this paper, we present Phone2Proc, a method that uses a 10-minute phone scan and conditional procedural generation to create a distribution of training scenes that are semantically similar to the target environment. The generated scenes are conditioned on the wall layout and arrangement of large objects from the scan, while also sampling lighting, clutter, surface textures, and instances of smaller objects with randomized placement and materials. Leveraging just a simple RGB camera, training with Phone2Proc shows massive improvements from 34.7% to 70.7% success rate in sim-to-real ObjectNav performance across a test suite of over 200 trials in diverse real-world environments, including homes, offices, and RoboTHOR. Furthermore, Phone2Proc's diverse distribution of generated scenes makes agents remarkably robust to changes in the real world, such as human movement, object rearrangement, lighting changes, or clutter.
translated by 谷歌翻译
自主代理可以在新环境中导航而不构建明确的地图吗?对于PointGoal Navigation的任务(“转到$ \ delta x $,$ \ delta y $'),在理想化的设置(否RGB -D和驱动噪声,完美的GPS+Compass)下,答案是一个明确的“是” - 由任务无形组件(CNNS和RNN)组成的无地图神经模型接受了大规模增强学习训练,在标准数据集(Gibson)上取得了100%的成功。但是,对于PointNav在现实环境中(RGB-D和致动噪声,没有GPS+Compass),这是一个悬而未决的问题。我们在本文中解决了一个。该任务的最强成绩是成功的71.7%。首先,我们确定了性能下降的主要原因:GPS+指南针的缺失。带有RGB-D传感和致动噪声的完美GPS+指南针的代理商取得了99.8%的成功(Gibson-V2 Val)。这表明(解释模因)强大的视觉探子仪是我们对逼真的PointNav所需的全部。如果我们能够实现这一目标,我们可以忽略感应和致动噪声。作为我们的操作假设,我们扩展了数据集和模型大小,并开发了无人批准的数据启发技术来训练模型以进行视觉探测。我们在栖息地现实的PointNAV挑战方面的最新状态从71%降低到94%的成功(+23,31%相对)和53%至74%的SPL(+21,40%相对)。虽然我们的方法不饱和或“解决”该数据集,但这种强大的改进与有希望的零射击SIM2REAL转移(到Locobot)相结合提供了与假设一致的证据,即即使在现实环境中,显式映射也不是必需的。 。
translated by 谷歌翻译
最近的视听导航工作是无噪音音频环境中的单一静态声音,并努力推广到闻名声音。我们介绍了一种新型动态视听导航基准测试,其中一个体现的AI代理必须在存在分散的人和嘈杂的声音存在下在未映射的环境中捕获移动声源。我们提出了一种依赖于多模态架构的端到端增强学习方法,该方法依赖于融合来自双耳音频信号和空间占用映射的空间视听信息,以编码为我们的新的稳健导航策略进行编码所需的功能复杂的任务设置。我们展示了我们的方法优于当前的最先进状态,以更好地推广到闻名声音以及对嘈杂的3D扫描现实世界数据集副本和TASTPORT3D上的嘈杂情景更好地对嘈杂的情景进行了更好的稳健性,以实现静态和动态的视听导航基准。我们的小型基准将在http://dav-nav.cs.uni-freiburg.de提供。
translated by 谷歌翻译
Semantic navigation is necessary to deploy mobile robots in uncontrolled environments like our homes, schools, and hospitals. Many learning-based approaches have been proposed in response to the lack of semantic understanding of the classical pipeline for spatial navigation, which builds a geometric map using depth sensors and plans to reach point goals. Broadly, end-to-end learning approaches reactively map sensor inputs to actions with deep neural networks, while modular learning approaches enrich the classical pipeline with learning-based semantic sensing and exploration. But learned visual navigation policies have predominantly been evaluated in simulation. How well do different classes of methods work on a robot? We present a large-scale empirical study of semantic visual navigation methods comparing representative methods from classical, modular, and end-to-end learning approaches across six homes with no prior experience, maps, or instrumentation. We find that modular learning works well in the real world, attaining a 90% success rate. In contrast, end-to-end learning does not, dropping from 77% simulation to 23% real-world success rate due to a large image domain gap between simulation and reality. For practitioners, we show that modular learning is a reliable approach to navigate to objects: modularity and abstraction in policy design enable Sim-to-Real transfer. For researchers, we identify two key issues that prevent today's simulators from being reliable evaluation benchmarks - (A) a large Sim-to-Real gap in images and (B) a disconnect between simulation and real-world error modes - and propose concrete steps forward.
translated by 谷歌翻译
这项工作研究了图像目标导航问题,需要通过真正拥挤的环境引导具有嘈杂传感器和控制的机器人。最近的富有成效的方法依赖于深度加强学习,并学习模拟环境中的导航政策,这些环境比真实环境更简单。直接将这些训练有素的策略转移到真正的环境可能非常具有挑战性甚至危险。我们用由四个解耦模块组成的分层导航方法来解决这个问题。第一模块在机器人导航期间维护障碍物映射。第二个将定期预测实时地图上的长期目标。第三个计划碰撞命令集以导航到长期目标,而最终模块将机器人正确靠近目标图像。四个模块是单独开发的,以适应真实拥挤的情景中的图像目标导航。此外,分层分解对导航目标规划,碰撞避免和导航结束预测的学习进行了解耦,这在导航训练期间减少了搜索空间,并有助于改善以前看不见的真实场景的概括。我们通过移动机器人评估模拟器和现实世界中的方法。结果表明,我们的方法优于多种导航基线,可以在这些方案中成功实现导航任务。
translated by 谷歌翻译
我们提出了Midgard,这是一个用于室外非结构化环境中自动机器人导航的开源模拟平台。 Midgard旨在实现在影照相3D环境中对自主代理(例如,无人接地车)进行培训,并通过培训场景中的可变性来支持基于学习的代理的概括技巧。 Midgard的主要功能包括可配置,可扩展和难度驱动的程序景观生成管道,并具有基于虚幻引擎的快速和影像现实主义场景。此外,Midgard还对OpenAi Gym进行了内置支持,OpenAi Gym是一个用于功能扩展的编程接口(例如,集成新型的传感器,自定义曝光内部模拟变量)和各种模拟代理传感器(例如RGB,DEPTH和实例/实例/语义细分)。我们评估了Midgard的功能,作为使用一组最先进的强化学习算法的机器人导航的基准测试工具。结果表明,Midgard作为模拟和训练环境的适用性,以及我们程序生成方法在控制场景难度方面的有效性,这直接反映了准确度量指标。 Midgard构建,源代码和文档可在https://midgardsim.org/上找到。
translated by 谷歌翻译
对象看起来和声音的方式提供了对其物理特性的互补反射。在许多设置中,视觉和试听的线索都异步到达,但必须集成,就像我们听到一个物体掉落在地板上,然后必须找到它时。在本文中,我们介绍了一个设置,用于研究3D虚拟环境中的多模式对象定位。一个物体在房间的某个地方掉落。配备了摄像头和麦克风的具体机器人剂必须通过将音频和视觉信号与知识的基础物理学结合来确定已删除的对象以及位置。为了研究此问题,我们生成了一个大规模数据集 - 倒下的对象数据集 - 其中包括64个房间中30个物理对象类别的8000个实例。该数据集使用Threedworld平台,该平台可以模拟基于物理的影响声音和在影片设置中对象之间的复杂物理交互。作为解决这一挑战的第一步,我们基于模仿学习,强化学习和模块化计划,开发了一组具体的代理基线,并对这项新任务的挑战进行了深入的分析。
translated by 谷歌翻译
大量数据集和高容量模型推动了计算机视觉和自然语言理解方面的许多最新进步。这项工作提出了一个平台,可以在体现的AI中实现类似的成功案例。我们提出了Procthor,这是一个程序生成体现的AI环境的框架。 Procthor使我们能够采样多种,交互式,可自定义和性能的虚拟环境的任意大型数据集,以训练和评估在导航,互动和操纵任务中的体现代理。我们通过10,000个生成的房屋和简单的神经模型的样本来证明procthor的能力和潜力。仅在Procthor上仅使用RGB图像训练的模型,没有明确的映射,并且没有人类任务监督在6个体现的AI基准中产生最先进的结果,用于导航,重排和手臂操纵,包括目前正在运行的Habitat 2022,AI2-- Thor重新安排2022,以及机器人挑战。我们还通过对procthor进行预训练,在下游基准测试上没有进行微调,通常会击败以前的最先进的系统,从而访问下游训练数据。
translated by 谷歌翻译
A robot that can carry out a natural-language instruction has been a dream since before the Jetsons cartoon series imagined a life of leisure mediated by a fleet of attentive robot helpers. It is a dream that remains stubbornly distant. However, recent advances in vision and language methods have made incredible progress in closely related areas. This is significant because a robot interpreting a naturallanguage navigation instruction on the basis of what it sees is carrying out a vision and language process that is similar to Visual Question Answering. Both tasks can be interpreted as visually grounded sequence-to-sequence translation problems, and many of the same methods are applicable. To enable and encourage the application of vision and language methods to the problem of interpreting visuallygrounded navigation instructions, we present the Matter-port3D Simulator -a large-scale reinforcement learning environment based on real imagery [11]. Using this simulator, which can in future support a range of embodied vision and language tasks, we provide the first benchmark dataset for visually-grounded natural language navigation in real buildings -the Room-to-Room (R2R) dataset 1 .1 https://bringmeaspoon.org Instruction: Head upstairs and walk past the piano through an archway directly in front. Turn right when the hallway ends at pictures and table. Wait by the moose antlers hanging on the wall.
translated by 谷歌翻译
Learning how to navigate among humans in an occluded and spatially constrained indoor environment, is a key ability required to embodied agent to be integrated into our society. In this paper, we propose an end-to-end architecture that exploits Socially-Aware Tasks (referred as to Risk and Social Compass) to inject into a reinforcement learning navigation policy the ability to infer common-sense social behaviors. To this end, our tasks exploit the notion of immediate and future dangers of collision. Furthermore, we propose an evaluation protocol specifically designed for the Social Navigation Task in simulated environments. This is done to capture fine-grained features and characteristics of the policy by analyzing the minimal unit of human-robot spatial interaction, called Encounter. We validate our approach on Gibson4+ and Habitat-Matterport3D datasets.
translated by 谷歌翻译
我们介绍了ThreedWorld(TDW),是交互式多模态物理模拟的平台。 TDW能够模拟高保真感官数据和富裕的3D环境中的移动代理和对象之间的物理交互。独特的属性包括:实时近光 - 真实图像渲染;对象和环境库,以及他们定制的例程;有效构建新环境课程的生成程序;高保真音频渲染;各种材料类型的现实物理相互作用,包括布料,液体和可变形物体;可定制的代理体现AI代理商;并支持与VR设备的人类交互。 TDW的API使多个代理能够在模拟中进行交互,并返回一系列表示世界状态的传感器和物理数据。我们在计算机视觉,机器学习和认知科学中的新兴的研究方向上提供了通过TDW的初始实验,包括多模态物理场景理解,物理动态预测,多代理交互,像孩子一样学习的模型,并注意研究人类和神经网络。
translated by 谷歌翻译
Training effective embodied AI agents often involves manual reward engineering, expert imitation, specialized components such as maps, or leveraging additional sensors for depth and localization. Another approach is to use neural architectures alongside self-supervised objectives which encourage better representation learning. In practice, there are few guarantees that these self-supervised objectives encode task-relevant information. We propose the Scene Graph Contrastive (SGC) loss, which uses scene graphs as general-purpose, training-only, supervisory signals. The SGC loss does away with explicit graph decoding and instead uses contrastive learning to align an agent's representation with a rich graphical encoding of its environment. The SGC loss is generally applicable, simple to implement, and encourages representations that encode objects' semantics, relationships, and history. Using the SGC loss, we attain significant gains on three embodied tasks: Object Navigation, Multi-Object Navigation, and Arm Point Navigation. Finally, we present studies and analyses which demonstrate the ability of our trained representation to encode semantic cues about the environment.
translated by 谷歌翻译
Progress in continual reinforcement learning has been limited due to several barriers to entry: missing code, high compute requirements, and a lack of suitable benchmarks. In this work, we present CORA, a platform for Continual Reinforcement Learning Agents that provides benchmarks, baselines, and metrics in a single code package. The benchmarks we provide are designed to evaluate different aspects of the continual RL challenge, such as catastrophic forgetting, plasticity, ability to generalize, and sample-efficient learning. Three of the benchmarks utilize video game environments (Atari, Procgen, NetHack). The fourth benchmark, CHORES, consists of four different task sequences in a visually realistic home simulator, drawn from a diverse set of task and scene parameters. To compare continual RL methods on these benchmarks, we prepare three metrics in CORA: Continual Evaluation, Isolated Forgetting, and Zero-Shot Forward Transfer. Finally, CORA includes a set of performant, open-source baselines of existing algorithms for researchers to use and expand on. We release CORA and hope that the continual RL community can benefit from our contributions, to accelerate the development of new continual RL algorithms.
translated by 谷歌翻译
我们介绍了互动室(Thor),这是一个视觉AI研究的框架,可在http://ai2thor.allenai.org上找到。AI2-这是由几乎逼真的3D室内场景组成的,在该场景中,AI代理可以在场景中导航并与对象进行交互以执行任务。AI2-这可以在许多不同的领域进行研究,包括但不限于深入强化学习,模仿学习,通过互动,计划,视觉问答答案,无监督的表示学习,对象检测和细分以及认知模型。AI2的目的是促进构建视觉上智能模型,并将研究推向该领域。
translated by 谷歌翻译
We introduce CARLA, an open-source simulator for autonomous driving research. CARLA has been developed from the ground up to support development, training, and validation of autonomous urban driving systems. In addition to open-source code and protocols, CARLA provides open digital assets (urban layouts, buildings, vehicles) that were created for this purpose and can be used freely. The simulation platform supports flexible specification of sensor suites and environmental conditions. We use CARLA to study the performance of three approaches to autonomous driving: a classic modular pipeline, an endto-end model trained via imitation learning, and an end-to-end model trained via reinforcement learning. The approaches are evaluated in controlled scenarios of increasing difficulty, and their performance is examined via metrics provided by CARLA, illustrating the platform's utility for autonomous driving research.
translated by 谷歌翻译