Object goal navigation (ObjectNav) in unseen environments is a fundamental task for Embodied AI. Agents in existing works learn ObjectNav policies based on 2D maps, scene graphs, or image sequences. Considering this task happens in 3D space, a 3D-aware agent can advance its ObjectNav capability via learning from fine-grained spatial information. However, leveraging 3D scene representation can be prohibitively unpractical for policy learning in this floor-level task, due to low sample efficiency and expensive computational cost. In this work, we propose a framework for the challenging 3D-aware ObjectNav based on two straightforward sub-policies. The two sub-polices, namely corner-guided exploration policy and category-aware identification policy, simultaneously perform by utilizing online fused 3D points as observation. Through extensive experiments, we show that this framework can dramatically improve the performance in ObjectNav through learning from 3D scene representation. Our framework achieves the best performance among all modular-based methods on the Matterport3D and Gibson datasets, while requiring (up to 30x) less computational cost for training.
translated by 谷歌翻译
对象目标导航的最新方法依赖于增强学习,通常需要大量的计算资源和学习时间。我们提出了使用无互动学习(PONI)的对象导航的潜在功能,这是一种模块化方法,可以散布“在哪里看?”的技能?对于对象和“如何导航到(x,y)?”。我们的主要见解是“在哪里看?”可以纯粹将其视为感知问题,而没有环境相互作用就可以学习。为了解决这个问题,我们提出了一个网络,该网络可以预测两个在语义图上的互补电位功能,并使用它们来决定在哪里寻找看不见的对象。我们使用在自上而下的语义图的被动数据集上使用受监督的学习来训练潜在的功能网络,并将其集成到模块化框架中以执行对象目标导航。 Gibson和MatterPort3D的实验表明,我们的方法可实现对象目标导航的最新方法,同时减少培训计算成本高达1,600倍。可以使用代码和预训练的模型:https://vision.cs.utexas.edu/projects/poni/
translated by 谷歌翻译
Efficient ObjectGoal navigation (ObjectNav) in novel environments requires an understanding of the spatial and semantic regularities in environment layouts. In this work, we present a straightforward method for learning these regularities by predicting the locations of unobserved objects from incomplete semantic maps. Our method differs from previous prediction-based navigation methods, such as frontier potential prediction or egocentric map completion, by directly predicting unseen targets while leveraging the global context from all previously explored areas. Our prediction model is lightweight and can be trained in a supervised manner using a relatively small amount of passively collected data. Once trained, the model can be incorporated into a modular pipeline for ObjectNav without the need for any reinforcement learning. We validate the effectiveness of our method on the HM3D and MP3D ObjectNav datasets. We find that it achieves the state-of-the-art on both datasets, despite not using any additional data for training.
translated by 谷歌翻译
这项工作研究了图像目标导航问题,需要通过真正拥挤的环境引导具有嘈杂传感器和控制的机器人。最近的富有成效的方法依赖于深度加强学习,并学习模拟环境中的导航政策,这些环境比真实环境更简单。直接将这些训练有素的策略转移到真正的环境可能非常具有挑战性甚至危险。我们用由四个解耦模块组成的分层导航方法来解决这个问题。第一模块在机器人导航期间维护障碍物映射。第二个将定期预测实时地图上的长期目标。第三个计划碰撞命令集以导航到长期目标,而最终模块将机器人正确靠近目标图像。四个模块是单独开发的,以适应真实拥挤的情景中的图像目标导航。此外,分层分解对导航目标规划,碰撞避免和导航结束预测的学习进行了解耦,这在导航训练期间减少了搜索空间,并有助于改善以前看不见的真实场景的概括。我们通过移动机器人评估模拟器和现实世界中的方法。结果表明,我们的方法优于多种导航基线,可以在这些方案中成功实现导航任务。
translated by 谷歌翻译
Semantic navigation is necessary to deploy mobile robots in uncontrolled environments like our homes, schools, and hospitals. Many learning-based approaches have been proposed in response to the lack of semantic understanding of the classical pipeline for spatial navigation, which builds a geometric map using depth sensors and plans to reach point goals. Broadly, end-to-end learning approaches reactively map sensor inputs to actions with deep neural networks, while modular learning approaches enrich the classical pipeline with learning-based semantic sensing and exploration. But learned visual navigation policies have predominantly been evaluated in simulation. How well do different classes of methods work on a robot? We present a large-scale empirical study of semantic visual navigation methods comparing representative methods from classical, modular, and end-to-end learning approaches across six homes with no prior experience, maps, or instrumentation. We find that modular learning works well in the real world, attaining a 90% success rate. In contrast, end-to-end learning does not, dropping from 77% simulation to 23% real-world success rate due to a large image domain gap between simulation and reality. For practitioners, we show that modular learning is a reliable approach to navigate to objects: modularity and abstraction in policy design enable Sim-to-Real transfer. For researchers, we identify two key issues that prevent today's simulators from being reliable evaluation benchmarks - (A) a large Sim-to-Real gap in images and (B) a disconnect between simulation and real-world error modes - and propose concrete steps forward.
translated by 谷歌翻译
对象目标导航(ObjectNAV)任务是在没有预先构建的地图的情况下将代理导航到看不见的环境中的对象类别。在本文中,我们通过使用语义相关对象作为线索来预测目标的距离来解决此任务。根据与目标对象的估计距离,我们的方法直接选择最佳的中期目标,这些目标更可能具有较短的目标途径。具体而言,基于学习的知识,我们的模型将鸟眼视图语义图作为输入,并估算从边界图单元到目标对象的路径长度。借助估计的距离图,代理可以同时探索环境并基于简单的人类设计策略导航到目标对象。在视觉上逼真的模拟环境中,经验结果表明,该提出的方法的表现优于成功率和效率的广泛基准。 Realobot实验还表明,我们的方法很好地推广到了现实世界。视频https://www.youtube.com/watch?v=r79pwvgfks4
translated by 谷歌翻译
We present a retrospective on the state of Embodied AI research. Our analysis focuses on 13 challenges presented at the Embodied AI Workshop at CVPR. These challenges are grouped into three themes: (1) visual navigation, (2) rearrangement, and (3) embodied vision-and-language. We discuss the dominant datasets within each theme, evaluation metrics for the challenges, and the performance of state-of-the-art models. We highlight commonalities between top approaches to the challenges and identify potential future directions for Embodied AI research.
translated by 谷歌翻译
为了基于深度加强学习(RL)来增强目标驱动的视觉导航的交叉目标和跨场景,我们将信息理论正则化术语引入RL目标。正则化最大化导航动作与代理的视觉观察变换之间的互信息,从而促进更明智的导航决策。这样,代理通过学习变分生成模型来模拟动作观察动态。基于该模型,代理生成(想象)从其当前观察和导航目标的下一次观察。这样,代理学会了解导航操作与其观察变化之间的因果关系,这允许代理通过比较当前和想象的下一个观察来预测导航的下一个动作。 AI2-Thor框架上的交叉目标和跨场景评估表明,我们的方法在某些最先进的模型上获得了平均成功率的10美元。我们进一步评估了我们的模型在两个现实世界中:来自离散的活动视觉数据集(AVD)和带有TurtleBot的连续现实世界环境中的看不见的室内场景导航。我们证明我们的导航模型能够成功实现导航任务这些情景。视频和型号可以在补充材料中找到。
translated by 谷歌翻译
在这项工作中,我们提出了一种用于图像目标导航的内存调格方法。早期的尝试,包括基于RL的基于RL的方法和基于SLAM的方法的概括性能差,或者在姿势/深度传感器上稳定稳定。我们的方法基于一个基于注意力的端到端模型,该模型利用情节记忆来学习导航。首先,我们以自我监督的方式训练一个国家安置的网络,然后将其嵌入以前访问的状态中的代理商的记忆中。我们的导航政策通过注意机制利用了此信息。我们通过广泛的评估来验证我们的方法,并表明我们的模型在具有挑战性的吉布森数据集上建立了新的最新技术。此外,与相关工作形成鲜明对比的是,我们仅凭RGB输入就实现了这种令人印象深刻的性能,而无需访问其他信息,例如位置或深度。
translated by 谷歌翻译
从“Internet AI”的时代到“体现AI”的时代,AI算法和代理商出现了一个新兴范式转变,其中不再从主要来自Internet策划的图像,视频或文本的数据集。相反,他们通过与与人类类似的Enocentric感知来通过与其环境的互动学习。因此,对体现AI模拟器的需求存在大幅增长,以支持各种体现的AI研究任务。这种越来越多的体现AI兴趣是有利于对人工综合情报(AGI)的更大追求,但对这一领域并无一直存在当代和全面的调查。本文旨在向体现AI领域提供百科全书的调查,从其模拟器到其研究。通过使用我们提出的七种功能评估九个当前体现的AI模拟器,旨在了解模拟器,以其在体现AI研究和其局限性中使用。最后,本文调查了体现AI - 视觉探索,视觉导航和体现问题的三个主要研究任务(QA),涵盖了最先进的方法,评估指标和数据集。最后,随着通过测量该领域的新见解,本文将为仿真器 - 任务选择和建议提供关于该领域的未来方向的建议。
translated by 谷歌翻译
对象目标导航要求机器人在以前看不见的环境中找到并导航到目标对象类的实例。我们的框架会随着时间的推移逐步构建环境的语义图,然后根据语义映射重复选择一个长期目标(“ where to Go”)以找到目标对象实例。长期目标选择被称为基于视觉的深度强化学习问题。具体而言,对编码器网络进行了训练,可以从语义图中提取高级功能并选择长期目标。此外,我们还将数据增强和Q功能正则化合并,以使长期目标选择更有效。我们在AI栖息地3D模拟环境中使用照片现实的Gibson基准数据集进行了实验结果,以证明与最先进的数据驱动基线相比,标准措施的性能改善。
translated by 谷歌翻译
我们介绍了一个目标驱动的导航系统,以改善室内场景中的Fapless视觉导航。我们的方法在每次步骤中都将机器人和目标的多视图观察为输入,以提供将机器人移动到目标的一系列动作,而不依赖于运行时在运行时。通过优化包含三个关键设计的组合目标来了解该系统。首先,我们建议代理人在做出行动决定之前构建下一次观察。这是通过从专家演示中学习变分生成模块来实现的。然后,我们提出预测预先预测静态碰撞,作为辅助任务,以改善导航期间的安全性。此外,为了减轻终止动作预测的训练数据不平衡问题,我们还介绍了一个目标检查模块来区分与终止动作的增强导航策略。这三种建议的设计都有助于提高培训数据效率,静态冲突避免和导航泛化性能,从而产生了一种新颖的目标驱动的FLASES导航系统。通过对Turtlebot的实验,我们提供了证据表明我们的模型可以集成到机器人系统中并在现实世界中导航。视频和型号可以在补充材料中找到。
translated by 谷歌翻译
In recent years several learning approaches to point goal navigation in previously unseen environments have been proposed. They vary in the representations of the environments, problem decomposition, and experimental evaluation. In this work, we compare the state-of-the-art Deep Reinforcement Learning based approaches with Partially Observable Markov Decision Process (POMDP) formulation of the point goal navigation problem. We adapt the (POMDP) sub-goal framework proposed by [1] and modify the component that estimates frontier properties by using partial semantic maps of indoor scenes built from images' semantic segmentation. In addition to the well-known completeness of the model-based approach, we demonstrate that it is robust and efficient in that it leverages informative, learned properties of the frontiers compared to an optimistic frontier-based planner. We also demonstrate its data efficiency compared to the end-to-end deep reinforcement learning approaches. We compare our results against an optimistic planner, ANS and DD-PPO on Matterport3D dataset using the Habitat Simulator. We show comparable, though slightly worse performance than the SOTA DD-PPO approach, yet with far fewer data.
translated by 谷歌翻译
A household robot should be able to navigate to target locations without requiring users to first annotate everything in their home. Current approaches to this object navigation challenge do not test on real robots and rely on expensive semantically labeled 3D meshes. In this work, our aim is an agent that builds self-supervised models of the world via exploration, the same as a child might. We propose an end-to-end self-supervised embodied agent that leverages exploration to train a semantic segmentation model of 3D objects, and uses those representations to learn an object navigation policy purely from self-labeled 3D meshes. The key insight is that embodied agents can leverage location consistency as a supervision signal - collecting images from different views/angles and applying contrastive learning to fine-tune a semantic segmentation model. In our experiments, we observe that our framework performs better than other self-supervised baselines and competitively with supervised baselines, in both simulation and when deployed in real houses.
translated by 谷歌翻译
在本文中,我们探索如何在互联网图像的数据和型号上构建,并使用它们适应机器人视觉,而无需任何额外的标签。我们提出了一个叫做自我监督体现的主动学习(密封)的框架。它利用互联网图像培训的感知模型来学习主动探索政策。通过3D一致性标记此探索策略收集的观察结果,并用于改善感知模型。我们构建并利用3D语义地图以完全自我监督的方式学习动作和感知。语义地图用于计算用于培训勘探政策的内在动机奖励,并使用时空3D一致性和标签传播标记代理观察。我们证明了密封框架可用于关闭动作 - 感知循环:通过在训练环境中移动,改善预读的感知模型的对象检测和实例分割性能,并且可以使用改进的感知模型来改善对象目标导航。
translated by 谷歌翻译
Training effective embodied AI agents often involves manual reward engineering, expert imitation, specialized components such as maps, or leveraging additional sensors for depth and localization. Another approach is to use neural architectures alongside self-supervised objectives which encourage better representation learning. In practice, there are few guarantees that these self-supervised objectives encode task-relevant information. We propose the Scene Graph Contrastive (SGC) loss, which uses scene graphs as general-purpose, training-only, supervisory signals. The SGC loss does away with explicit graph decoding and instead uses contrastive learning to align an agent's representation with a rich graphical encoding of its environment. The SGC loss is generally applicable, simple to implement, and encourages representations that encode objects' semantics, relationships, and history. Using the SGC loss, we attain significant gains on three embodied tasks: Object Navigation, Multi-Object Navigation, and Arm Point Navigation. Finally, we present studies and analyses which demonstrate the ability of our trained representation to encode semantic cues about the environment.
translated by 谷歌翻译
对象看起来和声音的方式提供了对其物理特性的互补反射。在许多设置中,视觉和试听的线索都异步到达,但必须集成,就像我们听到一个物体掉落在地板上,然后必须找到它时。在本文中,我们介绍了一个设置,用于研究3D虚拟环境中的多模式对象定位。一个物体在房间的某个地方掉落。配备了摄像头和麦克风的具体机器人剂必须通过将音频和视觉信号与知识的基础物理学结合来确定已删除的对象以及位置。为了研究此问题,我们生成了一个大规模数据集 - 倒下的对象数据集 - 其中包括64个房间中30个物理对象类别的8000个实例。该数据集使用Threedworld平台,该平台可以模拟基于物理的影响声音和在影片设置中对象之间的复杂物理交互。作为解决这一挑战的第一步,我们基于模仿学习,强化学习和模块化计划,开发了一组具体的代理基线,并对这项新任务的挑战进行了深入的分析。
translated by 谷歌翻译
我们建议通过学习通过构思它预期看到的下一个观察来引导的代理来改善视觉导航的跨目标和跨场景概括。这是通过学习变分贝叶斯模型来实现的,称为Neonav,该模型产生了在试剂和目标视图的当前观察中的下一个预期观察(Neo)。我们的生成模式是通过优化包含两个关键设计的变分目标来了解。首先,潜在分布在当前观察和目标视图上进行调节,导致基于模型的目标驱动导航。其次,潜伏的空间用在当前观察和下一个最佳动作上的高斯的混合物建模。我们使用后医混合物的用途能够有效地减轻过正规化的潜在空间的问题,从而大大提高了新目标和新场景的模型概括。此外,Neo Generation模型代理环境交互的前向动态,从而提高了近似推断的质量,因此提高了数据效率。我们对现实世界和合成基准进行了广泛的评估,并表明我们的模型在成功率,数据效率和泛化方面始终如一地优于最先进的模型。
translated by 谷歌翻译
Generalisation to unseen contexts remains a challenge for embodied navigation agents. In the context of semantic audio-visual navigation (SAVi) tasks, the notion of generalisation should include both generalising to unseen indoor visual scenes as well as generalising to unheard sounding objects. However, previous SAVi task definitions do not include evaluation conditions on truly novel sounding objects, resorting instead to evaluating agents on unheard sound clips of known objects; meanwhile, previous SAVi methods do not include explicit mechanisms for incorporating domain knowledge about object and region semantics. These weaknesses limit the development and assessment of models' abilities to generalise their learned experience. In this work, we introduce the use of knowledge-driven scene priors in the semantic audio-visual embodied navigation task: we combine semantic information from our novel knowledge graph that encodes object-region relations, spatial knowledge from dual Graph Encoder Networks, and background knowledge from a series of pre-training tasks -- all within a reinforcement learning framework for audio-visual navigation. We also define a new audio-visual navigation sub-task, where agents are evaluated on novel sounding objects, as opposed to unheard clips of known objects. We show improvements over strong baselines in generalisation to unseen regions and novel sounding objects, within the Habitat-Matterport3D simulation environment, under the SoundSpaces task.
translated by 谷歌翻译
We present Habitat, a platform for research in embodied artificial intelligence (AI). Habitat enables training embodied agents (virtual robots) in highly efficient photorealistic 3D simulation. Specifically, Habitat consists of: (i) Habitat-Sim: a flexible, high-performance 3D simulator with configurable agents, sensors, and generic 3D dataset handling. Habitat-Sim is fast -when rendering a scene from Matterport3D, it achieves several thousand frames per second (fps) running single-threaded, and can reach over 10,000 fps multi-process on a single GPU. (ii) Habitat-API: a modular high-level library for end-toend development of embodied AI algorithms -defining tasks (e.g. navigation, instruction following, question answering), configuring, training, and benchmarking embodied agents.These large-scale engineering contributions enable us to answer scientific questions requiring experiments that were till now impracticable or 'merely' impractical. Specifically, in the context of point-goal navigation: (1) we revisit the comparison between learning and SLAM approaches from two recent works [20,16] and find evidence for the opposite conclusion -that learning outperforms SLAM if scaled to an order of magnitude more experience than previous investigations, and (2) we conduct the first cross-dataset generalization experiments {train, test} × {Matterport3D, Gibson} for multiple sensors {blind, RGB, RGBD, D} and find that only agents with depth (D) sensors generalize across datasets. We hope that our open-source platform and these findings will advance research in embodied AI.
translated by 谷歌翻译