物理重新安排的物体是体现剂的重要功能。视觉室的重新安排评估了代理在房间中重新安排对象的能力,仅基于视觉输入而获得所需的目标。我们为此问题提出了一种简单而有效的方法:(1)搜索并映射需要重新排列哪些对象,以及(2)重新排列每个对象,直到任务完成为止。我们的方法包括一个现成的语义分割模型,基于体素的语义图和语义搜索策略,以有效地找到需要重新排列的对象。在AI2 - 重新排列的挑战中,我们的方法改进了当前最新的端到端增强学习方法,这些方法从0.53%的正确重排达到16.56%,学习视觉重排政策,仅使用2.7%,仅使用2.7%来自环境的样本。
translated by 谷歌翻译
We present a retrospective on the state of Embodied AI research. Our analysis focuses on 13 challenges presented at the Embodied AI Workshop at CVPR. These challenges are grouped into three themes: (1) visual navigation, (2) rearrangement, and (3) embodied vision-and-language. We discuss the dominant datasets within each theme, evaluation metrics for the challenges, and the performance of state-of-the-art models. We highlight commonalities between top approaches to the challenges and identify potential future directions for Embodied AI research.
translated by 谷歌翻译
Efficient ObjectGoal navigation (ObjectNav) in novel environments requires an understanding of the spatial and semantic regularities in environment layouts. In this work, we present a straightforward method for learning these regularities by predicting the locations of unobserved objects from incomplete semantic maps. Our method differs from previous prediction-based navigation methods, such as frontier potential prediction or egocentric map completion, by directly predicting unseen targets while leveraging the global context from all previously explored areas. Our prediction model is lightweight and can be trained in a supervised manner using a relatively small amount of passively collected data. Once trained, the model can be incorporated into a modular pipeline for ObjectNav without the need for any reinforcement learning. We validate the effectiveness of our method on the HM3D and MP3D ObjectNav datasets. We find that it achieves the state-of-the-art on both datasets, despite not using any additional data for training.
translated by 谷歌翻译
自然语言提供可访问和富有富有态度的界面,以指定机器人代理的长期任务。但是,非专家可能会使用高级指令指定此类任务,其中通过多个抽象层摘要通过特定的机器人操作。我们建议将语言和机器人行动之间的这种差距延长长的执行视野是持久的表示。我们提出了一种持久的空间语义表示方法,并展示它是如何构建执行分层推理的代理,以有效执行长期任务。尽管完全避免了常用的逐步说明,我们评估了我们对阿尔弗雷德基准的方法并实现了最先进的结果。
translated by 谷歌翻译
在本文中,我们探索如何在互联网图像的数据和型号上构建,并使用它们适应机器人视觉,而无需任何额外的标签。我们提出了一个叫做自我监督体现的主动学习(密封)的框架。它利用互联网图像培训的感知模型来学习主动探索政策。通过3D一致性标记此探索策略收集的观察结果,并用于改善感知模型。我们构建并利用3D语义地图以完全自我监督的方式学习动作和感知。语义地图用于计算用于培训勘探政策的内在动机奖励,并使用时空3D一致性和标签传播标记代理观察。我们证明了密封框架可用于关闭动作 - 感知循环:通过在训练环境中移动,改善预读的感知模型的对象检测和实例分割性能,并且可以使用改进的感知模型来改善对象目标导航。
translated by 谷歌翻译
对象目标导航的最新方法依赖于增强学习,通常需要大量的计算资源和学习时间。我们提出了使用无互动学习(PONI)的对象导航的潜在功能,这是一种模块化方法,可以散布“在哪里看?”的技能?对于对象和“如何导航到(x,y)?”。我们的主要见解是“在哪里看?”可以纯粹将其视为感知问题,而没有环境相互作用就可以学习。为了解决这个问题,我们提出了一个网络,该网络可以预测两个在语义图上的互补电位功能,并使用它们来决定在哪里寻找看不见的对象。我们使用在自上而下的语义图的被动数据集上使用受监督的学习来训练潜在的功能网络,并将其集成到模块化框架中以执行对象目标导航。 Gibson和MatterPort3D的实验表明,我们的方法可实现对象目标导航的最新方法,同时减少培训计算成本高达1,600倍。可以使用代码和预训练的模型:https://vision.cs.utexas.edu/projects/poni/
translated by 谷歌翻译
Training effective embodied AI agents often involves manual reward engineering, expert imitation, specialized components such as maps, or leveraging additional sensors for depth and localization. Another approach is to use neural architectures alongside self-supervised objectives which encourage better representation learning. In practice, there are few guarantees that these self-supervised objectives encode task-relevant information. We propose the Scene Graph Contrastive (SGC) loss, which uses scene graphs as general-purpose, training-only, supervisory signals. The SGC loss does away with explicit graph decoding and instead uses contrastive learning to align an agent's representation with a rich graphical encoding of its environment. The SGC loss is generally applicable, simple to implement, and encourages representations that encode objects' semantics, relationships, and history. Using the SGC loss, we attain significant gains on three embodied tasks: Object Navigation, Multi-Object Navigation, and Arm Point Navigation. Finally, we present studies and analyses which demonstrate the ability of our trained representation to encode semantic cues about the environment.
translated by 谷歌翻译
我们介绍了泰德(Tidee),这是一种体现的代理,它根据学识渊博的常识对象和房间安排先验来整理一个无序场景。泰德(Tidee)探索家庭环境,检测到其自然位置的对象,渗透到它们的合理对象上下文,在当前场景中定位此类上下文,并重新定位对象。常识先验在三个模块中编码:i)检测到现象对象的视觉声音检测器,ii)对象和空间关系的关联神经图记忆,提出了对象重新定位的合理语义插座和表面,以及iii)引导代理商探索的可视搜索网络,以有效地将利益定位在当前场景中以重新定位对象。我们测试了在AI2THOR模拟环境中整理混乱的场景的潮汐。 Tidee直接从像素和原始深度输入中执行任务,而没有事先观察到同一房间,仅依靠从单独的一组培训房屋中学到的先验。人类对由此产生的房间进行重组的评估表明,泰德(Tidee)的表现优于该模型的消融版本,这些版本不使用一个或多个常识性先验。在相关的房间重新安排基准测试中,该基准使代理可以在重新排列前查看目标状态,我们的模型的简化版本大大胜过了最佳的方法,可以通过大幅度的差距。代码和数据可在项目网站上获得:https://tidee-agent.github.io/。
translated by 谷歌翻译
对象看起来和声音的方式提供了对其物理特性的互补反射。在许多设置中,视觉和试听的线索都异步到达,但必须集成,就像我们听到一个物体掉落在地板上,然后必须找到它时。在本文中,我们介绍了一个设置,用于研究3D虚拟环境中的多模式对象定位。一个物体在房间的某个地方掉落。配备了摄像头和麦克风的具体机器人剂必须通过将音频和视觉信号与知识的基础物理学结合来确定已删除的对象以及位置。为了研究此问题,我们生成了一个大规模数据集 - 倒下的对象数据集 - 其中包括64个房间中30个物理对象类别的8000个实例。该数据集使用Threedworld平台,该平台可以模拟基于物理的影响声音和在影片设置中对象之间的复杂物理交互。作为解决这一挑战的第一步,我们基于模仿学习,强化学习和模块化计划,开发了一组具体的代理基线,并对这项新任务的挑战进行了深入的分析。
translated by 谷歌翻译
这项工作研究了图像目标导航问题,需要通过真正拥挤的环境引导具有嘈杂传感器和控制的机器人。最近的富有成效的方法依赖于深度加强学习,并学习模拟环境中的导航政策,这些环境比真实环境更简单。直接将这些训练有素的策略转移到真正的环境可能非常具有挑战性甚至危险。我们用由四个解耦模块组成的分层导航方法来解决这个问题。第一模块在机器人导航期间维护障碍物映射。第二个将定期预测实时地图上的长期目标。第三个计划碰撞命令集以导航到长期目标,而最终模块将机器人正确靠近目标图像。四个模块是单独开发的,以适应真实拥挤的情景中的图像目标导航。此外,分层分解对导航目标规划,碰撞避免和导航结束预测的学习进行了解耦,这在导航训练期间减少了搜索空间,并有助于改善以前看不见的真实场景的概括。我们通过移动机器人评估模拟器和现实世界中的方法。结果表明,我们的方法优于多种导航基线,可以在这些方案中成功实现导航任务。
translated by 谷歌翻译
建立一个对话体现的代理执行现实生活任务一直是一个长期而又具有挑战性的研究目标,因为它需要有效的人类代理沟通,多模式理解,远程顺序决策等。传统的符号方法具有扩展和概括问题,而端到端的深度学习模型则遭受数据稀缺和高任务复杂性的影响,并且通常很难解释。为了从两全其美的世界中受益,我们提出了一个神经符号常识性推理(JARVIS)框架,用于模块化,可推广和可解释的对话体现的药物。首先,它通过提示大型语言模型(LLM)来获得符号表示,以了解语言理解和次目标计划,并通过从视觉观察中构建语义图。然后,基于任务和动作级别的常识,次目标计划和行动生成的符号模块。在Teach数据集上进行的大量实验验证了我们的JARVIS框架的功效和效率,该框架在所有三个基于对话框的具体任务上实现了最新的(SOTA)结果,包括对话记录(EDH)的执行,对话框的轨迹, (TFD)和两个代理任务完成(TATC)(例如,我们的方法将EDH看不见的成功率从6.1 \%\%提高到15.8 \%)。此外,我们系统地分析了影响任务绩效的基本因素,并在几个射击设置中证明了我们方法的优越性。我们的Jarvis模型在Alexa奖Simbot公共基准挑战赛中排名第一。
translated by 谷歌翻译
从“Internet AI”的时代到“体现AI”的时代,AI算法和代理商出现了一个新兴范式转变,其中不再从主要来自Internet策划的图像,视频或文本的数据集。相反,他们通过与与人类类似的Enocentric感知来通过与其环境的互动学习。因此,对体现AI模拟器的需求存在大幅增长,以支持各种体现的AI研究任务。这种越来越多的体现AI兴趣是有利于对人工综合情报(AGI)的更大追求,但对这一领域并无一直存在当代和全面的调查。本文旨在向体现AI领域提供百科全书的调查,从其模拟器到其研究。通过使用我们提出的七种功能评估九个当前体现的AI模拟器,旨在了解模拟器,以其在体现AI研究和其局限性中使用。最后,本文调查了体现AI - 视觉探索,视觉导航和体现问题的三个主要研究任务(QA),涵盖了最先进的方法,评估指标和数据集。最后,随着通过测量该领域的新见解,本文将为仿真器 - 任务选择和建议提供关于该领域的未来方向的建议。
translated by 谷歌翻译
Semantic navigation is necessary to deploy mobile robots in uncontrolled environments like our homes, schools, and hospitals. Many learning-based approaches have been proposed in response to the lack of semantic understanding of the classical pipeline for spatial navigation, which builds a geometric map using depth sensors and plans to reach point goals. Broadly, end-to-end learning approaches reactively map sensor inputs to actions with deep neural networks, while modular learning approaches enrich the classical pipeline with learning-based semantic sensing and exploration. But learned visual navigation policies have predominantly been evaluated in simulation. How well do different classes of methods work on a robot? We present a large-scale empirical study of semantic visual navigation methods comparing representative methods from classical, modular, and end-to-end learning approaches across six homes with no prior experience, maps, or instrumentation. We find that modular learning works well in the real world, attaining a 90% success rate. In contrast, end-to-end learning does not, dropping from 77% simulation to 23% real-world success rate due to a large image domain gap between simulation and reality. For practitioners, we show that modular learning is a reliable approach to navigate to objects: modularity and abstraction in policy design enable Sim-to-Real transfer. For researchers, we identify two key issues that prevent today's simulators from being reliable evaluation benchmarks - (A) a large Sim-to-Real gap in images and (B) a disconnect between simulation and real-world error modes - and propose concrete steps forward.
translated by 谷歌翻译
视觉室内导航(VIN)任务已从数据驱动的机器学习社区中引起了人们的关注,尤其是在最近报告的基于学习方法的成功中。由于这项任务的先天复杂性,研究人员尝试从各种不同角度解决问题,其全部范围尚未在总体报告中捕获。这项调查首先总结了VIN任务的基于学习的方法的代表性工作,然后确定并讨论了阻碍VIN绩效的问题,并激发了值得探索社区的这些关键领域的未来研究。
translated by 谷歌翻译
We present a new AI task -Embodied Question Answering (EmbodiedQA) -where an agent is spawned at a random location in a 3D environment and asked a question ('What color is the car?'). In order to answer, the agent must first intelligently navigate to explore the environment, gather information through first-person (egocentric) vision, and then answer the question ('orange'). This challenging task requires a range of AI skills -active perception, language understanding, goal-driven navigation, commonsense reasoning, and grounding of language into actions. In this work, we develop the environments, end-to-end-trained reinforcement learning agents, and evaluation protocols for EmbodiedQA.
translated by 谷歌翻译
对比语言图像预测(剪辑)编码器已被证明是有利于对分类和检测到标题和图像操纵的一系列视觉任务。我们调查剪辑视觉骨干网的有效性,以实现AI任务。我们构建令人难以置信的简单基线,名为Emplip,没有任务特定的架构,归纳偏差(如使用语义地图),培训期间的辅助任务,或深度映射 - 但我们发现我们的改进的基线在范围内表现得非常好任务和模拟器。 empclip将Robothor ObjectNav排行榜上面的20分的巨额边缘(成功率)。它使ithor 1相重新安排排行榜上面,击败了采用主动神经映射的下一个最佳提交,而且多于固定的严格度量(0.08至0.17)。它还击败了2021年栖息地对象挑战的获奖者,该挑战采用辅助任务,深度地图和人类示范以及2019年栖息地进程挑战的挑战。我们评估剪辑视觉表示在捕获有关输入观测的语义信息时的能力 - 用于导航沉重的体现任务的基元 - 并且发现剪辑的表示比想象成掠过的骨干更有效地编码这些基元。最后,我们扩展了我们的一个基线,产生了能够归零对象导航的代理,该导航可以导航到在训练期间未被用作目标的对象。
translated by 谷歌翻译
在这项工作中,我们提出了一种用于图像目标导航的内存调格方法。早期的尝试,包括基于RL的基于RL的方法和基于SLAM的方法的概括性能差,或者在姿势/深度传感器上稳定稳定。我们的方法基于一个基于注意力的端到端模型,该模型利用情节记忆来学习导航。首先,我们以自我监督的方式训练一个国家安置的网络,然后将其嵌入以前访问的状态中的代理商的记忆中。我们的导航政策通过注意机制利用了此信息。我们通过广泛的评估来验证我们的方法,并表明我们的模型在具有挑战性的吉布森数据集上建立了新的最新技术。此外,与相关工作形成鲜明对比的是,我们仅凭RGB输入就实现了这种令人印象深刻的性能,而无需访问其他信息,例如位置或深度。
translated by 谷歌翻译
A household robot should be able to navigate to target locations without requiring users to first annotate everything in their home. Current approaches to this object navigation challenge do not test on real robots and rely on expensive semantically labeled 3D meshes. In this work, our aim is an agent that builds self-supervised models of the world via exploration, the same as a child might. We propose an end-to-end self-supervised embodied agent that leverages exploration to train a semantic segmentation model of 3D objects, and uses those representations to learn an object navigation policy purely from self-labeled 3D meshes. The key insight is that embodied agents can leverage location consistency as a supervision signal - collecting images from different views/angles and applying contrastive learning to fine-tune a semantic segmentation model. In our experiments, we observe that our framework performs better than other self-supervised baselines and competitively with supervised baselines, in both simulation and when deployed in real houses.
translated by 谷歌翻译
对象目标导航(ObjectNAV)任务是在没有预先构建的地图的情况下将代理导航到看不见的环境中的对象类别。在本文中,我们通过使用语义相关对象作为线索来预测目标的距离来解决此任务。根据与目标对象的估计距离,我们的方法直接选择最佳的中期目标,这些目标更可能具有较短的目标途径。具体而言,基于学习的知识,我们的模型将鸟眼视图语义图作为输入,并估算从边界图单元到目标对象的路径长度。借助估计的距离图,代理可以同时探索环境并基于简单的人类设计策略导航到目标对象。在视觉上逼真的模拟环境中,经验结果表明,该提出的方法的表现优于成功率和效率的广泛基准。 Realobot实验还表明,我们的方法很好地推广到了现实世界。视频https://www.youtube.com/watch?v=r79pwvgfks4
translated by 谷歌翻译
For robots to be generally useful, they must be able to find arbitrary objects described by people (i.e., be language-driven) even without expensive navigation training on in-domain data (i.e., perform zero-shot inference). We explore these capabilities in a unified setting: language-driven zero-shot object navigation (L-ZSON). Inspired by the recent success of open-vocabulary models for image classification, we investigate a straightforward framework, CLIP on Wheels (CoW), to adapt open-vocabulary models to this task without fine-tuning. To better evaluate L-ZSON, we introduce the Pasture benchmark, which considers finding uncommon objects, objects described by spatial and appearance attributes, and hidden objects described relative to visible objects. We conduct an in-depth empirical study by directly deploying 21 CoW baselines across Habitat, RoboTHOR, and Pasture. In total, we evaluate over 90k navigation episodes and find that (1) CoW baselines often struggle to leverage language descriptions, but are proficient at finding uncommon objects. (2) A simple CoW, with CLIP-based object localization and classical exploration -- and no additional training -- matches the navigation efficiency of a state-of-the-art ZSON method trained for 500M steps on Habitat MP3D data. This same CoW provides a 15.6 percentage point improvement in success over a state-of-the-art RoboTHOR ZSON model.
translated by 谷歌翻译