Embodied Instruction Following (EIF) studies how mobile manipulator robots should be controlled to accomplish long-horizon tasks specified by natural language instructions. While most research on EIF are conducted in simulators, the ultimate goal of the field is to deploy the agents in real life. As such, it is important to minimize the data cost required for training an agent, to help the transition from sim to real. However, many studies only focus on the performance and overlook the data cost -- modules that require separate training on extra data are often introduced without a consideration on deployability. In this work, we propose FILM++ which extends the existing work FILM with modifications that do not require extra data. While all data-driven modules are kept constant, FILM++ more than doubles FILM's performance. Furthermore, we propose Prompter, which replaces FILM++'s semantic search module with language model prompting. Unlike FILM++'s implementation that requires training on extra sets of data, no training is needed for our prompting based implementation while achieving better or at least comparable performance. Prompter achieves 42.64% and 45.72% on the ALFRED benchmark with high-level instructions only and with step-by-step instructions, respectively, outperforming the previous state of the art by 6.57% and 10.31%.
translated by 谷歌翻译
This study focuses on embodied agents that can follow natural language instructions to complete complex tasks in a visually-perceived environment. Existing methods rely on a large amount of (instruction, gold trajectory) pairs to learn a good policy. The high data cost and poor sample efficiency prevents the development of versatile agents that are capable of many tasks and can learn new tasks quickly. In this work, we propose a novel method, LLM-Planner, that harnesses the power of large language models (LLMs) such as GPT-3 to do few-shot planning for embodied agents. We further propose a simple but effective way to enhance LLMs with physical grounding to generate plans that are grounded in the current environment. Experiments on the ALFRED dataset show that our method can achieve very competitive few-shot performance, even outperforming several recent baselines that are trained using the full training data despite using less than 0.5% of paired training data. Existing methods can barely complete any task successfully under the same few-shot setting. Our work opens the door for developing versatile and sample-efficient embodied agents that can quickly learn many tasks.
translated by 谷歌翻译
We present ALFRED (Action Learning From Realistic Environments and Directives), a benchmark for learning a mapping from natural language instructions and egocentric vision to sequences of actions for household tasks. ALFRED includes long, compositional tasks with nonreversible state changes to shrink the gap between research benchmarks and real-world applications. ALFRED consists of expert demonstrations in interactive visual environments for 25k natural language directives. These directives contain both high-level goals like "Rinse off a mug and place it in the coffee maker." and low-level language instructions like "Walk to the coffee maker on the right." ALFRED tasks are more complex in terms of sequence length, action space, and language than existing visionand-language task datasets. We show that a baseline model based on recent embodied vision-and-language tasks performs poorly on ALFRED, suggesting that there is significant room for developing innovative grounded visual language understanding models with this benchmark.
translated by 谷歌翻译
We present a retrospective on the state of Embodied AI research. Our analysis focuses on 13 challenges presented at the Embodied AI Workshop at CVPR. These challenges are grouped into three themes: (1) visual navigation, (2) rearrangement, and (3) embodied vision-and-language. We discuss the dominant datasets within each theme, evaluation metrics for the challenges, and the performance of state-of-the-art models. We highlight commonalities between top approaches to the challenges and identify potential future directions for Embodied AI research.
translated by 谷歌翻译
我们研究了开发自主代理的问题,这些自主代理可以遵循人类的指示来推断和执行一系列行动以完成基础任务。近年来取得了重大进展,尤其是对于短范围的任务。但是,当涉及具有扩展动作序列的长匹马任务时,代理可以轻松忽略某些指令或陷入长长指令中间,并最终使任务失败。为了应对这一挑战,我们提出了一个基于模型的里程碑的任务跟踪器(M-Track),以指导代理商并监视其进度。具体而言,我们提出了一个里程碑构建器,该建筑商通过导航和交互里程碑标记指令,代理商需要逐步完成,以及一个系统地检查代理商当前里程碑的进度并确定何时继续进行下一个的里程碑检查器。在具有挑战性的Alfred数据集上,我们的M轨道在两个竞争基本模型中,未见成功率的相对成功率显着提高了33%和52%。
translated by 谷歌翻译
在本文中,我们介绍了地图语言导航任务,代理在其中执行自然语言指令,并仅基于给定的3D语义图移至目标位置。为了解决任务,我们设计了指导感的路径建议和歧视模型(IPPD)。我们的方法利用MAP信息来提供指导感知的路径建议,即,它选择所有潜在的指令一致的候选路径以减少解决方案空间。接下来,为表示沿路径的地图观测值以获得更好的模态对准,提出了针对语义图定制的新型路径特征编码方案。基于注意力的语言驱动的歧视者旨在评估候选路径,并确定最佳路径作为最终结果。与单步贪婪决策方法相比,我们的方法自然可以避免误差积累。与单步仿制学习方法相比,IPPD在导航成功方面的性能增长超过17%,而在有挑战性的看不见的环境中,在路径匹配测量NDTW上的性能增长了0.18。
translated by 谷歌翻译
建立一个对话体现的代理执行现实生活任务一直是一个长期而又具有挑战性的研究目标,因为它需要有效的人类代理沟通,多模式理解,远程顺序决策等。传统的符号方法具有扩展和概括问题,而端到端的深度学习模型则遭受数据稀缺和高任务复杂性的影响,并且通常很难解释。为了从两全其美的世界中受益,我们提出了一个神经符号常识性推理(JARVIS)框架,用于模块化,可推广和可解释的对话体现的药物。首先,它通过提示大型语言模型(LLM)来获得符号表示,以了解语言理解和次目标计划,并通过从视觉观察中构建语义图。然后,基于任务和动作级别的常识,次目标计划和行动生成的符号模块。在Teach数据集上进行的大量实验验证了我们的JARVIS框架的功效和效率,该框架在所有三个基于对话框的具体任务上实现了最新的(SOTA)结果,包括对话记录(EDH)的执行,对话框的轨迹, (TFD)和两个代理任务完成(TATC)(例如,我们的方法将EDH看不见的成功率从6.1 \%\%提高到15.8 \%)。此外,我们系统地分析了影响任务绩效的基本因素,并在几个射击设置中证明了我们方法的优越性。我们的Jarvis模型在Alexa奖Simbot公共基准挑战赛中排名第一。
translated by 谷歌翻译
在人类空间中运营的机器人必须能够与人的自然语言互动,既有理解和执行指示,也可以使用对话来解决歧义并从错误中恢复。为此,我们介绍了教学,一个超过3,000人的互动对话的数据集,以完成模拟中的家庭任务。一个有关任务的Oracle信息的指挥官以自然语言与追随者通信。追随者通过环境导航并与环境进行互动,以完成从“咖啡”到“准备早餐”的复杂性不同的任务,提出问题并从指挥官获取其他信息。我们提出三个基准使用教学研究体现了智能挑战,我们评估了对话理解,语言接地和任务执行中的初始模型的能力。
translated by 谷歌翻译
自然语言提供可访问和富有富有态度的界面,以指定机器人代理的长期任务。但是,非专家可能会使用高级指令指定此类任务,其中通过多个抽象层摘要通过特定的机器人操作。我们建议将语言和机器人行动之间的这种差距延长长的执行视野是持久的表示。我们提出了一种持久的空间语义表示方法,并展示它是如何构建执行分层推理的代理,以有效执行长期任务。尽管完全避免了常用的逐步说明,我们评估了我们对阿尔弗雷德基准的方法并实现了最先进的结果。
translated by 谷歌翻译
最近的作品表明,如何将大语言模型(LLM)的推理能力应用于自然语言处理以外的领域,例如机器人的计划和互动。这些具体的问题要求代理商了解世界上许多语义方面:可用技能的曲目,这些技能如何影响世界以及对世界的变化如何映射回该语言。在体现环境中规划的LLMS不仅需要考虑要做什么技能,还需要考虑如何以及何时进行操作 - 答案随着时间的推移而变化,以响应代理商自己的选择。在这项工作中,我们调查了在这种体现的环境中使用的LLM在多大程度上可以推论通过自然语言提供的反馈来源,而无需任何其他培训。我们建议,通过利用环境反馈,LLM能够形成内部独白,使他们能够在机器人控制方案中进行更丰富的处理和计划。我们研究了各种反馈来源,例如成功检测,场景描述和人类互动。我们发现,闭环语言反馈显着改善了三个领域的高级指导完成,包括模拟和真实的桌面顶部重新排列任务以及现实世界中厨房环境中的长途移动操作任务。
translated by 谷歌翻译
从“Internet AI”的时代到“体现AI”的时代,AI算法和代理商出现了一个新兴范式转变,其中不再从主要来自Internet策划的图像,视频或文本的数据集。相反,他们通过与与人类类似的Enocentric感知来通过与其环境的互动学习。因此,对体现AI模拟器的需求存在大幅增长,以支持各种体现的AI研究任务。这种越来越多的体现AI兴趣是有利于对人工综合情报(AGI)的更大追求,但对这一领域并无一直存在当代和全面的调查。本文旨在向体现AI领域提供百科全书的调查,从其模拟器到其研究。通过使用我们提出的七种功能评估九个当前体现的AI模拟器,旨在了解模拟器,以其在体现AI研究和其局限性中使用。最后,本文调查了体现AI - 视觉探索,视觉导航和体现问题的三个主要研究任务(QA),涵盖了最先进的方法,评估指标和数据集。最后,随着通过测量该领域的新见解,本文将为仿真器 - 任务选择和建议提供关于该领域的未来方向的建议。
translated by 谷歌翻译
在视觉和语言导航(VLN)中,按照自然语言指令在现实的3D环境中需要具体的代理。现有VLN方法的一个主要瓶颈是缺乏足够的培训数据,从而导致对看不见的环境的概括不令人满意。虽然通常会手动收集VLN数据,但这种方法很昂贵,并且可以防止可扩展性。在这项工作中,我们通过建议从HM3D自动创建900个未标记的3D建筑物的大规模VLN数据集来解决数据稀缺问题。我们为每个建筑物生成一个导航图,并通过交叉视图一致性从2D传输对象预测,从2D传输伪3D对象标签。然后,我们使用伪对象标签来微调一个预处理的语言模型,作为减轻教学生成中跨模式差距的提示。在导航环境和说明方面,我们生成的HM3D-AUTOVLN数据集是比现有VLN数据集大的数量级。我们通过实验表明,HM3D-AUTOVLN显着提高了所得VLN模型的概括能力。在SPL指标上,我们的方法分别在Reverie和DataSet的看不见的验证分裂分别对艺术的状态提高了7.1%和8.1%。
translated by 谷歌翻译
Efficient ObjectGoal navigation (ObjectNav) in novel environments requires an understanding of the spatial and semantic regularities in environment layouts. In this work, we present a straightforward method for learning these regularities by predicting the locations of unobserved objects from incomplete semantic maps. Our method differs from previous prediction-based navigation methods, such as frontier potential prediction or egocentric map completion, by directly predicting unseen targets while leveraging the global context from all previously explored areas. Our prediction model is lightweight and can be trained in a supervised manner using a relatively small amount of passively collected data. Once trained, the model can be incorporated into a modular pipeline for ObjectNav without the need for any reinforcement learning. We validate the effectiveness of our method on the HM3D and MP3D ObjectNav datasets. We find that it achieves the state-of-the-art on both datasets, despite not using any additional data for training.
translated by 谷歌翻译
机器人导航的目标条件政策可以在大型未注释的数据集上进行培训,从而为现实世界中的设置提供了良好的概括。但是,尤其是在指定目标需要图像的基于视觉的设置中,这是一个不自然的界面。语言为与机器人的通信提供了一种更方便的方式,但是现代方法通常需要以语言描述注释的轨迹的形式进行昂贵的监督。我们提出了一个用于机器人导航的系统,该系统享受着未注释的大型轨迹数据集培训的好处,同时仍为用户提供高级接口。我们没有在数据集之后使用标记的指令,而是表明可以完全从预先训练的导航模型(VING),图像语言关联(剪辑)和语言建模(GPT-3)中构建这样的系统,而无需任何微调或语言宣布的机器人数据。我们将LM-NAV实例化在现实世界中的移动机器人上,并通过自然语言指令通过复杂的室外环境演示长途导航。有关我们的实验的视频,代码发布和在浏览器中运行的交互式COLAB笔记本,请查看我们的项目页面https://sites.google.com/view/lmnav
translated by 谷歌翻译
语言指导的体现了AI基准,要求代理导航环境并操纵对象通常允许单向通信:人类用户向代理提供了自然语言命令,而代理只能被动地遵循命令。我们介绍了基于Alfred基准测试的基准测试后的拨号式拨号。Dialfred允许代理商积极向人类用户提出问题;代理使用用户响应中的其他信息来更好地完成其任务。我们发布了一个具有53K任务的问题和答案的人类注销数据集,以及一个可以回答问题的甲骨文。为了解决Dialfred,我们提出了一个提问者绩效框架,其中发问者通过人类通知的数据进行了预训练,并通过增强学习进行了微调。我们将拨号拨入公开,并鼓励研究人员提出和评估他们的解决方案,以构建支持对话的体现代理。
translated by 谷歌翻译
大型语言模型可以编码有关世界的大量语义知识。这种知识对于旨在采取自然语言表达的高级,时间扩展的指示的机器人可能非常有用。但是,语言模型的一个重大弱点是,它们缺乏现实世界的经验,这使得很难利用它们在给定的体现中进行决策。例如,要求语言模型描述如何清洁溢出物可能会导致合理的叙述,但是它可能不适用于需要在特定环境中执行此任务的特定代理商(例如机器人)。我们建议通过预处理的技能来提供现实世界的基础,这些技能用于限制模型以提出可行且在上下文上适当的自然语言动作。机器人可以充当语​​言模型的“手和眼睛”,而语言模型可以提供有关任务的高级语义知识。我们展示了如何将低级技能与大语言模型结合在一起,以便语言模型提供有关执行复杂和时间扩展说明的过程的高级知识,而与这些技能相关的价值功能则提供了连接必要的基础了解特定的物理环境。我们在许多现实世界的机器人任务上评估了我们的方法,我们表明了对现实世界接地的需求,并且这种方法能够在移动操纵器上完成长远,抽象的自然语言指令。该项目的网站和视频可以在https://say-can.github.io/上找到。
translated by 谷歌翻译
Recent studies in Vision-and-Language Navigation (VLN) train RL agents to execute natural-language navigation instructions in photorealistic environments, as a step towards robots that can follow human instructions. However, given the scarcity of human instruction data and limited diversity in the training environments, these agents still struggle with complex language grounding and spatial language understanding. Pretraining on large text and image-text datasets from the web has been extensively explored but the improvements are limited. We investigate large-scale augmentation with synthetic instructions. We take 500+ indoor environments captured in densely-sampled 360 degree panoramas, construct navigation trajectories through these panoramas, and generate a visually-grounded instruction for each trajectory using Marky, a high-quality multilingual navigation instruction generator. We also synthesize image observations from novel viewpoints using an image-to-image GAN. The resulting dataset of 4.2M instruction-trajectory pairs is two orders of magnitude larger than existing human-annotated datasets, and contains a wider variety of environments and viewpoints. To efficiently leverage data at this scale, we train a simple transformer agent with imitation learning. On the challenging RxR dataset, our approach outperforms all existing RL agents, improving the state-of-the-art NDTW from 71.1 to 79.1 in seen environments, and from 64.6 to 66.8 in unseen test environments. Our work points to a new path to improving instruction-following agents, emphasizing large-scale imitation learning and the development of synthetic instruction generation capabilities.
translated by 谷歌翻译
已经证明,经过代码完成培训的大型语言模型(LLMS)能够合成DocStrings的简单Python程序[1]。我们发现这些代码编写的LLM可以被重新使用以编写机器人策略代码,给定自然语言命令。具体而言,策略代码可以表达处理感知输出的功能或反馈循环(例如,从对象检测器[2],[3])并参数化控制原始API。当作为输入提供了几个示例命令(格式为注释)后,然后是相应的策略代码(通过少量提示),LLMS可以接收新命令并自主重新编写API调用以分别生成新的策略代码。通过链接经典的逻辑结构并引用第三方库(例如,numpy,shapely)执行算术,以这种方式使用的LLM可以编写(i)(i)表现出空间几何推理的机器人策略,(ii)(ii)将其推广到新的说明和新指令和新指令和(iii)根据上下文(即行为常识)规定模棱两可的描述(例如“更快”)的精确值(例如,速度)。本文将代码作为策略介绍:语言模型生成程序的以机器人为中心的形式化(LMP),该程序可以代表反应性策略(例如阻抗控制器),以及基于Waypoint的策略(基于远见的选择,基于轨迹,基于轨迹,控制),在多个真实的机器人平台上展示。我们方法的核心是促使层次代码 - 代码(递归定义未定义的功能),该代码可以编写更复杂的代码,还可以改善最新的代码,以解决HOMANEVAL [1]基准中的39.8%的问题。代码和视频可从https://code-as-policies.github.io获得。
translated by 谷歌翻译
A robot that can carry out a natural-language instruction has been a dream since before the Jetsons cartoon series imagined a life of leisure mediated by a fleet of attentive robot helpers. It is a dream that remains stubbornly distant. However, recent advances in vision and language methods have made incredible progress in closely related areas. This is significant because a robot interpreting a naturallanguage navigation instruction on the basis of what it sees is carrying out a vision and language process that is similar to Visual Question Answering. Both tasks can be interpreted as visually grounded sequence-to-sequence translation problems, and many of the same methods are applicable. To enable and encourage the application of vision and language methods to the problem of interpreting visuallygrounded navigation instructions, we present the Matter-port3D Simulator -a large-scale reinforcement learning environment based on real imagery [11]. Using this simulator, which can in future support a range of embodied vision and language tasks, we provide the first benchmark dataset for visually-grounded natural language navigation in real buildings -the Room-to-Room (R2R) dataset 1 .1 https://bringmeaspoon.org Instruction: Head upstairs and walk past the piano through an archway directly in front. Turn right when the hallway ends at pictures and table. Wait by the moose antlers hanging on the wall.
translated by 谷歌翻译
Semantic navigation is necessary to deploy mobile robots in uncontrolled environments like our homes, schools, and hospitals. Many learning-based approaches have been proposed in response to the lack of semantic understanding of the classical pipeline for spatial navigation, which builds a geometric map using depth sensors and plans to reach point goals. Broadly, end-to-end learning approaches reactively map sensor inputs to actions with deep neural networks, while modular learning approaches enrich the classical pipeline with learning-based semantic sensing and exploration. But learned visual navigation policies have predominantly been evaluated in simulation. How well do different classes of methods work on a robot? We present a large-scale empirical study of semantic visual navigation methods comparing representative methods from classical, modular, and end-to-end learning approaches across six homes with no prior experience, maps, or instrumentation. We find that modular learning works well in the real world, attaining a 90% success rate. In contrast, end-to-end learning does not, dropping from 77% simulation to 23% real-world success rate due to a large image domain gap between simulation and reality. For practitioners, we show that modular learning is a reliable approach to navigate to objects: modularity and abstraction in policy design enable Sim-to-Real transfer. For researchers, we identify two key issues that prevent today's simulators from being reliable evaluation benchmarks - (A) a large Sim-to-Real gap in images and (B) a disconnect between simulation and real-world error modes - and propose concrete steps forward.
translated by 谷歌翻译