We present a new AI task -Embodied Question Answering (EmbodiedQA) -where an agent is spawned at a random location in a 3D environment and asked a question ('What color is the car?'). In order to answer, the agent must first intelligently navigate to explore the environment, gather information through first-person (egocentric) vision, and then answer the question ('orange'). This challenging task requires a range of AI skills -active perception, language understanding, goal-driven navigation, commonsense reasoning, and grounding of language into actions. In this work, we develop the environments, end-to-end-trained reinforcement learning agents, and evaluation protocols for EmbodiedQA.
translated by 谷歌翻译
从“Internet AI”的时代到“体现AI”的时代,AI算法和代理商出现了一个新兴范式转变,其中不再从主要来自Internet策划的图像,视频或文本的数据集。相反,他们通过与与人类类似的Enocentric感知来通过与其环境的互动学习。因此,对体现AI模拟器的需求存在大幅增长,以支持各种体现的AI研究任务。这种越来越多的体现AI兴趣是有利于对人工综合情报(AGI)的更大追求,但对这一领域并无一直存在当代和全面的调查。本文旨在向体现AI领域提供百科全书的调查,从其模拟器到其研究。通过使用我们提出的七种功能评估九个当前体现的AI模拟器,旨在了解模拟器,以其在体现AI研究和其局限性中使用。最后,本文调查了体现AI - 视觉探索,视觉导航和体现问题的三个主要研究任务(QA),涵盖了最先进的方法,评估指标和数据集。最后,随着通过测量该领域的新见解,本文将为仿真器 - 任务选择和建议提供关于该领域的未来方向的建议。
translated by 谷歌翻译
We present a retrospective on the state of Embodied AI research. Our analysis focuses on 13 challenges presented at the Embodied AI Workshop at CVPR. These challenges are grouped into three themes: (1) visual navigation, (2) rearrangement, and (3) embodied vision-and-language. We discuss the dominant datasets within each theme, evaluation metrics for the challenges, and the performance of state-of-the-art models. We highlight commonalities between top approaches to the challenges and identify potential future directions for Embodied AI research.
translated by 谷歌翻译
A robot that can carry out a natural-language instruction has been a dream since before the Jetsons cartoon series imagined a life of leisure mediated by a fleet of attentive robot helpers. It is a dream that remains stubbornly distant. However, recent advances in vision and language methods have made incredible progress in closely related areas. This is significant because a robot interpreting a naturallanguage navigation instruction on the basis of what it sees is carrying out a vision and language process that is similar to Visual Question Answering. Both tasks can be interpreted as visually grounded sequence-to-sequence translation problems, and many of the same methods are applicable. To enable and encourage the application of vision and language methods to the problem of interpreting visuallygrounded navigation instructions, we present the Matter-port3D Simulator -a large-scale reinforcement learning environment based on real imagery [11]. Using this simulator, which can in future support a range of embodied vision and language tasks, we provide the first benchmark dataset for visually-grounded natural language navigation in real buildings -the Room-to-Room (R2R) dataset 1 .1 https://bringmeaspoon.org Instruction: Head upstairs and walk past the piano through an archway directly in front. Turn right when the hallway ends at pictures and table. Wait by the moose antlers hanging on the wall.
translated by 谷歌翻译
We present ALFRED (Action Learning From Realistic Environments and Directives), a benchmark for learning a mapping from natural language instructions and egocentric vision to sequences of actions for household tasks. ALFRED includes long, compositional tasks with nonreversible state changes to shrink the gap between research benchmarks and real-world applications. ALFRED consists of expert demonstrations in interactive visual environments for 25k natural language directives. These directives contain both high-level goals like "Rinse off a mug and place it in the coffee maker." and low-level language instructions like "Walk to the coffee maker on the right." ALFRED tasks are more complex in terms of sequence length, action space, and language than existing visionand-language task datasets. We show that a baseline model based on recent embodied vision-and-language tasks performs poorly on ALFRED, suggesting that there is significant room for developing innovative grounded visual language understanding models with this benchmark.
translated by 谷歌翻译
语言指导的体现了AI基准,要求代理导航环境并操纵对象通常允许单向通信:人类用户向代理提供了自然语言命令,而代理只能被动地遵循命令。我们介绍了基于Alfred基准测试的基准测试后的拨号式拨号。Dialfred允许代理商积极向人类用户提出问题;代理使用用户响应中的其他信息来更好地完成其任务。我们发布了一个具有53K任务的问题和答案的人类注销数据集,以及一个可以回答问题的甲骨文。为了解决Dialfred,我们提出了一个提问者绩效框架,其中发问者通过人类通知的数据进行了预训练,并通过增强学习进行了微调。我们将拨号拨入公开,并鼓励研究人员提出和评估他们的解决方案,以构建支持对话的体现代理。
translated by 谷歌翻译
视觉室内导航(VIN)任务已从数据驱动的机器学习社区中引起了人们的关注,尤其是在最近报告的基于学习方法的成功中。由于这项任务的先天复杂性,研究人员尝试从各种不同角度解决问题,其全部范围尚未在总体报告中捕获。这项调查首先总结了VIN任务的基于学习的方法的代表性工作,然后确定并讨论了阻碍VIN绩效的问题,并激发了值得探索社区的这些关键领域的未来研究。
translated by 谷歌翻译
我们研究了开发自主代理的问题,这些自主代理可以遵循人类的指示来推断和执行一系列行动以完成基础任务。近年来取得了重大进展,尤其是对于短范围的任务。但是,当涉及具有扩展动作序列的长匹马任务时,代理可以轻松忽略某些指令或陷入长长指令中间,并最终使任务失败。为了应对这一挑战,我们提出了一个基于模型的里程碑的任务跟踪器(M-Track),以指导代理商并监视其进度。具体而言,我们提出了一个里程碑构建器,该建筑商通过导航和交互里程碑标记指令,代理商需要逐步完成,以及一个系统地检查代理商当前里程碑的进度并确定何时继续进行下一个的里程碑检查器。在具有挑战性的Alfred数据集上,我们的M轨道在两个竞争基本模型中,未见成功率的相对成功率显着提高了33%和52%。
translated by 谷歌翻译
When building artificial intelligence systems that can reason and answer questions about visual data, we need diagnostic tests to analyze our progress and discover shortcomings. Existing benchmarks for visual question answering can help, but have strong biases that models can exploit to correctly answer questions without reasoning. They also conflate multiple sources of error, making it hard to pinpoint model weaknesses. We present a diagnostic dataset that tests a range of visual reasoning abilities. It contains minimal biases and has detailed annotations describing the kind of reasoning each question requires. We use this dataset to analyze a variety of modern visual reasoning systems, providing novel insights into their abilities and limitations.
translated by 谷歌翻译
第一人称视频在其持续环境的背景下突出了摄影师的活动。但是,当前的视频理解方法是从短视频剪辑中的视觉特征的原因,这些视频片段与基础物理空间分离,只捕获直接看到的东西。我们提出了一种方法,该方法通过学习摄影师(潜在看不见的)本地环境来促进以人为中心的环境的了解来链接以自我为中心的视频和摄像机随着时间的推移而张开。我们使用来自模拟的3D环境中的代理商的视频进行训练,在该环境中,环境完全可以观察到,并在看不见的环境的房屋旅行的真实视频中对其进行测试。我们表明,通过将视频接地在其物理环境中,我们的模型超过了传统的场景分类模型,可以预测摄影师所处的哪个房间(其中帧级信息不足),并且可以利用这种基础来定位与环境相对应的视频瞬间 - 中心查询,优于先验方法。项目页面:http://vision.cs.utexas.edu/projects/ego-scene-context/
translated by 谷歌翻译
在人类空间中运营的机器人必须能够与人的自然语言互动,既有理解和执行指示,也可以使用对话来解决歧义并从错误中恢复。为此,我们介绍了教学,一个超过3,000人的互动对话的数据集,以完成模拟中的家庭任务。一个有关任务的Oracle信息的指挥官以自然语言与追随者通信。追随者通过环境导航并与环境进行互动,以完成从“咖啡”到“准备早餐”的复杂性不同的任务,提出问题并从指挥官获取其他信息。我们提出三个基准使用教学研究体现了智能挑战,我们评估了对话理解,语言接地和任务执行中的初始模型的能力。
translated by 谷歌翻译
最近的作品表明,如何将大语言模型(LLM)的推理能力应用于自然语言处理以外的领域,例如机器人的计划和互动。这些具体的问题要求代理商了解世界上许多语义方面:可用技能的曲目,这些技能如何影响世界以及对世界的变化如何映射回该语言。在体现环境中规划的LLMS不仅需要考虑要做什么技能,还需要考虑如何以及何时进行操作 - 答案随着时间的推移而变化,以响应代理商自己的选择。在这项工作中,我们调查了在这种体现的环境中使用的LLM在多大程度上可以推论通过自然语言提供的反馈来源,而无需任何其他培训。我们建议,通过利用环境反馈,LLM能够形成内部独白,使他们能够在机器人控制方案中进行更丰富的处理和计划。我们研究了各种反馈来源,例如成功检测,场景描述和人类互动。我们发现,闭环语言反馈显着改善了三个领域的高级指导完成,包括模拟和真实的桌面顶部重新排列任务以及现实世界中厨房环境中的长途移动操作任务。
translated by 谷歌翻译
在工厂或房屋等环境中协助我们的机器人必须学会使用对象作为执行任务的工具,例如使用托盘携带对象。我们考虑了学习常识性知识何时可能有用的问题,以及如何与其他工具一起使用其使用以完成由人类指示的高级任务。具体而言,我们引入了一种新型的神经模型,称为Tooltango,该模型首先预测要使用的下一个工具,然后使用此信息来预测下一项动作。我们表明,该联合模型可以告知学习精细的策略,从而使机器人可以顺序使用特定工具,并在使模型更加准确的情况下增加了重要价值。 Tooltango使用图神经网络编码世界状态,包括对象和它们之间的符号关系,并使用人类教师的演示进行了培训,这些演示是指导物理模拟器中的虚拟机器人的演示。该模型学会了使用目标和动作历史的知识来参加场景,最终将符号动作解码为执行。至关重要的是,我们解决了缺少一些已知工具的看不见的环境的概括,但是存在其他看不见的工具。我们表明,通过通过从知识库中得出的预训练的嵌入来增强环境的表示,该模型可以有效地将其推广到新的环境中。实验结果表明,在预测具有看不见对象的新型环境中模拟移动操纵器的成功符号计划时,至少48.8-58.1%的绝对改善对基准的绝对改善。这项工作朝着使机器人能够快速合成复杂任务的强大计划的方向,尤其是在新颖的环境中
translated by 谷歌翻译
Visual understanding goes well beyond object recognition. With one glance at an image, we can effortlessly imagine the world beyond the pixels: for instance, we can infer people's actions, goals, and mental states. While this task is easy for humans, it is tremendously difficult for today's vision systems, requiring higher-order cognition and commonsense reasoning about the world. We formalize this task as Visual Commonsense Reasoning. Given a challenging question about an image, a machine must answer correctly and then provide a rationale justifying its answer.Next, we introduce a new dataset, VCR, consisting of 290k multiple choice QA problems derived from 110k movie scenes. The key recipe for generating non-trivial and highquality problems at scale is Adversarial Matching, a new approach to transform rich annotations into multiple choice questions with minimal bias. Experimental results show that while humans find VCR easy (over 90% accuracy), state-of-the-art vision models struggle (∼45%).To move towards cognition-level understanding, we present a new reasoning engine, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning. R2C helps narrow the gap between humans and machines (∼65%); still, the challenge is far from solved, and we provide analysis that suggests avenues for future work.
translated by 谷歌翻译
这项工作研究了图像目标导航问题,需要通过真正拥挤的环境引导具有嘈杂传感器和控制的机器人。最近的富有成效的方法依赖于深度加强学习,并学习模拟环境中的导航政策,这些环境比真实环境更简单。直接将这些训练有素的策略转移到真正的环境可能非常具有挑战性甚至危险。我们用由四个解耦模块组成的分层导航方法来解决这个问题。第一模块在机器人导航期间维护障碍物映射。第二个将定期预测实时地图上的长期目标。第三个计划碰撞命令集以导航到长期目标,而最终模块将机器人正确靠近目标图像。四个模块是单独开发的,以适应真实拥挤的情景中的图像目标导航。此外,分层分解对导航目标规划,碰撞避免和导航结束预测的学习进行了解耦,这在导航训练期间减少了搜索空间,并有助于改善以前看不见的真实场景的概括。我们通过移动机器人评估模拟器和现实世界中的方法。结果表明,我们的方法优于多种导航基线,可以在这些方案中成功实现导航任务。
translated by 谷歌翻译
We present Habitat, a platform for research in embodied artificial intelligence (AI). Habitat enables training embodied agents (virtual robots) in highly efficient photorealistic 3D simulation. Specifically, Habitat consists of: (i) Habitat-Sim: a flexible, high-performance 3D simulator with configurable agents, sensors, and generic 3D dataset handling. Habitat-Sim is fast -when rendering a scene from Matterport3D, it achieves several thousand frames per second (fps) running single-threaded, and can reach over 10,000 fps multi-process on a single GPU. (ii) Habitat-API: a modular high-level library for end-toend development of embodied AI algorithms -defining tasks (e.g. navigation, instruction following, question answering), configuring, training, and benchmarking embodied agents.These large-scale engineering contributions enable us to answer scientific questions requiring experiments that were till now impracticable or 'merely' impractical. Specifically, in the context of point-goal navigation: (1) we revisit the comparison between learning and SLAM approaches from two recent works [20,16] and find evidence for the opposite conclusion -that learning outperforms SLAM if scaled to an order of magnitude more experience than previous investigations, and (2) we conduct the first cross-dataset generalization experiments {train, test} × {Matterport3D, Gibson} for multiple sensors {blind, RGB, RGBD, D} and find that only agents with depth (D) sensors generalize across datasets. We hope that our open-source platform and these findings will advance research in embodied AI.
translated by 谷歌翻译
Artificial Intelligence (AI) and its applications have sparked extraordinary interest in recent years. This achievement can be ascribed in part to advances in AI subfields including Machine Learning (ML), Computer Vision (CV), and Natural Language Processing (NLP). Deep learning, a sub-field of machine learning that employs artificial neural network concepts, has enabled the most rapid growth in these domains. The integration of vision and language has sparked a lot of attention as a result of this. The tasks have been created in such a way that they properly exemplify the concepts of deep learning. In this review paper, we provide a thorough and an extensive review of the state of the arts approaches, key models design principles and discuss existing datasets, methods, their problem formulation and evaluation measures for VQA and Visual reasoning tasks to understand vision and language representation learning. We also present some potential future paths in this field of research, with the hope that our study may generate new ideas and novel approaches to handle existing difficulties and develop new applications.
translated by 谷歌翻译
解决时间扩展的任务是大多数增强学习(RL)算法的挑战[ARXIV:1906.07343]。我们研究了RL代理商学会提出自然语言问题的能力,以了解其环境并在新颖,时间扩展的环境中实现更大的概括性能。我们通过赋予该代理商的能力向全知的甲骨文提出“是,不”问题来做到这一点。这使代理商可以获得有关手头任务的指导,同时限制了对新信息的访问。为了在时间扩展的任务的背景下研究这种自然语言问题的出现,我们首先在迷你网格环境中训练代理商。然后,我们将受过训练的代理转移到另一个更艰难的环境中。与无法提出问题的基线代理相比,我们观察到概括性能的显着提高。通过将其对自然语言在其环境中的理解,代理可以推理其环境的动态,以至于在新型环境中部署时可以提出新的,相关的问题。
translated by 谷歌翻译
Efficient ObjectGoal navigation (ObjectNav) in novel environments requires an understanding of the spatial and semantic regularities in environment layouts. In this work, we present a straightforward method for learning these regularities by predicting the locations of unobserved objects from incomplete semantic maps. Our method differs from previous prediction-based navigation methods, such as frontier potential prediction or egocentric map completion, by directly predicting unseen targets while leveraging the global context from all previously explored areas. Our prediction model is lightweight and can be trained in a supervised manner using a relatively small amount of passively collected data. Once trained, the model can be incorporated into a modular pipeline for ObjectNav without the need for any reinforcement learning. We validate the effectiveness of our method on the HM3D and MP3D ObjectNav datasets. We find that it achieves the state-of-the-art on both datasets, despite not using any additional data for training.
translated by 谷歌翻译
AR眼镜/机器人等智能助手的长期目标是帮助用户以负担得起的现实世界情景,例如“我如何运行微波炉1分钟?”。但是,仍然没有明确的任务定义和合适的基准。在本文中,我们定义了一项名为“负担中心问题驱动的任务完成”的新任务,AI助手应从教学视频和脚本中学习,以指导用户逐步指导用户。为了支持该任务,我们构建了AssistQ,这是一个新的数据集,其中包括531个问答样本,该样本来自100个新电影的第一人称视频。每个问题都应通过从视觉细节(例如按钮的位置)和纹理细节(例如,按/转弯之类的操作)推断出多步导完成。为了解决这一独特的任务,我们开发了一个问题对行为(Q2A)模型,该模型极大地超过了几种基线方法,同时仍然有大量改进的空间。我们希望我们的任务和数据集能够推进Egentric AI助手的发展。我们的项目页面可在以下网址找到:https://showlab.github.io/assistq
translated by 谷歌翻译