我们研究了在室内路线上捕获的360度图像中的自动生成导航指令。现有的发电机遭受较差的视觉接地,导致它们依赖语言前沿和幻觉对象。我们的Marky-MT5系统通过专注于视觉地标来解决这一点;它包括第一阶段地标检测器和第二级发生器 - 多峰,多语言,多任务编码器 - 解码器。要培训它,我们在房间顶部(RXR)数据集的顶部引导地标注释。使用文本解析器,来自RXR的姿势迹线的弱监督,以及在1.8B图像上培训的多语言图像文本编码器,我们识别1.1M英语,印地语和泰卢语的地标描述并将其接地为Panoramas的特定区域。在房间到室内,人类途径在Marky-MT5的指示之后获得了71%的成功率(SR),只害羞他们的75%SR在人类指令之后 - 以及与其他发电机的SR高于SRS。对RXR更长的评估,不同的路径上的三种语言获得61-64%的SRS。在新颖环境中生成这种高质量的导航指令是迈向对话导航工具的一步,可以促进对指令跟随代理的大规模培训。
translated by 谷歌翻译
Recent studies in Vision-and-Language Navigation (VLN) train RL agents to execute natural-language navigation instructions in photorealistic environments, as a step towards robots that can follow human instructions. However, given the scarcity of human instruction data and limited diversity in the training environments, these agents still struggle with complex language grounding and spatial language understanding. Pretraining on large text and image-text datasets from the web has been extensively explored but the improvements are limited. We investigate large-scale augmentation with synthetic instructions. We take 500+ indoor environments captured in densely-sampled 360 degree panoramas, construct navigation trajectories through these panoramas, and generate a visually-grounded instruction for each trajectory using Marky, a high-quality multilingual navigation instruction generator. We also synthesize image observations from novel viewpoints using an image-to-image GAN. The resulting dataset of 4.2M instruction-trajectory pairs is two orders of magnitude larger than existing human-annotated datasets, and contains a wider variety of environments and viewpoints. To efficiently leverage data at this scale, we train a simple transformer agent with imitation learning. On the challenging RxR dataset, our approach outperforms all existing RL agents, improving the state-of-the-art NDTW from 71.1 to 79.1 in seen environments, and from 64.6 to 66.8 in unseen test environments. Our work points to a new path to improving instruction-following agents, emphasizing large-scale imitation learning and the development of synthetic instruction generation capabilities.
translated by 谷歌翻译
在视觉和语言导航(VLN)中,按照自然语言指令在现实的3D环境中需要具体的代理。现有VLN方法的一个主要瓶颈是缺乏足够的培训数据,从而导致对看不见的环境的概括不令人满意。虽然通常会手动收集VLN数据,但这种方法很昂贵,并且可以防止可扩展性。在这项工作中,我们通过建议从HM3D自动创建900个未标记的3D建筑物的大规模VLN数据集来解决数据稀缺问题。我们为每个建筑物生成一个导航图,并通过交叉视图一致性从2D传输对象预测,从2D传输伪3D对象标签。然后,我们使用伪对象标签来微调一个预处理的语言模型,作为减轻教学生成中跨模式差距的提示。在导航环境和说明方面,我们生成的HM3D-AUTOVLN数据集是比现有VLN数据集大的数量级。我们通过实验表明,HM3D-AUTOVLN显着提高了所得VLN模型的概括能力。在SPL指标上,我们的方法分别在Reverie和DataSet的看不见的验证分裂分别对艺术的状态提高了7.1%和8.1%。
translated by 谷歌翻译
视觉和语言导航(VLN)任务要求代理根据语言说明浏览环境。在本文中,我们旨在解决此任务中的两个关键挑战:利用多语言指令改进教学路径接地,并在培训期间看不见的新环境中导航。为了应对这些挑战,我们提出了明确的:跨语性和环境不可屈服的表示。首先,我们的经纪人在室内室内数据集中学习了三种语言(英语,印地语和泰卢固语)的共享且视觉上的跨语言表示。我们的语言表示学习是由视觉信息对齐的文本对指导的。其次,我们的代理商通过从不同环境中最大化语义对齐的图像对(对象匹配的约束)之间的相似性来学习环境不足的视觉表示。我们的环境不可知的视觉表示可以减轻低级视觉信息引起的环境偏见。从经验上讲,在房间 - 室内数据集中,我们表明,当通过跨语性语言表示和环境 - 非局部视觉表示形式概括地看不见的环境时,我们的多语言代理在所有指标上都比强大的基线模型进行了巨大改进。此外,我们表明我们学到的语言和视觉表示可以成功地转移到房间和合作的视觉和二元式导航任务上,并提出详细的定性和定量的概括和基础分析。我们的代码可从https://github.com/jialuli-luka/clear获得
translated by 谷歌翻译
机器人导航的目标条件政策可以在大型未注释的数据集上进行培训,从而为现实世界中的设置提供了良好的概括。但是,尤其是在指定目标需要图像的基于视觉的设置中,这是一个不自然的界面。语言为与机器人的通信提供了一种更方便的方式,但是现代方法通常需要以语言描述注释的轨迹的形式进行昂贵的监督。我们提出了一个用于机器人导航的系统,该系统享受着未注释的大型轨迹数据集培训的好处,同时仍为用户提供高级接口。我们没有在数据集之后使用标记的指令,而是表明可以完全从预先训练的导航模型(VING),图像语言关联(剪辑)和语言建模(GPT-3)中构建这样的系统,而无需任何微调或语言宣布的机器人数据。我们将LM-NAV实例化在现实世界中的移动机器人上,并通过自然语言指令通过复杂的室外环境演示长途导航。有关我们的实验的视频,代码发布和在浏览器中运行的交互式COLAB笔记本,请查看我们的项目页面https://sites.google.com/view/lmnav
translated by 谷歌翻译
了解空间和视觉信息对于遵循自然语言说明的导航代理至关重要。当前的基于变压器的VLN代理纠缠了方向和视觉信息,这限制了每个信息源的学习中的增益。在本文中,我们设计了具有明确取向和视觉模块的神经药物。这些模块学会了将空间信息和地标在视觉环境中的说明中提及。为了加强代理的空间推理和视觉感知,我们设计了特定的预训练任务,以进食并更好地利用我们最终导航模型中的相应模块。我们在Room2Room(R2R)和Room4Room(R4R)数据集上评估我们的方法,并在两个基准测试中实现最新结果。
translated by 谷歌翻译
在本文中,我们介绍了地图语言导航任务,代理在其中执行自然语言指令,并仅基于给定的3D语义图移至目标位置。为了解决任务,我们设计了指导感的路径建议和歧视模型(IPPD)。我们的方法利用MAP信息来提供指导感知的路径建议,即,它选择所有潜在的指令一致的候选路径以减少解决方案空间。接下来,为表示沿路径的地图观测值以获得更好的模态对准,提出了针对语义图定制的新型路径特征编码方案。基于注意力的语言驱动的歧视者旨在评估候选路径,并确定最佳路径作为最终结果。与单步贪婪决策方法相比,我们的方法自然可以避免误差积累。与单步仿制学习方法相比,IPPD在导航成功方面的性能增长超过17%,而在有挑战性的看不见的环境中,在路径匹配测量NDTW上的性能增长了0.18。
translated by 谷歌翻译
事实证明,演讲者的追随者模型在视觉和语言导航中有效,在该导航中,扬声器模型用于合成新的说明,以增强追随者导航模型的培训数据。但是,在以前的许多方法中,生成的指令未直接训练以优化追随者的性能。在本文中,我们介绍\ textsc {foam},a \ textsc {fo} llower- \ textsc {a} ware speaker \ textsc {m} odel,它不断更新给定关注的反馈,以便生成的指令可以是更多的指令。适合当前追随者的学习状态。具体而言,我们使用BI级优化框架优化了扬声器,并通过评估标记数据的跟随器来获得其训练信号。房间对房间和房间的室内数据集中的实验结果表明,我们的方法可以超越跨设置的强大基线模型。分析还表明,我们生成的指示的质量比基线更高。
translated by 谷歌翻译
感知,规划,估算和控制的当代方法允许机器人在不确定,非结构化环境中的远程代理中稳健运行。此进度现在创造了机器人不仅在隔离,而且在我们的复杂环境中运行的机器人。意识到这个机会需要一种高效且灵活的媒介,人类可以与协作机器人沟通。自然语言提供了一种这样的媒体,通过对自然语言理解的统计方法的重大进展,现在能够解释各种自由形式命令。然而,大多数当代方法需要机器人环境的详细,现有的空间语义地图,这些环境模拟了话语可能引用的可能引用的空间。因此,当机器人部署在新的,先前未知或部分观察到的环境中时,这些方法发生故障,特别是当环境的心理模型在人类运营商和机器人之间不同时。本文提供了一种新的学习框架的全面描述,允许现场和服务机器人解释并正确执行先验未知,非结构化环境中的自然语言指令。对于我们的方法而不是我们的语言作为“传感器” - 在话语中隐含的“传感器” - 推断的空间,拓扑和语义信息,然后利用这些信息来学习在潜在环境模型上的分布。我们将此分布纳入概率,语言接地模型中,并在机器人的动作空间的象征性表示中推断出分布。我们使用模仿学习来确定对环境和行为分布的原因的信仰空间政策。我们通过各种导航和移动操纵实验评估我们的框架。
translated by 谷歌翻译
A robot that can carry out a natural-language instruction has been a dream since before the Jetsons cartoon series imagined a life of leisure mediated by a fleet of attentive robot helpers. It is a dream that remains stubbornly distant. However, recent advances in vision and language methods have made incredible progress in closely related areas. This is significant because a robot interpreting a naturallanguage navigation instruction on the basis of what it sees is carrying out a vision and language process that is similar to Visual Question Answering. Both tasks can be interpreted as visually grounded sequence-to-sequence translation problems, and many of the same methods are applicable. To enable and encourage the application of vision and language methods to the problem of interpreting visuallygrounded navigation instructions, we present the Matter-port3D Simulator -a large-scale reinforcement learning environment based on real imagery [11]. Using this simulator, which can in future support a range of embodied vision and language tasks, we provide the first benchmark dataset for visually-grounded natural language navigation in real buildings -the Room-to-Room (R2R) dataset 1 .1 https://bringmeaspoon.org Instruction: Head upstairs and walk past the piano through an archway directly in front. Turn right when the hallway ends at pictures and table. Wait by the moose antlers hanging on the wall.
translated by 谷歌翻译
在人类空间中运营的机器人必须能够与人的自然语言互动,既有理解和执行指示,也可以使用对话来解决歧义并从错误中恢复。为此,我们介绍了教学,一个超过3,000人的互动对话的数据集,以完成模拟中的家庭任务。一个有关任务的Oracle信息的指挥官以自然语言与追随者通信。追随者通过环境导航并与环境进行互动,以完成从“咖啡”到“准备早餐”的复杂性不同的任务,提出问题并从指挥官获取其他信息。我们提出三个基准使用教学研究体现了智能挑战,我们评估了对话理解,语言接地和任务执行中的初始模型的能力。
translated by 谷歌翻译
The Flickr30k dataset has become a standard benchmark for sentence-based image description. This paper presents Flickr30k Entities, which augments the 158k captions from Flickr30k with 244k coreference chains, linking mentions of the same entities across different captions for the same image, and associating them with 276k manually annotated bounding boxes. Such annotations are essential for continued progress in automatic image description and grounded language understanding. They enable us to define a new benchmark for localization of textual entity mentions in an image. We present a strong baseline for this task that combines an image-text embedding, detectors for common objects, a color classifier, and a bias towards selecting larger objects. While our baseline rivals in accuracy more complex state-of-the-art models, we show that its gains cannot be easily parlayed into improvements on such tasks as image-sentence retrieval, thus underlining the limitations of current methods and the need for further research.
translated by 谷歌翻译
通用视觉(GPV)系统是旨在解决各种视觉任务的模型,而无需进行架构更改。如今,GPV主要从大型完全监督的数据集中学习技能和概念。通过获取数据以迅速学习每个技能的每个概念,将GPV扩展到数万个概念都变得令人望而却步。这项工作提出了一种有效且廉价的替代方法:从监督数据集中学习技能,从Web图像搜索中学习概念,并利用GPV的关键特征:跨技能传递视觉知识的能力。我们使用跨越10K+视觉概念的1M+图像的数据集来演示3个基准上的两个现有GPV(GPV-1和VL-T5)的Webly Supumented概念扩展:5个基于可可的数据集(80个主要概念),这是一个新的策划系列,这是一个新的策划系列。基于OpenImages和VisualGenome存储库(〜500个概念)以及Web衍生的数据集(10K+概念)的5个数据集。我们还提出了一种新的体系结构GPV-2,该架构支持各种任务 - 从分类和本地化等视觉任务到Qu Viewer+语言任务,例如QA和字幕,再到更多的利基市场,例如人类对象互动检测。 GPV-2从Web数据中受益匪浅,并且在这些基准测试中胜过GPV-1和VL-T5。我们的数据,代码和Web演示可在https://prior.allenai.org/projects/gpv2上获得。
translated by 谷歌翻译
我们研究了开发自主代理的问题,这些自主代理可以遵循人类的指示来推断和执行一系列行动以完成基础任务。近年来取得了重大进展,尤其是对于短范围的任务。但是,当涉及具有扩展动作序列的长匹马任务时,代理可以轻松忽略某些指令或陷入长长指令中间,并最终使任务失败。为了应对这一挑战,我们提出了一个基于模型的里程碑的任务跟踪器(M-Track),以指导代理商并监视其进度。具体而言,我们提出了一个里程碑构建器,该建筑商通过导航和交互里程碑标记指令,代理商需要逐步完成,以及一个系统地检查代理商当前里程碑的进度并确定何时继续进行下一个的里程碑检查器。在具有挑战性的Alfred数据集上,我们的M轨道在两个竞争基本模型中,未见成功率的相对成功率显着提高了33%和52%。
translated by 谷歌翻译
Generalist models, which are capable of performing diverse multi-modal tasks in a task-agnostic way within a single model, have been explored recently. Being, hopefully, an alternative to approaching general-purpose AI, existing generalist models are still at an early stage, where modality and task coverage is limited. To empower multi-modal task-scaling and speed up this line of research, we release a generalist model learning system, OFASys, built on top of a declarative task interface named multi-modal instruction. At the core of OFASys is the idea of decoupling multi-modal task representations from the underlying model implementations. In OFASys, a task involving multiple modalities can be defined declaratively even with just a single line of code. The system automatically generates task plans from such instructions for training and inference. It also facilitates multi-task training for diverse multi-modal workloads. As a starting point, we provide presets of 7 different modalities and 23 highly-diverse example tasks in OFASys, with which we also develop a first-in-kind, single model, OFA+, that can handle text, image, speech, video, and motion data. The single OFA+ model achieves 95% performance in average with only 16% parameters of 15 task-finetuned models, showcasing the performance reliability of multi-modal task-scaling provided by OFASys. Available at https://github.com/OFA-Sys/OFASys
translated by 谷歌翻译
我们提出了一种可扩展的方法,用于学习开放世界对象目标导航(ObjectNAV) - 要求虚拟机器人(代理)在未探索的环境中找到对象的任何实例(例如,“查找接收器”)。我们的方法完全是零拍的 - 即,它不需要任何形式的objectNav奖励或演示。取而代之的是,我们训练图像目标导航(ImagenAv)任务,在该任务中,代理在其中找到了捕获图片(即目标图像)的位置。具体而言,我们将目标图像编码为多模式的语义嵌入空间,以在未注释的3D环境(例如HM3D)中以大规模训练语义目标导航(Senanticnav)代理。训练后,可以指示Semanticnav代理查找以自由形式的自然语言描述的对象(例如,“接收器”,“浴室水槽”等),通过将语言目标投射到相同的多模式,语义嵌入空间中。结果,我们的方法启用了开放世界的ObjectNAV。我们在三个ObjectNAV数据集(Gibson,HM3D和MP3D)上广泛评估了我们的代理商,并观察到成功的4.2%-20.0%的绝对改进。作为参考,这些收益与2020年至2021年Objectnav挑战赛竞争对手之间成功的5%改善相似或更好。在开放世界的环境中,我们发现我们的代理商可以概括为明确提到的房间(例如,“找到厨房水槽”)的复合说明,并且何时可以推断目标室(例如,”找到水槽和炉子”)。
translated by 谷歌翻译
We present a retrospective on the state of Embodied AI research. Our analysis focuses on 13 challenges presented at the Embodied AI Workshop at CVPR. These challenges are grouped into three themes: (1) visual navigation, (2) rearrangement, and (3) embodied vision-and-language. We discuss the dominant datasets within each theme, evaluation metrics for the challenges, and the performance of state-of-the-art models. We highlight commonalities between top approaches to the challenges and identify potential future directions for Embodied AI research.
translated by 谷歌翻译
In this work, we study the problem of Embodied Referring Expression Grounding, where an agent needs to navigate in a previously unseen environment and localize a remote object described by a concise high-level natural language instruction. When facing such a situation, a human tends to imagine what the destination may look like and to explore the environment based on prior knowledge of the environmental layout, such as the fact that a bathroom is more likely to be found near a bedroom than a kitchen. We have designed an autonomous agent called Layout-aware Dreamer (LAD), including two novel modules, that is, the Layout Learner and the Goal Dreamer to mimic this cognitive decision process. The Layout Learner learns to infer the room category distribution of neighboring unexplored areas along the path for coarse layout estimation, which effectively introduces layout common sense of room-to-room transitions to our agent. To learn an effective exploration of the environment, the Goal Dreamer imagines the destination beforehand. Our agent achieves new state-of-the-art performance on the public leaderboard of the REVERIE dataset in challenging unseen test environments with improvement in navigation success (SR) by 4.02% and remote grounding success (RGS) by 3.43% compared to the previous state-of-the-art. The code is released at https://github.com/zehao-wang/LAD
translated by 谷歌翻译
Natural Language Generation (NLG) has improved exponentially in recent years thanks to the development of sequence-to-sequence deep learning technologies such as Transformer-based language models. This advancement has led to more fluent and coherent NLG, leading to improved development in downstream tasks such as abstractive summarization, dialogue generation and data-to-text generation. However, it is also apparent that deep learning based generation is prone to hallucinate unintended text, which degrades the system performance and fails to meet user expectations in many real-world scenarios. To address this issue, many studies have been presented in measuring and mitigating hallucinated texts, but these have never been reviewed in a comprehensive manner before. In this survey, we thus provide a broad overview of the research progress and challenges in the hallucination problem in NLG. The survey is organized into two parts: (1) a general overview of metrics, mitigation methods, and future directions; and (2) an overview of task-specific research progress on hallucinations in the following downstream tasks, namely abstractive summarization, dialogue generation, generative question answering, data-to-text generation, machine translation, and visual-language generation. This survey serves to facilitate collaborative efforts among researchers in tackling the challenge of hallucinated texts in NLG.
translated by 谷歌翻译
Can we teach a robot to recognize and make predictions for activities that it has never seen before? We tackle this problem by learning models for video from text. This paper presents a hierarchical model that generalizes instructional knowledge from large-scale text corpora and transfers the knowledge to video. Given a portion of an instructional video, our model recognizes and predicts coherent and plausible actions multiple steps into the future, all in rich natural language. To demonstrate the capabilities of our model, we introduce the \emph{Tasty Videos Dataset V2}, a collection of 4022 recipes for zero-shot learning, recognition and anticipation. Extensive experiments with various evaluation metrics demonstrate the potential of our method for generalization, given limited video data for training models.
translated by 谷歌翻译