我们提出了一个名为“ Visual配方流”的新的多模式数据集,使我们能够学习每个烹饪动作的结果。数据集由对象状态变化和配方文本的工作流程组成。状态变化表示为图像对,而工作流则表示为食谱流图(R-FG)。图像对接地在R-FG中,该R-FG提供了交叉模式关系。使用我们的数据集,可以尝试从多模式常识推理和程序文本生成来尝试一系列应用程序。
translated by 谷歌翻译
“行动”在人类与世界互动并使他们实现理想的目标方面起着至关重要的作用。结果,对人类的最常识(CS)知识围绕着行动。尽管“关于行动与变革的推理”(RAC)在知识代表社区中得到了广泛的研究,但它最近引起了NLP和计算机视觉研究人员的兴趣。本文调查了现有的任务,基准数据集,各种技术和模型,以及它们在视觉和语言领域中RAC中进步的各自绩效。最后,我们总结了我们的关键要点,讨论该研究领域面临的目前挑战,并概述了未来研究的潜在方向。
translated by 谷歌翻译
我们介绍了遮阳板,一个新的像素注释的新数据集和一个基准套件,用于在以自我为中心的视频中分割手和活动对象。遮阳板注释Epic-kitchens的视频,其中带有当前视频分割数据集中未遇到的新挑战。具体而言,我们需要确保像素级注释作为对象经历变革性相互作用的短期和长期一致性,例如洋葱被剥皮,切成丁和煮熟 - 我们旨在获得果皮,洋葱块,斩波板,刀,锅以及表演手的准确像素级注释。遮阳板引入了一条注释管道,以零件为ai驱动,以进行可伸缩性和质量。总共,我们公开发布257个对象类的272K手册语义面具,990万个插值密集口罩,67K手动关系,涵盖36小时的179个未修剪视频。除了注释外,我们还引入了视频对象细分,互动理解和长期推理方面的三个挑战。有关数据,代码和排行榜:http://epic-kitchens.github.io/visor
translated by 谷歌翻译
了解多媒体内容中描述或显示的事件彼此相关是开发可用于真实世界媒体的强大人工智能系统的关键组成部分。尽管许多研究专门用于文本,图像和视频域中的事件理解,但没有一个研究探索事件跨域中经历的复杂关系。例如,新闻文章可能会描述“抗议”事件,而视频显示“逮捕”事件。认识到视觉“逮捕”事件是更广泛的“抗议”事件的一个子事件,这是一个具有挑战性但重要的问题,但前面的工作尚未探讨。在本文中,我们提出了多模式事件关系关系的新任务,以识别这种跨模式事件关系。我们贡献了一个大规模数据集,该数据集由100K视频新文章对组成,以及密集注释的数据的基准。我们还提出了一种弱监督的多模式方法,该方法将来自外部知识库(KB)的常识性知识整合在一起,以预测丰富的多模式事件层次结构。实验表明,我们的模型在我们提出的基准上优于许多竞争基线。我们还对模型的性能进行了详细的分析,并建议未来研究的方向。
translated by 谷歌翻译
Can we teach a robot to recognize and make predictions for activities that it has never seen before? We tackle this problem by learning models for video from text. This paper presents a hierarchical model that generalizes instructional knowledge from large-scale text corpora and transfers the knowledge to video. Given a portion of an instructional video, our model recognizes and predicts coherent and plausible actions multiple steps into the future, all in rich natural language. To demonstrate the capabilities of our model, we introduce the \emph{Tasty Videos Dataset V2}, a collection of 4022 recipes for zero-shot learning, recognition and anticipation. Extensive experiments with various evaluation metrics demonstrate the potential of our method for generalization, given limited video data for training models.
translated by 谷歌翻译
Instruction tuning, a new learning paradigm that fine-tunes pre-trained language models on tasks specified through instructions, has shown promising zero-shot performance on various natural language processing tasks. However, it's still not explored for vision and multimodal tasks. In this work, we introduce MultiInstruct, the first multimodal instruction tuning benchmark dataset that consists of 47 diverse multimodal tasks covering 11 broad categories. Each task is designed at least with 5,000 instances (input-out pairs) from existing open-source datasets and 5 expert-written instructions. We take OFA as the base pre-trained model for multimodal instruction tuning, and to improve its performance, we explore multiple transfer learning strategies to leverage the large-scale Natural Instructions dataset. Experimental results demonstrate its strong zero-shot performance on various unseen multimodal tasks and the benefit of transfer learning from text-only instructions. We also design a new evaluation metric: Sensitivity, to evaluate how sensitive the model is to the variety of instructions. Our results indicate that the model is less sensitive to the varying instructions after finetuning on a diverse set of tasks and instructions for each task.
translated by 谷歌翻译
这项工作提出了一个新的对话数据集,即cookdial,该数据集促进了对任务知识了解的面向任务的对话系统的研究。该语料库包含260个以人类对任务为导向的对话框,其中代理给出了配方文档,指导用户烹饪菜肴。 Cookdial中的对话框展示了两个独特的功能:(i)对话流与支持文档之间的程序对齐; (ii)复杂的代理决策涉及分割长句子,解释硬说明并在对话框上下文中解决核心。此外,我们在假定的面向任务的对话框系统中确定了三个具有挑战性的(子)任务:(1)用户问题理解,(2)代理操作框架预测和(3)代理响应生成。对于这些任务中的每一个,我们都会开发一个神经基线模型,我们在cookdial数据集上进行了评估。我们公开发布烹饪数据集,包括对话框和食谱文档的丰富注释,以刺激对特定于域的文档接地对话框系统的进一步研究。
translated by 谷歌翻译
为了使AI安全地在医院,学校和工作场所等现实世界中安全部署,它必须能够坚定地理解物理世界。这种推理的基础是物理常识:了解可用对象的物理特性和提供的能力,如何被操纵以及它们如何与其他对象进行交互。物理常识性推理从根本上是一项多感官任务,因为物理特性是通过多种模式表现出来的,其中两个是视觉和声学。我们的论文通过贡献PACS来朝着现实世界中的物理常识推理:第一个用于物理常识属性注释的视听基准。 PACS包含13,400对答案对,涉及1,377个独特的物理常识性问题和1,526个视频。我们的数据集提供了新的机会来通过将音频作为此多模式问题的核心组成部分来推进物理推理的研究领域。使用PACS,我们在我们的新挑战性任务上评估了多种最先进的模型。尽管某些模型显示出令人鼓舞的结果(精度为70%),但它们都没有人类的绩效(精度为95%)。我们通过证明多模式推理的重要性并为未来的研究提供了可能的途径来结束本文。
translated by 谷歌翻译
它仍然是一个管道梦想,电话和AR眼镜的AI助手可以帮助我们的日常生活来解决我们的问题,如“如何调整这款手表日期?”和“如何设置加热持续时间?(指向烤箱的同时)”。传统任务中使用的查询(即视频问题应答,视频检索,时刻定位)通常是有关的,并基于纯文本。相比之下,我们提出了一项名为Cometdancy的问题驱动视频段检索(AQVSR)的新任务。我们每个问题都是一个图像框文本查询,专注于我们日常生活中的物品,并期望从教学视频转录程序段的语料库中检索相关的答案段。为了支持对此AQVSR任务的研究,我们构建一个名为AssionSR的新数据集。我们设计新颖的准则来创造高质量样本。此数据集包含有关1K视频片段的1.4K多模态问题,来自各种日用物品的教学视频。为了解决AQVSR,我们开发了一个称为双重多模式编码器(DME)的简单但有效的模型,显着优于几种基线方法,同时仍然有大型未来改善空间。此外,我们提供了详细的消融分析。我们的代码和数据可以在https://github.com/stanlei52/aqvsr中获得。
translated by 谷歌翻译
可穿戴摄像机可以从用户的角度获取图像和视频。可以处理这些数据以了解人类的行为。尽管人类的行为分析已在第三人称视野中进行了彻底的研究,但仍在以自我为中心的环境中,尤其是在工业场景中进行了研究。为了鼓励在该领域的研究,我们介绍了Meccano,这是一个以自我为中心视频的多式模式数据集来研究类似工业的环境中的人类行为理解。多模式的特征是凝视信号,深度图和RGB视频同时使用自定义耳机获得。该数据集已在从第一人称视角的人类行为理解的背景下明确标记为基本任务,例如识别和预测人类对象的相互作用。使用MECCANO数据集,我们探索了五个不同的任务,包括1)动作识别,2)活动对象检测和识别,3)以自我为中心的人类对象互动检测,4)动作预期和5)下一步活动对象检测。我们提出了一个旨在研究人类行为的基准,该基准在被考虑的类似工业的情况下,表明所研究的任务和所考虑的方案对于最先进的算法具有挑战性。为了支持该领域的研究,我们在https://iplab.dmi.unict.it/meccano/上公开发布数据集。
translated by 谷歌翻译
To effectively train accurate Relation Extraction models, sufficient and properly labeled data is required. Adequately labeled data is difficult to obtain and annotating such data is a tricky undertaking. Previous works have shown that either accuracy has to be sacrificed or the task is extremely time-consuming, if done accurately. We are proposing an approach in order to produce high-quality datasets for the task of Relation Extraction quickly. Neural models, trained to do Relation Extraction on the created datasets, achieve very good results and generalize well to other datasets. In our study, we were able to annotate 10,022 sentences for 19 relations in a reasonable amount of time, and trained a commonly used baseline model for each relation.
translated by 谷歌翻译
This article presents the application of the Universal Named Entity framework to generate automatically annotated corpora. By using a workflow that extracts Wikipedia data and meta-data and DBpedia information, we generated an English dataset which is described and evaluated. Furthermore, we conducted a set of experiments to improve the annotations in terms of precision, recall, and F1-measure. The final dataset is available and the established workflow can be applied to any language with existing Wikipedia and DBpedia. As part of future research, we intend to continue improving the annotation process and extend it to other languages.
translated by 谷歌翻译
Recent advances in language modeling have enabled new conversational systems. In particular, it is often desirable for people to make choices among specified options when using such systems. We address the problem of reference resolution, when people use natural expressions to choose between real world entities. For example, given the choice `Should we make a Simnel cake or a Pandan cake?' a natural response from a non-expert may be indirect: `let's make the green one'. Reference resolution has been little studied with natural expressions, thus robustly understanding such language has large potential for improving naturalness in dialog, recommendation, and search systems. We create AltEntities (Alternative Entities), a new public dataset of entity pairs and utterances, and develop models for the disambiguation problem. Consisting of 42K indirect referring expressions across three domains, it enables for the first time the study of how large language models can be adapted to this task. We find they achieve 82%-87% accuracy in realistic settings, which while reasonable also invites further advances.
translated by 谷歌翻译
在人类空间中运营的机器人必须能够与人的自然语言互动,既有理解和执行指示,也可以使用对话来解决歧义并从错误中恢复。为此,我们介绍了教学,一个超过3,000人的互动对话的数据集,以完成模拟中的家庭任务。一个有关任务的Oracle信息的指挥官以自然语言与追随者通信。追随者通过环境导航并与环境进行互动,以完成从“咖啡”到“准备早餐”的复杂性不同的任务,提出问题并从指挥官获取其他信息。我们提出三个基准使用教学研究体现了智能挑战,我们评估了对话理解,语言接地和任务执行中的初始模型的能力。
translated by 谷歌翻译
我们研究了在室内路线上捕获的360度图像中的自动生成导航指令。现有的发电机遭受较差的视觉接地,导致它们依赖语言前沿和幻觉对象。我们的Marky-MT5系统通过专注于视觉地标来解决这一点;它包括第一阶段地标检测器和第二级发生器 - 多峰,多语言,多任务编码器 - 解码器。要培训它,我们在房间顶部(RXR)数据集的顶部引导地标注释。使用文本解析器,来自RXR的姿势迹线的弱监督,以及在1.8B图像上培训的多语言图像文本编码器,我们识别1.1M英语,印地语和泰卢语的地标描述并将其接地为Panoramas的特定区域。在房间到室内,人类途径在Marky-MT5的指示之后获得了71%的成功率(SR),只害羞他们的75%SR在人类指令之后 - 以及与其他发电机的SR高于SRS。对RXR更长的评估,不同的路径上的三种语言获得61-64%的SRS。在新颖环境中生成这种高质量的导航指令是迈向对话导航工具的一步,可以促进对指令跟随代理的大规模培训。
translated by 谷歌翻译
Visual understanding goes well beyond object recognition. With one glance at an image, we can effortlessly imagine the world beyond the pixels: for instance, we can infer people's actions, goals, and mental states. While this task is easy for humans, it is tremendously difficult for today's vision systems, requiring higher-order cognition and commonsense reasoning about the world. We formalize this task as Visual Commonsense Reasoning. Given a challenging question about an image, a machine must answer correctly and then provide a rationale justifying its answer.Next, we introduce a new dataset, VCR, consisting of 290k multiple choice QA problems derived from 110k movie scenes. The key recipe for generating non-trivial and highquality problems at scale is Adversarial Matching, a new approach to transform rich annotations into multiple choice questions with minimal bias. Experimental results show that while humans find VCR easy (over 90% accuracy), state-of-the-art vision models struggle (∼45%).To move towards cognition-level understanding, we present a new reasoning engine, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning. R2C helps narrow the gap between humans and machines (∼65%); still, the challenge is far from solved, and we provide analysis that suggests avenues for future work.
translated by 谷歌翻译
A core process in human cognition is analogical mapping: the ability to identify a similar relational structure between different situations. We introduce a novel task, Visual Analogies of Situation Recognition, adapting the classical word-analogy task into the visual domain. Given a triplet of images, the task is to select an image candidate B' that completes the analogy (A to A' is like B to what?). Unlike previous work on visual analogy that focused on simple image transformations, we tackle complex analogies requiring understanding of scenes. We leverage situation recognition annotations and the CLIP model to generate a large set of 500k candidate analogies. Crowdsourced annotations for a sample of the data indicate that humans agree with the dataset label ~80% of the time (chance level 25%). Furthermore, we use human annotations to create a gold-standard dataset of 3,820 validated analogies. Our experiments demonstrate that state-of-the-art models do well when distractors are chosen randomly (~86%), but struggle with carefully chosen distractors (~53%, compared to 90% human accuracy). We hope our dataset will encourage the development of new analogy-making models. Website: https://vasr-dataset.github.io/
translated by 谷歌翻译
For the past decade, temporal annotation has been sparse: only a small portion of event pairs in a text was annotated. We present NarrativeTime, the first timeline-based annotation framework that achieves full coverage of all possible TLinks. To compare with the previous SOTA in dense temporal annotation, we perform full re-annotation of TimeBankDense corpus, which shows comparable agreement with a significant increase in density. We contribute TimeBankNT corpus (with each text fully annotated by two expert annotators), extensive annotation guidelines, open-source tools for annotation and conversion to TimeML format, baseline results, as well as quantitative and qualitative analysis of inter-annotator agreement.
translated by 谷歌翻译
时间轴提供了最有效的方法之一,可以看到一段时间内发生的重要历史事实,从而呈现出从文本形式阅读等效信息的见解。通过利用生成的对抗性学习进行重要的句子分类,并通过吸收基于知识的标签来改善事件核心分辨率的性能,我们从多个(历史)文本文档中引入了两个分阶段的事件时间表生成的系统。我们在两个手动注释的历史文本文档上演示了我们的结果。我们的结果对历史学家,推进历史研究以及理解一个国家的社会政治格局的研究对历史学家来说非常有帮助。
translated by 谷歌翻译
体现的代理需要能够在自然语言中互动理解任务描述,并提出适当的后续问题以获取必要的信息,以有效地成功完成各种用户的任务。在这项工作中,我们提出了一组对话框,用于建模此类对话框,并注释教学数据集,其中包括3,000多个位置,以任务为导向的对话(总计包含39.5k个话语),并具有对话框ACT。 Teach-da是对Dialog ACT的第一个大型数据集注释,用于具体任务完成。此外,我们在培训模型中证明了该注释的数据集在标记给定话语的对话框行为中的使用,预测给定对话框历史记录的下一个响应的对话框行为,并使用对话框行为指导代理商的非第二语言行为。特别是,我们对对话记录任务的教学执行执行的实验,该模型预测在体现任务完成环境中要执行的低级操作的顺序,证明对话框行为可以将最终任务成功提高2分,以提高最终任务成功率到没有对话行为的系统。
translated by 谷歌翻译