The little girl jumps back up after falling. Figure 1: We consider localizing moments in video with natural language and demonstrate that incorporating local and global video features is important for this task. To train and evaluate our model, we collect the Distinct Describable Moments (DiDeMo) dataset which consists of over 40,000 pairs of localized video moments and corresponding natural language.
translated by 谷歌翻译
它仍然是一个管道梦想,电话和AR眼镜的AI助手可以帮助我们的日常生活来解决我们的问题,如“如何调整这款手表日期?”和“如何设置加热持续时间?(指向烤箱的同时)”。传统任务中使用的查询(即视频问题应答,视频检索,时刻定位)通常是有关的,并基于纯文本。相比之下,我们提出了一项名为Cometdancy的问题驱动视频段检索(AQVSR)的新任务。我们每个问题都是一个图像框文本查询,专注于我们日常生活中的物品,并期望从教学视频转录程序段的语料库中检索相关的答案段。为了支持对此AQVSR任务的研究,我们构建一个名为AssionSR的新数据集。我们设计新颖的准则来创造高质量样本。此数据集包含有关1K视频片段的1.4K多模态问题,来自各种日用物品的教学视频。为了解决AQVSR,我们开发了一个称为双重多模式编码器(DME)的简单但有效的模型,显着优于几种基线方法,同时仍然有大型未来改善空间。此外,我们提供了详细的消融分析。我们的代码和数据可以在https://github.com/stanlei52/aqvsr中获得。
translated by 谷歌翻译
我们介绍了空间本地化叙述中的视频中的任务。我们的方法的关键是能够学会在与随附的叙述的视频中的大型视频中对自我监督进行空间地定位与自我监督的互动。为实现这一目标,我们提出了一种多层跨模型关注网络,可以在培训期间有效优化对比损失。我们介绍了一种分割的策略,可以通过视觉和自然语言方式计算和中间模态注意力之间的交替,这允许通过直接对比两种方式的表示来实现有效的培训。我们展示了我们对HOWTO100M教学数据集的自我训练的方法的有效性,并在YouCook2 DataSet中的本地化描述交互的新收集数据集上进行评估。我们展示了我们的方法优于替代基准,包括浅薄的共同关注和完全跨越的关注。我们还将我们的方法应用于在Flickr30k上的弱监管下的图像中的接地短语,并显示堆叠多个注意层是有效的,并且当与对区域丢失相结合时,在召回召回和指向时达到最先进的艺术状态手准确性。
translated by 谷歌翻译
The Flickr30k dataset has become a standard benchmark for sentence-based image description. This paper presents Flickr30k Entities, which augments the 158k captions from Flickr30k with 244k coreference chains, linking mentions of the same entities across different captions for the same image, and associating them with 276k manually annotated bounding boxes. Such annotations are essential for continued progress in automatic image description and grounded language understanding. They enable us to define a new benchmark for localization of textual entity mentions in an image. We present a strong baseline for this task that combines an image-text embedding, detectors for common objects, a color classifier, and a bias towards selecting larger objects. While our baseline rivals in accuracy more complex state-of-the-art models, we show that its gains cannot be easily parlayed into improvements on such tasks as image-sentence retrieval, thus underlining the limitations of current methods and the need for further research.
translated by 谷歌翻译
Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in
translated by 谷歌翻译
This paper introduces a video dataset of spatiotemporally localized Atomic Visual Actions (AVA). The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1.58M action labels with multiple labels per person occurring frequently. The key characteristics of our dataset are: (1) the definition of atomic visual actions, rather than composite actions; (2) precise spatio-temporal annotations with possibly multiple annotations for each person; (3) exhaustive annotation of these atomic actions over 15-minute video clips; (4) people temporally linked across consecutive segments; and (5) using movies to gather a varied set of action representations. This departs from existing datasets for spatio-temporal action recognition, which typically provide sparse annotations for composite actions in short video clips.AVA, with its realistic scene and action complexity, exposes the intrinsic difficulty of action recognition. To benchmark this, we present a novel approach for action localization that builds upon the current state-of-the-art methods, and demonstrates better performance on JHMDB and UCF101-24 categories. While setting a new state of the art on existing datasets, the overall results on AVA are low at 15.6% mAP, underscoring the need for developing new approaches for video understanding.
translated by 谷歌翻译
Learning text-video embeddings usually requires a dataset of video clips with manually provided captions. However, such datasets are expensive and time consuming to create and therefore difficult to obtain on a large scale. In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations. The contributions of this work are three-fold. First, we introduce HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M narrated instructional web videos depicting humans performing and describing over 23k different visual tasks. Our data collection procedure is fast, scalable and does not require any additional manual annotation. Second, we demonstrate that a text-video embedding trained on this data leads to state-ofthe-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask. Finally, we show that this embedding transfers well to other domains: fine-tuning on generic Youtube videos (MSR-VTT dataset) and movies (LSMDC dataset) outperforms models trained on these datasets alone. Our dataset, code and models are publicly available [1]. * Equal contribution.
translated by 谷歌翻译
Recent years have witnessed an increasing interest in image-based question-answering (QA) tasks. However, due to data limitations, there has been much less work on video-based QA. In this paper, we present TVQA, a largescale video QA dataset based on 6 popular TV shows. TVQA consists of 152,545 QA pairs from 21,793 clips, spanning over 460 hours of video. Questions are designed to be compositional in nature, requiring systems to jointly localize relevant moments within a clip, comprehend subtitle-based dialogue, and recognize relevant visual concepts. We provide analyses of this new dataset as well as several baselines and a multi-stream end-to-end trainable neural network framework for the TVQA task. The dataset is publicly available at http://tvqa.cs.unc.edu.
translated by 谷歌翻译
Most natural videos contain numerous events. For example, in a video of a "man playing a piano", the video might also contain "another man dancing" or "a crowd clapping". We introduce the task of dense-captioning events, which involves both detecting and describing events in a video. We propose a new model that is able to identify all events in a single pass of the video while simultaneously describing the detected events with natural language. Our model introduces a variant of an existing proposal module that is designed to capture both short as well as long events that span minutes. To capture the dependencies between the events in a video, our model introduces a new captioning module that uses contextual information from past and future events to jointly describe all events. We also introduce ActivityNet Captions, a large-scale benchmark for dense-captioning events. ActivityNet Captions contains 20k videos amounting to 849 video hours with 100k total descriptions, each with it's unique start and end time. Finally, we report performances of our model for dense-captioning events, video retrieval and localization.
translated by 谷歌翻译
We present a model that generates natural language descriptions of images and their regions. Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data. Our alignment model is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Multimodal Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions. We demonstrate that our alignment model produces state of the art results in retrieval experiments on Flickr8K, Flickr30K and MSCOCO datasets. We then show that the generated descriptions significantly outperform retrieval baselines on both full images and on a new dataset of region-level annotations.
translated by 谷歌翻译
近期和越来越越来越多的视频 - 语言研究的兴趣已经推动了大规模数据集的开发,可实现数据密集型机器学习技术。相比之下,在评估这些数据集的适应性时,已经进行了有限的努力进行视频 - 语言接地任务。最近的作品已经开始发现这些数据集中的重大限制,这表明最先进的技术通常会过度地覆盖到隐藏的数据集偏差。在这项工作中,我们呈现MAD(电影音频描述),这是一种新颖的基准,从扩充现有视频数据集的范式,其中包含文本注释,并专注于爬行和对齐主流电影的可用音频描述。 MAD包含超过384,000个自然语言句子,该句子接地为超过1,200小时的视频,并且在视频 - 语言接地数据集中展示目前诊断的偏差显着减少。疯狂的收集策略使新颖且更具挑战性的视频 - 语言接地版本,其中短时间时刻(通常秒长)必须在多样化的长型视频中准确地接地,可以持续长达三个小时。
translated by 谷歌翻译
从给定自然语言(NL)用户查询的视频中检测自定义时刻和亮点是一个重要而是研究过的主题。追求这个方向的挑战之一是缺乏注释数据。要解决此问题,我们介绍了基于查询的视频亮点(QVHighlights)数据集。它由超过10,000个YouTube视频组成,涵盖了各种主题,从日常活动,在Lifestyle VLog视频中旅行到新闻视频中的社会和政治活动。数据集中的每个视频都注释:(1)人类写的自由表格NL查询,(2)视频W.R.T中的相关时刻。查询和(3)所有查询相关剪辑的五点比例显着分数。此综合注释使我们能够开发和评估检测相关时刻的系统以及为不同灵活的用户查询的突出亮点。我们还为此任务提供了一个强大的基线,矩DETR,一个变压器编码器 - 解码器模型,即视图检索作为直接设置预测问题,将提取的视频和查询表示作为输入和预测时刻坐标和显着分数结束 - 结尾。虽然我们的模型不利用任何人,但我们表明它与经验丰富的架构相比,它表现得很竞争。使用ASR标题的弱预测预测,动量基本上显着优于先前的方法。最后,我们展示了几个措施和可视化的力矩。数据和代码在https://github.com/jayleicn/moment_detr上公开使用
translated by 谷歌翻译
我们介绍了遮阳板,一个新的像素注释的新数据集和一个基准套件,用于在以自我为中心的视频中分割手和活动对象。遮阳板注释Epic-kitchens的视频,其中带有当前视频分割数据集中未遇到的新挑战。具体而言,我们需要确保像素级注释作为对象经历变革性相互作用的短期和长期一致性,例如洋葱被剥皮,切成丁和煮熟 - 我们旨在获得果皮,洋葱块,斩波板,刀,锅以及表演手的准确像素级注释。遮阳板引入了一条注释管道,以零件为ai驱动,以进行可伸缩性和质量。总共,我们公开发布257个对象类的272K手册语义面具,990万个插值密集口罩,67K手动关系,涵盖36小时的179个未修剪视频。除了注释外,我们还引入了视频对象细分,互动理解和长期推理方面的三个挑战。有关数据,代码和排行榜:http://epic-kitchens.github.io/visor
translated by 谷歌翻译
我们介绍了一种视听方法,用于远程文本到视频检索。与以前专为简短视频检索设计的方法(例如,持续时间为5-15秒)不同,我们的方法旨在检索捕获复杂人类动作的长时间视频。仅标准视频方法的一个挑战是与从这样的长视频中处理数百个密集提取的帧相关的大量计算成本。为了解决这个问题,我们建议用紧凑的音频提示替换视频的部分,这些线索简洁地汇总了动态音频事件,并且处理便宜。我们的方法称为Eclipse(带有声音编码的有效剪辑),通过添加一个统一的视听变压器块,将流行的剪辑模型调整为视听视频设置,该块从视频和音频流中捕获互补的提示。除了比仅长期视频的方法快2.92倍和2.34倍的内存效率外,我们的方法还可以在几个不同的远程视频数据集上,例如ActivityNet,QVHighighlights,Youcoook2,Youcoook2,Youcook2,Youcook2,Youcook2,Youcook2,Youcook2,Youcook2, Didemo和Charades。
translated by 谷歌翻译
基于文本的视频细分旨在通过用文本查询指定演员及其表演动作来细分视频序列中的演员。由于\ emph {emph {语义不对称}的问题,以前的方法无法根据演员及其动作以细粒度的方式将视频内容与文本查询对齐。 \ emph {语义不对称}意味着在多模式融合过程中包含不同量的语义信息。为了减轻这个问题,我们提出了一个新颖的演员和动作模块化网络,该网络将演员及其动作分别定位在两个单独的模块中。具体来说,我们首先从视频和文本查询中学习与参与者相关的内容,然后以对称方式匹配它们以定位目标管。目标管包含所需的参与者和动作,然后将其送入完全卷积的网络,以预测演员的分割掩模。我们的方法还建立了对象的关联,使其与所提出的时间建议聚合机制交叉多个框架。这使我们的方法能够有效地细分视频并保持预测的时间一致性。整个模型允许联合学习参与者的匹配和细分,并在A2D句子和J-HMDB句子数据集上实现单帧细分和完整视频细分的最新性能。
translated by 谷歌翻译
认知科学表明,人类会以所见主体的变化分离的事件来感知视频。状态变化触发新事件,是大量冗余信息中最有用的事件之一。但是,先前的研究重点是对细分市场的总体理解,而无需评估内部的细粒度变化。在本文中,我们介绍了一个名为Kinetic-GEB+的新数据集。该数据集由与标题相关的170K边界组成,这些字幕描述了12K视频中通用事件中的状态更改。在这个新数据集中,我们提出了三个任务,支持通过状态变化开发对视频的更细粒度,健壮和类似人类的理解。我们在数据集中评估了许多代表性基线,在该基础上,我们还设计了一种新的TPD(基于时间的成对差异)建模方法,以进行视觉差异并实现显着的性能改进。此外,结果表明,在利用不同粒度,视觉差异的表示以及状态变化的准确定位方面,当前方法仍然存在着巨大的挑战。进一步的分析表明,我们的数据集可以推动开发更强大的方法来了解状态变化,从而提高视频级别的理解。该数据集可从https://github.com/yuxuan-w/geb-plus获得
translated by 谷歌翻译
第一人称视频在其持续环境的背景下突出了摄影师的活动。但是,当前的视频理解方法是从短视频剪辑中的视觉特征的原因,这些视频片段与基础物理空间分离,只捕获直接看到的东西。我们提出了一种方法,该方法通过学习摄影师(潜在看不见的)本地环境来促进以人为中心的环境的了解来链接以自我为中心的视频和摄像机随着时间的推移而张开。我们使用来自模拟的3D环境中的代理商的视频进行训练,在该环境中,环境完全可以观察到,并在看不见的环境的房屋旅行的真实视频中对其进行测试。我们表明,通过将视频接地在其物理环境中,我们的模型超过了传统的场景分类模型,可以预测摄影师所处的哪个房间(其中帧级信息不足),并且可以利用这种基础来定位与环境相对应的视频瞬间 - 中心查询,优于先验方法。项目页面:http://vision.cs.utexas.edu/projects/ego-scene-context/
translated by 谷歌翻译
Timeyou have a little pressure you are cutting the wood readjusting the table saw I am using a roller sure you applied glue Figure 1: We describe an efficient approach to learn visual representations from misaligned and noisy narrations (bottom) automatically extracted from instructional videos (top). Our video representations are learnt from scratch without relying on any manually annotated visual dataset yet outperform all self-supervised and many fully-supervised methods on several video recognition benchmarks.
translated by 谷歌翻译
Can we teach a robot to recognize and make predictions for activities that it has never seen before? We tackle this problem by learning models for video from text. This paper presents a hierarchical model that generalizes instructional knowledge from large-scale text corpora and transfers the knowledge to video. Given a portion of an instructional video, our model recognizes and predicts coherent and plausible actions multiple steps into the future, all in rich natural language. To demonstrate the capabilities of our model, we introduce the \emph{Tasty Videos Dataset V2}, a collection of 4022 recipes for zero-shot learning, recognition and anticipation. Extensive experiments with various evaluation metrics demonstrate the potential of our method for generalization, given limited video data for training models.
translated by 谷歌翻译
我们提出了一种为给定视频推荐音乐曲目的方法,反之亦然,基于它们的时间对齐及其在艺术层面上的信件。我们提出了一种自我监督的方法,该方法直接从数据中学习了这一对应,而无需任何人类注释。为了捕获解决任务所需的高级概念,我们建议使用每种模式的变压器网络对视频和音乐信号的长期时间上下文进行建模。实验表明,这种方法强烈胜过不利用时间上下文的替代方案。我们的贡献的结合提高了先前最高现状的检索准确性高达10倍。这种强大的改进使我们能够引入广泛的分析和应用。例如,我们可以根据视觉定义的属性来调节音乐检索。
translated by 谷歌翻译