近期和越来越越来越多的视频 - 语言研究的兴趣已经推动了大规模数据集的开发,可实现数据密集型机器学习技术。相比之下,在评估这些数据集的适应性时,已经进行了有限的努力进行视频 - 语言接地任务。最近的作品已经开始发现这些数据集中的重大限制,这表明最先进的技术通常会过度地覆盖到隐藏的数据集偏差。在这项工作中,我们呈现MAD(电影音频描述),这是一种新颖的基准,从扩充现有视频数据集的范式,其中包含文本注释,并专注于爬行和对齐主流电影的可用音频描述。 MAD包含超过384,000个自然语言句子,该句子接地为超过1,200小时的视频,并且在视频 - 语言接地数据集中展示目前诊断的偏差显着减少。疯狂的收集策略使新颖且更具挑战性的视频 - 语言接地版本,其中短时间时刻(通常秒长)必须在多样化的长型视频中准确地接地,可以持续长达三个小时。
translated by 谷歌翻译
Learning text-video embeddings usually requires a dataset of video clips with manually provided captions. However, such datasets are expensive and time consuming to create and therefore difficult to obtain on a large scale. In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations. The contributions of this work are three-fold. First, we introduce HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M narrated instructional web videos depicting humans performing and describing over 23k different visual tasks. Our data collection procedure is fast, scalable and does not require any additional manual annotation. Second, we demonstrate that a text-video embedding trained on this data leads to state-ofthe-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask. Finally, we show that this embedding transfers well to other domains: fine-tuning on generic Youtube videos (MSR-VTT dataset) and movies (LSMDC dataset) outperforms models trained on these datasets alone. Our dataset, code and models are publicly available [1]. * Equal contribution.
translated by 谷歌翻译
translated by 谷歌翻译
Most natural videos contain numerous events. For example, in a video of a "man playing a piano", the video might also contain "another man dancing" or "a crowd clapping". We introduce the task of dense-captioning events, which involves both detecting and describing events in a video. We propose a new model that is able to identify all events in a single pass of the video while simultaneously describing the detected events with natural language. Our model introduces a variant of an existing proposal module that is designed to capture both short as well as long events that span minutes. To capture the dependencies between the events in a video, our model introduces a new captioning module that uses contextual information from past and future events to jointly describe all events. We also introduce ActivityNet Captions, a large-scale benchmark for dense-captioning events. ActivityNet Captions contains 20k videos amounting to 849 video hours with 100k total descriptions, each with it's unique start and end time. Finally, we report performances of our model for dense-captioning events, video retrieval and localization.
translated by 谷歌翻译
For the majority of the machine learning community, the expensive nature of collecting high-quality human-annotated data and the inability to efficiently finetune very large state-of-the-art pretrained models on limited compute are major bottlenecks for building models for new tasks. We propose a zero-shot simple approach for one such task, Video Moment Retrieval (VMR), that does not perform any additional finetuning and simply repurposes off-the-shelf models trained on other tasks. Our three-step approach consists of moment proposal, moment-query matching and postprocessing, all using only off-the-shelf models. On the QVHighlights benchmark for VMR, we vastly improve performance of previous zero-shot approaches by at least 2.5x on all metrics and reduce the gap between zero-shot and state-of-the-art supervised by over 74%. Further, we also show that our zero-shot approach beats non-pretrained supervised models on the Recall metrics and comes very close on mAP metrics; and that it also performs better than the best pretrained supervised model on shorter moments. Finally, we ablate and analyze our results and propose interesting future directions.
translated by 谷歌翻译
Recent video+language datasets cover domains where the interaction is highly structured, such as instructional videos, or where the interaction is scripted, such as TV shows. Both of these properties can lead to spurious cues to be exploited by models rather than learning to ground language. In this paper, we present GrOunded footbAlL commentaries (GOAL), a novel dataset of football (or `soccer') highlights videos with transcribed live commentaries in English. As the course of a game is unpredictable, so are commentaries, which makes them a unique resource to investigate dynamic language grounding. We also provide state-of-the-art baselines for the following tasks: frame reordering, moment retrieval, live commentary retrieval and play-by-play live commentary generation. Results show that SOTA models perform reasonably well in most tasks. We discuss the implications of these results and suggest new tasks for which GOAL can be used. Our codebase is available at: https://gitlab.com/grounded-sport-convai/goal-baselines.
translated by 谷歌翻译
While there has been increasing interest in the task of describing video with natural language, current computer vision algorithms are still severely limited in terms of the variability and complexity of the videos and their associated language that they can recognize. This is in part due to the simplicity of current benchmarks, which mostly focus on specific fine-grained domains with limited videos and simple descriptions. While researchers have provided several benchmark datasets for image captioning, we are not aware of any large-scale video description dataset with comprehensive categories yet diverse video content.In this paper we present MSR-VTT (standing for "MSR-Video to Text") which is a new large-scale video benchmark for video understanding, especially the emerging task of translating video to text. This is achieved by collecting 257 popular queries from a commercial video search engine, with 118 videos for each query. In its current version, MSR-VTT provides 10K web video clips with 41.2 hours and 200K clip-sentence pairs in total, covering the most comprehensive categories and diverse visual content, and representing the largest dataset in terms of sentence and vocabulary. Each clip is annotated with about 20 natural sentences by 1,327 AMT workers. We present a detailed analysis of MSR-VTT in comparison to a complete set of existing datasets, together with a summarization of different state-of-the-art video-to-text approaches. We also provide an extensive evaluation of these approaches on this dataset, showing that the hybrid Recurrent Neural Networkbased approach, which combines single-frame and motion representations with soft-attention pooling strategy, yields the best generalization capability on MSR-VTT.
translated by 谷歌翻译
Can we teach a robot to recognize and make predictions for activities that it has never seen before? We tackle this problem by learning models for video from text. This paper presents a hierarchical model that generalizes instructional knowledge from large-scale text corpora and transfers the knowledge to video. Given a portion of an instructional video, our model recognizes and predicts coherent and plausible actions multiple steps into the future, all in rich natural language. To demonstrate the capabilities of our model, we introduce the \emph{Tasty Videos Dataset V2}, a collection of 4022 recipes for zero-shot learning, recognition and anticipation. Extensive experiments with various evaluation metrics demonstrate the potential of our method for generalization, given limited video data for training models.
translated by 谷歌翻译
translated by 谷歌翻译
时间接地的任务旨在在未经监控的视频中定位视频时刻,具有给定的句子查询。本文首次调查了某些特定于时间接地任务的肤浅偏差,并提出了一种新型靶向解决方案。最令人惊讶的是,我们观察到现有的时间地面模型在视觉模态中严重依赖于某些偏差(例如,高偏好或频繁概念或某些时间间隔的高偏好)。当在跨场景测试设置中概括模型时,这导致较差的性能。为此,我们提出了一种新颖的方法,称为Debiaded Temporal语言定位器(DebiaStll),以防止模型天鹅绒记忆偏差并强制基于真正的模态关系将查询句子接地。 Debias-TLL同时列举两种型号。通过我们的设计,当判断样品时,这两个模型的预测的大大差异显示出更高的偏置样品的概率。利用信息性差异,我们设计了一种用于缓解数据偏差的数据重称之度方案。我们评估跨场景时间接地中提出的模型,其中火车/测试数据是异构的。实验表明,与最先进的竞争对手相比,所提出的方法的大幅度优势。
translated by 谷歌翻译
从给定自然语言(NL)用户查询的视频中检测自定义时刻和亮点是一个重要而是研究过的主题。追求这个方向的挑战之一是缺乏注释数据。要解决此问题,我们介绍了基于查询的视频亮点(QVHighlights)数据集。它由超过10,000个YouTube视频组成,涵盖了各种主题,从日常活动,在Lifestyle VLog视频中旅行到新闻视频中的社会和政治活动。数据集中的每个视频都注释:(1)人类写的自由表格NL查询,(2)视频W.R.T中的相关时刻。查询和(3)所有查询相关剪辑的五点比例显着分数。此综合注释使我们能够开发和评估检测相关时刻的系统以及为不同灵活的用户查询的突出亮点。我们还为此任务提供了一个强大的基线,矩DETR,一个变压器编码器 - 解码器模型,即视图检索作为直接设置预测问题,将提取的视频和查询表示作为输入和预测时刻坐标和显着分数结束 - 结尾。虽然我们的模型不利用任何人,但我们表明它与经验丰富的架构相比,它表现得很竞争。使用ASR标题的弱预测预测,动量基本上显着优于先前的方法。最后,我们展示了几个措施和可视化的力矩。数据和代码在https://github.com/jayleicn/moment_detr上公开使用
translated by 谷歌翻译
Searching long egocentric videos with natural language queries (NLQ) has compelling applications in augmented reality and robotics, where a fluid index into everything that a person (agent) has seen before could augment human memory and surface relevant information on demand. However, the structured nature of the learning problem (free-form text query inputs, localized video temporal window outputs) and its needle-in-a-haystack nature makes it both technically challenging and expensive to supervise. We introduce Narrations-as-Queries (NaQ), a data augmentation strategy that transforms standard video-text narrations into training data for a video query localization model. Validating our idea on the Ego4D benchmark, we find it has tremendous impact in practice. NaQ improves multiple top models by substantial margins (even doubling their accuracy), and yields the very best results to date on the Ego4D NLQ challenge, soundly outperforming all challenge winners in the CVPR and ECCV 2022 competitions and topping the current public leaderboard. Beyond achieving the state-of-the-art for NLQ, we also demonstrate unique properties of our approach such as gains on long-tail object queries, and the ability to perform zero-shot and few-shot NLQ.
translated by 谷歌翻译
translated by 谷歌翻译
The little girl jumps back up after falling. Figure 1: We consider localizing moments in video with natural language and demonstrate that incorporating local and global video features is important for this task. To train and evaluate our model, we collect the Distinct Describable Moments (DiDeMo) dataset which consists of over 40,000 pairs of localized video moments and corresponding natural language.
translated by 谷歌翻译
视频瞬间检索旨在找到给定自然语言查询描述的片刻的开始和结束时间戳(视频的一部分)。全面监督的方法需要完整的时间边界注释才能获得有希望的结果,这是昂贵的,因为注释者需要关注整个时刻。弱监督的方法仅依赖于配对的视频和查询,但性能相对较差。在本文中,我们仔细研究了注释过程,并提出了一种称为“ Glance注释”的新范式。该范式需要一个只有一个随机框架的时间戳,我们将其称为“目光”,在完全监督的对应物的时间边界内。我们认为这是有益的,因为与弱监督相比,添加了琐碎的成本,还提供了更大的潜力。在一眼注释设置下,我们提出了一种基于对比度学习的一眼注释(VIGA),称为视频力矩检索的方法。 Viga将输入视频切成片段,并在剪辑和查询之间形成对比,其中一眼指导的高斯分布重量被分配给所有夹子。我们的广泛实验表明,VIGA通过很大的边距较小的弱监督方法获得了更好的结果,甚至可以在某些情况下与完全监督的方法相媲美。
translated by 谷歌翻译
translated by 谷歌翻译
translated by 谷歌翻译
视频时间基础(VTG)的目标是根据自然语言(NL)描述在未修剪视频中定位时间矩。由于现实世界的应用程序提供了永无止境的视频流,因此它提出了对长形视频的时间基础的需求,这导致了两个主要挑战:(1)长视频长度使得很难处理整个视频而不减少样本速率并导致高计算负担; (2)随着候选时间的增加数量,准确的多模式对准更具挑战性。为了应对这些挑战,我们提出了一个有效的以窗户为中心的粗略对齐框架,它可以灵活地处理具有较高推理速度的长格式视频输入,并通过我们的新颖的Choce-Fine Muly-Fine增强了时间基础模态对齐框架。具体来说,我们通过滑动窗口方法将长视频将长视频切成候选窗口。 Cone(1)以窗户为中心,通过对比度学习和通过对NL查询相关的候选窗口进行过滤来学习窗口间的(粗粒)语义差异,并且(2)执行内部(罚款) - 使用强大的对比视力文本预训练模型的强大多模式对齐能力对候选力矩进行排名。长期视频的两个大规模VTG基准测试的广泛实验始终显示出可观的性能增长(MAD的3.13%至6.87%,从10.46%到EGO4D-NLQ上的10.46%至13.46%),并且Cone在两个数据集上都可以达到SOTA结果。分析揭示了组件的有效性和长期视频接地的效率较高,因为我们的系统在EGO4D-NLQ上提高了2倍的推理速度,而在MAD上提高了15倍的速度,同时保持了锥体的SOTA性能。
translated by 谷歌翻译
我们提出了Locommer,一种基于变压器的视频接地模型,其在恒定的存储空间中运行,无论视频长度如何,即帧数。 Locommer专为任务而设计,在那里需要处理整个长视频,并在其核心贴上两个主要贡献。首先,我们的模型包含一种新的采样技术,将输入要素序列分成固定数量的部分,并使用随机方法选择每个部分的单个特征,这允许我们获得代表视频内容的特征样本集在手中的任务,同时保持内存占用空间。其次,我们提出了一种模块化设计,将功能分开,使我们能够通过监督自我关注头来学习归纳偏差,同时还有效利用预先接受训练的文本和视频编码器。我们在相关的基准数据集中测试我们的建议,以进行视频接地,表明该表现形式不仅可以实现优异的结果,包括在YouCookii上的最先进的性能,也可以比竞争对手更有效,并且它一直有效在平均工作的情况下,最新工作的表现,均值较大,最终导致Chardes-STA的新的最先进的性能。
translated by 谷歌翻译
translated by 谷歌翻译