Most natural videos contain numerous events. For example, in a video of a "man playing a piano", the video might also contain "another man dancing" or "a crowd clapping". We introduce the task of dense-captioning events, which involves both detecting and describing events in a video. We propose a new model that is able to identify all events in a single pass of the video while simultaneously describing the detected events with natural language. Our model introduces a variant of an existing proposal module that is designed to capture both short as well as long events that span minutes. To capture the dependencies between the events in a video, our model introduces a new captioning module that uses contextual information from past and future events to jointly describe all events. We also introduce ActivityNet Captions, a large-scale benchmark for dense-captioning events. ActivityNet Captions contains 20k videos amounting to 849 video hours with 100k total descriptions, each with it's unique start and end time. Finally, we report performances of our model for dense-captioning events, video retrieval and localization.
translated by 谷歌翻译
密集的视频字幕旨在使用视频的时间位置生成多个相关标题。以前的方法遵循复杂的“本地化 - 然后描述”方案,这些方案严重依赖于众多手工制作的组件。在本文中,通过将密集的标题产生作为设置预测任务,我们提出了一种具有并行解码(PDVC)的端到端密集视频字幕的简单且有效的框架。实际上,通过在变压器解码器顶部堆叠新提出的事件计数器,PDVC在对视频内容的整体理解下,将视频精确地将视频分成多个事件部分,这有效地提高了预测标题的相干性和可读性。与现有技术相比,PDVC具有多种吸引力优势:(1)不依赖于启发式非最大抑制或复发事件序列选择网络以除去冗余,PDVC直接产生具有适当尺寸的事件集; (2)与采用两级方案相比,我们并行地将事件查询的增强型表达送入本地化头和标题头,使这两个子任务深入相互关联,通过优化相互促进; (3)没有贝尔和吹口哨,对ActivityNet标题和YouScook2的广泛实验表明,PDVC能够产生高质量的标题结果,当其本地化准确性与它们相提并如此时,最先进的两级方法。代码可在https://github.com/ttengwang/pdvc提供。
translated by 谷歌翻译
Can we teach a robot to recognize and make predictions for activities that it has never seen before? We tackle this problem by learning models for video from text. This paper presents a hierarchical model that generalizes instructional knowledge from large-scale text corpora and transfers the knowledge to video. Given a portion of an instructional video, our model recognizes and predicts coherent and plausible actions multiple steps into the future, all in rich natural language. To demonstrate the capabilities of our model, we introduce the \emph{Tasty Videos Dataset V2}, a collection of 4022 recipes for zero-shot learning, recognition and anticipation. Extensive experiments with various evaluation metrics demonstrate the potential of our method for generalization, given limited video data for training models.
translated by 谷歌翻译
密集的视频字幕(DVC)旨在生成多句子描述,以阐明视频中的多个事件,这是具有挑战性,需要的视觉一致性,疑惑一致性和语言多样性。现有方法主要生成各个视频段的标题,缺乏适应全局视觉上下文和快速发展的视觉内容和文本描述之间的渐进对齐,这导致冗余和拼接描述。在本文中,我们介绍了信息流的概念,以模拟跨视频序列和标题的渐进信息。通过设计跨模型信息流对准机制,捕获和对齐的视觉和文本信息流,其在事件/主题演化上以更丰富的上下文和动态赋予标题处理。基于跨模型信息流对准模块,我们进一步提出了DVCFlow框架,它由全球本地视觉编码器组成,用于捕获每个视频段的全局功能和本地特征,以及用于产生标题的预先培训的标题生成器。对流行的ActivityNet标题和Youcookii数据集的广泛实验表明,我们的方法显着优于竞争基础,并根据主题和客观测试产生更多人类文本。
translated by 谷歌翻译
The potential for agents, whether embodied or software, to learn by observing other agents performing procedures involving objects and actions is rich. Current research on automatic procedure learning heavily relies on action labels or video subtitles, even during the evaluation phase, which makes them infeasible in real-world scenarios. This leads to our question: can the human-consensus structure of a procedure be learned from a large set of long, unconstrained videos (e.g., instructional videos from YouTube) with only visual evidence? To answer this question, we introduce the problem of procedure segmentation-to segment a video procedure into category-independent procedure segments. Given that no large-scale dataset is available for this problem, we collect a large-scale procedure segmentation dataset with procedure segments temporally localized and described; we use cooking videos and name the dataset YouCook2. We propose a segment-level recurrent network for generating procedure segments by modeling the dependencies across segments. The generated segments can be used as pre-processing for other tasks, such as dense video captioning and event parsing. We show in our experiments that the proposed model outperforms competitive baselines in procedure segmentation.
translated by 谷歌翻译
While there has been increasing interest in the task of describing video with natural language, current computer vision algorithms are still severely limited in terms of the variability and complexity of the videos and their associated language that they can recognize. This is in part due to the simplicity of current benchmarks, which mostly focus on specific fine-grained domains with limited videos and simple descriptions. While researchers have provided several benchmark datasets for image captioning, we are not aware of any large-scale video description dataset with comprehensive categories yet diverse video content.In this paper we present MSR-VTT (standing for "MSR-Video to Text") which is a new large-scale video benchmark for video understanding, especially the emerging task of translating video to text. This is achieved by collecting 257 popular queries from a commercial video search engine, with 118 videos for each query. In its current version, MSR-VTT provides 10K web video clips with 41.2 hours and 200K clip-sentence pairs in total, covering the most comprehensive categories and diverse visual content, and representing the largest dataset in terms of sentence and vocabulary. Each clip is annotated with about 20 natural sentences by 1,327 AMT workers. We present a detailed analysis of MSR-VTT in comparison to a complete set of existing datasets, together with a summarization of different state-of-the-art video-to-text approaches. We also provide an extensive evaluation of these approaches on this dataset, showing that the hybrid Recurrent Neural Networkbased approach, which combines single-frame and motion representations with soft-attention pooling strategy, yields the best generalization capability on MSR-VTT.
translated by 谷歌翻译
近期和越来越越来越多的视频 - 语言研究的兴趣已经推动了大规模数据集的开发,可实现数据密集型机器学习技术。相比之下,在评估这些数据集的适应性时,已经进行了有限的努力进行视频 - 语言接地任务。最近的作品已经开始发现这些数据集中的重大限制,这表明最先进的技术通常会过度地覆盖到隐藏的数据集偏差。在这项工作中,我们呈现MAD(电影音频描述),这是一种新颖的基准,从扩充现有视频数据集的范式,其中包含文本注释,并专注于爬行和对齐主流电影的可用音频描述。 MAD包含超过384,000个自然语言句子,该句子接地为超过1,200小时的视频,并且在视频 - 语言接地数据集中展示目前诊断的偏差显着减少。疯狂的收集策略使新颖且更具挑战性的视频 - 语言接地版本,其中短时间时刻(通常秒长)必须在多样化的长型视频中准确地接地,可以持续长达三个小时。
translated by 谷歌翻译
我们介绍了一种视听方法,用于远程文本到视频检索。与以前专为简短视频检索设计的方法(例如,持续时间为5-15秒)不同,我们的方法旨在检索捕获复杂人类动作的长时间视频。仅标准视频方法的一个挑战是与从这样的长视频中处理数百个密集提取的帧相关的大量计算成本。为了解决这个问题,我们建议用紧凑的音频提示替换视频的部分,这些线索简洁地汇总了动态音频事件,并且处理便宜。我们的方法称为Eclipse(带有声音编码的有效剪辑),通过添加一个统一的视听变压器块,将流行的剪辑模型调整为视听视频设置,该块从视频和音频流中捕获互补的提示。除了比仅长期视频的方法快2.92倍和2.34倍的内存效率外,我们的方法还可以在几个不同的远程视频数据集上,例如ActivityNet,QVHighighlights,Youcoook2,Youcoook2,Youcook2,Youcook2,Youcook2,Youcook2,Youcook2,Youcook2, Didemo和Charades。
translated by 谷歌翻译
为了为视频产生适当的标题,推理需要确定相关的概念并注意它们之间的空间关系以及剪辑中的时间发展。我们的端到端编码器视频字幕框架结合了两个基于变压器的体系结构,这是一种用于单个关节时空视频分析的改编变压器,以及用于高级文本生成的基于自我注意力的解码器。此外,我们引入了一种自适应框架选择方案,以减少所需的传入帧数,同时在训练两个变压器时保持相关内容。此外,我们通过汇总每个样本的所有基础真理标题来估计与视频字幕相关的语义概念。我们的方法在MSVD以及大规模的MSR-VTT和VATEX基准数据集上实现了最新的结果,并考虑了多个自然语言产生(NLG)指标。对多样性得分的其他评估突出了我们生成的标题结构的表现力和多样性。
translated by 谷歌翻译
Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in
translated by 谷歌翻译
Annotation of multimedia data by humans is time-consuming and costly, while reliable automatic generation of semantic metadata is a major challenge. We propose a framework to extract semantic metadata from automatically generated video captions. As metadata, we consider entities, the entities' properties, relations between entities, and the video category. We employ two state-of-the-art dense video captioning models with masked transformer (MT) and parallel decoding (PVDC) to generate captions for videos of the ActivityNet Captions dataset. Our experiments show that it is possible to extract entities, their properties, relations between entities, and the video category from the generated captions. We observe that the quality of the extracted information is mainly influenced by the quality of the event localization in the video as well as the performance of the event caption generation.
translated by 谷歌翻译
The little girl jumps back up after falling. Figure 1: We consider localizing moments in video with natural language and demonstrate that incorporating local and global video features is important for this task. To train and evaluate our model, we collect the Distinct Describable Moments (DiDeMo) dataset which consists of over 40,000 pairs of localized video moments and corresponding natural language.
translated by 谷歌翻译
认知科学表明,人类会以所见主体的变化分离的事件来感知视频。状态变化触发新事件,是大量冗余信息中最有用的事件之一。但是,先前的研究重点是对细分市场的总体理解,而无需评估内部的细粒度变化。在本文中,我们介绍了一个名为Kinetic-GEB+的新数据集。该数据集由与标题相关的170K边界组成,这些字幕描述了12K视频中通用事件中的状态更改。在这个新数据集中,我们提出了三个任务,支持通过状态变化开发对视频的更细粒度,健壮和类似人类的理解。我们在数据集中评估了许多代表性基线,在该基础上,我们还设计了一种新的TPD(基于时间的成对差异)建模方法,以进行视觉差异并实现显着的性能改进。此外,结果表明,在利用不同粒度,视觉差异的表示以及状态变化的准确定位方面,当前方法仍然存在着巨大的挑战。进一步的分析表明,我们的数据集可以推动开发更强大的方法来了解状态变化,从而提高视频级别的理解。该数据集可从https://github.com/yuxuan-w/geb-plus获得
translated by 谷歌翻译
The Flickr30k dataset has become a standard benchmark for sentence-based image description. This paper presents Flickr30k Entities, which augments the 158k captions from Flickr30k with 244k coreference chains, linking mentions of the same entities across different captions for the same image, and associating them with 276k manually annotated bounding boxes. Such annotations are essential for continued progress in automatic image description and grounded language understanding. They enable us to define a new benchmark for localization of textual entity mentions in an image. We present a strong baseline for this task that combines an image-text embedding, detectors for common objects, a color classifier, and a bias towards selecting larger objects. While our baseline rivals in accuracy more complex state-of-the-art models, we show that its gains cannot be easily parlayed into improvements on such tasks as image-sentence retrieval, thus underlining the limitations of current methods and the need for further research.
translated by 谷歌翻译
当今,分会一代成为在线视频的实用技术。本章断点使用户能够快速找到所需的零件并获得总结注释。但是,没有公共方法和数据集用于此任务。为了促进该方向的研究,我们介绍了一个名为Chapter-gen的新数据集,该数据集由大约10K用户生成的视频和带注释的章节信息组成。我们的数据收集过程是快速,可扩展的,不需要任何其他手动注释。在此数据集之外,我们设计了一个有效的基线,专门针对视频章节生成任务。捕获视频的两个方面,包括视觉动态和叙述文本。它分别将本地和全球视频功能分别用于本地化和标题生成。为了有效地解析长时间的视频,Skip滑动窗口机构旨在定位潜在的章节。并且开发了交叉注意的多模式融合模块,以汇总标题生成的本地功能。我们的实验表明,所提出的框架比现有方法取得了优越的结果,这表明即使在微调后也无法直接传输类似任务的方法设计。代码和数据集可在https://github.com/czt117/mvcg上找到。
translated by 谷歌翻译
我们提出了Locommer,一种基于变压器的视频接地模型,其在恒定的存储空间中运行,无论视频长度如何,即帧数。 Locommer专为任务而设计,在那里需要处理整个长视频,并在其核心贴上两个主要贡献。首先,我们的模型包含一种新的采样技术,将输入要素序列分成固定数量的部分,并使用随机方法选择每个部分的单个特征,这允许我们获得代表视频内容的特征样本集在手中的任务,同时保持内存占用空间。其次,我们提出了一种模块化设计,将功能分开,使我们能够通过监督自我关注头来学习归纳偏差,同时还有效利用预先接受训练的文本和视频编码器。我们在相关的基准数据集中测试我们的建议,以进行视频接地,表明该表现形式不仅可以实现优异的结果,包括在YouCookii上的最先进的性能,也可以比竞争对手更有效,并且它一直有效在平均工作的情况下,最新工作的表现,均值较大,最终导致Chardes-STA的新的最先进的性能。
translated by 谷歌翻译
Books are a rich source of both fine-grained information, how a character, an object or a scene looks like, as well as high-level semantics, what someone is thinking, feeling and how these states evolve through a story. This paper aims to align books to their movie releases in order to provide rich descriptive explanations for visual content that go semantically far beyond the captions available in current datasets.To align movies and books we exploit a neural sentence embedding that is trained in an unsupervised way from a large corpus of books, as well as a video-text neural embedding for computing similarities between movie clips and sentences in the book. We propose a context-aware CNN to combine information from multiple sources. We demonstrate good quantitative performance for movie/book alignment and show several qualitative examples that showcase the diversity of tasks our model can be used for.
translated by 谷歌翻译
我们介绍一种基于复杂事件(例如,分钟)可以分解成更简单的事件(例如,几秒钟)的前提的方法来学习无监督的语义视觉信息,并且这些简单事件在多个复杂事件中共享。我们将一个长视频分成短帧序列,以利用三维卷积神经网络提取它们的潜在表示。群集方法用于对产生视觉码本的组表示(即,长视频由集群标签给出的整数序列表示)。通过对码本条目编码共生概率矩阵来学习密集的表示。我们展示了该表示如何利用浓密视频标题任务的性能,只有视觉功能。由于这种方法,我们能够更换双模变压器(BMT)方法中的音频信号,并产生具有可比性的时间提案。此外,与Vanilla变压器方法中的我们的描述符连接视觉信号,与仅探索视觉功能的方法相比,在标题中实现最先进的性能,以及具有多模态方法的竞争性能。我们的代码可在https://github.com/valterlej/dvcusi获得。
translated by 谷歌翻译
视频接地旨在通过给定语言查询,本地化未经监控的视频中的相应视频时刻。现有方法通常以间接方式解决此任务,通过将其作为提案和匹配或融合和检测问题。解决这些替代问题通常需要在培训和手工制作的近重复结果中进行复杂的标签分配。同时,现有的作品通常专注于具有单句的稀疏视频接地,作为输入可能导致由于其不清晰的描述而产生模糊的本地化。在本文中,我们通过将段落作为输入同时定位多个时刻来解决密集视频接地的新问题。从视频接地的视角是语言条件回归,我们通过重新拟合变压器 - 相似的架构(PRVG)来提出端到端的并行解码范式。我们的PRVG中的关键设计是使用语言作为查询,并基于语言调制的可视表示直接回归矩界限。由于其简单设计,我们的PRVG框架可以应用于不同的测试方案(稀疏或密集的接地),并允许无需任何后处理技术的有效推理。此外,我们设计了强大的提案级注意力损失,以指导PRVG的培训,这不变于时刻持续时间,并有助于模型收敛。我们对ActivityNet标题和炸玉米饼的两个视频接地基准进行实验,展示了我们的PRVG可以显着优于以前的方法。我们还进行深入的研究,以研究并行回归范例对视频接地的有效性。
translated by 谷歌翻译
视频瞬间检索旨在找到给定自然语言查询描述的片刻的开始和结束时间戳(视频的一部分)。全面监督的方法需要完整的时间边界注释才能获得有希望的结果,这是昂贵的,因为注释者需要关注整个时刻。弱监督的方法仅依赖于配对的视频和查询,但性能相对较差。在本文中,我们仔细研究了注释过程,并提出了一种称为“ Glance注释”的新范式。该范式需要一个只有一个随机框架的时间戳,我们将其称为“目光”,在完全监督的对应物的时间边界内。我们认为这是有益的,因为与弱监督相比,添加了琐碎的成本,还提供了更大的潜力。在一眼注释设置下,我们提出了一种基于对比度学习的一眼注释(VIGA),称为视频力矩检索的方法。 Viga将输入视频切成片段,并在剪辑和查询之间形成对比,其中一眼指导的高斯分布重量被分配给所有夹子。我们的广泛实验表明,VIGA通过很大的边距较小的弱监督方法获得了更好的结果,甚至可以在某些情况下与完全监督的方法相媲美。
translated by 谷歌翻译