我们介绍一种基于复杂事件(例如,分钟)可以分解成更简单的事件(例如,几秒钟)的前提的方法来学习无监督的语义视觉信息,并且这些简单事件在多个复杂事件中共享。我们将一个长视频分成短帧序列,以利用三维卷积神经网络提取它们的潜在表示。群集方法用于对产生视觉码本的组表示(即,长视频由集群标签给出的整数序列表示)。通过对码本条目编码共生概率矩阵来学习密集的表示。我们展示了该表示如何利用浓密视频标题任务的性能,只有视觉功能。由于这种方法,我们能够更换双模变压器(BMT)方法中的音频信号,并产生具有可比性的时间提案。此外,与Vanilla变压器方法中的我们的描述符连接视觉信号,与仅探索视觉功能的方法相比,在标题中实现最先进的性能,以及具有多模态方法的竞争性能。我们的代码可在https://github.com/valterlej/dvcusi获得。
translated by 谷歌翻译
最近,几种方法探索了视频中对象的检测和分类,以便以显着的结果执行零射击动作识别。在这些方法中,类对象关系用于将视觉模式与语义侧信息相关联,因为这些关系也倾向于出现在文本中。因此,Word Vector方法将在其潜在的陈述中反映它们。灵感来自这些方法,并通过视频字幕来描述不仅具有一组对象但具有上下文信息的事件的能力,我们提出了一种方法,其中录像模型称为观察者,提供不同和互补的描述性句子。我们证明,在ZSAR中,代表具有描述性句子的视频而不是深度特征是可行的,并且自然而然地减轻了域适应问题,因为我们在UCF101数据集中达到了最先进的(SOTA)性能,并且在HMDB51上竞争性能他们的训练集。我们还展示了Word Vectors不适合构建我们描述的语义嵌入空间。因此,我们建议用从互联网上获取的搜索引擎获取的文档提取的句子代表课程,而没有任何人类评估描述的描述。最后,我们构建了在多个文本数据集上预先培训的基于BERT的eMbedder的共享语义空间。我们表明,这种预训练对于弥合语义差距至关重要。对于这两种类型的信息,视觉和语义,对此空间的投影很简单,因为它们是句子,使得在此共享空间中的最近邻居规则能够分类。我们的代码可在https://github.com/valterlej/zsarcap上找到。
translated by 谷歌翻译
Given an untrimmed video and natural language query, video sentence grounding aims to localize the target temporal moment in the video. Existing methods mainly tackle this task by matching and aligning semantics of the descriptive sentence and video segments on a single temporal resolution, while neglecting the temporal consistency of video content in different resolutions. In this work, we propose a novel multi-resolution temporal video sentence grounding network: MRTNet, which consists of a multi-modal feature encoder, a Multi-Resolution Temporal (MRT) module, and a predictor module. MRT module is an encoder-decoder network, and output features in the decoder part are in conjunction with Transformers to predict the final start and end timestamps. Particularly, our MRT module is hot-pluggable, which means it can be seamlessly incorporated into any anchor-free models. Besides, we utilize a hybrid loss to supervise cross-modal features in MRT module for more accurate grounding in three scales: frame-level, clip-level and sequence-level. Extensive experiments on three prevalent datasets have shown the effectiveness of MRTNet.
translated by 谷歌翻译
密集的视频字幕旨在使用视频的时间位置生成多个相关标题。以前的方法遵循复杂的“本地化 - 然后描述”方案,这些方案严重依赖于众多手工制作的组件。在本文中,通过将密集的标题产生作为设置预测任务,我们提出了一种具有并行解码(PDVC)的端到端密集视频字幕的简单且有效的框架。实际上,通过在变压器解码器顶部堆叠新提出的事件计数器,PDVC在对视频内容的整体理解下,将视频精确地将视频分成多个事件部分,这有效地提高了预测标题的相干性和可读性。与现有技术相比,PDVC具有多种吸引力优势:(1)不依赖于启发式非最大抑制或复发事件序列选择网络以除去冗余,PDVC直接产生具有适当尺寸的事件集; (2)与采用两级方案相比,我们并行地将事件查询的增强型表达送入本地化头和标题头,使这两个子任务深入相互关联,通过优化相互促进; (3)没有贝尔和吹口哨,对ActivityNet标题和YouScook2的广泛实验表明,PDVC能够产生高质量的标题结果,当其本地化准确性与它们相提并如此时,最先进的两级方法。代码可在https://github.com/ttengwang/pdvc提供。
translated by 谷歌翻译
自动音频字幕是一项跨模式翻译任务,旨在为给定的音频剪辑生成自然语言描述。近年来,随着免费可用数据集的发布,该任务受到了越来越多的关注。该问题主要通过深度学习技术解决。已经提出了许多方法,例如研究不同的神经网络架构,利用辅助信息,例如关键字或句子信息来指导字幕生成,并采用了不同的培训策略,这些策略极大地促进了该领域的发展。在本文中,我们对自动音频字幕的已发表贡献进行了全面综述,从各种现有方法到评估指标和数据集。我们还讨论了公开挑战,并设想可能的未来研究方向。
translated by 谷歌翻译
密集的视频字幕(DVC)旨在生成多句子描述,以阐明视频中的多个事件,这是具有挑战性,需要的视觉一致性,疑惑一致性和语言多样性。现有方法主要生成各个视频段的标题,缺乏适应全局视觉上下文和快速发展的视觉内容和文本描述之间的渐进对齐,这导致冗余和拼接描述。在本文中,我们介绍了信息流的概念,以模拟跨视频序列和标题的渐进信息。通过设计跨模型信息流对准机制,捕获和对齐的视觉和文本信息流,其在事件/主题演化上以更丰富的上下文和动态赋予标题处理。基于跨模型信息流对准模块,我们进一步提出了DVCFlow框架,它由全球本地视觉编码器组成,用于捕获每个视频段的全局功能和本地特征,以及用于产生标题的预先培训的标题生成器。对流行的ActivityNet标题和Youcookii数据集的广泛实验表明,我们的方法显着优于竞争基础,并根据主题和客观测试产生更多人类文本。
translated by 谷歌翻译
在描述自然语言中的时空事件时,视频标题模型主要依赖于编码器的潜在视觉表示。 Encoder-Decoder模型的最新进展主要参加编码器特征,主要是与解码器的线性交互。然而,对视觉数据的日益增长的模型复杂性鼓励更明确的特征交互,用于微粒信息,目前在视频标题域中不存在。此外,特征聚合方法已经用于通过连接或使用线性层来揭示更丰富的视觉表示。虽然在某种程度上为视频进行了语义重叠的功能集,但这些方法导致客观不匹配和功能冗余。此外,字幕中的多样性是从几种有意义的角度表达一个事件的基本组成部分,目前缺少时间,即视频标题域。为此,我们提出了变化堆叠的本地注意网络(VSLAN),该网络(VSLAN)利用低级别的双线性汇集进行自我细分功能交互,并以折扣方式堆叠多个视频特征流。每个特征堆栈的学习属性都有助于我们所提出的多样性编码模块,然后是解码查询阶段,以便于结束到最终的不同和自然标题,而没有任何明确的属性监督。我们在语法和多样性方面评估MSVD和MSR-VTT数据集的VSLAN。 VSLAN的苹果酒得分优于当前的现成方法,分别在MSVD和MSR-VTT上的$ 4.5 \%$ 4.8 \%$。在同一数据集上,VSLAN在标题分集度量中实现了竞争力。
translated by 谷歌翻译
Video paragraph captioning aims to generate a multi-sentence description of an untrimmed video with several temporal event locations in coherent storytelling. Following the human perception process, where the scene is effectively understood by decomposing it into visual (e.g. human, animal) and non-visual components (e.g. action, relations) under the mutual influence of vision and language, we first propose a visual-linguistic (VL) feature. In the proposed VL feature, the scene is modeled by three modalities including (i) a global visual environment; (ii) local visual main agents; (iii) linguistic scene elements. We then introduce an autoregressive Transformer-in-Transformer (TinT) to simultaneously capture the semantic coherence of intra- and inter-event contents within a video. Finally, we present a new VL contrastive loss function to guarantee learnt embedding features are matched with the captions semantics. Comprehensive experiments and extensive ablation studies on ActivityNet Captions and YouCookII datasets show that the proposed Visual-Linguistic Transformer-in-Transform (VLTinT) outperforms prior state-of-the-art methods on accuracy and diversity.
translated by 谷歌翻译
连接视觉和语言在生成智能中起着重要作用。因此,已经致力于图像标题的大型研究工作,即用句法和语义有意义的句子描述图像。从2015年开始,该任务通常通过由Visual Encoder组成的管道和文本生成的语言模型来解决任务。在这些年来,两种组件通过对象区域,属性,介绍多模态连接,完全关注方法和伯特早期融合策略的利用而显着发展。但是,无论令人印象深刻的结果,图像标题的研究还没有达到结论性答案。这项工作旨在提供图像标题方法的全面概述,从视觉编码和文本生成到培训策略,数据集和评估度量。在这方面,我们量化地比较了许多相关的最先进的方法来确定架构和培训策略中最有影响力的技术创新。此外,讨论了问题的许多变体及其开放挑战。这项工作的最终目标是作为理解现有文献的工具,并突出显示计算机视觉和自然语言处理的研究领域的未来方向可以找到最佳的协同作用。
translated by 谷歌翻译
密集的视频字幕(DVC)的任务旨在为一个视频中的多个事件制作带有时间戳的字幕。语义信息对于DVC的本地化和描述都起着重要作用。我们提出了基于编码编码框架的语义辅助密集的视频字幕模型。在编码阶段,我们设计了一个概念检测器来提取语义信息,然后将其与多模式的视觉特征融合在一起,以充分代表输入视频。在解码阶段,我们设计了一个与本地化和字幕的分类头,以提供语义监督。我们的方法在DVC评估指标下对Youmakeup数据集进行了重大改进,并在PIC 4TH挑战的化妆密集视频字幕(MDVC)任务中实现了高性能。
translated by 谷歌翻译
视频到文本(VTT)是自动生成短视听视频剪辑的描述的任务,可以支持视觉上受损人员以了解YouTube视频的场景。变压器架构在机器翻译和图像标题中表现出具有很大的性能,缺乏对VTT的直接和可重复的应用。但是,对视频描述的不同策略和建议没有全面研究,包括利用完全自临时网络利用随附的音频。因此,我们通过开发直接变压器架构来探索来自图像标题和视频处理的有希望的方法,并将它们应用于VTT。此外,我们介绍了一种在我们呼叫分数位置编码(FPE)的变压器中同步音频和视频特征的新方法。我们在Vatex DataSet上运行多个实验,以确定适用于看不见的数据集的配置,有助于描述自然语言中的短视频剪辑,并与Vanilla变压器网络相比,通过37.13和12.83点改善苹果酒和BLE-4分数。 - MSR-VTT和MSVD数据集的最佳结果。此外,FPE有助于将苹果酒分数增加8.6%。
translated by 谷歌翻译
The remarkable success of deep learning in various domains relies on the availability of large-scale annotated datasets. However, obtaining annotations is expensive and requires great effort, which is especially challenging for videos. Moreover, the use of human-generated annotations leads to models with biased learning and poor domain generalization and robustness. As an alternative, self-supervised learning provides a way for representation learning which does not require annotations and has shown promise in both image and video domains. Different from the image domain, learning video representations are more challenging due to the temporal dimension, bringing in motion and other environmental dynamics. This also provides opportunities for video-exclusive ideas that advance self-supervised learning in the video and multimodal domain. In this survey, we provide a review of existing approaches on self-supervised learning focusing on the video domain. We summarize these methods into four different categories based on their learning objectives: 1) pretext tasks, 2) generative learning, 3) contrastive learning, and 4) cross-modal agreement. We further introduce the commonly used datasets, downstream evaluation tasks, insights into the limitations of existing works, and the potential future directions in this area.
translated by 谷歌翻译
In this paper, we introduce ActBERT for self-supervised learning of joint video-text representations from unlabeled data. First, we leverage global action information to catalyze mutual interactions between linguistic texts and local regional objects. It uncovers global and local visual clues from paired video sequences and text descriptions for detailed visual and text relation modeling. Second, we introduce a TaNgled Transformer block (TNT) to encode three sources of information, i.e., global actions, local regional objects, and linguistic descriptions. Global-local correspondences are discovered via judicious clues extraction from contextual information. It enforces the joint video-text representation to be aware of fine-grained objects as well as global human intention. We validate the generalization capability of ActBERT on downstream video-and-language tasks, i.e., text-video clip retrieval, video captioning, video question answering, action segmentation, and action step localization. ActBERT significantly outperforms the stateof-the-art, demonstrating its superiority in video-text representation learning.actbct * This work was done when Linchao Zhu visited Baidu Research. Yi Yang is the corresponding author.
translated by 谷歌翻译
为了为视频产生适当的标题,推理需要确定相关的概念并注意它们之间的空间关系以及剪辑中的时间发展。我们的端到端编码器视频字幕框架结合了两个基于变压器的体系结构,这是一种用于单个关节时空视频分析的改编变压器,以及用于高级文本生成的基于自我注意力的解码器。此外,我们引入了一种自适应框架选择方案,以减少所需的传入帧数,同时在训练两个变压器时保持相关内容。此外,我们通过汇总每个样本的所有基础真理标题来估计与视频字幕相关的语义概念。我们的方法在MSVD以及大规模的MSR-VTT和VATEX基准数据集上实现了最新的结果,并考虑了多个自然语言产生(NLG)指标。对多样性得分的其他评估突出了我们生成的标题结构的表现力和多样性。
translated by 谷歌翻译
Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos, text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional feature encoding. We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution, image enhancement, and colorization) and 3D analysis (e.g., point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works. We hope this effort will ignite further interest in the community to solve current challenges towards the application of transformer models in computer vision.
translated by 谷歌翻译
密集的视频字幕旨在为未修剪视频中的一系列事件生成相应的文本描述,这些事件可以分为两个子任务,即事件检测和事件字幕。与以前分别解决这两个子任务的作品不同,最近的作品着重于增强两个子任务之间的任务间关联。但是,由于其特定于任务的解决方案的巨大差异,设计用于事件检测和字幕的任务间相互作用并不是微不足道的。此外,以前的事件检测方法通常会忽略事件之间的时间依赖性,从而导致事件冗余或不一致问题。在本文中,我们将事件检测定义为序列生成任务,并提出一个统一的预训练和微调框架,以自然增强事件检测和字幕之间的任务间关联。由于该模型将每个事件预测为以前的事件为上下文,因此事件之间的相互依赖性被充分利用,因此我们的模型可以检测到视频中更多样化和一致的事件。 ActivityNet数据集上的实验表明,我们的模型优于最新方法,并且在对大型视频文本数据进行预训练时,可以进一步提高。代码可在\ url {https://github.com/qiqang/uedvc}上获得。
translated by 谷歌翻译
Automatic video captioning aims for a holistic visual scene understanding. It requires a mechanism for capturing temporal context in video frames and the ability to comprehend the actions and associations of objects in a given timeframe. Such a system should additionally learn to abstract video sequences into sensible representations as well as to generate natural written language. While the majority of captioning models focus solely on the visual inputs, little attention has been paid to the audiovisual modality. To tackle this issue, we propose a novel two-fold approach. First, we implement a reward-guided KL Divergence to train a video captioning model which is resilient towards token permutations. Second, we utilise a Bi-Modal Hierarchical Reinforcement Learning (BMHRL) Transformer architecture to capture long-term temporal dependencies of the input data as a foundation for our hierarchical captioning module. Using our BMHRL, we show the suitability of the HRL agent in the generation of content-complete and grammatically sound sentences by achieving $4.91$, $2.23$, and $10.80$ in BLEU3, BLEU4, and METEOR scores, respectively on the ActivityNet Captions dataset. Finally, we make our BMHRL framework and trained models publicly available for users and developers at https://github.com/d-rothen/bmhrl.
translated by 谷歌翻译
本文研究了时间句子接地的多媒体问题(TSG),该问题旨在根据给定的句子查询准确地确定未修剪视频中的特定视频段。传统的TSG方法主要遵循自上而下或自下而上的框架,不是端到端。他们严重依靠耗时的后处理来完善接地结果。最近,提出了一些基于变压器的方法来有效地对视频和查询之间的细粒语义对齐进行建模。尽管这些方法在一定程度上达到了显着的性能,但它们同样将视频的框架和查询的单词视为用于关联的变压器输入,未能捕获其不同水平的粒度与独特的语义。为了解决这个问题,在本文中,我们提出了一种新型的等级局部 - 全球变压器(HLGT)来利用这种层次结构信息,并模拟不同粒度的不同级别的相互作用和不同的模态之间的相互作用,以学习更多细粒度的多模式表示。具体而言,我们首先将视频和查询分为单个剪辑和短语,以通过时间变压器学习其本地上下文(相邻依赖关系)和全局相关性(远程依赖)。然后,引入了全球本地变压器,以了解本地级别和全球级别语义之间的相互作用,以提供更好的多模式推理。此外,我们开发了一种新的跨模式周期一致性损失,以在两种模式之间实施相互作用,并鼓励它们之间的语义一致性。最后,我们设计了一个全新的跨模式平行变压器解码器,以集成编码的视觉和文本特征,以进行最终接地。在三个具有挑战性的数据集上进行了广泛的实验表明,我们提出的HLGT实现了新的最新性能。
translated by 谷歌翻译
Annotation of multimedia data by humans is time-consuming and costly, while reliable automatic generation of semantic metadata is a major challenge. We propose a framework to extract semantic metadata from automatically generated video captions. As metadata, we consider entities, the entities' properties, relations between entities, and the video category. We employ two state-of-the-art dense video captioning models with masked transformer (MT) and parallel decoding (PVDC) to generate captions for videos of the ActivityNet Captions dataset. Our experiments show that it is possible to extract entities, their properties, relations between entities, and the video category from the generated captions. We observe that the quality of the extracted information is mainly influenced by the quality of the event localization in the video as well as the performance of the event caption generation.
translated by 谷歌翻译
Video understanding is a growing field and a subject of intense research, which includes many interesting tasks to understanding both spatial and temporal information, e.g., action detection, action recognition, video captioning, video retrieval. One of the most challenging problems in video understanding is dealing with feature extraction, i.e. extract contextual visual representation from given untrimmed video due to the long and complicated temporal structure of unconstrained videos. Different from existing approaches, which apply a pre-trained backbone network as a black-box to extract visual representation, our approach aims to extract the most contextual information with an explainable mechanism. As we observed, humans typically perceive a video through the interactions between three main factors, i.e., the actors, the relevant objects, and the surrounding environment. Therefore, it is very crucial to design a contextual explainable video representation extraction that can capture each of such factors and model the relationships between them. In this paper, we discuss approaches, that incorporate the human perception process into modeling actors, objects, and the environment. We choose video paragraph captioning and temporal action detection to illustrate the effectiveness of human perception based-contextual representation in video understanding. Source code is publicly available at https://github.com/UARK-AICV/Video_Representation.
translated by 谷歌翻译