视频字幕的规范方法决定了用于从离线提取的密集视频特征学习的标题生成模型。这些特征提取器通常在以固定帧速率采样的视频帧上操作,并且通常在图像/视频理解任务上培训,而不适用于视频标题数据。在这项工作中,我们展示了Swinbert,一种用于视频字幕的基于端到端的变换器的模型,它将视频帧贴片直接作为输入,并输出自然语言描述。我们的方法代替利用多个2D / 3D特征提取器,该方法采用视频变压器来编码可适应可变长度的视频输入,而无需专用设计,可以针对不同的帧速率进行专用设计。基于该模型架构,我们表明视频标题可以从更密集地采样的视频帧中受益匪浅,而不是以前的成功,用于视频和语言理解任务的稀疏采样视频帧(例如,视频问题应答)。此外,为了避免连续视频帧中固有的冗余,我们建议通过更好的远程视频序列建模来自适应地学习稀疏的注意掩模并优化任务特定性能改进。通过对5个视频字幕数据集的广泛实验,我们展示了Swinbert通过较大的余量来实现对以前的方法的整体性能改进。此外,学习的稀疏注意力掩模将限制推向新的技术,可以在不同的视频长度和不同的数据集之间传输。
translated by 谷歌翻译
视频语言(VIDL)建模的巨大挑战在于从图像/视频理解模型和下游Vidl数据中提取的固定视频表示之间的断开。最近的研究试图通过端到端培训来减轻这种断开连接。为了使其进行计算可行,先前的作品倾向于“想象”视频输入,即,将一些稀疏的采样帧馈送到2D CNN中,然后是简单的均值汇集或连接以获得整体视频表示。虽然实现了有希望的结果,但这种简单的方法可能会失去对于执行下游VIDL任务至关重要的时间信息。在这项工作中,我们呈现紫罗兰色,全新的视频语言变压器,采用视频变压器,明确地模拟视频输入的时间动态。此外,与以前的研究不同,发现视频输入上的预训练任务(例如,屏蔽帧建模)不是非常有效的,我们设计了一个新的预训练任务,屏蔽了视觉令牌建模(MVM),以获得更好的视频建模。具体地,原始视频帧修补程序将“令牌化”转换为离散的视觉令牌,目标是基于蒙面的贴片恢复原始的视觉令牌。综合分析展示了通过视频变压器和MVM显式时间建模的有效性。因此,紫罗兰在5个视频问题的回答任务和4个文本到视频检索任务中实现了新的最先进的性能。
translated by 谷歌翻译
蒙版的视觉建模(MVM)最近已被证明对视觉预训练有效。虽然在视频输入(例如,蒙版框架建模)上进行了类似的重建目标,在视频语言(VIDL)预训练中探索了类似的重建目标,但先前研究中的预提取的视频功能在预训练期间无法通过MVM进行完善,因此无法通过MVM进行完善为下游性能不满意。在这项工作中,我们系统地检查了MVM在VIDL学习的背景下的潜力。具体而言,我们的研究基于完全端到端的视频变压器(Violet),该视频变压器(Violet)减轻了固定视频表示与MVM培训之间的断开连接。总共探索了MVM的八个不同的重建目标,从低级像素值和定向梯度到高级深度图,光流,离散的视觉令牌和潜在的视觉特征。我们进行全面的实验,并就导致有效MVM培训的因素提供见解。从经验上讲,我们展示了通过MVM目标预先训练的紫罗兰色,可以在13个VIDL基准测试中取得显着改进,从视频问题回答,视频字幕到文本到视频检索等等。
translated by 谷歌翻译
近年来,统一的视觉语言框架已经大大提高,其中大多数采用编码器架构将图像文本任务统一为序列到序列的生成。但是,现有的视频语言(VIDL)模型仍需要在每个任务的模型体系结构和培训目标中进行特定于任务的设计。在这项工作中,我们探索了一个统一的VIDL框架薰衣草,其中蒙版语言建模(MLM)用作所有前训练和下游任务的常见接口。这样的统一导致了简化的模型体系结构,在多模式编码器之上,只需要一个轻巧的MLM头,而不是具有更多参数的解码器。令人惊讶的是,实验结果表明,这个统一的框架在14个VIDL基准测试中实现了竞争性能,涵盖了视频问答,文本到视频检索和视频字幕。广泛的分析进一步证明了薰衣草比现有VIDL方法的优势:(i)在多任务列出时仅使用一组参数值支持所有下游任务; (ii)对各种下游任务的几乎没有概括; (iii)在视频问题回答任务上启用零射门评估。代码可从https://github.com/microsoft/lavender获得。
translated by 谷歌翻译
The canonical approach to video-and-language learning (e.g., video question answering) dictates a neural model to learn from offline-extracted dense video features from vision models and text features from language models. These feature extractors are trained independently and usually on tasks different from the target domains, rendering these fixed features sub-optimal for downstream tasks. Moreover, due to the high computational overload of dense video features, it is often difficult (or infeasible) to plug feature extractors directly into existing approaches for easy finetuning. To provide a remedy to this dilemma, we propose a generic framework CLIPBERT that enables affordable endto-end learning for video-and-language tasks, by employing sparse sampling, where only a single or a few sparsely sampled short clips from a video are used at each training step. Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that CLIP-BERT outperforms (or is on par with) existing methods that exploit full-length videos, suggesting that end-to-end learning with just a few sparsely sampled clips is often more accurate than using densely extracted offline features from full-length videos, proving the proverbial less-is-more principle. Videos in the datasets are from considerably different domains and lengths, ranging from 3-second genericdomain GIF videos to 180-second YouTube human activity videos, showing the generalization ability of our approach. Comprehensive ablation studies and thorough analyses are provided to dissect what factors lead to this success. Our code is publicly available. 1 * Equal contribution.
translated by 谷歌翻译
我们介绍了一种视听方法,用于远程文本到视频检索。与以前专为简短视频检索设计的方法(例如,持续时间为5-15秒)不同,我们的方法旨在检索捕获复杂人类动作的长时间视频。仅标准视频方法的一个挑战是与从这样的长视频中处理数百个密集提取的帧相关的大量计算成本。为了解决这个问题,我们建议用紧凑的音频提示替换视频的部分,这些线索简洁地汇总了动态音频事件,并且处理便宜。我们的方法称为Eclipse(带有声音编码的有效剪辑),通过添加一个统一的视听变压器块,将流行的剪辑模型调整为视听视频设置,该块从视频和音频流中捕获互补的提示。除了比仅长期视频的方法快2.92倍和2.34倍的内存效率外,我们的方法还可以在几个不同的远程视频数据集上,例如ActivityNet,QVHighighlights,Youcoook2,Youcoook2,Youcook2,Youcook2,Youcook2,Youcook2,Youcook2,Youcook2, Didemo和Charades。
translated by 谷歌翻译
Video-language pre-training is crucial for learning powerful multi-modal representation. However, it typically requires a massive amount of computation. In this paper, we develop SMAUG, an efficient pre-training framework for video-language models. The foundation component in SMAUG is masked autoencoders. Different from prior works which only mask textual inputs, our masking strategy considers both visual and textual modalities, providing a better cross-modal alignment and saving more pre-training costs. On top of that, we introduce a space-time token sparsification module, which leverages context information to further select only "important" spatial regions and temporal frames for pre-training. Coupling all these designs allows our method to enjoy both competitive performances on text-to-video retrieval and video question answering tasks, and much less pre-training costs by 1.9X or more. For example, our SMAUG only needs about 50 NVIDIA A6000 GPU hours for pre-training to attain competitive performances on these two video-language tasks across six popular benchmarks.
translated by 谷歌翻译
为了为视频产生适当的标题,推理需要确定相关的概念并注意它们之间的空间关系以及剪辑中的时间发展。我们的端到端编码器视频字幕框架结合了两个基于变压器的体系结构,这是一种用于单个关节时空视频分析的改编变压器,以及用于高级文本生成的基于自我注意力的解码器。此外,我们引入了一种自适应框架选择方案,以减少所需的传入帧数,同时在训练两个变压器时保持相关内容。此外,我们通过汇总每个样本的所有基础真理标题来估计与视频字幕相关的语义概念。我们的方法在MSVD以及大规模的MSR-VTT和VATEX基准数据集上实现了最新的结果,并考虑了多个自然语言产生(NLG)指标。对多样性得分的其他评估突出了我们生成的标题结构的表现力和多样性。
translated by 谷歌翻译
This work explores an efficient approach to establish a foundational video-text model for tasks including open-vocabulary video classification, text-to-video retrieval, video captioning and video question-answering. We present VideoCoCa that reuses a pretrained image-text contrastive captioner (CoCa) model and adapt it to video-text tasks with minimal extra training. While previous works adapt image-text models with various cross-frame fusion modules (for example, cross-frame attention layer or perceiver resampler) and finetune the modified architecture on video-text data, we surprisingly find that the generative attentional pooling and contrastive attentional pooling layers in the image-text CoCa design are instantly adaptable to ``flattened frame embeddings'', yielding a strong zero-shot transfer baseline for many video-text tasks. Specifically, the frozen image encoder of a pretrained image-text CoCa takes each video frame as inputs and generates \(N\) token embeddings per frame for totally \(T\) video frames. We flatten \(N \times T\) token embeddings as a long sequence of frozen video representation and apply CoCa's generative attentional pooling and contrastive attentional pooling on top. All model weights including pooling layers are directly loaded from an image-text CoCa pretrained model. Without any video or video-text data, VideoCoCa's zero-shot transfer baseline already achieves state-of-the-art results on zero-shot video classification on Kinetics 400/600/700, UCF101, HMDB51, and Charades, as well as zero-shot text-to-video retrieval on MSR-VTT and ActivityNet Captions. We also explore lightweight finetuning on top of VideoCoCa, and achieve strong results on video question-answering (iVQA, MSRVTT-QA, MSVD-QA) and video captioning (MSR-VTT, ActivityNet, Youcook2). Our approach establishes a simple and effective video-text baseline for future research.
translated by 谷歌翻译
The last several years have witnessed remarkable progress in video-and-language (VidL) understanding. However, most modern VidL approaches use complex and specialized model architectures and sophisticated pretraining protocols, making the reproducibility, analysis and comparisons of these frameworks difficult. Hence, instead of proposing yet another new VidL model, this paper conducts a thorough empirical study demystifying the most important factors in the VidL model design. Among the factors that we investigate are (i) the spatiotemporal architecture design, (ii) the multimodal fusion schemes, (iii) the pretraining objectives, (iv) the choice of pretraining data, (v) pretraining and finetuning protocols, and (vi) dataset and model scaling. Our empirical study reveals that the most important design factors include: temporal modeling, video-to-text multimodal fusion, masked modeling objectives, and joint training on images and videos. Using these empirical insights, we then develop a step-by-step recipe, dubbed VindLU, for effective VidL pretraining. Our final model trained using our recipe achieves comparable or better than state-of-the-art results on several VidL tasks without relying on external CLIP pretraining. In particular, on the text-to-video retrieval task, our approach obtains 61.2% on DiDeMo, and 55.0% on ActivityNet, outperforming current SOTA by 7.8% and 6.1% respectively. Furthermore, our model also obtains state-of-the-art video question-answering results on ActivityNet-QA, MSRVTT-QA, MSRVTT-MC and TVQA. Our code and pretrained models are publicly available at: https://github.com/klauscc/VindLU.
translated by 谷歌翻译
密集的视频字幕旨在为未修剪视频中的一系列事件生成相应的文本描述,这些事件可以分为两个子任务,即事件检测和事件字幕。与以前分别解决这两个子任务的作品不同,最近的作品着重于增强两个子任务之间的任务间关联。但是,由于其特定于任务的解决方案的巨大差异,设计用于事件检测和字幕的任务间相互作用并不是微不足道的。此外,以前的事件检测方法通常会忽略事件之间的时间依赖性,从而导致事件冗余或不一致问题。在本文中,我们将事件检测定义为序列生成任务,并提出一个统一的预训练和微调框架,以自然增强事件检测和字幕之间的任务间关联。由于该模型将每个事件预测为以前的事件为上下文,因此事件之间的相互依赖性被充分利用,因此我们的模型可以检测到视频中更多样化和一致的事件。 ActivityNet数据集上的实验表明,我们的模型优于最新方法,并且在对大型视频文本数据进行预训练时,可以进一步提高。代码可在\ url {https://github.com/qiqang/uedvc}上获得。
translated by 谷歌翻译
文本视频检索是一项具有巨大实际价值的任务,并受到了越来越多的关注,其中学习时空视频表示是研究热点之一。最先进的视频检索模型中的视频编码通常会直接采用预训练的视觉主链,其网络结构固定,因此无法进一步改进它们以产生细粒度的空间时间表视频表示。在本文中,我们提出了令牌移位和选择网络(TS2-NET),这是一种新型的令牌移动和选择变压器体系结构,该架构会动态调整令牌序列,并从输入视频样本中选择时间和空间维度中的信息令牌。令牌移位模块在时间上暂时移动整个代币特征,来回跨相邻帧,以保留完整的令牌表示并捕获微妙的动作。然后,令牌选择模块选择对局部空间语义贡献最大的令牌。基于彻底的实验,拟议的TS2-NET在主要文本视频检索基准上实现了最先进的性能,包括有关MSRVTT,VATEX,LSMDC,LSMDC,ActivityNetnet和DideMo的新记录。
translated by 谷歌翻译
本文介绍了Omnivl,这是一种新的基础模型,旨在使用一种通用体系结构来支持图像语言和视频语言任务。它为图像和视频输入采用了统一的基于变压器的视觉编码器,因此可以执行联合图像语言和视频语言预处理。我们首次证明了这样的范式受益于图像和视频任务,而不是传统的单向传输(例如,使用图像语言来帮助视频语言)。为此,我们提出了对图像语言和视频语言的脱钩关节预处理,以有效地将视觉模型分解为空间和时间维度,并在图像和视频任务上获得性能提升。此外,我们引入了一种新颖的统一视觉对比度(UNIVLC)损失,以利用图像文本,视频文本,图像标签(例如,图像分类),视频标签(例如,视频动作识别)在一起受到监督和吵闹的监督预处理数据都尽可能多地利用。无需额外的任务适配器,Omnivl可以同时支持仅视觉任务(例如,图像分类,视频操作识别),跨模式对齐任务(例如,图像/视频 - 文本检索)和多模式理解和生成任务(例如,图像/视频问答,字幕)。我们在各种下游任务上评估Omnivl,并以相似的模型大小和数据量表获得最新的或竞争结果。
translated by 谷歌翻译
Video-language pre-training has advanced the performance of various downstream video-language tasks. However, most previous methods directly inherit or adapt typical image-language pre-training paradigms to video-language pre-training, thus not fully exploiting the unique characteristic of video, i.e., temporal. In this paper, we propose a Hierarchical Temporal-Aware video-language pre-training framework, HiTeA, with two novel pre-training tasks for modeling cross-modal alignment between moments and texts as well as the temporal relations of video-text pairs. Specifically, we propose a cross-modal moment exploration task to explore moments in videos, which results in detailed video moment representation. Besides, the inherent temporal relations are captured by aligning video-text pairs as a whole in different time resolutions with multi-modal temporal relation exploration task. Furthermore, we introduce the shuffling test to evaluate the temporal reliance of datasets and video-language pre-training models. We achieve state-of-the-art results on 15 well-established video-language understanding and generation tasks, especially on temporal-oriented datasets (e.g., SSv2-Template and SSv2-Label) with 8.6% and 11.1% improvement respectively. HiTeA also demonstrates strong generalization ability when directly transferred to downstream tasks in a zero-shot manner. Models and demo will be available on ModelScope.
translated by 谷歌翻译
We introduce LaViLa, a new approach to learning video-language representations by leveraging Large Language Models (LLMs). We repurpose pre-trained LLMs to be conditioned on visual input, and finetune them to create automatic video narrators. Our auto-generated narrations offer a number of advantages, including dense coverage of long videos, better temporal synchronization of the visual information and text, and much higher diversity of text. The video-text embedding learned contrastively with these additional auto-generated narrations outperforms the previous state-of-the-art on multiple first-person and third-person video tasks, both in zero-shot and finetuned setups. Most notably, LaViLa obtains an absolute gain of 10.1% on EGTEA classification and 5.9% Epic-Kitchens-100 multi-instance retrieval benchmarks. Furthermore, LaViLa trained with only half the narrations from the Ego4D dataset outperforms baseline models trained on the full set, and shows positive scaling behavior on increasing pre-training data and model size.
translated by 谷歌翻译
最近,通过引入大规模的数据集和强大的变压器网络,视频预培训表明尤其是检索的巨大成功。然而,现有的视频语言变压器模型没有明确细粒度的语义对齐。在这项工作中,我们呈现了对象感知的变换器,以对象为中心的方法,该对象方法扩展了视频语言变压器来合并对象表示。关键的想法是利用边界框和对象标签来指导培训过程。我们在四个广泛使用的基准测试中评估了我们的三个标准子任务的模型。我们还提供了深入的分析和详细消融关于所提出的方法。我们在考虑的所有任务和数据集中表现出清晰的性能,展示将对象表示的模型中的型号集成到视频架构中。代码将以\ URL {https://github.com/fingerrec/oa -transformer}释放。
translated by 谷歌翻译
Vision-and语言(VL)预培训已被证明对各种VL下游任务非常有效。虽然最近的工作表明,基于完全变换器的VL模型可以比以前的基于区域特征的方法更有效,但它们在下游任务上的性能通常显着降低。在本文中,我们呈现仪表〜(\ textbf {m} ultimodal \ textbf {e} nd-to-text \ textbf {t} ransform \ textbf {er}),我们通过它系统地调查如何设计和预先列车基于完全变换器的VL模型以端到端的方式。具体而言,我们将模型设计沿多个尺寸分析:视觉编码器(例如,剪辑 - vit,Swin变压器),文本编码器(例如,Roberta,Deberta),多模式融合(例如,合并注意力与共同关注),架构设计(例如,仅编码器与编码器 - 解码器)和预训练目标(例如,屏蔽图像建模)。我们对广泛的VL任务进行全面实验,并提供有关如何在保持快速推理速度的同时培训表演VL变压器的见解。值得注意的是,仪表〜使用仅使用4M图像进行预培训的VQAV2 TEST-STD设置的精度为77.64 \%,超过最先进的区域特征的VINVL模型+1.04 \%,以及优于以前最好的完全变换器的ALBEF模型+1.6 \%。
translated by 谷歌翻译
The foundation models have recently shown excellent performance on a variety of downstream tasks in computer vision. However, most existing vision foundation models simply focus on image-level pretraining and adpation, which are limited for dynamic and complex video-level understanding tasks. To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications. Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications. Especially, our methods can obtain 91.1% and 77.2% top-1 accuracy on the challenging Kinetics-400 and Something-Something V2 benchmarks, respectively. All of these results effectively show the generality of our InternVideo for video understanding. The code will be released at https://github.com/OpenGVLab/InternVideo .
translated by 谷歌翻译
Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos, text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional feature encoding. We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution, image enhancement, and colorization) and 3D analysis (e.g., point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works. We hope this effort will ignite further interest in the community to solve current challenges towards the application of transformer models in computer vision.
translated by 谷歌翻译
视频文本预培训旨在通过对齐视觉和文本信息之间的语义来对齐大规模视频文本对学习可转换表示。最先进的方法以端到端的方式从原始像素提取视觉特征。然而,这些方法直接在帧级运行,从而忽略了视频中对象的时空结构,这在文本描述中具有名词的强烈协同作用。在这项工作中,我们提出了一个简单而有效的模块,即用于视频文本表示学习,即RegionLearner,它可以考虑在大规模视频文本对预训练中的对象结构。给定视频,我们的模块(1)首先将可视特征量化为语义集群,然后(2)生成被动掩码并使用它们聚合属于同一语义区域的功能,最后(3)模拟不同聚合区域之间的交互。与使用现成的对象探测器相比,我们所提出的模块不需要明确的监督,并且更加计算效率。我们在公共WebVID2M和CC3M数据集上预先列车。对四个下游视频文本检索基准测试的广泛评估清楚地展示了我们的地区learner的有效性。代码将在https://github.com/ruiyan1995/region_learner上获得。
translated by 谷歌翻译