We present a convolution-free approach to video classification built exclusively on self-attention over space and time. Our method, named "TimeSformer," adapts the standard Transformer architecture to video by enabling spatiotemporal feature learning directly from a sequence of framelevel patches. Our experimental study compares different self-attention schemes and suggests that "divided attention," where temporal attention and spatial attention are separately applied within each block, leads to the best video classification accuracy among the design choices considered. Despite the radically new design, TimeSformer achieves state-of-the-art results on several action recognition benchmarks, including the best reported accuracy on Kinetics-400 and Kinetics-600. Finally, compared to 3D convolutional networks, our model is faster to train, it can achieve dramatically higher test efficiency (at a small drop in accuracy), and it can also be applied to much longer video clips (over one minute long). Code and models are available at: https://github.com/ facebookresearch/TimeSformer.
translated by 谷歌翻译
大多数现代视频识别模型旨在在短视频剪辑上运行(例如,长度为5-10)。因此,将此类模型应用于长时间的电影理解任务是一项挑战,通常需要复杂的长期时间推理。最近引入的视频变形金刚通过使用远程时间自我注意来部分解决此问题。但是,由于自我注意力的二次成本,这种模型通常是昂贵且不切实际的。取而代之的是,我们提出了Vis4mer,这是一种有效的远程视频模型,结合了自我注意力的优势和最近引入的结构化状态空间序列(S4)层。我们的模型使用标准的变压器编码器进行短距离时空特征提取,以及多尺度的时间S4解码器,用于随后的远程时间推理。通过逐步减少每个解码器层处的时空特征分辨率和通道维度,Vis4mer在视频中学习了复杂的长期时空依赖性。此外,比相应的基于纯的自我注意力的模型,Vis4mer的价格更快为$ 2.63 \ times $ $,$ 8 \ times $ $ GPU内存。此外,Vis4mer实现最先进的结果,在长期视频理解(LVU)基准中,$ 9 $ 9 $长的电影视频分类任务中的$ 6 $。此外,我们表明我们的方法成功地将其推广到其他领域,从而在早餐和硬币程序活动数据集中取得了竞争成果。该代码可在以下网址公开获取:https://github.com/md-mohaiminul/vis4mer。
translated by 谷歌翻译
我们呈现了基于纯变压器的视频分类模型,在图像分类中最近的近期成功进行了借鉴。我们的模型从输入视频中提取了时空令牌,然后由一系列变压器层编码。为了处理视频中遇到的令牌的长序列,我们提出了我们模型的几种有效的变体,它们将输入的空间和时间维构建。虽然已知基于变换器的模型只有在可用的大型训练数据集时才有效,但我们展示了我们如何在训练期间有效地规范模型,并利用预先训练的图像模型能够在相对小的数据集上训练。我们进行彻底的消融研究,并在包括动力学400和600,史诗厨房,东西的多个视频分类基准上实现最先进的结果,其中 - 基于深度3D卷积网络的现有方法表现出优先的方法。为了促进进一步的研究,我们在https://github.com/google-research/scenic/tree/main/scenic/projects/vivit发布代码
translated by 谷歌翻译
虽然变形金机对视频识别任务的巨大潜力具有较强的捕获远程依赖性的强大能力,但它们经常遭受通过对视频中大量3D令牌的自我关注操作引起的高计算成本。在本文中,我们提出了一种新的变压器架构,称为双重格式,可以有效且有效地对视频识别进行时空关注。具体而言,我们的Dualformer将完全时空注意力分层到双级级联级别,即首先在附近的3D令牌之间学习细粒度的本地时空交互,然后捕获查询令牌之间的粗粒度全局依赖关系。粗粒度全球金字塔背景。不同于在本地窗口内应用时空分解或限制关注计算以提高效率的现有方法,我们本地 - 全球分层策略可以很好地捕获短期和远程时空依赖项,同时大大减少了钥匙和值的数量在注意计算提高效率。实验结果表明,对抗现有方法的五个视频基准的经济优势。特别是,Dualformer在动态-400/600上设置了新的最先进的82.9%/ 85.2%,大约1000g推理拖鞋,比具有相似性能的现有方法至少3.2倍。
translated by 谷歌翻译
In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study their effects on action recognition. Our motivation stems from the observation that 2D CNNs applied to individual frames of the video have remained solid performers in action recognition. In this work we empirically demonstrate the accuracy advantages of 3D CNNs over 2D CNNs within the framework of residual learning. Furthermore, we show that factorizing the 3D convolutional filters into separate spatial and temporal components yields significantly gains in accuracy. Our empirical study leads to the design of a new spatiotemporal convolutional block "R(2+1)D" which produces CNNs that achieve results comparable or superior to the state-of-theart on Sports-1M, Kinetics, UCF101, and HMDB51.
translated by 谷歌翻译
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train. 1
translated by 谷歌翻译
While today's video recognition systems parse snapshots or short clips accurately, they cannot connect the dots and reason across a longer range of time yet. Most existing video architectures can only process <5 seconds of a video without hitting the computation or memory bottlenecks. In this paper, we propose a new strategy to overcome this challenge. Instead of trying to process more frames at once like most existing methods, we propose to process videos in an online fashion and cache "memory" at each iteration. Through the memory, the model can reference prior context for long-term modeling, with only a marginal cost. Based on this idea, we build MeMViT, a Memory-augmented Multiscale Vision Transformer, that has a temporal support 30x longer than existing models with only 4.5% more compute; traditional methods need >3,000% more compute to do the same. On a wide range of settings, the increased temporal support enabled by MeMViT brings large gains in recognition accuracy consistently. MeMViT obtains state-of-the-art results on the AVA, EPIC-Kitchens-100 action classification, and action anticipation datasets. Code and models are available at https://github.com/facebookresearch/memvit.
translated by 谷歌翻译
We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models. Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simple low-level visual information, and deeper layers at spatially coarse, but complex, high-dimensional features. We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks where it outperforms concurrent vision transformers that rely on large scale external pre-training and are 5-10× more costly in computation and parameters. We further remove the temporal dimension and apply our model for image classification where it outperforms prior work on vision transformers. Code is available at: https: //github.com/facebookresearch/SlowFast.
translated by 谷歌翻译
Transformer models have shown great success handling long-range interactions, making them a promising tool for modeling video. However they lack inductive biases and scale quadratically with input length. These limitations are further exacerbated when dealing with the high dimensionality introduced with the temporal dimension. While there are surveys analyzing the advances of Transformers for vision, none focus on an in-depth analysis of video-specific designs. In this survey we analyze main contributions and trends of works leveraging Transformers to model video. Specifically, we delve into how videos are handled as input-level first. Then, we study the architectural changes made to deal with video more efficiently, reduce redundancy, re-introduce useful inductive biases, and capture long-term temporal dynamics. In addition we provide an overview of different training regimes and explore effective self-supervised learning strategies for video. Finally, we conduct a performance comparison on the most common benchmark for Video Transformers (i.e., action classification), finding them to outperform 3D ConvNets even with less computational complexity.
translated by 谷歌翻译
视频识别是由端到端学习范式主导的 - 首先初始化具有预审预周化图像模型的视频识别模型,然后对视频进行端到端培训。这使视频网络能够受益于验证的图像模型。但是,这需要大量的计算和内存资源,以便在视频上进行填充以及直接使用预审计的图像功能的替代方案,而无需填充图像骨架会导致结果不足。幸运的是,在对比视力语言预训练(剪辑)方面的最新进展为视觉识别任务的新途径铺平了道路。这些模型在大型开放式图像文本对数据上进行了预测,以丰富的语义学习强大的视觉表示。在本文中,我们介绍了有效的视频学习(EVL) - 一种有效的框架,用于直接训练具有冷冻剪辑功能的高质量视频识别模型。具体来说,我们采用轻型变压器解码器并学习查询令牌,从剪辑图像编码器中动态收集帧级空间特征。此外,我们在每个解码器层中采用局部时间模块,以发现相邻帧及其注意力图的时间线索。我们表明,尽管有效地使用冷冻的骨干训练,但我们的模型在各种视频识别数据集上学习了高质量的视频表示。代码可在https://github.com/opengvlab/feld-video-rencognition上找到。
translated by 谷歌翻译
基于变压器的方法最近在基于2D图像的视力任务上取得了巨大进步。但是,对于基于3D视频的任务,例如动作识别,直接将时空变压器应用于视频数据将带来沉重的计算和记忆负担,因为斑块的数量大大增加以及自我注意计算的二次复杂性。如何对视频数据的3D自我注意力进行有效地建模,这对于变压器来说是一个巨大的挑战。在本文中,我们提出了一种时间贴片移动(TPS)方法,用于在变压器中有效的3D自发明建模,以进行基于视频的动作识别。 TPS在时间尺寸中以特定的镶嵌图模式移动斑块的一部分,从而将香草的空间自我发项操作转换为时空的一部分,几乎没有额外的成本。结果,我们可以使用几乎相同的计算和记忆成本来计算3D自我注意力。 TPS是一个插件模块,可以插入现有的2D变压器模型中,以增强时空特征学习。提出的方法可以通过最先进的V1和V1,潜水-48和Kinetics400实现竞争性能,同时在计算和内存成本方面效率更高。 TPS的源代码可在https://github.com/martinxm/tps上找到。
translated by 谷歌翻译
视频理解需要在多种时空分辨率下推理 - 从短的细粒度动作到更长的持续时间。虽然变压器架构最近提出了最先进的,但它们没有明确建模不同的时空分辨率。为此,我们为视频识别(MTV)提供了多视图变压器。我们的模型由单独的编码器组成,表示输入视频的不同视图,以横向连接,以跨视图熔断信息。我们对我们的模型提供了彻底的消融研究,并表明MTV在一系列模型尺寸范围内的准确性和计算成本方面始终如一地表现优于单视对应力。此外,我们在五个标准数据集上实现最先进的结果,并通过大规模预制来进一步提高。我们将释放代码和备用检查点。
translated by 谷歌翻译
在本文中,我们将多尺度视觉变压器(MVIT)作为图像和视频分类的统一架构,以及对象检测。我们提出了一种改进的MVIT版本,它包含分解的相对位置嵌入和残余汇集连接。我们以五种尺寸实例化此架构,并评估Imagenet分类,COCO检测和动力学视频识别,在此优先效果。我们进一步比较了MVITS的汇集注意力来窗口注意力机制,其中它在准确性/计算中优于后者。如果没有钟声,MVIT在3个域中具有最先进的性能:ImageNet分类的准确性为88.8%,Coco对象检测的56.1盒AP和动力学-400视频分类的86.1%。代码和模型将公开可用。
translated by 谷歌翻译
视觉变压器正在成为解决计算机视觉问题的强大工具。最近的技术还证明了超出图像域之外的变压器来解决许多与视频相关的任务的功效。其中,由于其广泛的应用,人类的行动识别是从研究界受到特别关注。本文提供了对动作识别的视觉变压器技术的首次全面调查。我们朝着这个方向分析并总结了现有文献和新兴文献,同时突出了适应变形金刚以进行动作识别的流行趋势。由于其专业应用,我们将这些方法统称为``动作变压器''。我们的文献综述根据其架构,方式和预期目标为动作变压器提供了适当的分类法。在动作变压器的背景下,我们探讨了编码时空数据,降低维度降低,框架贴片和时空立方体构造以及各种表示方法的技术。我们还研究了变压器层中时空注意的优化,以处理更长的序列,通常通过减少单个注意操作中的令牌数量。此外,我们还研究了不同的网络学习策略,例如自我监督和零局学习,以及它们对基于变压器的行动识别的相关损失。这项调查还总结了在具有动作变压器重要基准的评估度量评分方面取得的进步。最后,它提供了有关该研究方向的挑战,前景和未来途径的讨论。
translated by 谷歌翻译
Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos, text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional feature encoding. We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution, image enhancement, and colorization) and 3D analysis (e.g., point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works. We hope this effort will ignite further interest in the community to solve current challenges towards the application of transformer models in computer vision.
translated by 谷歌翻译
We present a simple approach which can turn a ViT encoder into an efficient video model, which can seamlessly work with both image and video inputs. By sparsely sampling the inputs, the model is able to do training and inference from both inputs. The model is easily scalable and can be adapted to large-scale pre-trained ViTs without requiring full finetuning. The model achieves SOTA results and the code will be open-sourced.
translated by 谷歌翻译
由于视频帧之间的庞大本地冗余和复杂的全局依赖性,这是一种具有挑战性的任务。该研究的最近进步主要由3D卷积神经网络和视觉变压器推动。虽然3D卷积可以有效地聚合本地上下文来抑制来自小3D邻域的本地冗余,但由于接收领域有限,它缺乏捕获全局依赖性的能力。或者,视觉变压器可以通过自我关注机制有效地捕获远程依赖性,同时具有在每层中所有令牌之间的盲目相似性比较来降低本地冗余的限制。基于这些观察,我们提出了一种新颖的统一变压器(统一机),其以简洁的变压器格式无缝地整合3D卷积和时空自我关注的优点,并在计算和准确性之间实现了优选的平衡。与传统的变形金刚不同,我们的关系聚合器可以通过在浅层和深层中学习本地和全球令牌亲和力来解决时空冗余和依赖性。我们对流行的视频基准进行了广泛的实验,例如动力学-400,动力学-600,以及某种东西 - 某种东西 - 某种东西 - 某种东西 - 某种东西。只有ImageNet-1K预磨料,我们的统一器在动力学-400 /动力学-600上实现了82.9%/ 84.8%的前1个精度,同时需要比其他最先进的方法更少的gflops。对于某些东西而言,我们的制服分别实现了新的最先进的表演,分别实现了60.9%和71.2%的前1个精度。代码可在https://github.com/sense-x/uniformer获得。
translated by 谷歌翻译
The ability to distinguish between different movie scenes is critical for understanding the storyline of a movie. However, accurately detecting movie scenes is often challenging as it requires the ability to reason over very long movie segments. This is in contrast to most existing video recognition models, which are typically designed for short-range video analysis. This work proposes a State-Space Transformer model that can efficiently capture dependencies in long movie videos for accurate movie scene detection. Our model, dubbed TranS4mer, is built using a novel S4A building block, which combines the strengths of structured state-space sequence (S4) and self-attention (A) layers. Given a sequence of frames divided into movie shots (uninterrupted periods where the camera position does not change), the S4A block first applies self-attention to capture short-range intra-shot dependencies. Afterward, the state-space operation in the S4A block is used to aggregate long-range inter-shot cues. The final TranS4mer model, which can be trained end-to-end, is obtained by stacking the S4A blocks one after the other multiple times. Our proposed TranS4mer outperforms all prior methods in three movie scene detection datasets, including MovieNet, BBC, and OVSD, while also being $2\times$ faster and requiring $3\times$ less GPU memory than standard Transformer models. We will release our code and models.
translated by 谷歌翻译
The last several years have witnessed remarkable progress in video-and-language (VidL) understanding. However, most modern VidL approaches use complex and specialized model architectures and sophisticated pretraining protocols, making the reproducibility, analysis and comparisons of these frameworks difficult. Hence, instead of proposing yet another new VidL model, this paper conducts a thorough empirical study demystifying the most important factors in the VidL model design. Among the factors that we investigate are (i) the spatiotemporal architecture design, (ii) the multimodal fusion schemes, (iii) the pretraining objectives, (iv) the choice of pretraining data, (v) pretraining and finetuning protocols, and (vi) dataset and model scaling. Our empirical study reveals that the most important design factors include: temporal modeling, video-to-text multimodal fusion, masked modeling objectives, and joint training on images and videos. Using these empirical insights, we then develop a step-by-step recipe, dubbed VindLU, for effective VidL pretraining. Our final model trained using our recipe achieves comparable or better than state-of-the-art results on several VidL tasks without relying on external CLIP pretraining. In particular, on the text-to-video retrieval task, our approach obtains 61.2% on DiDeMo, and 55.0% on ActivityNet, outperforming current SOTA by 7.8% and 6.1% respectively. Furthermore, our model also obtains state-of-the-art video question-answering results on ActivityNet-QA, MSRVTT-QA, MSRVTT-MC and TVQA. Our code and pretrained models are publicly available at: https://github.com/klauscc/VindLU.
translated by 谷歌翻译
我们呈现蒙版特征预测(MaskFeat),用于自我监督的视频模型的预训练。我们的方法首先随机地掩盖输入序列的一部分,然后预测蒙面区域的特征。我们研究五种不同类型的功能,找到面向导向渐变(HOG)的直方图,手工制作的特征描述符,在性能和效率方面尤其良好。我们观察到猪中的局部对比标准化对于良好的结果至关重要,这与使用HOG进行视觉识别的早期工作符合。我们的方法可以学习丰富的视觉知识和基于大规模的变压器的模型。在不使用额外的模型重量或监督的情况下,在未标记视频上预先培训的MaskFeat在动力学-400上使用MVIT-L达到86.7%的前所未有的结果,在动力学-600,88.3%上,88.3%,在动力学-700,88.8地图上SSV2上的75.0%。 MaskFeat进一步推广到图像输入,其可以被解释为具有单个帧的视频,并在想象中获得竞争结果。
translated by 谷歌翻译