本文介绍了一种基于纯变压器的方法,称为视频动作识别的多模态视频变压器(MM-VIT)。与仅利用解码的RGB帧的其他方案不同,MM-VIT专门在压缩视频域中进行操作,并利用所有容易获得的模式,即I帧,运动向量,残差和音频波形。为了处理从多种方式提取的大量时空令牌,我们开发了几种可扩展的模型变体,它们将自我关注分解在空间,时间和模态尺寸上。此外,为了进一步探索丰富的模态互动及其效果,我们开发并比较了可以无缝集成到变压器构建块中的三种不同的交叉模态注意力机制。关于三个公共行动识别基准的广泛实验(UCF-101,某事-V2,Kinetics-600)证明了MM-VIT以效率和准确性的最先进的视频变压器,并且表现更好或同样地表现出对于具有计算重型光学流的最先进的CNN对应物。
translated by 谷歌翻译
视觉变压器正在成为解决计算机视觉问题的强大工具。最近的技术还证明了超出图像域之外的变压器来解决许多与视频相关的任务的功效。其中,由于其广泛的应用,人类的行动识别是从研究界受到特别关注。本文提供了对动作识别的视觉变压器技术的首次全面调查。我们朝着这个方向分析并总结了现有文献和新兴文献,同时突出了适应变形金刚以进行动作识别的流行趋势。由于其专业应用,我们将这些方法统称为``动作变压器''。我们的文献综述根据其架构,方式和预期目标为动作变压器提供了适当的分类法。在动作变压器的背景下,我们探讨了编码时空数据,降低维度降低,框架贴片和时空立方体构造以及各种表示方法的技术。我们还研究了变压器层中时空注意的优化,以处理更长的序列,通常通过减少单个注意操作中的令牌数量。此外,我们还研究了不同的网络学习策略,例如自我监督和零局学习,以及它们对基于变压器的行动识别的相关损失。这项调查还总结了在具有动作变压器重要基准的评估度量评分方面取得的进步。最后,它提供了有关该研究方向的挑战,前景和未来途径的讨论。
translated by 谷歌翻译
我们呈现了基于纯变压器的视频分类模型,在图像分类中最近的近期成功进行了借鉴。我们的模型从输入视频中提取了时空令牌,然后由一系列变压器层编码。为了处理视频中遇到的令牌的长序列,我们提出了我们模型的几种有效的变体,它们将输入的空间和时间维构建。虽然已知基于变换器的模型只有在可用的大型训练数据集时才有效,但我们展示了我们如何在训练期间有效地规范模型,并利用预先训练的图像模型能够在相对小的数据集上训练。我们进行彻底的消融研究,并在包括动力学400和600,史诗厨房,东西的多个视频分类基准上实现最先进的结果,其中 - 基于深度3D卷积网络的现有方法表现出优先的方法。为了促进进一步的研究,我们在https://github.com/google-research/scenic/tree/main/scenic/projects/vivit发布代码
translated by 谷歌翻译
Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio. Machine perception models, in stark contrast, are typically modality-specific and optimised for unimodal benchmarks, and hence late-stage fusion of final representations or predictions from each modality (`late-fusion') is still a dominant paradigm for multimodal video classification. Instead, we introduce a novel transformer based architecture that uses `fusion bottlenecks' for modality fusion at multiple layers. Compared to traditional pairwise self-attention, our model forces information between different modalities to pass through a small number of bottleneck latents, requiring the model to collate and condense the most relevant information in each modality and only share what is necessary. We find that such a strategy improves fusion performance, at the same time reducing computational cost. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks including Audioset, Epic-Kitchens and VGGSound. All code and models will be released.
translated by 谷歌翻译
Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos, text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional feature encoding. We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution, image enhancement, and colorization) and 3D analysis (e.g., point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works. We hope this effort will ignite further interest in the community to solve current challenges towards the application of transformer models in computer vision.
translated by 谷歌翻译
压缩视频动作识别最近引起了人们的注意,因为它通过用稀疏采样的RGB帧和压缩运动提示(例如运动向量和残差)替换原始视频来大大降低存储和计算成本。但是,这项任务严重遭受了粗糙和嘈杂的动力学以及异质RGB和运动方式的融合不足。为了解决上面的两个问题,本文提出了一个新颖的框架,即具有运动增强的细心跨模式相互作用网络(MEACI-NET)。它遵循两流体系结构,即一个用于RGB模式,另一个用于运动模态。特别是,该运动流采用带有denoising模块的多尺度块来增强表示表示。然后,通过引入选择性运动补充(SMC)和跨模式增强(CMA)模块来加强两条流之间的相互作用,其中SMC与时空上的局部局部运动相互补充,CMA和CMA进一步将两种模态与两种模态相结合。选择性功能增强。对UCF-101,HMDB-51和Kinetics-400基准的广泛实验证明了MEACI-NET的有效性和效率。
translated by 谷歌翻译
An important challenge in vision-based action recognition is the embedding of spatiotemporal features with two or more heterogeneous modalities into a single feature. In this study, we propose a new 3D deformable transformer for action recognition with adaptive spatiotemporal receptive fields and a cross-modal learning scheme. The 3D deformable transformer consists of three attention modules: 3D deformability, local joint stride, and temporal stride attention. The two cross-modal tokens are input into the 3D deformable attention module to create a cross-attention token with a reflected spatiotemporal correlation. Local joint stride attention is applied to spatially combine attention and pose tokens. Temporal stride attention temporally reduces the number of input tokens in the attention module and supports temporal expression learning without the simultaneous use of all tokens. The deformable transformer iterates L times and combines the last cross-modal token for classification. The proposed 3D deformable transformer was tested on the NTU60, NTU120, FineGYM, and Penn Action datasets, and showed results better than or similar to pre-trained state-of-the-art methods even without a pre-training process. In addition, by visualizing important joints and correlations during action recognition through spatial joint and temporal stride attention, the possibility of achieving an explainable potential for action recognition is presented.
translated by 谷歌翻译
Transformer models have shown great success handling long-range interactions, making them a promising tool for modeling video. However they lack inductive biases and scale quadratically with input length. These limitations are further exacerbated when dealing with the high dimensionality introduced with the temporal dimension. While there are surveys analyzing the advances of Transformers for vision, none focus on an in-depth analysis of video-specific designs. In this survey we analyze main contributions and trends of works leveraging Transformers to model video. Specifically, we delve into how videos are handled as input-level first. Then, we study the architectural changes made to deal with video more efficiently, reduce redundancy, re-introduce useful inductive biases, and capture long-term temporal dynamics. In addition we provide an overview of different training regimes and explore effective self-supervised learning strategies for video. Finally, we conduct a performance comparison on the most common benchmark for Video Transformers (i.e., action classification), finding them to outperform 3D ConvNets even with less computational complexity.
translated by 谷歌翻译
视频识别是由端到端学习范式主导的 - 首先初始化具有预审预周化图像模型的视频识别模型,然后对视频进行端到端培训。这使视频网络能够受益于验证的图像模型。但是,这需要大量的计算和内存资源,以便在视频上进行填充以及直接使用预审计的图像功能的替代方案,而无需填充图像骨架会导致结果不足。幸运的是,在对比视力语言预训练(剪辑)方面的最新进展为视觉识别任务的新途径铺平了道路。这些模型在大型开放式图像文本对数据上进行了预测,以丰富的语义学习强大的视觉表示。在本文中,我们介绍了有效的视频学习(EVL) - 一种有效的框架,用于直接训练具有冷冻剪辑功能的高质量视频识别模型。具体来说,我们采用轻型变压器解码器并学习查询令牌,从剪辑图像编码器中动态收集帧级空间特征。此外,我们在每个解码器层中采用局部时间模块,以发现相邻帧及其注意力图的时间线索。我们表明,尽管有效地使用冷冻的骨干训练,但我们的模型在各种视频识别数据集上学习了高质量的视频表示。代码可在https://github.com/opengvlab/feld-video-rencognition上找到。
translated by 谷歌翻译
视频理解需要在多种时空分辨率下推理 - 从短的细粒度动作到更长的持续时间。虽然变压器架构最近提出了最先进的,但它们没有明确建模不同的时空分辨率。为此,我们为视频识别(MTV)提供了多视图变压器。我们的模型由单独的编码器组成,表示输入视频的不同视图,以横向连接,以跨视图熔断信息。我们对我们的模型提供了彻底的消融研究,并表明MTV在一系列模型尺寸范围内的准确性和计算成本方面始终如一地表现优于单视对应力。此外,我们在五个标准数据集上实现最先进的结果,并通过大规模预制来进一步提高。我们将释放代码和备用检查点。
translated by 谷歌翻译
We present a convolution-free approach to video classification built exclusively on self-attention over space and time. Our method, named "TimeSformer," adapts the standard Transformer architecture to video by enabling spatiotemporal feature learning directly from a sequence of framelevel patches. Our experimental study compares different self-attention schemes and suggests that "divided attention," where temporal attention and spatial attention are separately applied within each block, leads to the best video classification accuracy among the design choices considered. Despite the radically new design, TimeSformer achieves state-of-the-art results on several action recognition benchmarks, including the best reported accuracy on Kinetics-400 and Kinetics-600. Finally, compared to 3D convolutional networks, our model is faster to train, it can achieve dramatically higher test efficiency (at a small drop in accuracy), and it can also be applied to much longer video clips (over one minute long). Code and models are available at: https://github.com/ facebookresearch/TimeSformer.
translated by 谷歌翻译
利用在大规模图像文本对中预先训练的视觉和语言模型(VLM)成为开放式视觉识别的有希望的范式。在这项工作中,我们通过利用视频中自然存在的运动和音频来扩展这种范式。我们提出\ textbf {mov},这是\ textbf {m} ult-imodal \ textbf {o} pen- \ textbf {v} ocabulary视频分类的简单而有效的方法。在MOV中,我们直接使用具有最小修改的预训练VLM的视觉编码器来编码视频,光流和音频频谱图。我们设计一种跨模式融合机制来汇总免费的多模式信息。 Kinetics-700和VGGSOUND的实验表明,引入流量或音频模态会带来预先训练的VLM和现有方法的大量性能增长。具体而言,MOV极大地提高了基础类别的准确性,而在新颖的课程上则更好地概括了。 MOV在UCF和HMDB零摄像视频分类基准上实现了最新结果,从而极大地超过了基于VLMS的传统零摄像方法和最新方法。代码和模型将发布。
translated by 谷歌翻译
动作检测的任务旨在在每个动作实例中同时推论动作类别和终点的本地化。尽管Vision Transformers推动了视频理解的最新进展,但由于在长时间的视频剪辑中,设计有效的架构以进行动作检测是不平凡的。为此,我们提出了一个有效的层次时空时空金字塔变压器(STPT)进行动作检测,这是基于以下事实:变压器中早期的自我注意力层仍然集中在局部模式上。具体而言,我们建议在早期阶段使用本地窗口注意来编码丰富的局部时空时空表示,同时应用全局注意模块以捕获后期的长期时空依赖性。通过这种方式,我们的STPT可以用冗余的大大减少来编码区域和依赖性,从而在准确性和效率之间进行有希望的权衡。例如,仅使用RGB输入,提议的STPT在Thumos14上获得了53.6%的地图,超过10%的I3D+AFSD RGB模型超过10%,并且对使用其他流量的额外流动功能的表现较少,该流量具有31%的GFLOPS ,它是一个有效,有效的端到端变压器框架,用于操作检测。
translated by 谷歌翻译
基于变压器的方法最近在基于2D图像的视力任务上取得了巨大进步。但是,对于基于3D视频的任务,例如动作识别,直接将时空变压器应用于视频数据将带来沉重的计算和记忆负担,因为斑块的数量大大增加以及自我注意计算的二次复杂性。如何对视频数据的3D自我注意力进行有效地建模,这对于变压器来说是一个巨大的挑战。在本文中,我们提出了一种时间贴片移动(TPS)方法,用于在变压器中有效的3D自发明建模,以进行基于视频的动作识别。 TPS在时间尺寸中以特定的镶嵌图模式移动斑块的一部分,从而将香草的空间自我发项操作转换为时空的一部分,几乎没有额外的成本。结果,我们可以使用几乎相同的计算和记忆成本来计算3D自我注意力。 TPS是一个插件模块,可以插入现有的2D变压器模型中,以增强时空特征学习。提出的方法可以通过最先进的V1和V1,潜水-48和Kinetics400实现竞争性能,同时在计算和内存成本方面效率更高。 TPS的源代码可在https://github.com/martinxm/tps上找到。
translated by 谷歌翻译
Spatiotemporal and motion features are two complementary and crucial information for video action recognition. Recent state-of-the-art methods adopt a 3D CNN stream to learn spatiotemporal features and another flow stream to learn motion features. In this work, we aim to efficiently encode these two features in a unified 2D framework. To this end, we first propose an STM block, which contains a Channel-wise SpatioTemporal Module (CSTM) to present the spatiotemporal features and a Channel-wise Motion Module (CMM) to efficiently encode motion features. We then replace original residual blocks in the ResNet architecture with STM blcoks to form a simple yet effective STM network by introducing very limited extra computation cost. Extensive experiments demonstrate that the proposed STM network outperforms the state-of-the-art methods on both temporal-related datasets (i.e., Something-Something v1 & v2 and Jester) and scene-related datasets (i.e., Kinetics-400, UCF-101, and HMDB-51) with the help of encoding spatiotemporal and motion features together. * The work was done during an internship at SenseTime.
translated by 谷歌翻译
我们使用无卷积的变压器架构提出了一种从未标记数据学习多式式表示的框架。具体而言,我们的视频音频文本变压器(Vatt)将原始信号作为输入提取,提取丰富的多式化表示,以使各种下游任务受益。我们使用多模式对比损失从头划线训练Vatt端到端,并通过视频动作识别,音频事件分类,图像分类和文本到视频检索的下游任务评估其性能。此外,我们通过共享三种方式之间的重量来研究模型 - 无话的单骨架变压器。我们表明,无卷积VATT优于下游任务中的最先进的Convnet架构。特别是,Vatt的视觉变压器在动力学-400上实现82.1%的高精度82.1%,在动力学-600,72.7%的动力学-700上的72.7%,以及时间的时间,新的记录,在避免受监督的预训练时,新的记录。通过从头划伤训练相同的变压器,转移到图像分类导致图像分类导致78.7%的ImageNet精度为64.7%,尽管视频和图像之间的域间差距,我们的模型概括了我们的模型。 Vatt的音雅音频变压器还通过在没有任何监督的预训练的情况下在Audioset上实现39.4%的地图来设置基于波形的音频事件识别的新记录。 Vatt的源代码是公开的。
translated by 谷歌翻译
We present a simple approach which can turn a ViT encoder into an efficient video model, which can seamlessly work with both image and video inputs. By sparsely sampling the inputs, the model is able to do training and inference from both inputs. The model is easily scalable and can be adapted to large-scale pre-trained ViTs without requiring full finetuning. The model achieves SOTA results and the code will be open-sourced.
translated by 谷歌翻译
卷积是现代神经网络最重要的特征变革,导致深度学习的进步。最近的变压器网络的出现,取代具有自我关注块的卷积层,揭示了静止卷积粒的限制,并将门打开到动态特征变换的时代。然而,现有的动态变换包括自我关注,全部限制了视频理解,其中空间和时间的对应关系,即运动信息,对于有效表示至关重要。在这项工作中,我们引入了一个关系功能转换,称为关系自我关注(RSA),通过动态生成关系内核和聚合关系上下文来利用视频中丰富的时空关系结构。我们的实验和消融研究表明,RSA网络基本上表现出卷积和自我关注的同行,在标准的运动中心基准上实现了用于视频动作识别的标准主导的基准,例如用于V1&V2,潜水48和Filegym。
translated by 谷歌翻译
Temporal modeling is key for action recognition in videos. It normally considers both short-range motions and long-range aggregations. In this paper, we propose a Temporal Excitation and Aggregation (TEA) block, including a motion excitation (ME) module and a multiple temporal aggregation (MTA) module, specifically designed to capture both short-and long-range temporal evolution. In particular, for short-range motion modeling, the ME module calculates the feature-level temporal differences from spatiotemporal features. It then utilizes the differences to excite the motion-sensitive channels of the features. The long-range temporal aggregations in previous works are typically achieved by stacking a large number of local temporal convolutions. Each convolution processes a local temporal window at a time. In contrast, the MTA module proposes to deform the local convolution to a group of subconvolutions, forming a hierarchical residual architecture. Without introducing additional parameters, the features will be processed with a series of sub-convolutions, and each frame could complete multiple temporal aggregations with neighborhoods. The final equivalent receptive field of temporal dimension is accordingly enlarged, which is capable of modeling the long-range temporal relationship over distant frames. The two components of the TEA block are complementary in temporal modeling. Finally, our approach achieves impressive results at low FLOPs on several action recognition benchmarks, such as Kinetics, Something-Something, HMDB51, and UCF101, which confirms its effectiveness and efficiency.
translated by 谷歌翻译
虽然变形金机对视频识别任务的巨大潜力具有较强的捕获远程依赖性的强大能力,但它们经常遭受通过对视频中大量3D令牌的自我关注操作引起的高计算成本。在本文中,我们提出了一种新的变压器架构,称为双重格式,可以有效且有效地对视频识别进行时空关注。具体而言,我们的Dualformer将完全时空注意力分层到双级级联级别,即首先在附近的3D令牌之间学习细粒度的本地时空交互,然后捕获查询令牌之间的粗粒度全局依赖关系。粗粒度全球金字塔背景。不同于在本地窗口内应用时空分解或限制关注计算以提高效率的现有方法,我们本地 - 全球分层策略可以很好地捕获短期和远程时空依赖项,同时大大减少了钥匙和值的数量在注意计算提高效率。实验结果表明,对抗现有方法的五个视频基准的经济优势。特别是,Dualformer在动态-400/600上设置了新的最先进的82.9%/ 85.2%,大约1000g推理拖鞋,比具有相似性能的现有方法至少3.2倍。
translated by 谷歌翻译