We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) a Fast pathway, operating at high frame rate, to capture motion at fine temporal resolution. The Fast pathway can be made very lightweight by reducing its channel capacity, yet can learn useful temporal information for video recognition. Our models achieve strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by our SlowFast concept. We report state-of-the-art accuracy on major video recognition benchmarks, Kinetics, Charades and AVA. Code has been made available at: https://github.com/ facebookresearch/SlowFast.
translated by 谷歌翻译
This paper presents X3D, a family of efficient video networks that progressively expand a tiny 2D image classification architecture along multiple network axes, in space, time, width and depth. Inspired by feature selection methods in machine learning, a simple stepwise network expansion approach is employed that expands a single axis in each step, such that good accuracy to complexity trade-off is achieved. To expand X3D to a specific target complexity, we perform progressive forward expansion followed by backward contraction. X3D achieves state-of-the-art performance while requiring 4.8× and 5.5× fewer multiply-adds and parameters for similar accuracy as previous work. Our most surprising finding is that networks with high spatiotemporal resolution can perform well, while being extremely light in terms of network width and parameters. We report competitive accuracy at unprecedented efficiency on video classification and detection benchmarks. Code will be available at: https: //github.com/facebookresearch/SlowFast.
translated by 谷歌翻译
We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models. Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simple low-level visual information, and deeper layers at spatially coarse, but complex, high-dimensional features. We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks where it outperforms concurrent vision transformers that rely on large scale external pre-training and are 5-10× more costly in computation and parameters. We further remove the temporal dimension and apply our model for image classification where it outperforms prior work on vision transformers. Code is available at: https: //github.com/facebookresearch/SlowFast.
translated by 谷歌翻译
在本文中,我们将多尺度视觉变压器(MVIT)作为图像和视频分类的统一架构,以及对象检测。我们提出了一种改进的MVIT版本,它包含分解的相对位置嵌入和残余汇集连接。我们以五种尺寸实例化此架构,并评估Imagenet分类,COCO检测和动力学视频识别,在此优先效果。我们进一步比较了MVITS的汇集注意力来窗口注意力机制,其中它在准确性/计算中优于后者。如果没有钟声,MVIT在3个域中具有最先进的性能:ImageNet分类的准确性为88.8%,Coco对象检测的56.1盒AP和动力学-400视频分类的86.1%。代码和模型将公开可用。
translated by 谷歌翻译
Both convolutional and recurrent operations are building blocks that process one local neighborhood at a time. In this paper, we present non-local operations as a generic family of building blocks for capturing long-range dependencies. Inspired by the classical non-local means method [4] in computer vision, our non-local operation computes the response at a position as a weighted sum of the features at all positions. This building block can be plugged into many computer vision architectures. On the task of video classification, even without any bells and whistles, our nonlocal models can compete or outperform current competition winners on both Kinetics and Charades datasets.In static image recognition, our non-local models improve object detection/segmentation and pose estimation on the COCO suite of tasks. Code will be made available.
translated by 谷歌翻译
In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study their effects on action recognition. Our motivation stems from the observation that 2D CNNs applied to individual frames of the video have remained solid performers in action recognition. In this work we empirically demonstrate the accuracy advantages of 3D CNNs over 2D CNNs within the framework of residual learning. Furthermore, we show that factorizing the 3D convolutional filters into separate spatial and temporal components yields significantly gains in accuracy. Our empirical study leads to the design of a new spatiotemporal convolutional block "R(2+1)D" which produces CNNs that achieve results comparable or superior to the state-of-theart on Sports-1M, Kinetics, UCF101, and HMDB51.
translated by 谷歌翻译
To understand the world, we humans constantly need to relate the present to the past, and put events in context. In this paper, we enable existing video models to do the same. We propose a long-term feature bank-supportive information extracted over the entire span of a video-to augment state-of-the-art video models that otherwise would only view short clips of 2-5 seconds. Our experiments demonstrate that augmenting 3D convolutional networks with a long-term feature bank yields state-of-the-art results on three challenging video datasets: AVA, EPIC-Kitchens, and Charades. Code is available online. 1 1 https://github.com/facebookresearch/ video-long-term-feature-banks Input clip (4 seconds) Target frame
translated by 谷歌翻译
Spatiotemporal and motion features are two complementary and crucial information for video action recognition. Recent state-of-the-art methods adopt a 3D CNN stream to learn spatiotemporal features and another flow stream to learn motion features. In this work, we aim to efficiently encode these two features in a unified 2D framework. To this end, we first propose an STM block, which contains a Channel-wise SpatioTemporal Module (CSTM) to present the spatiotemporal features and a Channel-wise Motion Module (CMM) to efficiently encode motion features. We then replace original residual blocks in the ResNet architecture with STM blcoks to form a simple yet effective STM network by introducing very limited extra computation cost. Extensive experiments demonstrate that the proposed STM network outperforms the state-of-the-art methods on both temporal-related datasets (i.e., Something-Something v1 & v2 and Jester) and scene-related datasets (i.e., Kinetics-400, UCF-101, and HMDB-51) with the help of encoding spatiotemporal and motion features together. * The work was done during an internship at SenseTime.
translated by 谷歌翻译
Temporal modeling is key for action recognition in videos. It normally considers both short-range motions and long-range aggregations. In this paper, we propose a Temporal Excitation and Aggregation (TEA) block, including a motion excitation (ME) module and a multiple temporal aggregation (MTA) module, specifically designed to capture both short-and long-range temporal evolution. In particular, for short-range motion modeling, the ME module calculates the feature-level temporal differences from spatiotemporal features. It then utilizes the differences to excite the motion-sensitive channels of the features. The long-range temporal aggregations in previous works are typically achieved by stacking a large number of local temporal convolutions. Each convolution processes a local temporal window at a time. In contrast, the MTA module proposes to deform the local convolution to a group of subconvolutions, forming a hierarchical residual architecture. Without introducing additional parameters, the features will be processed with a series of sub-convolutions, and each frame could complete multiple temporal aggregations with neighborhoods. The final equivalent receptive field of temporal dimension is accordingly enlarged, which is capable of modeling the long-range temporal relationship over distant frames. The two components of the TEA block are complementary in temporal modeling. Finally, our approach achieves impressive results at low FLOPs on several action recognition benchmarks, such as Kinetics, Something-Something, HMDB51, and UCF101, which confirms its effectiveness and efficiency.
translated by 谷歌翻译
在视频数据中,来自移动区域的忙碌运动细节在频域中的特定频率带宽内传送。同时,视频数据的其余频率是用具有实质冗余的安静信息编码,这导致现有视频模型中的低处理效率作为输入原始RGB帧。在本文中,我们考虑为处理重要忙碌信息的处理和对安静信息的计算的处理分配。我们设计可训练的运动带通量模块(MBPM),用于将繁忙信息从RAW视频数据中的安静信息分开。通过将MBPM嵌入到两个路径CNN架构中,我们定义了一个繁忙的网络(BQN)。 BQN的效率是通过避免由两个路径处理的特征空间中的冗余来确定:一个在低分辨率的安静特征上运行,而另一个处理繁忙功能。所提出的BQN在某物V1,Kinetics400,UCF101和HMDB51数据集中略高于最近最近的视频处理模型。
translated by 谷歌翻译
大多数现有的深神经网络都是静态的,这意味着它们只能以固定的复杂性推断。但资源预算可以大幅度不同。即使在一个设备上,实惠预算也可以用不同的场景改变,并且对每个所需预算的反复培训网络是非常昂贵的。因此,在这项工作中,我们提出了一种称为Mutualnet的一般方法,以训练可以以各种资源约束运行的单个网络。我们的方法列举了具有各种网络宽度和输入分辨率的模型配置队列。这种相互学习方案不仅允许模型以不同的宽度分辨率配置运行,而且还可以在这些配置之间传输独特的知识,帮助模型来学习更强大的表示。 Mutualnet是一般的培训方法,可以应用于各种网络结构(例如,2D网络:MobileNets,Reset,3D网络:速度,X3D)和各种任务(例如,图像分类,对象检测,分段和动作识别),并证明了实现各种数据集的一致性改进。由于我们只培训了这一模型,它对独立培训多种型号而言,它也大大降低了培训成本。令人惊讶的是,如果动态资源约束不是一个问题,则可以使用Mutualnet来显着提高单个网络的性能。总之,Mutualnet是静态和自适应,2D和3D网络的统一方法。代码和预先训练的模型可用于\ url {https://github.com/tayang1122/mutualnet}。
translated by 谷歌翻译
Despite the steady progress in video analysis led by the adoption of convolutional neural networks (CNNs), the relative improvement has been less drastic as that in 2D static image classification. Three main challenges exist including spatial (image) feature representation, temporal information representation, and model/computation complexity. It was recently shown by Carreira and Zisserman that 3D CNNs, inflated from 2D networks and pretrained on Ima-geNet, could be a promising way for spatial and temporal representation learning. However, as for model/computation complexity, 3D CNNs are much more expensive than 2D CNNs and prone to overfit. We seek a balance between speed and accuracy by building an effective and efficient video classification system through systematic exploration of critical network design choices. In particular, we show that it is possible to replace many of the 3D convolutions by low-cost 2D convolutions. Rather surprisingly, best result (in both speed and accuracy) is achieved when replacing the 3D convolutions at the bottom of the network, suggesting that temporal representation learning on high-level "semantic" features is more useful. Our conclusion generalizes to datasets with very different properties. When combined with several other cost-effective designs including separable spatial/temporal convolution and feature gating, our system results in an effective video classification system that that produces very competitive results on several action classification benchmarks (Kinetics, Something-something, UCF101 and HMDB), as well as two action detection (localization) benchmarks (JHMDB and UCF101-24).
translated by 谷歌翻译
While today's video recognition systems parse snapshots or short clips accurately, they cannot connect the dots and reason across a longer range of time yet. Most existing video architectures can only process <5 seconds of a video without hitting the computation or memory bottlenecks. In this paper, we propose a new strategy to overcome this challenge. Instead of trying to process more frames at once like most existing methods, we propose to process videos in an online fashion and cache "memory" at each iteration. Through the memory, the model can reference prior context for long-term modeling, with only a marginal cost. Based on this idea, we build MeMViT, a Memory-augmented Multiscale Vision Transformer, that has a temporal support 30x longer than existing models with only 4.5% more compute; traditional methods need >3,000% more compute to do the same. On a wide range of settings, the increased temporal support enabled by MeMViT brings large gains in recognition accuracy consistently. MeMViT obtains state-of-the-art results on the AVA, EPIC-Kitchens-100 action classification, and action anticipation datasets. Code and models are available at https://github.com/facebookresearch/memvit.
translated by 谷歌翻译
我们呈现蒙版特征预测(MaskFeat),用于自我监督的视频模型的预训练。我们的方法首先随机地掩盖输入序列的一部分,然后预测蒙面区域的特征。我们研究五种不同类型的功能,找到面向导向渐变(HOG)的直方图,手工制作的特征描述符,在性能和效率方面尤其良好。我们观察到猪中的局部对比标准化对于良好的结果至关重要,这与使用HOG进行视觉识别的早期工作符合。我们的方法可以学习丰富的视觉知识和基于大规模的变压器的模型。在不使用额外的模型重量或监督的情况下,在未标记视频上预先培训的MaskFeat在动力学-400上使用MVIT-L达到86.7%的前所未有的结果,在动力学-600,88.3%上,88.3%,在动力学-700,88.8地图上SSV2上的75.0%。 MaskFeat进一步推广到图像输入,其可以被解释为具有单个帧的视频,并在想象中获得竞争结果。
translated by 谷歌翻译
我们提出了MACLR,这是一种新颖的方法,可显式执行从视觉和运动方式中学习的跨模式自我监督的视频表示。与以前的视频表示学习方法相比,主要关注学习运动线索的研究方法是隐含的RGB输入,MACLR丰富了RGB视频片段的标准对比度学习目标,具有运动途径和视觉途径之间的跨模式学习目标。我们表明,使用我们的MACLR方法学到的表示形式更多地关注前景运动区域,因此可以更好地推广到下游任务。为了证明这一点,我们在五个数据集上评估了MACLR,以进行动作识别和动作检测,并在所有数据集上展示最先进的自我监督性能。此外,我们表明MACLR表示可以像在UCF101和HMDB51行动识别的全面监督下所学的表示一样有效,甚至超过了对Vidsitu和SSV2的行动识别的监督表示,以及对AVA的动作检测。
translated by 谷歌翻译
图像预训练,当前用于广泛视觉任务的当前事实范式在视频识别领域中通常不太受青睐。相比之下,一种共同的策略是直接从头开始使用时空卷积神经网络(CNN)训练。尽管如此,有趣的是,通过仔细研究这些从划痕学到的CNN,我们注意到存在某些3D内核比其他人具有更强的外观建模能力,可以说表明外观信息在学习中已经很好地散布了。受到这一观察的启发,我们假设有效利用图像预训练的关键在于学习空间和时间特征的分解,并将图像预训练作为初始化3D内核之前的外观。此外,我们提出了空间可分离(STS)卷积,该卷积将特征通道明确将特征通道分为空间和时间基团,以进一步使时空特征更彻底地分解3D CNN。我们的实验表明,简单地用ST替换3D卷积可以显着改善3D CNN的范围,而无需增加参数和计算动力学400和一些v2的计算。此外,这条新的培训管道始终以显着加速的视频识别取得更好的结果。例如,在强大的256- epecoch 128-GPU基线上,我们在Kinetics-400上获得了 +0.6%的慢速1,同时仅以40个GPU进行微调,而对50个时代进行了微调。代码和型号可在https://github.com/ucsc-vlaa/image-pretraining-for-video上找到。
translated by 谷歌翻译
Recent applications of Convolutional Neural Networks (ConvNets) for human action recognition in videos have proposed different solutions for incorporating the appearance and motion information. We study a number of ways of fusing ConvNet towers both spatially and temporally in order to best take advantage of this spatio-temporal information. We make the following findings: (i) that rather than fusing at the softmax layer, a spatial and temporal network can be fused at a convolution layer without loss of performance, but with a substantial saving in parameters;(ii) that it is better to fuse such networks spatially at the last convolutional layer than earlier, and that additionally fusing at the class prediction layer can boost accuracy; finally (iii) that pooling of abstract convolutional features over spatiotemporal neighbourhoods further boosts performance. Based on these studies we propose a new ConvNet architecture for spatiotemporal fusion of video snippets, and evaluate its performance on standard benchmarks where this architecture achieves stateof-the-art results. Our code and models are available at http://www.robots.ox.ac.uk/ vgg/software/two stream action
translated by 谷歌翻译
视频理解需要在多种时空分辨率下推理 - 从短的细粒度动作到更长的持续时间。虽然变压器架构最近提出了最先进的,但它们没有明确建模不同的时空分辨率。为此,我们为视频识别(MTV)提供了多视图变压器。我们的模型由单独的编码器组成,表示输入视频的不同视图,以横向连接,以跨视图熔断信息。我们对我们的模型提供了彻底的消融研究,并表明MTV在一系列模型尺寸范围内的准确性和计算成本方面始终如一地表现优于单视对应力。此外,我们在五个标准数据集上实现最先进的结果,并通过大规模预制来进一步提高。我们将释放代码和备用检查点。
translated by 谷歌翻译
Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident. This paper aims to discover the principles to design effective ConvNet architectures for action recognition in videos and learn these models given limited training samples. Our first contribution is temporal segment network (TSN), a novel framework for video-based action recognition. which is based on the idea of long-range temporal structure modeling. It combines a sparse temporal sampling strategy and video-level supervision to enable efficient and effective learning using the whole action video. The other contribution is our study on a series of good practices in learning ConvNets on video data with the help of temporal segment network. Our approach obtains the state-the-of-art performance on the datasets of HMDB51 (69.4%) and UCF101 (94.2%). We also visualize the learned ConvNet models, which qualitatively demonstrates the effectiveness of temporal segment network and the proposed good practices. 1
translated by 谷歌翻译
Temporal action detection (TAD) is extensively studied in the video understanding community by generally following the object detection pipeline in images. However, complex designs are not uncommon in TAD, such as two-stream feature extraction, multi-stage training, complex temporal modeling, and global context fusion. In this paper, we do not aim to introduce any novel technique for TAD. Instead, we study a simple, straightforward, yet must-known baseline given the current status of complex design and low detection efficiency in TAD. In our simple baseline (termed BasicTAD), we decompose the TAD pipeline into several essential components: data sampling, backbone design, neck construction, and detection head. We extensively investigate the existing techniques in each component for this baseline, and more importantly, perform end-to-end training over the entire pipeline thanks to the simplicity of design. As a result, this simple BasicTAD yields an astounding and real-time RGB-Only baseline very close to the state-of-the-art methods with two-stream inputs. In addition, we further improve the BasicTAD by preserving more temporal and spatial information in network representation (termed as PlusTAD). Empirical results demonstrate that our PlusTAD is very efficient and significantly outperforms the previous methods on the datasets of THUMOS14 and FineAction. Meanwhile, we also perform in-depth visualization and error analysis on our proposed method and try to provide more insights on the TAD problem. Our approach can serve as a strong baseline for future TAD research. The code and model will be released at https://github.com/MCG-NJU/BasicTAD.
translated by 谷歌翻译