空间卷积广泛用于许多深度视频模型。它基本上假设了时空不变性,即,使用不同帧中的每个位置的共享权重。这项工作提出了用于视频理解的时间 - 自适应卷积(Tadaconv),这表明沿着时间维度的自适应权重校准是促进在视频中建模复杂的时间动态的有效方法。具体而言,Tadaconv根据其本地和全局时间上下文校准每个帧的卷积权重,使空间卷积具有时间建模能力。与先前的时间建模操作相比,Tadaconv在通过卷积内核上运行而不是特征,其维度是比空间分辨率小的数量级更有效。此外,内核校准还具有增加的模型容量。通过用Tadaconv替换Reset中的空间互联网来构建坦达2D网络,这与多个视频动作识别和定位基准测试的最先进方法相比,导致PAR或更好的性能。我们还表明,作为可忽略的计算开销的容易插入操作,Tadaconv可以有效地改善许多具有令人信服的边距的现有视频模型。 HTTPS://github.com/alibaba-mmai-research/pytorch-video -Undersing提供代码和模型。
translated by 谷歌翻译
有效地对视频中的空间信息进行建模对于动作识别至关重要。为了实现这一目标,最先进的方法通常采用卷积操作员和密集的相互作用模块,例如非本地块。但是,这些方法无法准确地符合视频中的各种事件。一方面,采用的卷积是有固定尺度的,因此在各种尺度的事件中挣扎。另一方面,密集的相互作用建模范式仅在动作 - 欧元零件时实现次优性能,给最终预测带来了其他噪音。在本文中,我们提出了一个统一的动作识别框架,以通过引入以下设计来研究视频内容的动态性质。首先,在提取本地提示时,我们会生成动态尺度的时空内核,以适应各种事件。其次,为了将这些线索准确地汇总为全局视频表示形式,我们建议仅通过变压器在一些选定的前景对象之间进行交互,从而产生稀疏的范式。我们将提出的框架称为事件自适应网络(EAN),因为这两个关键设计都适应输入视频内容。为了利用本地细分市场内的短期运动,我们提出了一种新颖有效的潜在运动代码(LMC)模块,进一步改善了框架的性能。在几个大规模视频数据集上进行了广泛的实验,例如,某种东西,动力学和潜水48,验证了我们的模型是否在低拖鞋上实现了最先进或竞争性的表演。代码可在:https://github.com/tianyuan168326/ean-pytorch中找到。
translated by 谷歌翻译
Temporal modeling is key for action recognition in videos. It normally considers both short-range motions and long-range aggregations. In this paper, we propose a Temporal Excitation and Aggregation (TEA) block, including a motion excitation (ME) module and a multiple temporal aggregation (MTA) module, specifically designed to capture both short-and long-range temporal evolution. In particular, for short-range motion modeling, the ME module calculates the feature-level temporal differences from spatiotemporal features. It then utilizes the differences to excite the motion-sensitive channels of the features. The long-range temporal aggregations in previous works are typically achieved by stacking a large number of local temporal convolutions. Each convolution processes a local temporal window at a time. In contrast, the MTA module proposes to deform the local convolution to a group of subconvolutions, forming a hierarchical residual architecture. Without introducing additional parameters, the features will be processed with a series of sub-convolutions, and each frame could complete multiple temporal aggregations with neighborhoods. The final equivalent receptive field of temporal dimension is accordingly enlarged, which is capable of modeling the long-range temporal relationship over distant frames. The two components of the TEA block are complementary in temporal modeling. Finally, our approach achieves impressive results at low FLOPs on several action recognition benchmarks, such as Kinetics, Something-Something, HMDB51, and UCF101, which confirms its effectiveness and efficiency.
translated by 谷歌翻译
We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) a Fast pathway, operating at high frame rate, to capture motion at fine temporal resolution. The Fast pathway can be made very lightweight by reducing its channel capacity, yet can learn useful temporal information for video recognition. Our models achieve strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by our SlowFast concept. We report state-of-the-art accuracy on major video recognition benchmarks, Kinetics, Charades and AVA. Code has been made available at: https://github.com/ facebookresearch/SlowFast.
translated by 谷歌翻译
We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models. Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simple low-level visual information, and deeper layers at spatially coarse, but complex, high-dimensional features. We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks where it outperforms concurrent vision transformers that rely on large scale external pre-training and are 5-10× more costly in computation and parameters. We further remove the temporal dimension and apply our model for image classification where it outperforms prior work on vision transformers. Code is available at: https: //github.com/facebookresearch/SlowFast.
translated by 谷歌翻译
大多数现有的深神经网络都是静态的,这意味着它们只能以固定的复杂性推断。但资源预算可以大幅度不同。即使在一个设备上,实惠预算也可以用不同的场景改变,并且对每个所需预算的反复培训网络是非常昂贵的。因此,在这项工作中,我们提出了一种称为Mutualnet的一般方法,以训练可以以各种资源约束运行的单个网络。我们的方法列举了具有各种网络宽度和输入分辨率的模型配置队列。这种相互学习方案不仅允许模型以不同的宽度分辨率配置运行,而且还可以在这些配置之间传输独特的知识,帮助模型来学习更强大的表示。 Mutualnet是一般的培训方法,可以应用于各种网络结构(例如,2D网络:MobileNets,Reset,3D网络:速度,X3D)和各种任务(例如,图像分类,对象检测,分段和动作识别),并证明了实现各种数据集的一致性改进。由于我们只培训了这一模型,它对独立培训多种型号而言,它也大大降低了培训成本。令人惊讶的是,如果动态资源约束不是一个问题,则可以使用Mutualnet来显着提高单个网络的性能。总之,Mutualnet是静态和自适应,2D和3D网络的统一方法。代码和预先训练的模型可用于\ url {https://github.com/tayang1122/mutualnet}。
translated by 谷歌翻译
Recent self-supervised video representation learning methods focus on maximizing the similarity between multiple augmented views from the same video and largely rely on the quality of generated views. However, most existing methods lack a mechanism to prevent representation learning from bias towards static information in the video. In this paper, we propose frequency augmentation (FreqAug), a spatio-temporal data augmentation method in the frequency domain for video representation learning. FreqAug stochastically removes specific frequency components from the video so that learned representation captures essential features more from the remaining information for various downstream tasks. Specifically, FreqAug pushes the model to focus more on dynamic features rather than static features in the video via dropping spatial or temporal low-frequency components. To verify the generality of the proposed method, we experiment with FreqAug on multiple self-supervised learning frameworks along with standard augmentations. Transferring the improved representation to five video action recognition and two temporal action localization downstream tasks shows consistent improvements over baselines.
translated by 谷歌翻译
卷积是现代神经网络最重要的特征变革,导致深度学习的进步。最近的变压器网络的出现,取代具有自我关注块的卷积层,揭示了静止卷积粒的限制,并将门打开到动态特征变换的时代。然而,现有的动态变换包括自我关注,全部限制了视频理解,其中空间和时间的对应关系,即运动信息,对于有效表示至关重要。在这项工作中,我们引入了一个关系功能转换,称为关系自我关注(RSA),通过动态生成关系内核和聚合关系上下文来利用视频中丰富的时空关系结构。我们的实验和消融研究表明,RSA网络基本上表现出卷积和自我关注的同行,在标准的运动中心基准上实现了用于视频动作识别的标准主导的基准,例如用于V1&V2,潜水48和Filegym。
translated by 谷歌翻译
Temporal action detection (TAD) is extensively studied in the video understanding community by generally following the object detection pipeline in images. However, complex designs are not uncommon in TAD, such as two-stream feature extraction, multi-stage training, complex temporal modeling, and global context fusion. In this paper, we do not aim to introduce any novel technique for TAD. Instead, we study a simple, straightforward, yet must-known baseline given the current status of complex design and low detection efficiency in TAD. In our simple baseline (termed BasicTAD), we decompose the TAD pipeline into several essential components: data sampling, backbone design, neck construction, and detection head. We extensively investigate the existing techniques in each component for this baseline, and more importantly, perform end-to-end training over the entire pipeline thanks to the simplicity of design. As a result, this simple BasicTAD yields an astounding and real-time RGB-Only baseline very close to the state-of-the-art methods with two-stream inputs. In addition, we further improve the BasicTAD by preserving more temporal and spatial information in network representation (termed as PlusTAD). Empirical results demonstrate that our PlusTAD is very efficient and significantly outperforms the previous methods on the datasets of THUMOS14 and FineAction. Meanwhile, we also perform in-depth visualization and error analysis on our proposed method and try to provide more insights on the TAD problem. Our approach can serve as a strong baseline for future TAD research. The code and model will be released at https://github.com/MCG-NJU/BasicTAD.
translated by 谷歌翻译
由于存储器和计算资源有限,部署在移动设备上的卷积神经网络(CNNS)是困难的。我们的目标是通过利用特征图中的冗余来设计包括CPU和GPU的异构设备的高效神经网络,这很少在神经结构设计中进行了研究。对于类似CPU的设备,我们提出了一种新颖的CPU高效的Ghost(C-Ghost)模块,以生成从廉价操作的更多特征映射。基于一组内在的特征映射,我们使用廉价的成本应用一系列线性变换,以生成许多幽灵特征图,可以完全揭示内在特征的信息。所提出的C-Ghost模块可以作为即插即用组件,以升级现有的卷积神经网络。 C-Ghost瓶颈旨在堆叠C-Ghost模块,然后可以轻松建立轻量级的C-Ghostnet。我们进一步考虑GPU设备的有效网络。在建筑阶段的情况下,不涉及太多的GPU效率(例如,深度明智的卷积),我们建议利用阶段明智的特征冗余来制定GPU高效的幽灵(G-GHOST)阶段结构。舞台中的特征被分成两个部分,其中使用具有较少输出通道的原始块处理第一部分,用于生成内在特征,另一个通过利用阶段明智的冗余来生成廉价的操作。在基准测试上进行的实验证明了所提出的C-Ghost模块和G-Ghost阶段的有效性。 C-Ghostnet和G-Ghostnet分别可以分别实现CPU和GPU的准确性和延迟的最佳权衡。代码可在https://github.com/huawei-noah/cv-backbones获得。
translated by 谷歌翻译
在视频数据中,来自移动区域的忙碌运动细节在频域中的特定频率带宽内传送。同时,视频数据的其余频率是用具有实质冗余的安静信息编码,这导致现有视频模型中的低处理效率作为输入原始RGB帧。在本文中,我们考虑为处理重要忙碌信息的处理和对安静信息的计算的处理分配。我们设计可训练的运动带通量模块(MBPM),用于将繁忙信息从RAW视频数据中的安静信息分开。通过将MBPM嵌入到两个路径CNN架构中,我们定义了一个繁忙的网络(BQN)。 BQN的效率是通过避免由两个路径处理的特征空间中的冗余来确定:一个在低分辨率的安静特征上运行,而另一个处理繁忙功能。所提出的BQN在某物V1,Kinetics400,UCF101和HMDB51数据集中略高于最近最近的视频处理模型。
translated by 谷歌翻译
对于视频识别任务,总结了视频片段的整个内容的全局表示为最终性能发挥着重要作用。然而,现有的视频架构通常通过使用简单的全局平均池(GAP)方法来生成它,这具有有限的能力捕获视频的复杂动态。对于图像识别任务,存在证据表明协方差汇总具有比GAP更强的表示能力。遗憾的是,在图像识别中使用的这种普通协方差池是无数的代表,它不能模拟视频中固有的时空结构。因此,本文提出了一个时间 - 细心的协方差池(TCP),插入深度架构结束时,以产生强大的视频表示。具体而言,我们的TCP首先开发一个时间注意力模块,以适应性地校准后续协方差汇集的时空特征,近似地产生细心的协方差表示。然后,时间协方差汇总执行临界协方差表示的时间汇集,以表征校准特征的帧内相关性和帧间互相关。因此,所提出的TCP可以捕获复杂的时间动态。最后,引入了快速矩阵功率归一化以利用协方差表示的几何形状。请注意,我们的TCP是模型 - 不可知的,可以灵活地集成到任何视频架构中,导致TCPNet用于有效的视频识别。使用各种视频架构的六个基准(例如动力学,某事物和电力)的广泛实验显示我们的TCPNet明显优于其对应物,同时具有强大的泛化能力。源代码公开可用。
translated by 谷歌翻译
In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study their effects on action recognition. Our motivation stems from the observation that 2D CNNs applied to individual frames of the video have remained solid performers in action recognition. In this work we empirically demonstrate the accuracy advantages of 3D CNNs over 2D CNNs within the framework of residual learning. Furthermore, we show that factorizing the 3D convolutional filters into separate spatial and temporal components yields significantly gains in accuracy. Our empirical study leads to the design of a new spatiotemporal convolutional block "R(2+1)D" which produces CNNs that achieve results comparable or superior to the state-of-theart on Sports-1M, Kinetics, UCF101, and HMDB51.
translated by 谷歌翻译
计算机视觉任务可以从估计突出物区域和这些对象区域之间的相互作用中受益。识别对象区域涉及利用预借鉴模型来执行对象检测,对象分割和/或对象姿势估计。但是,由于以下原因,在实践中不可行:1)预用模型的训练数据集的对象类别可能不会涵盖一般计算机视觉任务的所有对象类别,2)佩戴型模型训练数据集之间的域间隙并且目标任务的数据集可能会影响性能,3)预磨模模型中存在的偏差和方差可能泄漏到导致无意中偏置的目标模型的目标任务中。为了克服这些缺点,我们建议利用一系列视频帧捕获一组公共对象和它们之间的相互作用的公共基本原理,因此视频帧特征之间的共分割的概念可以用自动的能力装配模型专注于突出区域,以最终的方式提高潜在的任务的性能。在这方面,我们提出了一种称为“共分割激活模块”(COSAM)的通用模块,其可以被插入任何CNN,以促进基于CNN的任何CNN的概念在一系列视频帧特征中的关注。我们在三个基于视频的任务中展示Cosam的应用即1)基于视频的人Re-ID,2)视频字幕分类,并证明COSAM能够在视频帧中捕获突出区域,从而引导对于显着的性能改进以及可解释的关注图。
translated by 谷歌翻译
This paper presents X3D, a family of efficient video networks that progressively expand a tiny 2D image classification architecture along multiple network axes, in space, time, width and depth. Inspired by feature selection methods in machine learning, a simple stepwise network expansion approach is employed that expands a single axis in each step, such that good accuracy to complexity trade-off is achieved. To expand X3D to a specific target complexity, we perform progressive forward expansion followed by backward contraction. X3D achieves state-of-the-art performance while requiring 4.8× and 5.5× fewer multiply-adds and parameters for similar accuracy as previous work. Our most surprising finding is that networks with high spatiotemporal resolution can perform well, while being extremely light in terms of network width and parameters. We report competitive accuracy at unprecedented efficiency on video classification and detection benchmarks. Code will be available at: https: //github.com/facebookresearch/SlowFast.
translated by 谷歌翻译
自2020年推出以来,Vision Transformers(VIT)一直在稳步打破许多视觉任务的记录,通常被描述为``全部'''替换Convnet。而且对于嵌入式设备不友好。此外,最近的研究表明,标准的转话如果经过重新设计和培训,可以在准确性和可伸缩性方面与VIT竞争。在本文中,我们采用Convnet的现代化结构来设计一种新的骨干,以采取行动,以采取行动特别是我们的主要目标是为工业产品部署服务,例如仅支持标准操作的FPGA董事会。因此,我们的网络仅由2D卷积组成,而无需使用任何3D卷积,远程注意插件或变压器块。在接受较少的时期(5x-10x)训练时,我们的骨干线超过了(2+1)D和3D卷积的方法,并获得可比的结果s在两个基准数据集上具有vit。
translated by 谷歌翻译
在每个卷积层中学习一个静态卷积内核是现代卷积神经网络(CNN)的常见训练范式。取而代之的是,动态卷积的最新研究表明,学习$ n $卷积核与输入依赖性注意的线性组合可以显着提高轻重量CNN的准确性,同时保持有效的推断。但是,我们观察到现有的作品endow卷积内核具有通过一个维度(关于卷积内核编号)的动态属性(关于内核空间的卷积内核编号),但其他三个维度(关于空间大小,输入通道号和输出通道编号和输出通道号,每个卷积内核)被忽略。受到这一点的启发,我们提出了Omni维动态卷积(ODCONV),这是一种更普遍而优雅的动态卷积设计,以推进这一研究。 ODCONV利用了一种新型的多维注意机制,采用平行策略来学习沿着任何卷积层的内核空间的所有四个维度学习卷积内核的互补关注。作为定期卷积的倒数替换,可以将ODCONV插入许多CNN架构中。 ImageNet和MS-Coco数据集的广泛实验表明,ODCONV为包括轻量重量和大型的各种盛行的CNN主链带来了可靠的准确性提升,例如3.77%〜5.71%| 1.86%〜3.72%〜3.72%的绝对1个绝对1改进至ImabivLenetV2 | ImageNet数据集上的重新连接家族。有趣的是,由于其功能学习能力的提高,即使具有一个单个内核的ODCONV也可以与具有多个内核的现有动态卷积对应物竞争或超越现有的动态卷积对应物,从而大大降低了额外的参数。此外,ODCONV也优于其他注意模块,用于调节输出特征或卷积重量。
translated by 谷歌翻译
Spatiotemporal and motion features are two complementary and crucial information for video action recognition. Recent state-of-the-art methods adopt a 3D CNN stream to learn spatiotemporal features and another flow stream to learn motion features. In this work, we aim to efficiently encode these two features in a unified 2D framework. To this end, we first propose an STM block, which contains a Channel-wise SpatioTemporal Module (CSTM) to present the spatiotemporal features and a Channel-wise Motion Module (CMM) to efficiently encode motion features. We then replace original residual blocks in the ResNet architecture with STM blcoks to form a simple yet effective STM network by introducing very limited extra computation cost. Extensive experiments demonstrate that the proposed STM network outperforms the state-of-the-art methods on both temporal-related datasets (i.e., Something-Something v1 & v2 and Jester) and scene-related datasets (i.e., Kinetics-400, UCF-101, and HMDB-51) with the help of encoding spatiotemporal and motion features together. * The work was done during an internship at SenseTime.
translated by 谷歌翻译
Despite the steady progress in video analysis led by the adoption of convolutional neural networks (CNNs), the relative improvement has been less drastic as that in 2D static image classification. Three main challenges exist including spatial (image) feature representation, temporal information representation, and model/computation complexity. It was recently shown by Carreira and Zisserman that 3D CNNs, inflated from 2D networks and pretrained on Ima-geNet, could be a promising way for spatial and temporal representation learning. However, as for model/computation complexity, 3D CNNs are much more expensive than 2D CNNs and prone to overfit. We seek a balance between speed and accuracy by building an effective and efficient video classification system through systematic exploration of critical network design choices. In particular, we show that it is possible to replace many of the 3D convolutions by low-cost 2D convolutions. Rather surprisingly, best result (in both speed and accuracy) is achieved when replacing the 3D convolutions at the bottom of the network, suggesting that temporal representation learning on high-level "semantic" features is more useful. Our conclusion generalizes to datasets with very different properties. When combined with several other cost-effective designs including separable spatial/temporal convolution and feature gating, our system results in an effective video classification system that that produces very competitive results on several action classification benchmarks (Kinetics, Something-something, UCF101 and HMDB), as well as two action detection (localization) benchmarks (JHMDB and UCF101-24).
translated by 谷歌翻译
We present a convolution-free approach to video classification built exclusively on self-attention over space and time. Our method, named "TimeSformer," adapts the standard Transformer architecture to video by enabling spatiotemporal feature learning directly from a sequence of framelevel patches. Our experimental study compares different self-attention schemes and suggests that "divided attention," where temporal attention and spatial attention are separately applied within each block, leads to the best video classification accuracy among the design choices considered. Despite the radically new design, TimeSformer achieves state-of-the-art results on several action recognition benchmarks, including the best reported accuracy on Kinetics-400 and Kinetics-600. Finally, compared to 3D convolutional networks, our model is faster to train, it can achieve dramatically higher test efficiency (at a small drop in accuracy), and it can also be applied to much longer video clips (over one minute long). Code and models are available at: https://github.com/ facebookresearch/TimeSformer.
translated by 谷歌翻译