由于细粒度的视觉细节中的运动和丰富内容的大变化,视频是复杂的。从这些信息密集型媒体中抽象有用的信息需要详尽的计算资源。本文研究了一个两步的替代方案,首先将视频序列冷凝到信息“框架”,然后在合成帧上利用现成的图像识别系统。有效问题是如何定义“有用信息”,然后将其从视频序列蒸发到一个合成帧。本文介绍了一种新颖的信息帧综合(IFS)架构,其包含三个客观任务,即外观重建,视频分类,运动估计和两个常规方案,即对抗性学习,颜色一致性。每个任务都配备了一个能力的合成框,而每个常规器可以提高其视觉质量。利用这些,通过以端到端的方式共同学习帧合成,预期产生的帧封装了用于视频分析的所需的时空信息。广泛的实验是在大型动力学数据集上进行的。与基线方法相比,将视频序列映射到单个图像,IFS显示出优异的性能。更值得注意地,IFS始终如一地展示了基于图像的2D网络和基于剪辑的3D网络的显着改进,并且通过了具有较少计算成本的最先进方法实现了相当的性能。
translated by 谷歌翻译
Convolutional Neural Networks (CNN) have been regarded as a powerful class of models for image recognition problems. Nevertheless, it is not trivial when utilizing a CNN for learning spatio-temporal video representation. A few studies have shown that performing 3D convolutions is a rewarding approach to capture both spatial and temporal dimensions in videos. However, the development of a very deep 3D CNN from scratch results in expensive computational cost and memory demand. A valid question is why not recycle off-the-shelf 2D networks for a 3D CNN. In this paper, we devise multiple variants of bottleneck building blocks in a residual learning framework by simulating 3 × 3 × 3 convolutions with 1 × 3 × 3 convolutional filters on spatial domain (equivalent to 2D CNN) plus 3 × 1 × 1 convolutions to construct temporal connections on adjacent feature maps in time. Furthermore, we propose a new architecture, named , that exploits all the variants of blocks but composes each in different placement of ResNet, following the philosophy that enhancing structural diversity with going deep could improve the power of neural networks. Our P3D ResNet achieves clear improvements on Sports-1M video classification dataset against 3D CNN and frame-based 2D CNN by 5.3% and 1.8%, respectively. We further examine the generalization performance of video representation produced by our pre-trained P3D ResNet on five different benchmarks and three different tasks, demonstrating superior performances over several state-of-the-art techniques.
translated by 谷歌翻译
运动,作为视频中最明显的现象,涉及随时间的变化,对视频表示学习的发展是独一无二的。在本文中,我们提出了问题:特别是对自我监督视频表示学习的运动有多重要。为此,我们撰写了一个二重奏,用于利用对比学习政权的数据增强和特征学习的动作。具体而言,我们介绍了一种以前的对比学习(MCL)方法,其将这种二重奏视为基础。一方面,MCL大写视频中的每个帧的光流量,以在时间上和空间地样本地样本(即,横跨时间的相关帧斑块的序列)作为数据增强。另一方面,MCL进一步将卷积层的梯度图对准来自空间,时间和时空视角的光流程图,以便在特征学习中地进行地面运动信息。在R(2 + 1)D骨架上进行的广泛实验证明了我们MCL的有效性。在UCF101上,在MCL学习的表示上培训的线性分类器实现了81.91%的前1个精度,表现优于6.78%的训练预测。在动力学-400上,MCL在线方案下实现66.62%的前1个精度。代码可在https://github.com/yihengzhang-cv/mcl-motion-focused-contrastive-learning。
translated by 谷歌翻译
在本文中,我们提出了一种新的视频表示学习方法,名为时间挤压(TS)池,这可以从长期的视频帧中提取基本移动信息,并将其映射到一组名为挤压图像的几个图像中。通过将时间挤压池作为层嵌入到现成的卷积神经网络(CNN)中,我们设计了一个名为Temporal Squeeze网络(TESNet)的新视频分类模型。由此产生的挤压图像包含来自视频帧的基本移动信息,对应于视频分类任务的优化。我们在两个视频分类基准上评估我们的架构,并与最先进的结果进行了比较。
translated by 谷歌翻译
Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident. This paper aims to discover the principles to design effective ConvNet architectures for action recognition in videos and learn these models given limited training samples. Our first contribution is temporal segment network (TSN), a novel framework for video-based action recognition. which is based on the idea of long-range temporal structure modeling. It combines a sparse temporal sampling strategy and video-level supervision to enable efficient and effective learning using the whole action video. The other contribution is our study on a series of good practices in learning ConvNets on video data with the help of temporal segment network. Our approach obtains the state-the-of-art performance on the datasets of HMDB51 (69.4%) and UCF101 (94.2%). We also visualize the learned ConvNet models, which qualitatively demonstrates the effectiveness of temporal segment network and the proposed good practices. 1
translated by 谷歌翻译
We investigate architectures of discriminatively trained deep Convolutional Networks (ConvNets) for action recognition in video. The challenge is to capture the complementary information on appearance from still frames and motion between frames. We also aim to generalise the best performing hand-crafted features within a data-driven learning framework. Our contribution is three-fold. First, we propose a two-stream ConvNet architecture which incorporates spatial and temporal networks. Second, we demonstrate that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Finally, we show that multitask learning, applied to two different action classification datasets, can be used to increase the amount of training data and improve the performance on both. Our architecture is trained and evaluated on the standard video actions benchmarks of UCF-101 and HMDB-51, where it is competitive with the state of the art. It also exceeds by a large margin previous attempts to use deep nets for video classification.
translated by 谷歌翻译
作为视频的独特性,运动对于开发视频理解模型至关重要。现代深度学习模型通过执行时空3D卷积来利用运动,将3D卷积分别分为空间和时间卷积,或者沿时间维度计算自我注意力。这种成功背后的隐含假设是,可以很好地汇总连续帧的特征图。然而,该假设可能并不总是对具有较大变形的地区特别存在。在本文中,我们提出了一个新的框架间注意区块的食谱,即独立框架间注意力(SIFA),它在新颖的情况下深入研究了整个框架的变形,以估计每个空间位置上的局部自我注意力。从技术上讲,SIFA通过通过两个帧之间的差来重新缩放偏移预测来重新缩放可变形设计。将每个空间位置在当前帧中作为查询,下一帧中的本地可变形邻居被视为键/值。然后,SIFA衡量查询和键之间的相似性是对加权平均时间聚集值的独立关注。我们进一步将SIFA块分别插入Convnet和Vision Transformer,以设计SIFA-NET和SIFA-TransFormer。在四个视频数据集上进行的广泛实验表明,SIFA-NET和SIFA转换器的优越性是更强的骨架。更值得注意的是,SIFA转换器在动力学400数据集上的精度为83.1%。源代码可在\ url {https://github.com/fuchenustc/sifa}中获得。
translated by 谷歌翻译
由于具有高复杂性和训练方案的各种选项,最佳地学习3D卷积神经网络(3D COUNCNET)并不重要。最常见的手工调整过程从使用短视频剪辑开始学习3D扫描,然后使用冗长的剪辑学习长期时间依赖性,同时逐渐将学习率衰减到低至低于低的学习率随着训练的进展。这样的过程与几个启发式设置出现的事实激发了研究,以寻求最佳的“路径”以自动化整个培训。在本文中,我们将路径分解为一系列训练“状态”,并在每个状态下指定超参数,例如学习率和输入剪辑的长度。膝关节曲线上的膝关节估计触发从一个状态到另一个状态的转换。我们在所有候选状态下执行动态编程,以规划状态的最佳排列,即优化路径。此外,我们使用独特的双头分类器设计设计了一种新的3D扫描,以提高空间和时间辨别。关于七个公共视频识别基准的广泛实验证明了我们提案的优势。通过优化规划,与最先进的识别方法相比,我们的3D ConverNets在比较时实现了卓越的结果。更值得注意地,我们分别在动力学-400和动力学-600数据集中获得80.5%和82.7%的前1个精度。源代码在https://github.com/zhaofanqiu/optimization-planning-for-3d-convnets中获得。
translated by 谷歌翻译
In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study their effects on action recognition. Our motivation stems from the observation that 2D CNNs applied to individual frames of the video have remained solid performers in action recognition. In this work we empirically demonstrate the accuracy advantages of 3D CNNs over 2D CNNs within the framework of residual learning. Furthermore, we show that factorizing the 3D convolutional filters into separate spatial and temporal components yields significantly gains in accuracy. Our empirical study leads to the design of a new spatiotemporal convolutional block "R(2+1)D" which produces CNNs that achieve results comparable or superior to the state-of-theart on Sports-1M, Kinetics, UCF101, and HMDB51.
translated by 谷歌翻译
压缩视频动作识别最近引起了人们的注意,因为它通过用稀疏采样的RGB帧和压缩运动提示(例如运动向量和残差)替换原始视频来大大降低存储和计算成本。但是,这项任务严重遭受了粗糙和嘈杂的动力学以及异质RGB和运动方式的融合不足。为了解决上面的两个问题,本文提出了一个新颖的框架,即具有运动增强的细心跨模式相互作用网络(MEACI-NET)。它遵循两流体系结构,即一个用于RGB模式,另一个用于运动模态。特别是,该运动流采用带有denoising模块的多尺度块来增强表示表示。然后,通过引入选择性运动补充(SMC)和跨模式增强(CMA)模块来加强两条流之间的相互作用,其中SMC与时空上的局部局部运动相互补充,CMA和CMA进一步将两种模态与两种模态相结合。选择性功能增强。对UCF-101,HMDB-51和Kinetics-400基准的广泛实验证明了MEACI-NET的有效性和效率。
translated by 谷歌翻译
视频内容是多方面的,由对象,场景,交互或操作组成。现有数据集主要标记为模型培训的一个方面,导致视频表示根据训练数据集仅偏置为一个小平面。目前还没有研究如何学习来自多方面标签的视频表示,以及多方面的信息是否有助于视频表示学习。在本文中,我们提出了一种新的学习框架,多朝向集成(MUFI),以聚合来自不同数据集的面部,以学习可以反映视频内容的全频谱的表示。从技术上讲,MUFI将问题交流为视觉语义嵌入学习,该问题将视频表示映射到丰富的语义嵌入空间中,并从两个角度联合优化视频表示。一个是利用每个视频和自己的标签描述之间的小型内部监督,第二个是从其他数据集的小平面预测每个视频的“语义表示”作为刻面监控。广泛的实验表明,通过我们的MUFI框架在四个大型视频数据集加上两个图像数据集的联盟上学习3D CNN,导致视频表示的优异能力。具有MUFI的预先学习的3D CNN还显示出在几个下游视频应用上的其他方法的清晰改进。更值得注意的是,MUFI在UCF101 / HMDB51上实现98.1%/ 80.9%,用于行动识别和101.5%,在MSVD上的浏览器D得分为视频字幕。
translated by 谷歌翻译
Temporal modeling is key for action recognition in videos. It normally considers both short-range motions and long-range aggregations. In this paper, we propose a Temporal Excitation and Aggregation (TEA) block, including a motion excitation (ME) module and a multiple temporal aggregation (MTA) module, specifically designed to capture both short-and long-range temporal evolution. In particular, for short-range motion modeling, the ME module calculates the feature-level temporal differences from spatiotemporal features. It then utilizes the differences to excite the motion-sensitive channels of the features. The long-range temporal aggregations in previous works are typically achieved by stacking a large number of local temporal convolutions. Each convolution processes a local temporal window at a time. In contrast, the MTA module proposes to deform the local convolution to a group of subconvolutions, forming a hierarchical residual architecture. Without introducing additional parameters, the features will be processed with a series of sub-convolutions, and each frame could complete multiple temporal aggregations with neighborhoods. The final equivalent receptive field of temporal dimension is accordingly enlarged, which is capable of modeling the long-range temporal relationship over distant frames. The two components of the TEA block are complementary in temporal modeling. Finally, our approach achieves impressive results at low FLOPs on several action recognition benchmarks, such as Kinetics, Something-Something, HMDB51, and UCF101, which confirms its effectiveness and efficiency.
translated by 谷歌翻译
Spatiotemporal and motion features are two complementary and crucial information for video action recognition. Recent state-of-the-art methods adopt a 3D CNN stream to learn spatiotemporal features and another flow stream to learn motion features. In this work, we aim to efficiently encode these two features in a unified 2D framework. To this end, we first propose an STM block, which contains a Channel-wise SpatioTemporal Module (CSTM) to present the spatiotemporal features and a Channel-wise Motion Module (CMM) to efficiently encode motion features. We then replace original residual blocks in the ResNet architecture with STM blcoks to form a simple yet effective STM network by introducing very limited extra computation cost. Extensive experiments demonstrate that the proposed STM network outperforms the state-of-the-art methods on both temporal-related datasets (i.e., Something-Something v1 & v2 and Jester) and scene-related datasets (i.e., Kinetics-400, UCF-101, and HMDB-51) with the help of encoding spatiotemporal and motion features together. * The work was done during an internship at SenseTime.
translated by 谷歌翻译
在视频数据中,来自移动区域的忙碌运动细节在频域中的特定频率带宽内传送。同时,视频数据的其余频率是用具有实质冗余的安静信息编码,这导致现有视频模型中的低处理效率作为输入原始RGB帧。在本文中,我们考虑为处理重要忙碌信息的处理和对安静信息的计算的处理分配。我们设计可训练的运动带通量模块(MBPM),用于将繁忙信息从RAW视频数据中的安静信息分开。通过将MBPM嵌入到两个路径CNN架构中,我们定义了一个繁忙的网络(BQN)。 BQN的效率是通过避免由两个路径处理的特征空间中的冗余来确定:一个在低分辨率的安静特征上运行,而另一个处理繁忙功能。所提出的BQN在某物V1,Kinetics400,UCF101和HMDB51数据集中略高于最近最近的视频处理模型。
translated by 谷歌翻译
We propose a simple, yet effective approach for spatiotemporal feature learning using deep 3-dimensional convolutional networks (3D ConvNets) trained on a large scale supervised video dataset. Our findings are three-fold: 1) 3D ConvNets are more suitable for spatiotemporal feature learning compared to 2D ConvNets; 2) A homogeneous architecture with small 3 × 3 × 3 convolution kernels in all layers is among the best performing architectures for 3D ConvNets; and 3) Our learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks. In addition, the features are compact: achieving 52.8% accuracy on UCF101 dataset with only 10 dimensions and also very efficient to compute due to the fast inference of ConvNets. Finally, they are conceptually very simple and easy to train and use.
translated by 谷歌翻译
行动预测旨在通过部分观察视频推断即将举行的人类行动,这是由于早期观察结果有限的信息有限。现有方法主要采用重建策略来处理此任务,期望从部分观察到完整视频来学习单个映射函数,以便于预测过程。在这项研究中,我们提出了来自两个新方面的部分视频查询生成“完整视频”功能调节的对抗性记忆网络(AMEMNet)。首先,键值结构化存储器发生器旨在将不同的部分视频存储为键存储器,并在具有门控机制和查询关注的值存储器中动态地写入完整视频。其次,我们开发了一个类感知判别者,以指导内存发生器在对抗训练时不仅提供现实,而且还提供鉴别的完整视频特征。通过RGB和光学流量的晚期融合给出了AMEMNET的最终预测结果。提供两个基准视频数据集,UCF-101和HMDB51的广泛实验结果,以证明所提出的AMEMNET模型在最先进的方法的有效性。
translated by 谷歌翻译
时空卷积通常无法学习视频中的运动动态,因此在野外的视频理解需要有效的运动表示。在本文中,我们提出了一种基于时空自相似性(STS)的丰富和强大的运动表示。给定一系列帧,STS表示每个局部区域作为空间和时间的邻居的相似度。通过将外观特征转换为关系值,它使学习者能够更好地识别空间和时间的结构模式。我们利用了整个STS,让我们的模型学会从中提取有效的运动表示。建议的神经块被称为自拍,可以轻松插入神经架构中,并在没有额外监督的情况下训练结束。在空间和时间内具有足够的邻域,它有效地捕获视频中的长期交互和快速运动,导致强大的动作识别。我们的实验分析证明了其对运动建模方法的优越性以及与直接卷积的时空特征的互补性。在标准动作识别基准测试中,某事-V1&V2,潜水-48和FineGym,该方法实现了最先进的结果。
translated by 谷歌翻译
Human activity recognition (HAR) using drone-mounted cameras has attracted considerable interest from the computer vision research community in recent years. A robust and efficient HAR system has a pivotal role in fields like video surveillance, crowd behavior analysis, sports analysis, and human-computer interaction. What makes it challenging are the complex poses, understanding different viewpoints, and the environmental scenarios where the action is taking place. To address such complexities, in this paper, we propose a novel Sparse Weighted Temporal Attention (SWTA) module to utilize sparsely sampled video frames for obtaining global weighted temporal attention. The proposed SWTA is comprised of two parts. First, temporal segment network that sparsely samples a given set of frames. Second, weighted temporal attention, which incorporates a fusion of attention maps derived from optical flow, with raw RGB images. This is followed by a basenet network, which comprises a convolutional neural network (CNN) module along with fully connected layers that provide us with activity recognition. The SWTA network can be used as a plug-in module to the existing deep CNN architectures, for optimizing them to learn temporal information by eliminating the need for a separate temporal stream. It has been evaluated on three publicly available benchmark datasets, namely Okutama, MOD20, and Drone-Action. The proposed model has received an accuracy of 72.76%, 92.56%, and 78.86% on the respective datasets thereby surpassing the previous state-of-the-art performances by a margin of 25.26%, 18.56%, and 2.94%, respectively.
translated by 谷歌翻译
本文介绍了一个名为DTVNet的新型端到端动态时间流逝视频生成框架,以从归一化运动向量上的单个景观图像生成多样化的延期视频。所提出的DTVNET由两个子模块组成:\ EMPH {光学流编码器}(OFE)和\ EMPH {动态视频生成器}(DVG)。 OFE将一系列光学流程图映射到编码所生成视频的运动信息的\ Emph {归一化运动向量}。 DVG包含来自运动矢量和单个景观图像的运动和内容流。此外,它包含一个编码器,用于学习共享内容特征和解码器,以构造具有相应运动的视频帧。具体地,\ EMPH {运动流}介绍多个\ EMPH {自适应实例归一化}(Adain)层,以集成用于控制对象运动的多级运动信息。在测试阶段,基于仅一个输入图像,可以产生具有相同内容但具有相同运动信息但各种运动信息的视频。此外,我们提出了一个高分辨率的景区时间流逝视频数据集,命名为快速天空时间,以评估不同的方法,可以被视为高质量景观图像和视频生成任务的新基准。我们进一步对天空延时,海滩和快速天空数据集进行实验。结果证明了我们对最先进的方法产生高质量和各种动态视频的方法的优越性。
translated by 谷歌翻译
Convolutional neural networks (CNNs) have been extensively applied for image recognition problems giving stateof-the-art results on recognition, detection, segmentation and retrieval. In this work we propose and evaluate several deep neural network architectures to combine image information across a video over longer time periods than previously attempted. We propose two methods capable of handling full length videos. The first method explores various convolutional temporal feature pooling architectures, examining the various design choices which need to be made when adapting a CNN for this task. The second proposed method explicitly models the video as an ordered sequence of frames. For this purpose we employ a recurrent neural network that uses Long Short-Term Memory (LSTM) cells which are connected to the output of the underlying CNN. Our best networks exhibit significant performance improvements over previously published results on the Sports 1 million dataset (73.1% vs. 60.9%) and the UCF-101 datasets with (88.6% vs. 88.0%) and without additional optical flow information (82.6% vs. 73.0%).
translated by 谷歌翻译