We propose a new self-supervised CNN pre-training technique based on a novel auxiliary task called odd-oneout learning. In this task, the machine is asked to identify the unrelated or odd element from a set of otherwise related elements. We apply this technique to self-supervised video representation learning where we sample subsequences from videos and ask the network to learn to predict the odd video subsequence. The odd video subsequence is sampled such that it has wrong temporal order of frames while the even ones have the correct temporal order. Therefore, to generate a odd-one-out question no manual annotation is required. Our learning machine is implemented as multi-stream convolutional neural network, which is learned end-to-end. Using odd-one-out networks, we learn temporal representations for videos that generalizes to other related tasks such as action recognition.On action classification, our method obtains 60.3% on the UCF101 dataset using only UCF101 data for training which is approximately 10% better than current stateof-the-art self-supervised learning methods. Similarly, on HMDB51 dataset we outperform self-supervised state-ofthe art methods by 12.7% on action classification task.
translated by 谷歌翻译
We present an unsupervised representation learning approach using videos without semantic labels. We leverage the temporal coherence as a supervisory signal by formulating representation learning as a sequence sorting task. We take temporally shuffled frames (i.e., in non-chronological order) as inputs and train a convolutional neural network to sort the shuffled sequences. Similar to comparison-based sorting algorithms, we propose to extract features from all frame pairs and aggregate them to predict the correct order. As sorting shuffled image sequence requires an understanding of the statistical temporal structure of images, training with such a proxy task allows us to learn rich and generalizable visual representation. We validate the effectiveness of the learned representation using our method as pre-training on high-level recognition problems. The experimental results show that our method compares favorably against state-of-the-art methods on action recognition, image classification and object detection tasks.
translated by 谷歌翻译
We propose a self-supervised spatiotemporal learning technique which leverages the chronological order of videos. Our method can learn the spatiotemporal representation of the video by predicting the order of shuffled clips from the video. The category of the video is not required, which gives our technique the potential to take advantage of infinite unannotated videos. There exist related works which use frames, while compared to frames, clips are more consistent with the video dynamics. Clips can help to reduce the uncertainty of orders and are more appropriate to learn a video representation. The 3D convolutional neural networks are utilized to extract features for clips, and these features are processed to predict the actual order. The learned representations are evaluated via nearest neighbor retrieval experiments. We also use the learned networks as the pre-trained models and finetune them on the action recognition task. Three types of 3D convolutional neural networks are tested in experiments, and we gain large improvements compared to existing self-supervised methods.
translated by 谷歌翻译
Convolutional neural networks (CNNs) have been extensively applied for image recognition problems giving stateof-the-art results on recognition, detection, segmentation and retrieval. In this work we propose and evaluate several deep neural network architectures to combine image information across a video over longer time periods than previously attempted. We propose two methods capable of handling full length videos. The first method explores various convolutional temporal feature pooling architectures, examining the various design choices which need to be made when adapting a CNN for this task. The second proposed method explicitly models the video as an ordered sequence of frames. For this purpose we employ a recurrent neural network that uses Long Short-Term Memory (LSTM) cells which are connected to the output of the underlying CNN. Our best networks exhibit significant performance improvements over previously published results on the Sports 1 million dataset (73.1% vs. 60.9%) and the UCF-101 datasets with (88.6% vs. 88.0%) and without additional optical flow information (82.6% vs. 73.0%).
translated by 谷歌翻译
In this paper, we present an approach for learning a visual representation from the raw spatiotemporal signals in videos. Our representation is learned without supervision from semantic labels. We formulate our method as an unsupervised sequential verification task, i.e., we determine whether a sequence of frames from a video is in the correct temporal order. With this simple task and no semantic labels, we learn a powerful visual representation using a Convolutional Neural Network (CNN). The representation contains complementary information to that learned from supervised image datasets like ImageNet. Qualitative results show that our method captures information that is temporally varying, such as human pose. When used as pre-training for action recognition, our method gives significant gains over learning without external data on benchmark datasets like UCF101 and HMDB51. To demonstrate its sensitivity to human pose, we show results for pose estimation on the FLIC and MPII datasets that are competitive, or better than approaches using significantly more supervision. Our method can be combined with supervised representations to provide an additional boost in accuracy.
translated by 谷歌翻译
人类行动识别是计算机视觉中的重要应用领域。它的主要目的是准确地描述人类的行为及其相互作用,从传感器获得的先前看不见的数据序列中。识别,理解和预测复杂人类行动的能力能够构建许多重要的应用,例如智能监视系统,人力计算机界面,医疗保健,安全和军事应用。近年来,计算机视觉社区特别关注深度学习。本文使用深度学习技术的视频分析概述了当前的动作识别最新识别。我们提出了识别人类行为的最重要的深度学习模型,并分析它们,以提供用于解决人类行动识别问题的深度学习算法的当前进展,以突出其优势和缺点。基于文献中报道的识别精度的定量分析,我们的研究确定了动作识别中最新的深层体系结构,然后为该领域的未来工作提供当前的趋势和开放问题。
translated by 谷歌翻译
Large-scale labeled data are generally required to train deep neural networks in order to obtain better performance in visual feature learning from images or videos for computer vision applications. To avoid extensive cost of collecting and annotating large-scale datasets, as a subset of unsupervised learning methods, self-supervised learning methods are proposed to learn general image and video features from large-scale unlabeled data without using any human-annotated labels. This paper provides an extensive review of deep learning-based self-supervised general visual feature learning methods from images or videos. First, the motivation, general pipeline, and terminologies of this field are described. Then the common deep neural network architectures that used for self-supervised learning are summarized. Next, the schema and evaluation metrics of self-supervised learning methods are reviewed followed by the commonly used image and video datasets and the existing self-supervised visual feature learning methods. Finally, quantitative performance comparisons of the reviewed methods on benchmark datasets are summarized and discussed for both image and video feature learning. At last, this paper is concluded and lists a set of promising future directions for self-supervised visual feature learning.
translated by 谷歌翻译
We propose a simple, yet effective approach for spatiotemporal feature learning using deep 3-dimensional convolutional networks (3D ConvNets) trained on a large scale supervised video dataset. Our findings are three-fold: 1) 3D ConvNets are more suitable for spatiotemporal feature learning compared to 2D ConvNets; 2) A homogeneous architecture with small 3 × 3 × 3 convolution kernels in all layers is among the best performing architectures for 3D ConvNets; and 3) Our learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks. In addition, the features are compact: achieving 52.8% accuracy on UCF101 dataset with only 10 dimensions and also very efficient to compute due to the fast inference of ConvNets. Finally, they are conceptually very simple and easy to train and use.
translated by 谷歌翻译
Convolutional Neural Networks (CNN) have been regarded as a powerful class of models for image recognition problems. Nevertheless, it is not trivial when utilizing a CNN for learning spatio-temporal video representation. A few studies have shown that performing 3D convolutions is a rewarding approach to capture both spatial and temporal dimensions in videos. However, the development of a very deep 3D CNN from scratch results in expensive computational cost and memory demand. A valid question is why not recycle off-the-shelf 2D networks for a 3D CNN. In this paper, we devise multiple variants of bottleneck building blocks in a residual learning framework by simulating 3 × 3 × 3 convolutions with 1 × 3 × 3 convolutional filters on spatial domain (equivalent to 2D CNN) plus 3 × 1 × 1 convolutions to construct temporal connections on adjacent feature maps in time. Furthermore, we propose a new architecture, named , that exploits all the variants of blocks but composes each in different placement of ResNet, following the philosophy that enhancing structural diversity with going deep could improve the power of neural networks. Our P3D ResNet achieves clear improvements on Sports-1M video classification dataset against 3D CNN and frame-based 2D CNN by 5.3% and 1.8%, respectively. We further examine the generalization performance of video representation produced by our pre-trained P3D ResNet on five different benchmarks and three different tasks, demonstrating superior performances over several state-of-the-art techniques.
translated by 谷歌翻译
Figure 1: Seeing these ordered frames from videos, can you tell whether each video is playing forward or backward? (answer below 1 ). Depending on the video, solving the task may require (a) low-level understanding (e.g. physics), (b) high-level reasoning (e.g. semantics), or (c) familiarity with very subtle effects or with (d) camera conventions. In this work, we learn and exploit several types of knowledge to predict the arrow of time automatically with neural network models trained on large-scale video datasets.
translated by 谷歌翻译
Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident. This paper aims to discover the principles to design effective ConvNet architectures for action recognition in videos and learn these models given limited training samples. Our first contribution is temporal segment network (TSN), a novel framework for video-based action recognition. which is based on the idea of long-range temporal structure modeling. It combines a sparse temporal sampling strategy and video-level supervision to enable efficient and effective learning using the whole action video. The other contribution is our study on a series of good practices in learning ConvNets on video data with the help of temporal segment network. Our approach obtains the state-the-of-art performance on the datasets of HMDB51 (69.4%) and UCF101 (94.2%). We also visualize the learned ConvNet models, which qualitatively demonstrates the effectiveness of temporal segment network and the proposed good practices. 1
translated by 谷歌翻译
运动,作为视频中最明显的现象,涉及随时间的变化,对视频表示学习的发展是独一无二的。在本文中,我们提出了问题:特别是对自我监督视频表示学习的运动有多重要。为此,我们撰写了一个二重奏,用于利用对比学习政权的数据增强和特征学习的动作。具体而言,我们介绍了一种以前的对比学习(MCL)方法,其将这种二重奏视为基础。一方面,MCL大写视频中的每个帧的光流量,以在时间上和空间地样本地样本(即,横跨时间的相关帧斑块的序列)作为数据增强。另一方面,MCL进一步将卷积层的梯度图对准来自空间,时间和时空视角的光流程图,以便在特征学习中地进行地面运动信息。在R(2 + 1)D骨架上进行的广泛实验证明了我们MCL的有效性。在UCF101上,在MCL学习的表示上培训的线性分类器实现了81.91%的前1个精度,表现优于6.78%的训练预测。在动力学-400上,MCL在线方案下实现66.62%的前1个精度。代码可在https://github.com/yihengzhang-cv/mcl-motion-focused-contrastive-learning。
translated by 谷歌翻译
In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study their effects on action recognition. Our motivation stems from the observation that 2D CNNs applied to individual frames of the video have remained solid performers in action recognition. In this work we empirically demonstrate the accuracy advantages of 3D CNNs over 2D CNNs within the framework of residual learning. Furthermore, we show that factorizing the 3D convolutional filters into separate spatial and temporal components yields significantly gains in accuracy. Our empirical study leads to the design of a new spatiotemporal convolutional block "R(2+1)D" which produces CNNs that achieve results comparable or superior to the state-of-theart on Sports-1M, Kinetics, UCF101, and HMDB51.
translated by 谷歌翻译
We investigate architectures of discriminatively trained deep Convolutional Networks (ConvNets) for action recognition in video. The challenge is to capture the complementary information on appearance from still frames and motion between frames. We also aim to generalise the best performing hand-crafted features within a data-driven learning framework. Our contribution is three-fold. First, we propose a two-stream ConvNet architecture which incorporates spatial and temporal networks. Second, we demonstrate that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Finally, we show that multitask learning, applied to two different action classification datasets, can be used to increase the amount of training data and improve the performance on both. Our architecture is trained and evaluated on the standard video actions benchmarks of UCF-101 and HMDB-51, where it is competitive with the state of the art. It also exceeds by a large margin previous attempts to use deep nets for video classification.
translated by 谷歌翻译
视频自我监督的学习是一项挑战的任务,这需要模型的显着表达力量来利用丰富的空间时间知识,并从大量未标记的视频产生有效的监督信号。但是,现有方法未能提高未标记视频的时间多样性,并以明确的方式忽略精心建模的多尺度时间依赖性。为了克服这些限制,我们利用视频中的多尺度时间依赖性,并提出了一个名为时间对比图学习(TCGL)的新型视频自我监督学习框架,该框架共同模拟了片段间和片段间的时间依赖性用混合图对比学习策略学习的时间表示学习。具体地,首先引入空间 - 时间知识发现(STKD)模块以基于离散余弦变换的频域分析从视频中提取运动增强的空间时间表。为了显式模拟未标记视频的多尺度时间依赖性,我们的TCGL将关于帧和片段命令的先前知识集成到图形结构中,即片段/间隙间时间对比图(TCG)。然后,特定的对比学习模块旨在最大化不同图形视图中节点之间的协议。为了为未标记的视频生成监控信号,我们介绍了一种自适应片段订购预测(ASOP)模块,它利用视频片段之间的关系知识来学习全局上下文表示并自适应地重新校准通道明智的功能。实验结果表明我们的TCGL在大规模行动识别和视频检索基准上的最先进方法中的优势。
translated by 谷歌翻译
无意的行动是罕见的事件,难以精确定义,并且高度依赖于动作的时间背景。在这项工作中,我们探讨了此类行动,并试图确定视频中的观点,这些动作从故意到无意中过渡。我们提出了一个多阶段框架,该框架利用了固有的偏见,例如运动速度,运动方向和为了识别无意的行动。为了通过自我监督的训练来增强表示,我们提出了时间转变,称为时间转变,称为无意义行动固有偏见(T2IBUA)的时间转变。多阶段方法对各个帧和完整剪辑的级别进行了时间信息。这些增强的表示表现出强烈的无意行动识别任务的表现。我们对我们的框架进行了广泛的消融研究,并报告结果对最先进的结果有了显着改善。
translated by 谷歌翻译
There is a natural correlation between the visual and auditive elements of a video. In this work we leverage this connection to learn general and effective models for both audio and video analysis from self-supervised temporal synchronization. We demonstrate that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss are critical ingredients to obtain powerful multi-sensory representations from models optimized to discern temporal synchronization of audio-video pairs. Without further finetuning, the resulting audio features achieve performance superior or comparable to the state-of-the-art on established audio classification benchmarks (DCASE2014 and ESC-50). At the same time, our visual subnet provides a very effective initialization to improve the accuracy of video-based action recognition models: compared to learning from scratch, our self-supervised pretraining yields a remarkable gain of +19.9% in action recognition accuracy on UCF101 and a boost of +17.7% on HMDB51.
translated by 谷歌翻译
The paucity of videos in current action classification datasets (UCF-101 and HMDB-51) has made it difficult to identify good video architectures, as most methods obtain similar performance on existing small-scale benchmarks. This paper re-evaluates state-of-the-art architectures in light of the new Kinetics Human Action Video dataset. Kinetics has two orders of magnitude more data, with 400 human action classes and over 400 clips per class, and is collected from realistic, challenging YouTube videos. We provide an analysis on how current architectures fare on the task of action classification on this dataset and how much performance improves on the smaller benchmark datasets after pre-training on Kinetics.We also introduce a new Two-Stream Inflated 3D Con-vNet (I3D) that is based on 2D ConvNet inflation: filters and pooling kernels of very deep image classification ConvNets are expanded into 3D, making it possible to learn seamless spatio-temporal feature extractors from video while leveraging successful ImageNet architecture designs and even their parameters. We show that, after pre-training on Kinetics, I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.9% on HMDB-51 and 98.0% on UCF-101.
translated by 谷歌翻译
在本文中,我们提出了一种新的视频表示学习方法,名为时间挤压(TS)池,这可以从长期的视频帧中提取基本移动信息,并将其映射到一组名为挤压图像的几个图像中。通过将时间挤压池作为层嵌入到现成的卷积神经网络(CNN)中,我们设计了一个名为Temporal Squeeze网络(TESNet)的新视频分类模型。由此产生的挤压图像包含来自视频帧的基本移动信息,对应于视频分类任务的优化。我们在两个视频分类基准上评估我们的架构,并与最先进的结果进行了比较。
translated by 谷歌翻译
由于细粒度的视觉细节中的运动和丰富内容的大变化,视频是复杂的。从这些信息密集型媒体中抽象有用的信息需要详尽的计算资源。本文研究了一个两步的替代方案,首先将视频序列冷凝到信息“框架”,然后在合成帧上利用现成的图像识别系统。有效问题是如何定义“有用信息”,然后将其从视频序列蒸发到一个合成帧。本文介绍了一种新颖的信息帧综合(IFS)架构,其包含三个客观任务,即外观重建,视频分类,运动估计和两个常规方案,即对抗性学习,颜色一致性。每个任务都配备了一个能力的合成框,而每个常规器可以提高其视觉质量。利用这些,通过以端到端的方式共同学习帧合成,预期产生的帧封装了用于视频分析的所需的时空信息。广泛的实验是在大型动力学数据集上进行的。与基线方法相比,将视频序列映射到单个图像,IFS显示出优异的性能。更值得注意地,IFS始终如一地展示了基于图像的2D网络和基于剪辑的3D网络的显着改进,并且通过了具有较少计算成本的最先进方法实现了相当的性能。
translated by 谷歌翻译