In this paper, we present an approach for learning a visual representation from the raw spatiotemporal signals in videos. Our representation is learned without supervision from semantic labels. We formulate our method as an unsupervised sequential verification task, i.e., we determine whether a sequence of frames from a video is in the correct temporal order. With this simple task and no semantic labels, we learn a powerful visual representation using a Convolutional Neural Network (CNN). The representation contains complementary information to that learned from supervised image datasets like ImageNet. Qualitative results show that our method captures information that is temporally varying, such as human pose. When used as pre-training for action recognition, our method gives significant gains over learning without external data on benchmark datasets like UCF101 and HMDB51. To demonstrate its sensitivity to human pose, we show results for pose estimation on the FLIC and MPII datasets that are competitive, or better than approaches using significantly more supervision. Our method can be combined with supervised representations to provide an additional boost in accuracy.
translated by 谷歌翻译
We present an unsupervised representation learning approach using videos without semantic labels. We leverage the temporal coherence as a supervisory signal by formulating representation learning as a sequence sorting task. We take temporally shuffled frames (i.e., in non-chronological order) as inputs and train a convolutional neural network to sort the shuffled sequences. Similar to comparison-based sorting algorithms, we propose to extract features from all frame pairs and aggregate them to predict the correct order. As sorting shuffled image sequence requires an understanding of the statistical temporal structure of images, training with such a proxy task allows us to learn rich and generalizable visual representation. We validate the effectiveness of the learned representation using our method as pre-training on high-level recognition problems. The experimental results show that our method compares favorably against state-of-the-art methods on action recognition, image classification and object detection tasks.
translated by 谷歌翻译
We propose a self-supervised spatiotemporal learning technique which leverages the chronological order of videos. Our method can learn the spatiotemporal representation of the video by predicting the order of shuffled clips from the video. The category of the video is not required, which gives our technique the potential to take advantage of infinite unannotated videos. There exist related works which use frames, while compared to frames, clips are more consistent with the video dynamics. Clips can help to reduce the uncertainty of orders and are more appropriate to learn a video representation. The 3D convolutional neural networks are utilized to extract features for clips, and these features are processed to predict the actual order. The learned representations are evaluated via nearest neighbor retrieval experiments. We also use the learned networks as the pre-trained models and finetune them on the action recognition task. Three types of 3D convolutional neural networks are tested in experiments, and we gain large improvements compared to existing self-supervised methods.
translated by 谷歌翻译
We propose a new self-supervised CNN pre-training technique based on a novel auxiliary task called odd-oneout learning. In this task, the machine is asked to identify the unrelated or odd element from a set of otherwise related elements. We apply this technique to self-supervised video representation learning where we sample subsequences from videos and ask the network to learn to predict the odd video subsequence. The odd video subsequence is sampled such that it has wrong temporal order of frames while the even ones have the correct temporal order. Therefore, to generate a odd-one-out question no manual annotation is required. Our learning machine is implemented as multi-stream convolutional neural network, which is learned end-to-end. Using odd-one-out networks, we learn temporal representations for videos that generalizes to other related tasks such as action recognition.On action classification, our method obtains 60.3% on the UCF101 dataset using only UCF101 data for training which is approximately 10% better than current stateof-the-art self-supervised learning methods. Similarly, on HMDB51 dataset we outperform self-supervised state-ofthe art methods by 12.7% on action classification task.
translated by 谷歌翻译
我们提出了MACLR,这是一种新颖的方法,可显式执行从视觉和运动方式中学习的跨模式自我监督的视频表示。与以前的视频表示学习方法相比,主要关注学习运动线索的研究方法是隐含的RGB输入,MACLR丰富了RGB视频片段的标准对比度学习目标,具有运动途径和视觉途径之间的跨模式学习目标。我们表明,使用我们的MACLR方法学到的表示形式更多地关注前景运动区域,因此可以更好地推广到下游任务。为了证明这一点,我们在五个数据集上评估了MACLR,以进行动作识别和动作检测,并在所有数据集上展示最先进的自我监督性能。此外,我们表明MACLR表示可以像在UCF101和HMDB51行动识别的全面监督下所学的表示一样有效,甚至超过了对Vidsitu和SSV2的行动识别的监督表示,以及对AVA的动作检测。
translated by 谷歌翻译
Is strong supervision necessary for learning a good visual representation? Do we really need millions of semantically-labeled images to train a Convolutional Neural Network (CNN)? In this paper, we present a simple yet surprisingly powerful approach for unsupervised learning of CNN. Specifically, we use hundreds of thousands of unlabeled videos from the web to learn visual representations.Our key idea is that visual tracking provides the supervision. That is, two patches connected by a track should have similar visual representation in deep feature space since they probably belong to the same object or object part. We design a Siamese-triplet network with a ranking loss function to train this CNN representation. Without using a single image from ImageNet, just using 100K unlabeled videos and the VOC 2012 dataset, we train an ensemble of unsupervised networks that achieves 52% mAP (no bounding box regression). This performance comes tantalizingly close to its ImageNet-supervised counterpart, an ensemble which achieves a mAP of 54.4%. We also show that our unsupervised network can perform competitively in other tasks such as surface-normal estimation.
translated by 谷歌翻译
We wish to automatically predict the "speediness" of moving objects in videos-whether they move faster, at, or slower than their "natural" speed. The core component in our approach is SpeedNet-a novel deep network trained to detect if a video is playing at normal rate, or if it is sped up. SpeedNet is trained on a large corpus of natural videos in a self-supervised manner, without requiring any manual annotations. We show how this single, binary classification network can be used to detect arbitrary rates of speediness of objects. We demonstrate prediction results by Speed-Net on a wide range of videos containing complex natural motions, and examine the visual cues it utilizes for making those predictions. Importantly, we show that through predicting the speed of videos, the model learns a powerful and meaningful space-time representation that goes beyond simple motion cues. We demonstrate how those learned features can boost the performance of self-supervised action recognition, and can be used for video retrieval. Furthermore, we also apply SpeedNet for generating time-varying, adaptive video speedups, which can allow viewers to watch videos faster, but with less of the jittery, unnatural motions typical to videos that are sped up uniformly.
translated by 谷歌翻译
The paucity of videos in current action classification datasets (UCF-101 and HMDB-51) has made it difficult to identify good video architectures, as most methods obtain similar performance on existing small-scale benchmarks. This paper re-evaluates state-of-the-art architectures in light of the new Kinetics Human Action Video dataset. Kinetics has two orders of magnitude more data, with 400 human action classes and over 400 clips per class, and is collected from realistic, challenging YouTube videos. We provide an analysis on how current architectures fare on the task of action classification on this dataset and how much performance improves on the smaller benchmark datasets after pre-training on Kinetics.We also introduce a new Two-Stream Inflated 3D Con-vNet (I3D) that is based on 2D ConvNet inflation: filters and pooling kernels of very deep image classification ConvNets are expanded into 3D, making it possible to learn seamless spatio-temporal feature extractors from video while leveraging successful ImageNet architecture designs and even their parameters. We show that, after pre-training on Kinetics, I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.9% on HMDB-51 and 98.0% on UCF-101.
translated by 谷歌翻译
There is a natural correlation between the visual and auditive elements of a video. In this work we leverage this connection to learn general and effective models for both audio and video analysis from self-supervised temporal synchronization. We demonstrate that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss are critical ingredients to obtain powerful multi-sensory representations from models optimized to discern temporal synchronization of audio-video pairs. Without further finetuning, the resulting audio features achieve performance superior or comparable to the state-of-the-art on established audio classification benchmarks (DCASE2014 and ESC-50). At the same time, our visual subnet provides a very effective initialization to improve the accuracy of video-based action recognition models: compared to learning from scratch, our self-supervised pretraining yields a remarkable gain of +19.9% in action recognition accuracy on UCF101 and a boost of +17.7% on HMDB51.
translated by 谷歌翻译
对人类姿势和行动的认可对于自治系统与人们顺利互动。然而,相机通常在2D中捕获人类的姿势,作为图像和视频,这在跨越识别任务具有挑战性的观点来具有显着的外观变化。为了解决这个问题,我们探讨了来自2D信息的3D人体姿势中的识别相似性,在现有工作中没有得到很好地研究。在这里,我们提出了一种从2D主体关节键盘学习紧凑型视图 - 不变的嵌入空间的方法,而不明确地预测3D姿势。通过确定性映射难以代表预测和遮挡的2D姿势的输入模糊,因此我们采用了嵌入空间的概率制定。实验结果表明,与3D姿态估计模型相比,我们的嵌入模型在不同相机视图中检索类似的姿势时达到更高的准确性。我们还表明,通过培训简单的时间嵌入模型,我们在姿势序列检索方面取得了卓越的性能,并大大减少了基于堆叠帧的嵌入式的嵌入维度,以实现高效的大规模检索。此外,为了使我们的嵌入能够使用部分可见的输入,我们进一步调查培训期间的不同关键点遮挡增强策略。我们证明这些遮挡增强显着提高了部分2D输入姿势的检索性能。行动识别和视频对齐的结果表明,使用我们的嵌入没有任何额外培训,可以实现相对于每个任务专门培训的其他模型的竞争性能。
translated by 谷歌翻译
自我监督的方法已通过端到端监督学习的图像分类显着缩小了差距。但是,在人类动作视频的情况下,外观和运动都是变化的重要因素,因此该差距仍然很大。这样做的关键原因之一是,采样对类似的视频剪辑,这是许多自我监督的对比学习方法所需的步骤,目前是保守的,以避免误报。一个典型的假设是,类似剪辑仅在单个视频中暂时关闭,从而导致运动相似性的示例不足。为了减轻这种情况,我们提出了SLIC,这是一种基于聚类的自我监督的对比度学习方法,用于人类动作视频。我们的关键贡献是,我们通过使用迭代聚类来分组类似的视频实例来改善传统的视频内积极采样。这使我们的方法能够利用集群分配中的伪标签来取样更艰难的阳性和负面因素。在UCF101上,SLIC的表现优于最先进的视频检索基线 +15.4%,而直接转移到HMDB51时,SLIC检索基线的率高为15.4%, +5.7%。通过用于动作分类的端到端登录,SLIC在UCF101上获得了83.2%的TOP-1准确性(+0.8%),而HMDB51(+1.6%)上的fric fineTuns in top-1 finetuning。在动力学预处理后,SLIC还与最先进的行动分类竞争。
translated by 谷歌翻译
具有注释的缺乏大规模的真实数据集使转移学习视频活动的必要性。我们的目标是为少数行动分类开发几次拍摄转移学习的有效方法。我们利用独立培训的本地视觉提示来学习可以从源域传输的表示,该源域只能使用少数示例来从源域传送到不同的目标域。我们使用的视觉提示包括对象 - 对象交互,手掌和地区内的动作,这些地区是手工位置的函数。我们采用了一个基于元学习的框架,以提取部署的视觉提示的独特和域不变组件。这使得能够在使用不同的场景和动作配置捕获的公共数据集中传输动作分类模型。我们呈现了我们转让学习方法的比较结果,并报告了阶级阶级和数据间数据间际传输的最先进的行动分类方法。
translated by 谷歌翻译
We present a self-supervised Contrastive Video Representation Learning (CVRL) method to learn spatiotemporal visual representations from unlabeled videos. Our representations are learned using a contrastive loss, where two augmented clips from the same short video are pulled together in the embedding space, while clips from different videos are pushed away. We study what makes for good data augmentations for video self-supervised learning and find that both spatial and temporal information are crucial. We carefully design data augmentations involving spatial and temporal cues. Concretely, we propose a temporally consistent spatial augmentation method to impose strong spatial augmentations on each frame of the video while maintaining the temporal consistency across frames. We also propose a sampling-based temporal augmentation method to avoid overly enforcing invariance on clips that are distant in time. On Kinetics-600, a linear classifier trained on the representations learned by CVRL achieves 70.4% top-1 accuracy with a 3D-ResNet-50 (R3D-50) backbone, outperforming ImageNet supervised pre-training by 15.7% and SimCLR unsupervised pre-training by 18.8% using the same inflated R3D-50. The performance of CVRL can be further improved to 72.9% with a larger R3D-152 (2× filters) backbone, significantly closing the gap between unsupervised and supervised video representation learning. Our code and models will be available at https://github.com/tensorflow/models/tree/master/official/.
translated by 谷歌翻译
Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident. This paper aims to discover the principles to design effective ConvNet architectures for action recognition in videos and learn these models given limited training samples. Our first contribution is temporal segment network (TSN), a novel framework for video-based action recognition. which is based on the idea of long-range temporal structure modeling. It combines a sparse temporal sampling strategy and video-level supervision to enable efficient and effective learning using the whole action video. The other contribution is our study on a series of good practices in learning ConvNets on video data with the help of temporal segment network. Our approach obtains the state-the-of-art performance on the datasets of HMDB51 (69.4%) and UCF101 (94.2%). We also visualize the learned ConvNet models, which qualitatively demonstrates the effectiveness of temporal segment network and the proposed good practices. 1
translated by 谷歌翻译
We propose a simple, yet effective approach for spatiotemporal feature learning using deep 3-dimensional convolutional networks (3D ConvNets) trained on a large scale supervised video dataset. Our findings are three-fold: 1) 3D ConvNets are more suitable for spatiotemporal feature learning compared to 2D ConvNets; 2) A homogeneous architecture with small 3 × 3 × 3 convolution kernels in all layers is among the best performing architectures for 3D ConvNets; and 3) Our learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks. In addition, the features are compact: achieving 52.8% accuracy on UCF101 dataset with only 10 dimensions and also very efficient to compute due to the fast inference of ConvNets. Finally, they are conceptually very simple and easy to train and use.
translated by 谷歌翻译
未来的活动预期是在Egocentric视觉中具有挑战性问题。作为标准的未来活动预期范式,递归序列预测遭受错误的累积。为了解决这个问题,我们提出了一个简单有效的自我监管的学习框架,旨在使中间表现为连续调节中间代表性,以产生表示(a)与先前观察到的对比的当前时间戳框架中的新颖信息内容和(b)反映其与先前观察到的帧的相关性。前者通过最小化对比损失来实现,并且后者可以通过动态重量机制来实现在观察到的内容中的信息帧中,具有当前帧的特征与观察到的帧之间的相似性比较。通过多任务学习可以进一步增强学习的最终视频表示,该多任务学习在目标活动标签上执行联合特征学习和自动检测到的动作和对象类令牌。在大多数自我传统视频数据集和两个第三人称视频数据集中,SRL在大多数情况下急剧表现出现有的现有最先进。通过实验性事实,还可以准确识别支持活动语义的行动和对象概念的实验性。
translated by 谷歌翻译
In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study their effects on action recognition. Our motivation stems from the observation that 2D CNNs applied to individual frames of the video have remained solid performers in action recognition. In this work we empirically demonstrate the accuracy advantages of 3D CNNs over 2D CNNs within the framework of residual learning. Furthermore, we show that factorizing the 3D convolutional filters into separate spatial and temporal components yields significantly gains in accuracy. Our empirical study leads to the design of a new spatiotemporal convolutional block "R(2+1)D" which produces CNNs that achieve results comparable or superior to the state-of-theart on Sports-1M, Kinetics, UCF101, and HMDB51.
translated by 谷歌翻译
我们介绍了一种新颖的自我监督的对比学习方法,以了解来自未标记视频的表示。现有方法忽略了输入失真的细节,例如,通过学习与时间转换的不变性。相反,我们认为视频表示应该保留视频动态并反映输入的时间操纵。因此,我们利用新的约束来构建对时间转换和更好的捕获视频动态的表示表示。在我们的方法中,视频的增强剪辑之间的相对时间转换被编码在向量中并与其他转换向量形成对比。为了支持时间的设备,我们还提出了将视频的两个剪辑的自我监督分类为1.重叠2.订购或3.无序。我们的实验表明,时代的表示达到最先进的结果,导致UCF101,HMDB51和潜水48上的视频检索和动作识别基准。
translated by 谷歌翻译
The objective of this paper is visual-only self-supervised video representation learning. We make the following contributions: (i) we investigate the benefit of adding semantic-class positives to instance-based Info Noise Contrastive Estimation (In-foNCE) training, showing that this form of supervised contrastive learning leads to a clear improvement in performance; (ii) we propose a novel self-supervised co-training scheme to improve the popular infoNCE loss, exploiting the complementary information from different views, RGB streams and optical flow, of the same data source by using one view to obtain positive class samples for the other; (iii) we thoroughly evaluate the quality of the learnt representation on two different downstream tasks: action recognition and video retrieval. In both cases, the proposed approach demonstrates state-of-the-art or comparable performance with other self-supervised approaches, whilst being significantly more efficient to train, i.e. requiring far less training data to achieve similar performance.
translated by 谷歌翻译
视频自我监督的学习是一项挑战的任务,这需要模型的显着表达力量来利用丰富的空间时间知识,并从大量未标记的视频产生有效的监督信号。但是,现有方法未能提高未标记视频的时间多样性,并以明确的方式忽略精心建模的多尺度时间依赖性。为了克服这些限制,我们利用视频中的多尺度时间依赖性,并提出了一个名为时间对比图学习(TCGL)的新型视频自我监督学习框架,该框架共同模拟了片段间和片段间的时间依赖性用混合图对比学习策略学习的时间表示学习。具体地,首先引入空间 - 时间知识发现(STKD)模块以基于离散余弦变换的频域分析从视频中提取运动增强的空间时间表。为了显式模拟未标记视频的多尺度时间依赖性,我们的TCGL将关于帧和片段命令的先前知识集成到图形结构中,即片段/间隙间时间对比图(TCG)。然后,特定的对比学习模块旨在最大化不同图形视图中节点之间的协议。为了为未标记的视频生成监控信号,我们介绍了一种自适应片段订购预测(ASOP)模块,它利用视频片段之间的关系知识来学习全局上下文表示并自适应地重新校准通道明智的功能。实验结果表明我们的TCGL在大规模行动识别和视频检索基准上的最先进方法中的优势。
translated by 谷歌翻译