Visual and audio modalities are highly correlated, yet they contain different information. Their strong correlation makes it possible to predict the semantics of one from the other with good accuracy. Their intrinsic differences make cross-modal prediction a potentially more rewarding pretext task for self-supervised learning of video and audio representations compared to within-modality learning. Based on this intuition, we propose Cross-Modal Deep Clustering (XDC), a novel selfsupervised method that leverages unsupervised clustering in one modality (e.g., audio) as a supervisory signal for the other modality (e.g., video). This cross-modal supervision helps XDC utilize the semantic correlation and the differences between the two modalities. Our experiments show that XDC outperforms single-modality clustering and other multi-modal variants. XDC achieves state-of-the-art accuracy among self-supervised methods on multiple video and audio benchmarks. Most importantly, our video model pretrained on large-scale unlabeled data significantly outperforms the same model pretrained with full-supervision on ImageNet and Kinetics for action recognition on HMDB51 and UCF101. To the best of our knowledge, XDC is the first self-supervised learning method that outperforms large-scale fully-supervised pretraining for action recognition on the same architecture.
translated by 谷歌翻译
我们呈现了一个用于学习视听表示的自我监督的框架。在我们的框架中引入了一种小说概念,其中除了学习模态和标准的“同步的”跨模型关系之外,riscross也会学习“异步”的跨模式关系。我们展示通过放松音频和视觉模态之间的时间同步性,网络了解强劲的时间不变的表示。我们的实验表明,音频和视觉方式的强大增强,可放松交叉模态时间同步优化性能。要预先绘制我们提出的框架,我们使用具有不同大小,动力学,动力学-400和augioset的不同数据集。学习的表示是在许多下游任务中评估的,即行动识别,声音分类和检索。 Crisscross显示了动作识别的最先进的性能(UCF101和HMDB51)和声音分类(ESC50)。将公开可用的代码和预赠品模型。
translated by 谷歌翻译
There is a natural correlation between the visual and auditive elements of a video. In this work we leverage this connection to learn general and effective models for both audio and video analysis from self-supervised temporal synchronization. We demonstrate that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss are critical ingredients to obtain powerful multi-sensory representations from models optimized to discern temporal synchronization of audio-video pairs. Without further finetuning, the resulting audio features achieve performance superior or comparable to the state-of-the-art on established audio classification benchmarks (DCASE2014 and ESC-50). At the same time, our visual subnet provides a very effective initialization to improve the accuracy of video-based action recognition models: compared to learning from scratch, our self-supervised pretraining yields a remarkable gain of +19.9% in action recognition accuracy on UCF101 and a boost of +17.7% on HMDB51.
translated by 谷歌翻译
我们提出了MACLR,这是一种新颖的方法,可显式执行从视觉和运动方式中学习的跨模式自我监督的视频表示。与以前的视频表示学习方法相比,主要关注学习运动线索的研究方法是隐含的RGB输入,MACLR丰富了RGB视频片段的标准对比度学习目标,具有运动途径和视觉途径之间的跨模式学习目标。我们表明,使用我们的MACLR方法学到的表示形式更多地关注前景运动区域,因此可以更好地推广到下游任务。为了证明这一点,我们在五个数据集上评估了MACLR,以进行动作识别和动作检测,并在所有数据集上展示最先进的自我监督性能。此外,我们表明MACLR表示可以像在UCF101和HMDB51行动识别的全面监督下所学的表示一样有效,甚至超过了对Vidsitu和SSV2的行动识别的监督表示,以及对AVA的动作检测。
translated by 谷歌翻译
尽管视频自我监督的学习模型最近取得了成功,但关于它们的概括能力仍然有很多了解。在本文中,我们研究了敏感的视频自我监督学习对当前常规基准的方式以及方法是否超出规范评估设置的概括。我们在敏感性的四个不同因素上做到这一点:域,样本,动作和任务。我们的研究包括7个视频数据集,9种自学方法和6种视频理解任务的500多个实验,揭示了视频自我监督学习中的当前基准测试不是沿这些敏感性因素的概括指标。此外,我们发现自我监督的方法在香草的监督前训练后落后,尤其是当域移动较大并且可用下游样品的量很低时。从我们的分析中,我们将严重的基准测试(实验的一个子集)提炼出来,并讨论其对评估现有和未来自我监督视频学习方法获得的表示的普遍性的意义。
translated by 谷歌翻译
Large-scale labeled data are generally required to train deep neural networks in order to obtain better performance in visual feature learning from images or videos for computer vision applications. To avoid extensive cost of collecting and annotating large-scale datasets, as a subset of unsupervised learning methods, self-supervised learning methods are proposed to learn general image and video features from large-scale unlabeled data without using any human-annotated labels. This paper provides an extensive review of deep learning-based self-supervised general visual feature learning methods from images or videos. First, the motivation, general pipeline, and terminologies of this field are described. Then the common deep neural network architectures that used for self-supervised learning are summarized. Next, the schema and evaluation metrics of self-supervised learning methods are reviewed followed by the commonly used image and video datasets and the existing self-supervised visual feature learning methods. Finally, quantitative performance comparisons of the reviewed methods on benchmark datasets are summarized and discussed for both image and video feature learning. At last, this paper is concluded and lists a set of promising future directions for self-supervised visual feature learning.
translated by 谷歌翻译
我们提出了一种自制算法,以从以自我为中心的视频数据中学习表示形式。最近,已经做出了重大努力,以捕捉人类在日常活动中与自己的环境进行互动。结果,已经出现了几个大型的以相互作用的多模式数据的自我为中心的数据集。但是,来自视频的学习表征可能具有挑战性。首先,鉴于长期连续视频的未经保育性质,学习有效表示需要专注于互动的时间。其次,日常活动的视觉表示应对环境状态的变化敏感。但是,当前成功的多模式学习框架鼓励随着时间的推移表示代表。为了应对这些挑战,我们利用音频信号来确定有利于更好学习的可能相互作用的时刻。我们还提出了一个新颖的自我监督目标,该目标从相互作用引起的听觉状态变化中学习。我们在两个大规模的中心数据集(Epic-Kitchens-100和最近发布的EGO4D)上广泛验证了这些贡献,并显示了几个下游任务的改进,包括行动识别,长期行动预期和对象状态变化分类。
translated by 谷歌翻译
Videos are a rich source of multi-modal supervision. In this work, we learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams. To this end, we introduce the notion of a multimodal versatile network -a network that can ingest multiple modalities and whose representations enable downstream tasks in multiple modalities. In particular, we explore how best to combine the modalities, such that fine-grained representations of the visual and audio modalities can be maintained, whilst also integrating text into a common embedding. Driven by versatility, we also introduce a novel process of deflation, so that the networks can be effortlessly applied to the visual data in the form of video or a static image. We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks. Equipped with these representations, we obtain state-of-the-art performance on multiple challenging benchmarks including UCF101, HMDB51, Kinetics600, Audioset and ESC-50 when compared to previous self-supervised work. Our models are publicly available [1, 2, 3]. * Equal contribution. † Work done during an internship at DeepMind. 34th Conference on Neural Information Processing Systems (NeurIPS 2020),
translated by 谷歌翻译
The remarkable success of deep learning in various domains relies on the availability of large-scale annotated datasets. However, obtaining annotations is expensive and requires great effort, which is especially challenging for videos. Moreover, the use of human-generated annotations leads to models with biased learning and poor domain generalization and robustness. As an alternative, self-supervised learning provides a way for representation learning which does not require annotations and has shown promise in both image and video domains. Different from the image domain, learning video representations are more challenging due to the temporal dimension, bringing in motion and other environmental dynamics. This also provides opportunities for video-exclusive ideas that advance self-supervised learning in the video and multimodal domain. In this survey, we provide a review of existing approaches on self-supervised learning focusing on the video domain. We summarize these methods into four different categories based on their learning objectives: 1) pretext tasks, 2) generative learning, 3) contrastive learning, and 4) cross-modal agreement. We further introduce the commonly used datasets, downstream evaluation tasks, insights into the limitations of existing works, and the potential future directions in this area.
translated by 谷歌翻译
自我监督的方法已通过端到端监督学习的图像分类显着缩小了差距。但是,在人类动作视频的情况下,外观和运动都是变化的重要因素,因此该差距仍然很大。这样做的关键原因之一是,采样对类似的视频剪辑,这是许多自我监督的对比学习方法所需的步骤,目前是保守的,以避免误报。一个典型的假设是,类似剪辑仅在单个视频中暂时关闭,从而导致运动相似性的示例不足。为了减轻这种情况,我们提出了SLIC,这是一种基于聚类的自我监督的对比度学习方法,用于人类动作视频。我们的关键贡献是,我们通过使用迭代聚类来分组类似的视频实例来改善传统的视频内积极采样。这使我们的方法能够利用集群分配中的伪标签来取样更艰难的阳性和负面因素。在UCF101上,SLIC的表现优于最先进的视频检索基线 +15.4%,而直接转移到HMDB51时,SLIC检索基线的率高为15.4%, +5.7%。通过用于动作分类的端到端登录,SLIC在UCF101上获得了83.2%的TOP-1准确性(+0.8%),而HMDB51(+1.6%)上的fric fineTuns in top-1 finetuning。在动力学预处理后,SLIC还与最先进的行动分类竞争。
translated by 谷歌翻译
Timeyou have a little pressure you are cutting the wood readjusting the table saw I am using a roller sure you applied glue Figure 1: We describe an efficient approach to learn visual representations from misaligned and noisy narrations (bottom) automatically extracted from instructional videos (top). Our video representations are learnt from scratch without relying on any manually annotated visual dataset yet outperform all self-supervised and many fully-supervised methods on several video recognition benchmarks.
translated by 谷歌翻译
We present a self-supervised Contrastive Video Representation Learning (CVRL) method to learn spatiotemporal visual representations from unlabeled videos. Our representations are learned using a contrastive loss, where two augmented clips from the same short video are pulled together in the embedding space, while clips from different videos are pushed away. We study what makes for good data augmentations for video self-supervised learning and find that both spatial and temporal information are crucial. We carefully design data augmentations involving spatial and temporal cues. Concretely, we propose a temporally consistent spatial augmentation method to impose strong spatial augmentations on each frame of the video while maintaining the temporal consistency across frames. We also propose a sampling-based temporal augmentation method to avoid overly enforcing invariance on clips that are distant in time. On Kinetics-600, a linear classifier trained on the representations learned by CVRL achieves 70.4% top-1 accuracy with a 3D-ResNet-50 (R3D-50) backbone, outperforming ImageNet supervised pre-training by 15.7% and SimCLR unsupervised pre-training by 18.8% using the same inflated R3D-50. The performance of CVRL can be further improved to 72.9% with a larger R3D-152 (2× filters) backbone, significantly closing the gap between unsupervised and supervised video representation learning. Our code and models will be available at https://github.com/tensorflow/models/tree/master/official/.
translated by 谷歌翻译
Can we leverage the audiovisual information already present in video to improve self-supervised representation learning? To answer this question, we study various pretraining architectures and objectives within the masked autoencoding framework, motivated by the success of similar methods in natural language and image understanding. We show that we can achieve significant improvements on audiovisual downstream classification tasks, surpassing the state-of-the-art on VGGSound and AudioSet. Furthermore, we can leverage our audiovisual pretraining scheme for multiple unimodal downstream tasks using a single audiovisual pretrained model. We additionally demonstrate the transferability of our representations, achieving state-of-the-art audiovisual results on Epic Kitchens without pretraining specifically for this dataset.
translated by 谷歌翻译
在本文中,我们考虑了从长时间的视频到几分钟的长视频进行分类的问题(例如,烹饪不同的食谱,烹饪不同的食谱,进行不同的家庭装修,创建各种形式的艺术和手工艺品)。准确地对这些活动进行分类,不仅需要识别构成任务的单个步骤,还需要捕获其时间依赖性。这个问题与传统的动作分类大不相同,在传统的动作分类中,模型通常在跨越几秒钟的视频上进行了优化,并且手动修剪以包含简单的原子动作。虽然步骤注释可以使模型的培训能够识别程序活动的各个步骤,但由于长时间视频中手动注释时间界的超级注释,因此该领域的现有大规模数据集不包括此类段标签。为了解决这个问题,我们建议通过利用文本知识库(Wikihow)的遥远监督来自动确定教学视频中的步骤,其中包括对执行各种复杂活动所需的步骤的详细描述。我们的方法使用语言模型来匹配视频中自动转录的语音,以在知识库中逐步描述。我们证明,经过训练的视频模型可以识别这些自动标记的步骤(无手动监督)产生了在四个下游任务上实现卓越的概括性能的表示:识别程序活动,步骤分类,步骤预测和以自我为中心的视频分类。
translated by 谷歌翻译
生成视频数据的表示对于推进机器感知领域至关重要。大多数当前的技术都依赖于手工注册的数据,这些数据可能很难使用,生成昂贵且难以扩展。在这项工作中,我们提出了一种基于对比度学习的新颖学习方法,熔岩能够以一种自我监督的方式学习联合语言,音频和视频表示。我们使用变压器编码器在动力学700数据集上预先训练熔岩来学习每种模式的表示形式。然后,我们证明,熔岩在使用未标记的数据的一小部分时,与当前最新的自我监督和弱监督预审技术进行了竞争性能。
translated by 谷歌翻译
语音的视频录制包含相关的音频和视觉信息,为语音表示从扬声器的唇部运动和产生的声音提供了强大的信号。我们介绍了视听隐藏单元BERT(AV-HUBERT),是视听语音的自我监督的代表学习框架,这些屏幕屏蔽了多流视频输入并预测自动发现和迭代地精制多模式隐藏单元。 AV-HUBERT学习强大的视听语音表示,这些语音表示受益于唇读和自动语音识别。在最大的公众唇读基准LRS3(433小时)中,AV-Hubert达到32.5%WER,只有30个小时的标签数据,优于前一种最先进的方法(33.6%)培训,达到了一千次转录的视频数据(31k小时)。当使用来自LRS3的所有433小时的标记数据并结合自培训时,唇读WER进一步降低至26.9%。使用我们在相同的基准测试中使用您的视听表示,用于音频语音识别的相对效率为40%,而最先进的性能(1.3%Vs 2.3%)。我们的代码和模型可在https://github.com/facebookResearch/av_hubert获得
translated by 谷歌翻译
最近,自我监督的表示学习(SSRL)在计算机视觉,语音,自然语言处理(NLP)以及最近的其他类型的模式(包括传感器的时间序列)中引起了很多关注。自我监督学习的普及是由传统模型通常需要大量通知数据进行培训的事实所驱动的。获取带注释的数据可能是一个困难且昂贵的过程。已经引入了自我监督的方法,以通过使用从原始数据自由获得的监督信号对模型进行判别预训练来提高训练数据的效率。与现有的对SSRL的评论不同,该评论旨在以单一模式为重点介绍CV或NLP领域的方法,我们旨在为时间数据提供对多模式自我监督学习方法的首次全面审查。为此,我们1)提供现有SSRL方法的全面分类,2)通过定义SSRL框架的关键组件来引入通用管道,3)根据其目标功能,网络架构和潜在应用程序,潜在的应用程序,潜在的应用程序,比较现有模型, 4)查看每个类别和各种方式中的现有多模式技术。最后,我们提出了现有的弱点和未来的机会。我们认为,我们的工作对使用多模式和/或时间数据的域中SSRL的要求有了一个观点
translated by 谷歌翻译
State-of-the-art activity recognizers are effective during the day, but not trustworthy in the dark. The main causes are the distribution shift from the lower color contrast as well as the limited availability of labeled dark videos. Our goal is to recognize activities in the dark as well as in the day. To compensate for the lack of labeled dark videos, we introduce a pseudo-supervised learning scheme, which utilizes task-irrelevant unlabeled dark videos to train an activity recognizer. Our proposed activity recognizer makes use of audio which is invariant to illumination. However, the usefulness of audio and visual features differs according to the illumination. Thus we propose to make our audio-visual recognizer `darkness-aware'. Experiments on EPIC-Kitchens, Kinetics-Sound, and Charades demonstrate that our proposals enable effective activity recognition in the dark and can even improve robustness to occlusions.
translated by 谷歌翻译
We present XKD, a novel self-supervised framework to learn meaningful representations from unlabelled video clips. XKD is trained with two pseudo tasks. First, masked data reconstruction is performed to learn modality-specific representations. Next, self-supervised cross-modal knowledge distillation is performed between the two modalities through teacher-student setups to learn complementary information. To identify the most effective information to transfer and also to tackle the domain gap between audio and visual modalities which could hinder knowledge transfer, we introduce a domain alignment strategy for effective cross-modal distillation. Lastly, to develop a general-purpose solution capable of handling both audio and visual streams, a modality-agnostic variant of our proposed framework is introduced, which uses the same backbone for both audio and visual modalities. Our proposed cross-modal knowledge distillation improves linear evaluation top-1 accuracy of video action classification by 8.4% on UCF101, 8.1% on HMDB51, 13.8% on Kinetics-Sound, and 14.2% on Kinetics400. Additionally, our modality-agnostic variant shows promising results in developing a general-purpose network capable of handling different data streams. The code is released on the project website.
translated by 谷歌翻译
我们在没有监督的情况下解决了学习对象探测器的问题。与弱监督的对象检测不同,我们不假设图像级类标签。取而代之的是,我们使用音频组件来“教”对象检测器,从视听数据中提取监督信号。尽管此问题与声音源本地化有关,但它更难,因为检测器必须按类型对对象进行分类,列举对象的每个实例,并且即使对象保持沉默,也可以这样做。我们通过首先设计一个自制的框架来解决这个问题,该框架具有一个对比目标,该目标共同学会了分类和本地化对象。然后,在不使用任何监督的情况下,我们只需使用这些自我监督的标签和盒子来训练基于图像的对象检测器。因此,对于对象检测和声音源定位的任务,我们优于先前的无监督和弱监督的检测器。我们还表明,我们可以将该探测器与每个伪级标签的标签保持一致,并展示我们的方法如何学习检测超出仪器(例如飞机和猫)的通用对象。
translated by 谷歌翻译