Contrastive representation learning has proven to be an effective self-supervised learning method for images and videos. Most successful approaches are based on Noise Contrastive Estimation (NCE) and use different views of an instance as positives that should be contrasted with other instances, called negatives, that are considered as noise. However, several instances in a dataset are drawn from the same distribution and share underlying semantic information. A good data representation should contain relations between the instances, or semantic similarity and dissimilarity, that contrastive learning harms by considering all negatives as noise. To circumvent this issue, we propose a novel formulation of contrastive learning using semantic similarity between instances called Similarity Contrastive Estimation (SCE). Our training objective is a soft contrastive one that brings the positives closer and estimates a continuous distribution to push or pull negative instances based on their learned similarities. We validate empirically our approach on both image and video representation learning. We show that SCE performs competitively with the state of the art on the ImageNet linear evaluation protocol for fewer pretraining epochs and that it generalizes to several downstream image tasks. We also show that SCE reaches state-of-the-art results for pretraining video representation and that the learned representation can generalize to video downstream tasks.
translated by 谷歌翻译
对比表现学习已被证明是一种有效的自我监督学习方法。大多数成功的方法都是基于噪声对比估计(NCE)范式,并将实例视图的视图视为阳性和其他情况,作为阳性应与其对比的噪声。但是,数据集中的所有实例都是从相同的分布和共享底层语义信息中汲取,这些语义信息不应被视为噪声。我们认为,良好的数据表示包含实例之间的关系或语义相似性。对比学习隐含地学习关系,但认为负面的噪音是对学习关系质量有害的噪音,因此是象征性的质量。为了规避这个问题,我们提出了一种使用称为相似性对比估计(SCE)之间的情况之间的语义相似性的对比学习的新颖性。我们的培训目标可以被视为柔和的对比学习。我们提出了持续分配以基于其语义相似性推动或拉动实例的持续分配。目标相似性分布从弱增强的情况计算并锐化以消除无关的关系。每个弱增强实例都与一个强大的增强实例配对,该实例对比其积极的同时保持目标相似性分布。实验结果表明,我们所提出的SCE在各种数据集中优于其基线MoCov2和RESSL,并对ImageNet线性评估协议上的最先进的算法具有竞争力。
translated by 谷歌翻译
Unsupervised image representations have significantly reduced the gap with supervised pretraining, notably with the recent achievements of contrastive learning methods. These contrastive methods typically work online and rely on a large number of explicit pairwise feature comparisons, which is computationally challenging. In this paper, we propose an online algorithm, SwAV, that takes advantage of contrastive methods without requiring to compute pairwise comparisons. Specifically, our method simultaneously clusters the data while enforcing consistency between cluster assignments produced for different augmentations (or "views") of the same image, instead of comparing features directly as in contrastive learning. Simply put, we use a "swapped" prediction mechanism where we predict the code of a view from the representation of another view. Our method can be trained with large and small batches and can scale to unlimited amounts of data. Compared to previous contrastive methods, our method is more memory efficient since it does not require a large memory bank or a special momentum network. In addition, we also propose a new data augmentation strategy, multi-crop, that uses a mix of views with different resolutions in place of two full-resolution views, without increasing the memory or compute requirements. We validate our findings by achieving 75.3% top-1 accuracy on ImageNet with ResNet-50, as well as surpassing supervised pretraining on all the considered transfer tasks.
translated by 谷歌翻译
自我监督的学习最近在没有人类注释的情况下在表示学习方面取得了巨大的成功。主要方法(即对比度学习)通常基于实例歧视任务,即单个样本被视为独立类别。但是,假定所有样品都是不同的,这与普通视觉数据集中类似样品的自然分组相矛盾,例如同一狗的多个视图。为了弥合差距,本文提出了一种自适应方法,该方法引入了软样本间关系,即自适应软化对比度学习(ASCL)。更具体地说,ASCL将原始实例歧视任务转换为多实体软歧视任务,并自适应地引入样本间关系。作为现有的自我监督学习框架的有效简明的插件模块,ASCL就性能和效率都实现了多个基准的最佳性能。代码可从https://github.com/mrchenfeng/ascl_icpr2022获得。
translated by 谷歌翻译
We introduce Bootstrap Your Own Latent (BYOL), a new approach to self-supervised image representation learning. BYOL relies on two neural networks, referred to as online and target networks, that interact and learn from each other. From an augmented view of an image, we train the online network to predict the target network representation of the same image under a different augmented view. At the same time, we update the target network with a slow-moving average of the online network. While state-of-the art methods rely on negative pairs, BYOL achieves a new state of the art without them. BYOL reaches 74.3% top-1 classification accuracy on ImageNet using a linear evaluation with a ResNet-50 architecture and 79.6% with a larger ResNet. We show that BYOL performs on par or better than the current state of the art on both transfer and semi-supervised benchmarks. Our implementation and pretrained models are given on GitHub. 3 * Equal contribution; the order of first authors was randomly selected.
translated by 谷歌翻译
This paper presents SimCLR: a simple framework for contrastive learning of visual representations. We simplify recently proposed contrastive selfsupervised learning algorithms without requiring specialized architectures or a memory bank. In order to understand what enables the contrastive prediction tasks to learn useful representations, we systematically study the major components of our framework. We show that (1) composition of data augmentations plays a critical role in defining effective predictive tasks, (2) introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and (3) contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning. By combining these findings, we are able to considerably outperform previous methods for self-supervised and semi-supervised learning on ImageNet. A linear classifier trained on self-supervised representations learned by Sim-CLR achieves 76.5% top-1 accuracy, which is a 7% relative improvement over previous state-ofthe-art, matching the performance of a supervised ResNet-50. When fine-tuned on only 1% of the labels, we achieve 85.8% top-5 accuracy, outperforming AlexNet with 100× fewer labels. 1
translated by 谷歌翻译
时空表示学习对于视频自我监督的表示至关重要。最近的方法主要使用对比学习和借口任务。然而,这些方法通过在潜在空间中的特征相似性判断所学习表示的中间状态的同时通过潜伏空间中的特征相似性来学习表示,这限制了整体性能。在这项工作中,考虑到采样实例的相似性作为中级状态,我们提出了一种新的借口任务 - 时空 - 时间重叠速率(Stor)预测。它源于观察到,人类能够区分空间和时间在视频中的重叠率。此任务鼓励模型区分两个生成的样本的存储来学习表示。此外,我们采用了联合优化,将借口任务与对比学习相结合,以进一步增强时空表示学习。我们还研究了所提出的计划中每个组分的相互影响。广泛的实验表明,我们的拟议Stor任务可以赞成对比学习和借口任务。联合优化方案可以显着提高视频理解中的时空表示。代码可在https://github.com/katou2/cstp上获得。
translated by 谷歌翻译
通过自学学习的视觉表示是一项极具挑战性的任务,因为网络需要在没有监督提供的主动指导的情况下筛选出相关模式。这是通过大量数据增强,大规模数据集和过量量的计算来实现的。视频自我监督学习(SSL)面临着额外的挑战:视频数据集通常不如图像数据集那么大,计算是一个数量级,并且优化器所必须通过的伪造模式数量乘以几倍。因此,直接从视频数据中学习自我监督的表示可能会导致次优性能。为了解决这个问题,我们建议在视频表示学习框架中利用一个以自我或语言监督为基础的强大模型,并在不依赖视频标记的数据的情况下学习强大的空间和时间信息。为此,我们修改了典型的基于视频的SSL设计和目标,以鼓励视频编码器\ textit {subsume}基于图像模型的语义内容,该模型在通用域上训练。所提出的算法被证明可以更有效地学习(即在较小的时期和较小的批次中),并在单模式SSL方法中对标准下游任务进行了新的最新性能。
translated by 谷歌翻译
自我监督的方法已通过端到端监督学习的图像分类显着缩小了差距。但是,在人类动作视频的情况下,外观和运动都是变化的重要因素,因此该差距仍然很大。这样做的关键原因之一是,采样对类似的视频剪辑,这是许多自我监督的对比学习方法所需的步骤,目前是保守的,以避免误报。一个典型的假设是,类似剪辑仅在单个视频中暂时关闭,从而导致运动相似性的示例不足。为了减轻这种情况,我们提出了SLIC,这是一种基于聚类的自我监督的对比度学习方法,用于人类动作视频。我们的关键贡献是,我们通过使用迭代聚类来分组类似的视频实例来改善传统的视频内积极采样。这使我们的方法能够利用集群分配中的伪标签来取样更艰难的阳性和负面因素。在UCF101上,SLIC的表现优于最先进的视频检索基线 +15.4%,而直接转移到HMDB51时,SLIC检索基线的率高为15.4%, +5.7%。通过用于动作分类的端到端登录,SLIC在UCF101上获得了83.2%的TOP-1准确性(+0.8%),而HMDB51(+1.6%)上的fric fineTuns in top-1 finetuning。在动力学预处理后,SLIC还与最先进的行动分类竞争。
translated by 谷歌翻译
尽管最近通过剩余网络的代表学习中的自我监督方法取得了进展,但它们仍然对ImageNet分类基准进行了高度的监督学习,限制了它们在性能关键设置中的适用性。在MITROVIC等人的现有理论上洞察中建立2021年,我们提出了RELICV2,其结合了明确的不变性损失,在各种适当构造的数据视图上具有对比的目标。 Relicv2在ImageNet上实现了77.1%的前1个分类准确性,使用线性评估使用Reset50架构和80.6%,具有较大的Reset型号,优于宽边缘以前的最先进的自我监督方法。最值得注意的是,RelicV2是使用一系列标准Reset架构始终如一地始终优先于类似的对比较中的监督基线的第一个表示学习方法。最后,我们表明,尽管使用Reset编码器,Relicv2可与最先进的自我监控视觉变压器相媲美。
translated by 谷歌翻译
Self-supervised learning (SSL) is rapidly closing BARLOW TWINS is competitive with state-of-the-art methods for self-supervised learning while being conceptually simpler, naturally avoiding trivial constant (i.e. collapsed) embeddings, and being robust to the training batch size.
translated by 谷歌翻译
我们提出了MACLR,这是一种新颖的方法,可显式执行从视觉和运动方式中学习的跨模式自我监督的视频表示。与以前的视频表示学习方法相比,主要关注学习运动线索的研究方法是隐含的RGB输入,MACLR丰富了RGB视频片段的标准对比度学习目标,具有运动途径和视觉途径之间的跨模式学习目标。我们表明,使用我们的MACLR方法学到的表示形式更多地关注前景运动区域,因此可以更好地推广到下游任务。为了证明这一点,我们在五个数据集上评估了MACLR,以进行动作识别和动作检测,并在所有数据集上展示最先进的自我监督性能。此外,我们表明MACLR表示可以像在UCF101和HMDB51行动识别的全面监督下所学的表示一样有效,甚至超过了对Vidsitu和SSV2的行动识别的监督表示,以及对AVA的动作检测。
translated by 谷歌翻译
对比自我监督的学习已经超越了许多下游任务的监督预测,如分割和物体检测。但是,当前的方法仍然主要应用于像想象成的策划数据集。在本文中,我们首先研究数据集中的偏差如何影响现有方法。我们的研究结果表明,目前的对比方法令人惊讶地工作:(i)对象与场景为中心,(ii)统一与长尾和(iii)一般与域特定的数据集。其次,鉴于这种方法的一般性,我们尝试通过微小的修改来实现进一步的收益。我们展示了学习额外的修正 - 通过使用多尺度裁剪,更强的增强和最近的邻居 - 改善了表示。最后,我们观察Moco在用多作物策略训练时学习空间结构化表示。表示可以用于语义段检索和视频实例分段,而不会FineTuning。此外,结果与专门模型相提并论。我们希望这项工作将成为其他研究人员的有用研究。代码和模型可在https://github.com/wvanganebleke/revisiting-contrastive-ssl上获得。
translated by 谷歌翻译
The objective of this paper is visual-only self-supervised video representation learning. We make the following contributions: (i) we investigate the benefit of adding semantic-class positives to instance-based Info Noise Contrastive Estimation (In-foNCE) training, showing that this form of supervised contrastive learning leads to a clear improvement in performance; (ii) we propose a novel self-supervised co-training scheme to improve the popular infoNCE loss, exploiting the complementary information from different views, RGB streams and optical flow, of the same data source by using one view to obtain positive class samples for the other; (iii) we thoroughly evaluate the quality of the learnt representation on two different downstream tasks: action recognition and video retrieval. In both cases, the proposed approach demonstrates state-of-the-art or comparable performance with other self-supervised approaches, whilst being significantly more efficient to train, i.e. requiring far less training data to achieve similar performance.
translated by 谷歌翻译
We present a self-supervised Contrastive Video Representation Learning (CVRL) method to learn spatiotemporal visual representations from unlabeled videos. Our representations are learned using a contrastive loss, where two augmented clips from the same short video are pulled together in the embedding space, while clips from different videos are pushed away. We study what makes for good data augmentations for video self-supervised learning and find that both spatial and temporal information are crucial. We carefully design data augmentations involving spatial and temporal cues. Concretely, we propose a temporally consistent spatial augmentation method to impose strong spatial augmentations on each frame of the video while maintaining the temporal consistency across frames. We also propose a sampling-based temporal augmentation method to avoid overly enforcing invariance on clips that are distant in time. On Kinetics-600, a linear classifier trained on the representations learned by CVRL achieves 70.4% top-1 accuracy with a 3D-ResNet-50 (R3D-50) backbone, outperforming ImageNet supervised pre-training by 15.7% and SimCLR unsupervised pre-training by 18.8% using the same inflated R3D-50. The performance of CVRL can be further improved to 72.9% with a larger R3D-152 (2× filters) backbone, significantly closing the gap between unsupervised and supervised video representation learning. Our code and models will be available at https://github.com/tensorflow/models/tree/master/official/.
translated by 谷歌翻译
Contrastive learning has become a key component of self-supervised learning approaches for computer vision. By learning to embed two augmented versions of the same image close to each other and to push the embeddings of different images apart, one can train highly transferable visual representations. As revealed by recent studies, heavy data augmentation and large sets of negatives are both crucial in learning such representations. At the same time, data mixing strategies, either at the image or the feature level, improve both supervised and semi-supervised learning by synthesizing novel examples, forcing networks to learn more robust features. In this paper, we argue that an important aspect of contrastive learning, i.e. the effect of hard negatives, has so far been neglected. To get more meaningful negative samples, current top contrastive self-supervised learning approaches either substantially increase the batch sizes, or keep very large memory banks; increasing memory requirements, however, leads to diminishing returns in terms of performance. We therefore start by delving deeper into a top-performing framework and show evidence that harder negatives are needed to facilitate better and faster learning. Based on these observations, and motivated by the success of data mixing, we propose hard negative mixing strategies at the feature level, that can be computed on-the-fly with a minimal computational overhead. We exhaustively ablate our approach on linear classification, object detection, and instance segmentation and show that employing our hard negative mixing procedure improves the quality of visual representations learned by a state-of-the-art self-supervised learning method.Project page: https://europe.naverlabs.com/mochi 34th Conference on Neural Information Processing Systems (NeurIPS 2020),
translated by 谷歌翻译
We introduce Bootstrap Your Own Latent (BYOL), a new approach to selfsupervised image representation learning. BYOL relies on two neural networks, referred to as online and target networks, that interact and learn from each other. From an augmented view of an image, we train the online network to predict the target network representation of the same image under a different augmented view. At the same time, we update the target network with a slow-moving average of the online network. While state-of-the art methods rely on negative pairs, BYOL achieves a new state of the art without them. BYOL reaches 74.3% top-1 classification accuracy on ImageNet using a linear evaluation with a ResNet-50 architecture and 79.6% with a larger ResNet. We show that BYOL performs on par or better than the current state of the art on both transfer and semi-supervised benchmarks. Our implementation and pretrained models are given on GitHub. 3 * Equal contribution; the order of first authors was randomly selected. 3
translated by 谷歌翻译
对比度学习最近在无监督的视觉表示学习中显示出巨大的潜力。在此轨道中的现有研究主要集中于图像内不变性学习。学习通常使用丰富的图像内变换来构建正对,然后使用对比度损失最大化一致性。相反,相互影响不变性的优点仍然少得多。利用图像间不变性的一个主要障碍是,尚不清楚如何可靠地构建图像间的正对,并进一步从它们中获得有效的监督,因为没有配对注释可用。在这项工作中,我们提出了一项全面的实证研究,以更好地了解从三个主要组成部分的形象间不变性学习的作用:伪标签维护,采样策略和决策边界设计。为了促进这项研究,我们引入了一个统一的通用框架,该框架支持无监督的内部和间形内不变性学习的整合。通过精心设计的比较和分析,揭示了多个有价值的观察结果:1)在线标签收敛速度比离线标签更快; 2)半硬性样品比硬否定样品更可靠和公正; 3)一个不太严格的决策边界更有利于形象间的不变性学习。借助所有获得的食谱,我们的最终模型(即InterCLR)对多个标准基准测试的最先进的内图内不变性学习方法表现出一致的改进。我们希望这项工作将为设计有效的无监督间歇性不变性学习提供有用的经验。代码:https://github.com/open-mmlab/mmselfsup。
translated by 谷歌翻译
我们介绍了一种对比视频表示方法,它使用课程学习在对比度培训中施加动态抽样策略。更具体地说,Concur以易于正面样本(在时间上和语义上相似的剪辑上)开始对比度训练,并且随着训练的进行,它会有效地提高时间跨度,从而有效地采样了硬质阳性(时间为时间和语义上不同)。为了学习更好的上下文感知表示形式,我们还提出了一个辅助任务,以预测积极剪辑之间的时间距离。我们对两个流行的动作识别数据集进行了广泛的实验,即UCF101和HMDB51,我们提出的方法在两项视频动作识别和视频检索的基准任务上实现了最新的性能。我们通过使用R(2+1)D和C3D编码器以及对Kinetics-400和Kinetics-200200数据集的R(2+1)D和C3D编码器以及预训练的影响来探讨编码器骨架和预训练策略的影响。此外,一项详细的消融研究显示了我们提出的方法的每个组成部分的有效性。
translated by 谷歌翻译