Unsupervised image representations have significantly reduced the gap with supervised pretraining, notably with the recent achievements of contrastive learning methods. These contrastive methods typically work online and rely on a large number of explicit pairwise feature comparisons, which is computationally challenging. In this paper, we propose an online algorithm, SwAV, that takes advantage of contrastive methods without requiring to compute pairwise comparisons. Specifically, our method simultaneously clusters the data while enforcing consistency between cluster assignments produced for different augmentations (or "views") of the same image, instead of comparing features directly as in contrastive learning. Simply put, we use a "swapped" prediction mechanism where we predict the code of a view from the representation of another view. Our method can be trained with large and small batches and can scale to unlimited amounts of data. Compared to previous contrastive methods, our method is more memory efficient since it does not require a large memory bank or a special momentum network. In addition, we also propose a new data augmentation strategy, multi-crop, that uses a mix of views with different resolutions in place of two full-resolution views, without increasing the memory or compute requirements. We validate our findings by achieving 75.3% top-1 accuracy on ImageNet with ResNet-50, as well as surpassing supervised pretraining on all the considered transfer tasks.
translated by 谷歌翻译
This paper presents Prototypical Contrastive Learning (PCL), an unsupervised representation learning method that bridges contrastive learning with clustering. PCL not only learns low-level features for the task of instance discrimination, but more importantly, it encodes semantic structures discovered by clustering into the learned embedding space. Specifically, we introduce prototypes as latent variables to help find the maximum-likelihood estimation of the network parameters in an Expectation-Maximization framework. We iteratively perform E-step as finding the distribution of prototypes via clustering and M-step as optimizing the network via contrastive learning. We propose ProtoNCE loss, a generalized version of the InfoNCE loss for contrastive learning, which encourages representations to be closer to their assigned prototypes. PCL outperforms state-of-the-art instance-wise contrastive learning methods on multiple benchmarks with substantial improvement in low-resource transfer learning. Code and pretrained models are available at https://github.com/salesforce/PCL.
translated by 谷歌翻译
This paper presents SimCLR: a simple framework for contrastive learning of visual representations. We simplify recently proposed contrastive selfsupervised learning algorithms without requiring specialized architectures or a memory bank. In order to understand what enables the contrastive prediction tasks to learn useful representations, we systematically study the major components of our framework. We show that (1) composition of data augmentations plays a critical role in defining effective predictive tasks, (2) introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and (3) contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning. By combining these findings, we are able to considerably outperform previous methods for self-supervised and semi-supervised learning on ImageNet. A linear classifier trained on self-supervised representations learned by Sim-CLR achieves 76.5% top-1 accuracy, which is a 7% relative improvement over previous state-ofthe-art, matching the performance of a supervised ResNet-50. When fine-tuned on only 1% of the labels, we achieve 85.8% top-5 accuracy, outperforming AlexNet with 100× fewer labels. 1
translated by 谷歌翻译
对比度学习最近在无监督的视觉表示学习中显示出巨大的潜力。在此轨道中的现有研究主要集中于图像内不变性学习。学习通常使用丰富的图像内变换来构建正对,然后使用对比度损失最大化一致性。相反,相互影响不变性的优点仍然少得多。利用图像间不变性的一个主要障碍是,尚不清楚如何可靠地构建图像间的正对,并进一步从它们中获得有效的监督,因为没有配对注释可用。在这项工作中,我们提出了一项全面的实证研究,以更好地了解从三个主要组成部分的形象间不变性学习的作用:伪标签维护,采样策略和决策边界设计。为了促进这项研究,我们引入了一个统一的通用框架,该框架支持无监督的内部和间形内不变性学习的整合。通过精心设计的比较和分析,揭示了多个有价值的观察结果:1)在线标签收敛速度比离线标签更快; 2)半硬性样品比硬否定样品更可靠和公正; 3)一个不太严格的决策边界更有利于形象间的不变性学习。借助所有获得的食谱,我们的最终模型(即InterCLR)对多个标准基准测试的最先进的内图内不变性学习方法表现出一致的改进。我们希望这项工作将为设计有效的无监督间歇性不变性学习提供有用的经验。代码:https://github.com/open-mmlab/mmselfsup。
translated by 谷歌翻译
对比性自我监督表示方法学习方法最大程度地提高了正对之间的相似性,同时倾向于最大程度地减少负对之间的相似性。但是,总的来说,负面对之间的相互作用被忽略了,因为它们没有根据其特定差异和相似性而采用的特殊机制来对待负面对。在本文中,我们提出了扩展的动量对比(Xmoco),这是一种基于MOCO家族配置中提出的动量编码单元的遗产,一种自我监督的表示方法。为此,我们引入了交叉一致性正则化损失,并通过该损失将转换一致性扩展到不同图像(负对)。在交叉一致性正则化规则下,我们认为与任何一对图像(正或负)相关的语义表示应在借口转换下保留其交叉相似性。此外,我们通过在批处理上的负面对上实施相似性的均匀分布来进一步规范训练损失。可以轻松地将所提出的正规化添加到现有的自我监督学习算法中。从经验上讲,我们报告了标准Imagenet-1K线性头部分类基准的竞争性能。此外,通过将学习的表示形式转移到常见的下游任务中,我们表明,将Xmoco与普遍使用的增强功能一起使用可以改善此类任务的性能。我们希望本文的发现是研究人员考虑自我监督学习中负面例子的重要相互作用的动机。
translated by 谷歌翻译
我们介绍了代表学习(CARL)的一致分配,通过组合来自自我监督对比学习和深层聚类的思路来学习视觉表现的无监督学习方法。通过从聚类角度来看对比学习,Carl通过学习一组一般原型来学习无监督的表示,该原型用作能量锚来强制执行给定图像的不同视图被分配给相同的原型。与与深层聚类的对比学习的当代工作不同,Carl建议以在线方式学习一组一般原型,使用梯度下降,而无需使用非可微分算法或k手段来解决群集分配问题。卡尔在许多代表性学习基准中超越了竞争对手,包括线性评估,半监督学习和转移学习。
translated by 谷歌翻译
我们考虑在给定的分类任务(例如Imagenet-1k(IN1K))上训练深神网络的问题,以便它在该任务以及其他(未来)转移任务方面擅长。这两个看似矛盾的属性在改善模型的概括的同时保持其在原始任务上的性能之间实现了权衡。接受自我监督学习训练的模型(SSL)倾向于比其受监督的转移学习更好地概括。但是,他们仍然落后于In1k上的监督模型。在本文中,我们提出了一个有监督的学习设置,以利用两全其美的方式。我们使用最近的SSL模型的两个关键组成部分丰富了普通的监督培训框架:多尺度农作物用于数据增强和使用可消耗的投影仪。我们用内存库在即时计算的类原型中代替了班级权重的最后一层。我们表明,这三个改进导致IN1K培训任务和13个转移任务之间的权衡取决于更加有利的权衡。在所有探索的配置中,我们都会挑出两种模型:T-Rex实现了转移学习的新状态,并且超过了In1k上的Dino和Paws等最佳方法,以及与高度优化的RSB--相匹配的T-Rex*在IN1K上的A1模型,同时在转移任务上表现更好。项目页面和预估计的模型:https://europe.naverlabs.com/t-rex
translated by 谷歌翻译
通过对比学习,自我监督学习最近在视觉任务中显示了巨大的潜力,这旨在在数据集中区分每个图像或实例。然而,这种情况级别学习忽略了实例之间的语义关系,有时不希望地从语义上类似的样本中排斥锚,被称为“假否定”。在这项工作中,我们表明,对于具有更多语义概念的大规模数据集来说,虚假否定的不利影响更为重要。为了解决这个问题,我们提出了一种新颖的自我监督的对比学习框架,逐步地检测并明确地去除假阴性样本。具体地,在训练过程之后,考虑到编码器逐渐提高,嵌入空间变得更加语义结构,我们的方法动态地检测增加的高质量假否定。接下来,我们讨论两种策略,以明确地在对比学习期间明确地消除检测到的假阴性。广泛的实验表明,我们的框架在有限的资源设置中的多个基准上表现出其他自我监督的对比学习方法。
translated by 谷歌翻译
Contrastive representation learning has proven to be an effective self-supervised learning method for images and videos. Most successful approaches are based on Noise Contrastive Estimation (NCE) and use different views of an instance as positives that should be contrasted with other instances, called negatives, that are considered as noise. However, several instances in a dataset are drawn from the same distribution and share underlying semantic information. A good data representation should contain relations between the instances, or semantic similarity and dissimilarity, that contrastive learning harms by considering all negatives as noise. To circumvent this issue, we propose a novel formulation of contrastive learning using semantic similarity between instances called Similarity Contrastive Estimation (SCE). Our training objective is a soft contrastive one that brings the positives closer and estimates a continuous distribution to push or pull negative instances based on their learned similarities. We validate empirically our approach on both image and video representation learning. We show that SCE performs competitively with the state of the art on the ImageNet linear evaluation protocol for fewer pretraining epochs and that it generalizes to several downstream image tasks. We also show that SCE reaches state-of-the-art results for pretraining video representation and that the learned representation can generalize to video downstream tasks.
translated by 谷歌翻译
We introduce Bootstrap Your Own Latent (BYOL), a new approach to self-supervised image representation learning. BYOL relies on two neural networks, referred to as online and target networks, that interact and learn from each other. From an augmented view of an image, we train the online network to predict the target network representation of the same image under a different augmented view. At the same time, we update the target network with a slow-moving average of the online network. While state-of-the art methods rely on negative pairs, BYOL achieves a new state of the art without them. BYOL reaches 74.3% top-1 classification accuracy on ImageNet using a linear evaluation with a ResNet-50 architecture and 79.6% with a larger ResNet. We show that BYOL performs on par or better than the current state of the art on both transfer and semi-supervised benchmarks. Our implementation and pretrained models are given on GitHub. 3 * Equal contribution; the order of first authors was randomly selected.
translated by 谷歌翻译
许多最近的自我监督学习方法在图像分类和其他任务上表现出了令人印象深刻的表现。已经使用了一种令人困惑的多种技术,并不总是清楚地了解其收益的原因,尤其是在组合使用时。在这里,我们将图像的嵌入视为点粒子,并将模型优化视为该粒子系统上的动态过程。我们的动态模型结合了类似图像的吸引力,避免局部崩溃的局部分散力以及实现颗粒的全球均匀分布的全局分散力。动态透视图突出了使用延迟参数图像嵌入(a la byol)以及同一图像的多个视图的优点。它还使用纯动态的局部分散力(布朗运动),该分散力比其他方法显示出改善的性能,并且不需要其他粒子坐标的知识。该方法称为MSBREG,代表(i)多视质心损失,它施加了吸引力的力来将不同的图像视图嵌入到其质心上,(ii)奇异值损失,将粒子系统推向空间均匀的密度( iii)布朗扩散损失。我们评估MSBREG在ImageNet上的下游分类性能以及转移学习任务,包括细粒度分类,多类对象分类,对象检测和实例分段。此外,我们还表明,将我们的正则化术语应用于其他方法,进一步改善了其性能并通过防止模式崩溃来稳定训练。
translated by 谷歌翻译
对比自我监督的学习已经超越了许多下游任务的监督预测,如分割和物体检测。但是,当前的方法仍然主要应用于像想象成的策划数据集。在本文中,我们首先研究数据集中的偏差如何影响现有方法。我们的研究结果表明,目前的对比方法令人惊讶地工作:(i)对象与场景为中心,(ii)统一与长尾和(iii)一般与域特定的数据集。其次,鉴于这种方法的一般性,我们尝试通过微小的修改来实现进一步的收益。我们展示了学习额外的修正 - 通过使用多尺度裁剪,更强的增强和最近的邻居 - 改善了表示。最后,我们观察Moco在用多作物策略训练时学习空间结构化表示。表示可以用于语义段检索和视频实例分段,而不会FineTuning。此外,结果与专门模型相提并论。我们希望这项工作将成为其他研究人员的有用研究。代码和模型可在https://github.com/wvanganebleke/revisiting-contrastive-ssl上获得。
translated by 谷歌翻译
Self-supervised learning (SSL) is rapidly closing BARLOW TWINS is competitive with state-of-the-art methods for self-supervised learning while being conceptually simpler, naturally avoiding trivial constant (i.e. collapsed) embeddings, and being robust to the training batch size.
translated by 谷歌翻译
尽管最近通过剩余网络的代表学习中的自我监督方法取得了进展,但它们仍然对ImageNet分类基准进行了高度的监督学习,限制了它们在性能关键设置中的适用性。在MITROVIC等人的现有理论上洞察中建立2021年,我们提出了RELICV2,其结合了明确的不变性损失,在各种适当构造的数据视图上具有对比的目标。 Relicv2在ImageNet上实现了77.1%的前1个分类准确性,使用线性评估使用Reset50架构和80.6%,具有较大的Reset型号,优于宽边缘以前的最先进的自我监督方法。最值得注意的是,RelicV2是使用一系列标准Reset架构始终如一地始终优先于类似的对比较中的监督基线的第一个表示学习方法。最后,我们表明,尽管使用Reset编码器,Relicv2可与最先进的自我监控视觉变压器相媲美。
translated by 谷歌翻译
对比表现学习已被证明是一种有效的自我监督学习方法。大多数成功的方法都是基于噪声对比估计(NCE)范式,并将实例视图的视图视为阳性和其他情况,作为阳性应与其对比的噪声。但是,数据集中的所有实例都是从相同的分布和共享底层语义信息中汲取,这些语义信息不应被视为噪声。我们认为,良好的数据表示包含实例之间的关系或语义相似性。对比学习隐含地学习关系,但认为负面的噪音是对学习关系质量有害的噪音,因此是象征性的质量。为了规避这个问题,我们提出了一种使用称为相似性对比估计(SCE)之间的情况之间的语义相似性的对比学习的新颖性。我们的培训目标可以被视为柔和的对比学习。我们提出了持续分配以基于其语义相似性推动或拉动实例的持续分配。目标相似性分布从弱增强的情况计算并锐化以消除无关的关系。每个弱增强实例都与一个强大的增强实例配对,该实例对比其积极的同时保持目标相似性分布。实验结果表明,我们所提出的SCE在各种数据集中优于其基线MoCov2和RESSL,并对ImageNet线性评估协议上的最先进的算法具有竞争力。
translated by 谷歌翻译
我们通过以端到端的方式对大规模未标记的数据集进行分类,呈现扭曲,简单和理论上可解释的自我监督的表示学习方法。我们使用Softmax操作终止的暹罗网络,以产生两个增强图像的双类分布。没有监督,我们强制执行不同增强的班级分布。但是,只需最小化增强之间的分歧将导致折叠解决方案,即,输出所有图像的相同类概率分布。在这种情况下,留下有关输入图像的信息。为了解决这个问题,我们建议最大化输入和课程预测之间的互信息。具体地,我们最小化每个样品的分布的熵,使每个样品的课程预测是对每个样品自信的预测,并最大化平均分布的熵,以使不同样品的预测变得不同。以这种方式,扭曲可以自然地避免没有特定设计的折叠解决方案,例如非对称网络,停止梯度操作或动量编码器。因此,扭曲优于各种任务的最先进的方法。特别是,在半监督学习中,扭曲令人惊讶地表现出令人惊讶的是,使用Reset-50作为骨干的1%ImageNet标签实现61.2%的顶级精度,以前的最佳结果为6.2%。代码和预先训练的模型是给出的:https://github.com/byteDance/twist
translated by 谷歌翻译
自我监督学习的最新进展证明了多种视觉任务的有希望的结果。高性能自我监督方法中的一个重要成分是通过培训模型使用数据增强,以便在嵌入空间附近的相同图像的不同增强视图。然而,常用的增强管道整体地对待图像,忽略图像的部分的语义相关性-e.g。主题与背景 - 这可能导致学习杂散相关性。我们的工作通过调查一类简单但高度有效的“背景增强”来解决这个问题,这鼓励模型专注于语义相关内容,劝阻它们专注于图像背景。通过系统的调查,我们表明背景增强导致在各种任务中跨越一系列最先进的自我监督方法(MOCO-V2,BYOL,SWAV)的性能大量改进。 $ \ SIM $ + 1-2%的ImageNet收益,使得与监督基准的表现有关。此外,我们发现有限标签设置的改进甚至更大(高达4.2%)。背景技术增强还改善了许多分布换档的鲁棒性,包括天然对抗性实例,想象群-9,对抗性攻击,想象成型。我们还在产生了用于背景增强的显着掩模的过程中完全无监督的显着性检测进展。
translated by 谷歌翻译
我们对自我监督,监督或半监督设置的代表学习感兴趣。在应用自我监督学习的平均移位思想的事先工作,通过拉动查询图像来概括拜尔的想法,不仅更接近其其他增强,而且还可以到其他增强的最近邻居(NNS)。我们认为,学习可以从选择远处与查询相关的邻居选择遥远的邻居。因此,我们建议通过约束最近邻居的搜索空间来概括MSF算法。我们显示我们的方法在SSL设置中优于MSF,当约束使用不同的图像时,并且当约束确保NNS具有与查询相同的伪标签时,在半监控设置中优于培训资源的半监控设置中的爪子。
translated by 谷歌翻译
We present Momentum Contrast (MoCo) for unsupervised visual representation learning. From a perspective on contrastive learning [29] as dictionary look-up, we build a dynamic dictionary with a queue and a moving-averaged encoder. This enables building a large and consistent dictionary on-the-fly that facilitates contrastive unsupervised learning. MoCo provides competitive results under the common linear protocol on ImageNet classification. More importantly, the representations learned by MoCo transfer well to downstream tasks. MoCo can outperform its supervised pre-training counterpart in 7 detection/segmentation tasks on PASCAL VOC, COCO, and other datasets, sometimes surpassing it by large margins. This suggests that the gap between unsupervised and supervised representation learning has been largely closed in many vision tasks.
translated by 谷歌翻译