对比自我监督的学习已经超越了许多下游任务的监督预测,如分割和物体检测。但是,当前的方法仍然主要应用于像想象成的策划数据集。在本文中,我们首先研究数据集中的偏差如何影响现有方法。我们的研究结果表明,目前的对比方法令人惊讶地工作:(i)对象与场景为中心,(ii)统一与长尾和(iii)一般与域特定的数据集。其次,鉴于这种方法的一般性,我们尝试通过微小的修改来实现进一步的收益。我们展示了学习额外的修正 - 通过使用多尺度裁剪,更强的增强和最近的邻居 - 改善了表示。最后,我们观察Moco在用多作物策略训练时学习空间结构化表示。表示可以用于语义段检索和视频实例分段,而不会FineTuning。此外,结果与专门模型相提并论。我们希望这项工作将成为其他研究人员的有用研究。代码和模型可在https://github.com/wvanganebleke/revisiting-contrastive-ssl上获得。
translated by 谷歌翻译
自我监督学习中的最新作品通过以对象为中心或基于区域的对应目标进行预处理,在场景级密集的预测任务上表现出了强劲的表现。在本文中,我们介绍了区域对象表示学习(R2O),该学习统一了基于区域的和以对象为中心的预处理。 R2O通过训练编码器以动态完善基于区域的段为中心的蒙版,然后共同学习掩模中内容的表示形式。 R2O使用“区域改进模块”将使用区域级先验生成的小图像区域分组为较大的区域,这些区域倾向于通过聚类区域级特征对应对应对象。随着训练的进展,R2O遵循了一个区域到对象的课程,该课程鼓励学习区域级的早期特征并逐渐进步以训练以对象为中心的表示。使用R2O的表示形式导致了Pascal VOC(+0.7 MIOU)和CityScapes(+0.4 MIOU)的语义细分表现最先进的表现,并在MS Coco(+0.3 Mask AP)上进行了实例细分。此外,在对Imagenet进行了预审进之后,R2O预处理的模型能够超过Caltech-UCSD Birds 200-2011数据集(+2.9 MIOU)的无监督物体细分中现有的最新对象细分。我们在https://github.com/kkallidromitis/r2o上提供了这项工作的代码/模型。
translated by 谷歌翻译
We present Momentum Contrast (MoCo) for unsupervised visual representation learning. From a perspective on contrastive learning [29] as dictionary look-up, we build a dynamic dictionary with a queue and a moving-averaged encoder. This enables building a large and consistent dictionary on-the-fly that facilitates contrastive unsupervised learning. MoCo provides competitive results under the common linear protocol on ImageNet classification. More importantly, the representations learned by MoCo transfer well to downstream tasks. MoCo can outperform its supervised pre-training counterpart in 7 detection/segmentation tasks on PASCAL VOC, COCO, and other datasets, sometimes surpassing it by large margins. This suggests that the gap between unsupervised and supervised representation learning has been largely closed in many vision tasks.
translated by 谷歌翻译
To date, most existing self-supervised learning methods are designed and optimized for image classification. These pre-trained models can be sub-optimal for dense prediction tasks due to the discrepancy between image-level prediction and pixel-level prediction. To fill this gap, we aim to design an effective, dense self-supervised learning method that directly works at the level of pixels (or local features) by taking into account the correspondence between local features. We present dense contrastive learning (DenseCL), which implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images.Compared to the baseline method MoCo-v2, our method introduces negligible computation overhead (only <1% slower), but demonstrates consistently superior performance when transferring to downstream dense prediction tasks including object detection, semantic segmentation and instance segmentation; and outperforms the state-of-the-art methods by a large margin. Specifically, over the strong MoCo-v2 baseline, our method achieves significant improvements of 2.0% AP on PASCAL VOC object detection, 1.1% AP on COCO object detection, 0.9% AP on COCO instance segmentation, 3.0% mIoU on PASCAL VOC semantic segmentation and 1.8% mIoU on Cityscapes semantic segmentation.
translated by 谷歌翻译
Unsupervised image representations have significantly reduced the gap with supervised pretraining, notably with the recent achievements of contrastive learning methods. These contrastive methods typically work online and rely on a large number of explicit pairwise feature comparisons, which is computationally challenging. In this paper, we propose an online algorithm, SwAV, that takes advantage of contrastive methods without requiring to compute pairwise comparisons. Specifically, our method simultaneously clusters the data while enforcing consistency between cluster assignments produced for different augmentations (or "views") of the same image, instead of comparing features directly as in contrastive learning. Simply put, we use a "swapped" prediction mechanism where we predict the code of a view from the representation of another view. Our method can be trained with large and small batches and can scale to unlimited amounts of data. Compared to previous contrastive methods, our method is more memory efficient since it does not require a large memory bank or a special momentum network. In addition, we also propose a new data augmentation strategy, multi-crop, that uses a mix of views with different resolutions in place of two full-resolution views, without increasing the memory or compute requirements. We validate our findings by achieving 75.3% top-1 accuracy on ImageNet with ResNet-50, as well as surpassing supervised pretraining on all the considered transfer tasks.
translated by 谷歌翻译
We introduce Bootstrap Your Own Latent (BYOL), a new approach to self-supervised image representation learning. BYOL relies on two neural networks, referred to as online and target networks, that interact and learn from each other. From an augmented view of an image, we train the online network to predict the target network representation of the same image under a different augmented view. At the same time, we update the target network with a slow-moving average of the online network. While state-of-the art methods rely on negative pairs, BYOL achieves a new state of the art without them. BYOL reaches 74.3% top-1 classification accuracy on ImageNet using a linear evaluation with a ResNet-50 architecture and 79.6% with a larger ResNet. We show that BYOL performs on par or better than the current state of the art on both transfer and semi-supervised benchmarks. Our implementation and pretrained models are given on GitHub. 3 * Equal contribution; the order of first authors was randomly selected.
translated by 谷歌翻译
Contrastive learning has become a key component of self-supervised learning approaches for computer vision. By learning to embed two augmented versions of the same image close to each other and to push the embeddings of different images apart, one can train highly transferable visual representations. As revealed by recent studies, heavy data augmentation and large sets of negatives are both crucial in learning such representations. At the same time, data mixing strategies, either at the image or the feature level, improve both supervised and semi-supervised learning by synthesizing novel examples, forcing networks to learn more robust features. In this paper, we argue that an important aspect of contrastive learning, i.e. the effect of hard negatives, has so far been neglected. To get more meaningful negative samples, current top contrastive self-supervised learning approaches either substantially increase the batch sizes, or keep very large memory banks; increasing memory requirements, however, leads to diminishing returns in terms of performance. We therefore start by delving deeper into a top-performing framework and show evidence that harder negatives are needed to facilitate better and faster learning. Based on these observations, and motivated by the success of data mixing, we propose hard negative mixing strategies at the feature level, that can be computed on-the-fly with a minimal computational overhead. We exhaustively ablate our approach on linear classification, object detection, and instance segmentation and show that employing our hard negative mixing procedure improves the quality of visual representations learned by a state-of-the-art self-supervised learning method.Project page: https://europe.naverlabs.com/mochi 34th Conference on Neural Information Processing Systems (NeurIPS 2020),
translated by 谷歌翻译
本文介绍了密集的暹罗网络(Denseiam),这是一个简单的无监督学习框架,用于密集的预测任务。它通过以两种类型的一致性(即像素一致性和区域一致性)之间最大化一个图像的两个视图之间的相似性来学习视觉表示。具体地,根据重叠区域中的确切位置对应关系,Denseiam首先最大化像素级的空间一致性。它还提取一批与重叠区域中某些子区域相对应的区域嵌入,以形成区域一致性。与以前需要负像素对,动量编码器或启发式面膜的方法相反,Denseiam受益于简单的暹罗网络,并优化了不同粒度的一致性。它还证明了简单的位置对应关系和相互作用的区域嵌入足以学习相似性。我们将Denseiam应用于ImageNet,并在各种下游任务上获得竞争性改进。我们还表明,只有在一些特定于任务的损失中,简单的框架才能直接执行密集的预测任务。在现有的无监督语义细分基准中,它以2.1 miou的速度超过了最新的细分方法,培训成本为28%。代码和型号在https://github.com/zwwwayne/densesiam上发布。
translated by 谷歌翻译
This paper presents SimCLR: a simple framework for contrastive learning of visual representations. We simplify recently proposed contrastive selfsupervised learning algorithms without requiring specialized architectures or a memory bank. In order to understand what enables the contrastive prediction tasks to learn useful representations, we systematically study the major components of our framework. We show that (1) composition of data augmentations plays a critical role in defining effective predictive tasks, (2) introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and (3) contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning. By combining these findings, we are able to considerably outperform previous methods for self-supervised and semi-supervised learning on ImageNet. A linear classifier trained on self-supervised representations learned by Sim-CLR achieves 76.5% top-1 accuracy, which is a 7% relative improvement over previous state-ofthe-art, matching the performance of a supervised ResNet-50. When fine-tuned on only 1% of the labels, we achieve 85.8% top-5 accuracy, outperforming AlexNet with 100× fewer labels. 1
translated by 谷歌翻译
自我监督学习(SSL)的承诺是利用大量未标记的数据来解决复杂的任务。尽管简单,图像级学习取得了出色的进步,但最新方法显示出包括图像结构知识的优势。但是,通过引入手工制作的图像分割来定义感兴趣的区域或专门的增强策略,这些方法牺牲了使SSL如此强大的简单性和通用性。取而代之的是,我们提出了一个自我监督的学习范式,该学习范式本身会发现这种图像结构。我们的方法,ODIN,夫妻对象发现和表示网络,以发现有意义的图像分割,而无需任何监督。由此产生的学习范式更简单,更易碎,更一般,并且取得了最先进的转移学习结果,以进行对象检测和实例对可可的细分,以及对Pascal和CityScapes的语义细分,同时超过监督的预先培训,用于戴维斯的视频细分。
translated by 谷歌翻译
自我监督的方法(SSL)通过最大化两个增强视图之间的相互信息,裁剪是一种巨大的成功,其中裁剪是一种流行的增强技术。裁剪区域广泛用于构造正对,而裁剪后的左侧区域很少被探讨在现有方法中,尽管它们在一起构成相同的图像实例并且两者都有助于对类别的描述。在本文中,我们首次尝试从完整的角度来展示两种地区的重要性,并提出称为区域对比学习(RegionCl)的简单但有效的借口任务。具体地,给定两个不同的图像,我们随机从具有相同大小的每个图像随机裁剪区域(称为粘贴视图)并将它们交换以分别与左区域(称为CANVAS视图)一起组成两个新图像。然后,可以根据以下简单标准提供对比度对,即,每个视图是(1)阳性,其视图从相同的原始图像增强,并且与从其他图像增强的视图增强的视图。对于对流行的SSL方法进行微小的修改,RegionCL利用这些丰富的对并帮助模型区分来自画布和粘贴视图的区域特征,因此学习更好的视觉表示。 Imagenet,Coco和Citycapes上的实验表明,RegionCL通过大型边缘改善Moco V2,Densecl和Simsiam,并在分类,检测和分割任务上实现最先进的性能。代码将在https://github.com/annbless/regioncl.git上获得。
translated by 谷歌翻译
我们提出了一种适用于半全球任务的自学学习(SSL)方法,例如对象检测和语义分割。我们通过在训练过程中最大程度地减少像素级局部对比度(LC)损失,代表了同一图像转换版本的相应图像位置之间的局部一致性。可以将LC-LOSS添加到以最小开销的现有自我监督学习方法中。我们使用可可,Pascal VOC和CityScapes数据集评估了两个下游任务的SSL方法 - 对象检测和语义细分。我们的方法的表现优于现有的最新SSL方法可可对象检测的方法1.9%,Pascal VOC检测1.4%,而CityScapes Sementation则为0.6%。
translated by 谷歌翻译
无监督语义分割的任务旨在将像素聚集到语义上有意义的群体中。具体而言,分配给同一群集的像素应共享高级语义属性,例如其对象或零件类别。本文介绍了MaskDistill:基于三个关键想法的无监督语义细分的新颖框架。首先,我们提倡一种数据驱动的策略,以生成对象掩模作为语义分割事先的像素分组。这种方法省略了手工制作的先验,这些先验通常是为特定场景组成而设计的,并限制了竞争框架的适用性。其次,MaskDistill将对象掩盖簇簇以获取伪地真相,以训练初始对象分割模型。第三,我们利用此模型过滤出低质量的对象掩模。这种策略减轻了我们像素分组中的噪声,并导致了我们用来训练最终分割模型的干净掩模集合。通过组合这些组件,我们可以大大优于以前的作品,用于对Pascal(+11%MIOU)和COCO(+4%Mask AP50)进行无监督的语义分割。有趣的是,与现有方法相反,我们的框架不在低级图像提示上,也不限于以对象为中心的数据集。代码和型号将提供。
translated by 谷歌翻译
对比度学习的许多最新方法已努力弥补在ImageNet等标志性图像和Coco等复杂场景上进行预处理的预处理之间的差距。这一差距之所以存在很大程度上是因为普遍使用的随机作物增强量在不同物体的拥挤场景图像中获得语义上不一致的内容。以前的作品使用预处理管道来定位明显的对象以改进裁剪,但是端到端的解决方案仍然难以捉摸。在这项工作中,我们提出了一个框架,该框架通过共同学习表示和细分来实现这一目标。我们利用分割掩码来训练具有掩模依赖性对比损失的模型,并使用经过部分训练的模型来引导更好的掩模。通过在这两个组件之间进行迭代,我们将分割信息中的对比度更新进行基础,并同时改善整个训练的分割。实验表明我们的表示形式在分类,检测和分割方面鲁棒性转移到下游任务。
translated by 谷歌翻译
我们专注于更好地理解增强不变代表性学习的关键因素。我们重新访问moco v2和byol,并试图证明以下假设的真实性:不同的框架即使具有相同的借口任务也会带来不同特征的表示。我们建立了MoCo V2和BYOL之间公平比较的第一个基准,并观察:(i)复杂的模型配置使得可以更好地适应预训练数据集; (ii)从实现竞争性转移表演中获得的预训练和微调阻碍模型的优化策略不匹配。鉴于公平的基准,我们进行进一步的研究并发现网络结构的不对称性赋予对比框架在线性评估协议下正常工作,同时可能会损害长尾分类任务的转移性能。此外,负样本并不能使模型更明智地选择数据增强,也不会使不对称网络结构结构。我们相信我们的发现为将来的工作提供了有用的信息。
translated by 谷歌翻译
自我监督的对比学习的最新进展产生了良好的图像级表示,这有利于分类任务,但通常会忽略像素级详细信息,从而导致转移性能不令人满意地转移到密集的预测任务,例如语义细分。在这项工作中,我们提出了一种称为CP2的像素对比度学习方法(拷贝性对比度预处理),该方法促进了图像和像素级表示学习,因此更适合下游密集的预测任务。详细说明,我们将随机的作物从图像(前景)复制到不同的背景图像,并为语义分割模型提供了以1)为目标的语义分割模型。共享相同的前景。表现出色表明CP2在下游语义分段中的表现强劲:通过对Pascal VOC 2012上的CP2预审计的模型,我们获得了78.6%MIOU,具有RESNET-50和79.5%的vit-s。
translated by 谷歌翻译
跨图像建立视觉对应是一项具有挑战性且必不可少的任务。最近,已经提出了大量的自我监督方法,以更好地学习视觉对应的表示。但是,我们发现这些方法通常无法利用语义信息,并且在低级功能的匹配方面过度融合。相反,人类的视觉能够将不同的物体区分为跟踪的借口。受此范式的启发,我们建议学习语义意识的细粒对应关系。首先,我们证明语义对应是通过一组丰富的图像级别自我监督方法隐式获得的。我们进一步设计了一个像素级的自我监督学习目标,该目标专门针对细粒的对应关系。对于下游任务,我们将这两种互补的对应表示形式融合在一起,表明它们是协同增强性能的。我们的方法超过了先前的最先进的自我监督方法,使用卷积网络在各种视觉通信任务上,包括视频对象分割,人姿势跟踪和人类部分跟踪。
translated by 谷歌翻译
对比的自我监督学习在很大程度上缩小了对想象成的预先训练的差距。然而,它的成功高度依赖于想象成的以对象形象,即相同图像的不同增强视图对应于相同的对象。当预先训练在具有许多物体的更复杂的场景图像上,如此重种策划约束会立即不可行。为了克服这一限制,我们介绍了对象级表示学习(ORL),这是一个新的自我监督的学习框架迈向场景图像。我们的主要洞察力是利用图像级自我监督的预培训作为发现对象级语义对应之前的,从而实现了从场景图像中学习的对象级表示。对Coco的广泛实验表明,ORL显着提高了自我监督学习在场景图像上的性能,甚至超过了在几个下游任务上的监督Imagenet预训练。此外,当可用更加解标的场景图像时,ORL提高了下游性能,证明其在野外利用未标记数据的巨大潜力。我们希望我们的方法可以激励未来的研究从场景数据的更多通用无人监督的代表。
translated by 谷歌翻译
尽管最近通过剩余网络的代表学习中的自我监督方法取得了进展,但它们仍然对ImageNet分类基准进行了高度的监督学习,限制了它们在性能关键设置中的适用性。在MITROVIC等人的现有理论上洞察中建立2021年,我们提出了RELICV2,其结合了明确的不变性损失,在各种适当构造的数据视图上具有对比的目标。 Relicv2在ImageNet上实现了77.1%的前1个分类准确性,使用线性评估使用Reset50架构和80.6%,具有较大的Reset型号,优于宽边缘以前的最先进的自我监督方法。最值得注意的是,RelicV2是使用一系列标准Reset架构始终如一地始终优先于类似的对比较中的监督基线的第一个表示学习方法。最后,我们表明,尽管使用Reset编码器,Relicv2可与最先进的自我监控视觉变压器相媲美。
translated by 谷歌翻译