对比性自我监督表示方法学习方法最大程度地提高了正对之间的相似性,同时倾向于最大程度地减少负对之间的相似性。但是,总的来说,负面对之间的相互作用被忽略了,因为它们没有根据其特定差异和相似性而采用的特殊机制来对待负面对。在本文中,我们提出了扩展的动量对比(Xmoco),这是一种基于MOCO家族配置中提出的动量编码单元的遗产,一种自我监督的表示方法。为此,我们引入了交叉一致性正则化损失,并通过该损失将转换一致性扩展到不同图像(负对)。在交叉一致性正则化规则下,我们认为与任何一对图像(正或负)相关的语义表示应在借口转换下保留其交叉相似性。此外,我们通过在批处理上的负面对上实施相似性的均匀分布来进一步规范训练损失。可以轻松地将所提出的正规化添加到现有的自我监督学习算法中。从经验上讲,我们报告了标准Imagenet-1K线性头部分类基准的竞争性能。此外,通过将学习的表示形式转移到常见的下游任务中,我们表明,将Xmoco与普遍使用的增强功能一起使用可以改善此类任务的性能。我们希望本文的发现是研究人员考虑自我监督学习中负面例子的重要相互作用的动机。
translated by 谷歌翻译
事实证明,无监督的表示学习方法在学习目标数据集的视觉语义方面有效。这些方法背后的主要思想是,同一图像的不同视图代表相同的语义。在本文中,我们进一步引入了一个附加模块,以促进对样品之间空间跨相关性的知识注入。反过来,这导致了类内部信息的提炼,包括特征级别的位置和同类实例之间的相似性。建议的附加组件可以添加到现有方法中,例如SWAV。稍后,我们可以删除用于推理的附加模块,而无需修改学识的权重。通过一系列广泛的经验评估,我们验证我们的方法在检测类激活图,TOP-1分类准确性和下游任务(例如对象检测)的情况下会提高性能,并具有不同的配置设置。
translated by 谷歌翻译
我们通过以端到端的方式对大规模未标记的数据集进行分类,呈现扭曲,简单和理论上可解释的自我监督的表示学习方法。我们使用Softmax操作终止的暹罗网络,以产生两个增强图像的双类分布。没有监督,我们强制执行不同增强的班级分布。但是,只需最小化增强之间的分歧将导致折叠解决方案,即,输出所有图像的相同类概率分布。在这种情况下,留下有关输入图像的信息。为了解决这个问题,我们建议最大化输入和课程预测之间的互信息。具体地,我们最小化每个样品的分布的熵,使每个样品的课程预测是对每个样品自信的预测,并最大化平均分布的熵,以使不同样品的预测变得不同。以这种方式,扭曲可以自然地避免没有特定设计的折叠解决方案,例如非对称网络,停止梯度操作或动量编码器。因此,扭曲优于各种任务的最先进的方法。特别是,在半监督学习中,扭曲令人惊讶地表现出令人惊讶的是,使用Reset-50作为骨干的1%ImageNet标签实现61.2%的顶级精度,以前的最佳结果为6.2%。代码和预先训练的模型是给出的:https://github.com/byteDance/twist
translated by 谷歌翻译
We present Momentum Contrast (MoCo) for unsupervised visual representation learning. From a perspective on contrastive learning [29] as dictionary look-up, we build a dynamic dictionary with a queue and a moving-averaged encoder. This enables building a large and consistent dictionary on-the-fly that facilitates contrastive unsupervised learning. MoCo provides competitive results under the common linear protocol on ImageNet classification. More importantly, the representations learned by MoCo transfer well to downstream tasks. MoCo can outperform its supervised pre-training counterpart in 7 detection/segmentation tasks on PASCAL VOC, COCO, and other datasets, sometimes surpassing it by large margins. This suggests that the gap between unsupervised and supervised representation learning has been largely closed in many vision tasks.
translated by 谷歌翻译
Unsupervised image representations have significantly reduced the gap with supervised pretraining, notably with the recent achievements of contrastive learning methods. These contrastive methods typically work online and rely on a large number of explicit pairwise feature comparisons, which is computationally challenging. In this paper, we propose an online algorithm, SwAV, that takes advantage of contrastive methods without requiring to compute pairwise comparisons. Specifically, our method simultaneously clusters the data while enforcing consistency between cluster assignments produced for different augmentations (or "views") of the same image, instead of comparing features directly as in contrastive learning. Simply put, we use a "swapped" prediction mechanism where we predict the code of a view from the representation of another view. Our method can be trained with large and small batches and can scale to unlimited amounts of data. Compared to previous contrastive methods, our method is more memory efficient since it does not require a large memory bank or a special momentum network. In addition, we also propose a new data augmentation strategy, multi-crop, that uses a mix of views with different resolutions in place of two full-resolution views, without increasing the memory or compute requirements. We validate our findings by achieving 75.3% top-1 accuracy on ImageNet with ResNet-50, as well as surpassing supervised pretraining on all the considered transfer tasks.
translated by 谷歌翻译
To date, most existing self-supervised learning methods are designed and optimized for image classification. These pre-trained models can be sub-optimal for dense prediction tasks due to the discrepancy between image-level prediction and pixel-level prediction. To fill this gap, we aim to design an effective, dense self-supervised learning method that directly works at the level of pixels (or local features) by taking into account the correspondence between local features. We present dense contrastive learning (DenseCL), which implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images.Compared to the baseline method MoCo-v2, our method introduces negligible computation overhead (only <1% slower), but demonstrates consistently superior performance when transferring to downstream dense prediction tasks including object detection, semantic segmentation and instance segmentation; and outperforms the state-of-the-art methods by a large margin. Specifically, over the strong MoCo-v2 baseline, our method achieves significant improvements of 2.0% AP on PASCAL VOC object detection, 1.1% AP on COCO object detection, 0.9% AP on COCO instance segmentation, 3.0% mIoU on PASCAL VOC semantic segmentation and 1.8% mIoU on Cityscapes semantic segmentation.
translated by 谷歌翻译
本文介绍了密集的暹罗网络(Denseiam),这是一个简单的无监督学习框架,用于密集的预测任务。它通过以两种类型的一致性(即像素一致性和区域一致性)之间最大化一个图像的两个视图之间的相似性来学习视觉表示。具体地,根据重叠区域中的确切位置对应关系,Denseiam首先最大化像素级的空间一致性。它还提取一批与重叠区域中某些子区域相对应的区域嵌入,以形成区域一致性。与以前需要负像素对,动量编码器或启发式面膜的方法相反,Denseiam受益于简单的暹罗网络,并优化了不同粒度的一致性。它还证明了简单的位置对应关系和相互作用的区域嵌入足以学习相似性。我们将Denseiam应用于ImageNet,并在各种下游任务上获得竞争性改进。我们还表明,只有在一些特定于任务的损失中,简单的框架才能直接执行密集的预测任务。在现有的无监督语义细分基准中,它以2.1 miou的速度超过了最新的细分方法,培训成本为28%。代码和型号在https://github.com/zwwwayne/densesiam上发布。
translated by 谷歌翻译
Contrastive learning has become a key component of self-supervised learning approaches for computer vision. By learning to embed two augmented versions of the same image close to each other and to push the embeddings of different images apart, one can train highly transferable visual representations. As revealed by recent studies, heavy data augmentation and large sets of negatives are both crucial in learning such representations. At the same time, data mixing strategies, either at the image or the feature level, improve both supervised and semi-supervised learning by synthesizing novel examples, forcing networks to learn more robust features. In this paper, we argue that an important aspect of contrastive learning, i.e. the effect of hard negatives, has so far been neglected. To get more meaningful negative samples, current top contrastive self-supervised learning approaches either substantially increase the batch sizes, or keep very large memory banks; increasing memory requirements, however, leads to diminishing returns in terms of performance. We therefore start by delving deeper into a top-performing framework and show evidence that harder negatives are needed to facilitate better and faster learning. Based on these observations, and motivated by the success of data mixing, we propose hard negative mixing strategies at the feature level, that can be computed on-the-fly with a minimal computational overhead. We exhaustively ablate our approach on linear classification, object detection, and instance segmentation and show that employing our hard negative mixing procedure improves the quality of visual representations learned by a state-of-the-art self-supervised learning method.Project page: https://europe.naverlabs.com/mochi 34th Conference on Neural Information Processing Systems (NeurIPS 2020),
translated by 谷歌翻译
We introduce Bootstrap Your Own Latent (BYOL), a new approach to self-supervised image representation learning. BYOL relies on two neural networks, referred to as online and target networks, that interact and learn from each other. From an augmented view of an image, we train the online network to predict the target network representation of the same image under a different augmented view. At the same time, we update the target network with a slow-moving average of the online network. While state-of-the art methods rely on negative pairs, BYOL achieves a new state of the art without them. BYOL reaches 74.3% top-1 classification accuracy on ImageNet using a linear evaluation with a ResNet-50 architecture and 79.6% with a larger ResNet. We show that BYOL performs on par or better than the current state of the art on both transfer and semi-supervised benchmarks. Our implementation and pretrained models are given on GitHub. 3 * Equal contribution; the order of first authors was randomly selected.
translated by 谷歌翻译
我们专注于更好地理解增强不变代表性学习的关键因素。我们重新访问moco v2和byol,并试图证明以下假设的真实性:不同的框架即使具有相同的借口任务也会带来不同特征的表示。我们建立了MoCo V2和BYOL之间公平比较的第一个基准,并观察:(i)复杂的模型配置使得可以更好地适应预训练数据集; (ii)从实现竞争性转移表演中获得的预训练和微调阻碍模型的优化策略不匹配。鉴于公平的基准,我们进行进一步的研究并发现网络结构的不对称性赋予对比框架在线性评估协议下正常工作,同时可能会损害长尾分类任务的转移性能。此外,负样本并不能使模型更明智地选择数据增强,也不会使不对称网络结构结构。我们相信我们的发现为将来的工作提供了有用的信息。
translated by 谷歌翻译
Siamese networks have become a common structure in various recent models for unsupervised visual representation learning. These models maximize the similarity between two augmentations of one image, subject to certain conditions for avoiding collapsing solutions. In this paper, we report surprising empirical results that simple Siamese networks can learn meaningful representations even using none of the following: (i) negative sample pairs, (ii) large batches, (iii) momentum encoders. Our experiments show that collapsing solutions do exist for the loss and structure, but a stop-gradient operation plays an essential role in preventing collapsing. We provide a hypothesis on the implication of stop-gradient, and further show proof-of-concept experiments verifying it. Our "SimSiam" method achieves competitive results on ImageNet and downstream tasks. We hope this simple baseline will motivate people to rethink the roles of Siamese architectures for unsupervised representation learning. Code will be made available.
translated by 谷歌翻译
我们提出了一种适用于半全球任务的自学学习(SSL)方法,例如对象检测和语义分割。我们通过在训练过程中最大程度地减少像素级局部对比度(LC)损失,代表了同一图像转换版本的相应图像位置之间的局部一致性。可以将LC-LOSS添加到以最小开销的现有自我监督学习方法中。我们使用可可,Pascal VOC和CityScapes数据集评估了两个下游任务的SSL方法 - 对象检测和语义细分。我们的方法的表现优于现有的最新SSL方法可可对象检测的方法1.9%,Pascal VOC检测1.4%,而CityScapes Sementation则为0.6%。
translated by 谷歌翻译
许多最近的自我监督学习方法在图像分类和其他任务上表现出了令人印象深刻的表现。已经使用了一种令人困惑的多种技术,并不总是清楚地了解其收益的原因,尤其是在组合使用时。在这里,我们将图像的嵌入视为点粒子,并将模型优化视为该粒子系统上的动态过程。我们的动态模型结合了类似图像的吸引力,避免局部崩溃的局部分散力以及实现颗粒的全球均匀分布的全局分散力。动态透视图突出了使用延迟参数图像嵌入(a la byol)以及同一图像的多个视图的优点。它还使用纯动态的局部分散力(布朗运动),该分散力比其他方法显示出改善的性能,并且不需要其他粒子坐标的知识。该方法称为MSBREG,代表(i)多视质心损失,它施加了吸引力的力来将不同的图像视图嵌入到其质心上,(ii)奇异值损失,将粒子系统推向空间均匀的密度( iii)布朗扩散损失。我们评估MSBREG在ImageNet上的下游分类性能以及转移学习任务,包括细粒度分类,多类对象分类,对象检测和实例分段。此外,我们还表明,将我们的正则化术语应用于其他方法,进一步改善了其性能并通过防止模式崩溃来稳定训练。
translated by 谷歌翻译
We present DetCo, a simple yet effective self-supervised approach for object detection. Unsupervised pre-training methods have been recently designed for object detection, but they are usually deficient in image classification, or the opposite. Unlike them, DetCo transfers well on downstream instance-level dense prediction tasks, while maintaining competitive image-level classification accuracy. The advantages are derived from (1) multi-level supervision to intermediate representations, (2) contrastive learning between global image and local patches. These two designs facilitate discriminative and consistent global and local representation at each level of feature pyramid, improving detection and classification, simultaneously.Extensive experiments on VOC, COCO, Cityscapes, and ImageNet demonstrate that DetCo not only outperforms recent methods on a series of 2D and 3D instance-level detection tasks, but also competitive on image classification. For example, on ImageNet classification, DetCo is 6.9% and 5.0% top-1 accuracy better than InsLoc and DenseCL, which are two contemporary works designed for object detection. Moreover, on COCO detection, DetCo is 6.9 AP better than SwAV with Mask R-CNN C4. Notably, DetCo largely boosts up Sparse R-CNN, a recent strong detector, from 45.0 AP to 46.5 AP (+1.5 AP), establishing a new SOTA on COCO. Code is available.
translated by 谷歌翻译
This paper presents SimCLR: a simple framework for contrastive learning of visual representations. We simplify recently proposed contrastive selfsupervised learning algorithms without requiring specialized architectures or a memory bank. In order to understand what enables the contrastive prediction tasks to learn useful representations, we systematically study the major components of our framework. We show that (1) composition of data augmentations plays a critical role in defining effective predictive tasks, (2) introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and (3) contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning. By combining these findings, we are able to considerably outperform previous methods for self-supervised and semi-supervised learning on ImageNet. A linear classifier trained on self-supervised representations learned by Sim-CLR achieves 76.5% top-1 accuracy, which is a 7% relative improvement over previous state-ofthe-art, matching the performance of a supervised ResNet-50. When fine-tuned on only 1% of the labels, we achieve 85.8% top-5 accuracy, outperforming AlexNet with 100× fewer labels. 1
translated by 谷歌翻译
使用超越欧几里德距离的神经网络,深入的Bregman分歧测量数据点的分歧,并且能够捕获分布的发散。在本文中,我们提出了深深的布利曼对视觉表现的对比学习的分歧,我们的目标是通过基于功能Bregman分歧培训额外的网络来提高自我监督学习中使用的对比损失。与完全基于单点之间的分歧的传统对比学学习方法相比,我们的框架可以捕获分布之间的发散,这提高了学习表示的质量。我们展示了传统的对比损失和我们提出的分歧损失优于基线的结合,并且最先前的自我监督和半监督学习的大多数方法在多个分类和对象检测任务和数据集中。此外,学习的陈述在转移到其他数据集和任务时概括了良好。源代码和我们的型号可用于补充,并将通过纸张释放。
translated by 谷歌翻译
对比度学习(CL)方法有效地学习数据表示,而无需标记监督,在该方法中,编码器通过单VS-MONY SOFTMAX跨透镜损失将每个正样本在多个负样本上对比。通过利用大量未标记的图像数据,在Imagenet上预先训练时,最近的CL方法获得了有希望的结果,这是一个具有均衡图像类的曲制曲线曲线集。但是,当对野外图像进行预训练时,它们往往会产生较差的性能。在本文中,为了进一步提高CL的性能并增强其对未经保育数据集的鲁棒性,我们提出了一种双重的CL策略,该策略将其内部查询的正(负)样本对比,然后才能决定多么强烈地拉动(推)。我们通过对比度吸引力和对比度排斥(CACR)意识到这一策略,这使得查询不仅发挥了更大的力量来吸引更遥远的正样本,而且可以驱除更接近的负面样本。理论分析表明,CACR通过考虑正/阴性样品的分布之间的差异来概括CL的行为,而正/负样品的分布通常与查询独立进行采样,并且它们的真实条件分布给出了查询。我们证明了这种独特的阳性吸引力和阴性排斥机制,这有助于消除在数据集的策划较低时尤其有益于数据及其潜在表示的统一先验分布的需求。对许多标准视觉任务进行的大规模大规模实验表明,CACR不仅在表示学习中的基准数据集上始终优于现有的CL方法,而且在对不平衡图像数据集进行预训练时,还表现出更好的鲁棒性。
translated by 谷歌翻译
尽管增加了大量的增强家庭,但只有几个樱桃采摘的稳健增强政策有利于自我监督的图像代表学习。在本文中,我们提出了一个定向自我监督的学习范式(DSSL),其与显着的增强符号兼容。具体而言,我们在用标准增强的视图轻度增强后调整重增强策略,以产生更难的视图(HV)。 HV通常具有与原始图像较高的偏差而不是轻度增强的标准视图(SV)。与以前的方法不同,同等对称地将所有增强视图对称地最大化它们的相似性,DSSL将相同实例的增强视图视为部分有序集(具有SV $ \ LeftrightArrow $ SV,SV $ \左路$ HV),然后装备一个定向目标函数尊重视图之间的衍生关系。 DSSL可以轻松地用几行代码实现,并且对于流行的自我监督学习框架非常灵活,包括SIMCLR,Simsiam,Byol。对CiFar和Imagenet的广泛实验结果表明,DSSL可以稳定地改善各种基线,其兼容性与更广泛的增强。
translated by 谷歌翻译
In contrastive self-supervised learning, the common way to learn discriminative representation is to pull different augmented "views" of the same image closer while pushing all other images further apart, which has been proven to be effective. However, it is unavoidable to construct undesirable views containing different semantic concepts during the augmentation procedure. It would damage the semantic consistency of representation to pull these augmentations closer in the feature space indiscriminately. In this study, we introduce feature-level augmentation and propose a novel semantics-consistent feature search (SCFS) method to mitigate this negative effect. The main idea of SCFS is to adaptively search semantics-consistent features to enhance the contrast between semantics-consistent regions in different augmentations. Thus, the trained model can learn to focus on meaningful object regions, improving the semantic representation ability. Extensive experiments conducted on different datasets and tasks demonstrate that SCFS effectively improves the performance of self-supervised learning and achieves state-of-the-art performance on different downstream tasks.
translated by 谷歌翻译
对比度学习最近在无监督的视觉表示学习中显示出巨大的潜力。在此轨道中的现有研究主要集中于图像内不变性学习。学习通常使用丰富的图像内变换来构建正对,然后使用对比度损失最大化一致性。相反,相互影响不变性的优点仍然少得多。利用图像间不变性的一个主要障碍是,尚不清楚如何可靠地构建图像间的正对,并进一步从它们中获得有效的监督,因为没有配对注释可用。在这项工作中,我们提出了一项全面的实证研究,以更好地了解从三个主要组成部分的形象间不变性学习的作用:伪标签维护,采样策略和决策边界设计。为了促进这项研究,我们引入了一个统一的通用框架,该框架支持无监督的内部和间形内不变性学习的整合。通过精心设计的比较和分析,揭示了多个有价值的观察结果:1)在线标签收敛速度比离线标签更快; 2)半硬性样品比硬否定样品更可靠和公正; 3)一个不太严格的决策边界更有利于形象间的不变性学习。借助所有获得的食谱,我们的最终模型(即InterCLR)对多个标准基准测试的最先进的内图内不变性学习方法表现出一致的改进。我们希望这项工作将为设计有效的无监督间歇性不变性学习提供有用的经验。代码:https://github.com/open-mmlab/mmselfsup。
translated by 谷歌翻译