几乎所有用于计算机视觉任务的最先进的神经网络都受到(1)在目标数据集上的大规模数据集和(2)FINETUNING上的预培训(1)预培训。该策略有助于减少对目标数据集的依赖,并提高目标任务的收敛速率和泛化。虽然对大型数据集进行预训练非常有用,但其最重要的缺点是高培训成本。要解决此问题,我们提出了有效的过滤方法,以从训练前的数据集中选择相关子集。此外,我们发现,在训练前的图像分辨率降低图像分辨率在成本和性能之间提供了很大的权衡。我们通过在无监督和监督设置中的想象中进行预测,并在各种目标数据集和任务集合中进行预测,通过预先培训来验证我们的技术。我们提出的方法大大降低了预训练成本并提供了强大的性能提升。最后,我们通过在我们的子集上调整可用模型来提高标准ImageNet预培训1-3%,并在从更大的规模数据集中过滤的数据集上进行预训练。
translated by 谷歌翻译
带有像素天标签的注释图像是耗时和昂贵的过程。最近,DataSetGan展示了有希望的替代方案 - 通过利用一小组手动标记的GaN生成的图像来通过生成的对抗网络(GAN)来综合大型标记数据集。在这里,我们将DataSetGan缩放到ImageNet类别的规模。我们从ImageNet上训练的类条件生成模型中拍摄图像样本,并为所有1K类手动注释每个类的5张图像。通过在Biggan之上培训有效的特征分割架构,我们将Bigan转换为标记的DataSet生成器。我们进一步表明,VQGan可以类似地用作数据集生成器,利用已经注释的数据。我们通过在各种设置中标记一组8K实图像并在各种设置中评估分段性能来创建一个新的想象因基准。通过广泛的消融研究,我们展示了利用大型生成的数据集来培训在像素 - 明智的任务上培训不同的监督和自我监督的骨干模型的大增益。此外,我们证明,使用我们的合成数据集进行预培训,以改善在几个下游数据集上的标准Imagenet预培训,例如Pascal-VOC,MS-Coco,Citycapes和Chink X射线以及任务(检测,细分)。我们的基准将公开并维护一个具有挑战性的任务的排行榜。项目页面:https://nv-tlabs.github.io/big-dataseTgan/
translated by 谷歌翻译
大规模数据集的预培训模型,如想象成,是计算机视觉中的标准实践。此范例对于具有小型培训套的任务特别有效,其中高容量模型往往会过度装备。在这项工作中,我们考虑一个自我监督的预训练场景,只能利用目标任务数据。我们考虑数据集,如斯坦福汽车,草图或可可,这是比想象成小的数量的顺序。我们的研究表明,在本文中介绍的Beit或诸如Beit或Variant的去噪对预训练数据的类型和大小比通过比较图像嵌入来训练的流行自我监督方法更加强大。我们获得了竞争性能与ImageNet预训练相比,来自不同域的各种分类数据集。在Coco上,当专注于使用Coco Images进行预训练时,检测和实例分割性能超过了可比设置中的监督Imagenet预训练。
translated by 谷歌翻译
To date, most existing self-supervised learning methods are designed and optimized for image classification. These pre-trained models can be sub-optimal for dense prediction tasks due to the discrepancy between image-level prediction and pixel-level prediction. To fill this gap, we aim to design an effective, dense self-supervised learning method that directly works at the level of pixels (or local features) by taking into account the correspondence between local features. We present dense contrastive learning (DenseCL), which implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images.Compared to the baseline method MoCo-v2, our method introduces negligible computation overhead (only <1% slower), but demonstrates consistently superior performance when transferring to downstream dense prediction tasks including object detection, semantic segmentation and instance segmentation; and outperforms the state-of-the-art methods by a large margin. Specifically, over the strong MoCo-v2 baseline, our method achieves significant improvements of 2.0% AP on PASCAL VOC object detection, 1.1% AP on COCO object detection, 0.9% AP on COCO instance segmentation, 3.0% mIoU on PASCAL VOC semantic segmentation and 1.8% mIoU on Cityscapes semantic segmentation.
translated by 谷歌翻译
Self-supervised visual representation learning has seen huge progress recently, but no large scale evaluation has compared the many models now available. We evaluate the transfer performance of 13 top self-supervised models on 40 downstream tasks, including many-shot and few-shot recognition, object detection, and dense prediction. We compare their performance to a supervised baseline and show that on most tasks the best self-supervised models outperform supervision, confirming the recently observed trend in the literature. We find ImageNet Top-1 accuracy to be highly correlated with transfer to many-shot recognition, but increasingly less so for few-shot, object detection and dense prediction. No single self-supervised method dominates overall, suggesting that universal pre-training is still unsolved. Our analysis of features suggests that top self-supervised learners fail to preserve colour information as well as supervised alternatives, but tend to induce better classifier calibration, and less attentive overfitting than supervised learners.
translated by 谷歌翻译
Contrastive learning has become a key component of self-supervised learning approaches for computer vision. By learning to embed two augmented versions of the same image close to each other and to push the embeddings of different images apart, one can train highly transferable visual representations. As revealed by recent studies, heavy data augmentation and large sets of negatives are both crucial in learning such representations. At the same time, data mixing strategies, either at the image or the feature level, improve both supervised and semi-supervised learning by synthesizing novel examples, forcing networks to learn more robust features. In this paper, we argue that an important aspect of contrastive learning, i.e. the effect of hard negatives, has so far been neglected. To get more meaningful negative samples, current top contrastive self-supervised learning approaches either substantially increase the batch sizes, or keep very large memory banks; increasing memory requirements, however, leads to diminishing returns in terms of performance. We therefore start by delving deeper into a top-performing framework and show evidence that harder negatives are needed to facilitate better and faster learning. Based on these observations, and motivated by the success of data mixing, we propose hard negative mixing strategies at the feature level, that can be computed on-the-fly with a minimal computational overhead. We exhaustively ablate our approach on linear classification, object detection, and instance segmentation and show that employing our hard negative mixing procedure improves the quality of visual representations learned by a state-of-the-art self-supervised learning method.Project page: https://europe.naverlabs.com/mochi 34th Conference on Neural Information Processing Systems (NeurIPS 2020),
translated by 谷歌翻译
主动学习旨在选择最具信息丰富的样本,以利用有限的注释预算。大多数现有的工作通过分别在每个数据集上多次重复耗时的模型训练和批量数据选择,遵循麻烦的管道。通过提出本文提出新的一般和有效的主动学习(GEAL)方法,挑战该地位QUO。利用预先培训的大型数据集预先培训的公开模型,我们的方法可以在不同的数据集中对具有相同模型的单通推断进行数据选择过程。为了捕获图像内的微妙本地信息,我们提出了从预先训练网络的中间特征中容易地提取的知识集群。而不是麻烦的批量选择策略,通过在细粒度知识集群级别执行K中心贪婪来选择所有数据样本。整个过程只需要单通式模型推论而不培训或监督,使我们的方法在时间复杂程度明显优于现有技术,从而长达数百次。广泛的实验越来越展示了我们对物体检测,语义分割,深度估计和图像分类方法的有希望的性能。
translated by 谷歌翻译
Unsupervised image representations have significantly reduced the gap with supervised pretraining, notably with the recent achievements of contrastive learning methods. These contrastive methods typically work online and rely on a large number of explicit pairwise feature comparisons, which is computationally challenging. In this paper, we propose an online algorithm, SwAV, that takes advantage of contrastive methods without requiring to compute pairwise comparisons. Specifically, our method simultaneously clusters the data while enforcing consistency between cluster assignments produced for different augmentations (or "views") of the same image, instead of comparing features directly as in contrastive learning. Simply put, we use a "swapped" prediction mechanism where we predict the code of a view from the representation of another view. Our method can be trained with large and small batches and can scale to unlimited amounts of data. Compared to previous contrastive methods, our method is more memory efficient since it does not require a large memory bank or a special momentum network. In addition, we also propose a new data augmentation strategy, multi-crop, that uses a mix of views with different resolutions in place of two full-resolution views, without increasing the memory or compute requirements. We validate our findings by achieving 75.3% top-1 accuracy on ImageNet with ResNet-50, as well as surpassing supervised pretraining on all the considered transfer tasks.
translated by 谷歌翻译
最近自我监督学习成功的核心组成部分是裁剪数据增强,其选择要在自我监督损失中用作正视图的图像的子区域。底层假设是给定图像的随机裁剪和调整大小的区域与感兴趣对象的信息共享信息,其中学习的表示将捕获。这种假设在诸如想象网的数据集中大多满足,其中存在大,以中心为中心的对象,这很可能存在于完整图像的随机作物中。然而,在诸如OpenImages或Coco的其他数据集中,其更像是真实世界未保健数据的代表,通常存在图像中的多个小对象。在这项工作中,我们表明,基于通常随机裁剪的自我监督学习在此类数据集中表现不佳。我们提出用从对象提案算法获得的作物取代一种或两种随机作物。这鼓励模型学习对象和场景级别语义表示。使用这种方法,我们调用对象感知裁剪,导致对分类和对象检测基准的场景裁剪的显着改进。例如,在OpenImages上,我们的方法可以使用基于Moco-V2的预训练来实现8.8%的提高8.8%地图。我们还显示了对Coco和Pascal-Voc对象检测和分割任务的显着改善,通过最先进的自我监督的学习方法。我们的方法是高效,简单且通用的,可用于最现有的对比和非对比的自我监督的学习框架。
translated by 谷歌翻译
Jitendra Malik once said, "Supervision is the opium of the AI researcher". Most deep learning techniques heavily rely on extreme amounts of human labels to work effectively. In today's world, the rate of data creation greatly surpasses the rate of data annotation. Full reliance on human annotations is just a temporary means to solve current closed problems in AI. In reality, only a tiny fraction of data is annotated. Annotation Efficient Learning (AEL) is a study of algorithms to train models effectively with fewer annotations. To thrive in AEL environments, we need deep learning techniques that rely less on manual annotations (e.g., image, bounding-box, and per-pixel labels), but learn useful information from unlabeled data. In this thesis, we explore five different techniques for handling AEL.
translated by 谷歌翻译
转移学习可以在源任务上重新使用知识来帮助学习目标任务。一种简单的转移学习形式在当前的最先进的计算机视觉模型中是常见的,即预先训练ILSVRC数据集上的图像分类模型,然后在任何目标任务上进行微调。然而,先前对转移学习的系统研究已经有限,并且预计工作的情况并不完全明白。在本文中,我们对跨越不同的图像域进行了广泛的转移学习实验探索(消费者照片,自主驾驶,空中图像,水下,室内场景,合成,特写镜头)和任务类型(语义分割,物体检测,深度估计,关键点检测)。重要的是,这些都是与现代计算机视觉应用相关的复杂的结构化的输出任务类型。总共执行超过2000年的转移学习实验,包括许多来源和目标来自不同的图像域,任务类型或两者。我们系统地分析了这些实验,了解图像域,任务类型和数据集大小对传输学习性能的影响。我们的研究导致了几个见解和具体建议:(1)对于大多数任务,存在一个显着优于ILSVRC'12预培训的来源; (2)图像领域是实现阳性转移的最重要因素; (3)源数据集应该\ \ emph {include}目标数据集的图像域以获得最佳结果; (4)与此同时,当源任务的图像域比目标的图像域时,我们只观察小的负面影响; (5)跨任务类型的转移可能是有益的,但其成功严重依赖于源和目标任务类型。
translated by 谷歌翻译
The success of deep learning in vision can be attributed to: (a) models with high capacity; (b) increased computational power; and (c) availability of large-scale labeled data. Since 2012, there have been significant advances in representation capabilities of the models and computational capabilities of GPUs. But the size of the biggest dataset has surprisingly remained constant. What will happen if we increase the dataset size by 10× or 100×? This paper takes a step towards clearing the clouds of mystery surrounding the relationship between 'enormous data' and visual deep learning. By exploiting the JFT-300M dataset which has more than 375M noisy labels for 300M images, we investigate how the performance of current vision tasks would change if this data was used for representation learning. Our paper delivers some surprising (and some expected) findings. First, we find that the performance on vision tasks increases logarithmically based on volume of training data size. Second, we show that representation learning (or pretraining) still holds a lot of promise. One can improve performance on many vision tasks by just training a better base model. Finally, as expected, we present new state-of-theart results for different vision tasks including image classification, object detection, semantic segmentation and human pose estimation. Our sincere hope is that this inspires vision community to not undervalue the data and develop collective efforts in building larger datasets.
translated by 谷歌翻译
Transfer of pre-trained representations improves sample efficiency and simplifies hyperparameter tuning when training deep neural networks for vision. We revisit the paradigm of pre-training on large supervised datasets and fine-tuning the model on a target task. We scale up pre-training, and propose a simple recipe that we call Big Transfer (BiT). By combining a few carefully selected components, and transferring using a simple heuristic, we achieve strong performance on over 20 datasets. BiT performs well across a surprisingly wide range of data regimes -from 1 example per class to 1 M total examples. BiT achieves 87.5% top-1 accuracy on ILSVRC-2012, 99.4% on CIFAR-10, and 76.3% on the 19 task Visual Task Adaptation Benchmark (VTAB). On small datasets, BiT attains 76.8% on ILSVRC-2012 with 10 examples per class, and 97.0% on CIFAR-10 with 10 examples per class. We conduct detailed analysis of the main components that lead to high transfer performance.
translated by 谷歌翻译
本文介绍了密集的暹罗网络(Denseiam),这是一个简单的无监督学习框架,用于密集的预测任务。它通过以两种类型的一致性(即像素一致性和区域一致性)之间最大化一个图像的两个视图之间的相似性来学习视觉表示。具体地,根据重叠区域中的确切位置对应关系,Denseiam首先最大化像素级的空间一致性。它还提取一批与重叠区域中某些子区域相对应的区域嵌入,以形成区域一致性。与以前需要负像素对,动量编码器或启发式面膜的方法相反,Denseiam受益于简单的暹罗网络,并优化了不同粒度的一致性。它还证明了简单的位置对应关系和相互作用的区域嵌入足以学习相似性。我们将Denseiam应用于ImageNet,并在各种下游任务上获得竞争性改进。我们还表明,只有在一些特定于任务的损失中,简单的框架才能直接执行密集的预测任务。在现有的无监督语义细分基准中,它以2.1 miou的速度超过了最新的细分方法,培训成本为28%。代码和型号在https://github.com/zwwwayne/densesiam上发布。
translated by 谷歌翻译
Contrastive learning methods for unsupervised visual representation learning have reached remarkable levels of transfer performance. We argue that the power of contrastive learning has yet to be fully unleashed, as current methods are trained only on instance-level pretext tasks, leading to representations that may be sub-optimal for downstream tasks requiring dense pixel predictions. In this paper, we introduce pixel-level pretext tasks for learning dense feature representations. The first task directly applies contrastive learning at the pixel level. We additionally propose a pixel-to-propagation consistency task that produces better results, even surpassing the state-of-the-art approaches by a large margin. Specifically, it achieves 60.2 AP, 41.4 / 40.5 mAP and 77.2 mIoU when transferred to Pascal VOC object detection (C4), COCO object detection (FPN / C4) and Cityscapes semantic segmentation using a ResNet-50 backbone network, which are 2.6 AP, 0.8 / 1.0 mAP and 1.0 mIoU better than the previous best methods built on instance-level contrastive learning. Moreover, the pixel-level pretext tasks are found to be effective for pretraining not only regular backbone networks but also head networks used for dense downstream tasks, and are complementary to instance-level contrastive methods. These results demonstrate the strong potential of defining pretext tasks at the pixel level, and suggest a new path forward in unsupervised visual representation learning. Code is available at https://github.com/zdaxie/PixPro.
translated by 谷歌翻译
Transferring the knowledge learned from large scale datasets (e.g., ImageNet) via fine-tuning offers an effective solution for domain-specific fine-grained visual categorization (FGVC) tasks (e.g., recognizing bird species or car make & model). In such scenarios, data annotation often calls for specialized domain knowledge and thus is difficult to scale. In this work, we first tackle a problem in large scale FGVC. Our method won first place in iNaturalist 2017 large scale species classification challenge. Central to the success of our approach is a training scheme that uses higher image resolution and deals with the long-tailed distribution of training data. Next, we study transfer learning via fine-tuning from large scale datasets to small scale, domainspecific FGVC datasets. We propose a measure to estimate domain similarity via Earth Mover's Distance and demonstrate that transfer learning benefits from pre-training on a source domain that is similar to the target domain by this measure. Our proposed transfer learning outperforms Im-ageNet pre-training and obtains state-of-the-art results on multiple commonly used FGVC datasets.
translated by 谷歌翻译
Recently, the self-supervised pre-training paradigm has shown great potential in leveraging large-scale unlabeled data to improve downstream task performance. However, increasing the scale of unlabeled pre-training data in real-world scenarios requires prohibitive computational costs and faces the challenge of uncurated samples. To address these issues, we build a task-specific self-supervised pre-training framework from a data selection perspective based on a simple hypothesis that pre-training on the unlabeled samples with similar distribution to the target task can bring substantial performance gains. Buttressed by the hypothesis, we propose the first yet novel framework for Scalable and Efficient visual Pre-Training (SEPT) by introducing a retrieval pipeline for data selection. SEPT first leverage a self-supervised pre-trained model to extract the features of the entire unlabeled dataset for retrieval pipeline initialization. Then, for a specific target task, SEPT retrievals the most similar samples from the unlabeled dataset based on feature similarity for each target instance for pre-training. Finally, SEPT pre-trains the target model with the selected unlabeled samples in a self-supervised manner for target data finetuning. By decoupling the scale of pre-training and available upstream data for a target task, SEPT achieves high scalability of the upstream dataset and high efficiency of pre-training, resulting in high model architecture flexibility. Results on various downstream tasks demonstrate that SEPT can achieve competitive or even better performance compared with ImageNet pre-training while reducing the size of training samples by one magnitude without resorting to any extra annotations.
translated by 谷歌翻译
We introduce Bootstrap Your Own Latent (BYOL), a new approach to self-supervised image representation learning. BYOL relies on two neural networks, referred to as online and target networks, that interact and learn from each other. From an augmented view of an image, we train the online network to predict the target network representation of the same image under a different augmented view. At the same time, we update the target network with a slow-moving average of the online network. While state-of-the art methods rely on negative pairs, BYOL achieves a new state of the art without them. BYOL reaches 74.3% top-1 classification accuracy on ImageNet using a linear evaluation with a ResNet-50 architecture and 79.6% with a larger ResNet. We show that BYOL performs on par or better than the current state of the art on both transfer and semi-supervised benchmarks. Our implementation and pretrained models are given on GitHub. 3 * Equal contribution; the order of first authors was randomly selected.
translated by 谷歌翻译
We present DetCo, a simple yet effective self-supervised approach for object detection. Unsupervised pre-training methods have been recently designed for object detection, but they are usually deficient in image classification, or the opposite. Unlike them, DetCo transfers well on downstream instance-level dense prediction tasks, while maintaining competitive image-level classification accuracy. The advantages are derived from (1) multi-level supervision to intermediate representations, (2) contrastive learning between global image and local patches. These two designs facilitate discriminative and consistent global and local representation at each level of feature pyramid, improving detection and classification, simultaneously.Extensive experiments on VOC, COCO, Cityscapes, and ImageNet demonstrate that DetCo not only outperforms recent methods on a series of 2D and 3D instance-level detection tasks, but also competitive on image classification. For example, on ImageNet classification, DetCo is 6.9% and 5.0% top-1 accuracy better than InsLoc and DenseCL, which are two contemporary works designed for object detection. Moreover, on COCO detection, DetCo is 6.9 AP better than SwAV with Mask R-CNN C4. Notably, DetCo largely boosts up Sparse R-CNN, a recent strong detector, from 45.0 AP to 46.5 AP (+1.5 AP), establishing a new SOTA on COCO. Code is available.
translated by 谷歌翻译
自我监督学习(SSL)的承诺是利用大量未标记的数据来解决复杂的任务。尽管简单,图像级学习取得了出色的进步,但最新方法显示出包括图像结构知识的优势。但是,通过引入手工制作的图像分割来定义感兴趣的区域或专门的增强策略,这些方法牺牲了使SSL如此强大的简单性和通用性。取而代之的是,我们提出了一个自我监督的学习范式,该学习范式本身会发现这种图像结构。我们的方法,ODIN,夫妻对象发现和表示网络,以发现有意义的图像分割,而无需任何监督。由此产生的学习范式更简单,更易碎,更一般,并且取得了最先进的转移学习结果,以进行对象检测和实例对可可的细分,以及对Pascal和CityScapes的语义细分,同时超过监督的预先培训,用于戴维斯的视频细分。
translated by 谷歌翻译