以前的工作提出了许多新的损失函数和常规程序,可提高图像分类任务的测试准确性。但是,目前尚不清楚这些损失函数是否了解下游任务的更好表示。本文研究了培训目标的选择如何影响卷积神经网络隐藏表示的可转移性,训练在想象中。我们展示了许多目标在Vanilla Softmax交叉熵上导致想象的精度有统计学意义的改进,但由此产生的固定特征提取器转移到下游任务基本较差,并且当网络完全微调时,损失的选择几乎没有效果新任务。使用居中内核对齐来测量网络隐藏表示之间的相似性,我们发现损失函数之间的差异仅在网络的最后几层中都很明显。我们深入了解倒数第二层的陈述,发现不同的目标和近奇计的组合导致大幅不同的类别分离。具有较高类别分离的表示可以在原始任务上获得更高的准确性,但它们的功能对于下游任务不太有用。我们的结果表明,用于原始任务的学习不变功能与传输任务相关的功能之间存在权衡。
translated by 谷歌翻译
Transfer learning is a cornerstone of computer vision, yet little work has been done to evaluate the relationship between architecture and transfer. An implicit hypothesis in modern computer vision research is that models that perform better on ImageNet necessarily perform better on other vision tasks. However, this hypothesis has never been systematically tested. Here, we compare the performance of 16 classification networks on 12 image classification datasets. We find that, when networks are used as fixed feature extractors or fine-tuned, there is a strong correlation between ImageNet accuracy and transfer accuracy (r = 0.99 and 0.96, respectively). In the former setting, we find that this relationship is very sensitive to the way in which networks are trained on ImageNet; many common forms of regularization slightly improve ImageNet accuracy but yield penultimate layer features that are much worse for transfer learning. Additionally, we find that, on two small fine-grained image classification datasets, pretraining on ImageNet provides minimal benefits, indicating the learned features from Ima-geNet do not transfer well to fine-grained tasks. Together, our results show that ImageNet architectures generalize well across datasets, but ImageNet features are less general than previously suggested.
translated by 谷歌翻译
This paper presents SimCLR: a simple framework for contrastive learning of visual representations. We simplify recently proposed contrastive selfsupervised learning algorithms without requiring specialized architectures or a memory bank. In order to understand what enables the contrastive prediction tasks to learn useful representations, we systematically study the major components of our framework. We show that (1) composition of data augmentations plays a critical role in defining effective predictive tasks, (2) introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and (3) contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning. By combining these findings, we are able to considerably outperform previous methods for self-supervised and semi-supervised learning on ImageNet. A linear classifier trained on self-supervised representations learned by Sim-CLR achieves 76.5% top-1 accuracy, which is a 7% relative improvement over previous state-ofthe-art, matching the performance of a supervised ResNet-50. When fine-tuned on only 1% of the labels, we achieve 85.8% top-5 accuracy, outperforming AlexNet with 100× fewer labels. 1
translated by 谷歌翻译
We introduce Bootstrap Your Own Latent (BYOL), a new approach to self-supervised image representation learning. BYOL relies on two neural networks, referred to as online and target networks, that interact and learn from each other. From an augmented view of an image, we train the online network to predict the target network representation of the same image under a different augmented view. At the same time, we update the target network with a slow-moving average of the online network. While state-of-the art methods rely on negative pairs, BYOL achieves a new state of the art without them. BYOL reaches 74.3% top-1 classification accuracy on ImageNet using a linear evaluation with a ResNet-50 architecture and 79.6% with a larger ResNet. We show that BYOL performs on par or better than the current state of the art on both transfer and semi-supervised benchmarks. Our implementation and pretrained models are given on GitHub. 3 * Equal contribution; the order of first authors was randomly selected.
translated by 谷歌翻译
最大化模型准确性的常规配方是(1)具有各种超参数的多个模型,以及(2)选择在固定验证集中表现最佳的单个模型,从而丢弃其余部分。在本文中,我们在微调大型预训练的模型的背景下重新审视了该过程的第二步,其中微调模型通常位于单个低误差盆地中。我们表明,平均多种模型的权重以不同的超参数配置进行了微调通常提高准确性和鲁棒性。与传统的合奏不同,我们可能会平均许多模型,而不会产生任何其他推理或记忆成本 - 我们将结果称为“模型汤”。当微调大型预训练的模型,例如夹子,Align和VIT-G在JFT上预先训练的VIT-G时,我们的汤食谱可为ImageNet上的超参数扫描中的最佳模型提供显着改进。所得的VIT-G模型在Imagenet上达到90.94%的TOP-1准确性,实现了新的最新状态。此外,我们表明,模型汤方法扩展到多个图像分类和自然语言处理任务,改善分发性能,并改善新下游任务的零局部性。最后,我们通过分析将权重平衡和与logit浓度的性能相似与预测的损失和信心的平坦度联系起来,并经过经验验证这种关系。代码可从https://github.com/mlfoundations/model-soups获得。
translated by 谷歌翻译
With the ever-growing model size and the limited availability of labeled training data, transfer learning has become an increasingly popular approach in many science and engineering domains. For classification problems, this work delves into the mystery of transfer learning through an intriguing phenomenon termed neural collapse (NC), where the last-layer features and classifiers of learned deep networks satisfy: (i) the within-class variability of the features collapses to zero, and (ii) the between-class feature means are maximally and equally separated. Through the lens of NC, our findings for transfer learning are the following: (i) when pre-training models, preventing intra-class variability collapse (to a certain extent) better preserves the intrinsic structures of the input data, so that it leads to better model transferability; (ii) when fine-tuning models on downstream tasks, obtaining features with more NC on downstream data results in better test accuracy on the given task. The above results not only demystify many widely used heuristics in model pre-training (e.g., data augmentation, projection head, self-supervised learning), but also leads to more efficient and principled fine-tuning method on downstream tasks that we demonstrate through extensive experimental results.
translated by 谷歌翻译
最近的工作据称,利用Softmax跨熵的分类损失不仅可以用于固定设定的分类任务,而且还通过专门为开放式任务开发的优于开销的损失,包括几次射击学习和检索。使用不同的嵌入几何形状研究了软MAX分类器 - 欧几里德,双曲线和球形,并且已经对一个或另一个的优越性进行了索赔,但它们没有得到精心控制的系统。我们对各种固定设定分类和图像检索任务的软MAX损失嵌入几何的实证研究。对于球形损失观察到的一个有趣的财产导致我们提出了一种基于VON MISES-FISHER分配的概率分类器,我们表明它具有最先进的方法竞争,同时生产出完善的盒子校准。我们提供有关亏损之间的权衡以及如何在其中选择的指导。
translated by 谷歌翻译
执行零摄像推理时(即,在特定数据集上不进行微调)时,大型预训练的模型(例如剪辑或ALIGN)在一系列数据分布中提供一致的精度。尽管现有的微调方法显着提高了给定目标分布的准确性,但它们通常会降低分配变化的稳健性。我们通过引入一种简单有效的方法来提高鲁棒性,同时进行微调:结合零拍和微调模型(Wise-ft)的重量。与标准的微调相比,Wise-FT在分配变化下提供了巨大的准确性提高,同时保留了目标分布的高精度。在Imagenet和五个派生的分布变化上,Wise-FT在先前的工作中提高了分布转移的准确性4至6个百分点(PP),同时将Imagenet精度提高1.6pp。Wise-ft的稳健性相似(2至23 pp),明智之前与七个常用的转移学习数据集的标准微调相比,在一组进一步的分配转移的各种集合中,准确性增长率为0.8至3.3 pp。这些改进在微调或推理期间没有任何额外的计算成本。
translated by 谷歌翻译
The generalization and learning speed of a multi-class neural network can often be significantly improved by using soft targets that are a weighted average of the hard targets and the uniform distribution over labels. Smoothing the labels in this way prevents the network from becoming over-confident and label smoothing has been used in many state-of-the-art models, including image classification, language translation and speech recognition. Despite its widespread use, label smoothing is still poorly understood. Here we show empirically that in addition to improving generalization, label smoothing improves model calibration which can significantly improve beam-search. However, we also observe that if a teacher network is trained with label smoothing, knowledge distillation into a student network is much less effective. To explain these observations, we visualize how label smoothing changes the representations learned by the penultimate layer of the network. We show that label smoothing encourages the representations of training examples from the same class to group in tight clusters. This results in loss of information in the logits about resemblances between instances of different classes, which is necessary for distillation, but does not hurt generalization or calibration of the model's predictions.
translated by 谷歌翻译
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.
translated by 谷歌翻译
Deep neural networks excel at learning the training data, but often provide incorrect and confident predictions when evaluated on slightly different test examples. This includes distribution shifts, outliers, and adversarial examples. To address these issues, we propose Manifold Mixup, a simple regularizer that encourages neural networks to predict less confidently on interpolations of hidden representations. Manifold Mixup leverages semantic interpolations as additional training signal, obtaining neural networks with smoother decision boundaries at multiple levels of representation. As a result, neural networks trained with Manifold Mixup learn class-representations with fewer directions of variance. We prove theory on why this flattening happens under ideal conditions, validate it on practical situations, and connect it to previous works on information theory and generalization. In spite of incurring no significant computation and being implemented in a few lines of code, Manifold Mixup improves strong baselines in supervised learning, robustness to single-step adversarial attacks, and test log-likelihood.
translated by 谷歌翻译
我们考虑在给定的分类任务(例如Imagenet-1k(IN1K))上训练深神网络的问题,以便它在该任务以及其他(未来)转移任务方面擅长。这两个看似矛盾的属性在改善模型的概括的同时保持其在原始任务上的性能之间实现了权衡。接受自我监督学习训练的模型(SSL)倾向于比其受监督的转移学习更好地概括。但是,他们仍然落后于In1k上的监督模型。在本文中,我们提出了一个有监督的学习设置,以利用两全其美的方式。我们使用最近的SSL模型的两个关键组成部分丰富了普通的监督培训框架:多尺度农作物用于数据增强和使用可消耗的投影仪。我们用内存库在即时计算的类原型中代替了班级权重的最后一层。我们表明,这三个改进导致IN1K培训任务和13个转移任务之间的权衡取决于更加有利的权衡。在所有探索的配置中,我们都会挑出两种模型:T-Rex实现了转移学习的新状态,并且超过了In1k上的Dino和Paws等最佳方法,以及与高度优化的RSB--相匹配的T-Rex*在IN1K上的A1模型,同时在转移任务上表现更好。项目页面和预估计的模型:https://europe.naverlabs.com/t-rex
translated by 谷歌翻译
深度学习的最新进展依赖于大型标签的数据集来培训大容量模型。但是,以时间和成本效益的方式收集大型数据集通常会导致标签噪声。我们提出了一种从嘈杂的标签中学习的方法,该方法利用特征空间中的训练示例之间的相似性,鼓励每个示例的预测与其最近的邻居相似。与使用多个模型或不同阶段的训练算法相比,我们的方法采用了简单,附加的正规化项的形式。它可以被解释为经典的,偏置标签传播算法的归纳版本。我们在数据集上彻底评估我们的方法评估合成(CIFAR-10,CIFAR-100)和现实(迷你网络,网络vision,Clotsing1m,Mini-Imagenet-Red)噪声,并实现竞争性或最先进的精度,在所有人之间。
translated by 谷歌翻译
使用DataSet的真实标签培训而不是随机标签导致更快的优化和更好的泛化。这种差异归因于自然数据集中的输入和标签之间的对齐概念。我们发现,随机或真正标签上的具有不同架构和优化器的培训神经网络在隐藏的表示和训练标签之间强制执行相同的关系,阐明为什么神经网络表示为转移如此成功。我们首先突出显示为什么对齐的特征在经典的合成转移问题中促进转移和展示,即对齐是对相似和不同意任务的正负传输的确定因素。然后我们调查各种神经网络架构,并发现(a)在各种不同的架构和优化器中出现的对齐,并且从深度(b)对准产生的更多对准对于更接近输出的层和(c)现有的性能深度CNN表现出高级别的对准。
translated by 谷歌翻译
Importance-weighted risk minimization is a key ingredient in many machine learning algorithms for causal inference, domain adaptation, class imbalance, and off-policy reinforcement learning. While the effect of importance weighting is wellcharacterized for low-capacity misspecified models, little is known about how it impacts overparameterized, deep neural networks. Inspired by recent theoretical results showing that on (linearly) separable data, deep linear networks optimized by SGD learn weight-agnostic solutions, we ask, for realistic deep networks, for which many practical datasets are separable, what is the effect of importance weighting? We present the surprising finding that while importance weighting impacts deep nets early in training, so long as the nets are able to separate the training data, its effect diminishes over successive epochs. Moreover, while L2 regularization and batch normalization (but not dropout), restore some of the impact of importance weighting, they express the effect via (seemingly) the wrong abstraction: why should practitioners tweak the L2 regularization, and by how much, to produce the correct weighting effect? We experimentally confirm these findings across a range of architectures and datasets.
translated by 谷歌翻译
Accurate uncertainty quantification is a major challenge in deep learning, as neural networks can make overconfident errors and assign high confidence predictions to out-of-distribution (OOD) inputs. The most popular approaches to estimate predictive uncertainty in deep learning are methods that combine predictions from multiple neural networks, such as Bayesian neural networks (BNNs) and deep ensembles. However their practicality in real-time, industrial-scale applications are limited due to the high memory and computational cost. Furthermore, ensembles and BNNs do not necessarily fix all the issues with the underlying member networks. In this work, we study principled approaches to improve uncertainty property of a single network, based on a single, deterministic representation. By formalizing the uncertainty quantification as a minimax learning problem, we first identify distance awareness, i.e., the model's ability to quantify the distance of a testing example from the training data, as a necessary condition for a DNN to achieve high-quality (i.e., minimax optimal) uncertainty estimation. We then propose Spectral-normalized Neural Gaussian Process (SNGP), a simple method that improves the distance-awareness ability of modern DNNs with two simple changes: (1) applying spectral normalization to hidden weights to enforce bi-Lipschitz smoothness in representations and (2) replacing the last output layer with a Gaussian process layer. On a suite of vision and language understanding benchmarks, SNGP outperforms other single-model approaches in prediction, calibration and out-of-domain detection. Furthermore, SNGP provides complementary benefits to popular techniques such as deep ensembles and data augmentation, making it a simple and scalable building block for probabilistic deep learning. Code is open-sourced at https://github.com/google/uncertainty-baselines
translated by 谷歌翻译
最近证明,接受SGD训练的神经网络优先依赖线性预测的特征,并且可以忽略复杂的,同样可预测的功能。这种简单性偏见可以解释他们缺乏分布(OOD)的鲁棒性。学习任务越复杂,统计工件(即选择偏见,虚假相关性)的可能性就越大比学习的机制更简单。我们证明可以减轻简单性偏差并改善了OOD的概括。我们使用对其输入梯度对齐的惩罚来训练一组类似的模型以不同的方式拟合数据。我们从理论和经验上展示了这会导致学习更复杂的预测模式的学习。 OOD的概括从根本上需要超出I.I.D.示例,例如多个培训环境,反事实示例或其他侧面信息。我们的方法表明,我们可以将此要求推迟到独立的模型选择阶段。我们获得了SOTA的结果,可以在视觉域偏置数据和概括方面进行视觉识别。该方法 - 第一个逃避简单性偏见的方法 - 突出了需要更好地理解和控制深度学习中的归纳偏见。
translated by 谷歌翻译
最近的结果表明,在训练期间重新升级神经网络参数的子集可以改善泛化,特别是对于小型训练集。我们研究不同重新初始化方法在12个基准图像分类数据集中的几种卷积架构中的影响,分析了它们的潜在收益和突出显示限制。我们还介绍了一种新的层状重新初始化算法,优于先前的方法,并建议观察到的改进的泛化的解释。首先,我们表明,无需增加重量的规范,可以在不增加重量的规范的情况下增加训练示例的余量。因此,导致神经网络的边缘的泛化范围的改善。其次,我们证明它在损失表面的平坦局部最小值中稳定。第三,它鼓励学习一般规则,并通过强调神经网络的下层来劝阻记忆。我们的外带消息是使用自下而上的层状重新初始化的小型数据集可以改善卷积神经网络的准确性,其中重新初始层的数量可能因可用计算预算而变化。
translated by 谷歌翻译
Image classification with small datasets has been an active research area in the recent past. However, as research in this scope is still in its infancy, two key ingredients are missing for ensuring reliable and truthful progress: a systematic and extensive overview of the state of the art, and a common benchmark to allow for objective comparisons between published methods. This article addresses both issues. First, we systematically organize and connect past studies to consolidate a community that is currently fragmented and scattered. Second, we propose a common benchmark that allows for an objective comparison of approaches. It consists of five datasets spanning various domains (e.g., natural images, medical imagery, satellite data) and data types (RGB, grayscale, multispectral). We use this benchmark to re-evaluate the standard cross-entropy baseline and ten existing methods published between 2017 and 2021 at renowned venues. Surprisingly, we find that thorough hyper-parameter tuning on held-out validation data results in a highly competitive baseline and highlights a stunted growth of performance over the years. Indeed, only a single specialized method dating back to 2019 clearly wins our benchmark and outperforms the baseline classifier.
translated by 谷歌翻译