最大化模型准确性的常规配方是(1)具有各种超参数的多个模型,以及(2)选择在固定验证集中表现最佳的单个模型,从而丢弃其余部分。在本文中,我们在微调大型预训练的模型的背景下重新审视了该过程的第二步,其中微调模型通常位于单个低误差盆地中。我们表明,平均多种模型的权重以不同的超参数配置进行了微调通常提高准确性和鲁棒性。与传统的合奏不同,我们可能会平均许多模型,而不会产生任何其他推理或记忆成本 - 我们将结果称为“模型汤”。当微调大型预训练的模型,例如夹子,Align和VIT-G在JFT上预先训练的VIT-G时,我们的汤食谱可为ImageNet上的超参数扫描中的最佳模型提供显着改进。所得的VIT-G模型在Imagenet上达到90.94%的TOP-1准确性,实现了新的最新状态。此外,我们表明,模型汤方法扩展到多个图像分类和自然语言处理任务,改善分发性能,并改善新下游任务的零局部性。最后,我们通过分析将权重平衡和与logit浓度的性能相似与预测的损失和信心的平坦度联系起来,并经过经验验证这种关系。代码可从https://github.com/mlfoundations/model-soups获得。
translated by 谷歌翻译
执行零摄像推理时(即,在特定数据集上不进行微调)时,大型预训练的模型(例如剪辑或ALIGN)在一系列数据分布中提供一致的精度。尽管现有的微调方法显着提高了给定目标分布的准确性,但它们通常会降低分配变化的稳健性。我们通过引入一种简单有效的方法来提高鲁棒性,同时进行微调:结合零拍和微调模型(Wise-ft)的重量。与标准的微调相比,Wise-FT在分配变化下提供了巨大的准确性提高,同时保留了目标分布的高精度。在Imagenet和五个派生的分布变化上,Wise-FT在先前的工作中提高了分布转移的准确性4至6个百分点(PP),同时将Imagenet精度提高1.6pp。Wise-ft的稳健性相似(2至23 pp),明智之前与七个常用的转移学习数据集的标准微调相比,在一组进一步的分配转移的各种集合中,准确性增长率为0.8至3.3 pp。这些改进在微调或推理期间没有任何额外的计算成本。
translated by 谷歌翻译
Changing how pre-trained models behave -- e.g., improving their performance on a downstream task or mitigating biases learned during pre-training -- is a common practice when developing machine learning systems. In this work, we propose a new paradigm for steering the behavior of neural networks, centered around \textit{task vectors}. A task vector specifies a direction in the weight space of a pre-trained model, such that movement in that direction improves performance on the task. We build task vectors by subtracting the weights of a pre-trained model from the weights of the same model after fine-tuning on a task. We show that these task vectors can be modified and combined together through arithmetic operations such as negation and addition, and the behavior of the resulting model is steered accordingly. Negating a task vector decreases performance on the target task, with little change in model behavior on control tasks. Moreover, adding task vectors together can improve performance on multiple tasks at once. Finally, when tasks are linked by an analogy relationship of the form ``A is to B as C is to D", combining task vectors from three of the tasks can improve performance on the fourth, even when no data from the fourth task is used for training. Overall, our experiments with several models, modalities and tasks show that task arithmetic is a simple, efficient and effective way of editing models.
translated by 谷歌翻译
在许多图像分类任务中,诸如夹子之类的开放式摄影模型具有高精度。但是,在某些设置中,他们的零拍摄性能远非最佳。我们研究模型修补程序,目的是提高对特定任务的准确性,而不会在表现已经足够的任务上降低准确性。为了实现这一目标,我们引入了油漆,这是一种修补方法,该方法在微调之前使用模型的权重与要修补的任务进行微调后的权重。在零机夹的性能差的九个任务上,油漆可将精度提高15至60个百分点,同时将ImageNet上的精度保留在零拍模型的一个百分点之内。油漆还允许在多个任务上修补单个模型,并通过模型刻度进行改进。此外,我们确定了广泛转移的案例,即使任务不相交,对一个任务进行修补也会提高其他任务的准确性。最后,我们研究了超出常见基准的应用程序,例如计数或减少印刷攻击对剪辑的影响。我们的发现表明,可以扩展一组任务集,开放式摄影模型可实现高精度,而无需从头开始重新训练它们。
translated by 谷歌翻译
以前的工作提出了许多新的损失函数和常规程序,可提高图像分类任务的测试准确性。但是,目前尚不清楚这些损失函数是否了解下游任务的更好表示。本文研究了培训目标的选择如何影响卷积神经网络隐藏表示的可转移性,训练在想象中。我们展示了许多目标在Vanilla Softmax交叉熵上导致想象的精度有统计学意义的改进,但由此产生的固定特征提取器转移到下游任务基本较差,并且当网络完全微调时,损失的选择几乎没有效果新任务。使用居中内核对齐来测量网络隐藏表示之间的相似性,我们发现损失函数之间的差异仅在网络的最后几层中都很明显。我们深入了解倒数第二层的陈述,发现不同的目标和近奇计的组合导致大幅不同的类别分离。具有较高类别分离的表示可以在原始任务上获得更高的准确性,但它们的功能对于下游任务不太有用。我们的结果表明,用于原始任务的学习不变功能与传输任务相关的功能之间存在权衡。
translated by 谷歌翻译
本文提出了一种对比调整,这是一种简单的方法,采用对比训练来对准图像和文本模型,同时仍然利用他们的预训练。在我们的实证研究中,我们发现,锁定的预训练图像模型与解锁文本模型最佳。我们调用这种对比调整“锁定图像文本调整”(LIT TOONING)的实例,该实例仅教导文本模型,从预先训练的图像模型中读出了良好的表示新任务。亮度调谐模型将零拍摄传输到新视觉任务的能力提高,例如图像分类或检索。建议的亮度调整是广泛适用的;它可以使用三种不同的图像文本数据集可靠地使用多种预训练方法(监督和无监督)和多种架构(Reset,Vision变换器和MLP-MILLER)。利用基于变压器的预训练VIT-G / 14型号,LIT调谐模型在想象网测试集中实现了84.5%的零射频传输精度,并且在充满挑战的分发ObjectNet测试集中实现了81.1%。
translated by 谷歌翻译
Scaling up neural networks has led to remarkable performance across a wide range of tasks. Moreover, performance often follows reliable scaling laws as a function of training set size, model size, and compute, which offers valuable guidance as large-scale experiments are becoming increasingly expensive. However, previous work on scaling laws has primarily used private data \& models or focused on uni-modal language or vision learning. To address these limitations, we investigate scaling laws for contrastive language-image pre-training (CLIP) with the public LAION dataset and the open-source OpenCLIP repository. Our large-scale experiments involve models trained on up to two billion image-text pairs and identify power law scaling for multiple downstream tasks including zero-shot classification, retrieval, linear probing, and end-to-end fine-tuning. We find that the training distribution plays a key role in scaling laws as the OpenAI and OpenCLIP models exhibit different scaling behavior despite identical model architectures and similar training recipes. We open-source our evaluation workflow and all models, including the largest public CLIP models, to ensure reproducibility and make scaling laws research more accessible. Source code and instructions to reproduce this study will be available at https://github.com/LAION-AI/scaling-laws-openclip
translated by 谷歌翻译
Finetuning image-text models such as CLIP achieves state-of-the-art accuracies on a variety of benchmarks. However, recent works like WiseFT (Wortsman et al., 2021) and LP-FT (Kumar et al., 2022) have shown that even subtle differences in the finetuning process can lead to surprisingly large differences in the final performance, both for in-distribution (ID) and out-of-distribution (OOD) data. In this work, we show that a natural and simple approach of mimicking contrastive pretraining consistently outperforms alternative finetuning approaches. Specifically, we cast downstream class labels as text prompts and continue optimizing the contrastive loss between image embeddings and class-descriptive prompt embeddings (contrastive finetuning). Our method consistently outperforms baselines across 7 distribution shifts, 6 transfer learning, and 3 few-shot learning benchmarks. On WILDS-iWILDCam, our proposed approach FLYP outperforms the top of the leaderboard by $2.3\%$ ID and $2.7\%$ OOD, giving the highest reported accuracy. Averaged across 7 OOD datasets (2 WILDS and 5 ImageNet associated shifts), FLYP gives gains of $4.2\%$ OOD over standard finetuning and outperforms the current state of the art (LP-FT) by more than $1\%$ both ID and OOD. Similarly, on 3 few-shot learning benchmarks, our approach gives gains up to $4.6\%$ over standard finetuning and $4.4\%$ over the state of the art. In total, these benchmarks establish contrastive finetuning as a simple, intuitive, and state-of-the-art approach for supervised finetuning of image-text models like CLIP. Code is available at https://github.com/locuslab/FLYP.
translated by 谷歌翻译
视觉变压器(VIT)已被证明可以在广泛的视觉应用中获得高度竞争性的性能,例如图像分类,对象检测和语义图像分割。与卷积神经网络相比,通常发现视觉变压器的较弱的电感偏差会在较小的培训数据集上培训时,会增加对模型正则化或数据增强的依赖(简称为“ AUGREG”)。我们进行了一项系统的实证研究,以便更好地了解培训数据,AUGREG,模型大小和计算预算之间的相互作用。作为这项研究的一个结果,我们发现增加的计算和AUGREG的组合可以产生与在数量级上训练的模型相同的训练数据的模型:我们在公共Imagenet-21K数据集中培训各种尺寸的VIT模型在较大的JFT-300M数据集上匹配或超越其对手的培训。
translated by 谷歌翻译
Transfer learning is a cornerstone of computer vision, yet little work has been done to evaluate the relationship between architecture and transfer. An implicit hypothesis in modern computer vision research is that models that perform better on ImageNet necessarily perform better on other vision tasks. However, this hypothesis has never been systematically tested. Here, we compare the performance of 16 classification networks on 12 image classification datasets. We find that, when networks are used as fixed feature extractors or fine-tuned, there is a strong correlation between ImageNet accuracy and transfer accuracy (r = 0.99 and 0.96, respectively). In the former setting, we find that this relationship is very sensitive to the way in which networks are trained on ImageNet; many common forms of regularization slightly improve ImageNet accuracy but yield penultimate layer features that are much worse for transfer learning. Additionally, we find that, on two small fine-grained image classification datasets, pretraining on ImageNet provides minimal benefits, indicating the learned features from Ima-geNet do not transfer well to fine-grained tasks. Together, our results show that ImageNet architectures generalize well across datasets, but ImageNet features are less general than previously suggested.
translated by 谷歌翻译
Contrastive Language-Image Pre-trained (CLIP) models have zero-shot ability of classifying an image belonging to "[CLASS]" by using similarity between the image and the prompt sentence "a [CONTEXT] of [CLASS]". Based on exhaustive text cues in "[CONTEXT]", CLIP model is aware of different contexts, e.g. background, style, viewpoint, and exhibits unprecedented robustness against a wide range of distribution shifts. However, recent works find further fine-tuning of CLIP models improves accuracy but sacrifices the robustness on downstream tasks. We conduct an empirical investigation to show fine-tuning will corrupt the context-aware ability of pre-trained CLIP features. To solve this problem, we propose Context-Aware Robust Fine-tuning (CAR-FT). CAR-FT regularizes the model during fine-tuning to capture the context information. Specifically, we use zero-shot prompt weights to get the context distribution contained in the image. By minimizing the Kullback-Leibler Divergence (KLD) between context distributions induced by original/fine-tuned CLIP models, CAR-FT makes the context-aware ability of CLIP inherited into downstream tasks, and achieves both higher In-Distribution (ID) and Out-Of-Distribution (OOD) accuracy. The experimental results show CAR-FT achieves superior robustness on five OOD test datasets of ImageNet, and meanwhile brings accuracy gains on nine downstream tasks. Additionally, CAR-FT surpasses previous Domain Generalization (DG) methods and gets 78.5% averaged accuracy on DomainBed benchmark, building the new state-of-the-art.
translated by 谷歌翻译
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.
translated by 谷歌翻译
Web爬行的数据集已在最近的图像文本模型(例如剪辑(对比语言图像预训练)或火烈鸟)中启用了非凡的概括功能,但是对数据集创建过程知之甚少。在这项工作中,我们介绍了六个可公开可用数据源的测试床 - YFCC,LAION,概念标题,机智,redcaps,shutterstock-,以调查预训练分布如何在剪辑中诱导稳健性。我们发现,预训练数据的性能在分布变化之间有很大的变化,没有单个数据源主导。此外,我们系统地研究了这些数据源之间的相互作用,发现组合多个来源并不一定会产生更好的模型,而是稀释了最佳个体数据源的鲁棒性。我们将经验发现与简单环境中的理论见解相辅相成,其中结合训练数据还会导致稳健性稀释。此外,我们的理论模型为LAION数据集中最近采用的基于夹的数据过滤技术的成功提供了候选解释。总体而言,我们的结果表明,仅仅从Web中收集大量数据并不是建立预训练数据集以进行鲁棒性概括的最有效方法,因此需要进一步研究数据集设计。
translated by 谷歌翻译
Vision Transformers convert images to sequences by slicing them into patches. The size of these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher accuracy at greater computational cost, but changing the patch size typically requires retraining the model. In this paper, we demonstrate that simply randomizing the patch size at training time leads to a single set of weights that performs well across a wide range of patch sizes, making it possible to tailor the model to different compute budgets at deployment time. We extensively evaluate the resulting model, which we call FlexiViT, on a wide range of tasks, including classification, image-text retrieval, open-world detection, panoptic segmentation, and semantic segmentation, concluding that it usually matches, and sometimes outperforms, standard ViT models trained at a single patch size in an otherwise identical setup. Hence, FlexiViT training is a simple drop-in improvement for ViT that makes it easy to add compute-adaptive capabilities to most models relying on a ViT backbone architecture. Code and pre-trained models are available at https://github.com/google-research/big_vision
translated by 谷歌翻译
预训练的视觉模型(例如,剪辑)在许多下游任务中显示出有希望的零弹性概括,并具有正确设计的文本提示。最近的作品不依赖手工设计的提示,而是使用下游任务的培训数据来学习提示。虽然有效,但针对领域数据的培训却降低了模型的概括能力,使其无法看到新领域。在这项工作中,我们提出了测试时间提示调整(TPT),该方法可以通过单个测试样本即时学习自适应提示。对于图像分类,TPT通过使用置信度选择最小化熵来优化提示,以便模型在每个测试样本的不同增强视图上都具有一致的预测。在评估对自然分布变化的概括时,TPT平均将零击的TOP-1精度提高了3.6%,超过了先前需要其他特定于任务的训练数据的迅速调整方法。在评估看不见类别的跨数据集泛化时,TPT与使用其他培训数据的最先进方法相当。项目页面:https://azshue.github.io/tpt。
translated by 谷歌翻译
差异隐私(DP)提供了正式的隐私保证,以防止对手可以访问机器学习模型,从而从提取有关单个培训点的信息。最受欢迎的DP训练方法是差异私有随机梯度下降(DP-SGD),它通过在训练过程中注入噪声来实现这种保护。然而,以前的工作发现,DP-SGD通常会导致标准图像分类基准的性能显着降解。此外,一些作者假设DP-SGD在大型模型上固有地表现不佳,因为保留隐私所需的噪声规范与模型维度成正比。相反,我们证明了过度参数化模型上的DP-SGD可以比以前想象的要好得多。将仔细的超参数调整与简单技术结合起来,以确保信号传播并提高收敛速率,我们获得了新的SOTA,而没有额外数据的CIFAR-10,在81.4%的81.4%下(8,10^{ - 5}) - 使用40 -layer wide-Resnet,比以前的SOTA提高了71.7%。当对预训练的NFNET-F3进行微调时,我们在ImageNet(0.5,8*10^{ - 7})下达到了83.8%的TOP-1精度。此外,我们还在(8,8 \ cdot 10^{ - 7})下达到了86.7%的TOP-1精度,DP仅比当前的非私人SOTA仅4.3%。我们认为,我们的结果是缩小私人图像分类和非私有图像分类之间准确性差距的重要一步。
translated by 谷歌翻译
Foundation models are redefining how AI systems are built. Practitioners now follow a standard procedure to build their machine learning solutions: download a copy of a foundation model, and fine-tune it using some in-house data about the target task of interest. Consequently, the Internet is swarmed by a handful of foundation models fine-tuned on many diverse tasks. Yet, these individual fine-tunings often lack strong generalization and exist in isolation without benefiting from each other. In our opinion, this is a missed opportunity, as these specialized models contain diverse features. Based on this insight, we propose model recycling, a simple strategy that leverages multiple fine-tunings of the same foundation model on diverse auxiliary tasks, and repurposes them as rich and diverse initializations for the target task. Specifically, model recycling fine-tunes in parallel each specialized model on the target task, and then averages the weights of all target fine-tunings into a final model. Empirically, we show that model recycling maximizes model diversity by benefiting from diverse auxiliary tasks, and achieves a new state of the art on the reference DomainBed benchmark for out-of-distribution generalization. Looking forward, model recycling is a contribution to the emerging paradigm of updatable machine learning where, akin to open-source software development, the community collaborates to incrementally and reliably update machine learning models.
translated by 谷歌翻译
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train. 1
translated by 谷歌翻译
对比训练有素的语言图像模型,例如剪辑,Align和Basic,已经证明了对多种具有挑战性的自然分配变化的前所未有的鲁棒性。由于这些语言图像模型与以前的培训方法有多种不同,因此一个重要的问题是导致稳定性增长的原因。我们通过系统的实验研究回答这个问题。具体而言,我们研究了鲁棒性增长的五个不同可能的原因:(i)训练集大小,(ii)培训分配,(iii)在培训时进行语言监督,(iv)测试时语言监督,以及(v)对比损失函数。我们的实验表明,更多样化的训练分布是稳健性增长的主要原因,其他因素几乎没有稳健性。除了实验结果之外,我们还引入了Imagenet捕获,这是一种来自Flickr的原始文本注释的Imagenet版本,以实现语言图像训练的进一步受控实验。
translated by 谷歌翻译
This paper presents SimCLR: a simple framework for contrastive learning of visual representations. We simplify recently proposed contrastive selfsupervised learning algorithms without requiring specialized architectures or a memory bank. In order to understand what enables the contrastive prediction tasks to learn useful representations, we systematically study the major components of our framework. We show that (1) composition of data augmentations plays a critical role in defining effective predictive tasks, (2) introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and (3) contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning. By combining these findings, we are able to considerably outperform previous methods for self-supervised and semi-supervised learning on ImageNet. A linear classifier trained on self-supervised representations learned by Sim-CLR achieves 76.5% top-1 accuracy, which is a 7% relative improvement over previous state-ofthe-art, matching the performance of a supervised ResNet-50. When fine-tuned on only 1% of the labels, we achieve 85.8% top-5 accuracy, outperforming AlexNet with 100× fewer labels. 1
translated by 谷歌翻译