Contrastive Language-Image Pre-trained (CLIP) models have zero-shot ability of classifying an image belonging to "[CLASS]" by using similarity between the image and the prompt sentence "a [CONTEXT] of [CLASS]". Based on exhaustive text cues in "[CONTEXT]", CLIP model is aware of different contexts, e.g. background, style, viewpoint, and exhibits unprecedented robustness against a wide range of distribution shifts. However, recent works find further fine-tuning of CLIP models improves accuracy but sacrifices the robustness on downstream tasks. We conduct an empirical investigation to show fine-tuning will corrupt the context-aware ability of pre-trained CLIP features. To solve this problem, we propose Context-Aware Robust Fine-tuning (CAR-FT). CAR-FT regularizes the model during fine-tuning to capture the context information. Specifically, we use zero-shot prompt weights to get the context distribution contained in the image. By minimizing the Kullback-Leibler Divergence (KLD) between context distributions induced by original/fine-tuned CLIP models, CAR-FT makes the context-aware ability of CLIP inherited into downstream tasks, and achieves both higher In-Distribution (ID) and Out-Of-Distribution (OOD) accuracy. The experimental results show CAR-FT achieves superior robustness on five OOD test datasets of ImageNet, and meanwhile brings accuracy gains on nine downstream tasks. Additionally, CAR-FT surpasses previous Domain Generalization (DG) methods and gets 78.5% averaged accuracy on DomainBed benchmark, building the new state-of-the-art.
translated by 谷歌翻译
Finetuning image-text models such as CLIP achieves state-of-the-art accuracies on a variety of benchmarks. However, recent works like WiseFT (Wortsman et al., 2021) and LP-FT (Kumar et al., 2022) have shown that even subtle differences in the finetuning process can lead to surprisingly large differences in the final performance, both for in-distribution (ID) and out-of-distribution (OOD) data. In this work, we show that a natural and simple approach of mimicking contrastive pretraining consistently outperforms alternative finetuning approaches. Specifically, we cast downstream class labels as text prompts and continue optimizing the contrastive loss between image embeddings and class-descriptive prompt embeddings (contrastive finetuning). Our method consistently outperforms baselines across 7 distribution shifts, 6 transfer learning, and 3 few-shot learning benchmarks. On WILDS-iWILDCam, our proposed approach FLYP outperforms the top of the leaderboard by $2.3\%$ ID and $2.7\%$ OOD, giving the highest reported accuracy. Averaged across 7 OOD datasets (2 WILDS and 5 ImageNet associated shifts), FLYP gives gains of $4.2\%$ OOD over standard finetuning and outperforms the current state of the art (LP-FT) by more than $1\%$ both ID and OOD. Similarly, on 3 few-shot learning benchmarks, our approach gives gains up to $4.6\%$ over standard finetuning and $4.4\%$ over the state of the art. In total, these benchmarks establish contrastive finetuning as a simple, intuitive, and state-of-the-art approach for supervised finetuning of image-text models like CLIP. Code is available at https://github.com/locuslab/FLYP.
translated by 谷歌翻译
预训练的视觉模型(例如,剪辑)在许多下游任务中显示出有希望的零弹性概括,并具有正确设计的文本提示。最近的作品不依赖手工设计的提示,而是使用下游任务的培训数据来学习提示。虽然有效,但针对领域数据的培训却降低了模型的概括能力,使其无法看到新领域。在这项工作中,我们提出了测试时间提示调整(TPT),该方法可以通过单个测试样本即时学习自适应提示。对于图像分类,TPT通过使用置信度选择最小化熵来优化提示,以便模型在每个测试样本的不同增强视图上都具有一致的预测。在评估对自然分布变化的概括时,TPT平均将零击的TOP-1精度提高了3.6%,超过了先前需要其他特定于任务的训练数据的迅速调整方法。在评估看不见类别的跨数据集泛化时,TPT与使用其他培训数据的最先进方法相当。项目页面:https://azshue.github.io/tpt。
translated by 谷歌翻译
执行零摄像推理时(即,在特定数据集上不进行微调)时,大型预训练的模型(例如剪辑或ALIGN)在一系列数据分布中提供一致的精度。尽管现有的微调方法显着提高了给定目标分布的准确性,但它们通常会降低分配变化的稳健性。我们通过引入一种简单有效的方法来提高鲁棒性,同时进行微调:结合零拍和微调模型(Wise-ft)的重量。与标准的微调相比,Wise-FT在分配变化下提供了巨大的准确性提高,同时保留了目标分布的高精度。在Imagenet和五个派生的分布变化上,Wise-FT在先前的工作中提高了分布转移的准确性4至6个百分点(PP),同时将Imagenet精度提高1.6pp。Wise-ft的稳健性相似(2至23 pp),明智之前与七个常用的转移学习数据集的标准微调相比,在一组进一步的分配转移的各种集合中,准确性增长率为0.8至3.3 pp。这些改进在微调或推理期间没有任何额外的计算成本。
translated by 谷歌翻译
域泛化(DG)是一个难度的学习问题,旨在学习一个概念域的概念模型。最近的巨型预训练模型,如剪辑和GPT-3,即基础模型(FMS),已被证明对许多分布换档具有强大,因此应导致DG的大量改进。在这项工作中,我们研究了在图像分类中采用DG问题采用剪辑的通用方法,在那里我们评估了天真零射击学习和全DG学习设置。对于后者,我们提出了AP(摊销提示),作为迅速生成形式的域推断的新方法。在域泛化基准上使用多个标准数据集,即PACS,VLC,OfficeHome和Terraincognita,Clip提供了可比的性能而无需微调任何参数,这表明FM在DG中的适用性和重要性。此外,我们表明,组合域提示跟踪带剪辑使AP能够以大的余量越大,从71.3 \%升高到79.3 \%的精度。我们希望我们的方法的简单性和成功强调强调的重要性并导致更广泛采用和分析域泛化领域的基础模型。
translated by 谷歌翻译
随着大型预训练的Vison语言模型(如剪辑)的出现,可以通过及时调整来调整可转让表示形式。及时调整试图从存储在预训练的视觉模型的图像和文本编码器中的常识中探索有益信息,以探索下游任务。最近提出的名为“上下文优化”(COP)的方法将一组可学习的向量从语言侧引入文本提示符,而单独调整文本提示符则不会影响图像编码器的计算视觉特征,从而导致了次级优势。在本文中,我们通过学习文本提示并同时为文本和图像编码器提供双重模式提示调整范式。此外,为了使视觉提示更多地集中在目标视觉概念上,我们提出了类感知的视觉及时调整(CAVPT),该调整是通过在模板提示和视觉类别令牌嵌入的语言描述之间进行交叉注意来动态生成的。我们的方法提供了一种新的范式来调整大型预训练的视觉模型,并在8个数据集上进行了广泛的实验结果,证明了该方法的有效性。我们的代码在补充材料中可用。
translated by 谷歌翻译
从自然语言监督中学习视觉表示,最近在许多开创性的作品中表现出了巨大的希望。通常,这些具有语言的视觉模型表现出对各种数据集和任务的强大可传递性。但是,由于缺乏易于使用的评估工具包和公共基准,评估这些模型的可转让性仍然很具有挑战性。为了解决这个问题,我们构建了高级版(评估语言的视觉任务级传输),这是用于评估(预训练)语言增强视觉模型的第一个基准和工具包。升华由三个组成部分组成。 (i)数据集。作为下游评估套件,它由20个图像分类数据集和35个对象检测数据集组成,每个数据集都用外部知识来增强。 (ii)工具包。开发了自动高参数调谐工具包,以促进下游任务的模型评估。 (iii)指标。多种评估指标用于测量样品效率(零射击和少量)和参数效率(线性探测和完整模型微调)。我们在https://computer-vision-in-the-wild.github.io/elevater/上公开发布leverater
translated by 谷歌翻译
诸如剪辑之类的大型预训练的视觉模型在学习表现方面表现出巨大的潜力,这些模型可以在各种下游任务中转移。与主要基于离散标签的传统表示学习不同,视觉语言预训练会使图像和文本在公共特征空间中对齐,这允许通过提示零弹性转移到下游任务,即从分类权重合成。描述兴趣类的自然语言。在这项工作中,我们表明,在实践中部署此类模型的一个重大挑战是及时的工程,它需要域专业知识,并且非常耗时 - 由于措辞的略有变化,需要花费大量时间来进行单词调整可能会对性能产生巨大影响。受到自然语言处理(NLP)迅速学习研究的最新进展的启发,我们提出了上下文优化(COP),这是一种专门用于调整类似剪辑的视觉语言模型的简单方法,用于下游图像识别。具体而言,Coop用可学习的向量建模了提示A的上下文单词,而整个预训练的参数则保持固定。为了处理不同的图像识别任务,我们提供了两个COOP的实现:统一上下文和特定于班级的上下文。通过在11个数据集上进行的大量实验,我们证明Coop只需要一两个镜头才能以相当的利润击败手工制作的提示,并且能够以16张镜头(例如16张照片)获得迅速工程的显着改进增益约为15%(最高达到45%以上)。尽管是一种基于学习的方法,但与使用手工制作的提示相比,Coop与零拍模型相比,取得了出色的域泛化性能。
translated by 谷歌翻译
Scaling up neural networks has led to remarkable performance across a wide range of tasks. Moreover, performance often follows reliable scaling laws as a function of training set size, model size, and compute, which offers valuable guidance as large-scale experiments are becoming increasingly expensive. However, previous work on scaling laws has primarily used private data \& models or focused on uni-modal language or vision learning. To address these limitations, we investigate scaling laws for contrastive language-image pre-training (CLIP) with the public LAION dataset and the open-source OpenCLIP repository. Our large-scale experiments involve models trained on up to two billion image-text pairs and identify power law scaling for multiple downstream tasks including zero-shot classification, retrieval, linear probing, and end-to-end fine-tuning. We find that the training distribution plays a key role in scaling laws as the OpenAI and OpenCLIP models exhibit different scaling behavior despite identical model architectures and similar training recipes. We open-source our evaluation workflow and all models, including the largest public CLIP models, to ensure reproducibility and make scaling laws research more accessible. Source code and instructions to reproduce this study will be available at https://github.com/LAION-AI/scaling-laws-openclip
translated by 谷歌翻译
虽然大型审计的基础模型(FMS)对数据集级别的分布变化显示出显着的零击分类鲁棒性,但它们对亚群或组移动的稳健性相对却相对不受欢迎。我们研究了这个问题,并发现诸如剪辑之类的FMS可能对各种群体转移可能不健壮。在9个稳健性基准中,其嵌入式分类零射击分类导致平均和最差组精度之间的差距高达80.7个百分点(PP)。不幸的是,现有的改善鲁棒性的方法需要重新培训,这在大型基础模型上可能非常昂贵。我们还发现,改善模型推理的有效方法(例如,通过适配器,具有FM嵌入式作为输入的轻量级网络)不会持续改进,有时与零击相比会伤害组鲁棒性(例如,将精度差距提高到50.1 pp on 50.1 pp on On on 50.1 pp on Celeba)。因此,我们制定了一种适配器培训策略,以有效有效地改善FM组的鲁棒性。我们激励的观察是,尽管同一阶级中的群体中较差的鲁棒性在基础模型“嵌入空间”中分开,但标准适配器训练可能不会使这些要点更加紧密。因此,我们提出了对比度的适应,该适应器会通过对比度学习进行训练适配器,以使样品嵌入在同一类中的地面真相类嵌入和其他样品嵌入。在整个9个基准测试中,我们的方法始终提高组鲁棒性,使最差的组精度提高了8.5至56.0 pp。我们的方法也是有效的,这样做的方法也没有任何FM芬太尼,只有一组固定的冷冻FM嵌入。在水鸟和Celeba等基准上,这导致最差的组精度可与最先进的方法相媲美,而最先进的方法可以重新训练整个模型,而仅训练$ \ leq $ 1%的模型参数。
translated by 谷歌翻译
最大化模型准确性的常规配方是(1)具有各种超参数的多个模型,以及(2)选择在固定验证集中表现最佳的单个模型,从而丢弃其余部分。在本文中,我们在微调大型预训练的模型的背景下重新审视了该过程的第二步,其中微调模型通常位于单个低误差盆地中。我们表明,平均多种模型的权重以不同的超参数配置进行了微调通常提高准确性和鲁棒性。与传统的合奏不同,我们可能会平均许多模型,而不会产生任何其他推理或记忆成本 - 我们将结果称为“模型汤”。当微调大型预训练的模型,例如夹子,Align和VIT-G在JFT上预先训练的VIT-G时,我们的汤食谱可为ImageNet上的超参数扫描中的最佳模型提供显着改进。所得的VIT-G模型在Imagenet上达到90.94%的TOP-1准确性,实现了新的最新状态。此外,我们表明,模型汤方法扩展到多个图像分类和自然语言处理任务,改善分发性能,并改善新下游任务的零局部性。最后,我们通过分析将权重平衡和与logit浓度的性能相似与预测的损失和信心的平坦度联系起来,并经过经验验证这种关系。代码可从https://github.com/mlfoundations/model-soups获得。
translated by 谷歌翻译
对比训练有素的语言图像模型,例如剪辑,Align和Basic,已经证明了对多种具有挑战性的自然分配变化的前所未有的鲁棒性。由于这些语言图像模型与以前的培训方法有多种不同,因此一个重要的问题是导致稳定性增长的原因。我们通过系统的实验研究回答这个问题。具体而言,我们研究了鲁棒性增长的五个不同可能的原因:(i)训练集大小,(ii)培训分配,(iii)在培训时进行语言监督,(iv)测试时语言监督,以及(v)对比损失函数。我们的实验表明,更多样化的训练分布是稳健性增长的主要原因,其他因素几乎没有稳健性。除了实验结果之外,我们还引入了Imagenet捕获,这是一种来自Flickr的原始文本注释的Imagenet版本,以实现语言图像训练的进一步受控实验。
translated by 谷歌翻译
Pretrained large-scale vision-language models like CLIP have exhibited strong generalization over unseen tasks. Yet imperceptible adversarial perturbations can significantly reduce CLIP's performance on new tasks. In this work, we identify and explore the problem of \emph{adapting large-scale models for zero-shot adversarial robustness}. We first identify two key factors during model adaption -- training losses and adaptation methods -- that affect the model's zero-shot adversarial robustness. We then propose a text-guided contrastive adversarial training loss, which aligns the text embeddings and the adversarial visual features with contrastive learning on a small set of training data. We apply this training loss to two adaption methods, model finetuning and visual prompt tuning. We find that visual prompt tuning is more effective in the absence of texts, while finetuning wins in the existence of text guidance. Overall, our approach significantly improves the zero-shot adversarial robustness over CLIP, seeing an average improvement of over 31 points over ImageNet and 15 zero-shot datasets. We hope this work can shed light on understanding the zero-shot adversarial robustness of large-scale models.
translated by 谷歌翻译
在许多图像分类任务中,诸如夹子之类的开放式摄影模型具有高精度。但是,在某些设置中,他们的零拍摄性能远非最佳。我们研究模型修补程序,目的是提高对特定任务的准确性,而不会在表现已经足够的任务上降低准确性。为了实现这一目标,我们引入了油漆,这是一种修补方法,该方法在微调之前使用模型的权重与要修补的任务进行微调后的权重。在零机夹的性能差的九个任务上,油漆可将精度提高15至60个百分点,同时将ImageNet上的精度保留在零拍模型的一个百分点之内。油漆还允许在多个任务上修补单个模型,并通过模型刻度进行改进。此外,我们确定了广泛转移的案例,即使任务不相交,对一个任务进行修补也会提高其他任务的准确性。最后,我们研究了超出常见基准的应用程序,例如计数或减少印刷攻击对剪辑的影响。我们的发现表明,可以扩展一组任务集,开放式摄影模型可实现高精度,而无需从头开始重新训练它们。
translated by 谷歌翻译
本文提出了一种对比调整,这是一种简单的方法,采用对比训练来对准图像和文本模型,同时仍然利用他们的预训练。在我们的实证研究中,我们发现,锁定的预训练图像模型与解锁文本模型最佳。我们调用这种对比调整“锁定图像文本调整”(LIT TOONING)的实例,该实例仅教导文本模型,从预先训练的图像模型中读出了良好的表示新任务。亮度调谐模型将零拍摄传输到新视觉任务的能力提高,例如图像分类或检索。建议的亮度调整是广泛适用的;它可以使用三种不同的图像文本数据集可靠地使用多种预训练方法(监督和无监督)和多种架构(Reset,Vision变换器和MLP-MILLER)。利用基于变压器的预训练VIT-G / 14型号,LIT调谐模型在想象网测试集中实现了84.5%的零射频传输精度,并且在充满挑战的分发ObjectNet测试集中实现了81.1%。
translated by 谷歌翻译
Large-scale cross-modal pre-training paradigms have recently shown ubiquitous success on a wide range of downstream tasks, e.g., zero-shot classification, retrieval and image captioning. However, their successes highly rely on the scale and quality of web-crawled data that naturally contain incomplete and noisy information (e.g., wrong or irrelevant content). Existing works either design manual rules to clean data or generate pseudo-targets as auxiliary signals for reducing noise impact, which do not explicitly tackle both the incorrect and incomplete challenges simultaneously. In this paper, to automatically mitigate the impact of noise by solely mining over existing data, we propose a principled Noise-robust Language-Image Pre-training framework (NLIP) to stabilize pre-training via two schemes: noise-harmonization and noise-completion. First, in noise-harmonization scheme, NLIP estimates the noise probability of each pair according to the memorization effect of cross-modal transformers, then adopts noise-adaptive regularization to harmonize the cross-modal alignments with varying degrees. Second, in noise-completion scheme, to enrich the missing object information of text, NLIP injects a concept-conditioned cross-modal decoder to obtain semantic-consistent synthetic captions to complete noisy ones, which uses the retrieved visual concepts (i.e., objects' names) for the corresponding image to guide captioning generation. By collaboratively optimizing noise-harmonization and noise-completion schemes, our NLIP can alleviate the common noise effects during image-text pre-training in a more efficient way. Extensive experiments show the significant performance improvements of our NLIP using only 26M data over existing pre-trained models (e.g., CLIP, FILIP and BLIP) on 12 zero-shot classification datasets, MSCOCO image captioning and zero-shot image-text retrieval tasks.
translated by 谷歌翻译
从任务不足的预训练的深层模型中转移知识以进行下游任务是计算机视觉研究中的一个重要主题。随着计算能力的增长,我们现在拥有大规模的模型体系结构和数据量的开源视觉语言预培训模型。在这项研究中,我们专注于转移视力分类任务的知识。传统方法随机初始化线性分类器头进行视觉分类,但是它们将文本编码器的用法留为未发现的下游视觉识别任务。在本文中,我们修改了线性分类器的角色,并用对象类别的嵌入式语言表示替换分类器。这些语言表示是从视觉语言预训练模型的文本编码器初始化的,以进一步利用其良好的语言模型参数。实证研究表明,我们的方法提高了视频分类的性能和训练速度,模型的变化微不足道。特别是,我们的范式在动力学400上实现了87.3%的最新准确性。
translated by 谷歌翻译
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.
translated by 谷歌翻译
Prompt tuning is a new few-shot transfer learning technique that only tunes the learnable prompt for pre-trained vision and language models such as CLIP. However, existing prompt tuning methods tend to learn spurious or entangled representations, which leads to poor generalization to unseen concepts. Towards non-spurious and efficient prompt learning from limited examples, this paper presents a novel \underline{\textbf{C}}ounterfactual \underline{\textbf{P}}rompt \underline{\textbf{L}}earning (CPL) method for vision and language models, which simultaneously employs counterfactual generation and contrastive learning in a joint optimization framework. Particularly, CPL constructs counterfactual by identifying minimal non-spurious feature change between semantically-similar positive and negative samples that causes concept change, and learns more generalizable prompt representation from both factual and counterfactual examples via contrastive learning. Extensive experiments demonstrate that CPL can obtain superior few-shot performance on different vision and language tasks than previous prompt tuning methods on CLIP. On image classification, we achieve 3.55\% average relative improvement on unseen classes across seven datasets; on image-text retrieval and visual question answering, we gain up to 4.09\% and 25.08\% relative improvements across three few-shot scenarios on unseen test sets respectively.
translated by 谷歌翻译
Prompt Tuning, conditioning on task-specific learned prompt vectors, has emerged as a data-efficient and parameter-efficient method for adapting large pretrained vision-language models to multiple downstream tasks. However, existing approaches usually consider learning prompt vectors for each task independently from scratch, thereby failing to exploit the rich shareable knowledge across different vision-language tasks. In this paper, we propose multitask vision-language prompt tuning (MVLPT), which incorporates cross-task knowledge into prompt tuning for vision-language models. Specifically, (i) we demonstrate the effectiveness of learning a single transferable prompt from multiple source tasks to initialize the prompt for each target task; (ii) we show many target tasks can benefit each other from sharing prompt vectors and thus can be jointly learned via multitask prompt tuning. We benchmark the proposed MVLPT using three representative prompt tuning methods, namely text prompt tuning, visual prompt tuning, and the unified vision-language prompt tuning. Results in 20 vision tasks demonstrate that the proposed approach outperforms all single-task baseline prompt tuning methods, setting the new state-of-the-art on the few-shot ELEVATER benchmarks and cross-task generalization benchmarks. To understand where the cross-task knowledge is most effective, we also conduct a large-scale study on task transferability with 20 vision tasks in 400 combinations for each prompt tuning method. It shows that the most performant MVLPT for each prompt tuning method prefers different task combinations and many tasks can benefit each other, depending on their visual similarity and label similarity. Code is available at https://github.com/sIncerass/MVLPT.
translated by 谷歌翻译