Incidental supervision from language has become a popular approach for learning generic visual representations that can be prompted to perform many recognition tasks in computer vision. We conduct an in-depth exploration of the CLIP model and show that its visual representation is often strongly biased towards solving some tasks more than others. Moreover, which task the representation will be biased towards is unpredictable, with little consistency across images. To resolve this task bias, we show how to learn a visual prompt that guides the representation towards features relevant to their task of interest. Our results show that these visual prompts can be independent of the input image and still effectively provide a conditioning mechanism to steer visual representations towards the desired task.
translated by 谷歌翻译
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.
translated by 谷歌翻译
Large-scale vision-language models such as CLIP have shown impressive performance on zero-shot image classification and image-to-text retrieval. However, such zero-shot performance of CLIP-based models does not realize in tasks that require a finer-grained correspondence between vision and language, such as Visual Question Answering (VQA). We investigate why this is the case, and report an interesting phenomenon of CLIP, which we call the Concept Association Bias (CAB), as a potential cause of the difficulty of applying CLIP to VQA and similar tasks. CAB is especially apparent when two concepts are present in the given image while a text prompt only contains a single concept. In such a case, we find that CLIP tends to treat input as a bag of concepts and attempts to fill in the other missing concept crossmodally, leading to an unexpected zero-shot prediction. For example, when asked for the color of a lemon in an image, CLIP predicts ``purple'' if the image contains a lemon and an eggplant. We demonstrate the Concept Association Bias of CLIP by showing that CLIP's zero-shot classification performance greatly suffers when there is a strong concept association between an object (e.g. lemon) and an attribute (e.g. its color). On the other hand, when the association between object and attribute is weak, we do not see this phenomenon. Furthermore, we show that CAB is significantly mitigated when we enable CLIP to learn deeper structure across image and text embeddings by adding an additional Transformer on top of CLIP and fine-tuning it on VQA. We find that across such fine-tuned variants of CLIP, the strength of CAB in a model predicts how well it performs on VQA.
translated by 谷歌翻译
诸如剪辑之类的大型预训练的视觉模型在学习表现方面表现出巨大的潜力,这些模型可以在各种下游任务中转移。与主要基于离散标签的传统表示学习不同,视觉语言预训练会使图像和文本在公共特征空间中对齐,这允许通过提示零弹性转移到下游任务,即从分类权重合成。描述兴趣类的自然语言。在这项工作中,我们表明,在实践中部署此类模型的一个重大挑战是及时的工程,它需要域专业知识,并且非常耗时 - 由于措辞的略有变化,需要花费大量时间来进行单词调整可能会对性能产生巨大影响。受到自然语言处理(NLP)迅速学习研究的最新进展的启发,我们提出了上下文优化(COP),这是一种专门用于调整类似剪辑的视觉语言模型的简单方法,用于下游图像识别。具体而言,Coop用可学习的向量建模了提示A的上下文单词,而整个预训练的参数则保持固定。为了处理不同的图像识别任务,我们提供了两个COOP的实现:统一上下文和特定于班级的上下文。通过在11个数据集上进行的大量实验,我们证明Coop只需要一两个镜头才能以相当的利润击败手工制作的提示,并且能够以16张镜头(例如16张照片)获得迅速工程的显着改进增益约为15%(最高达到45%以上)。尽管是一种基于学习的方法,但与使用手工制作的提示相比,Coop与零拍模型相比,取得了出色的域泛化性能。
translated by 谷歌翻译
Vision-language models (VLMs) such as CLIP have shown promising performance on a variety of recognition tasks using the standard zero-shot classification procedure -- computing similarity between the query image and the embedded words for each category. By only using the category name, they neglect to make use of the rich context of additional information that language affords. The procedure gives no intermediate understanding of why a category is chosen, and furthermore provides no mechanism for adjusting the criteria used towards this decision. We present an alternative framework for classification with VLMs, which we call classification by description. We ask VLMs to check for descriptive features rather than broad categories: to find a tiger, look for its stripes; its claws; and more. By basing decisions on these descriptors, we can provide additional cues that encourage using the features we want to be used. In the process, we can get a clear idea of what features the model uses to construct its decision; it gains some level of inherent explainability. We query large language models (e.g., GPT-3) for these descriptors to obtain them in a scalable way. Extensive experiments show our framework has numerous advantages past interpretability. We show improvements in accuracy on ImageNet across distribution shifts; demonstrate the ability to adapt VLMs to recognize concepts unseen during training; and illustrate how descriptors can be edited to effectively mitigate bias compared to the baseline.
translated by 谷歌翻译
随着大型预训练的Vison语言模型(如剪辑)的出现,可以通过及时调整来调整可转让表示形式。及时调整试图从存储在预训练的视觉模型的图像和文本编码器中的常识中探索有益信息,以探索下游任务。最近提出的名为“上下文优化”(COP)的方法将一组可学习的向量从语言侧引入文本提示符,而单独调整文本提示符则不会影响图像编码器的计算视觉特征,从而导致了次级优势。在本文中,我们通过学习文本提示并同时为文本和图像编码器提供双重模式提示调整范式。此外,为了使视觉提示更多地集中在目标视觉概念上,我们提出了类感知的视觉及时调整(CAVPT),该调整是通过在模板提示和视觉类别令牌嵌入的语言描述之间进行交叉注意来动态生成的。我们的方法提供了一种新的范式来调整大型预训练的视觉模型,并在8个数据集上进行了广泛的实验结果,证明了该方法的有效性。我们的代码在补充材料中可用。
translated by 谷歌翻译
当前机器学习的大部分基础的大型数据集提出了有关不适当内容的严重问题,例如冒犯,侮辱,威胁或可能引起焦虑。这要求增加数据集文档,例如使用数据表。它们除其他主题外,还鼓励反思数据集的组成。到目前为止,该文档是手动完成的,因此可能是乏味且容易出错的,尤其是对于大型图像数据集。在这里,我们询问了机器是否可以帮助我们反思不适当的内容的“循环”问题,回答了数据表中的问题16。为此,我们建议使用存储在预训练的变压器模型中的信息来协助我们进行文档过程。具体而言,基于社会 - 道德价值数据集的及时调整引导剪辑以识别潜在的不适当的内容,从而减少了人工的劳动。然后,我们根据使用视觉模型生成的字幕来记录使用单词云找到的不适当图像。两个流行的大规模计算机视觉数据集的文档(ImageNet和OpenImages)以这种方式产生,这表明机器确实可以帮助数据集创建者回答有关不适当图像内容的问题16。
translated by 谷歌翻译
Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model. Since training on a similar scale for videos is infeasible, recent approaches focus on the effective transfer of image-based CLIP to the video domain. In this pursuit, new parametric modules are added to learn temporal information and inter-frame relationships which require meticulous design efforts. Furthermore, when the resulting models are learned on videos, they tend to overfit on the given task distribution and lack in generalization aspect. This begs the following question: How to effectively transfer image-level CLIP representations to videos? In this work, we show that a simple Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos. Our qualitative analysis illustrates that the frame-level processing from CLIP image-encoder followed by feature pooling and similarity matching with corresponding text embeddings helps in implicitly modeling the temporal cues within ViFi-CLIP. Such fine-tuning helps the model to focus on scene dynamics, moving objects and inter-object relationships. For low-data regimes where full fine-tuning is not viable, we propose a `bridge and prompt' approach that first uses fine-tuning to bridge the domain gap and then learns prompts on language and vision side to adapt CLIP representations. We extensively evaluate this simple yet strong baseline on zero-shot, base-to-novel generalization, few-shot and fully supervised settings across five video benchmarks. Our code is available at https://github.com/muzairkhattak/ViFi-CLIP.
translated by 谷歌翻译
Prompt tuning has been employed as an efficient way to adapt large vision-language pre-trained models (e.g. CLIP) to various downstream tasks in data-limited or label-limited settings. Nonetheless, visual data (e.g., images) is by default prerequisite for learning prompts in existing methods. In this work, we advocate that the effectiveness of image-text contrastive learning in aligning the two modalities (for training CLIP) further makes it feasible to treat texts as images for prompt tuning and introduce TaI prompting. In contrast to the visual data, text descriptions are easy to collect, and their class labels can be directly derived. Particularly, we apply TaI prompting to multi-label image recognition, where sentences in the wild serve as alternatives to images for prompt tuning. Moreover, with TaI, double-grained prompt tuning (TaI-DPT) is further presented to extract both coarse-grained and fine-grained embeddings for enhancing the multi-label recognition performance. Experimental results show that our proposed TaI-DPT outperforms zero-shot CLIP by a large margin on multiple benchmarks, e.g., MS-COCO, VOC2007, and NUS-WIDE, while it can be combined with existing methods of prompting from images to improve recognition performance further. Code is released at https://github.com/guozix/TaI-DPT.
translated by 谷歌翻译
Prompt Tuning, conditioning on task-specific learned prompt vectors, has emerged as a data-efficient and parameter-efficient method for adapting large pretrained vision-language models to multiple downstream tasks. However, existing approaches usually consider learning prompt vectors for each task independently from scratch, thereby failing to exploit the rich shareable knowledge across different vision-language tasks. In this paper, we propose multitask vision-language prompt tuning (MVLPT), which incorporates cross-task knowledge into prompt tuning for vision-language models. Specifically, (i) we demonstrate the effectiveness of learning a single transferable prompt from multiple source tasks to initialize the prompt for each target task; (ii) we show many target tasks can benefit each other from sharing prompt vectors and thus can be jointly learned via multitask prompt tuning. We benchmark the proposed MVLPT using three representative prompt tuning methods, namely text prompt tuning, visual prompt tuning, and the unified vision-language prompt tuning. Results in 20 vision tasks demonstrate that the proposed approach outperforms all single-task baseline prompt tuning methods, setting the new state-of-the-art on the few-shot ELEVATER benchmarks and cross-task generalization benchmarks. To understand where the cross-task knowledge is most effective, we also conduct a large-scale study on task transferability with 20 vision tasks in 400 combinations for each prompt tuning method. It shows that the most performant MVLPT for each prompt tuning method prefers different task combinations and many tasks can benefit each other, depending on their visual similarity and label similarity. Code is available at https://github.com/sIncerass/MVLPT.
translated by 谷歌翻译
本文提出了一种对比调整,这是一种简单的方法,采用对比训练来对准图像和文本模型,同时仍然利用他们的预训练。在我们的实证研究中,我们发现,锁定的预训练图像模型与解锁文本模型最佳。我们调用这种对比调整“锁定图像文本调整”(LIT TOONING)的实例,该实例仅教导文本模型,从预先训练的图像模型中读出了良好的表示新任务。亮度调谐模型将零拍摄传输到新视觉任务的能力提高,例如图像分类或检索。建议的亮度调整是广泛适用的;它可以使用三种不同的图像文本数据集可靠地使用多种预训练方法(监督和无监督)和多种架构(Reset,Vision变换器和MLP-MILLER)。利用基于变压器的预训练VIT-G / 14型号,LIT调谐模型在想象网测试集中实现了84.5%的零射频传输精度,并且在充满挑战的分发ObjectNet测试集中实现了81.1%。
translated by 谷歌翻译
将简单的体系结构与大规模预训练相结合已导致图像分类的大量改进。对于对象检测,预训练和缩放方法的确定性不佳,尤其是在长尾和开放式摄影的环境中,训练数据相对较少。在本文中,我们提出了一个强大的配方,用于将图像文本模型转移到开放式对象检测中。我们使用具有最小修改,对比度文本预训练和端到端检测微调的标准视觉变压器体系结构。我们对该设置的缩放属性的分析表明,增加图像级预训练和模型大小在下游检测任务上产生一致的改进。我们提供适应性策略和正规化,以实现零击文本条件和单次图像条件对象检测的非常强劲的性能。代码和型号可在GitHub上找到。
translated by 谷歌翻译
从自然语言监督中学习视觉表示,最近在许多开创性的作品中表现出了巨大的希望。通常,这些具有语言的视觉模型表现出对各种数据集和任务的强大可传递性。但是,由于缺乏易于使用的评估工具包和公共基准,评估这些模型的可转让性仍然很具有挑战性。为了解决这个问题,我们构建了高级版(评估语言的视觉任务级传输),这是用于评估(预训练)语言增强视觉模型的第一个基准和工具包。升华由三个组成部分组成。 (i)数据集。作为下游评估套件,它由20个图像分类数据集和35个对象检测数据集组成,每个数据集都用外部知识来增强。 (ii)工具包。开发了自动高参数调谐工具包,以促进下游任务的模型评估。 (iii)指标。多种评估指标用于测量样品效率(零射击和少量)和参数效率(线性探测和完整模型微调)。我们在https://computer-vision-in-the-wild.github.io/elevater/上公开发布leverater
translated by 谷歌翻译
预训练的视觉模型(例如,剪辑)在许多下游任务中显示出有希望的零弹性概括,并具有正确设计的文本提示。最近的作品不依赖手工设计的提示,而是使用下游任务的培训数据来学习提示。虽然有效,但针对领域数据的培训却降低了模型的概括能力,使其无法看到新领域。在这项工作中,我们提出了测试时间提示调整(TPT),该方法可以通过单个测试样本即时学习自适应提示。对于图像分类,TPT通过使用置信度选择最小化熵来优化提示,以便模型在每个测试样本的不同增强视图上都具有一致的预测。在评估对自然分布变化的概括时,TPT平均将零击的TOP-1精度提高了3.6%,超过了先前需要其他特定于任务的训练数据的迅速调整方法。在评估看不见类别的跨数据集泛化时,TPT与使用其他培训数据的最先进方法相当。项目页面:https://azshue.github.io/tpt。
translated by 谷歌翻译
自动视觉解对我们多样化和开放的世界需要计算机视觉模型,以概括为特定任务的最小定制,类似于人类视力。计算机视觉基础型号培训,培训多样化,大型数据集,可以适应各种下游任务,对该任务来解决现实世界计算机视觉应用而言至关重要。虽然现有的视觉基础模型如剪辑,对齐和吴道2.0主要集中在映射图像和文本表示到跨模型共享表示,我们介绍了一台新的计算机视觉基础模型,佛罗伦萨,扩大粗糙的表示(现场)到精细(对象),从静态(图像)到动态(视频),以及从RGB到多个模态(标题,深度)。通过从Web级图像文本数据中纳入通用视觉语言表示,我们的佛罗伦萨模型可以很容易地适应各种计算机视觉任务,例如分类,检索,对象检测,VQA,图像标题,视频检索和动作识别。此外,佛罗伦萨在许多类型的转移学习中表现出出色的表现:全面采样的微调,线性探测,几次射击传输和用于新颖图像和物体的零拍摄传输。所有这些属性对于我们的视觉基础模型至关重要,以提供通用视觉任务。佛罗伦萨实现了新的最先进的导致44个代表性基准,例如Imagenet-1K零射击分类,最高1精度为83.74,最高5个精度为97.18,62.4地图上的Coco微调, 80.36在VQA上,动力学-600上的87.8。
translated by 谷歌翻译
本文介绍了用于学习对象级别,语言感知和富含语义的视觉表示的接地语言图像预培训(GLIP)模型。 Glip统一对象检测和短语进行预培训。统一带来了两个好处:1)它允许GLIP从检测和接地数据中学习,以改善两个任务和引导良好的接地模型; 2)GLIP可以通过以自培训方式产生接地盒来利用大规模的图像文本对,使学习的表示是语义丰富的。在我们的实验中,我们在27M的接地数据上预先列车触胶,包括3M人的注释和24M Web爬网的图像文本对。学习的表示表明了强烈的零射击和对各种对象识别任务的可转换性。 1)直接在Coco和LVIS上评估(在训练期间没有在Coco中看到任何图像)时,Plip分别达到49.8 AP和26.9 AP,超过许多监督基线。 2)在COCO上微调后,GLIP在Val和61.5 AP上实现60.8 AP在测试开发上,超过先前的SOTA。 3)当转移到下游对象检测任务时,具有完全监控动态头的1次触发器竞争对手。代码将在https://github.com/microsoft/glip发布。
translated by 谷歌翻译
我们引入了构图软提示(CSP),这是一种参数有效的学习技术,可改善大规模预处理视觉模型(VLMS)的零摄像组成性。 VLM可以在其灵活的文本编码器中代表任意类作为自然语言提示,但在组成零击基准任务上的表现不佳。为了改善VLM,我们提出了一种新颖的软提示形式。我们将构成的属性和对象视为将类定义为词汇的可学习令牌,并在多个及时的构图上调整它们。在推断期间,我们在新组合中重新组装了学习的属性对象词汇。我们表明,CSP在基准数据集上的原始VLM的表现平均为AUC上的10.9个百分点。 CSP还胜过Coop,这是一种调谐前缀上下文的软提示方法,在AUC上平均要点5.8个百分点。我们执行其他实验,以表明CSP对仅属性分类,高阶属性 - 属性对象组成以及预验证属性和微调对象的组合进行了改进。
translated by 谷歌翻译
我们介绍了自回归文本到图像(Parti)模型的途径,该模型生成高保真的影像图像并支持涉及复杂组成和世界知识的内容丰富的合成。 Parti将文本对图像生成视为类似于机器翻译的序列到序列建模问题,图像令牌的序列是目标输出,而不是其他语言的文本令牌。这种策略自然可以利用大型语言模型的先前工作,通过扩展数据和模型尺寸,能力和性能的持续进展。我们的方法很简单:首先,Parti使用基于变压器的图像令牌VIT-VQGAN将图像编码为离散令牌的序列。其次,我们通过将编码器二次变压器模型缩放到20B参数来实现一致的质量改进,其新的最新零弹药FID得分为7.23,而MS-Coco的FIDED得分为3.22。我们对本地化叙述以及党的详细分析(P2),这是1600多个英语提示的新的整体基准,证明了Parti在各种类别和难度方面的有效性。我们还探索并突出了我们的模型的局限性,以定义和体现关注重点领域以进一步改进。有关高分辨率图像,请参见https://parti.research.google/。
translated by 谷歌翻译
自从出现以来,在大型,随机收集的数据上训练的视觉模型在许多领域都有重大影响。但是,由于它们在各个领域表现出色,例如图像文本 - 取回,因此他们的内部工作仍未得到充分了解。当前的工作分析了这些模型的真实零击功能。我们从分析培训语料库的分析开始,评估测试类的程度(以及哪个)实际上是零射击,以及与单个类别的性能如何相关。我们跟进这些模型的基于属性的零击学习能力的分析,以评估这种经典的零击概念从大规模的监督中出现的方式。我们利用最近发布的LAION400M数据语料库以及公开可用的剪辑,OpenClip和Flava的模型,评估了基于属性的CUB和AWA2基准的零摄影功能。我们的分析表明:(i)在预训练期间(很多)观察到大多数流行的零射门基准中的大多数类别; (ii)零射击性能主要来自模型识别类标签的能力,每当它们存在于文本中时,并且只有在不使用类标签时才能观察到基于属性的zeroshot学习的较低的性能能力; (iii)所使用的属性数量可能会对性能产生重大影响,并且很容易导致大幅下降。
translated by 谷歌翻译
作为人类,我们通过我们所有的感官来驾驭世界,使用每个人从每个人纠正其他人。我们介绍了Merlot Reserve,一个模型,该模型是联合随着时间的推移而表示视频的模型 - 通过从音频,字幕和视频帧学习的新培训目标。给出了一个视频,我们用掩模令牌替换文本和音频的片段;该模型通过选择正确的蒙版片段来学习。我们的目标比替代方面更快地学习,并在规模上表现良好:我们预先逼近2000万YouTube视频。经验结果表明,Merlot Reserve学会通过所有组成模式的视频的强烈陈述。在FineTuned时,它在VCR和TVQA上为VCR和TVQA进行了新的最先进,优先于前勤工作分别为5%和7%。消融表明,两个任务都受益于音频预制 - 甚至录像机,围绕图像中心的QA任务(没有声音)。此外,我们的客观使开箱即用的预测,揭示了强大的多式联合致辞理解。在一个完全零拍摄的环境中,我们的模型在四个视频理解任务中获得竞争结果,甚至优于最近提出的定位推理(星)基准的监督方法。我们分析为什么包含音频导致更好的视觉语言表示,这表明未来研究的重要机会。我们通过讨论多式联运预测的道德和社会影响来得出结论。
translated by 谷歌翻译