传统的计算机视觉模型受过培训,以预测固定的预定义类别。最近,自然语言已被证明是一个更广泛而更丰富的监督来源,为视觉概念提供更精细的描述,而不是监督“黄金”标签。以前的作品,例如剪辑,使用InfoNce丢失来训练模型以预测图像和文本标题之间的配对。然而,剪辑是饥饿的数据,需要超过400米的图像文本对进行培训。效率低下可以归因于图像文本对嘈杂的事实。为了解决这个问题,我们提出了水獭(有效的零射击识别的最佳运输蒸馏),它使用在线熵最佳运输,找到一个软图像文本与标签进行对比学习。基于预磨料的图像和文本编码器,用电站培训的型号实现了强大的性能,只有3M图像文本对。与InfoNce损失相比,标记平滑和知识蒸馏,OTTER始终如一地优于零拍摄图像(19,958类)和来自腾讯ML图像的多标记Imagenet 10k(10032类)的零拍摄评估中的这些基线。在4个不同的数据集/架构设置x 6度量上,OTTER优于(32)或绑定(2)34中的所有基准。
translated by 谷歌翻译
Pre-trained representations are becoming crucial for many NLP and perception tasks. While representation learning in NLP has transitioned to training on raw text without human annotations, visual and vision-language representations still rely heavily on curated training datasets that are expensive or require expert knowledge. For vision applications, representations are mostly learned using datasets with explicit class labels such as Ima-geNet or OpenImages. For vision-language, popular datasets like Conceptual Captions, MSCOCO, or CLIP all involve a non-trivial data collection (and cleaning) process. This costly curation process limits the size of datasets and hence hinders the scaling of trained models. In this paper, we leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset. A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss. We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme. Our visual representation achieves strong performance when transferred to classification tasks such as ImageNet and VTAB. The aligned visual and language representations enables zero-shot image classification and also set new state-of-the-art results on Flickr30K and MSCOCO image-text retrieval benchmarks, even when compared with more sophisticated crossattention models. The representations also enable cross-modality search with complex text and text + image queries.
translated by 谷歌翻译
自动视觉解对我们多样化和开放的世界需要计算机视觉模型,以概括为特定任务的最小定制,类似于人类视力。计算机视觉基础型号培训,培训多样化,大型数据集,可以适应各种下游任务,对该任务来解决现实世界计算机视觉应用而言至关重要。虽然现有的视觉基础模型如剪辑,对齐和吴道2.0主要集中在映射图像和文本表示到跨模型共享表示,我们介绍了一台新的计算机视觉基础模型,佛罗伦萨,扩大粗糙的表示(现场)到精细(对象),从静态(图像)到动态(视频),以及从RGB到多个模态(标题,深度)。通过从Web级图像文本数据中纳入通用视觉语言表示,我们的佛罗伦萨模型可以很容易地适应各种计算机视觉任务,例如分类,检索,对象检测,VQA,图像标题,视频检索和动作识别。此外,佛罗伦萨在许多类型的转移学习中表现出出色的表现:全面采样的微调,线性探测,几次射击传输和用于新颖图像和物体的零拍摄传输。所有这些属性对于我们的视觉基础模型至关重要,以提供通用视觉任务。佛罗伦萨实现了新的最先进的导致44个代表性基准,例如Imagenet-1K零射击分类,最高1精度为83.74,最高5个精度为97.18,62.4地图上的Coco微调, 80.36在VQA上,动力学-600上的87.8。
translated by 谷歌翻译
使用图像文本对的对比语言图像预测(剪辑)在零拍摄和传输学习设置中的图像分类中取得了令人印象深刻的结果。但是,我们表明,直接应用此类模型以识别对象检测的图像区域导致由于域移位导致的性能差:剪辑训练以与文本描述的整体匹配,而不捕获图像之间的细粒度对齐地区和文本跨度。为了缓解此问题,我们提出了一种称为RegionClip的新方法,可显着扩展剪辑以学习区域级视觉表示,从而在图像区域和文本概念之间实现细粒度对齐。我们的方法利用剪辑模型将图像区域与模板标题匹配,然后预先列出我们的模型以对准要素空间中的这些区域文本对。将预磨料模型转移到开放词汇对象检测任务时,我们的方法显着优于3.8 AP50和2.2 AP的最新技术,分别用于COCO和LVIS数据集的新型类别。更多,学习区域表示支持对象检测的零拍摄推断,显示了对COCO和LVIS数据集的有希望的结果。我们的代码可在https://github.com/microsoft/regionclip上获得。
translated by 谷歌翻译
我们提出了Clip-Lite,一种通过与文本注释的特征对齐方式进行视觉表示学习的信息有效方法。与先前提出的剪辑模型相比,剪辑液在优化其对比学学习目标期间只需要一个负图像文本样本对。我们通过利用信息有效的较低限制来实现这一点,以最大化两个输入模态之间的相互信息。这允许剪辑Lite培训,在获得比夹子的更好的性能的同时具有显着减少的数据和批量尺寸。我们通过在Coco-Tablions数据集上预先绘制来评估剪贴画并对其他数据集进行测试传输。 Clip-Lite在Pascal VOC分类上获得+ 15.4%的映射绝对增益,并在ImageNet上获得A + 22.1%的前1个精度增益,同时与其他更复杂,文本监督模型相当或优越。 Clip-Lite还优于剪辑图像和文本检索,零拍分类和视觉接地。最后,通过在表示学习期间执行显式图像文本对齐,我们显示Clip-Lite可以利用语言语义来鼓励可以在下游任务中使用的无偏见的视觉表示。
translated by 谷歌翻译
使用自然语言作为培训视觉识别模型的监督持有巨大的承诺。最近的作品表明,如果在大型训练数据集中的图像和标题之间的对齐形式使用此类监督,则结果对齐模型在零拍摄分类中表现出色,如下游任务2。在本文中,我们专注于挑逗语言监督的哪些部分对于训练零拍摄图像分类模型至关重要。通过广泛和仔细的实验​​,我们表明:1)可以将简单的单词(弓)标题用作数据集中大多数图像标题的替代品。令人惊讶的是,我们观察到这种方法在与单词平衡结合时提高了零拍分类性能。 2)使用船首净化模型,我们可以通过在没有标题的图像上生成伪弓标题来获得更多培训数据。使用真实和伪弓形标题培训的模型达到了更强的零射性能。在ImageNet-1K零拍评估中,我们只使用3M图像标题对的最佳模型,使用15M图像标题对培训的剪辑模型(31.5%VS 31.3%)进行剪辑。
translated by 谷歌翻译
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.
translated by 谷歌翻译
开创性双编码器预训练工作(例如,剪辑并对齐)揭示了与对比学习对齐多模态表示的潜力。然而,这些作品需要大量的数据和计算资源(例如,十亿级Web数据和数百个GPU),这阻止了从再生产和进一步探索的资源有限的研究人员。为此,我们探讨了一堆简单但有效的启发式,并提供了全面的培训指导,使我们能够与有限的资源进行双编码器多模态表示对齐。我们为竞争结果提供可重复的强大基线,即Zerovl,只有1400万公共访问的学术数据集和8 v100 GPU。此外,我们收集100米Web数据进行预培训,而不是最先进的方法实现可比或优越的结果,进一步证明了我们对大规模数据的方法的有效性。我们希望这项工作将为多模态预培训的未来研究提供有用的数据点和经验。我们的代码和预先训练的型号将被释放,以促进研究界。
translated by 谷歌翻译
对比风格的预训练弱相关的图像文本对表现出极大的力量学习语义对齐跨模式模型。测量图像文本对特征表示之间距离之间的距离的共同选择是余弦相似性,可以将其视为数学上嵌入球体的特征的负面产物。尽管这种拓扑的计算资源消耗和正确定义的统一性通常受益于通常有两个主要缺点。首先,它容易受到与弱相关图像文本对噪声产生的语义歧义现象。其次,一开始学习进步是不稳定和脆弱的。尽管在以前的研究实践中,采用可学习的软度温度参数和长期的热身方案来融化训练的进度,但仍然缺乏对这些问题的深入分析。在这项工作中,我们从优化的角度讨论了拓扑的所需属性及其为特征表示向量的赋形距离函数。然后,我们提出了一个相当简单的解决方案,以改善上述问题。也就是说,我们将特征表示形式映射到带有负内产物作为距离函数的倾斜歧管上。在实验分析中,我们表明我们可以通过仅更改训练代码的两行来提高基线性能(例如,在零拍图像中4%到文本检索任务)。
translated by 谷歌翻译
对比性语言图像预处理(剪辑)受到广泛关注,因为它的学会表示形式可以很好地转移到各种下游任务上。在剪辑训练期间,Infonce目标旨在使正面图像对齐和分开的负面图像对齐。在本文中,我们在此过程中显示了表示分组的效果:Infonce客观间接通过随机出现的模式内锚将语义相似的表示形式组合在一起。我们引入了原型对比度图像预处理(原始的),以提高其效率并提高其针对模态差距的鲁棒性来增强这种分组。具体而言,原始利润在图像和文本空间之间建立了原型级别的歧视,从而有效传输了更高级别的结构知识。我们进一步提出了典型的背部翻译(PBT),以将表示形式分组与表示形式对齐,从而有效地学习了在较大的模态差距下有意义的表示。 PBT还使我们能够以更丰富的先验知识介绍其他外部教师。 ProtoClip通过在线情节培训策略进行了培训,这可以扩展到无限量的数据。结合上述新颖的设计,我们在概念标题上训练原始设计,并获得了 +5.81%的成像网线性探测改进,并且 +2.01%的Imagenet Zero Zero-shot分类改进。代码可在https://github.com/megvii-research/protoclip上找到。
translated by 谷歌翻译
具有大尺度图像文本对的视觉预训练(VLP)在各个领域都表现出卓越的性能。但是,Internet上的图像文本对共存通常缺乏明确的对齐信息,这对于VLP来说是次优的。建议采用现成的对象检测器来利用其他图像标签信息。但是,对象检测器是耗时的,只能识别预定义的对象类别,从而限制了模型容量。受到观察的启发,即文本包含不完整的细粒图像信息,我们介绍了Ideas,该想法代表通过在线多标签识别VLP来增加文本多样性。想法表明,可以在VLP期间共同优化从文本中提取的图像标签的多标签学习。此外,想法可以在线识别有价值的图像标签,以提供更明确的文本监督。全面的实验表明,想法可以显着提高多个下游数据集上的性能,并具有较小的额外计算成本。
translated by 谷歌翻译
Large-scale cross-modal pre-training paradigms have recently shown ubiquitous success on a wide range of downstream tasks, e.g., zero-shot classification, retrieval and image captioning. However, their successes highly rely on the scale and quality of web-crawled data that naturally contain incomplete and noisy information (e.g., wrong or irrelevant content). Existing works either design manual rules to clean data or generate pseudo-targets as auxiliary signals for reducing noise impact, which do not explicitly tackle both the incorrect and incomplete challenges simultaneously. In this paper, to automatically mitigate the impact of noise by solely mining over existing data, we propose a principled Noise-robust Language-Image Pre-training framework (NLIP) to stabilize pre-training via two schemes: noise-harmonization and noise-completion. First, in noise-harmonization scheme, NLIP estimates the noise probability of each pair according to the memorization effect of cross-modal transformers, then adopts noise-adaptive regularization to harmonize the cross-modal alignments with varying degrees. Second, in noise-completion scheme, to enrich the missing object information of text, NLIP injects a concept-conditioned cross-modal decoder to obtain semantic-consistent synthetic captions to complete noisy ones, which uses the retrieved visual concepts (i.e., objects' names) for the corresponding image to guide captioning generation. By collaboratively optimizing noise-harmonization and noise-completion schemes, our NLIP can alleviate the common noise effects during image-text pre-training in a more efficient way. Extensive experiments show the significant performance improvements of our NLIP using only 26M data over existing pre-trained models (e.g., CLIP, FILIP and BLIP) on 12 zero-shot classification datasets, MSCOCO image captioning and zero-shot image-text retrieval tasks.
translated by 谷歌翻译
Joint image-text embedding is the bedrock for most Visionand-Language (V+L) tasks, where multimodality inputs are simultaneously processed for joint visual and textual understanding. In this paper, we introduce UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets (COCO, Visual Genome, Conceptual Captions, and SBU Captions), which can power heterogeneous downstream V+L tasks with joint multimodal embeddings. We design four pre-training tasks: Masked Language Modeling (MLM), Masked Region Modeling (MRM, with three variants), Image-Text Matching (ITM), and Word-Region Alignment (WRA). Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i.e., masked language/region modeling is conditioned on full observation of image/text). In addition to ITM for global image-text alignment, we also propose WRA via the use of Optimal Transport (OT) to explicitly encourage finegrained alignment between words and image regions during pre-training. Comprehensive analysis shows that both conditional masking and OTbased WRA contribute to better pre-training. We also conduct a thorough ablation study to find an optimal combination of pre-training tasks. Extensive experiments show that UNITER achieves new state of the art across six V+L tasks (over nine datasets), including Visual Question
translated by 谷歌翻译
Although significant progress has been made in few-shot learning, most of existing few-shot learning methods require supervised pre-training on a large amount of samples of base classes, which limits their generalization ability in real world application. Recently, large-scale self-supervised vision-language models (e.g., CLIP) have provided a new paradigm for transferable visual representation learning. However, the pre-trained VLPs may neglect detailed visual information that is difficult to describe by language sentences, but important for learning an effective classifier in few-shot classification. To address the above problem, we propose a new framework, named Semantic-guided Visual Adapting (SgVA), which can effectively extend vision-language pre-trained models to produce discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation. The implicit knowledge distillation is designed to transfer the fine-grained cross-modal knowledge to guide the updating of the vision adapter. State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
translated by 谷歌翻译
剪辑在零拍传输学习任务上产生了令人印象深刻的结果,并被视为BERT或GPT3等基础模型。具有丰富表示形式的剪辑视觉模型是使用Infonce目标和自然语言监督对特定任务进行微调之前进行预训练的。尽管剪辑在零拍传输学习方面表现出色,但它遭受了解释的问题,也就是说,它的重点是一个或几个功能,同时忽略了其他相关功能。该问题是由于原始多模式数据中未充分提取协方差结构而引起的。我们建议使用现代Hopfield网络来解决解释的问题。他们检索到的嵌入具有富集的协方差结构,该结构源自存储嵌入中特征的共发生。但是,现代的Hopfield网络增加了阻碍学习的Infonce目标的饱和效应。我们建议使用Infoloob目标来减轻这种饱和效果。我们介绍了小说``对比抛弃了一个增压'(Cloob),该小说使用现代的Hopfield网络与Infoloob Opportions一起进行协方差丰富。在实验中,我们将Cloob与概念标题进行预培训后的剪辑和YFCC数据集进行了比较,相对于其在其他数据集上的零拍传输学习性能。 Cloob在所有考虑的架构和数据集中始终在零摄像转移学习上胜过剪辑。
translated by 谷歌翻译
We introduce Patch Aligned Contrastive Learning (PACL), a modified compatibility function for CLIP's contrastive loss, intending to train an alignment between the patch tokens of the vision encoder and the CLS token of the text encoder. With such an alignment, a model can identify regions of an image corresponding to a given text input, and therefore transfer seamlessly to the task of open vocabulary semantic segmentation without requiring any segmentation annotations during training. Using pre-trained CLIP encoders with PACL, we are able to set the state-of-the-art on the task of open vocabulary zero-shot segmentation on 4 different segmentation benchmarks: Pascal VOC, Pascal Context, COCO Stuff and ADE20K. Furthermore, we show that PACL is also applicable to image-level predictions and when used with a CLIP backbone, provides a general improvement in zero-shot classification accuracy compared to CLIP, across a suite of 12 image classification datasets.
translated by 谷歌翻译
探索大规模预处理的基础模型对计算机视觉具有重大兴趣,因为这些模型可以快速转移到许多下游任务中。本文介绍了对比字幕(COCA),这是一种极简主义的设计,旨在为图像文本编码器编码器基础模型预算与对比度损失和字幕损失,从而从剪辑和诸如simvlm之类的生成方法之类的对比方法中包含模型能力。与所有解码器层都参与编码器输出的标准编码器 - 模块变压器相反,可口可乐省略了解码器层的上半部分的交叉注意,以编码单峰文本表示,并串联到剩余的解码器层,这些解码器与图像编码器相交的解码器层多模式图像文本表示。除了对多模态解码器输出的字幕损失外,我们还应用了单峰图像和文本嵌入之间的对比损失,该输出可以预测文本令牌自动加压。通过共享相同的计算图,可以用最小的开销有效地计算两个培训目标。可口可乐是端到端和从头开始的网络尺度alt-text数据和带注释的图像,通过将所有标签视为文本,无缝地统一自然语言监督以进行表示。从经验上讲,可口可乐通过零拍传输或在广泛的下游任务上进行零摄像转移或最少的特定任务适应,跨越视觉识别(Imagenet,Kinetics-400/600/700,瞬间, ),交叉模式检索(MSCOCO,FLICKR30K,MSR-VTT),多模式理解(VQA,SNLI-VE,NLVR2)和图像字幕(MSCOCO,NOCAPS)。值得注意的是,在Imagenet分类方面,COCA获得了86.3%的TOP-1准确性,带有冷冻编码器和学习的分类头90.6%,以及带有填充编码器的Imagenet上的新最先进的91.0%Top-1 Top-1精度。
translated by 谷歌翻译
自从出现以来,在大型,随机收集的数据上训练的视觉模型在许多领域都有重大影响。但是,由于它们在各个领域表现出色,例如图像文本 - 取回,因此他们的内部工作仍未得到充分了解。当前的工作分析了这些模型的真实零击功能。我们从分析培训语料库的分析开始,评估测试类的程度(以及哪个)实际上是零射击,以及与单个类别的性能如何相关。我们跟进这些模型的基于属性的零击学习能力的分析,以评估这种经典的零击概念从大规模的监督中出现的方式。我们利用最近发布的LAION400M数据语料库以及公开可用的剪辑,OpenClip和Flava的模型,评估了基于属性的CUB和AWA2基准的零摄影功能。我们的分析表明:(i)在预训练期间(很多)观察到大多数流行的零射门基准中的大多数类别; (ii)零射击性能主要来自模型识别类标签的能力,每当它们存在于文本中时,并且只有在不使用类标签时才能观察到基于属性的zeroshot学习的较低的性能能力; (iii)所使用的属性数量可能会对性能产生重大影响,并且很容易导致大幅下降。
translated by 谷歌翻译
Large-scale vision and language representation learning has shown promising improvements on various vision-language tasks. Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens. Because the visual tokens and word tokens are unaligned, it is challenging for the multimodal encoder to learn image-text interactions. In this paper, we introduce a contrastive loss to ALign the image and text representations BEfore Fusing (ALBEF) them through cross-modal attention, which enables more grounded vision and language representation learning. Unlike most existing methods, our method does not require bounding box annotations nor high-resolution images. To improve learning from noisy web data, we propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model. We provide a theoretical analysis of ALBEF from a mutual information maximization perspective, showing that different training tasks can be interpreted as different ways to generate views for an image-text pair. ALBEF achieves state-of-the-art performance on multiple downstream visionlanguage tasks. On image-text retrieval, ALBEF outperforms methods that are pre-trained on orders of magnitude larger datasets. On VQA and NLVR 2 , ALBEF achieves absolute improvements of 2.37% and 3.84% compared to the state-ofthe-art, while enjoying faster inference speed. Code and models are available at https://github.com/salesforce/ALBEF.
translated by 谷歌翻译
人工智能(AI)的基本目标是模仿人类的核心认知活动。尽管在AI研究中取得了巨大的成功,但大多数现有方法仅具有单认知能力。为了克服这一局限性并迈出了朝着人工通用智能(AGI)迈出的坚实一步,我们开发了一个通过庞大的多模式数据进行预训练的基础模型,可以快速适应各种下游认知任务。为了实现这一目标,我们建议通过从Internet上拖延的语义相关数据进行自我监督的学习来预先培训我们的基础模型,并表明可以在各种下游任务上获得有希望的结果。特别是,使用开发的模型解剖工具,我们证明了我们的基础模型现在拥有强大的想象力。我们认为,我们的工作从我们的“弱或狭窄AI”的常见实践到“强或广泛的AI”迈出了转变的迈向AGI。
translated by 谷歌翻译