过去几年的技术创新的巨大浪潮,标志着AI技术的进展,是深刻的重塑行业和社会。然而,在路上,一个关键的挑战等待着我们,即我们满足快速增长的情景的能力的能力受到收购培训数据的成本的严重限制。由于主流学习范式的局限性,这一困难的局面是基于主流学习范式的局限性:我们需要根据大量注释的数据以及通常从头来训练每个新场景的新模型。在解决这一基本问题时,我们超越并开发一个名为实习生的新学习范式。通过在多个阶段的来自多个来源的监控信号学习,培训的模型将产生强大的相互性。我们在26个众所周知的数据集中评估我们的模型,该数据集涵盖计算机视觉中的四类任务。在大多数情况下,我们的模型仅适用于目标域中的培训数据的10%,始终以完整的数据培训的对应物,通常由显着的边距。这是一个重要前景的重要一步,其中具有一般视觉能力的这种模型可以大大降低对数据的依赖,从而加速通过AI技术的采用。此外,围绕我们的新范式旋转,我们还介绍了一个新的数据系统,新的架构和新的基准,以及一起形成一般愿景生态系统,以开放和包容性的方式支持其未来的发展。
translated by 谷歌翻译
Current computer vision models, unlike the human visual system, cannot yet achieve general-purpose visual understanding. Existing efforts to create a general vision model are limited in the scope of assessed tasks and offer no overarching framework to perform them holistically. We present a new comprehensive benchmark, General-purpose Visual Understanding Evaluation (G-VUE), covering the full spectrum of visual cognitive abilities with four functional domains $\unicode{x2014}$ Perceive, Ground, Reason, and Act. The four domains are embodied in 11 carefully curated tasks, from 3D reconstruction to visual reasoning and manipulation. Along with the benchmark, we provide a general encoder-decoder framework to allow for the evaluation of arbitrary visual representation on all 11 tasks. We evaluate various pre-trained visual representations with our framework and observe that (1) Transformer-based visual backbone generally outperforms CNN-based backbone on G-VUE, (2) visual representations from vision-language pre-training are superior to those with vision-only pre-training across visual tasks. With G-VUE, we provide a holistic evaluation standard to motivate research toward building general-purpose visual systems via obtaining more general-purpose visual representations.
translated by 谷歌翻译
为了同时朝着对多个下游任务的整体理解,需要提取具有更好可传递性的功能。尽管许多最新的自我监管的预训练方法在普遍的预处理前范式下在各种视觉任务上取得了令人印象深刻的表现,但它们对多任务学习方案的概括能力尚待探索。在本文中,我们在三个下游任务上进行了广泛研究各种类型的自我监督方法的转移性能,例如Moco和Simc​​lr,包括语义细分,可驱动的区域细分和交通对象检测,在大规模驾驶数据集中BDD100K。我们出人意料地发现,他们的表现是最佳的甚至落后于单任务基线的滞后,这可能是由于训练目标和建筑设计的区别在于预处理范式。为了克服这一难题,并避免重新设计资源密集的预培训阶段,我们提出了一种简单而有效的预处理 - 适应性 - 赛范围,用于一般的多任务培训,可以有效地适应现行预审预周态的模型没有增加培训开销。在自适应阶段,我们利用可学习的多尺度适配器来动态调整由多任务目标监督的预验证的模型权重,同时使经过预告片的知识未经触及。此外,我们将视觉语言预训练模型剪辑视为对预处理 - 适应 - 最终范式的强烈补充,并提出了一个名为LV-Adapter的新型适配器,该适配器通过任务特定的提示将语言先验纳入了多任务的模型中和视觉和文本特征之间的对齐。
translated by 谷歌翻译
什么构成一个物体?这是计算机愿景中的长期问题。为了实现这一目标,已经开发了许多基于学习的基于学习的方法来得分对象。但是,它们通常不会划过新域和未经看不见的对象。在本文中,我们倡导现有方法缺乏由人类可理解的语义管理的自上而下的监督信号。为了弥合这一差距,我们探索了已经用对齐的图像文本对培训的多模态视觉变压器(MVIT)。我们对各个域和新型对象的广泛实验显示了MVITS的最先进的性能,以使图像中的通用对象本地化。基于这些发现,我们使用多尺度特征处理和可变形的自我关注来开发一种高效且灵活的MVIT架构,可以自适应地生成给定特定语言查询的提议。我们展示了MVIT提案在各种应用中的重要性,包括开放世界对象检测,突出和伪装对象检测,监督和自我监督的检测任务。此外,MVITS提供了具有可理解文本查询的增强的交互性。代码:https://git.io/j1hpy。
translated by 谷歌翻译
We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features conditioned on visible image patches. Via this pretext task, we can efficiently scale up EVA to one billion parameters, and sets new records on a broad range of representative vision downstream tasks, such as image recognition, video action recognition, object detection, instance segmentation and semantic segmentation without heavy supervised training. Moreover, we observe quantitative changes in scaling EVA result in qualitative changes in transfer learning performance that are not present in other models. For instance, EVA takes a great leap in the challenging large vocabulary instance segmentation task: our model achieves almost the same state-of-the-art performance on LVISv1.0 dataset with over a thousand categories and COCO dataset with only eighty categories. Beyond a pure vision encoder, EVA can also serve as a vision-centric, multi-modal pivot to connect images and text. We find initializing the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute, providing a new direction for scaling up and accelerating the costly training of multi-modal foundation models. To facilitate future research, we release all the code and models at https://github.com/baaivision/EVA.
translated by 谷歌翻译
在混合完成的多任务,多域和多模式数据上进行预训练仍然是视力感知预训练的开放挑战。在本文中,我们提出了GPPF,这是一个普遍的感知预训练框架,预先培训任务级的动态网络,该网络是由在标签的多任务和多域数据集上的各层知识“乐高”组成的。通过检查人类在复杂环境中学习的先天能力,我们识别并将三个关键要素转移到深网上:(1)同时暴露于每个批次中的各种交叉任务和跨域信息。 (2)由知识共享驱动的单独的乐高单元中的分区知识存储。 (3)用于训练和下游任务的乐高单元子集的稀疏激活。值得注意的是,由于其在输入形状,损失功能,输出格式,数据分布等方面的差异,不同视觉任务的联合培训是不平凡的。因此,我们创新地开发了插件的多任务培训算法,该培训算法是支持单个迭代多个任务(SIMT)同时培训。 Simt用大型多任务多任务数据集为预训练的基础奠定了基础,并且被证明对于我们的GPPF实验中的稳定培训至关重要。令人兴奋的是,详尽的实验表明,我们的GPPF-R50型号在GPPF-15M中的8个预训练预培训任务的强大基线上取得了显着改善,并在22个下游任务中收获了一系列SOTA,并具有相似的计算预算。我们还验证了GPPF对SOTA视觉变压器的概括能力,并具有一致的改进。这些可靠的实验结果充分证明了我们新颖的GPPF框架提供的有效的知识学习,存储,共享和转移。
translated by 谷歌翻译
自动视觉解对我们多样化和开放的世界需要计算机视觉模型,以概括为特定任务的最小定制,类似于人类视力。计算机视觉基础型号培训,培训多样化,大型数据集,可以适应各种下游任务,对该任务来解决现实世界计算机视觉应用而言至关重要。虽然现有的视觉基础模型如剪辑,对齐和吴道2.0主要集中在映射图像和文本表示到跨模型共享表示,我们介绍了一台新的计算机视觉基础模型,佛罗伦萨,扩大粗糙的表示(现场)到精细(对象),从静态(图像)到动态(视频),以及从RGB到多个模态(标题,深度)。通过从Web级图像文本数据中纳入通用视觉语言表示,我们的佛罗伦萨模型可以很容易地适应各种计算机视觉任务,例如分类,检索,对象检测,VQA,图像标题,视频检索和动作识别。此外,佛罗伦萨在许多类型的转移学习中表现出出色的表现:全面采样的微调,线性探测,几次射击传输和用于新颖图像和物体的零拍摄传输。所有这些属性对于我们的视觉基础模型至关重要,以提供通用视觉任务。佛罗伦萨实现了新的最先进的导致44个代表性基准,例如Imagenet-1K零射击分类,最高1精度为83.74,最高5个精度为97.18,62.4地图上的Coco微调, 80.36在VQA上,动力学-600上的87.8。
translated by 谷歌翻译
Despite the superior performance brought by vision-and-language pretraining, it remains unclear whether learning with multi-modal data can help understand each individual modality. In this work, we investigate how language can help with visual representation learning from a probing perspective. Specifically, we compare vision-and-language and vision-only models by probing their visual representations on a broad range of tasks, in order to assess the quality of the learned representations in a fine-grained manner. Interestingly, our probing results suggest that vision-and-language models are better at label prediction tasks like object and attribute prediction, while vision-only models are stronger at dense prediction tasks that require more localized information. With further analysis using detailed metrics, our study suggests that language helps vision models learn better semantics, but not localization. Code is released at https://github.com/Lizw14/visual_probing.
translated by 谷歌翻译
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.
translated by 谷歌翻译
转移学习可以在源任务上重新使用知识来帮助学习目标任务。一种简单的转移学习形式在当前的最先进的计算机视觉模型中是常见的,即预先训练ILSVRC数据集上的图像分类模型,然后在任何目标任务上进行微调。然而,先前对转移学习的系统研究已经有限,并且预计工作的情况并不完全明白。在本文中,我们对跨越不同的图像域进行了广泛的转移学习实验探索(消费者照片,自主驾驶,空中图像,水下,室内场景,合成,特写镜头)和任务类型(语义分割,物体检测,深度估计,关键点检测)。重要的是,这些都是与现代计算机视觉应用相关的复杂的结构化的输出任务类型。总共执行超过2000年的转移学习实验,包括许多来源和目标来自不同的图像域,任务类型或两者。我们系统地分析了这些实验,了解图像域,任务类型和数据集大小对传输学习性能的影响。我们的研究导致了几个见解和具体建议:(1)对于大多数任务,存在一个显着优于ILSVRC'12预培训的来源; (2)图像领域是实现阳性转移的最重要因素; (3)源数据集应该\ \ emph {include}目标数据集的图像域以获得最佳结果; (4)与此同时,当源任务的图像域比目标的图像域时,我们只观察小的负面影响; (5)跨任务类型的转移可能是有益的,但其成功严重依赖于源和目标任务类型。
translated by 谷歌翻译
Prompt Tuning, conditioning on task-specific learned prompt vectors, has emerged as a data-efficient and parameter-efficient method for adapting large pretrained vision-language models to multiple downstream tasks. However, existing approaches usually consider learning prompt vectors for each task independently from scratch, thereby failing to exploit the rich shareable knowledge across different vision-language tasks. In this paper, we propose multitask vision-language prompt tuning (MVLPT), which incorporates cross-task knowledge into prompt tuning for vision-language models. Specifically, (i) we demonstrate the effectiveness of learning a single transferable prompt from multiple source tasks to initialize the prompt for each target task; (ii) we show many target tasks can benefit each other from sharing prompt vectors and thus can be jointly learned via multitask prompt tuning. We benchmark the proposed MVLPT using three representative prompt tuning methods, namely text prompt tuning, visual prompt tuning, and the unified vision-language prompt tuning. Results in 20 vision tasks demonstrate that the proposed approach outperforms all single-task baseline prompt tuning methods, setting the new state-of-the-art on the few-shot ELEVATER benchmarks and cross-task generalization benchmarks. To understand where the cross-task knowledge is most effective, we also conduct a large-scale study on task transferability with 20 vision tasks in 400 combinations for each prompt tuning method. It shows that the most performant MVLPT for each prompt tuning method prefers different task combinations and many tasks can benefit each other, depending on their visual similarity and label similarity. Code is available at https://github.com/sIncerass/MVLPT.
translated by 谷歌翻译
最近的进展表明,使用对比图像文本对的大规模预训练可以是从自然语言监督的高质量视觉表演学习的有前途的替代方案。从更广泛的监督来源受益,这种新的范例展示了对下游分类任务和数据集的令人印象深刻的可转移性。然而,从图像文本对中学习的知识转移到更复杂的密集预测任务的问题几乎没有访问过。在这项工作中,我们通过隐式和明确地利用来自剪辑的预先训练的知识来提出了一种新的密集预测框架。具体地,我们将剪辑中的原始图像文本匹配问题转换为像素文本匹配问题,并使用像素文本分数图来指导致密预测模型的学习。通过进一步使用图像中的上下文信息来提示语言模型,我们能够促进我们的模型来更好地利用预先接受训练的知识。我们的方法是模型 - 不可行的,它可以应用于任意密集的预测系统和各种预先训练的视觉底座,包括夹模型和想象成预先训练的模型。广泛的实验证明了我们对语义分割,对象检测和实例分段任务的方法的卓越性能。代码可在https://github.com/raoyongming/denseclip获得
translated by 谷歌翻译
大规模数据集在计算机视觉中起着至关重要的作用。但是当前的数据集盲目注释而没有与样品区分的区分,从而使数据收集效率低下且不计。开放的问题是如何积极地构建大型数据集。尽管先进的主动学习算法可能是答案,但我们在实验上发现它们在分发数据广泛的现实注释方案中是la脚的。因此,这项工作为现实的数据集注释提供了一个新颖的主动学习框架。配备了此框架,我们构建了一个高质量的视觉数据集 - 竹子,由69m的图像分类注释,带有119K类别,带有809个类别的28m对象边界框注释。我们通过从几个知识库中整合的层次分类法来组织这些类别。分类注释比Imagenet22K大四倍,检测的注释比Object365大三倍。与ImagEnet22K和Objects365相比,预先训练的竹子在各种下游任务中实现了卓越的性能(分类的6.2%增长,检测到2.1%的增长)。我们认为,我们的积极学习框架和竹子对于将来的工作至关重要。
translated by 谷歌翻译
本文重新讨论了统计学习中统一融合的原则,讨论了它是机器学习背后的基础,并试图更好地了解当前深度学习算法正在解决的基本问题。讨论以计算机视觉作为机器学习中的示例领域,表明,利用越来越大规模数据进行预训练的最新研究趋势在很大程度上是为了减少实际上可探索的经验损失与最终所需的差异,但最终所需的差异可悲的预期损失。此外,本文提出了一些未来的研究方向,可以预测数据的持续增加,并认为通过结合结构和知识,需要更多的基础研究,以鲁棒性,可解释性和机器学习的推理能力。
translated by 谷歌翻译
We introduce Patch Aligned Contrastive Learning (PACL), a modified compatibility function for CLIP's contrastive loss, intending to train an alignment between the patch tokens of the vision encoder and the CLS token of the text encoder. With such an alignment, a model can identify regions of an image corresponding to a given text input, and therefore transfer seamlessly to the task of open vocabulary semantic segmentation without requiring any segmentation annotations during training. Using pre-trained CLIP encoders with PACL, we are able to set the state-of-the-art on the task of open vocabulary zero-shot segmentation on 4 different segmentation benchmarks: Pascal VOC, Pascal Context, COCO Stuff and ADE20K. Furthermore, we show that PACL is also applicable to image-level predictions and when used with a CLIP backbone, provides a general improvement in zero-shot classification accuracy compared to CLIP, across a suite of 12 image classification datasets.
translated by 谷歌翻译
本文提出了一种对比调整,这是一种简单的方法,采用对比训练来对准图像和文本模型,同时仍然利用他们的预训练。在我们的实证研究中,我们发现,锁定的预训练图像模型与解锁文本模型最佳。我们调用这种对比调整“锁定图像文本调整”(LIT TOONING)的实例,该实例仅教导文本模型,从预先训练的图像模型中读出了良好的表示新任务。亮度调谐模型将零拍摄传输到新视觉任务的能力提高,例如图像分类或检索。建议的亮度调整是广泛适用的;它可以使用三种不同的图像文本数据集可靠地使用多种预训练方法(监督和无监督)和多种架构(Reset,Vision变换器和MLP-MILLER)。利用基于变压器的预训练VIT-G / 14型号,LIT调谐模型在想象网测试集中实现了84.5%的零射频传输精度,并且在充满挑战的分发ObjectNet测试集中实现了81.1%。
translated by 谷歌翻译
对比语言 - 图像预训练(剪辑)在开放词汇零拍摄图像识别方面取得了显着突破。许多最近的研究利用预先训练的剪辑模型进行图像级分类和操纵。在本文中,我们进一步探索了剪辑的电位,用于像素级致密预测,具体地在语义分割中。在没有注释和微调的情况下,我们的方法Denseclip会产生合理的分段结果,在各种数据集中的开放概念上产生了合理的分段结果。通过添加伪标签和自我培训,Denseclip +超越了SOTA转换零点语义分割方法,通过大幅边缘,例如,Pascal VOC / Pascal Context / Coco Sift的宣传课程从35.6 / 20.7 / 30.3到86.1 / 66.7 / 54.7。我们还在输入损坏下测试了Denseclip的稳健性,并评估其在识别细粒度物体和新颖概念中的能力。我们的发现表明,Denseclip可以作为致密预测任务的新可靠的监督源,以实现无批准的分割。
translated by 谷歌翻译
基础模型不是模型生产管道的最后一章。以少数数据以少数数据传输到数千个下游任务正在成为基础模型的应用的趋势。在本文中,我们提出了一个通用转移框架:一个传输所有(OTA),将任何视觉基础模型(VFM)转移到具有少数下游数据的下游任务。我们首先通过图像重新表示微调(IRF)将VFM传输到特定于任务特定模型,然后将知识从特定于任务的模型蒸馏到部署的模型,其中包含由下游图像引导的生成(DIGG)产生的数据。OTA在传输时没有对上游数据,VFM和下游任务的依赖性。它还为VFM研究人员提供了一种方法,以释放其上游信息,以便更好地转移,但由于隐私要求而没有泄漏数据。大规模实验在少数数据设置中验证我们方法的有效性和优越性。我们的代码将被释放。
translated by 谷歌翻译
We present X-Decoder, a generalized decoding model that can predict pixel-level segmentation and language tokens seamlessly. X-Decodert takes as input two types of queries: (i) generic non-semantic queries and (ii) semantic queries induced from text inputs, to decode different pixel-level and token-level outputs in the same semantic space. With such a novel design, X-Decoder is the first work that provides a unified way to support all types of image segmentation and a variety of vision-language (VL) tasks. Further, our design enables seamless interactions across tasks at different granularities and brings mutual benefits by learning a common and rich pixel-level visual-semantic understanding space, without any pseudo-labeling. After pretraining on a mixed set of a limited amount of segmentation data and millions of image-text pairs, X-Decoder exhibits strong transferability to a wide range of downstream tasks in both zero-shot and finetuning settings. Notably, it achieves (1) state-of-the-art results on open-vocabulary segmentation and referring segmentation on eight datasets; (2) better or competitive finetuned performance to other generalist and specialist models on segmentation and VL tasks; and (3) flexibility for efficient finetuning and novel task composition (e.g., referring captioning and image editing). Code, demo, video, and visualization are available at https://x-decoder-vl.github.io.
translated by 谷歌翻译
Recently, CLIP has been applied to pixel-level zero-shot learning tasks via a two-stage scheme. The general idea is to first generate class-agnostic region proposals and then feed the cropped proposal regions to CLIP to utilize its image-level zero-shot classification capability. While effective, such a scheme requires two image encoders, one for proposal generation and one for CLIP, leading to a complicated pipeline and high computational cost. In this work, we pursue a simpler-and-efficient one-stage solution that directly extends CLIP's zero-shot prediction capability from image to pixel level. Our investigation starts with a straightforward extension as our baseline that generates semantic masks by comparing the similarity between text and patch embeddings extracted from CLIP. However, such a paradigm could heavily overfit the seen classes and fail to generalize to unseen classes. To handle this issue, we propose three simple-but-effective designs and figure out that they can significantly retain the inherent zero-shot capacity of CLIP and improve pixel-level generalization ability. Incorporating those modifications leads to an efficient zero-shot semantic segmentation system called ZegCLIP. Through extensive experiments on three public benchmarks, ZegCLIP demonstrates superior performance, outperforming the state-of-the-art methods by a large margin under both "inductive" and "transductive" zero-shot settings. In addition, compared with the two-stage method, our one-stage ZegCLIP achieves a speedup of about 5 times faster during inference. We release the code at https://github.com/ZiqinZhou66/ZegCLIP.git.
translated by 谷歌翻译