State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.
translated by 谷歌翻译
本文提出了一种对比调整,这是一种简单的方法,采用对比训练来对准图像和文本模型,同时仍然利用他们的预训练。在我们的实证研究中,我们发现,锁定的预训练图像模型与解锁文本模型最佳。我们调用这种对比调整“锁定图像文本调整”(LIT TOONING)的实例,该实例仅教导文本模型,从预先训练的图像模型中读出了良好的表示新任务。亮度调谐模型将零拍摄传输到新视觉任务的能力提高,例如图像分类或检索。建议的亮度调整是广泛适用的;它可以使用三种不同的图像文本数据集可靠地使用多种预训练方法(监督和无监督)和多种架构(Reset,Vision变换器和MLP-MILLER)。利用基于变压器的预训练VIT-G / 14型号,LIT调谐模型在想象网测试集中实现了84.5%的零射频传输精度,并且在充满挑战的分发ObjectNet测试集中实现了81.1%。
translated by 谷歌翻译
作为人类,我们通过我们所有的感官来驾驭世界,使用每个人从每个人纠正其他人。我们介绍了Merlot Reserve,一个模型,该模型是联合随着时间的推移而表示视频的模型 - 通过从音频,字幕和视频帧学习的新培训目标。给出了一个视频,我们用掩模令牌替换文本和音频的片段;该模型通过选择正确的蒙版片段来学习。我们的目标比替代方面更快地学习,并在规模上表现良好:我们预先逼近2000万YouTube视频。经验结果表明,Merlot Reserve学会通过所有组成模式的视频的强烈陈述。在FineTuned时,它在VCR和TVQA上为VCR和TVQA进行了新的最先进,优先于前勤工作分别为5%和7%。消融表明,两个任务都受益于音频预制 - 甚至录像机,围绕图像中心的QA任务(没有声音)。此外,我们的客观使开箱即用的预测,揭示了强大的多式联合致辞理解。在一个完全零拍摄的环境中,我们的模型在四个视频理解任务中获得竞争结果,甚至优于最近提出的定位推理(星)基准的监督方法。我们分析为什么包含音频导致更好的视觉语言表示,这表明未来研究的重要机会。我们通过讨论多式联运预测的道德和社会影响来得出结论。
translated by 谷歌翻译
We demonstrate that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even becoming competitive with prior state-ofthe-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous nonsparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks. We also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora.
translated by 谷歌翻译
最近的工作表明,自我监督的预训练导致对挑战性视觉识别任务的监督学习改进。剪辑是一种令人兴奋的学习语言监督的新方法,展示了各种基准的有希望的表现。在这项工作中,我们探索自我监督的学习是否可以帮助使用语言监督来进行视觉表现学习。我们介绍了一个用于组合自我监督学习和剪辑预训练的多任务学习框架。在使用视觉变形金刚进行预培训之后,我们在三个不同的设置下彻底评估了代表性质量,并将性能与自我监督学习进行了比较:零拍摄传输,线性分类和端到端的FineTuning。在ImageNet和电池的额外数据集中,我们发现SLIP通过大幅度提高了精度。我们将通过关于不同模型大小,培训计划和预训练预训练数据集的实验进行验证。我们的研究结果表明,滑块享有世界上最好的:性能比自我监督更好(+ 8.1%的线性精度)和语言监督(+ 5.2%的零射精精度)。
translated by 谷歌翻译
自动视觉解对我们多样化和开放的世界需要计算机视觉模型,以概括为特定任务的最小定制,类似于人类视力。计算机视觉基础型号培训,培训多样化,大型数据集,可以适应各种下游任务,对该任务来解决现实世界计算机视觉应用而言至关重要。虽然现有的视觉基础模型如剪辑,对齐和吴道2.0主要集中在映射图像和文本表示到跨模型共享表示,我们介绍了一台新的计算机视觉基础模型,佛罗伦萨,扩大粗糙的表示(现场)到精细(对象),从静态(图像)到动态(视频),以及从RGB到多个模态(标题,深度)。通过从Web级图像文本数据中纳入通用视觉语言表示,我们的佛罗伦萨模型可以很容易地适应各种计算机视觉任务,例如分类,检索,对象检测,VQA,图像标题,视频检索和动作识别。此外,佛罗伦萨在许多类型的转移学习中表现出出色的表现:全面采样的微调,线性探测,几次射击传输和用于新颖图像和物体的零拍摄传输。所有这些属性对于我们的视觉基础模型至关重要,以提供通用视觉任务。佛罗伦萨实现了新的最先进的导致44个代表性基准,例如Imagenet-1K零射击分类,最高1精度为83.74,最高5个精度为97.18,62.4地图上的Coco微调, 80.36在VQA上,动力学-600上的87.8。
translated by 谷歌翻译
We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zero-shot transfer setting without the need for any fine-tuning. When compared to humans, the models approach their accuracy and robustness. We are releasing models and inference code to serve as a foundation for further work on robust speech processing.
translated by 谷歌翻译
从自然语言监督中学习视觉表示,最近在许多开创性的作品中表现出了巨大的希望。通常,这些具有语言的视觉模型表现出对各种数据集和任务的强大可传递性。但是,由于缺乏易于使用的评估工具包和公共基准,评估这些模型的可转让性仍然很具有挑战性。为了解决这个问题,我们构建了高级版(评估语言的视觉任务级传输),这是用于评估(预训练)语言增强视觉模型的第一个基准和工具包。升华由三个组成部分组成。 (i)数据集。作为下游评估套件,它由20个图像分类数据集和35个对象检测数据集组成,每个数据集都用外部知识来增强。 (ii)工具包。开发了自动高参数调谐工具包,以促进下游任务的模型评估。 (iii)指标。多种评估指标用于测量样品效率(零射击和少量)和参数效率(线性探测和完整模型微调)。我们在https://computer-vision-in-the-wild.github.io/elevater/上公开发布leverater
translated by 谷歌翻译
Scaling up neural networks has led to remarkable performance across a wide range of tasks. Moreover, performance often follows reliable scaling laws as a function of training set size, model size, and compute, which offers valuable guidance as large-scale experiments are becoming increasingly expensive. However, previous work on scaling laws has primarily used private data \& models or focused on uni-modal language or vision learning. To address these limitations, we investigate scaling laws for contrastive language-image pre-training (CLIP) with the public LAION dataset and the open-source OpenCLIP repository. Our large-scale experiments involve models trained on up to two billion image-text pairs and identify power law scaling for multiple downstream tasks including zero-shot classification, retrieval, linear probing, and end-to-end fine-tuning. We find that the training distribution plays a key role in scaling laws as the OpenAI and OpenCLIP models exhibit different scaling behavior despite identical model architectures and similar training recipes. We open-source our evaluation workflow and all models, including the largest public CLIP models, to ensure reproducibility and make scaling laws research more accessible. Source code and instructions to reproduce this study will be available at https://github.com/LAION-AI/scaling-laws-openclip
translated by 谷歌翻译
诸如剪辑之类的大型预训练的视觉模型在学习表现方面表现出巨大的潜力,这些模型可以在各种下游任务中转移。与主要基于离散标签的传统表示学习不同,视觉语言预训练会使图像和文本在公共特征空间中对齐,这允许通过提示零弹性转移到下游任务,即从分类权重合成。描述兴趣类的自然语言。在这项工作中,我们表明,在实践中部署此类模型的一个重大挑战是及时的工程,它需要域专业知识,并且非常耗时 - 由于措辞的略有变化,需要花费大量时间来进行单词调整可能会对性能产生巨大影响。受到自然语言处理(NLP)迅速学习研究的最新进展的启发,我们提出了上下文优化(COP),这是一种专门用于调整类似剪辑的视觉语言模型的简单方法,用于下游图像识别。具体而言,Coop用可学习的向量建模了提示A的上下文单词,而整个预训练的参数则保持固定。为了处理不同的图像识别任务,我们提供了两个COOP的实现:统一上下文和特定于班级的上下文。通过在11个数据集上进行的大量实验,我们证明Coop只需要一两个镜头才能以相当的利润击败手工制作的提示,并且能够以16张镜头(例如16张照片)获得迅速工程的显着改进增益约为15%(最高达到45%以上)。尽管是一种基于学习的方法,但与使用手工制作的提示相比,Coop与零拍模型相比,取得了出色的域泛化性能。
translated by 谷歌翻译
Vision language (VL) models like CLIP are robust to natural distribution shifts, in part because CLIP learns on unstructured data using a technique called caption supervision; the model inteprets image-linked texts as ground-truth labels. In a carefully controlled comparison study, we show that caption-supervised CNNs trained on a standard cross-entropy loss (with image labels assigned by scanning captions for class names) can exhibit greater distributional robustness than VL models trained on the same data. To facilitate future experiments with high-accuracy caption-supervised models, we introduce CaptionNet (https://github.com/penfever/CaptionNet/), which includes a class-balanced, fully supervised dataset with over 50,000 new human-labeled ImageNet-compliant samples which includes web-scraped captions. In a series of experiments on CaptionNet, we show how the choice of loss function, data filtration and supervision strategy enable robust computer vision. We also provide the codebase necessary to reproduce our experiments at VL Hub (https://github.com/penfever/vlhub/).
translated by 谷歌翻译
Pre-trained representations are becoming crucial for many NLP and perception tasks. While representation learning in NLP has transitioned to training on raw text without human annotations, visual and vision-language representations still rely heavily on curated training datasets that are expensive or require expert knowledge. For vision applications, representations are mostly learned using datasets with explicit class labels such as Ima-geNet or OpenImages. For vision-language, popular datasets like Conceptual Captions, MSCOCO, or CLIP all involve a non-trivial data collection (and cleaning) process. This costly curation process limits the size of datasets and hence hinders the scaling of trained models. In this paper, we leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset. A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss. We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme. Our visual representation achieves strong performance when transferred to classification tasks such as ImageNet and VTAB. The aligned visual and language representations enables zero-shot image classification and also set new state-of-the-art results on Flickr30K and MSCOCO image-text retrieval benchmarks, even when compared with more sophisticated crossattention models. The representations also enable cross-modality search with complex text and text + image queries.
translated by 谷歌翻译
Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos, text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional feature encoding. We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution, image enhancement, and colorization) and 3D analysis (e.g., point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works. We hope this effort will ignite further interest in the community to solve current challenges towards the application of transformer models in computer vision.
translated by 谷歌翻译
人工智能的最新趋势是将验证的模型用于语言和视觉任务,这些模型已经实现了非凡的表现,但也令人困惑。因此,以各种方式探索这些模型的能力对该领域至关重要。在本文中,我们探讨了模型的可靠性,在其中我们将可靠的模型定义为一个不仅可以实现强大的预测性能,而且在许多涉及不确定性(例如选择性预测,开放式设置识别)的决策任务上,在许多决策任务上表现出色,而且表现良好。强大的概括(例如,准确性和适当的评分规则,例如在分布数据集中和分发数据集上的对数可能性)和适应性(例如,主动学习,几乎没有射击不确定性)。我们设计了40个数据集的10种任务类型,以评估视觉和语言域上可靠性的不同方面。为了提高可靠性,我们分别开发了VIT-PLEX和T5-PLEX,分别针对视觉和语言方式扩展了大型模型。 PLEX极大地改善了跨可靠性任务的最先进,并简化了传统协议,因为它可以改善开箱即用的性能,并且不需要设计分数或为每个任务调整模型。我们演示了高达1B参数的模型尺寸的缩放效果,并预处理数据集大小最多4B示例。我们还展示了PLEX在具有挑战性的任务上的功能,包括零射门的开放式识别,主动学习和对话语言理解中的不确定性。
translated by 谷歌翻译
剪辑的发展[Radford等,2021]引发了关于语言监督是否可以导致与传统仅图像方法更可转移表示的视觉模型的争论。我们的工作通过对两种方法的学习能力进行了对下游分类任务的学习能力进行仔细控制的比较来研究这个问题。我们发现,当预训练数据集符合某些标准时 - 它足够大,并且包含具有较低变异性的描述性字幕 - 仅图像的方法也与剪辑的传输性能不匹配,即使它们接受了更多图像数据的培训。但是,与人们期望的相反,在某些情况下,没有满足这些标准,其中通过标题增加的监督实际上是有害的。在我们的发现的激励下,我们设计了简单的处方,以使剪辑能够更好地利用现有预训练数据集中存在的语言信息。
translated by 谷歌翻译
成对图像和文本的大型数据集越来越受到愿景和愿景和语言任务的通用表示。此类数据集已通过查询搜索引擎或收集HTML Alt-Text构建 - 由于Web数据是嘈杂的,因此它们需要复杂的过滤管道来维护质量。我们探索备用数据源以收集具有最小滤波的高质量数据。我们介绍Redcaps - 从Reddit收集的12M图像文本对的大规模数据集。来自Reddit的图像和标题描绘并描述了各种各样的物体和场景。我们从手动策划的FuSoddits集中收集数据,这为粗略图像标签提供给粗略图像标签,并允许我们转向数据集组合而不标记单个实例。我们展示Redcaps培训的标题模型产生了人类优选的丰富和各种标题,并学习转移到许多下游任务的视觉表现。
translated by 谷歌翻译
使用文本,图像,音频,视频等多种方式的多模式深度学习系统,与单独的方式(即单向)系统相比,显示出更好的性能。多式联机学习涉及多个方面:表示,翻译,对齐,融合和共同学习。在当前多式联机学习状态下,假设是在训练和测试时间期间存在,对齐和无噪声。然而,在现实世界的任务中,通常,观察到一个或多个模式丢失,嘈杂,缺乏注释数据,具有不可靠的标签,并且在训练或测试中稀缺,或两者都稀缺。这种挑战是由称为多式联合学习的学习范例解决的。通过使用模态之间的知识传输,包括其表示和预测模型,通过从另一个(资源丰富的)方式利用来自另一(资源丰富的)模型的知识来帮助实现(资源差)模型的建模。共同学习是一个新兴地区,没有专注的评论,明确地关注共同学习所解决的所有挑战。为此,在这项工作中,我们对新兴的多式联合学习领域提供了全面的调查,尚未完整探讨。我们审查实施的实施,以克服一个或多个共同学习挑战,而不明确地将它们视为共同学习挑战。我们基于共同学习和相关实施解决的挑战,展示了多式联合学习的综合分类。用于包括最新的技术与一些应用程序和数据集一起审查。我们的最终目标是讨论挑战和观点以及未来工作的重要思想和方向,我们希望对整个研究界的有益,重点关注这一令人兴奋的领域。
translated by 谷歌翻译
利用深度学习的最新进展,文本到图像生成模型目前具有吸引公众关注的优点。其中两个模型Dall-E 2和Imagen已经证明,可以从图像的简单文本描述中生成高度逼真的图像。基于一种称为扩散模型的新型图像生成方法,文本对图像模型可以生产许多不同类型的高分辨率图像,其中人类想象力是唯一的极限。但是,这些模型需要大量的计算资源来训练,并处理从互联网收集的大量数据集。此外,代码库和模型均未发布。因此,它可以防止AI社区尝试这些尖端模型,从而使其结果复制变得复杂,即使不是不可能。在本文中,我们的目标是首先回顾这些模型使用的不同方法和技术,然后提出我们自己的文本模型模型实施。高度基于DALL-E 2,我们引入了一些轻微的修改,以应对所引起的高计算成本。因此,我们有机会进行实验,以了解这些模型的能力,尤其是在低资源制度中。特别是,我们提供了比Dall-e 2的作者(包括消融研究)更深入的分析。此外,扩散模型使用所谓的指导方法来帮助生成过程。我们引入了一种新的指导方法,该方法可以与其他指导方法一起使用,以提高图像质量。最后,我们的模型产生的图像质量相当好,而不必维持最先进的文本对图像模型的重大培训成本。
translated by 谷歌翻译
对比训练有素的语言图像模型,例如剪辑,Align和Basic,已经证明了对多种具有挑战性的自然分配变化的前所未有的鲁棒性。由于这些语言图像模型与以前的培训方法有多种不同,因此一个重要的问题是导致稳定性增长的原因。我们通过系统的实验研究回答这个问题。具体而言,我们研究了鲁棒性增长的五个不同可能的原因:(i)训练集大小,(ii)培训分配,(iii)在培训时进行语言监督,(iv)测试时语言监督,以及(v)对比损失函数。我们的实验表明,更多样化的训练分布是稳健性增长的主要原因,其他因素几乎没有稳健性。除了实验结果之外,我们还引入了Imagenet捕获,这是一种来自Flickr的原始文本注释的Imagenet版本,以实现语言图像训练的进一步受控实验。
translated by 谷歌翻译
本次调查绘制了用于分析社交媒体数据的生成方法的研究状态的广泛的全景照片(Sota)。它填补了空白,因为现有的调查文章在其范围内或被约会。我们包括两个重要方面,目前正在挖掘和建模社交媒体的重要性:动态和网络。社会动态对于了解影响影响或疾病的传播,友谊的形成,友谊的形成等,另一方面,可以捕获各种复杂关系,提供额外的洞察力和识别否则将不会被注意的重要模式。
translated by 谷歌翻译