在现实世界设置下自动发现视觉模型中的故障仍然是一个开放的挑战。这项工作说明了如何利用大量数据培训的现成,大规模,图像到文本和文本对象模型如何自动找到此类故障。本质上,有条件的文本到图像生成模型用于生成大量的合成,但现实的输入,给定了地面真相标签。错误分类的输入是聚类的,并使用字幕模型来描述每个群集。每个集群的描述依次使用来生成更多的输入,并评估特定簇是否会导致比预期更多的故障。我们使用该管道来证明我们可以有效地询问在Imagenet上训练的分类器以找到特定的故障案例并发现虚假相关性。我们还表明,我们可以扩展针对特定分类器体系结构的对抗数据集的方法。这项工作是概念验证,证明了大规模生成模型的实用性,以开放式方式自动发现视觉模型中的错误。我们还描述了与这种方法相关的许多局限性和陷阱。
translated by 谷歌翻译
Vision models often fail systematically on groups of data that share common semantic characteristics (e.g., rare objects or unusual scenes), but identifying these failure modes is a challenge. We introduce AdaVision, an interactive process for testing vision models which helps users identify and fix coherent failure modes. Given a natural language description of a coherent group, AdaVision retrieves relevant images from LAION-5B with CLIP. The user then labels a small amount of data for model correctness, which is used in successive retrieval rounds to hill-climb towards high-error regions, refining the group definition. Once a group is saturated, AdaVision uses GPT-3 to suggest new group descriptions for the user to explore. We demonstrate the usefulness and generality of AdaVision in user studies, where users find major bugs in state-of-the-art classification, object detection, and image captioning models. These user-discovered groups have failure rates 2-3x higher than those surfaced by automatic error clustering methods. Finally, finetuning on examples found with AdaVision fixes the discovered bugs when evaluated on unseen examples, without degrading in-distribution accuracy, and while also improving performance on out-of-distribution datasets.
translated by 谷歌翻译
A major goal of multimodal research is to improve machine understanding of images and text. Tasks include image captioning, text-to-image generation, and vision-language representation learning. So far, research has focused on the relationships between images and text. For example, captioning models attempt to understand the semantics of images which are then transformed into text. An important question is: which annotation reflects best a deep understanding of image content? Similarly, given a text, what is the best image that can present the semantics of the text? In this work, we argue that the best text or caption for a given image is the text which would generate the image which is the most similar to that image. Likewise, the best image for a given text is the image that results in the caption which is best aligned with the original text. To this end, we propose a unified framework that includes both a text-to-image generative model and an image-to-text generative model. Extensive experiments validate our approach.
translated by 谷歌翻译
文本对图像模型提供了前所未有的自由,可以通过自然语言指导创作。然而,尚不清楚如何行使这种自由以生成特定独特概念,修改其外观或以新角色和新颖场景构成它们的图像。换句话说,我们问:我们如何使用语言指导的模型将猫变成绘画,或者想象基于我们喜欢的玩具的新产品?在这里,我们提出了一种简单的方法,可以允许这种创造性自由。我们仅使用3-5个用户提供的概念(例如对象或样式)的图像,我们学会通过在冷冻文本到图像模型的嵌入空间中通过新的“单词”表示它。这些“单词”可以组成自然语言句子,以直观的方式指导个性化的创作。值得注意的是,我们发现有证据表明单词嵌入足以捕获独特而多样的概念。我们将我们的方法比较了各种基线,并证明它可以更忠实地描绘出一系列应用程序和任务的概念。我们的代码,数据和新单词将在以下网址提供:https://textual-inversion.github.io
translated by 谷歌翻译
Neural image classifiers are known to undergo severe performance degradation when exposed to input that exhibits covariate-shift with respect to the training distribution. Successful hand-crafted augmentation pipelines aim at either approximating the expected test domain conditions or to perturb the features that are specific to the training environment. The development of effective pipelines is typically cumbersome, and produce transformations whose impact on the classifier performance are hard to understand and control. In this paper, we show that recent Text-to-Image (T2I) generators' ability to simulate image interventions via natural-language prompts can be leveraged to train more robust models, offering a more interpretable and controllable alternative to traditional augmentation methods. We find that a variety of prompting mechanisms are effective for producing synthetic training data sufficient to achieve state-of-the-art performance in widely-adopted domain-generalization benchmarks and reduce classifiers' dependency on spurious features. Our work suggests that further progress in T2I generation and a tighter integration with other research fields may represent a significant step towards the development of more robust machine learning systems.
translated by 谷歌翻译
我们介绍了自回归文本到图像(Parti)模型的途径,该模型生成高保真的影像图像并支持涉及复杂组成和世界知识的内容丰富的合成。 Parti将文本对图像生成视为类似于机器翻译的序列到序列建模问题,图像令牌的序列是目标输出,而不是其他语言的文本令牌。这种策略自然可以利用大型语言模型的先前工作,通过扩展数据和模型尺寸,能力和性能的持续进展。我们的方法很简单:首先,Parti使用基于变压器的图像令牌VIT-VQGAN将图像编码为离散令牌的序列。其次,我们通过将编码器二次变压器模型缩放到20B参数来实现一致的质量改进,其新的最新零弹药FID得分为7.23,而MS-Coco的FIDED得分为3.22。我们对本地化叙述以及党的详细分析(P2),这是1600多个英语提示的新的整体基准,证明了Parti在各种类别和难度方面的有效性。我们还探索并突出了我们的模型的局限性,以定义和体现关注重点领域以进一步改进。有关高分辨率图像,请参见https://parti.research.google/。
translated by 谷歌翻译
利用深度学习的最新进展,文本到图像生成模型目前具有吸引公众关注的优点。其中两个模型Dall-E 2和Imagen已经证明,可以从图像的简单文本描述中生成高度逼真的图像。基于一种称为扩散模型的新型图像生成方法,文本对图像模型可以生产许多不同类型的高分辨率图像,其中人类想象力是唯一的极限。但是,这些模型需要大量的计算资源来训练,并处理从互联网收集的大量数据集。此外,代码库和模型均未发布。因此,它可以防止AI社区尝试这些尖端模型,从而使其结果复制变得复杂,即使不是不可能。在本文中,我们的目标是首先回顾这些模型使用的不同方法和技术,然后提出我们自己的文本模型模型实施。高度基于DALL-E 2,我们引入了一些轻微的修改,以应对所引起的高计算成本。因此,我们有机会进行实验,以了解这些模型的能力,尤其是在低资源制度中。特别是,我们提供了比Dall-e 2的作者(包括消融研究)更深入的分析。此外,扩散模型使用所谓的指导方法来帮助生成过程。我们引入了一种新的指导方法,该方法可以与其他指导方法一起使用,以提高图像质量。最后,我们的模型产生的图像质量相当好,而不必维持最先进的文本对图像模型的重大培训成本。
translated by 谷歌翻译
对象剪切已成为有效生成大量标记的训练数据的一种有希望的方法。它涉及将前景对象掩盖在背景图像上。背景图像与对象一致时,为培训对象识别模型提供了有用的上下文信息。尽管该方法可以轻松地生成大型标记的数据,但寻找下游任务的一致上下文图像仍然是一个难以捉摸的问题。在这项工作中,我们为自动上下文图像生成的新范式提出了一个新的范式。我们方法的核心是利用上下文和语言驱动图像生成之间的相互作用。通过在代表上下文的一小部分图像上应用图像字幕方法来提供上下文的语言描述。然后,这些语言描述用于使用基于语言的DALL-E图像生成框架来生成各种上下文图像集。然后将它们与对象合成,以提供分类器的增强培训集。我们在四个对象检测数据集上证明了方法比先前的上下文图像生成方法的优势。此外,我们还强调了数据生成方法对分布和零摄像数据生成方案的组成性质。
translated by 谷歌翻译
Recent large-scale image generation models such as Stable Diffusion have exhibited an impressive ability to generate fairly realistic images starting from a very simple text prompt. Could such models render real images obsolete for training image prediction models? In this paper, we answer part of this provocative question by questioning the need for real images when training models for ImageNet classification. More precisely, provided only with the class names that have been used to build the dataset, we explore the ability of Stable Diffusion to generate synthetic clones of ImageNet and measure how useful they are for training classification models from scratch. We show that with minimal and class-agnostic prompt engineering those ImageNet clones we denote as ImageNet-SD are able to close a large part of the gap between models produced by synthetic images and models trained with real images for the several standard classification benchmarks that we consider in this study. More importantly, we show that models trained on synthetic images exhibit strong generalization properties and perform on par with models trained on real data.
translated by 谷歌翻译
我们提出了快速的文本2stylegan,这是一种自然语言界面,可适应预先训练的甘体,以实现文本引导的人脸合成。利用对比性语言图像预训练(剪辑)的最新进展,在培训过程中不需要文本数据。Fast Text2Stylegan被配制为条件变异自动编码器(CVAE),可在测试时为生成的图像提供额外的控制和多样性。我们的模型在遇到新的文本提示时不需要重新训练或微调剂或剪辑。与先前的工作相反,我们不依赖于测试时间的优化,这使我们的方法数量级比先前的工作快。从经验上讲,在FFHQ数据集上,我们的方法提供了与先前的工作相比,自然语言描述中具有不同详细程度的自然语言描述中的图像。
translated by 谷歌翻译
现有的方法用于隔离数据集中的硬群和虚假相关性通常需要人为干预。这可以使这些方法具有劳动密集型和特定于数据集的特定方式。为了解决这些缺点,我们提出了一种自动提炼模型故障模式的可扩展方法。具体而言,我们利用线性分类器来识别一致的误差模式,然后又诱导这些故障模式作为特征空间内的方向的自然表示。我们证明,该框架使我们能够发现并自动为培训数据集中的子群体提起挑战,并进行干预以改善模型对这些亚群的绩效。可在https://github.com/madrylab/failure-directions上找到代码
translated by 谷歌翻译
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.
translated by 谷歌翻译
对比训练有素的语言图像模型,例如剪辑,Align和Basic,已经证明了对多种具有挑战性的自然分配变化的前所未有的鲁棒性。由于这些语言图像模型与以前的培训方法有多种不同,因此一个重要的问题是导致稳定性增长的原因。我们通过系统的实验研究回答这个问题。具体而言,我们研究了鲁棒性增长的五个不同可能的原因:(i)训练集大小,(ii)培训分配,(iii)在培训时进行语言监督,(iv)测试时语言监督,以及(v)对比损失函数。我们的实验表明,更多样化的训练分布是稳健性增长的主要原因,其他因素几乎没有稳健性。除了实验结果之外,我们还引入了Imagenet捕获,这是一种来自Flickr的原始文本注释的Imagenet版本,以实现语言图像训练的进一步受控实验。
translated by 谷歌翻译
文本到图像合成的最新进展导致了较大的经过验证的变压器,具有出色的能力,可以从给定文本产生可视化。但是,这些模型不适合专门的任务,例如故事可视化,该任务要求代理商制作一系列图像,给定相应的字幕序列,形成叙述。此外,我们发现故事可视化任务无法适应新叙事中看不见的情节和角色的概括。因此,我们首先提出了故事延续的任务,其中生成的视觉故事是在源图像上进行的,从而可以更好地对具有新角色的叙述进行更好的概括。然后,我们使用特定于(a)顺序图像生成的任务特定模块和(b)从初始帧复制相关元素的任务特定模块来增强或“复古”文本对图像合成模型。然后,我们探讨了预训练模型的全模型芬太尼以及对参数适应的及时调整。我们在两个现有数据集(PororoSV和FlintStonessV)上评估了我们的方法storydall-e,并介绍了从视频吸引数据集收集的新数据集DIDEMOSV。我们还基于生成的对抗网络(GAN)开发了一个模型故事游戏,以进行故事的延续,并将其与StoryDall-E模型进行比较,以展示我们方法的优势。我们表明,我们的复古拟合方法优于基于GAN的模型,用于故事延续,并促进从源图像中复制视觉元素,从而改善了生成的视觉故事中的连续性。最后,我们的分析表明,经过审计的变压器努力理解包含几个角色的叙述。总体而言,我们的工作表明,可以验证的文本对图像合成模型可以适应复杂和低资源的任务,例如故事延续。
translated by 谷歌翻译
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis. Starting from random noise, such text-to-image diffusion models gradually synthesize images in an iterative fashion while conditioning on text prompts. We find that their synthesis behavior qualitatively changes throughout this process: Early in sampling, generation strongly relies on the text prompt to generate text-aligned content, while later, the text conditioning is almost entirely ignored. This suggests that sharing model parameters throughout the entire generation process may not be ideal. Therefore, in contrast to existing works, we propose to train an ensemble of text-to-image diffusion models specialized for different synthesis stages. To maintain training efficiency, we initially train a single model, which is then split into specialized models that are trained for the specific stages of the iterative generation process. Our ensemble of diffusion models, called eDiff-I, results in improved text alignment while maintaining the same inference computation cost and preserving high visual quality, outperforming previous large-scale text-to-image diffusion models on the standard benchmark. In addition, we train our model to exploit a variety of embeddings for conditioning, including the T5 text, CLIP text, and CLIP image embeddings. We show that these different embeddings lead to different behaviors. Notably, the CLIP image embedding allows an intuitive way of transferring the style of a reference image to the target text-to-image output. Lastly, we show a technique that enables eDiff-I's "paint-with-words" capability. A user can select the word in the input text and paint it in a canvas to control the output, which is very handy for crafting the desired image in mind. The project page is available at https://deepimagination.cc/eDiff-I/
translated by 谷歌翻译
最近已被证明扩散模型产生高质量的合成图像,尤其是与指导技术配对,以促进忠诚的多样性。我们探索文本条件图像综合问题的扩散模型,并比较了两种不同的指导策略:剪辑指导和自由分类指导。我们发现后者是人类评估者的优选,用于光敏和标题相似度,并且通常产生光素质拟种样品。使用自由分类指导的35亿参数文本条件扩散模型的样本由人类评估者对来自Dall-E的人的人们青睐,即使后者使用昂贵的剪辑重新划分。此外,我们发现我们的模型可以进行微调,以执行图像修复,从而实现强大的文本驱动的图像编辑。我们在过滤的数据集中培训较小的模型,并在https://github.com/openai/glide-text2im释放代码和权重。
translated by 谷歌翻译
最近的文本到图像匹配模型对大型图像和句子的大公司进行了对比学习。虽然这些模型可以提供用于匹配和随后的零拍任务的强大分数,但它们不能给出给定图像的标题。在这项工作中,我们重新利用这些模型来生成在推理时间的图像时生成描述性文本,而无需进一步的训练或调整步骤。这是通过将具有大语言模型的视觉语义模型组合,从两种网络级模型中的知识中获益。由受监督标题方法获得的标题的限制性较小。此外,作为零射击学习方法,它非常灵活,我们展示了执行图像算法的能力,其中输入可以是图像或文本,输出是句子。这使得新颖的高级视觉能力,例如比较两个图像或解决视觉类比测试。
translated by 谷歌翻译
Many high-level skills that are required for computer vision tasks, such as parsing questions, comparing and contrasting semantics, and writing descriptions, are also required in other domains such as natural language processing. In this paper, we ask whether this makes it possible to learn those skills from text data and then use them to complete vision tasks without ever training on visual training data. Key to our approach is exploiting the joint embedding space of contrastively trained vision and language encoders. In practice, there can be systematic differences between embedding spaces for different modalities in contrastive models, and we analyze how these differences affect our approach and study a variety of strategies to mitigate this concern. We produce models using only text training data on three tasks: image captioning, visual entailment and visual question answering, and evaluate them on standard benchmarks using images. We find that this kind of transfer is possible and results in only a small drop in performance relative to models trained on images. We also showcase a variety of stylistic image captioning models that were trained using no image data and no human-curated language data, but instead text data from books, the web, or language models.
translated by 谷歌翻译
最近的工作认为,强大的培训需要比标准分类所需的数据集大得多。在CiFar-10和CiFar-100上,这转化为仅培训的型号之间的可稳健稳健精度差距,这些型号来自原始训练集的数据,那些从“80万微小图像”数据集(TI-80M)提取的附加数据培训。在本文中,我们探讨了单独培训的生成模型如何利用人为地提高原始训练集的大小,并改善对$ \ ell_p $ norm-inded扰动的对抗鲁棒性。我们确定了包含额外生成数据的充分条件可以改善鲁棒性,并证明可以显着降低具有额外实际数据训练的模型的强大准确性差距。令人惊讶的是,我们甚至表明即使增加了非现实的随机数据(由高斯采样产生)也可以改善鲁棒性。我们在Cifar-10,CiFar-100,SVHN和Tinyimagenet上评估我们的方法,而$ \ ell_ indty $和$ \ ell_2 $ norm-indeded扰动尺寸$ \ epsilon = 8/255 $和$ \ epsilon = 128/255 $分别。与以前的最先进的方法相比,我们以强大的准确性显示出大的绝对改进。反对$ \ ell_ \ infty $ norm-indeded扰动尺寸$ \ epsilon = 8/255 $,我们的车型分别在Cifar-10和Cifar-100上达到66.10%和33.49%(改善状态)最新美术+ 8.96%和+ 3.29%)。反对$ \ ell_2 $ norm-indeded扰动尺寸$ \ epsilon = 128/255 $,我们的型号在Cifar-10(+ 3.81%)上实现78.31%。这些结果击败了使用外部数据的最先前的作品。
translated by 谷歌翻译
通过使用图像文本匹配模型的使用,零光学习在计算机视觉中的应用已彻底改变。最值得注意的示例,剪辑,已广泛用于带有文本提示的零摄像分类和指导生成模型。但是,对于输入文本的措辞,夹子的零拍情况不稳定,因此有必要仔细设计所用的提示。我们发现这种不稳定性源于选择性相似性分数,该得分仅基于语义上有意义的输入令牌的子集。为了减轻它,我们提出了一种新颖的基于可解释的方法,该方法增加了损失术语,以确保剪辑专注于输入的所有相关语义部分,此外还采用了以前的作品中使用的夹子相似性损失。当通过及时的工程应用于单发分类时,我们的方法可以提高识别率,而无需进行额外的培训或微调。此外,我们表明使用我们的方法对生成模型的剪辑指导显着改善了生成的图像。最后,我们通过在对象位置进行空间条件来证明对基于文本的图像生成的新颖使用,这是需要将图像解释性热图限制在预定的边界框中。
translated by 谷歌翻译