从开放式域文本提示中生成和编辑图像是迄今为止需要昂贵且经过特殊训练的型号的一项挑战性的任务。我们为这两个任务展示了一种新颖的方法,该方法能够通过使用多模式编码器来指导图像世代,从而从具有显着语义复杂性的文本提示中产生高视觉质量的图像。我们在各种任务上说明了如何使用夹[37]引导VQGAN [11]产生的视觉质量输出比先前的较不灵活的方法,例如DALL-E [38],Glide [33]和Open-Edit [24],尽管没有接受培训的任务。我们的代码在公共存储库中可用。
translated by 谷歌翻译
文本对图像模型提供了前所未有的自由,可以通过自然语言指导创作。然而,尚不清楚如何行使这种自由以生成特定独特概念,修改其外观或以新角色和新颖场景构成它们的图像。换句话说,我们问:我们如何使用语言指导的模型将猫变成绘画,或者想象基于我们喜欢的玩具的新产品?在这里,我们提出了一种简单的方法,可以允许这种创造性自由。我们仅使用3-5个用户提供的概念(例如对象或样式)的图像,我们学会通过在冷冻文本到图像模型的嵌入空间中通过新的“单词”表示它。这些“单词”可以组成自然语言句子,以直观的方式指导个性化的创作。值得注意的是,我们发现有证据表明单词嵌入足以捕获独特而多样的概念。我们将我们的方法比较了各种基线,并证明它可以更忠实地描绘出一系列应用程序和任务的概念。我们的代码,数据和新单词将在以下网址提供:https://textual-inversion.github.io
translated by 谷歌翻译
我们介绍了自回归文本到图像(Parti)模型的途径,该模型生成高保真的影像图像并支持涉及复杂组成和世界知识的内容丰富的合成。 Parti将文本对图像生成视为类似于机器翻译的序列到序列建模问题,图像令牌的序列是目标输出,而不是其他语言的文本令牌。这种策略自然可以利用大型语言模型的先前工作,通过扩展数据和模型尺寸,能力和性能的持续进展。我们的方法很简单:首先,Parti使用基于变压器的图像令牌VIT-VQGAN将图像编码为离散令牌的序列。其次,我们通过将编码器二次变压器模型缩放到20B参数来实现一致的质量改进,其新的最新零弹药FID得分为7.23,而MS-Coco的FIDED得分为3.22。我们对本地化叙述以及党的详细分析(P2),这是1600多个英语提示的新的整体基准,证明了Parti在各种类别和难度方面的有效性。我们还探索并突出了我们的模型的局限性,以定义和体现关注重点领域以进一步改进。有关高分辨率图像,请参见https://parti.research.google/。
translated by 谷歌翻译
自然语言为图像编辑提供高度直观的界面。在本文中,我们基于自然语言描述与ROI掩模一起介绍用于在通用自然图像中执行局部(基于区域的)编辑的第一解决方案。我们通过利用并结合预先训练的语言图像模型(CLIP)来实现我们的目标,以使编辑朝向用户提供的文本提示,具有去噪扩散概率模型(DDPM)来产生自然的结果。为了使编辑区域与图像的不变部分无缝熔化,我们在噪声水平的进展下使用本地文本引导的扩散潜伏在空间上混合输入图像的声明版本。此外,我们表明向扩散过程增加增强,减轻了对抗性结果。我们与定性和定量的几个基线和相关方法进行比较,并表明我们的方法在整体现实主义方面优于这些解决方案,保留背景和匹配文本的能力。最后,我们显示了多个文本驱动的编辑应用程序,包括将新对象添加到图像,删除/替换/更改现有对象,背景替换和图像外推。
translated by 谷歌翻译
最近已被证明扩散模型产生高质量的合成图像,尤其是与指导技术配对,以促进忠诚的多样性。我们探索文本条件图像综合问题的扩散模型,并比较了两种不同的指导策略:剪辑指导和自由分类指导。我们发现后者是人类评估者的优选,用于光敏和标题相似度,并且通常产生光素质拟种样品。使用自由分类指导的35亿参数文本条件扩散模型的样本由人类评估者对来自Dall-E的人的人们青睐,即使后者使用昂贵的剪辑重新划分。此外,我们发现我们的模型可以进行微调,以执行图像修复,从而实现强大的文本驱动的图像编辑。我们在过滤的数据集中培训较小的模型,并在https://github.com/openai/glide-text2im释放代码和权重。
translated by 谷歌翻译
Recent works on diffusion models have demonstrated a strong capability for conditioning image generation, e.g., text-guided image synthesis. Such success inspires many efforts trying to use large-scale pre-trained diffusion models for tackling a challenging problem--real image editing. Works conducted in this area learn a unique textual token corresponding to several images containing the same object. However, under many circumstances, only one image is available, such as the painting of the Girl with a Pearl Earring. Using existing works on fine-tuning the pre-trained diffusion models with a single image causes severe overfitting issues. The information leakage from the pre-trained diffusion models makes editing can not keep the same content as the given image while creating new features depicted by the language guidance. This work aims to address the problem of single-image editing. We propose a novel model-based guidance built upon the classifier-free guidance so that the knowledge from the model trained on a single image can be distilled into the pre-trained diffusion model, enabling content creation even with one given image. Additionally, we propose a patch-based fine-tuning that can effectively help the model generate images of arbitrary resolution. We provide extensive experiments to validate the design choices of our approach and show promising editing capabilities, including changing style, content addition, and object manipulation. The code is available for research purposes at https://github.com/zhang-zx/SINE.git .
translated by 谷歌翻译
Large, text-conditioned generative diffusion models have recently gained a lot of attention for their impressive performance in generating high-fidelity images from text alone. However, achieving high-quality results is almost unfeasible in a one-shot fashion. On the contrary, text-guided image generation involves the user making many slight changes to inputs in order to iteratively carve out the envisioned image. However, slight changes to the input prompt often lead to entirely different images being generated, and thus the control of the artist is limited in its granularity. To provide flexibility, we present the Stable Artist, an image editing approach enabling fine-grained control of the image generation process. The main component is semantic guidance (SEGA) which steers the diffusion process along variable numbers of semantic directions. This allows for subtle edits to images, changes in composition and style, as well as optimization of the overall artistic conception. Furthermore, SEGA enables probing of latent spaces to gain insights into the representation of concepts learned by the model, even complex ones such as 'carbon emission'. We demonstrate the Stable Artist on several tasks, showcasing high-quality image editing and composition.
translated by 谷歌翻译
利用深度学习的最新进展,文本到图像生成模型目前具有吸引公众关注的优点。其中两个模型Dall-E 2和Imagen已经证明,可以从图像的简单文本描述中生成高度逼真的图像。基于一种称为扩散模型的新型图像生成方法,文本对图像模型可以生产许多不同类型的高分辨率图像,其中人类想象力是唯一的极限。但是,这些模型需要大量的计算资源来训练,并处理从互联网收集的大量数据集。此外,代码库和模型均未发布。因此,它可以防止AI社区尝试这些尖端模型,从而使其结果复制变得复杂,即使不是不可能。在本文中,我们的目标是首先回顾这些模型使用的不同方法和技术,然后提出我们自己的文本模型模型实施。高度基于DALL-E 2,我们引入了一些轻微的修改,以应对所引起的高计算成本。因此,我们有机会进行实验,以了解这些模型的能力,尤其是在低资源制度中。特别是,我们提供了比Dall-e 2的作者(包括消融研究)更深入的分析。此外,扩散模型使用所谓的指导方法来帮助生成过程。我们引入了一种新的指导方法,该方法可以与其他指导方法一起使用,以提高图像质量。最后,我们的模型产生的图像质量相当好,而不必维持最先进的文本对图像模型的重大培训成本。
translated by 谷歌翻译
自2021年以来,文本到图像的生成引起了人们的关注。如今,可以通过深层生成模型从文本输入(“提示”)中综合美丽而有趣的数字图像和艺术品。围绕文本图像生成和AI生成的艺术的在线社区很快就出现了。本文根据3个月的人种学研究确定了在线社区中从业人员使用的六种类型的迅速修饰符。迅速修饰符的新颖分类学为研究人员提供了研究文本到图像生成实践的概念起点,但也可以帮助AI生成的ART的实践者改善其图像。我们进一步概述了如何在“及时工程”的实践中应用及时修饰符。我们讨论了这种新颖的创造性实践在人类互动(HCI)领域的研究机会。本文最后讨论了从人类互动(HAI)(HAI)在未来的应用中,除文本到图像生成和AI生成的艺术的用例之外,从人类互动(HAI)的角度讨论了更广泛的含义。
translated by 谷歌翻译
数字艺术合成在多媒体社区中受到越来越多的关注,因为有效地与公众参与了艺术。当前的数字艺术合成方法通常使用单模式输入作为指导,从而限制了模型的表现力和生成结果的多样性。为了解决这个问题,我们提出了多模式引导的艺术品扩散(MGAD)模型,该模型是一种基于扩散的数字艺术品生成方法,它利用多模式提示作为控制无分类器扩散模型的指导。此外,对比度语言图像预处理(剪辑)模型用于统一文本和图像模式。关于生成的数字艺术绘画质量和数量的广泛实验结果证实了扩散模型和多模式指导的组合有效性。代码可从https://github.com/haha-lisa/mgad-multimodal-guided-artwork-diffusion获得。
translated by 谷歌翻译
Stone" "Mohawk hairstyle" "Without makeup" "Cute cat" "Lion" "Gothic church" * Equal contribution, ordered alphabetically. Code and video are available on https://github.com/orpatashnik/StyleCLIP
translated by 谷歌翻译
Text-guided image editing can have a transformative impact in supporting creative applications. A key challenge is to generate edits that are faithful to input text prompts, while consistent with input images. We present Imagen Editor, a cascaded diffusion model built, by fine-tuning Imagen on text-guided image inpainting. Imagen Editor's edits are faithful to the text prompts, which is accomplished by using object detectors to propose inpainting masks during training. In addition, Imagen Editor captures fine details in the input image by conditioning the cascaded pipeline on the original high resolution image. To improve qualitative and quantitative evaluation, we introduce EditBench, a systematic benchmark for text-guided image inpainting. EditBench evaluates inpainting edits on natural and generated images exploring objects, attributes, and scenes. Through extensive human evaluation on EditBench, we find that object-masking during training leads to across-the-board improvements in text-image alignment -- such that Imagen Editor is preferred over DALL-E 2 and Stable Diffusion -- and, as a cohort, these models are better at object-rendering than text-rendering, and handle material/color/size attributes better than count/shape attributes.
translated by 谷歌翻译
We present Muse, a text-to-image Transformer model that achieves state-of-the-art image generation performance while being significantly more efficient than diffusion or autoregressive models. Muse is trained on a masked modeling task in discrete token space: given the text embedding extracted from a pre-trained large language model (LLM), Muse is trained to predict randomly masked image tokens. Compared to pixel-space diffusion models, such as Imagen and DALL-E 2, Muse is significantly more efficient due to the use of discrete tokens and requiring fewer sampling iterations; compared to autoregressive models, such as Parti, Muse is more efficient due to the use of parallel decoding. The use of a pre-trained LLM enables fine-grained language understanding, translating to high-fidelity image generation and the understanding of visual concepts such as objects, their spatial relationships, pose, cardinality etc. Our 900M parameter model achieves a new SOTA on CC3M, with an FID score of 6.06. The Muse 3B parameter model achieves an FID of 7.88 on zero-shot COCO evaluation, along with a CLIP score of 0.32. Muse also directly enables a number of image editing applications without the need to fine-tune or invert the model: inpainting, outpainting, and mask-free editing. More results are available at https://muse-model.github.io
translated by 谷歌翻译
随着信息中的各种方式存在于现实世界中的各种方式,多式联信息之间的有效互动和融合在计算机视觉和深度学习研究中的多模式数据的创造和感知中起着关键作用。通过卓越的功率,在多式联运信息中建模互动,多式联运图像合成和编辑近年来已成为一个热门研究主题。与传统的视觉指导不同,提供明确的线索,多式联路指南在图像合成和编辑方面提供直观和灵活的手段。另一方面,该领域也面临着具有固有的模态差距的特征的几个挑战,高分辨率图像的合成,忠实的评估度量等。在本调查中,我们全面地阐述了最近多式联运图像综合的进展根据数据模型和模型架构编辑和制定分类。我们从图像合成和编辑中的不同类型的引导方式开始介绍。然后,我们描述了多模式图像综合和编辑方法,其具有详细的框架,包括生成的对抗网络(GAN),GaN反转,变压器和其他方法,例如NERF和扩散模型。其次是在多模式图像合成和编辑中广泛采用的基准数据集和相应的评估度量的综合描述,以及分析各个优点和限制的不同合成方法的详细比较。最后,我们为目前的研究挑战和未来的研究方向提供了深入了解。与本调查相关的项目可在HTTPS://github.com/fnzhan/mise上获得
translated by 谷歌翻译
最近,大规模文本驱动的合成模型由于其出色的产生高度多样化的图像而引起了很多关注,这些图像遵循给定的文本提示。这种基于文本的综合方法特别有吸引力,这些方法对人类用来口头描述其意图。因此,将文本驱动的图像合成扩展到文本驱动的图像编辑是很自然的。编辑对于这些生成模型来说是具有挑战性的,因为编辑技术的先天属性是保留大多数原始图像,而在基于文本的模型中,即使对文本提示的小修改也通常会导致完全不同的结果。最先进的方法可以通过要求用户提供空间掩码来本地化编辑,从而忽略蒙版区域内的原始结构和内容,从而减轻这种方法。在本文中,我们追求一个直观的及时提示编辑框架,其中编辑仅由文本控制。为此,我们深入分析了一个文本条件模型,并观察到跨注意层是控制图像的空间布局与提示中每个单词之间关系的关键。通过此观察,我们提出了几种应用程序,它们仅通过编辑文本提示来监视图像综合。这包括通过替换单词,通过添加规范来替换单词编辑的本地化编辑,甚至精心控制单词在图像中反映的程度。我们介绍了各种图像和提示的结果,证明了对编辑提示的高质量综合和忠诚度。
translated by 谷歌翻译
可以训练生成模型,以从特定域中生成图像,仅由文本提示引导,而不看到任何图像?换句话说:可以将图像生成器“盲目地训练”吗?利用大规模对比语言图像预训练(CLIP)模型的语义力量,我们提出了一种文本驱动方法,允许将生成模型转移到新域,而无需收集单个图像。我们展示通过自然语言提示和几分钟的培训,我们的方法可以通过各种风格和形状的多种域调整发电机。值得注意的是,许多这些修改难以与现有方法达到困难或完全不可能。我们在广泛的域中进行了广泛的实验和比较。这些证明了我们方法的有效性,并表明我们的移动模型保持了对下游任务吸引的生成模型的潜在空间属性。
translated by 谷歌翻译
利用Stylegan的表现力及其分离的潜在代码,现有方法可以实现对不同视觉属性的现实编辑,例如年龄和面部图像的性别。出现了一个有趣而又具有挑战性的问题:生成模型能否针对他们博学的先验进行反事实编辑?由于自然数据集中缺乏反事实样本,我们以文本驱动的方式研究了这个问题,并具有对比语言图像预言(剪辑),这些(剪辑)甚至可以为各种反事实概念提供丰富的语义知识。与内域操作不同,反事实操作需要更全面地剥削夹包含的语义知识,以及对编辑方向的更微妙的处理,以避免被卡在局部最低或不需要的编辑中。为此,我们设计了一种新颖的对比损失,该损失利用了预定义的夹子空间方向,从不同的角度将编辑指向所需的方向。此外,我们设计了一个简单而有效的方案,该方案将(目标文本)明确映射到潜在空间,并将其与潜在代码融合在一起,以进行有效的潜在代码优化和准确的编辑。广泛的实验表明,我们的设计在乘坐各种反事实概念的目标文本驾驶时,可以实现准确,现实的编辑。
translated by 谷歌翻译
Recent diffusion-based AI art platforms are able to create impressive images from simple text descriptions. This makes them powerful tools for concept design in any discipline that requires creativity in visual design tasks. This is also true for early stages of architectural design with multiple stages of ideation, sketching and modelling. In this paper, we investigate how applicable diffusion-based models already are to these tasks. We research the applicability of the platforms Midjourney, DALL-E 2 and StableDiffusion to a series of common use cases in architectural design to determine which are already solvable or might soon be. We also analyze how they are already being used by analyzing a data set of 40 million Midjourney queries with NLP methods to extract common usage patterns. With this insights we derived a workflow to interior and exterior design that combines the strengths of the individual platforms.
translated by 谷歌翻译
Neural image classifiers are known to undergo severe performance degradation when exposed to input that exhibits covariate-shift with respect to the training distribution. Successful hand-crafted augmentation pipelines aim at either approximating the expected test domain conditions or to perturb the features that are specific to the training environment. The development of effective pipelines is typically cumbersome, and produce transformations whose impact on the classifier performance are hard to understand and control. In this paper, we show that recent Text-to-Image (T2I) generators' ability to simulate image interventions via natural-language prompts can be leveraged to train more robust models, offering a more interpretable and controllable alternative to traditional augmentation methods. We find that a variety of prompting mechanisms are effective for producing synthetic training data sufficient to achieve state-of-the-art performance in widely-adopted domain-generalization benchmarks and reduce classifiers' dependency on spurious features. Our work suggests that further progress in T2I generation and a tighter integration with other research fields may represent a significant step towards the development of more robust machine learning systems.
translated by 谷歌翻译
The text-to-image model Stable Diffusion has recently become very popular. Only weeks after its open source release, millions are experimenting with image generation. This is due to its ease of use, since all it takes is a brief description of the desired image to "prompt" the generative model. Rarely do the images generated for a new prompt immediately meet the user's expectations. Usually, an iterative refinement of the prompt ("prompt engineering") is necessary for satisfying images. As a new perspective, we recast image prompt engineering as interactive image retrieval - on an "infinite index". Thereby, a prompt corresponds to a query and prompt engineering to query refinement. Selected image-prompt pairs allow direct relevance feedback, as the model can modify an image for the refined prompt. This is a form of one-sided interactive retrieval, where the initiative is on the user side, whereas the server side remains stateless. In light of an extensive literature review, we develop these parallels in detail and apply the findings to a case study of a creative search task on such a model. We note that the uncertainty in searching an infinite index is virtually never-ending. We also discuss future research opportunities related to retrieval models specialized for generative models and interactive generative image retrieval. The application of IR technology, such as query reformulation and relevance feedback, will contribute to improved workflows when using generative models, while the notion of an infinite index raises new challenges in IR research.
translated by 谷歌翻译