While generative models produce high-quality images of concepts learned from a large-scale database, a user often wishes to synthesize instantiations of their own concepts (for example, their family, pets, or items). Can we teach a model to quickly acquire a new concept, given a few examples? Furthermore, can we compose multiple new concepts together? We propose Custom Diffusion, an efficient method for augmenting existing text-to-image models. We find that only optimizing a few parameters in the text-to-image conditioning mechanism is sufficiently powerful to represent new concepts while enabling fast tuning (~6 minutes). Additionally, we can jointly train for multiple concepts or combine multiple fine-tuned models into one via closed-form constrained optimization. Our fine-tuned model generates variations of multiple, new concepts and seamlessly composes them with existing concepts in novel settings. Our method outperforms several baselines and concurrent works, regarding both qualitative and quantitative evaluations, while being memory and computationally efficient.
translated by 谷歌翻译
文本对图像模型提供了前所未有的自由,可以通过自然语言指导创作。然而,尚不清楚如何行使这种自由以生成特定独特概念,修改其外观或以新角色和新颖场景构成它们的图像。换句话说,我们问:我们如何使用语言指导的模型将猫变成绘画,或者想象基于我们喜欢的玩具的新产品?在这里,我们提出了一种简单的方法,可以允许这种创造性自由。我们仅使用3-5个用户提供的概念(例如对象或样式)的图像,我们学会通过在冷冻文本到图像模型的嵌入空间中通过新的“单词”表示它。这些“单词”可以组成自然语言句子,以直观的方式指导个性化的创作。值得注意的是,我们发现有证据表明单词嵌入足以捕获独特而多样的概念。我们将我们的方法比较了各种基线,并证明它可以更忠实地描绘出一系列应用程序和任务的概念。我们的代码,数据和新单词将在以下网址提供:https://textual-inversion.github.io
translated by 谷歌翻译
大型文本对图像模型在AI的演变中取得了显着的飞跃,从而使图像从给定的文本提示中实现了高质量和多样化的图像合成。但是,这些模型缺乏在给定的参考集中模仿受试者的外观,并在不同情况下合成它们的新颖性。在这项工作中,我们提出了一种新的方法,用于“个性化”文本图像扩散模型(将它们专门针对用户的需求)。仅作为一个主题的几张图像给出,我们将验证的文本对图像模型(图像,尽管我们的方法不限于特定模型),以便它学会了将唯一标识符与该特定主题结合。一旦将受试者嵌入模型的输出域中,就可以使用唯一标识符来合成主题的完全新颖的光真逼真的图像在不同场景中的上下文化。通过利用具有新的自动构基特异性的先前保存损失的语义先验嵌入到模型中,我们的技术可以在参考图像中未出现的不同场景,姿势,视图和照明条件中合成主题。我们将技术应用于几个以前无用的任务,包括主题重新定义,文本指导的视图合成,外观修改和艺术渲染(所有这些都保留了主题的关键特征)。项目页面:https://dreambooth.github.io/
translated by 谷歌翻译
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis. Starting from random noise, such text-to-image diffusion models gradually synthesize images in an iterative fashion while conditioning on text prompts. We find that their synthesis behavior qualitatively changes throughout this process: Early in sampling, generation strongly relies on the text prompt to generate text-aligned content, while later, the text conditioning is almost entirely ignored. This suggests that sharing model parameters throughout the entire generation process may not be ideal. Therefore, in contrast to existing works, we propose to train an ensemble of text-to-image diffusion models specialized for different synthesis stages. To maintain training efficiency, we initially train a single model, which is then split into specialized models that are trained for the specific stages of the iterative generation process. Our ensemble of diffusion models, called eDiff-I, results in improved text alignment while maintaining the same inference computation cost and preserving high visual quality, outperforming previous large-scale text-to-image diffusion models on the standard benchmark. In addition, we train our model to exploit a variety of embeddings for conditioning, including the T5 text, CLIP text, and CLIP image embeddings. We show that these different embeddings lead to different behaviors. Notably, the CLIP image embedding allows an intuitive way of transferring the style of a reference image to the target text-to-image output. Lastly, we show a technique that enables eDiff-I's "paint-with-words" capability. A user can select the word in the input text and paint it in a canvas to control the output, which is very handy for crafting the desired image in mind. The project page is available at https://deepimagination.cc/eDiff-I/
translated by 谷歌翻译
Recent works on diffusion models have demonstrated a strong capability for conditioning image generation, e.g., text-guided image synthesis. Such success inspires many efforts trying to use large-scale pre-trained diffusion models for tackling a challenging problem--real image editing. Works conducted in this area learn a unique textual token corresponding to several images containing the same object. However, under many circumstances, only one image is available, such as the painting of the Girl with a Pearl Earring. Using existing works on fine-tuning the pre-trained diffusion models with a single image causes severe overfitting issues. The information leakage from the pre-trained diffusion models makes editing can not keep the same content as the given image while creating new features depicted by the language guidance. This work aims to address the problem of single-image editing. We propose a novel model-based guidance built upon the classifier-free guidance so that the knowledge from the model trained on a single image can be distilled into the pre-trained diffusion model, enabling content creation even with one given image. Additionally, we propose a patch-based fine-tuning that can effectively help the model generate images of arbitrary resolution. We provide extensive experiments to validate the design choices of our approach and show promising editing capabilities, including changing style, content addition, and object manipulation. The code is available for research purposes at https://github.com/zhang-zx/SINE.git .
translated by 谷歌翻译
Shape can specify key object constraints, yet existing text-to-image diffusion models ignore this cue and synthesize objects that are incorrectly scaled, cut off, or replaced with background content. We propose a training-free method, Shape-Guided Diffusion, which uses a novel Inside-Outside Attention mechanism to constrain the cross-attention (and self-attention) maps such that prompt tokens (and pixels) referring to the inside of the shape cannot attend outside the shape, and vice versa. To demonstrate the efficacy of our method, we propose a new image editing task where the model must replace an object specified by its mask and a text prompt. We curate a new ShapePrompts benchmark based on MS-COCO and achieve SOTA results in shape faithfulness, text alignment, and realism according to both quantitative metrics and human preferences. Our data and code will be made available at https://shape-guided-diffusion.github.io.
translated by 谷歌翻译
最近已被证明扩散模型产生高质量的合成图像,尤其是与指导技术配对,以促进忠诚的多样性。我们探索文本条件图像综合问题的扩散模型,并比较了两种不同的指导策略:剪辑指导和自由分类指导。我们发现后者是人类评估者的优选,用于光敏和标题相似度,并且通常产生光素质拟种样品。使用自由分类指导的35亿参数文本条件扩散模型的样本由人类评估者对来自Dall-E的人的人们青睐,即使后者使用昂贵的剪辑重新划分。此外,我们发现我们的模型可以进行微调,以执行图像修复,从而实现强大的文本驱动的图像编辑。我们在过滤的数据集中培训较小的模型,并在https://github.com/openai/glide-text2im释放代码和权重。
translated by 谷歌翻译
We present Muse, a text-to-image Transformer model that achieves state-of-the-art image generation performance while being significantly more efficient than diffusion or autoregressive models. Muse is trained on a masked modeling task in discrete token space: given the text embedding extracted from a pre-trained large language model (LLM), Muse is trained to predict randomly masked image tokens. Compared to pixel-space diffusion models, such as Imagen and DALL-E 2, Muse is significantly more efficient due to the use of discrete tokens and requiring fewer sampling iterations; compared to autoregressive models, such as Parti, Muse is more efficient due to the use of parallel decoding. The use of a pre-trained LLM enables fine-grained language understanding, translating to high-fidelity image generation and the understanding of visual concepts such as objects, their spatial relationships, pose, cardinality etc. Our 900M parameter model achieves a new SOTA on CC3M, with an FID score of 6.06. The Muse 3B parameter model achieves an FID of 7.88 on zero-shot COCO evaluation, along with a CLIP score of 0.32. Muse also directly enables a number of image editing applications without the need to fine-tune or invert the model: inpainting, outpainting, and mask-free editing. More results are available at https://muse-model.github.io
translated by 谷歌翻译
关于文本到图像生成的研究在产生多样化和照片现实的图像方面取得了重大进展,这是由在大规模图像文本数据上训练的扩散和自动回归模型驱动的。尽管最先进的模型可以产生共同实体的高质量图像,但它们通常很难产生不常见的实体的图像,例如“ chortai(dog)”或“ picarones(食物)”。为了解决此问题,我们介绍了检索型的文本对图像生成器(Re-Imagen),这是一种生成模型,它使用检索到的信息来产生高保真和忠实的图像,即使对于稀有或看不见的实体也是如此。给定文本提示,重新构造访问外部多模式知识库以检索相关(图像,文本)对,并将它们用作引用来生成图像。通过此检索步骤,重新构造的知识是对上述实体的高级语义和低级视觉细节的了解,从而提高了其在产生实体视觉外观的准确性。我们在包含(图像,文本,检索)的构造数据集上训练Re-Imagen,以教导该模型在文本提示和检索上扎根。此外,我们制定了一种新的抽样策略,以使文本和检索条件的无分类指南交流,以平衡文本和检索对齐。 Re-Imagen在两个图像生成基准上获得了新的SOTA FID结果,例如Coco(IE,FID = 5.25)和Wikiimage(即FID = 5.82),而无需微调。为了进一步评估该模型的功能,我们介绍了EntityDrawBench,这是一种新的基准测试,可评估从多个视觉域的各种实体的图像生成,从频繁到稀有。人类对EntityDrawBench的评估表明,Re-Imagen与照片现实主义中最好的先前模型相同,但具有明显的忠诚,尤其是在较不频繁的实体上。
translated by 谷歌翻译
利用深度学习的最新进展,文本到图像生成模型目前具有吸引公众关注的优点。其中两个模型Dall-E 2和Imagen已经证明,可以从图像的简单文本描述中生成高度逼真的图像。基于一种称为扩散模型的新型图像生成方法,文本对图像模型可以生产许多不同类型的高分辨率图像,其中人类想象力是唯一的极限。但是,这些模型需要大量的计算资源来训练,并处理从互联网收集的大量数据集。此外,代码库和模型均未发布。因此,它可以防止AI社区尝试这些尖端模型,从而使其结果复制变得复杂,即使不是不可能。在本文中,我们的目标是首先回顾这些模型使用的不同方法和技术,然后提出我们自己的文本模型模型实施。高度基于DALL-E 2,我们引入了一些轻微的修改,以应对所引起的高计算成本。因此,我们有机会进行实验,以了解这些模型的能力,尤其是在低资源制度中。特别是,我们提供了比Dall-e 2的作者(包括消融研究)更深入的分析。此外,扩散模型使用所谓的指导方法来帮助生成过程。我们引入了一种新的指导方法,该方法可以与其他指导方法一起使用,以提高图像质量。最后,我们的模型产生的图像质量相当好,而不必维持最先进的文本对图像模型的重大培训成本。
translated by 谷歌翻译
Can a text-to-image diffusion model be used as a training objective for adapting a GAN generator to another domain? In this paper, we show that the classifier-free guidance can be leveraged as a critic and enable generators to distill knowledge from large-scale text-to-image diffusion models. Generators can be efficiently shifted into new domains indicated by text prompts without access to groundtruth samples from target domains. We demonstrate the effectiveness and controllability of our method through extensive experiments. Although not trained to minimize CLIP loss, our model achieves equally high CLIP scores and significantly lower FID than prior work on short prompts, and outperforms the baseline qualitatively and quantitatively on long and complicated prompts. To our best knowledge, the proposed method is the first attempt at incorporating large-scale pre-trained diffusion models and distillation sampling for text-driven image generator domain adaptation and gives a quality previously beyond possible. Moreover, we extend our work to 3D-aware style-based generators and DreamBooth guidance.
translated by 谷歌翻译
最近,大规模文本驱动的合成模型由于其出色的产生高度多样化的图像而引起了很多关注,这些图像遵循给定的文本提示。这种基于文本的综合方法特别有吸引力,这些方法对人类用来口头描述其意图。因此,将文本驱动的图像合成扩展到文本驱动的图像编辑是很自然的。编辑对于这些生成模型来说是具有挑战性的,因为编辑技术的先天属性是保留大多数原始图像,而在基于文本的模型中,即使对文本提示的小修改也通常会导致完全不同的结果。最先进的方法可以通过要求用户提供空间掩码来本地化编辑,从而忽略蒙版区域内的原始结构和内容,从而减轻这种方法。在本文中,我们追求一个直观的及时提示编辑框架,其中编辑仅由文本控制。为此,我们深入分析了一个文本条件模型,并观察到跨注意层是控制图像的空间布局与提示中每个单词之间关系的关键。通过此观察,我们提出了几种应用程序,它们仅通过编辑文本提示来监视图像综合。这包括通过替换单词,通过添加规范来替换单词编辑的本地化编辑,甚至精心控制单词在图像中反映的程度。我们介绍了各种图像和提示的结果,证明了对编辑提示的高质量综合和忠诚度。
translated by 谷歌翻译
We introduce M-VADER: a diffusion model (DM) for image generation where the output can be specified using arbitrary combinations of images and text. We show how M-VADER enables the generation of images specified using combinations of image and text, and combinations of multiple images. Previously, a number of successful DM image generation algorithms have been introduced that make it possible to specify the output image using a text prompt. Inspired by the success of those models, and led by the notion that language was already developed to describe the elements of visual contexts that humans find most important, we introduce an embedding model closely related to a vision-language model. Specifically, we introduce the embedding model S-MAGMA: a 13 billion parameter multimodal decoder combining components from an autoregressive vision-language model MAGMA and biases finetuned for semantic search.
translated by 谷歌翻译
Recent 3D generative models have achieved remarkable performance in synthesizing high resolution photorealistic images with view consistency and detailed 3D shapes, but training them for diverse domains is challenging since it requires massive training images and their camera distribution information. Text-guided domain adaptation methods have shown impressive performance on converting the 2D generative model on one domain into the models on other domains with different styles by leveraging the CLIP (Contrastive Language-Image Pre-training), rather than collecting massive datasets for those domains. However, one drawback of them is that the sample diversity in the original generative model is not well-preserved in the domain-adapted generative models due to the deterministic nature of the CLIP text encoder. Text-guided domain adaptation will be even more challenging for 3D generative models not only because of catastrophic diversity loss, but also because of inferior text-image correspondence and poor image quality. Here we propose DATID-3D, a domain adaptation method tailored for 3D generative models using text-to-image diffusion models that can synthesize diverse images per text prompt without collecting additional images and camera information for the target domain. Unlike 3D extensions of prior text-guided domain adaptation methods, our novel pipeline was able to fine-tune the state-of-the-art 3D generator of the source domain to synthesize high resolution, multi-view consistent images in text-guided targeted domains without additional data, outperforming the existing text-guided domain adaptation methods in diversity and text-image correspondence. Furthermore, we propose and demonstrate diverse 3D image manipulations such as one-shot instance-selected adaptation and single-view manipulated 3D reconstruction to fully enjoy diversity in text.
translated by 谷歌翻译
我们介绍了自回归文本到图像(Parti)模型的途径,该模型生成高保真的影像图像并支持涉及复杂组成和世界知识的内容丰富的合成。 Parti将文本对图像生成视为类似于机器翻译的序列到序列建模问题,图像令牌的序列是目标输出,而不是其他语言的文本令牌。这种策略自然可以利用大型语言模型的先前工作,通过扩展数据和模型尺寸,能力和性能的持续进展。我们的方法很简单:首先,Parti使用基于变压器的图像令牌VIT-VQGAN将图像编码为离散令牌的序列。其次,我们通过将编码器二次变压器模型缩放到20B参数来实现一致的质量改进,其新的最新零弹药FID得分为7.23,而MS-Coco的FIDED得分为3.22。我们对本地化叙述以及党的详细分析(P2),这是1600多个英语提示的新的整体基准,证明了Parti在各种类别和难度方面的有效性。我们还探索并突出了我们的模型的局限性,以定义和体现关注重点领域以进一步改进。有关高分辨率图像,请参见https://parti.research.google/。
translated by 谷歌翻译
最近,GaN反演方法与对比语言 - 图像预先绘制(CLIP)相结合,可以通过文本提示引导零拍摄图像操作。然而,由于GaN反转能力有限,它们对不同实物的不同实物的应用仍然困难。具体地,这些方法通常在与训练数据相比,改变对象标识或产生不需要的图像伪影的比较与新颖姿势,视图和高度可变内容重建具有新颖姿势,视图和高度可变内容的困难。为了减轻这些问题并实现真实图像的忠实操纵,我们提出了一种新的方法,Dumbused Clip,其使用扩散模型执行文本驱动的图像操纵。基于近期扩散模型的完整反转能力和高质量的图像生成功率,即使在看不见的域之间也成功地执行零拍摄图像操作。此外,我们提出了一种新颖的噪声组合方法,允许简单的多属性操作。与现有基线相比,广泛的实验和人类评估确认了我们的方法的稳健和卓越的操纵性能。
translated by 谷歌翻译
深层生成模型通过自动化基于收集的数据集的多样性,现实内容的综合,使新手用户更容易访问视觉内容。但是,当前的机器学习方法错过了创作过程的关键要素 - 综合远远超出数据分配和日常体验的东西的能力。为了开始解决此问题,我们可以通过仅编辑一些具有所需几何变化的原始模型输出来“扭曲”给定模型。我们的方法将低级更新应用于单个模型层以重建编辑的示例。此外,为了打击过度拟合,我们建议一种基于样式混合的潜在空间增强方法。我们的方法允许用户创建一个模型,该模型可以通过定义的几何更改合成无尽的对象,从而可以创建新的生成模型,而无需策划大规模数据集。我们还证明可以组成编辑的模型以实现汇总效果,并提出了一个交互式界面,以使用户能够通过组合创建新的模型。对多个测试案例的经验测量表明,我们方法对最近的GAN微调方法的优势。最后,我们使用编辑的模型展示了多个应用程序,包括潜在空间插值和图像编辑。
translated by 谷歌翻译
Large-scale diffusion models have achieved state-of-the-art results on text-to-image synthesis (T2I) tasks. Despite their ability to generate high-quality yet creative images, we observe that attribution-binding and compositional capabilities are still considered major challenging issues, especially when involving multiple objects. In this work, we improve the compositional skills of T2I models, specifically more accurate attribute binding and better image compositions. To do this, we incorporate linguistic structures with the diffusion guidance process based on the controllable properties of manipulating cross-attention layers in diffusion-based T2I models. We observe that keys and values in cross-attention layers have strong semantic meanings associated with object layouts and content. Therefore, we can better preserve the compositional semantics in the generated image by manipulating the cross-attention representations based on linguistic insights. Built upon Stable Diffusion, a SOTA T2I model, our structured cross-attention design is efficient that requires no additional training samples. We achieve better compositional skills in qualitative and quantitative results, leading to a 5-8% advantage in head-to-head user comparison studies. Lastly, we conduct an in-depth analysis to reveal potential causes of incorrect image compositions and justify the properties of cross-attention layers in the generation process.
translated by 谷歌翻译
Recent CLIP-guided 3D optimization methods, e.g., DreamFields and PureCLIPNeRF achieve great success in zero-shot text-guided 3D synthesis. However, due to the scratch training and random initialization without any prior knowledge, these methods usually fail to generate accurate and faithful 3D structures that conform to the corresponding text. In this paper, we make the first attempt to introduce the explicit 3D shape prior to CLIP-guided 3D optimization methods. Specifically, we first generate a high-quality 3D shape from input texts in the text-to-shape stage as the 3D shape prior. We then utilize it as the initialization of a neural radiance field and then optimize it with the full prompt. For the text-to-shape generation, we present a simple yet effective approach that directly bridges the text and image modalities with a powerful text-to-image diffusion model. To narrow the style domain gap between images synthesized by the text-to-image model and shape renderings used to train the image-to-shape generator, we further propose to jointly optimize a learnable text prompt and fine-tune the text-to-image diffusion model for rendering-style image generation. Our method, namely, Dream3D, is capable of generating imaginative 3D content with better visual quality and shape accuracy than state-of-the-art methods.
translated by 谷歌翻译
可控图像合成模型允许根据文本指令或来自示例图像的指导创建不同的图像。最近,已经显示出去噪扩散概率模型比现有方法产生更现实的图像,并且已在无条件和类条件设置中成功展示。我们探索细粒度,连续控制该模型类,并引入了一种新颖的统一框架,用于语义扩散指导,允许语言或图像指导,或两者。使用图像文本或图像匹配分数的梯度将指导注入预训练的无条件扩散模型中。我们探讨基于剪辑的文本指导,以及以统一形式的基于内容和类型的图像指导。我们的文本引导综合方法可以应用于没有相关文本注释的数据集。我们对FFHQ和LSUN数据集进行实验,并显示出细粒度的文本引导图像合成的结果,与样式或内容示例图像相关的图像的合成,以及具有文本和图像引导的示例。
translated by 谷歌翻译