While generative models produce high-quality images of concepts learned from a large-scale database, a user often wishes to synthesize instantiations of their own concepts (for example, their family, pets, or items). Can we teach a model to quickly acquire a new concept, given a few examples? Furthermore, can we compose multiple new concepts together? We propose Custom Diffusion, an efficient method for augmenting existing text-to-image models. We find that only optimizing a few parameters in the text-to-image conditioning mechanism is sufficiently powerful to represent new concepts while enabling fast tuning (~6 minutes). Additionally, we can jointly train for multiple concepts or combine multiple fine-tuned models into one via closed-form constrained optimization. Our fine-tuned model generates variations of multiple, new concepts and seamlessly composes them with existing concepts in novel settings. Our method outperforms several baselines and concurrent works, regarding both qualitative and quantitative evaluations, while being memory and computationally efficient.
translated by 谷歌翻译
During image editing, existing deep generative models tend to re-synthesize the entire output from scratch, including the unedited regions. This leads to a significant waste of computation, especially for minor editing operations. In this work, we present Spatially Sparse Inference (SSI), a general-purpose technique that selectively performs computation for edited regions and accelerates various generative models, including both conditional GANs and diffusion models. Our key observation is that users tend to make gradual changes to the input image. This motivates us to cache and reuse the feature maps of the original image. Given an edited image, we sparsely apply the convolutional filters to the edited regions while reusing the cached features for the unedited regions. Based on our algorithm, we further propose Sparse Incremental Generative Engine (SIGE) to convert the computation reduction to latency reduction on off-the-shelf hardware. With 1.2%-area edited regions, our method reduces the computation of DDIM by 7.5$\times$ and GauGAN by 18$\times$ while preserving the visual fidelity. With SIGE, we accelerate the speed of DDIM by 3.0x on RTX 3090 and 6.6$\times$ on Apple M1 Pro CPU, and GauGAN by 4.2$\times$ on RTX 3090 and 14$\times$ on Apple M1 Pro CPU.
translated by 谷歌翻译
深层生成模型通过自动化基于收集的数据集的多样性,现实内容的综合,使新手用户更容易访问视觉内容。但是,当前的机器学习方法错过了创作过程的关键要素 - 综合远远超出数据分配和日常体验的东西的能力。为了开始解决此问题,我们可以通过仅编辑一些具有所需几何变化的原始模型输出来“扭曲”给定模型。我们的方法将低级更新应用于单个模型层以重建编辑的示例。此外,为了打击过度拟合,我们建议一种基于样式混合的潜在空间增强方法。我们的方法允许用户创建一个模型,该模型可以通过定义的几何更改合成无尽的对象,从而可以创建新的生成模型,而无需策划大规模数据集。我们还证明可以组成编辑的模型以实现汇总效果,并提出了一个交互式界面,以使用户能够通过组合创建新的模型。对多个测试案例的经验测量表明,我们方法对最近的GAN微调方法的优势。最后,我们使用编辑的模型展示了多个应用程序,包括潜在空间插值和图像编辑。
translated by 谷歌翻译
现有的GAN倒置和编辑方法适用于具有干净背景的对齐物体,例如肖像和动物面孔,但通常会为更加困难的类别而苦苦挣扎,具有复杂的场景布局和物体遮挡,例如汽车,动物和室外图像。我们提出了一种新方法,以在gan的潜在空间(例如stylegan2)中倒转和编辑复杂的图像。我们的关键想法是用一系列层的集合探索反演,从而将反转过程适应图像的难度。我们学会预测不同图像段的“可逆性”,并将每个段投影到潜在层。更容易的区域可以倒入发电机潜在空间中的较早层,而更具挑战性的区域可以倒入更晚的特征空间。实验表明,与最新的复杂类别的方法相比,我们的方法获得了更好的反转结果,同时保持下游的编辑性。请参阅我们的项目页面,网址为https://www.cs.cmu.edu/~saminversion。
translated by 谷歌翻译
大规模训练的出现产生了强大的视觉识别模型的聚宝盆。然而,传统上以无人监督的方式从划痕训练的生成模型。可以利用来自一大堆预用的视觉模型的集体“知识”来改善GaN培训吗?如果是这样,有这么多的模型可供选择,应该选择哪一个,并且以什么方式最有效?我们发现预磨削的计算机视觉模型可以在鉴别器的集合中使用时显着提高性能。值得注意的是,所选模型的特定子集极大地影响性能。我们提出了一种有效的选择机制,通过探测预训练模型嵌入的实际和假样本之间的线性可分性,选择最准确的模型,并逐步将其添加到鉴别器集合中。有趣的是,我们的方法可以在有限的数据和大规模设置中提高GaN培训。只有10K培训样本,我们的LSUN猫的FID与1.6M图像培训的风格挂牌匹配。在完整的数据集上,我们的方法将FID提高了1.5倍的LSUN猫,教堂和马类的2倍。
translated by 谷歌翻译
我们提出了Gan监督的学习,一个学习歧视模型的框架及其GAN生成的培训数据结束结束。我们将框架应用于密集的视觉调整问题。灵感来自经典的凝固方法,我们的甘蓝算法列举了空间变压器来将随机样本从受过协调的数据训练到常见的共同学习的目标模式。我们在八个数据集上显示结果,所有这些都证明了我们的方法成功对齐复杂数据并发现密集的对应。甘蓝显着优于过去自我监督的对应算法,并在几个数据集上与(有时超过)最先进的监督对应算法进行了近几个数据集 - 而不利用任何通信监督或数据增强,尽管仅仅是完全培训在GaN生成的数据上。对于精确的对应,我们通过最先进的受监管方法提高了3倍。我们展示了我们对下游GaN训练的图像数据集的增强现实,图像编辑和自动预处理的应用。
translated by 谷歌翻译
培训监督图像综合模型需要批评评论权来比较两个图像:结果的原始真相。然而,这种基本功能仍然是一个公开问题。流行的方法使用L1(平均绝对误差)丢失,或者在预先预留的深网络的像素或特征空间中。然而,我们观察到这些损失倾向于产生过度模糊和灰色的图像,以及其他技术,如GAN需要用于对抗这些伪影。在这项工作中,我们介绍了一种基于信息理论的方法来测量两个图像之间的相似性。我们认为,良好的重建应该具有较高的相互信息与地面真相。这种观点使得能够以对比的方式学习轻量级批评者以“校准”特征空间,使得相应的空间贴片的重建被置于擦除其他贴片。我们表明,当用作L1损耗的替代时,我们的配方立即提升输出图像的感知现实主义,有或没有额外的GaN丢失。
translated by 谷歌翻译
引导图像综合使日常用户能够以最低努力创建和编辑照片逼真的图像。关键挑战是对用户输入(例如,手绘彩色笔划)和合成图像的现实来平衡忠实忠诚。现有的基于GAN的方法尝试使用条件的GAN或GAN enversions实现此类平衡,这是具有挑战性的,并且通常需要额外的培训数据或个别应用程序的损失函数。为了解决这些问题,我们引入了一种新的图像综合和编辑方法,基于传播模型生成的现成的新图像合成和编辑方法,随机差分编辑(SDETIT),其通过通过随机微分方程(SDE)迭代地脱颖而见,这将逼真的图像合成现实图像。鉴于使用任何类型的用户指南的输入图像,SDEDIT首先向输入添加噪声,然后在增加其现实主义之前通过SDE向所得到的图像进行去噪。 Sdedit不需要特定的任务培训或反转,并且可以自然地实现现实主义和忠诚之间的平衡。根据人类的感知研究,在现实主义研究中,SDEDIT在现实主义上显着优于最先进的GAN的方法,高达98.09%,而且整体满意度评分的91.72%,包括基于行程的图像合成和编辑作为图像合成。
translated by 谷歌翻译
In image-to-image translation, each patch in the output should reflect the content of the corresponding patch in the input, independent of domain. We propose a straightforward method for doing so -maximizing mutual information between the two, using a framework based on contrastive learning. The method encourages two elements (corresponding patches) to map to a similar point in a learned feature space, relative to other elements (other patches) in the dataset, referred to as negatives. We explore several critical design choices for making contrastive learning effective in the image synthesis setting. Notably, we use a multilayer, patch-based approach, rather than operate on entire images. Furthermore, we draw negatives from within the input image itself, rather than from the rest of the dataset. We demonstrate that our framework enables one-sided translation in the unpaired image-to-image translation setting, while improving quality and reducing training time. In addition, our method can even be extended to the training setting where each "domain" is only a single image.
translated by 谷歌翻译
The performance of generative adversarial networks (GANs) heavily deteriorates given a limited amount of training data. This is mainly because the discriminator is memorizing the exact training set. To combat it, we propose Differentiable Augmentation (DiffAugment), a simple method that improves the data efficiency of GANs by imposing various types of differentiable augmentations on both real and fake samples. Previous attempts to directly augment the training data manipulate the distribution of real images, yielding little benefit; DiffAugment enables us to adopt the differentiable augmentation for the generated samples, effectively stabilizes training, and leads to better convergence. Experiments demonstrate consistent gains of our method over a variety of GAN architectures and loss functions for both unconditional and class-conditional generation. With DiffAugment, we achieve a state-of-the-art FID of 6.80 with an IS of 100.8 on ImageNet 128×128 and 2-4× reductions of FID given 1,000 images on FFHQ and LSUN. Furthermore, with only 20% training data, we can match the top performance on CIFAR-10 and CIFAR-100. Finally, our method can generate high-fidelity images using only 100 images without pre-training, while being on par with existing transfer learning algorithms. Code is available at https://github.com/mit-han-lab/data-efficient-gans.
translated by 谷歌翻译