While generative models produce high-quality images of concepts learned from a large-scale database, a user often wishes to synthesize instantiations of their own concepts (for example, their family, pets, or items). Can we teach a model to quickly acquire a new concept, given a few examples? Furthermore, can we compose multiple new concepts together? We propose Custom Diffusion, an efficient method for augmenting existing text-to-image models. We find that only optimizing a few parameters in the text-to-image conditioning mechanism is sufficiently powerful to represent new concepts while enabling fast tuning (~6 minutes). Additionally, we can jointly train for multiple concepts or combine multiple fine-tuned models into one via closed-form constrained optimization. Our fine-tuned model generates variations of multiple, new concepts and seamlessly composes them with existing concepts in novel settings. Our method outperforms several baselines and concurrent works, regarding both qualitative and quantitative evaluations, while being memory and computationally efficient.
During image editing, existing deep generative models tend to re-synthesize the entire output from scratch, including the unedited regions. This leads to a significant waste of computation, especially for minor editing operations. In this work, we present Spatially Sparse Inference (SSI), a general-purpose technique that selectively performs computation for edited regions and accelerates various generative models, including both conditional GANs and diffusion models. Our key observation is that users tend to make gradual changes to the input image. This motivates us to cache and reuse the feature maps of the original image. Given an edited image, we sparsely apply the convolutional filters to the edited regions while reusing the cached features for the unedited regions. Based on our algorithm, we further propose Sparse Incremental Generative Engine (SIGE) to convert the computation reduction to latency reduction on off-the-shelf hardware. With 1.2%-area edited regions, our method reduces the computation of DDIM by 7.5$\times$ and GauGAN by 18$\times$ while preserving the visual fidelity. With SIGE, we accelerate the speed of DDIM by 3.0x on RTX 3090 and 6.6$\times$ on Apple M1 Pro CPU, and GauGAN by 4.2$\times$ on RTX 3090 and 14$\times$ on Apple M1 Pro CPU.
深层生成模型通过自动化基于收集的数据集的多样性,现实内容的综合,使新手用户更容易访问视觉内容。但是,当前的机器学习方法错过了创作过程的关键要素 - 综合远远超出数据分配和日常体验的东西的能力。为了开始解决此问题,我们可以通过仅编辑一些具有所需几何变化的原始模型输出来“扭曲”给定模型。我们的方法将低级更新应用于单个模型层以重建编辑的示例。此外,为了打击过度拟合,我们建议一种基于样式混合的潜在空间增强方法。我们的方法允许用户创建一个模型,该模型可以通过定义的几何更改合成无尽的对象,从而可以创建新的生成模型,而无需策划大规模数据集。我们还证明可以组成编辑的模型以实现汇总效果,并提出了一个交互式界面,以使用户能够通过组合创建新的模型。对多个测试案例的经验测量表明,我们方法对最近的GAN微调方法的优势。最后,我们使用编辑的模型展示了多个应用程序,包括潜在空间插值和图像编辑。
我们提出了Gan监督的学习,一个学习歧视模型的框架及其GAN生成的培训数据结束结束。我们将框架应用于密集的视觉调整问题。灵感来自经典的凝固方法,我们的甘蓝算法列举了空间变压器来将随机样本从受过协调的数据训练到常见的共同学习的目标模式。我们在八个数据集上显示结果,所有这些都证明了我们的方法成功对齐复杂数据并发现密集的对应。甘蓝显着优于过去自我监督的对应算法,并在几个数据集上与(有时超过)最先进的监督对应算法进行了近几个数据集 - 而不利用任何通信监督或数据增强,尽管仅仅是完全培训在GaN生成的数据上。对于精确的对应,我们通过最先进的受监管方法提高了3倍。我们展示了我们对下游GaN训练的图像数据集的增强现实,图像编辑和自动预处理的应用。
引导图像综合使日常用户能够以最低努力创建和编辑照片逼真的图像。关键挑战是对用户输入(例如,手绘彩色笔划)和合成图像的现实来平衡忠实忠诚。现有的基于GAN的方法尝试使用条件的GAN或GAN enversions实现此类平衡,这是具有挑战性的,并且通常需要额外的培训数据或个别应用程序的损失函数。为了解决这些问题,我们引入了一种新的图像综合和编辑方法,基于传播模型生成的现成的新图像合成和编辑方法,随机差分编辑(SDETIT),其通过通过随机微分方程(SDE)迭代地脱颖而见,这将逼真的图像合成现实图像。鉴于使用任何类型的用户指南的输入图像,SDEDIT首先向输入添加噪声,然后在增加其现实主义之前通过SDE向所得到的图像进行去噪。 Sdedit不需要特定的任务培训或反转,并且可以自然地实现现实主义和忠诚之间的平衡。根据人类的感知研究,在现实主义研究中,SDEDIT在现实主义上显着优于最先进的GAN的方法,高达98.09%,而且整体满意度评分的91.72%,包括基于行程的图像合成和编辑作为图像合成。
In image-to-image translation, each patch in the output should reflect the content of the corresponding patch in the input, independent of domain. We propose a straightforward method for doing so -maximizing mutual information between the two, using a framework based on contrastive learning. The method encourages two elements (corresponding patches) to map to a similar point in a learned feature space, relative to other elements (other patches) in the dataset, referred to as negatives. We explore several critical design choices for making contrastive learning effective in the image synthesis setting. Notably, we use a multilayer, patch-based approach, rather than operate on entire images. Furthermore, we draw negatives from within the input image itself, rather than from the rest of the dataset. We demonstrate that our framework enables one-sided translation in the unpaired image-to-image translation setting, while improving quality and reducing training time. In addition, our method can even be extended to the training setting where each "domain" is only a single image.
The performance of generative adversarial networks (GANs) heavily deteriorates given a limited amount of training data. This is mainly because the discriminator is memorizing the exact training set. To combat it, we propose Differentiable Augmentation (DiffAugment), a simple method that improves the data efficiency of GANs by imposing various types of differentiable augmentations on both real and fake samples. Previous attempts to directly augment the training data manipulate the distribution of real images, yielding little benefit; DiffAugment enables us to adopt the differentiable augmentation for the generated samples, effectively stabilizes training, and leads to better convergence. Experiments demonstrate consistent gains of our method over a variety of GAN architectures and loss functions for both unconditional and class-conditional generation. With DiffAugment, we achieve a state-of-the-art FID of 6.80 with an IS of 100.8 on ImageNet 128×128 and 2-4× reductions of FID given 1,000 images on FFHQ and LSUN. Furthermore, with only 20% training data, we can match the top performance on CIFAR-10 and CIFAR-100. Finally, our method can generate high-fidelity images using only 100 images without pre-training, while being on par with existing transfer learning algorithms. Code is available at https://github.com/mit-han-lab/data-efficient-gans.
