基于得分的扩散生成模型(SDGM)已实现了SOTA FID导致未配对的图像到图像翻译(I2i)。但是,我们注意到现有方法完全忽略了源域中的培训数据,从而导致了未配对I2i的次优解决方案。为此,我们提出了能源引导的随机微分方程(EGSDE),该方程采用了在源和目标域上鉴定的能量函数,以指导鉴定的SDE推理过程,以实现现实和忠实的不成对的I2i。在两个功能提取器的基础上,我们仔细设计了能量功能,以鼓励传输的图像保留独立于域的特征和丢弃域特异性域。此外,我们提供了EGSDE作为专家的产品的替代解释,其中三位专家(对应于SDE和两个功能提取器)中的每一个都仅有助于忠诚或现实主义。从经验上讲,我们将EGSDE与三个公认的未配对的I2I任务在四个指标下进行的大型基线进行了比较。 EGSDE不仅在几乎所有设置中都始终优于现有的基于SDGMS的方法,而且还取得了SOTA现实主义的结果​​(例如,猫在狗到狗中的65.82的FID为65.82,而在AFHQ上野生对狗的FID为59.75),而无需损害忠实的表现。
translated by 谷歌翻译
Diffusion models have achieved unprecedented performance in generative modeling. The commonly-adopted formulation of the latent code of diffusion models is a sequence of gradually denoised samples, as opposed to the simpler (e.g., Gaussian) latent space of GANs, VAEs, and normalizing flows. This paper provides an alternative, Gaussian formulation of the latent space of various diffusion models, as well as an invertible DPM-Encoder that maps images into the latent space. While our formulation is purely based on the definition of diffusion models, we demonstrate several intriguing consequences. (1) Empirically, we observe that a common latent space emerges from two diffusion models trained independently on related domains. In light of this finding, we propose CycleDiffusion, which uses DPM-Encoder for unpaired image-to-image translation. Furthermore, applying CycleDiffusion to text-to-image diffusion models, we show that large-scale text-to-image diffusion models can be used as zero-shot image-to-image editors. (2) One can guide pre-trained diffusion models and GANs by controlling the latent codes in a unified, plug-and-play formulation based on energy-based models. Using the CLIP model and a face recognition model as guidance, we demonstrate that diffusion models have better coverage of low-density sub-populations and individuals than GANs. The code is publicly available at https://github.com/ChenWu98/cycle-diffusion.
translated by 谷歌翻译
常见的图像到图像翻译方法依赖于来自源和目标域的数据的联合培训。这可以防止培训过程保留域数据的隐私(例如,在联合环境中),并且通常意味着必须对新模型进行新的模型。我们提出了双扩散隐式桥(DDIB),这是一种基于扩散模型的图像翻译方法,它绕过域对训练。带有DDIBS的图像翻译依赖于对每个域独立训练的两个扩散模型,并且是一个两步的过程:DDIB首先获得具有源扩散模型的源图像的潜在编码,然后使用目标模型来解码此类编码,以构造目标模型。这两个步骤均通过ODE定义,因此该过程仅与ODE求解器的离散误差有关。从理论上讲,我们将DDIB解释为潜在源的串联,而潜在的靶向Schr \” Odinger Bridges是一种熵调节的最佳运输形式,以解释该方法的功效。我们在实验上都应用了ddibs,在合成和高级和高位上应用DDIB分辨率图像数据集,以在各种翻译任务中演示其实用性及其与现有最佳传输方法的连接。
translated by 谷歌翻译
DeNoising扩散模型代表了计算机视觉中最新的主题,在生成建模领域表现出了显着的结果。扩散模型是一个基于两个阶段的深层生成模型,一个正向扩散阶段和反向扩散阶段。在正向扩散阶段,通过添加高斯噪声,输入数据在几个步骤中逐渐受到干扰。在反向阶段,模型的任务是通过学习逐步逆转扩散过程来恢复原始输入数据。尽管已知的计算负担,即由于采样过程中涉及的步骤数量,扩散模型对生成样品的质量和多样性得到了广泛赞赏。在这项调查中,我们对视觉中应用的denoising扩散模型的文章进行了全面综述,包括该领域的理论和实际贡献。首先,我们识别并介绍了三个通用扩散建模框架,这些框架基于扩散概率模型,噪声调节得分网络和随机微分方程。我们进一步讨论了扩散模型与其他深层生成模型之间的关系,包括变异自动编码器,生成对抗网络,基于能量的模型,自回归模型和正常流量。然后,我们介绍了计算机视觉中应用的扩散模型的多角度分类。最后,我们说明了扩散模型的当前局限性,并设想了一些有趣的未来研究方向。
translated by 谷歌翻译
尽管具有生成对抗网络(GAN)的图像到图像(I2I)翻译的显着进步,但使用单对生成器和歧视器将图像有效地转换为多个目标域中的一组不同图像仍然具有挑战性。现有的I2i翻译方法采用多个针对不同域的特定于域的内容编码,其中每个特定于域的内容编码器仅经过来自同一域的图像的训练。然而,我们认为应从所有域之间的图像中学到内容(域变相)特征。因此,现有方案的每个特定于域的内容编码器都无法有效提取域不变特征。为了解决这个问题,我们提出了一个灵活而通用的Sologan模型,用于在多个域之间具有未配对数据的多模式I2I翻译。与现有方法相反,Solgan算法使用具有附加辅助分类器的单个投影鉴别器,并为所有域共享编码器和生成器。因此,可以使用来自所有域的图像有效地训练Solgan,从而可以有效提取域 - 不变性内容表示。在多个数据集中,针对多个同行和sologan的变体的定性和定量结果证明了该方法的优点,尤其是对于挑战i2i翻译数据集的挑战,即涉及极端形状变化的数据集或在翻译后保持复杂的背景,需要保持复杂的背景。此外,我们通过消融研究证明了Sogan中每个成分的贡献。
translated by 谷歌翻译
Diffusion Probabilistic Models (DPMs) have recently been employed for image deblurring. DPMs are trained via a stochastic denoising process that maps Gaussian noise to the high-quality image, conditioned on the concatenated blurry input. Despite their high-quality generated samples, image-conditioned Diffusion Probabilistic Models (icDPM) rely on synthetic pairwise training data (in-domain), with potentially unclear robustness towards real-world unseen images (out-of-domain). In this work, we investigate the generalization ability of icDPMs in deblurring, and propose a simple but effective guidance to significantly alleviate artifacts, and improve the out-of-distribution performance. Particularly, we propose to first extract a multiscale domain-generalizable representation from the input image that removes domain-specific information while preserving the underlying image structure. The representation is then added into the feature maps of the conditional diffusion model as an extra guidance that helps improving the generalization. To benchmark, we focus on out-of-distribution performance by applying a single-dataset trained model to three external and diverse test sets. The effectiveness of the proposed formulation is demonstrated by improvements over the standard icDPM, as well as state-of-the-art performance on perceptual quality and competitive distortion metrics compared to existing methods.
translated by 谷歌翻译
最近,GaN反演方法与对比语言 - 图像预先绘制(CLIP)相结合,可以通过文本提示引导零拍摄图像操作。然而,由于GaN反转能力有限,它们对不同实物的不同实物的应用仍然困难。具体地,这些方法通常在与训练数据相比,改变对象标识或产生不需要的图像伪影的比较与新颖姿势,视图和高度可变内容重建具有新颖姿势,视图和高度可变内容的困难。为了减轻这些问题并实现真实图像的忠实操纵,我们提出了一种新的方法,Dumbused Clip,其使用扩散模型执行文本驱动的图像操纵。基于近期扩散模型的完整反转能力和高质量的图像生成功率,即使在看不见的域之间也成功地执行零拍摄图像操作。此外,我们提出了一种新颖的噪声组合方法,允许简单的多属性操作。与现有基线相比,广泛的实验和人类评估确认了我们的方法的稳健和卓越的操纵性能。
translated by 谷歌翻译
Conditional diffusion probabilistic models can model the distribution of natural images and can generate diverse and realistic samples based on given conditions. However, oftentimes their results can be unrealistic with observable color shifts and textures. We believe that this issue results from the divergence between the probabilistic distribution learned by the model and the distribution of natural images. The delicate conditions gradually enlarge the divergence during each sampling timestep. To address this issue, we introduce a new method that brings the predicted samples to the training data manifold using a pretrained unconditional diffusion model. The unconditional model acts as a regularizer and reduces the divergence introduced by the conditional model at each sampling step. We perform comprehensive experiments to demonstrate the effectiveness of our approach on super-resolution, colorization, turbulence removal, and image-deraining tasks. The improvements obtained by our method suggest that the priors can be incorporated as a general plugin for improving conditional diffusion models.
translated by 谷歌翻译
While deep learning-based methods for blind face restoration have achieved unprecedented success, they still suffer from two major limitations. First, most of them deteriorate when facing complex degradations out of their training data. Second, these methods require multiple constraints, e.g., fidelity, perceptual, and adversarial losses, which require laborious hyper-parameter tuning to stabilize and balance their influences. In this work, we propose a novel method named DifFace that is capable of coping with unseen and complex degradations more gracefully without complicated loss designs. The key of our method is to establish a posterior distribution from the observed low-quality (LQ) image to its high-quality (HQ) counterpart. In particular, we design a transition distribution from the LQ image to the intermediate state of a pre-trained diffusion model and then gradually transmit from this intermediate state to the HQ target by recursively applying a pre-trained diffusion model. The transition distribution only relies on a restoration backbone that is trained with $L_2$ loss on some synthetic data, which favorably avoids the cumbersome training process in existing methods. Moreover, the transition distribution can contract the error of the restoration backbone and thus makes our method more robust to unknown degradations. Comprehensive experiments show that DifFace is superior to current state-of-the-art methods, especially in cases with severe degradations. Our code and model are available at https://github.com/zsyOAOA/DifFace.
translated by 谷歌翻译
可控图像合成模型允许根据文本指令或来自示例图像的指导创建不同的图像。最近,已经显示出去噪扩散概率模型比现有方法产生更现实的图像,并且已在无条件和类条件设置中成功展示。我们探索细粒度,连续控制该模型类,并引入了一种新颖的统一框架,用于语义扩散指导,允许语言或图像指导,或两者。使用图像文本或图像匹配分数的梯度将指导注入预训练的无条件扩散模型中。我们探讨基于剪辑的文本指导,以及以统一形式的基于内容和类型的图像指导。我们的文本引导综合方法可以应用于没有相关文本注释的数据集。我们对FFHQ和LSUN数据集进行实验,并显示出细粒度的文本引导图像合成的结果,与样式或内容示例图像相关的图像的合成,以及具有文本和图像引导的示例。
translated by 谷歌翻译
引导图像综合使日常用户能够以最低努力创建和编辑照片逼真的图像。关键挑战是对用户输入(例如,手绘彩色笔划)和合成图像的现实来平衡忠实忠诚。现有的基于GAN的方法尝试使用条件的GAN或GAN enversions实现此类平衡,这是具有挑战性的,并且通常需要额外的培训数据或个别应用程序的损失函数。为了解决这些问题,我们引入了一种新的图像综合和编辑方法,基于传播模型生成的现成的新图像合成和编辑方法,随机差分编辑(SDETIT),其通过通过随机微分方程(SDE)迭代地脱颖而见,这将逼真的图像合成现实图像。鉴于使用任何类型的用户指南的输入图像,SDEDIT首先向输入添加噪声,然后在增加其现实主义之前通过SDE向所得到的图像进行去噪。 Sdedit不需要特定的任务培训或反转,并且可以自然地实现现实主义和忠诚之间的平衡。根据人类的感知研究,在现实主义研究中,SDEDIT在现实主义上显着优于最先进的GAN的方法,高达98.09%,而且整体满意度评分的91.72%,包括基于行程的图像合成和编辑作为图像合成。
translated by 谷歌翻译
我们提出了整流的流程,这是一种令人惊讶的简单学习方法(神经)的普通微分方程(ODE)模型,用于在两个经验观察到的分布\ pi_0和\ pi_1之间运输,因此为生成建模和域转移提供了统一的解决方案,以及其他各种任务。涉及分配运输。整流流的想法是学习ode,以遵循尽可能多的连接从\ pi_0和\ pi_1的直径。这是通过解决直接的非线性最小二乘优化问题来实现的,该问题可以轻松地缩放到大型模型,而无需在标准监督学习之外引入额外的参数。直径是特殊的,因此是特殊的,因为它们是两个点之间的最短路径,并且可以精确模拟而无需时间离散,因此可以在计算上产生高效的模型。我们表明,从数据(称为整流)中学习的整流流的过程将\ pi_0和\ pi_1的任意耦合转变为新的确定性耦合,并证明是非侵入的凸面运输成本。此外,递归应用矫正使我们能够获得具有越来越直的路径的流动序列,可以在推理阶段进行粗略的时间离散化来准确地模拟。在实证研究中,我们表明,整流流对图像产生,图像到图像翻译和域的适应性表现出色。特别是,在图像生成和翻译上,我们的方法几乎产生了几乎直流的流,即使是单个Euler离散步骤,也会产生高质量的结果。
translated by 谷歌翻译
Generative Adversarial Networks (GANs) typically suffer from overfitting when limited training data is available. To facilitate GAN training, current methods propose to use data-specific augmentation techniques. Despite the effectiveness, it is difficult for these methods to scale to practical applications. In this work, we present ScoreMix, a novel and scalable data augmentation approach for various image synthesis tasks. We first produce augmented samples using the convex combinations of the real samples. Then, we optimize the augmented samples by minimizing the norms of the data scores, i.e., the gradients of the log-density functions. This procedure enforces the augmented samples close to the data manifold. To estimate the scores, we train a deep estimation network with multi-scale score matching. For different image synthesis tasks, we train the score estimation network using different data. We do not require the tuning of the hyperparameters or modifications to the network architecture. The ScoreMix method effectively increases the diversity of data and reduces the overfitting problem. Moreover, it can be easily incorporated into existing GAN models with minor modifications. Experimental results on numerous tasks demonstrate that GAN models equipped with the ScoreMix method achieve significant improvements.
translated by 谷歌翻译
We show that diffusion models can achieve image sample quality superior to the current state-of-the-art generative models. We achieve this on unconditional image synthesis by finding a better architecture through a series of ablations. For conditional image synthesis, we further improve sample quality with classifier guidance: a simple, compute-efficient method for trading off diversity for fidelity using gradients from a classifier. We achieve an FID of 2.97 on ImageNet 128×128, 4.59 on ImageNet 256×256, and 7.72 on ImageNet 512×512, and we match BigGAN-deep even with as few as 25 forward passes per sample, all while maintaining better coverage of the distribution. Finally, we find that classifier guidance combines well with upsampling diffusion models, further improving FID to 3.94 on ImageNet 256×256 and 3.85 on ImageNet 512×512. We release our code at https://github.com/openai/guided-diffusion.
translated by 谷歌翻译
我们提出了一种基于示例的图像翻译的新方法,称为匹配交织的扩散模型(MIDMS)。该任务的大多数现有方法都是基于GAN的匹配,然后代表了代代框架。但是,在此框架中,跨跨域的语义匹配难度引起的匹配误差,例如草图和照片,可以很容易地传播到生成步骤,从而导致结果退化。由于扩散模型的最新成功激发了克服GAN的缺点,我们结合了扩散模型以克服这些局限性。具体而言,我们制定了一个基于扩散的匹配和生成框架,该框架通过将中间扭曲馈入尖锐的过程并将其变形以生成翻译的图像,从而交织了潜在空间中的跨域匹配和扩散步骤。此外,为了提高扩散过程的可靠性,我们使用周期一致性设计了一种置信度的过程,以在翻译过程中仅考虑自信区域。实验结果表明,我们的MIDM比最新方法产生的图像更合理。
translated by 谷歌翻译
扩散模型已显示出令人印象深刻的图像产生性能,并已用于各种计算机视觉任务。不幸的是,使用扩散模型的图像生成非常耗时,因为它需要数千个采样步骤。为了解决这个问题,我们在这里提出了一种新型的金字塔扩散模型,以使用训练有位置嵌入的单个分数函数从更粗的分辨率图像开始生成高分辨率图像。这使图像生成的时间效率抽样可以解决,并在资源有限的训练时也可以解决低批量的大小问题。此外,我们表明,使用单个分数函数可以有效地用于多尺度的超分辨率问题。
translated by 谷歌翻译
从手绘中生成图像是内容创建的至关重要和基本任务。翻译很困难,因为存在无限的可能性,并且不同的用户通常会期望不同的结果。因此,我们提出了一个统一的框架,该框架支持基于扩散模型的草图和笔触对图像合成的三维控制。用户不仅可以确定输入笔画和草图的忠诚程度,而且还可以确定现实程度,因为用户输入通常与真实图像不一致。定性和定量实验表明,我们的框架实现了最新的性能,同时提供了具有控制形状,颜色和现实主义的自定义图像的灵活性。此外,我们的方法释放了应用程序,例如在真实图像上编辑,部分草图和笔触的生成以及多域多模式合成。
translated by 谷歌翻译
Diffusion-based generative models have achieved remarkable success in image generation. Their guidance formulation allows an external model to plug-and-play control the generation process for various tasks without fine-tuning the diffusion model. However, the direct use of publicly available off-the-shelf models for guidance fails due to their poor performance on noisy inputs. For that, the existing practice is to fine-tune the guidance models with labeled data corrupted with noises. In this paper, we argue that this practice has limitations in two aspects: (1) performing on inputs with extremely various noises is too hard for a single model; (2) collecting labeled datasets hinders scaling up for various tasks. To tackle the limitations, we propose a novel strategy that leverages multiple experts where each expert is specialized in a particular noise range and guides the reverse process at its corresponding timesteps. However, as it is infeasible to manage multiple networks and utilize labeled data, we present a practical guidance framework termed Practical Plug-And-Play (PPAP), which leverages parameter-efficient fine-tuning and data-free knowledge transfer. We exhaustively conduct ImageNet class conditional generation experiments to show that our method can successfully guide diffusion with small trainable parameters and no labeled data. Finally, we show that image classifiers, depth estimators, and semantic segmentation models can guide publicly available GLIDE through our framework in a plug-and-play manner.
translated by 谷歌翻译
一击生成域Adaption旨在仅使用一个参考图像将一个预训练的发电机传输到一个新域中。但是,适用的生成器(i)要生成从预训练的生成器继承的多种图像,而(ii)(ii)忠实地获取参考图像的特定领域特定属性和样式,这仍然非常具有挑战性。在本文中,我们提出了一种新颖的单发性生成域适应方法,即Difa,用于多元化和忠实的适应。对于全球级别的适应,我们利用参考图像的剪辑嵌入与源图像的平均嵌入之间的差异来限制目标发生器。对于本地级别的适应,我们引入了一个细心的样式损失,该损失将每个适应图像的中间令牌与参考图像的相应令牌保持一致。为了促进多样化的生成,引入了选择性的跨域一致性,以选择和保留域共享属性,以编辑潜在的$ \ MATHCAL {W}+$ $空间来继承预训练的生成器的多样性。广泛的实验表明,我们的方法在定量和定性上都优于最先进的实验,尤其是对于大域间隙的情况。此外,我们的DIFA可以轻松地扩展到零击生成域的适应性,并具有吸引力的结果。代码可从https://github.com/1170300521/difa获得。
translated by 谷歌翻译