Diffusion models, which learn to reverse a signal destruction process to generate new data, typically require the signal at each step to have the same dimension. We argue that, considering the spatial redundancy in image signals, there is no need to maintain a high dimensionality in the evolution process, especially in the early generation phase. To this end, we make a theoretical generalization of the forward diffusion process via signal decomposition. Concretely, we manage to decompose an image into multiple orthogonal components and control the attenuation of each component when perturbing the image. That way, along with the noise strength increasing, we are able to diminish those inconsequential components and thus use a lower-dimensional signal to represent the source, barely losing information. Such a reformulation allows to vary dimensions in both training and inference of diffusion models. Extensive experiments on a range of datasets suggest that our approach substantially reduces the computational cost and achieves on-par or even better synthesis performance compared to baseline methods. We also show that our strategy facilitates high-resolution image synthesis and improves FID of diffusion model trained on FFHQ at $1024\times1024$ resolution from 52.40 to 10.46. Code and models will be made publicly available.
translated by 谷歌翻译
Diffusion models are rising as a powerful solution for high-fidelity image generation, which exceeds GANs in quality in many circumstances. However, their slow training and inference speed is a huge bottleneck, blocking them from being used in real-time applications. A recent DiffusionGAN method significantly decreases the models' running time by reducing the number of sampling steps from thousands to several, but their speeds still largely lag behind the GAN counterparts. This paper aims to reduce the speed gap by proposing a novel wavelet-based diffusion structure. We extract low-and-high frequency components from both image and feature levels via wavelet decomposition and adaptively handle these components for faster processing while maintaining good generation quality. Furthermore, we propose to use a reconstruction term, which effectively boosts the model training convergence. Experimental results on CelebA-HQ, CIFAR-10, LSUN-Church, and STL-10 datasets prove our solution is a stepping-stone to offering real-time and high-fidelity diffusion models. Our code and pre-trained checkpoints will be available at \url{https://github.com/VinAIResearch/WaveDiff.git}.
translated by 谷歌翻译
最近,Rissanen等人(2022年)提出了一种基于热量耗散或模糊的生成建模的新型扩散过程,作为各向同性高斯扩散的替代方法。在这里,我们表明,可以通过与非各向异性噪声的高斯扩散过程来等效地定义模糊。在建立这一联系时,我们弥合了反热量耗散和降解扩散之间的缝隙,并阐明了由于这种建模选择而导致的感应偏置。最后,我们提出了一类普遍的扩散模型,该模型既可以提供标准的高斯denoisis扩散和逆热散热,我们称之为模糊的扩散模型。
translated by 谷歌翻译
去噪扩散概率模型(DDPMS)在没有对抗性训练的情况下实现了高质量的图像生成,但它们需要模拟Markov链以产生样品的许多步骤。为了加速采样,我们呈现去噪扩散隐式模型(DDIM),更有效的迭代类隐式概率模型,具有与DDPM相同的培训过程。在DDPMS中,生成过程被定义为Markovian扩散过程的反向。我们构建一类导致相同的训练目标的非马尔可瓦夫扩散过程,但其反向过程可能会更快地采样。我们经验证明,与DDPM相比,DDIM可以生产高质量的样本10倍以上$ 50 \时间$ 50 \倍。允许我们缩小对样本质量的计算,并可以直接执行语义有意义的图像插值潜在的空间。
translated by 谷歌翻译
We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and our models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding. On the unconditional CIFAR10 dataset, we obtain an Inception score of 9.46 and a state-of-the-art FID score of 3.17. On 256x256 LSUN, we obtain sample quality similar to ProgressiveGAN. Our implementation is available at https://github.com/hojonathanho/diffusion.
translated by 谷歌翻译
Denoising diffusion (score-based) generative models have recently achieved significant accomplishments in generating realistic and diverse data. These approaches define a forward diffusion process for transforming data into noise and a backward denoising process for sampling data from noise. Unfortunately, the generation process of current denoising diffusion models is notoriously slow due to the lengthy iterative noise estimations, which rely on cumbersome neural networks. It prevents the diffusion models from being widely deployed, especially on edge devices. Previous works accelerate the generation process of diffusion model (DM) via finding shorter yet effective sampling trajectories. However, they overlook the cost of noise estimation with a heavy network in every iteration. In this work, we accelerate generation from the perspective of compressing the noise estimation network. Due to the difficulty of retraining DMs, we exclude mainstream training-aware compression paradigms and introduce post-training quantization (PTQ) into DM acceleration. However, the output distributions of noise estimation networks change with time-step, making previous PTQ methods fail in DMs since they are designed for single-time step scenarios. To devise a DM-specific PTQ method, we explore PTQ on DM in three aspects: quantized operations, calibration dataset, and calibration metric. We summarize and use several observations derived from all-inclusive investigations to formulate our method, which especially targets the unique multi-time-step structure of DMs. Experimentally, our method can directly quantize full-precision DMs into 8-bit models while maintaining or even improving their performance in a training-free manner. Importantly, our method can serve as a plug-and-play module on other fast-sampling methods, e.g., DDIM.
translated by 谷歌翻译
We show that diffusion models can achieve image sample quality superior to the current state-of-the-art generative models. We achieve this on unconditional image synthesis by finding a better architecture through a series of ablations. For conditional image synthesis, we further improve sample quality with classifier guidance: a simple, compute-efficient method for trading off diversity for fidelity using gradients from a classifier. We achieve an FID of 2.97 on ImageNet 128×128, 4.59 on ImageNet 256×256, and 7.72 on ImageNet 512×512, and we match BigGAN-deep even with as few as 25 forward passes per sample, all while maintaining better coverage of the distribution. Finally, we find that classifier guidance combines well with upsampling diffusion models, further improving FID to 3.94 on ImageNet 256×256 and 3.85 on ImageNet 512×512. We release our code at https://github.com/openai/guided-diffusion.
translated by 谷歌翻译
扩散概率模型(DPM)是一类强大的深层生成模型(DGM)。尽管取得了成功,但在整个时间段上的迭代生成过程效率要比其他DGMS(例如gans)效率要低得多。因此,时间步长上的生成性能至关重要,这受到DPMS中协方差设计的极大影响。在这项工作中,我们考虑对角和完整的协方差,以提高DPM的表现力。我们得出此类协方差的最佳结果,然后在DPM的平均值不完善时将其纠正。最佳和校正后的都可以分解为对噪声功能的条件期望的术语。在此基础上,我们建议通过学习这些条件期望来估计最佳协方差及其校正。我们的方法可以应用于离散时间和连续时间段的DPM。我们在实施计算效率方面考虑了对角协方差。为了进行有效的实际实施,我们采用参数共享方案和两阶段的培训过程。从经验上讲,我们的方法的表现优于可能性结果的各种协方差设计,并提高了样本质量,尤其是在少数时间段上。
translated by 谷歌翻译
尽管扩散模型在图像生成中表现出了巨大的成功,但它们的噪声生成过程并未明确考虑图像的结构,例如它们固有的多尺度性质。受扩散模型的启发和粗到精细建模的可取性,我们提出了一个新模型,该模型通过迭代反转热方程式生成图像,当在图像的2D平面上运行时,PDE局部删除了细尺度信息。在我们的新方法中,正向热方程的解被解释为有向图形模型中的变异近似。我们展示了有希望的图像质量,并指出了在扩散模型中未见的新兴定性特性,例如在神经网络可解释性的图像和各个方面的整体颜色和形状分解。对自然图像的光谱分析将我们的模型定位为扩散模型的一种双重偶,并揭示了其中的隐式感应偏见。
translated by 谷歌翻译
While deep learning-based methods for blind face restoration have achieved unprecedented success, they still suffer from two major limitations. First, most of them deteriorate when facing complex degradations out of their training data. Second, these methods require multiple constraints, e.g., fidelity, perceptual, and adversarial losses, which require laborious hyper-parameter tuning to stabilize and balance their influences. In this work, we propose a novel method named DifFace that is capable of coping with unseen and complex degradations more gracefully without complicated loss designs. The key of our method is to establish a posterior distribution from the observed low-quality (LQ) image to its high-quality (HQ) counterpart. In particular, we design a transition distribution from the LQ image to the intermediate state of a pre-trained diffusion model and then gradually transmit from this intermediate state to the HQ target by recursively applying a pre-trained diffusion model. The transition distribution only relies on a restoration backbone that is trained with $L_2$ loss on some synthetic data, which favorably avoids the cumbersome training process in existing methods. Moreover, the transition distribution can contract the error of the restoration backbone and thus makes our method more robust to unknown degradations. Comprehensive experiments show that DifFace is superior to current state-of-the-art methods, especially in cases with severe degradations. Our code and model are available at https://github.com/zsyOAOA/DifFace.
translated by 谷歌翻译
扩散概率模型已被证明在几个竞争性图像综合基准上产生最先进的结果,但缺乏低维,可解释的潜在空间,并且在一代中慢慢。另一方面,变形AutoEncoders(VAES)通常可以访问低维潜空间,但表现出差的样品质量。尽管最近的进步,VAE通常需要潜在代码的高维层次结构来产生高质量样本。我们呈现DiffUsevae,一种新的生成框架,它在扩散模型框架内集成了VAE,并利用这一点以设计用于扩散模型的新型条件参数化。我们表明所得模型可以在采样效率方面提高无条件扩散模型,同时还配备了具有低维VAE的扩散模型推断潜码。此外,我们表明所提出的模型可以产生高分辨率样本,并展示与标准基准上的最先进模型相当的合成质量。最后,我们表明所提出的方法可用于可控制的图像合成,并且还展示了图像超分辨率和去噪等下游任务的开箱即用功能。为了重现性,我们的源代码将公开可用于\ url {https://github.com/kpandey008/diffusevae}。
translated by 谷歌翻译
扩散模型已成为深层生成建模的最有希望的框架之一。在这项工作中,我们探讨了不均匀扩散模型的潜力。我们表明,非均匀扩散会导致多尺度扩散模型,这些模型与多尺度归一化流的结构相似。我们从实验上发现,在相同或更少的训练时间中,多尺度扩散模型比标准均匀扩散模型获得更好的FID得分。更重要的是,它生成样品$ 4.4 $ 4.4美元的$ 4.4 $ $ 128 \ times 128 $分辨率。在使用更多量表的较高分辨率中,预计加速度将更高。此外,我们表明,不均匀的扩散导致有条件得分函数的新估计量,该估计函数以最新的条件降解估计量以PAR性能达到了PAR性能。我们的理论和实验性发现伴随着开源库MSDIFF,可以促进对非均匀扩散模型的进一步研究。
translated by 谷歌翻译
DeNoising扩散模型代表了计算机视觉中最新的主题,在生成建模领域表现出了显着的结果。扩散模型是一个基于两个阶段的深层生成模型,一个正向扩散阶段和反向扩散阶段。在正向扩散阶段,通过添加高斯噪声,输入数据在几个步骤中逐渐受到干扰。在反向阶段,模型的任务是通过学习逐步逆转扩散过程来恢复原始输入数据。尽管已知的计算负担,即由于采样过程中涉及的步骤数量,扩散模型对生成样品的质量和多样性得到了广泛赞赏。在这项调查中,我们对视觉中应用的denoising扩散模型的文章进行了全面综述,包括该领域的理论和实际贡献。首先,我们识别并介绍了三个通用扩散建模框架,这些框架基于扩散概率模型,噪声调节得分网络和随机微分方程。我们进一步讨论了扩散模型与其他深层生成模型之间的关系,包括变异自动编码器,生成对抗网络,基于能量的模型,自回归模型和正常流量。然后,我们介绍了计算机视觉中应用的扩散模型的多角度分类。最后,我们说明了扩散模型的当前局限性,并设想了一些有趣的未来研究方向。
translated by 谷歌翻译
基于得分的扩散模型是一类生成模型,其动力学由将噪声映射到数据中的随机微分方程描述。尽管最近的作品已经开始为这些模型奠定理论基础,但仍缺乏对扩散时间t的作用的分析理解。当前的最佳实践提倡大型T,以确保正向动力学使扩散足够接近已知和简单的噪声分布。但是,对于更好的分数匹配目标和更高的计算效率,应优选较小的t值。从扩散模型的各种解释开始,在这项工作中,我们量化了这一权衡,并提出了一种新方法,通过采用较小的扩散时间来提高培训和采样的质量和效率。实际上,我们展示了如何使用辅助模型来弥合理想和模拟正向动力学之间的间隙,然后进行标准的反向扩散过程。经验结果支持我们的分析;对于图像数据,我们的方法是竞争性W.R.T.根据标准样本质量指标和对数可能的样本。
translated by 谷歌翻译
基于扩散的生成模型已经证明了感知上令人印象深刻的合成能力,但是它们也可以是基于可能性的模型吗?我们以肯定的方式回答了这一点,并介绍了一个基于扩散的生成模型家族,该模型可以在标准图像密度估计基准上获得最先进的可能性。与其他基于扩散的模型不同,我们的方法允许与其他模型的其余部分共同对噪声时间表进行有效优化。我们表明,根据扩散数据的信噪比,变异下限(VLB)简化为非常短的表达,从而改善了我们对该模型类别的理论理解。使用这种见解,我们证明了文献中提出的几个模型之间的等效性。此外,我们表明连续时间VLB在噪声方面不变,除了其端点处的信噪比。这使我们能够学习一个噪声时间表,以最大程度地减少所得VLB估计器的差异,从而更快地优化。将这些进步与建筑改进相结合,我们获得了图像密度估计基准的最先进的可能性,超过了多年来主导这些基准测试的自回旋模型,通常优化了很多年。此外,我们展示了如何将模型用作BITS背包压缩方案的一部分,并展示了接近理论最佳的无损压缩率。代码可在https://github.com/google-research/vdm上找到。
translated by 谷歌翻译
扩散模型是一类强大的生成模型类别,可以迭代地贬低样品生成数据。尽管许多作品都集中在此抽样过程中的迭代次数上,但很少有人专注于每次迭代的成本。我们发现,添加简单的VIT风格的修补转换可以大大减少扩散模型的采样时间和内存使用情况。我们通过对扩散模型目标的分析以及在LSUN教堂,Imagenet 256和FFHQ1024上进行的经验实验来证明我们的方法是合理的。我们在Tensorflow和Pytorch中提供了实现。
translated by 谷歌翻译
我们表明,级联扩散模型能够在类条件的想象生成基准上生成高保真图像,而无需辅助图像分类器的任何帮助来提高样品质量。级联的扩散模型包括多个扩散模型的流水线,其产生越来越多的分辨率,以最低分辨率的标准扩散模型开始,然后是一个或多个超分辨率扩散模型,其连续上追随图像并添加更高的分辨率细节。我们发现级联管道的样本质量至关重要的是调节增强,我们提出的数据增强较低分辨率调节输入到超级分辨率模型的方法。我们的实验表明,调节增强防止在级联模型中采样过程中的复合误差,帮助我们在256×256分辨率下,在128x128和4.88,优于63.02的分类精度分数,培训级联管道。 %(TOP-1)和84.06%(TOP-5)在256x256,优于VQ-VAE-2。
translated by 谷歌翻译
作为生成部件作为自回归模型的向量量化变形式自动化器(VQ-VAE)的集成在图像生成上产生了高质量的结果。但是,自回归模型将严格遵循采样阶段的逐步扫描顺序。这导致现有的VQ系列模型几乎不会逃避缺乏全球信息的陷阱。连续域中的去噪扩散概率模型(DDPM)显示了捕获全局背景的能力,同时产生高质量图像。在离散状态空间中,一些作品已经证明了执行文本生成和低分辨率图像生成的可能性。我们认为,在VQ-VAE的富含内容的离散视觉码本的帮助下,离散扩散模型还可以利用全局上下文产生高保真图像,这补偿了沿像素空间的经典自回归模型的缺陷。同时,离散VAE与扩散模型的集成解决了传统的自回归模型的缺点是超大的,以及在生成图像时需要在采样过程中的过度时间的扩散模型。结果发现所生成的图像的质量严重依赖于离散的视觉码本。广泛的实验表明,所提出的矢量量化离散扩散模型(VQ-DDM)能够实现与低复杂性的顶层方法的相当性能。它还展示了在没有额外培训的图像修复任务方面与自回归模型量化的其他矢量突出的优势。
translated by 谷歌翻译
Score-based modeling through stochastic differential equations (SDEs) has provided a new perspective on diffusion models, and demonstrated superior performance on continuous data. However, the gradient of the log-likelihood function, i.e., the score function, is not properly defined for discrete spaces. This makes it non-trivial to adapt \textcolor{\cdiff}{the score-based modeling} to categorical data. In this paper, we extend diffusion models to discrete variables by introducing a stochastic jump process where the reverse process denoises via a continuous-time Markov chain. This formulation admits an analytical simulation during backward sampling. To learn the reverse process, we extend score matching to general categorical data and show that an unbiased estimator can be obtained via simple matching of the conditional marginal distributions. We demonstrate the effectiveness of the proposed method on a set of synthetic and real-world music and image benchmarks.
translated by 谷歌翻译
通过将图像形成过程分解成逐个申请的去噪自身额,扩散模型(DMS)实现了最先进的合成导致图像数据和超越。另外,它们的配方允许引导机构来控制图像生成过程而不会再刷新。然而,由于这些模型通常在像素空间中直接操作,因此强大的DMS的优化通常消耗数百个GPU天,并且由于顺序评估,推理是昂贵的。为了在保留其质量和灵活性的同时启用有限计算资源的DM培训,我们将它们应用于强大的佩带自动化器的潜在空间。与以前的工作相比,这种代表上的培训扩散模型允许第一次达到复杂性降低和细节保存之间的近乎最佳点,极大地提高了视觉保真度。通过将跨关注层引入模型架构中,我们将扩散模型转化为强大而柔性的发电机,以进行诸如文本或边界盒和高分辨率合成的通用调节输入,以卷积方式变得可以实现。我们的潜在扩散模型(LDMS)实现了一种新的技术状态,可在各种任务中进行图像修复和高竞争性能,包括无条件图像生成,语义场景合成和超级分辨率,同时与基于像素的DMS相比显着降低计算要求。代码可在https://github.com/compvis/lattent-diffusion获得。
translated by 谷歌翻译