视觉变压器(VIT)在包括低水平的视觉任务中显示了有望,而U-NET在基于分数的扩散模型中仍然占主导地位。在本文中,我们对扩散模型中的基于VIT的体系结构进行了系统的经验研究。我们的结果表明,在VIT中添加超长的跳过连接(例如U-NET)对于扩散模型至关重要。新的VIT体系结构以及其他改进被称为U-Vit。在几个流行的视觉数据集中,U-Vit可以将竞争性生成结果达到SOTA U-NET,同时需要大量的参数和计算,如果不是更少。
translated by 谷歌翻译
Score-based diffusion models have captured widespread attention and funded fast progress of recent vision generative tasks. In this paper, we focus on diffusion model backbone which has been much neglected before. We systematically explore vision Transformers as diffusion learners for various generative tasks. With our improvements the performance of vanilla ViT-based backbone (IU-ViT) is boosted to be on par with traditional U-Net-based methods. We further provide a hypothesis on the implication of disentangling the generative backbone as an encoder-decoder structure and show proof-of-concept experiments verifying the effectiveness of a stronger encoder for generative tasks with ASymmetriC ENcoder Decoder (ASCEND). Our improvements achieve competitive results on CIFAR-10, CelebA, LSUN, CUB Bird and large-resolution text-to-image tasks. To the best of our knowledge, we are the first to successfully train a single diffusion model on text-to-image task beyond 64x64 resolution. We hope this will motivate people to rethink the modeling choices and the training pipelines for diffusion-based generative models.
translated by 谷歌翻译
We explore a new class of diffusion models based on the transformer architecture. We train latent diffusion models of images, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches. We analyze the scalability of our Diffusion Transformers (DiTs) through the lens of forward pass complexity as measured by Gflops. We find that DiTs with higher Gflops -- through increased transformer depth/width or increased number of input tokens -- consistently have lower FID. In addition to possessing good scalability properties, our largest DiT-XL/2 models outperform all prior diffusion models on the class-conditional ImageNet 512x512 and 256x256 benchmarks, achieving a state-of-the-art FID of 2.27 on the latter.
translated by 谷歌翻译
Extensive empirical evidence demonstrates that conditional generative models are easier to train and perform better than unconditional ones by exploiting the labels of data. So do score-based diffusion models. In this paper, we analyze the phenomenon formally and identify that the key of conditional learning is to partition the data properly. Inspired by the analyses, we propose self-conditioned diffusion models (SCDM), which is trained conditioned on indices clustered by the k-means algorithm on the features extracted by a model pre-trained in a self-supervised manner. SCDM significantly improves the unconditional model across various datasets and achieves a record-breaking FID of 3.94 on ImageNet 64x64 without labels. Besides, SCDM achieves a slightly better FID than the corresponding conditional model on CIFAR10.
translated by 谷歌翻译
We present an end-to-end Transformer based Latent Diffusion model for image synthesis. On the ImageNet class conditioned generation task we show that a Transformer based Latent Diffusion model achieves a 14.1FID which is comparable to the 13.1FID score of a UNet based architecture. In addition to showing the application of Transformer models for Diffusion based image synthesis this simplification in architecture allows easy fusion and modeling of text and image data. The multi-head attention mechanism of Transformers enables simplified interaction between the image and text features which removes the requirement for crossattention mechanism in UNet based Diffusion models.
translated by 谷歌翻译
扩散模型是一类强大的生成模型类别,可以迭代地贬低样品生成数据。尽管许多作品都集中在此抽样过程中的迭代次数上,但很少有人专注于每次迭代的成本。我们发现,添加简单的VIT风格的修补转换可以大大减少扩散模型的采样时间和内存使用情况。我们通过对扩散模型目标的分析以及在LSUN教堂,Imagenet 256和FFHQ1024上进行的经验实验来证明我们的方法是合理的。我们在Tensorflow和Pytorch中提供了实现。
translated by 谷歌翻译
Diffusion models are rising as a powerful solution for high-fidelity image generation, which exceeds GANs in quality in many circumstances. However, their slow training and inference speed is a huge bottleneck, blocking them from being used in real-time applications. A recent DiffusionGAN method significantly decreases the models' running time by reducing the number of sampling steps from thousands to several, but their speeds still largely lag behind the GAN counterparts. This paper aims to reduce the speed gap by proposing a novel wavelet-based diffusion structure. We extract low-and-high frequency components from both image and feature levels via wavelet decomposition and adaptively handle these components for faster processing while maintaining good generation quality. Furthermore, we propose to use a reconstruction term, which effectively boosts the model training convergence. Experimental results on CelebA-HQ, CIFAR-10, LSUN-Church, and STL-10 datasets prove our solution is a stepping-stone to offering real-time and high-fidelity diffusion models. Our code and pre-trained checkpoints will be available at \url{https://github.com/VinAIResearch/WaveDiff.git}.
translated by 谷歌翻译
扩散降级概率模型(DDPM)和视觉变压器(VIT)分别在生成任务和判别任务中表现出重大进展,到目前为止,这些模型已在其自身领域中很大程度上开发出来。在本文中,我们通过将VIT结构集成到DDPM之间,建立DDPM和VIT之间的直接联系,并引入一种称为“生成Vit(Genvit)”的新生成模型。VIT的建模灵活性使我们能够将Genvit进一步扩展到混合判别生成建模,并引入混合VIT(HYBVIT)。我们的工作是最早探索单个VIT以共同探索图像生成和分类的人之一。我们进行了一系列实验,以分析提出的模型的性能,并证明它们在生成和判别任务中都超过了先前的最新技术。我们的代码和预培训模型可以在https://github.com/sndnyang/diffusion_vit中找到。
translated by 谷歌翻译
在本文中,我们展示了如何通过仅依靠现成的预审预周化的模型来实现对2型界限的最先进的对抗性鲁棒性。为此,我们实例化了Salman等人的DeNoceed平滑方法。通过结合预处理的降级扩散概率模型和标准的高智分类器。这使我们能够在限制在2个norm范围内的对抗扰动下证明Imagenet上的71%精度,使用任何方法比先前的认证SOTA提高了14个百分点,或改善了与DeNoed Spootering相比的30个百分点。我们仅使用预审预测的扩散模型和图像分类器获得这些结果,而无需进行任何模型参数的任何微调或重新训练。
translated by 谷歌翻译
Classifier-free guided diffusion models have recently been shown to be highly effective at high-resolution image generation, and they have been widely used in large-scale diffusion frameworks including DALLE-2, Stable Diffusion and Imagen. However, a downside of classifier-free guided diffusion models is that they are computationally expensive at inference time since they require evaluating two diffusion models, a class-conditional model and an unconditional model, tens to hundreds of times. To deal with this limitation, we propose an approach to distilling classifier-free guided diffusion models into models that are fast to sample from: Given a pre-trained classifier-free guided model, we first learn a single model to match the output of the combined conditional and unconditional models, and then we progressively distill that model to a diffusion model that requires much fewer sampling steps. For standard diffusion models trained on the pixel-space, our approach is able to generate images visually comparable to that of the original model using as few as 4 sampling steps on ImageNet 64x64 and CIFAR-10, achieving FID/IS scores comparable to that of the original model while being up to 256 times faster to sample from. For diffusion models trained on the latent-space (e.g., Stable Diffusion), our approach is able to generate high-fidelity images using as few as 1 to 4 denoising steps, accelerating inference by at least 10-fold compared to existing methods on ImageNet 256x256 and LAION datasets. We further demonstrate the effectiveness of our approach on text-guided image editing and inpainting, where our distilled model is able to generate high-quality results using as few as 2-4 denoising steps.
translated by 谷歌翻译
扩散概率模型采用前向马尔可夫扩散链逐渐将数据映射到噪声分布,学习如何通过推断一个反向马尔可夫扩散链来生成数据以颠倒正向扩散过程。为了实现竞争性数据生成性能,他们需要一条长长的扩散链,这使它们在培训中不仅在培训中而且发电。为了显着提高计算效率,我们建议通过废除将数据扩散到随机噪声的要求来截断正向扩散链。因此,我们从隐式生成分布而不是随机噪声启动逆扩散链,并通过将其与截断的正向扩散链损坏的数据的分布相匹配来学习其参数。实验结果表明,就发电性能和所需的逆扩散步骤的数量而言,我们的截短扩散概率模型对未截断的概率模型提供了一致的改进。
translated by 谷歌翻译
We show that diffusion models can achieve image sample quality superior to the current state-of-the-art generative models. We achieve this on unconditional image synthesis by finding a better architecture through a series of ablations. For conditional image synthesis, we further improve sample quality with classifier guidance: a simple, compute-efficient method for trading off diversity for fidelity using gradients from a classifier. We achieve an FID of 2.97 on ImageNet 128×128, 4.59 on ImageNet 256×256, and 7.72 on ImageNet 512×512, and we match BigGAN-deep even with as few as 25 forward passes per sample, all while maintaining better coverage of the distribution. Finally, we find that classifier guidance combines well with upsampling diffusion models, further improving FID to 3.94 on ImageNet 256×256 and 3.85 on ImageNet 512×512. We release our code at https://github.com/openai/guided-diffusion.
translated by 谷歌翻译
We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and our models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding. On the unconditional CIFAR10 dataset, we obtain an Inception score of 9.46 and a state-of-the-art FID score of 3.17. On 256x256 LSUN, we obtain sample quality similar to ProgressiveGAN. Our implementation is available at https://github.com/hojonathanho/diffusion.
translated by 谷歌翻译
生成时间连贯的高保真视频是生成建模研究中的重要里程碑。我们通过提出一个视频生成的扩散模型来取得这一里程碑的进步,该模型显示出非常有希望的初始结果。我们的模型是标准图像扩散体系结构的自然扩展,它可以从图像和视频数据中共同训练,我们发现这可以减少Minibatch梯度的方差并加快优化。为了生成长而更高的分辨率视频,我们引入了一种新的条件抽样技术,用于空间和时间视频扩展,该技术的性能比以前提出的方法更好。我们介绍了大型文本条件的视频生成任务,以及最新的结果,以实现视频预测和无条件视频生成的确定基准。可从https://video-diffusion.github.io/获得补充材料
translated by 谷歌翻译
扩散模型是强大的生成模型,可使用得分函数模拟扩散过程的反面,以合成噪声数据。扩散模型的采样过程可以解释为求解反向随机微分方程(SDE)或扩散过程的普通微分方程(ODE),通常需要多达数千个离散步骤来生成单个图像。这引发了人们对开发反向S/ODE的有效整合技术的极大兴趣。在这里,我们提出了一种基于得分的采样的正交方法:Denoising MCMC(DMCMC)。 DMCMC首先使用MCMC在数据和方差(或扩散时间)的产品空间中生产样品。然后,使用反向S/ODE积分器来定义MCMC样品。由于MCMC越过数据歧管接近数据,因此为DMCMC生产干净样品的计算成本远小于从噪声中产生干净样品的计算成本。为了验证拟议的概念,我们表明denoing langevin Gibbs(DLG)是DMCMC实例,成功地加速了有关CIFAR10和Celeba-HQ-HQ-256图像生成的这项工作中考虑的所有六个反向S/ODE集成器。值得注意的是,结合了Karras等人的集成商。 (2022)和Song等人的预训练分数模型。 (2021b),DLG达到SOTA结果。在CIFAR10上有限数量的分数功能评估(NFE)设置中,我们有$ 3.86 $ fid,$ \ \ \ \ \ $ \ $ \ $ 2.63 $ fid,$ \ \ \ \ \ \ 20 $ nfe。在Celeba-HQ-256上,我们有$ 6.99 $ fid,$ \ $ \ 160 $ nfe,击败了Kim等人的当前最佳记录。 (2022)在基于分数的型号中,$ 7.16 $ FID,$ 4000 $ NFE。代码:https://github.com/1202KBS/DMCMC
translated by 谷歌翻译
作为生成部件作为自回归模型的向量量化变形式自动化器(VQ-VAE)的集成在图像生成上产生了高质量的结果。但是,自回归模型将严格遵循采样阶段的逐步扫描顺序。这导致现有的VQ系列模型几乎不会逃避缺乏全球信息的陷阱。连续域中的去噪扩散概率模型(DDPM)显示了捕获全局背景的能力,同时产生高质量图像。在离散状态空间中,一些作品已经证明了执行文本生成和低分辨率图像生成的可能性。我们认为,在VQ-VAE的富含内容的离散视觉码本的帮助下,离散扩散模型还可以利用全局上下文产生高保真图像,这补偿了沿像素空间的经典自回归模型的缺陷。同时,离散VAE与扩散模型的集成解决了传统的自回归模型的缺点是超大的,以及在生成图像时需要在采样过程中的过度时间的扩散模型。结果发现所生成的图像的质量严重依赖于离散的视觉码本。广泛的实验表明,所提出的矢量量化离散扩散模型(VQ-DDM)能够实现与低复杂性的顶层方法的相当性能。它还展示了在没有额外培训的图像修复任务方面与自回归模型量化的其他矢量突出的优势。
translated by 谷歌翻译
最近,Rissanen等人(2022年)提出了一种基于热量耗散或模糊的生成建模的新型扩散过程,作为各向同性高斯扩散的替代方法。在这里,我们表明,可以通过与非各向异性噪声的高斯扩散过程来等效地定义模糊。在建立这一联系时,我们弥合了反热量耗散和降解扩散之间的缝隙,并阐明了由于这种建模选择而导致的感应偏置。最后,我们提出了一类普遍的扩散模型,该模型既可以提供标准的高斯denoisis扩散和逆热散热,我们称之为模糊的扩散模型。
translated by 谷歌翻译
The image captioning task is typically realized by an auto-regressive method that decodes the text tokens one by one. We present a diffusion-based captioning model, dubbed the name DDCap, to allow more decoding flexibility. Unlike image generation, where the output is continuous and redundant with a fixed length, texts in image captions are categorical and short with varied lengths. Therefore, naively applying the discrete diffusion model to text decoding does not work well, as shown in our experiments. To address the performance gap, we propose several key techniques including best-first inference, concentrated attention mask, text length prediction, and image-free training. On COCO without additional caption pre-training, it achieves a CIDEr score of 117.8, which is +5.0 higher than the auto-regressive baseline with the same architecture in the controlled setting. It also performs +26.8 higher CIDEr score than the auto-regressive baseline (230.3 v.s.203.5) on a caption infilling task. With 4M vision-language pre-training images and the base-sized model, we reach a CIDEr score of 125.1 on COCO, which is competitive to the best well-developed auto-regressive frameworks. The code is available at https://github.com/buxiangzhiren/DDCap.
translated by 谷歌翻译
We study a challenging task, conditional human motion generation, which produces plausible human motion sequences according to various conditional inputs, such as action classes or textual descriptors. Since human motions are highly diverse and have a property of quite different distribution from conditional modalities, such as textual descriptors in natural languages, it is hard to learn a probabilistic mapping from the desired conditional modality to the human motion sequences. Besides, the raw motion data from the motion capture system might be redundant in sequences and contain noises; directly modeling the joint distribution over the raw motion sequences and conditional modalities would need a heavy computational overhead and might result in artifacts introduced by the captured noises. To learn a better representation of the various human motion sequences, we first design a powerful Variational AutoEncoder (VAE) and arrive at a representative and low-dimensional latent code for a human motion sequence. Then, instead of using a diffusion model to establish the connections between the raw motion sequences and the conditional inputs, we perform a diffusion process on the motion latent space. Our proposed Motion Latent-based Diffusion model (MLD) could produce vivid motion sequences conforming to the given conditional inputs and substantially reduce the computational overhead in both the training and inference stages. Extensive experiments on various human motion generation tasks demonstrate that our MLD achieves significant improvements over the state-of-the-art methods among extensive human motion generation tasks, with two orders of magnitude faster than previous diffusion models on raw motion sequences.
translated by 谷歌翻译
变压器在计算机视觉中变得普遍,特别是对于高级视觉任务。然而,采用生成的对抗性网络(GaN)框架中的变压器仍然是一个开放但具有挑战性的问题。本文进行了一项全面的实证研究,探讨了高保真图像合成的GaN中变压器的性能。我们的分析亮点并重申了特征局部度在图像生成中的重要性,尽管局部性的优点在分类任务中是众所周知的。也许更有趣的是,我们发现自我关注层中的残余连接有害,以利用基于变压器的鉴别器和条件发电机。我们仔细检查了影响力,并提出了减轻负面影响的有效方法。我们的研究导致GaN中的变压器的新替代设计,卷积神经网络(CNN) - 免费发电机称为晶体 - G,这在无条件和条件图像代中实现了竞争导致。基于变压器的鉴别器,Strans-D也显着降低了其基于CNN的鉴别器的间隙。
translated by 谷歌翻译