The image captioning task is typically realized by an auto-regressive method that decodes the text tokens one by one. We present a diffusion-based captioning model, dubbed the name DDCap, to allow more decoding flexibility. Unlike image generation, where the output is continuous and redundant with a fixed length, texts in image captions are categorical and short with varied lengths. Therefore, naively applying the discrete diffusion model to text decoding does not work well, as shown in our experiments. To address the performance gap, we propose several key techniques including best-first inference, concentrated attention mask, text length prediction, and image-free training. On COCO without additional caption pre-training, it achieves a CIDEr score of 117.8, which is +5.0 higher than the auto-regressive baseline with the same architecture in the controlled setting. It also performs +26.8 higher CIDEr score than the auto-regressive baseline (230.3 v.s.203.5) on a caption infilling task. With 4M vision-language pre-training images and the base-sized model, we reach a CIDEr score of 125.1 on COCO, which is competitive to the best well-developed auto-regressive frameworks. The code is available at https://github.com/buxiangzhiren/DDCap.
translated by 谷歌翻译
我们介绍了文本到图像生成的矢量量化扩散(VQ-扩散)模型。该方法基于矢量量化变分性AutoEncoder(VQ-VAE),其潜像通过最近开发的去噪扩散概率(DDPM)的条件变体为基础。我们发现这种潜在空间方法非常适合于图像到图像生成任务,因为它不仅消除了具有现有方法的单向偏差,还允许我们结合掩模和更换的扩散策略,以避免积累错误,这是现有方法的严重问题。我们的实验表明,与具有类似数量的参数数量的传统自回归(AR)模型相比,VQ扩散产生明显更好的文本到图像生成结果。与以前的基于GAN的文本到图像方法相比,我们的VQ扩散可以通过大边缘处理更复杂的场景并提高合成的图像质量。最后,我们表明我们的方法中的图像生成计算可以通过Reparameter化进行高效。利用传统的AR方法,文本到图像生成时间随输出图像分辨率线性增加,因此即使对于正常尺寸图像也是相当耗时的。 VQ-扩散使我们能够在质量和速度之间实现更好的权衡。我们的实验表明,具有Reparameterization的VQ扩散模型比传统的AR方法快15倍,同时实现更好的图像质量。
translated by 谷歌翻译
We present DiffusionBERT, a new generative masked language model based on discrete diffusion models. Diffusion models and many pre-trained language models have a shared training objective, i.e., denoising, making it possible to combine the two powerful models and enjoy the best of both worlds. On the one hand, diffusion models offer a promising training strategy that helps improve the generation quality. On the other hand, pre-trained denoising language models (e.g., BERT) can be used as a good initialization that accelerates convergence. We explore training BERT to learn the reverse process of a discrete diffusion process with an absorbing state and elucidate several designs to improve it. First, we propose a new noise schedule for the forward diffusion process that controls the degree of noise added at each step based on the information of each token. Second, we investigate several designs of incorporating the time step into BERT. Experiments on unconditional text generation demonstrate that DiffusionBERT achieves significant improvement over existing diffusion models for text (e.g., D3PM and Diffusion-LM) and previous generative masked language models in terms of perplexity and BLEU score.
translated by 谷歌翻译
Recently, vector quantized autoregressive (VQ-AR) models have shown remarkable results in text-to-image synthesis by equally predicting discrete image tokens from the top left to bottom right in the latent space. Although the simple generative process surprisingly works well, is this the best way to generate the image? For instance, human creation is more inclined to the outline-to-fine of an image, while VQ-AR models themselves do not consider any relative importance of each component. In this paper, we present a progressive denoising model for high-fidelity text-to-image image generation. The proposed method takes effect by creating new image tokens from coarse to fine based on the existing context in a parallel manner and this procedure is recursively applied until an image sequence is completed. The resulting coarse-to-fine hierarchy makes the image generation process intuitive and interpretable. Extensive experiments demonstrate that the progressive model produces significantly better results when compared with the previous VQ-AR method in FID score across a wide variety of categories and aspects. Moreover, the text-to-image generation time of traditional AR increases linearly with the output image resolution and hence is quite time-consuming even for normal-size images. In contrast, our approach allows achieving a better trade-off between generation quality and speed.
translated by 谷歌翻译
The recently developed discrete diffusion models perform extraordinarily well in the text-to-image task, showing significant promise for handling the multi-modality signals. In this work, we harness these traits and present a unified multimodal generation model that can conduct both the "modality translation" and "multi-modality generation" tasks using a single model, performing text-based, image-based, and even vision-language simultaneous generation. Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix. Moreover, we design a mutual attention module with fused embedding layer and a unified objective function to emphasise the inter-modal linkages, which are vital for multi-modality generation. Extensive experiments indicate that our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.
translated by 谷歌翻译
Recent advances on text-to-image generation have witnessed the rise of diffusion models which act as powerful generative models. Nevertheless, it is not trivial to exploit such latent variable models to capture the dependency among discrete words and meanwhile pursue complex visual-language alignment in image captioning. In this paper, we break the deeply rooted conventions in learning Transformer-based encoder-decoder, and propose a new diffusion model based paradigm tailored for image captioning, namely Semantic-Conditional Diffusion Networks (SCD-Net). Technically, for each input image, we first search the semantically relevant sentences via cross-modal retrieval model to convey the comprehensive semantic information. The rich semantics are further regarded as semantic prior to trigger the learning of Diffusion Transformer, which produces the output sentence in a diffusion process. In SCD-Net, multiple Diffusion Transformer structures are stacked to progressively strengthen the output sentence with better visional-language alignment and linguistical coherence in a cascaded manner. Furthermore, to stabilize the diffusion process, a new self-critical sequence training strategy is designed to guide the learning of SCD-Net with the knowledge of a standard autoregressive Transformer model. Extensive experiments on COCO dataset demonstrate the promising potential of using diffusion models in the challenging image captioning task. Source code is available at \url{https://github.com/YehLi/xmodaler/tree/master/configs/image_caption/scdnet}.
translated by 谷歌翻译
产生人类想要的声音效果是一个重要的话题。但是,在这一领域,很少有研究声音发电。在这项研究中,我们调查了以文本提示为条件的声音,并提出了一个新型的文本对生成框架,该框架由文本编码器组成,矢量量化了变异自动编码器(VQ-VAE),解码器和歌手。该框架首先使用解码器将从文本编码器提取的文本特征传递到借助VQ-VAE的MEL光谱图中,然后使用Vocoder将生成的MEL光谱图转换为波形。我们发现,解码器显着影响发电性能。因此,我们专注于在这项研究中设计一个好的解码器。我们从传统的自动回解码器开始,该解码器已被证明是以前的Sound Generation Works中的最先进方法。但是,AR解码器始终按顺序预测MEL-SPECTROGIN图令牌,这引入了单向偏见和错误问题的积累。此外,使用AR解码器,声音生成时间随着声音持续时间线性增加。为了克服AR解码器引入的缺点,我们提出了一个基于离散扩散模型的非自动回形解码器,称为DiffSound。具体而言,DIFFSOUND可以在一个步骤中预测所有MEL光谱图令牌,然后在下一步中完善预测的令牌,因此可以在几个步骤后获得最优于预测的结果。我们的实验表明,与AR解码器相比,我们提出的差异不仅产生更好的文本到单一生成结果,而且还具有更快的生成速度,例如MOS:3.56 \ textit {v.s} 2.786,并且生成速度为五个比AR解码器快的时间。
translated by 谷歌翻译
We introduce M-VADER: a diffusion model (DM) for image generation where the output can be specified using arbitrary combinations of images and text. We show how M-VADER enables the generation of images specified using combinations of image and text, and combinations of multiple images. Previously, a number of successful DM image generation algorithms have been introduced that make it possible to specify the output image using a text prompt. Inspired by the success of those models, and led by the notion that language was already developed to describe the elements of visual contexts that humans find most important, we introduce an embedding model closely related to a vision-language model. Specifically, we introduce the embedding model S-MAGMA: a 13 billion parameter multimodal decoder combining components from an autoregressive vision-language model MAGMA and biases finetuned for semantic search.
translated by 谷歌翻译
扩散模型(DMS)显示出高质量图像合成的巨大潜力。但是,当涉及到具有复杂场景的图像时,如何正确描述图像全局结构和对象细节仍然是一项具有挑战性的任务。在本文中,我们提出了弗里多(Frido),这是一种特征金字塔扩散模型,该模型执行了图像合成的多尺度粗到1个降解过程。我们的模型将输入图像分解为依赖比例的矢量量化特征,然后是用于产生图像输出的粗到细门。在上述多尺度表示阶段,可以进一步利用文本,场景图或图像布局等其他输入条件。因此,还可以将弗里多应用于条件或跨模式图像合成。我们对各种无条件和有条件的图像生成任务进行了广泛的实验,从文本到图像综合,布局到图像,场景环形图像到标签形象。更具体地说,我们在五个基准测试中获得了最先进的FID分数,即可可和开阔图像的布局到图像,可可和视觉基因组的场景环形图像以及可可的标签对图像图像。 。代码可在https://github.com/davidhalladay/frido上找到。
translated by 谷歌翻译
Diffusion model, a new generative modelling paradigm, has achieved great success in image, audio, and video generation. However, considering the discrete categorical nature of text, it is not trivial to extend continuous diffusion models to natural language, and text diffusion models are less studied. Sequence-to-sequence text generation is one of the essential natural language processing topics. In this work, we apply diffusion models to approach sequence-to-sequence text generation, and explore whether the superiority generation performance of diffusion model can transfer to natural language domain. We propose SeqDiffuSeq, a text diffusion model for sequence-to-sequence generation. SeqDiffuSeq uses an encoder-decoder Transformers architecture to model denoising function. In order to improve generation quality, SeqDiffuSeq combines the self-conditioning technique and a newly proposed adaptive noise schedule technique. The adaptive noise schedule has the difficulty of denoising evenly distributed across time steps, and considers exclusive noise schedules for tokens at different positional order. Experiment results illustrate the good performance on sequence-to-sequence generation in terms of text quality and inference time.
translated by 谷歌翻译
In this paper, we propose a large-scale language pre-training for text GENeration using dIffusion modEl, which is named GENIE. GENIE is a pre-training sequence-to-sequence text generation model which combines Transformer and diffusion. The diffusion model accepts the latent information from the encoder, which is used to guide the denoising of the current time step. After multiple such denoise iterations, the diffusion model can restore the Gaussian noise to the diverse output text which is controlled by the input text. Moreover, such architecture design also allows us to adopt large scale pre-training on the GENIE. We propose a novel pre-training method named continuous paragraph denoise based on the characteristics of the diffusion model. Extensive experiments on the XSum, CNN/DailyMail, and Gigaword benchmarks shows that GENIE can achieves comparable performance with various strong baselines, especially after pre-training, the generation quality of GENIE is greatly improved. We have also conduct a lot of experiments on the generation diversity and parameter impact of GENIE. The code for GENIE will be made publicly available.
translated by 谷歌翻译
近年来在开发更好的图像标题模型方面取得了巨大进展,但其中大多数依赖于单独的对象探测器来提取区域特征。最近的视觉语言研究通过利用网格表示来实现更灵活的模型训练和更快推理速度的速度来转向探测器趋势。但是,这种发展主要专注于图像理解任务,并且对标题生成任务的研究仍然较少。在本文中,我们涉及一种更好的无需探测器图像标题模型,并提出了一种基于纯视觉变压器的图像标题模型,称为VITCAP,其中使用了网格表示而不提取区域特征。为了提高性能,我们介绍了一种新颖的概念令牌网络(CTN)来预测语义概念,然后将它们纳入端到端的标题。特别地,CTN是基于视觉变换器构建的,并且旨在通过分类任务预测概念令牌,其中包含丰富的语义信息极大地利益标题任务。与以前的探测器的模型相比,Vitcap大大简化了架构,同时在各种具有挑战性的图像标题数据集上实现了竞争性能。特别是,Vitcap分别达到138.1苹果酒分数,即在Nocaps上的Coco-Caption Karpatal-Splity,93.8和108.6苹果酒分数和Google-CC标题数据集上分别达到138.1苹果酒分数。
translated by 谷歌翻译
A major goal of multimodal research is to improve machine understanding of images and text. Tasks include image captioning, text-to-image generation, and vision-language representation learning. So far, research has focused on the relationships between images and text. For example, captioning models attempt to understand the semantics of images which are then transformed into text. An important question is: which annotation reflects best a deep understanding of image content? Similarly, given a text, what is the best image that can present the semantics of the text? In this work, we argue that the best text or caption for a given image is the text which would generate the image which is the most similar to that image. Likewise, the best image for a given text is the image that results in the caption which is best aligned with the original text. To this end, we propose a unified framework that includes both a text-to-image generative model and an image-to-text generative model. Extensive experiments validate our approach.
translated by 谷歌翻译
近年来,根据Vision-Language预训练(VLP),我们在图像标题任务中掌握了显着的性能提升。比例被认为是这一进步的重要因素。然而,大多数现有工作仅侧重于预训练的变压器,在大约400万图像上具有中等大小(例如,12或24层)。在本文中,我们呈现柠檬,一个大规模的图像标题器,并为图像标题的VLP的缩放行为提供第一个实证研究。我们使用最先进的VINVL模型作为我们的参考模型,它由图像特征提取器和变压器模型组成,并将变压器上下放大,模型大小范围从13到675万参数。在数据方面,我们通过高达200万图像文本对进行实验,该对基于图像的Alt属性自动从Web自动收集(称为ALT200M)。广泛的分析有助于将性能趋势表征为模型大小和预训练数据尺寸增加。我们还比较不同的培训配方,特别是在大规模嘈杂数据上培训。结果,柠檬在几个主要图像标题基准上实现了新的技术状态,包括Coco标题,Nocaps和概念标题。我们还显示柠檬可以在以零拍摄方式使用时生成带有长尾视觉概念的标题。
translated by 谷歌翻译
Evaluating and comparing text-to-image models is a challenging problem. Significant advances in the field have recently been made, piquing interest of various industrial sectors. As a consequence, a gold standard in the field should cover a variety of tasks and application contexts. In this paper a novel evaluation approach is experimented, on the basis of: (i) a curated data set, made by high-quality royalty-free image-text pairs, divided into ten categories; (ii) a quantitative metric, the CLIP-score, (iii) a human evaluation task to distinguish, for a given text, the real and the generated images. The proposed method has been applied to the most recent models, i.e., DALLE2, Latent Diffusion, Stable Diffusion, GLIDE and Craiyon. Early experimental results show that the accuracy of the human judgement is fully coherent with the CLIP-score. The dataset has been made available to the public.
translated by 谷歌翻译
随着信息中的各种方式存在于现实世界中的各种方式,多式联信息之间的有效互动和融合在计算机视觉和深度学习研究中的多模式数据的创造和感知中起着关键作用。通过卓越的功率,在多式联运信息中建模互动,多式联运图像合成和编辑近年来已成为一个热门研究主题。与传统的视觉指导不同,提供明确的线索,多式联路指南在图像合成和编辑方面提供直观和灵活的手段。另一方面,该领域也面临着具有固有的模态差距的特征的几个挑战,高分辨率图像的合成,忠实的评估度量等。在本调查中,我们全面地阐述了最近多式联运图像综合的进展根据数据模型和模型架构编辑和制定分类。我们从图像合成和编辑中的不同类型的引导方式开始介绍。然后,我们描述了多模式图像综合和编辑方法,其具有详细的框架,包括生成的对抗网络(GAN),GaN反转,变压器和其他方法,例如NERF和扩散模型。其次是在多模式图像合成和编辑中广泛采用的基准数据集和相应的评估度量的综合描述,以及分析各个优点和限制的不同合成方法的详细比较。最后,我们为目前的研究挑战和未来的研究方向提供了深入了解。与本调查相关的项目可在HTTPS://github.com/fnzhan/mise上获得
translated by 谷歌翻译
现有的图像字幕的方法通常从左到右生成句子逐字,并在本地上下文中受到限制,包括给定的图像和历史记录生成的单词。在解码过程中,有许多研究目的是利用全球信息,例如迭代改进。但是,它仍然探讨了如何有效,有效地纳入未来的环境。为了回答这个问题,受到非自动回归图像字幕(NAIC)的启发,可以通过修改后的掩码操作利用两侧关系,我们的目标是将此进步嫁接到常规的自动回归图像字幕(AIC)模型,同时保持推理效率而无需进行推理效率额外的时间成本。具体而言,首先对AIC和NAIC模型结合了共享的视觉编码器,迫使视觉编码器包含足够有效的未来上下文。然后鼓励AIC模型捕获NAIC模型在其不自信的单词上互换的跨层互换的因果动态,该单词遵循教师学生的范式,并通过分配校准训练目标进行了优化。经验证据表明,我们所提出的方法清楚地超过了自动指标和人类评估的最新基线,对MS COCO基准测试。源代码可在以下网址获得:https://github.com/feizc/future-caption。
translated by 谷歌翻译
Score-based diffusion models have captured widespread attention and funded fast progress of recent vision generative tasks. In this paper, we focus on diffusion model backbone which has been much neglected before. We systematically explore vision Transformers as diffusion learners for various generative tasks. With our improvements the performance of vanilla ViT-based backbone (IU-ViT) is boosted to be on par with traditional U-Net-based methods. We further provide a hypothesis on the implication of disentangling the generative backbone as an encoder-decoder structure and show proof-of-concept experiments verifying the effectiveness of a stronger encoder for generative tasks with ASymmetriC ENcoder Decoder (ASCEND). Our improvements achieve competitive results on CIFAR-10, CelebA, LSUN, CUB Bird and large-resolution text-to-image tasks. To the best of our knowledge, we are the first to successfully train a single diffusion model on text-to-image task beyond 64x64 resolution. We hope this will motivate people to rethink the modeling choices and the training pipelines for diffusion-based generative models.
translated by 谷歌翻译
Can continuous diffusion models bring the same performance breakthrough on natural language they did for image generation? To circumvent the discrete nature of text data, we can simply project tokens in a continuous space of embeddings, as is standard in language modeling. We propose Self-conditioned Embedding Diffusion, a continuous diffusion mechanism that operates on token embeddings and allows to learn flexible and scalable diffusion models for both conditional and unconditional text generation. Through qualitative and quantitative evaluation, we show that our text diffusion models generate samples comparable with those produced by standard autoregressive language models - while being in theory more efficient on accelerator hardware at inference time. Our work paves the way for scaling up diffusion models for text, similarly to autoregressive models, and for improving performance with recent refinements to continuous diffusion.
translated by 谷歌翻译
利用深度学习的最新进展,文本到图像生成模型目前具有吸引公众关注的优点。其中两个模型Dall-E 2和Imagen已经证明,可以从图像的简单文本描述中生成高度逼真的图像。基于一种称为扩散模型的新型图像生成方法,文本对图像模型可以生产许多不同类型的高分辨率图像,其中人类想象力是唯一的极限。但是,这些模型需要大量的计算资源来训练,并处理从互联网收集的大量数据集。此外,代码库和模型均未发布。因此,它可以防止AI社区尝试这些尖端模型,从而使其结果复制变得复杂,即使不是不可能。在本文中,我们的目标是首先回顾这些模型使用的不同方法和技术,然后提出我们自己的文本模型模型实施。高度基于DALL-E 2,我们引入了一些轻微的修改,以应对所引起的高计算成本。因此,我们有机会进行实验,以了解这些模型的能力,尤其是在低资源制度中。特别是,我们提供了比Dall-e 2的作者(包括消融研究)更深入的分析。此外,扩散模型使用所谓的指导方法来帮助生成过程。我们引入了一种新的指导方法,该方法可以与其他指导方法一起使用,以提高图像质量。最后,我们的模型产生的图像质量相当好,而不必维持最先进的文本对图像模型的重大培训成本。
translated by 谷歌翻译