In this paper, we propose a large-scale language pre-training for text GENeration using dIffusion modEl, which is named GENIE. GENIE is a pre-training sequence-to-sequence text generation model which combines Transformer and diffusion. The diffusion model accepts the latent information from the encoder, which is used to guide the denoising of the current time step. After multiple such denoise iterations, the diffusion model can restore the Gaussian noise to the diverse output text which is controlled by the input text. Moreover, such architecture design also allows us to adopt large scale pre-training on the GENIE. We propose a novel pre-training method named continuous paragraph denoise based on the characteristics of the diffusion model. Extensive experiments on the XSum, CNN/DailyMail, and Gigaword benchmarks shows that GENIE can achieves comparable performance with various strong baselines, especially after pre-training, the generation quality of GENIE is greatly improved. We have also conduct a lot of experiments on the generation diversity and parameter impact of GENIE. The code for GENIE will be made publicly available.
translated by 谷歌翻译
We present DiffusionBERT, a new generative masked language model based on discrete diffusion models. Diffusion models and many pre-trained language models have a shared training objective, i.e., denoising, making it possible to combine the two powerful models and enjoy the best of both worlds. On the one hand, diffusion models offer a promising training strategy that helps improve the generation quality. On the other hand, pre-trained denoising language models (e.g., BERT) can be used as a good initialization that accelerates convergence. We explore training BERT to learn the reverse process of a discrete diffusion process with an absorbing state and elucidate several designs to improve it. First, we propose a new noise schedule for the forward diffusion process that controls the degree of noise added at each step based on the information of each token. Second, we investigate several designs of incorporating the time step into BERT. Experiments on unconditional text generation demonstrate that DiffusionBERT achieves significant improvement over existing diffusion models for text (e.g., D3PM and Diffusion-LM) and previous generative masked language models in terms of perplexity and BLEU score.
translated by 谷歌翻译
Diffusion models have achieved state-of-the-art synthesis quality on visual and audio tasks, and recent works adapt them to textual data by diffusing on the embedding space. But the difference between the continuous data space and the embedding space raises challenges to the diffusion model, which have not been carefully explored. In this paper, we conduct systematic studies and analyze the challenges threefold. Firstly, the data distribution is learnable for embeddings, which may lead to the collapse of the loss function. Secondly, as the norm of embedding varies between popular and rare words, adding the same noise scale will lead to sub-optimal results. In addition, we find that noises sampled from a standard Gaussian distribution may distract the diffusion process. To solve the above challenges, we propose Difformer, a denoising diffusion probabilistic model based on Transformer, which consists of three techniques including utilizing an anchor loss function, a layer normalization module for embeddings, and a norm factor to the Gaussian noise. All techniques are complementary to each other and critical to boosting the model performance together. Experiments are conducted on benchmark datasets over two seminal text generation tasks including machine translation and text summarization. The results show that Difformer significantly outperforms the embedding diffusion baselines, while achieving competitive results with strong autoregressive baselines.
translated by 谷歌翻译
Diffusion models have achieved great success in modeling continuous data modalities such as images, audio, and video, but have seen limited use in discrete domains such as language. Recent attempts to adapt diffusion to language have presented diffusion as an alternative to autoregressive language generation. We instead view diffusion as a complementary method that can augment the generative capabilities of existing pre-trained language models. We demonstrate that continuous diffusion models can be learned in the latent space of a pre-trained encoder-decoder model, enabling us to sample continuous latent representations that can be decoded into natural language with the pre-trained decoder. We show that our latent diffusion models are more effective at sampling novel text from data distributions than a strong autoregressive baseline and also enable controllable generation.
translated by 谷歌翻译
Diffusion model, a new generative modelling paradigm, has achieved great success in image, audio, and video generation. However, considering the discrete categorical nature of text, it is not trivial to extend continuous diffusion models to natural language, and text diffusion models are less studied. Sequence-to-sequence text generation is one of the essential natural language processing topics. In this work, we apply diffusion models to approach sequence-to-sequence text generation, and explore whether the superiority generation performance of diffusion model can transfer to natural language domain. We propose SeqDiffuSeq, a text diffusion model for sequence-to-sequence generation. SeqDiffuSeq uses an encoder-decoder Transformers architecture to model denoising function. In order to improve generation quality, SeqDiffuSeq combines the self-conditioning technique and a newly proposed adaptive noise schedule technique. The adaptive noise schedule has the difficulty of denoising evenly distributed across time steps, and considers exclusive noise schedules for tokens at different positional order. Experiment results illustrate the good performance on sequence-to-sequence generation in terms of text quality and inference time.
translated by 谷歌翻译
The image captioning task is typically realized by an auto-regressive method that decodes the text tokens one by one. We present a diffusion-based captioning model, dubbed the name DDCap, to allow more decoding flexibility. Unlike image generation, where the output is continuous and redundant with a fixed length, texts in image captions are categorical and short with varied lengths. Therefore, naively applying the discrete diffusion model to text decoding does not work well, as shown in our experiments. To address the performance gap, we propose several key techniques including best-first inference, concentrated attention mask, text length prediction, and image-free training. On COCO without additional caption pre-training, it achieves a CIDEr score of 117.8, which is +5.0 higher than the auto-regressive baseline with the same architecture in the controlled setting. It also performs +26.8 higher CIDEr score than the auto-regressive baseline (230.3 v.s.203.5) on a caption infilling task. With 4M vision-language pre-training images and the base-sized model, we reach a CIDEr score of 125.1 on COCO, which is competitive to the best well-developed auto-regressive frameworks. The code is available at https://github.com/buxiangzhiren/DDCap.
translated by 谷歌翻译
We present BART, a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by ( 1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and many other more recent pretraining schemes. We evaluate a number of noising approaches, finding the best performance by both randomly shuffling the order of the original sentences and using a novel in-filling scheme, where spans of text are replaced with a single mask token. BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new stateof-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE. BART also provides a 1.1 BLEU increase over a back-translation system for machine translation, with only target language pretraining. We also report ablation experiments that replicate other pretraining schemes within the BART framework, to better measure which factors most influence end-task performance.
translated by 谷歌翻译
Can continuous diffusion models bring the same performance breakthrough on natural language they did for image generation? To circumvent the discrete nature of text data, we can simply project tokens in a continuous space of embeddings, as is standard in language modeling. We propose Self-conditioned Embedding Diffusion, a continuous diffusion mechanism that operates on token embeddings and allows to learn flexible and scalable diffusion models for both conditional and unconditional text generation. Through qualitative and quantitative evaluation, we show that our text diffusion models generate samples comparable with those produced by standard autoregressive language models - while being in theory more efficient on accelerator hardware at inference time. Our work paves the way for scaling up diffusion models for text, similarly to autoregressive models, and for improving performance with recent refinements to continuous diffusion.
translated by 谷歌翻译
产生人类想要的声音效果是一个重要的话题。但是,在这一领域,很少有研究声音发电。在这项研究中,我们调查了以文本提示为条件的声音,并提出了一个新型的文本对生成框架,该框架由文本编码器组成,矢量量化了变异自动编码器(VQ-VAE),解码器和歌手。该框架首先使用解码器将从文本编码器提取的文本特征传递到借助VQ-VAE的MEL光谱图中,然后使用Vocoder将生成的MEL光谱图转换为波形。我们发现,解码器显着影响发电性能。因此,我们专注于在这项研究中设计一个好的解码器。我们从传统的自动回解码器开始,该解码器已被证明是以前的Sound Generation Works中的最先进方法。但是,AR解码器始终按顺序预测MEL-SPECTROGIN图令牌,这引入了单向偏见和错误问题的积累。此外,使用AR解码器,声音生成时间随着声音持续时间线性增加。为了克服AR解码器引入的缺点,我们提出了一个基于离散扩散模型的非自动回形解码器,称为DiffSound。具体而言,DIFFSOUND可以在一个步骤中预测所有MEL光谱图令牌,然后在下一步中完善预测的令牌,因此可以在几个步骤后获得最优于预测的结果。我们的实验表明,与AR解码器相比,我们提出的差异不仅产生更好的文本到单一生成结果,而且还具有更快的生成速度,例如MOS:3.56 \ textit {v.s} 2.786,并且生成速度为五个比AR解码器快的时间。
translated by 谷歌翻译
Controllable Text Generation (CTG) is emerging area in the field of natural language generation (NLG). It is regarded as crucial for the development of advanced text generation technologies that are more natural and better meet the specific constraints in practical applications. In recent years, methods using large-scale pre-trained language models (PLMs), in particular the widely used transformer-based PLMs, have become a new paradigm of NLG, allowing generation of more diverse and fluent text. However, due to the lower level of interpretability of deep neural networks, the controllability of these methods need to be guaranteed. To this end, controllable text generation using transformer-based PLMs has become a rapidly growing yet challenging new research hotspot. A diverse range of approaches have emerged in the recent 3-4 years, targeting different CTG tasks which may require different types of controlled constraints. In this paper, we present a systematic critical review on the common tasks, main approaches and evaluation methods in this area. Finally, we discuss the challenges that the field is facing, and put forward various promising future directions. To the best of our knowledge, this is the first survey paper to summarize CTG techniques from the perspective of PLMs. We hope it can help researchers in related fields to quickly track the academic frontier, providing them with a landscape of the area and a roadmap for future research.
translated by 谷歌翻译
在本文中,我们提出了一种新的生成模型,逐步逐步的去噪AutoEncoder(Sundae),不依赖于自回归模型。类似地与去噪扩散技术,在从随机输入开始并从随机输入开始并每次直到收敛改善它们时,日出施加Sundae。我们提出了一个简单的新改进运算符,它比扩散方法更少迭代,同时在定性地在自然语言数据集上产生更好的样本。Sundae在WMT'14英语到德语翻译任务上实现最先进的结果(非自回归方法),在巨大清洁的常见爬网数据集和Python代码的数据集上对无条件语言建模的良好定性结果来自GitHub。通过在模板中填充任意空白模式,Sundae的非自动增加性质开辟了超出左右提示的可能性。
translated by 谷歌翻译
由于暴露偏见,大多数现有的自然语言产生(NLG)模型通过最大化的可能性目标训练了推理阶段的文本结果不佳。在本文中,为了解决此问题,我们重新审视生成的框架,并提出了用于文本生成任务的联合发电机库(JGR)培训算法。在JGR中,生成器模型是通过最大化两个目标来训练的:训练语料库的可能性和排名者模型给出的预期奖励。同时,Ranker模型从发电机模型中获取输入样本,并学会了将优质样本与生成池区分开来。发电机和排名模型交替优化,直到收敛为止。在实证研究中,提出的JGR模型在五个公共基准测试中实现了新的最先进的表现,涵盖了三项大众一代任务:摘要,问题生成和回答生成。我们将在https://github.com/microsoft/advnlg上提供代码,数据和模型。
translated by 谷歌翻译
Diffusion models have quickly become the go-to paradigm for generative modelling of perceptual signals (such as images and sound) through iterative refinement. Their success hinges on the fact that the underlying physical phenomena are continuous. For inherently discrete and categorical data such as language, various diffusion-inspired alternatives have been proposed. However, the continuous nature of diffusion models conveys many benefits, and in this work we endeavour to preserve it. We propose CDCD, a framework for modelling categorical data with diffusion models that are continuous both in time and input space. We demonstrate its efficacy on several language modelling tasks.
translated by 谷歌翻译
对比学习模型在无监督的视觉表示学习中取得了巨大成功,这使得相同图像的不同视图的特征表示之间的相似性最大化,同时最小化不同图像的视图的特征表示之间的相似性。在文本摘要中,输出摘要是输入文档的较短形式,它们具有类似的含义。在本文中,我们提出了对监督抽象文本摘要的对比学习模型,在那里我们查看文档,它的金摘要及其模型生成的摘要,与相同的平均表示的不同视图,并在培训期间最大化它们之间的相似性。我们在三个不同的摘要数据集上改进了一个强序列到序列文本生成模型(即,BART)。人类评估还表明,与其对应物相比,我们的模型达到了更好的忠实性评级,没有对比的目标。
translated by 谷歌翻译
This paper presents a new UNIfied pre-trained Language Model (UNILM) that can be fine-tuned for both natural language understanding and generation tasks. The model is pre-trained using three types of language modeling tasks: unidirectional, bidirectional, and sequence-to-sequence prediction. The unified modeling is achieved by employing a shared Transformer network and utilizing specific self-attention masks to control what context the prediction conditions on. UNILM compares favorably with BERT on the GLUE benchmark, and the SQuAD 2.0 and CoQA question answering tasks. Moreover, UNILM achieves new state-ofthe-art results on five natural language generation datasets, including improving the CNN/DailyMail abstractive summarization ROUGE-L to 40.51 (2.04 absolute improvement), the Gigaword abstractive summarization ROUGE-L to 35.75 (0.86 absolute improvement), the CoQA generative question answering F1 score to 82.5 (37.1 absolute improvement), the SQuAD question generation BLEU-4 to 22.12 (3.75 absolute improvement), and the DSTC7 document-grounded dialog response generation NIST-4 to 2.67 (human performance is 2.65). The code and pre-trained models are available at https://github.com/microsoft/unilm. * Equal contribution. † Contact person.
translated by 谷歌翻译
手语制作(SLP)旨在将口语语言自动转化为符号序列。 SLP的核心过程是将符号光泽序列转换为其相应的标志姿势序列(G2P)。大多数现有的G2P模型通常以自回归方式执行这种条件的远程生成,这不可避免地导致错误的积累。为了解决这个问题,我们提出了一种量化量子序列序列的生成的矢量量化扩散方法,称为poseVQ扩散,这是一种迭代性非自动入学方法。具体而言,我们首先引入量化量化变量自动编码器(姿势VQVAE)模型,以表示姿势序列作为一系列潜在代码。然后,我们通过最近开发的扩散体系结构的扩展来对潜在离散空间进行建模。为了更好地利用时空信息,我们介绍了一种新颖的体系结构,即CodeUnet,以在离散空间中生成更高质量的姿势序列。此外,利用学习的代码,我们开发了一种新型的顺序k-nearest-neighbours方法,以预测相应的光泽序列的姿势序列的可变长度。因此,与自回旋G2P模型相比,我们的模型具有更快的采样速度,并产生明显更好的结果。与以前的非自动入学G2P方法相比,PoseVQ扩散通过迭代改进改善了预测的结果,从而在SLP评估基准上获得了最新的结果。
translated by 谷歌翻译
查询聚焦的文本摘要(QFTS)任务旨在构建基于给定查询的文本文档摘要的构建系统。解决此任务的关键挑战是缺乏培训摘要模型的大量标记数据。在本文中,我们通过探索一系列域适应技术来解决这一挑战。鉴于最近在广泛的自然语言处理任务中进行预先接受的变压器模型的成功,我们利用此类模型为单文档和多文件方案的QFTS任务产生抽象摘要。对于域适应,我们使用预先训练的变压器的摘要模型应用了各种技术,包括转移学习,弱监督学习和远程监督。六个数据集的广泛实验表明,我们所提出的方法非常有效地为QFTS任务产生抽象摘要,同时在一组自动和人类评估指标上设置新的最先进的结果。
translated by 谷歌翻译
名人认可是品牌交流中最重要的策略之一。如今,越来越多的公司试图为自己建立生动的特征。因此,他们的品牌身份交流应符合人类和法规的某些特征。但是,以前的作品主要是通过假设停止的,而不是提出一种特定的品牌和名人之间匹配的方式。在本文中,我们建议基于自然语言处理(NLP)技术的品牌名人匹配模型(BCM)。鉴于品牌和名人,我们首先从互联网上获得了一些描述性文档,然后总结了这些文档,最后计算品牌和名人之间的匹配程度,以确定它们是否匹配。根据实验结果,我们提出的模型以0.362 F1得分和精度的6.3%优于最佳基线,这表明我们模型在现实世界中的有效性和应用值。更重要的是,据我们所知,拟议的BCM模型是使用NLP解决认可问题的第一项工作,因此它可以为以下工作提供一些新颖的研究思想和方法。
translated by 谷歌翻译
与单案摘要相比,抽象性多文件摘要(MDS)对其冗长和链接的来源的表示和覆盖范围提出了挑战。这项研究开发了一个平行的层次变压器(PHT),具有MDS的注意对齐。通过合并单词和段落级的多头注意,PHT的层次结构可以更好地处理令牌和文档级别的依赖项。为了指导解码到更好的源文档覆盖范围,然后将注意力调整机制引入以校准光束搜索,并预测的最佳注意力分布。根据Wikisum数据,进行了全面的评估,以测试拟议的体系结构对MD的改进。通过更好地处理内部和跨文档的信息,结果胭脂和人类评估都表明,我们的分层模型以相对较低的计算成本生成较高质量的摘要。
translated by 谷歌翻译
Pre-trained models have achieved remarkable success in natural language processing (NLP). However, existing pre-training methods underutilize the benefits of language understanding for generation. Inspired by the idea of Generative Adversarial Networks (GANs), we propose a GAN-style model for encoder-decoder pre-training by introducing an auxiliary discriminator, unifying the ability of language understanding and generation in a single model. Our model, named as GanLM, is trained with two pre-training objectives: replaced token detection and replaced token denoising. Specifically, given masked source sentences, the generator outputs the target distribution and the discriminator predicts whether the target sampled tokens from distribution are incorrect. The target sentence is replaced with misclassified tokens to construct noisy previous context, which is used to generate the gold sentence. In general, both tasks improve the ability of language understanding and generation by selectively using the denoising data. Extensive experiments in language generation benchmarks show that GanLM with the powerful language understanding capability outperforms various strong pre-trained language models (PLMs) and achieves state-of-the-art performance.
translated by 谷歌翻译