用于提取和抽象性摘要系统的传统培训范例始终仅使用令牌级别或句子级培训目标。但是,始终从摘要级别评估输出摘要,从而导致培训和评估的不一致。在本文中,我们提出了一个基于对比度学习的重新排列框架,用于一阶段的摘要,称为COLO。通过建模对比目标,我们表明摘要模型能够根据摘要级别的分数直接生成摘要,而无需其他模块和参数。广泛的实验表明,CORO在CNN/DailyMail基准测试中提高了单阶段系统的提取和抽象结果,将其提高到44.58和46.33 Rouge-1得分,同时保留了参数效率和推断效率。与最先进的多阶段系统相比,我们节省了100多个GPU训练时间,并在推理期间获得3〜8加速比,同时保持可比的结果。
translated by 谷歌翻译
现有摘要系统主要生成纯粹依赖源文档内容的摘要。但是,即使对于人类,我们通常需要一些引用或示例,帮助我们充分了解源文档并以特定格式写入摘要。但是如何找到高质量的样式,并将它们纳入总结系统仍然挑战和探索。在本文中,我们提出了一种由致密的猎犬和摘要提升的新型检索增强的抽象概要框架。首先,检索几个密切相关的示例作为补充输入,以帮助生成模型更全面地了解文本。此外,检索的示例也可以在引导模型以捕获特定语料库的写入风格中起作用。我们在多个域和两个骨干型号的各种摘要数据集上验证我们的方法:BERT和BART。结果表明,与强大的预训练模型相比,我们的框架在胭脂-1分数中获得了1.38〜4.66的显着改善,并在账单上实现了新的最先进。人类评估表明我们的检索增强模型可以更好地捕获特定于域的书写风格。
translated by 谷歌翻译
Recently, contrastive learning attracts increasing interests in neural text generation as a new solution to alleviate the exposure bias problem. It introduces a sequence-level training signal which is crucial to generation tasks that always rely on auto-regressive decoding. However, previous methods using contrastive learning in neural text generation usually lead to inferior performance. In this paper, we analyse the underlying reasons and propose a new Contrastive Neural Text generation framework, CoNT. CoNT addresses bottlenecks that prevent contrastive learning from being widely adopted in generation tasks from three aspects -- the construction of contrastive examples, the choice of the contrastive loss, and the strategy in decoding. We validate CoNT on five generation tasks with ten benchmarks, including machine translation, summarization, code comment generation, data-to-text generation and commonsense generation. Experimental results show that CoNT clearly outperforms the conventional training framework on all the ten benchmarks with a convincing margin. Especially, CoNT surpasses previous the most competitive contrastive learning method for text generation, by 1.50 BLEU on machine translation and 1.77 ROUGE-1 on summarization, respectively. It achieves new state-of-the-art on summarization, code comment generation (without external data) and data-to-text generation.
translated by 谷歌翻译
对比学习模型在无监督的视觉表示学习中取得了巨大成功,这使得相同图像的不同视图的特征表示之间的相似性最大化,同时最小化不同图像的视图的特征表示之间的相似性。在文本摘要中,输出摘要是输入文档的较短形式,它们具有类似的含义。在本文中,我们提出了对监督抽象文本摘要的对比学习模型,在那里我们查看文档,它的金摘要及其模型生成的摘要,与相同的平均表示的不同视图,并在培训期间最大化它们之间的相似性。我们在三个不同的摘要数据集上改进了一个强序列到序列文本生成模型(即,BART)。人类评估还表明,与其对应物相比,我们的模型达到了更好的忠实性评级,没有对比的目标。
translated by 谷歌翻译
Current abstractive summarization systems present important weaknesses which prevent their deployment in real-world applications, such as the omission of relevant information and the generation of factual inconsistencies (also known as hallucinations). At the same time, automatic evaluation metrics such as CTC scores have been recently proposed that exhibit a higher correlation with human judgments than traditional lexical-overlap metrics such as ROUGE. In this work, we intend to close the loop by leveraging the recent advances in summarization metrics to create quality-aware abstractive summarizers. Namely, we propose an energy-based model that learns to re-rank summaries according to one or a combination of these metrics. We experiment using several metrics to train our energy-based re-ranker and show that it consistently improves the scores achieved by the predicted summaries. Nonetheless, human evaluation results show that the re-ranking approach should be used with care for highly abstractive summaries, as the available metrics are not yet sufficiently reliable for this purpose.
translated by 谷歌翻译
Pre-trained language models have been successful in natural language generation (NLG) tasks. While various decoding methods have been employed, they often produce suboptimal results. We first present an empirical analysis of three NLG tasks: summarization, machine translation, and constrained text generation. We found that selecting the best output from the results of multiple decoding methods can significantly improve performance. To further improve reranking for NLG tasks, we proposed a novel method, \textsc{PairReranker}, which uses a single encoder and a pairwise loss function to jointly encode a source input and a pair of candidates and compare them. Experiments on three NLG tasks demonstrated the effectiveness and flexibility of \textsc{PairReranker}, showing strong results, compared with previous baselines. In addition, our \textsc{PairReranker} can generalize to significantly improve GPT-3 (text-davinci-003) results (e.g., 24.55\% on CommonGen and 11.35\% on WMT18 zh-en), even though our rerankers are not trained with any GPT-3 candidates.
translated by 谷歌翻译
Bidirectional Encoder Representations from Transformers (BERT; Devlin et al. 2019) represents the latest incarnation of pretrained language models which have recently advanced a wide range of natural language processing tasks. In this paper, we showcase how BERT can be usefully applied in text summarization and propose a general framework for both extractive and abstractive models. We introduce a novel document-level encoder based on BERT which is able to express the semantics of a document and obtain representations for its sentences. Our extractive model is built on top of this encoder by stacking several intersentence Transformer layers. For abstractive summarization, we propose a new fine-tuning schedule which adopts different optimizers for the encoder and the decoder as a means of alleviating the mismatch between the two (the former is pretrained while the latter is not). We also demonstrate that a two-staged fine-tuning approach can further boost the quality of the generated summaries. Experiments on three datasets show that our model achieves stateof-the-art results across the board in both extractive and abstractive settings. 1
translated by 谷歌翻译
由于暴露偏见,大多数现有的自然语言产生(NLG)模型通过最大化的可能性目标训练了推理阶段的文本结果不佳。在本文中,为了解决此问题,我们重新审视生成的框架,并提出了用于文本生成任务的联合发电机库(JGR)培训算法。在JGR中,生成器模型是通过最大化两个目标来训练的:训练语料库的可能性和排名者模型给出的预期奖励。同时,Ranker模型从发电机模型中获取输入样本,并学会了将优质样本与生成池区分开来。发电机和排名模型交替优化,直到收敛为止。在实证研究中,提出的JGR模型在五个公共基准测试中实现了新的最先进的表现,涵盖了三项大众一代任务:摘要,问题生成和回答生成。我们将在https://github.com/microsoft/advnlg上提供代码,数据和模型。
translated by 谷歌翻译
With the rise of task-specific pre-training objectives, abstractive summarization models like PEGASUS offer appealing zero-shot performance on downstream summarization tasks. However, the performance of such unsupervised models still lags significantly behind their supervised counterparts. Similarly to the supervised setup, we notice a very high variance in quality among summary candidates from these models whereas only one candidate is kept as the summary output. In this paper, we propose to re-rank summary candidates in an unsupervised manner, aiming to close the performance gap between unsupervised and supervised models. Our approach improves the pre-trained unsupervised PEGASUS by 4.37% to 7.27% relative mean ROUGE across four widely-adopted summarization benchmarks, and achieves relative gains of 7.51% (up to 23.73%) averaged over 30 transfer setups.
translated by 谷歌翻译
ROUGE is a standard automatic evaluation metric based on n-grams for sequence-to-sequence tasks, while cross-entropy loss is an essential objective of neural network language model that optimizes at a unigram level. We present differentiable n-gram objectives, attempting to alleviate the discrepancy between training criterion and evaluating criterion. The objective maximizes the probabilistic weight of matched sub-sequences, and the novelty of our work is the objective weights the matched sub-sequences equally and does not ceil the number of matched sub-sequences by the ground truth count of n-grams in reference sequence. We jointly optimize cross-entropy loss and the proposed objective, providing decent ROUGE score enhancement over abstractive summarization dataset CNN/DM and XSum, outperforming alternative n-gram objectives.
translated by 谷歌翻译
文本摘要的重写方法结合了提取性和抽象的方法,使用抽象模型提高了提取性摘要的简洁性和可读性。退出重写系统将每个提取性句子作为唯一的输入,它相对集中,但可能会失去必要的背景知识和话语上下文。在本文中,我们调查了上下文化的重写,该重写消耗了整个文档并考虑了摘要上下文。我们将上下文重写正式化为具有组标签对齐的SEQ2SEQ,将组标签引入了模拟对齐方式的解决方案,并通过基于内容的地址来识别提取句子。结果表明,我们的方法显着优于非上下文重写系统,而无需加强学习,从而在多个提取器上实现了胭脂分数的强烈改进。
translated by 谷歌翻译
诸如学术文章和商业报告之类的长期文件一直是详细说明重要问题和需要额外关注的复杂主题的标准格式。自动汇总系统可以有效地将长文档置于简短而简洁的文本中,以封装最重要的信息,从而在帮助读者的理解中很重要。最近,随着神经体系结构的出现,已经做出了重大的研究工作,以推动自动文本摘要系统,以及有关将这些系统扩展到长期文档领域的挑战的大量研究。在这项调查中,我们提供了有关长期文档摘要的研究的全面概述,以及其研究环境的三个主要组成部分的系统评估:基准数据集,汇总模型和评估指标。对于每个组成部分,我们在长期汇总的背景下组织文献,并进行经验分析,以扩大有关当前研究进度的观点。实证分析包括一项研究基准数据集的内在特征,摘要模型的多维分析以及摘要评估指标的综述。根据总体发现,我们通过提出可能在这个快速增长的领域中提出未来探索的方向来得出结论。
translated by 谷歌翻译
自动医疗问题摘要可以极大地帮助系统了解消费者健康问题并检索正确的答案。基于最大似然估计(MLE)的SEQ2SEQ模型已在此任务中应用,这面临两个一般问题:该模型无法捕获良好的问题,并且传统的MLE策略缺乏理解句子级语义的能力。为了减轻这些问题,我们提出了一个新颖的问题焦点驱动的对比学习框架(QFCL)。特别是,我们提出了一种简单有效的方法来基于问题的重点生成硬性样本,并利用编码器和解码器的对比度学习以获得更好的句子级别表示。在三个医疗基准数据集上,我们提出的模型可实现新的最新结果,并在三个数据集的基线BART模型上获得了5.33、12.85和3.81点的性能增益。进一步的人类判断和详细的分析证明,我们的QFCL模型可以学习更好的句子表示,具有区分不同句子含义的能力,并通过捕获问题重点来产生高质量的摘要。
translated by 谷歌翻译
提取性摘要通过识别和串联文档中最重要的句子来产生摘要。由于大多数摘要数据集都没有带有指示文档句子是否值得摘要的黄金标签,因此已经提出了不同的标签算法来推断甲骨文提取物进行模型培训。在这项工作中,我们以广泛使用的贪婪标签方法来识别两个缺陷:它提供了次优和确定性的甲骨文。为了减轻这两个问题,我们提出了一种简单而有效的标签算法,该算法会产生柔和的,基于期望的句子标签。我们为提取性摘要定义了一个新的学习目标,该目标将来自多个Oracle摘要的学习信号结合在一起,并证明这等同于估计每个文档句子的Oracle期望。在没有任何架构修改的情况下,提议的标签方案在跨域和语言的各种摘要基准上都在监督和零击设置中取得了卓越的性能。
translated by 谷歌翻译
生成事实 - 一致的摘要是抽象总结的具有挑战性的任务。以前的作品主要编码事实信息或在解码后执行校正后/等级。在本文中,我们从对比学习的角度提供了一个事实 - 一致的解决方案,这是之前作品的自然延伸。我们提出CO2SUM(对比一致性),一种对比的学习方案,可以很容易地应用于事实 - 一致的抽象总结的序列模型,证明了模型可以在不修改架构的情况下感知。 CO2SUM在编码器上应用对比度学习,该编码器可以帮助模型意识到输入文章中包含的事实信息,或者对解码器进行对比学习,这使得模型生成事实正确的输出摘要。更重要的是,这两种方案是正交的,可以组合以进一步改善忠诚。关于公共基准测试的综合实验表明,与其他强大的事实 - 一致的摘要基线相比,CO2SUM提高了大型预先训练的语言模型的忠诚,并达到竞争力。
translated by 谷歌翻译
Current metrics for evaluating factuality for abstractive document summarization have achieved high correlations with human judgment, but they do not account for the vision modality and thus are not adequate for vision-and-language summarization. We propose CLIPBERTScore, a simple weighted combination of CLIPScore and BERTScore to leverage the robustness and strong factuality detection performance between image-summary and document-summary, respectively. Next, due to the lack of meta-evaluation benchmarks to evaluate the quality of multimodal factuality metrics, we collect human judgments of factuality with respect to documents and images. We show that this simple combination of two metrics in the zero-shot setting achieves higher correlations than existing factuality metrics for document summarization, outperforms an existing multimodal summarization metric, and performs competitively with strong multimodal factuality metrics specifically fine-tuned for the task. Our thorough analysis demonstrates the robustness and high correlation of CLIPBERTScore and its components on four factuality metric-evaluation benchmarks. Finally, we demonstrate two practical downstream applications of our CLIPBERTScore metric: for selecting important images to focus on during training, and as a reward for reinforcement learning to improve factuality of multimodal summary generation w.r.t automatic and human evaluation. Our data and code are publicly available at https://github.com/meetdavidwan/faithful-multimodal-summ
translated by 谷歌翻译
查询聚焦的文本摘要(QFTS)任务旨在构建基于给定查询的文本文档摘要的构建系统。解决此任务的关键挑战是缺乏培训摘要模型的大量标记数据。在本文中,我们通过探索一系列域适应技术来解决这一挑战。鉴于最近在广泛的自然语言处理任务中进行预先接受的变压器模型的成功,我们利用此类模型为单文档和多文件方案的QFTS任务产生抽象摘要。对于域适应,我们使用预先训练的变压器的摘要模型应用了各种技术,包括转移学习,弱监督学习和远程监督。六个数据集的广泛实验表明,我们所提出的方法非常有效地为QFTS任务产生抽象摘要,同时在一组自动和人类评估指标上设置新的最先进的结果。
translated by 谷歌翻译
Text summarization is a user-preference based task, i.e., for one document, users often have different priorities for summary. As a key aspect of customization in summarization, granularity is used to measure the semantic coverage between the summary and source document. However, developing systems that can generate summaries with customizable semantic coverage is still an under-explored topic. In this paper, we propose the first unsupervised multi-granularity summarization framework, GranuSum. We take events as the basic semantic units of the source documents and propose to rank these events by their salience. We also develop a model to summarize input documents with given events as anchors and hints. By inputting different numbers of events, GranuSum is capable of producing multi-granular summaries in an unsupervised manner. Meanwhile, we annotate a new benchmark GranuDUC that contains multiple summaries at different granularities for each document cluster. Experimental results confirm the substantial superiority of GranuSum on multi-granularity summarization over strong baselines. Further, by exploiting the event information, GranuSum also exhibits state-of-the-art performance under the conventional unsupervised abstractive setting. Dataset for this paper can be found at: https://github.com/maszhongming/GranuDUC
translated by 谷歌翻译
由于免费的在线百科全书具有大量内容,因此Wikipedia和Wikidata是许多自然语言处理(NLP)任务的关键,例如信息检索,知识基础构建,机器翻译,文本分类和文本摘要。在本文中,我们介绍了Wikides,这是一个新颖的数据集,用于为文本摘要问题提供Wikipedia文章的简短描述。该数据集由6987个主题上的80K英语样本组成。我们设置了一种两阶段的摘要方法 - 描述生成(I阶段)和候选排名(II阶段)作为一种依赖于转移和对比学习的强大方法。对于描述生成,与其他小规模的预训练模型相比,T5和BART表现出了优越性。通过将对比度学习与Beam Search的不同输入一起应用,基于度量的排名模型优于直接描述生成模型,在主题独立拆分和独立于主题的独立拆分中,最高可达22个胭脂。此外,第II期中的结果描述得到了人类评估的支持,其中45.33%以上,而I阶段的23.66%则支持针对黄金描述。在情感分析方面,生成的描述无法有效地从段落中捕获所有情感极性,同时从黄金描述中更好地完成此任务。自动产生的新描述减少了人类为创建它们的努力,并丰富了基于Wikidata的知识图。我们的论文对Wikipedia和Wikidata产生了实际影响,因为有成千上万的描述。最后,我们预计Wikides将成为从短段落中捕获显着信息的相关作品的有用数据集。策划的数据集可公开可用:https://github.com/declare-lab/wikides。
translated by 谷歌翻译
在本文中,我们建议利用对话的独特特征,共享参与者的常识性知识,以解决总结它们的困难。我们提出了病态的框架,该框架使用常识推论作为其他背景。与以前仅依赖于输入对话的工作相比,Sick使用外部知识模型来生成丰富的常识推断,并选择具有基于相似性选择方法的最可能的推理。基于生病的,病人++的理解为监督,在总结多任务学习环境中的对话时,添加了产生常识推断的任务。实验结果表明,通过注入常识性知识,我们的框架比现有方法产生更多信息和一致的摘要。
translated by 谷歌翻译