最先进的抽象摘要系统经常生成\ emph {幻觉};即,不直接从源文本中推断的内容。尽管被认为是不正确的,我们发现非常令人难潮的内容是事实,即与世界知识一致。这些事实幻觉通过提供有用的背景信息,可以在摘要中受益。在这项工作中,我们提出了一种新的检测方法,将事实与实体的非事实幻觉分开。我们的方法分别使用实体的先前和后验概率,分别是预训练和芬特的屏蔽语言模型。经验结果表明,我们的方法在精度和F1分数方面大大优于两种基线%,与人类判断强烈相关。百分比对事实分类任务。此外,我们显示我们的探测器,当用作离线增强学习(RL)算法中的奖励信号时,显着提高了摘要的事实性,同时保持抽象水平。
translated by 谷歌翻译
Current state-of-the-art summarization models are trained with either maximum likelihood estimation (MLE) or reinforcement learning (RL). In this study, we investigate the third training paradigm and argue that inverse reinforcement learning (IRL) may be more suitable for text summarization. IRL focuses on estimating the reward function of an agent, given a set of observations of that agent's behavior. Generally, IRL provides advantages in situations where the reward function is not explicitly known or where it is difficult to define or interact with the environment directly. These situations are exactly what we observe in summarization. Thus, we introduce inverse reinforcement learning into text summarization and define a suite of sub-rewards that are important for summarization optimization. By simultaneously estimating the reward function and optimizing the summarization agent with expert demonstrations, we show that the model trained with IRL produces summaries that closely follow human behavior, in terms of better ROUGE, coverage, novelty, compression ratio and factuality when compared to the baselines trained with MLE and RL.
translated by 谷歌翻译
利用预训练语言模型的抽象摘要系统在基准数据集上取得了卓越的结果。但是,此类模型已被证明更容易幻觉,这些事实对输入背景不忠。在本文中,我们提出了一种通过实体覆盖范围控制(ECC)来补救实体级外部幻觉的方法。我们首先计算实体覆盖范围的精度,并为每个培训示例提供相应的控制代码,该示例隐含地指导该模型在训练阶段识别忠实的内容。我们通过从Wikipedia提取的大但嘈杂的数据中进行中间调整进一步扩展了我们的方法,以解锁零击摘要。我们表明,根据我们对三个基准数据集XSUM,PubMed和Samsum的实验结果,根据我们在监督的微调和零射击设置中,可以在监督微调和零摄像设置中更加忠实和显着的抽象性汇总。
translated by 谷歌翻译
Natural Language Generation (NLG) has improved exponentially in recent years thanks to the development of sequence-to-sequence deep learning technologies such as Transformer-based language models. This advancement has led to more fluent and coherent NLG, leading to improved development in downstream tasks such as abstractive summarization, dialogue generation and data-to-text generation. However, it is also apparent that deep learning based generation is prone to hallucinate unintended text, which degrades the system performance and fails to meet user expectations in many real-world scenarios. To address this issue, many studies have been presented in measuring and mitigating hallucinated texts, but these have never been reviewed in a comprehensive manner before. In this survey, we thus provide a broad overview of the research progress and challenges in the hallucination problem in NLG. The survey is organized into two parts: (1) a general overview of metrics, mitigation methods, and future directions; and (2) an overview of task-specific research progress on hallucinations in the following downstream tasks, namely abstractive summarization, dialogue generation, generative question answering, data-to-text generation, machine translation, and visual-language generation. This survey serves to facilitate collaborative efforts among researchers in tackling the challenge of hallucinated texts in NLG.
translated by 谷歌翻译
尽管最近的抽象性摘要在自动评估指标上取得了成功,但生成的摘要仍然与源文档呈现事实不一致。在本文中,我们专注于实体级别的事实不一致,即减少生成的摘要与源文档之间的不匹配实体。因此,我们提出了一种基于实体的新型跨度机制,并通过全球相关成分探索其扩展。四个摘要数据集的实验结果表明,跨度可以有效地改善实体级别的事实一致性,而单词级别和实体级别的显着性基本上没有变化。该代码可在https://github.com/wendy-xiao/entity基于基础上找到
translated by 谷歌翻译
自动评估摘要的连贯性具有重要意义,既可以实现成本效益的摘要评估,又可以通过选择高分候选候选摘要来提高连贯性。尽管已经提出了许多不同的方法来建模摘要相干性,但通常使用不同的数据集和指标对其进行评估。这使得很难理解他们的相对性能,并确定朝着更好的摘要连贯建模的方法。在这项工作中,我们对各种方法进行了大规模研究,以进行均匀的竞争环境建模。此外,我们介绍了两项新的分析措施,即系统内相关性和偏置矩阵,它们有助于确定相干度量的偏见,并为系统级混杂因素提供鲁棒性。尽管当前可用的自动连贯性措施都无法为所有评估指标的系统摘要分配可靠的连贯分数,但对自我监督任务进行了微调的大规模语言模型显示出令人鼓舞的结果,只要微调会考虑在内他们需要在不同的摘要长度上概括。
translated by 谷歌翻译
State-of-the-art summarization models still struggle to be factually consistent with the input text. A model-agnostic way to address this problem is post-editing the generated summaries. However, existing approaches typically fail to remove entity errors if a suitable input entity replacement is not available or may insert erroneous content. In our work, we focus on removing extrinsic entity errors, or entities not in the source, to improve consistency while retaining the summary's essential information and form. We propose to use sentence-compression data to train the post-editing model to take a summary with extrinsic entity errors marked with special tokens and output a compressed, well-formed summary with those errors removed. We show that this model improves factual consistency while maintaining ROUGE, improving entity precision by up to 30% on XSum, and that this model can be applied on top of another post-editor, improving entity precision by up to a total of 38%. We perform an extensive comparison of post-editing approaches that demonstrate trade-offs between factual consistency, informativeness, and grammaticality, and we analyze settings where post-editors show the largest improvements.
translated by 谷歌翻译
在摘要域中,摘要的关键要求是与输入文档一致。以前的工作发现,当应用于不一致检测时,自然语言推理(NLI)模型不会竞争地执行。在这项工作中,我们重新访问NLI的使用进行不一致检测,发现过去的工作遭到了NLI数据集(句子级)与不一致检测(文档级别)之间的输入粒度不匹配。我们提供称为SummacConv的高效和轻量级方法,使NLI模型能够通过将文档分段为句子单元并在句子对之间聚合得分来成功地用于此任务。在我们的新推出的基准名为Summac(简介一致性)中由六个大的不一致检测数据集组成,SummacConv以74.4%的均衡精度获得最先进的结果,与现有工作相比,5%的点改进。我们制作可用的模型和数据集:https://github.com/tingofurro/summac
translated by 谷歌翻译
ROUGE is a standard automatic evaluation metric based on n-grams for sequence-to-sequence tasks, while cross-entropy loss is an essential objective of neural network language model that optimizes at a unigram level. We present differentiable n-gram objectives, attempting to alleviate the discrepancy between training criterion and evaluating criterion. The objective maximizes the probabilistic weight of matched sub-sequences, and the novelty of our work is the objective weights the matched sub-sequences equally and does not ceil the number of matched sub-sequences by the ground truth count of n-grams in reference sequence. We jointly optimize cross-entropy loss and the proposed objective, providing decent ROUGE score enhancement over abstractive summarization dataset CNN/DM and XSum, outperforming alternative n-gram objectives.
translated by 谷歌翻译
诸如学术文章和商业报告之类的长期文件一直是详细说明重要问题和需要额外关注的复杂主题的标准格式。自动汇总系统可以有效地将长文档置于简短而简洁的文本中,以封装最重要的信息,从而在帮助读者的理解中很重要。最近,随着神经体系结构的出现,已经做出了重大的研究工作,以推动自动文本摘要系统,以及有关将这些系统扩展到长期文档领域的挑战的大量研究。在这项调查中,我们提供了有关长期文档摘要的研究的全面概述,以及其研究环境的三个主要组成部分的系统评估:基准数据集,汇总模型和评估指标。对于每个组成部分,我们在长期汇总的背景下组织文献,并进行经验分析,以扩大有关当前研究进度的观点。实证分析包括一项研究基准数据集的内在特征,摘要模型的多维分析以及摘要评估指标的综述。根据总体发现,我们通过提出可能在这个快速增长的领域中提出未来探索的方向来得出结论。
translated by 谷歌翻译
自动摘要评估对于机器生成和人为生产的摘要都有用。自动评估给定文档的摘要文本启用,例如,摘要生成系统开发和检测不适当的摘要。摘要评估可以以多种模式进行:排名摘要生成系统;对特定文档的排名摘要;并在绝对规模上估算文档 - 苏格尔对的质量。带有注释的现有数据集用于摘要评估,通常基于新闻摘要数据集,例如CNN/DailyMail或XSUM。在这项工作中,我们描述了一个新的数据集,即播客摘要评估语料库,这是由TREC2020的人类专家评估的播客摘要集。与现有的摘要评估数据相比,该数据集具有两个独特的方面:(i)基于语音播客的长输入,文档; (ii)有机会在播客语料库中检测不适当的参考摘要。首先,我们检查了现有的评估方法,包括无模型和基于模型的方法,并为此长输入摘要评估数据集提供基准结果。其次,为了过滤参考参考文献配对以进行培训,我们采用摘要评估进行数据选择。这两个方面的实验结果为摘要评估和发电任务提供了有趣的见解。播客摘要评估数据可用。
translated by 谷歌翻译
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets. Despite promising results, current models still suffer from generating factually inconsistent summaries, reducing their utility for real-world application. Several recent efforts attempt to address this by devising models that automatically detect factual inconsistencies in machine generated summaries. However, they focus exclusively on English, a language with abundant resources. In this work, we leverage factual consistency evaluation models to improve multilingual summarization. We explore two intuitive approaches to mitigate hallucinations based on the signal provided by a multilingual NLI model, namely data filtering and controlled generation. Experimental results in the 45 languages from the XLSum dataset show gains over strong baselines in both automatic and human evaluation.
translated by 谷歌翻译
Abstractive summarization is the process of generating a summary given a document as input. Although significant progress has been made, the factual inconsistency between the document and the generated summary still limits its practical applications. Previous work found that the probabilities assigned by the generation model reflect its preferences for the generated summary, including the preference for factual consistency, and the preference for the language or knowledge prior as well. To separate the preference for factual consistency, we propose an unsupervised framework named CoP by controlling the preference of the generation model with the help of prompt. More specifically, the framework performs an extra inference step in which a text prompt is introduced as an additional input. In this way, another preference is described by the generation probability of this extra inference process. The difference between the above two preferences, i.e. the difference between the probabilities, could be used as measurements for detecting factual inconsistencies. Interestingly, we found that with the properly designed prompt, our framework could evaluate specific preferences and serve as measurements for fine-grained categories of inconsistency, such as entity-related inconsistency, coreference-related inconsistency, etc. Moreover, our framework could also be extended to the supervised setting to learn better prompt from the labeled data as well. Experiments show that our framework achieves new SOTA results on three factual inconsistency detection tasks.
translated by 谷歌翻译
Current abstractive summarization systems present important weaknesses which prevent their deployment in real-world applications, such as the omission of relevant information and the generation of factual inconsistencies (also known as hallucinations). At the same time, automatic evaluation metrics such as CTC scores have been recently proposed that exhibit a higher correlation with human judgments than traditional lexical-overlap metrics such as ROUGE. In this work, we intend to close the loop by leveraging the recent advances in summarization metrics to create quality-aware abstractive summarizers. Namely, we propose an energy-based model that learns to re-rank summaries according to one or a combination of these metrics. We experiment using several metrics to train our energy-based re-ranker and show that it consistently improves the scores achieved by the predicted summaries. Nonetheless, human evaluation results show that the re-ranking approach should be used with care for highly abstractive summaries, as the available metrics are not yet sufficiently reliable for this purpose.
translated by 谷歌翻译
Human evaluation is the foundation upon which the evaluation of both summarization systems and automatic metrics rests. However, existing human evaluation protocols and benchmarks for summarization either exhibit low inter-annotator agreement or lack the scale needed to draw statistically significant conclusions, and an in-depth analysis of human evaluation is lacking. In this work, we address the shortcomings of existing summarization evaluation along the following axes: 1) We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which relies on fine-grained semantic units and allows for high inter-annotator agreement. 2) We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of over 22k summary-level annotations over state-of-the-art systems on three datasets. 3) We compare our ACU protocol with three other human evaluation protocols, underscoring potential confounding factors in evaluation setups. 4) We evaluate existing automatic metrics using the collected human annotations across evaluation protocols and demonstrate how our benchmark leads to more statistically stable and significant results. Furthermore, our findings have important implications for evaluating large language models (LLMs), as we show that LLMs adjusted by human feedback (e.g., GPT-3.5) may overfit unconstrained human evaluation, which is affected by the annotators' prior, input-agnostic preferences, calling for more robust, targeted evaluation methods.
translated by 谷歌翻译
寻求健康信息的寻求使网络与消费者的健康相关问题淹没了。通常,消费者使用过度描述性和外围信息来表达其医疗状况或其他医疗保健需求,从而有助于自然语言理解的挑战。解决这一挑战的一种方法是总结问题并提取原始问题的关键信息。为了解决此问题,我们介绍了一个新的数据集CHQ-SUMM,其中包含1507个域 - 专家注释的消费者健康问题和相应的摘要。该数据集源自社区提问论坛,因此为了解社交媒体上与消费者健康相关的帖子提供了宝贵的资源。我们在多个最先进的摘要模型上基准测试数据集,以显示数据集的有效性。
translated by 谷歌翻译
尽管最近的抽象摘要有所改善,但大多数当前方法都会产生与源文档不一致的摘要,从而严重限制了其在现实世界应用中的信任和使用。最近的作品显示了使用文本或依赖性弧形识别事实错误识别的有希望的改进;但是,他们不会同时考虑整个语义图。为此,我们提出了Factgraph,该方法将文档分解为结构化含义表示(MR),更适合于事实评估。太太描述了核心语义概念及其关系,以规范形式汇总文档和摘要中的主要内容,并减少数据稀疏性。 Factgraph使用与结构感知适配器增强的图形编码器编码此类图,以根据图形连接性捕获概念之间的交互,以及使用基于适配器的文本编码器的文本表示。在不同基准上进行评估事实的实验表明,事实图的表现优于先前的方法高达15%。此外,Factgraph改善了识别内容可验证性错误的性能,并更好地捕获了附近级别的事实不一致。
translated by 谷歌翻译
实际一致性是实际设置中文本摘要模型的基本质量。在评估此维度的现有工作可以大致分为两行研究,基于征收的指标和问题应答(QA)的指标。然而,最近作品中提出的不同的实验设置导致对比的结论是哪个范例表现最佳。在这项工作中,我们进行了广泛的征集和基于QA的指标的比较,致力于仔细选择基于QA的度量的组件对于性能至关重要。在那些见解中,我们提出了一个优化的公制,我们称之为QAFacteval,这导致了对夏季事实一致性基准的基于QA的度量标准的平均平均平均改进。我们的解决方案提高了基于最佳的基于范围的公制,并在该基准测试中实现了最先进的性能。此外,我们发现基于QA和基于征求的度量提供了互补信号,并将两者组合成单个学习的度量,以进一步提升。通过定性和定量分析,我们将问题生成和可应答性分类视为基于QA的度量的未来工作的两个关键组成部分。
translated by 谷歌翻译
文本摘要的重写方法结合了提取性和抽象的方法,使用抽象模型提高了提取性摘要的简洁性和可读性。退出重写系统将每个提取性句子作为唯一的输入,它相对集中,但可能会失去必要的背景知识和话语上下文。在本文中,我们调查了上下文化的重写,该重写消耗了整个文档并考虑了摘要上下文。我们将上下文重写正式化为具有组标签对齐的SEQ2SEQ,将组标签引入了模拟对齐方式的解决方案,并通过基于内容的地址来识别提取句子。结果表明,我们的方法显着优于非上下文重写系统,而无需加强学习,从而在多个提取器上实现了胭脂分数的强烈改进。
translated by 谷歌翻译
在这项工作中,我们提出了一种将问题回答(QA)信号纳入摘要模型的方法。我们的方法通过自动生成由NPS回答的WH问题并自动确定在黄金摘要中是否回答这些问题,识别输入文档中的显着名词短语(NPS)。基于QA的信号被纳入了一种双级摘要模型,该模型首先使用分类模型在输入文档中标记突出NPS,然后有条件地生成摘要。我们的实验表明,使用基于QA的监督训练的模型产生了比在基准摘要数据集上识别突出跨度的基线方法的高质量摘要。此外,我们示出可以基于输入文档中标记的NPS来控制所产生的摘要的内容。最后,我们提出了一种增强培训数据的方法,因此黄金摘要与培训期间使用的标记的输入跨度更加一致,并展示了如何在学习更好地排除未标记的文档内容的模型中的结果。
translated by 谷歌翻译