Summary quality assessment metrics have two categories: reference-based and reference-free. Reference-based metrics are theoretically more accurate but are limited by the availability and quality of the human-written references, which are both difficulty to ensure. This inspires the development of reference-free metrics, which are independent from human-written references, in the past few years. However, existing reference-free metrics cannot be both zero-shot and accurate. In this paper, we propose a zero-shot but accurate reference-free approach in a sneaky way: feeding documents, based upon which summaries generated, as references into reference-based metrics. Experimental results show that this zero-shot approach can give us the best-performing reference-free metrics on nearly all aspects on several recently-released datasets, even beating reference-free metrics specifically trained for this task sometimes. We further investigate what reference-based metrics can benefit from such repurposing and whether our additional tweaks help.
translated by 谷歌翻译
文本生成的广泛使用的评估指标要么与更长的文本效果不错,要么无法评估文本质量的所有方面。在本文中,我们引入了一个名为SMART的新指标,以减轻此类限制。具体而言,我们将句子视为匹配的基本单位,而不是代币,并使用句子匹配函数来匹配匹配候选和参考句子。还将候选句子与源文件中的句子进行了比较,以允许接地(例如,事实)评估。我们的结果表明,我们提出的指标与基于模型的匹配函数的系统级相关性优于萨姆瓦尔摘要元评估数据集上的所有竞争指标指标。后者不使用任何神经模型,这在模型开发阶段很有用,在这些阶段,资源可以受到限制且需要快速评估。最后,我们还进行了广泛的分析,表明我们提出的指标与较长的摘要很好地运行,并且对特定模型的偏见较小。
translated by 谷歌翻译
自动摘要评估对于机器生成和人为生产的摘要都有用。自动评估给定文档的摘要文本启用,例如,摘要生成系统开发和检测不适当的摘要。摘要评估可以以多种模式进行:排名摘要生成系统;对特定文档的排名摘要;并在绝对规模上估算文档 - 苏格尔对的质量。带有注释的现有数据集用于摘要评估,通常基于新闻摘要数据集,例如CNN/DailyMail或XSUM。在这项工作中,我们描述了一个新的数据集,即播客摘要评估语料库,这是由TREC2020的人类专家评估的播客摘要集。与现有的摘要评估数据相比,该数据集具有两个独特的方面:(i)基于语音播客的长输入,文档; (ii)有机会在播客语料库中检测不适当的参考摘要。首先,我们检查了现有的评估方法,包括无模型和基于模型的方法,并为此长输入摘要评估数据集提供基准结果。其次,为了过滤参考参考文献配对以进行培训,我们采用摘要评估进行数据选择。这两个方面的实验结果为摘要评估和发电任务提供了有趣的见解。播客摘要评估数据可用。
translated by 谷歌翻译
文本摘要模型通常经过培训,以产生满足人类质量要求的摘要。但是,现有的摘要文本评估指标只是摘要质量的粗略代理,与人类评分和抑制摘要多样性的相关性低。为了解决这些问题,我们提出了SummScore,这是基于CrossCoder的摘要质量评估的综合指标。首先,通过采用原始的苏格拉外测量模式并比较原始文本的语义,SummScore摆脱了抑制摘要多样性的抑制。借助文本匹配的预训练交叉编码器,SummScore可以有效地捕获摘要语义之间的细微差异。其次,为了提高全面性和解释性,SummScore由四个细粒子模型组成,它们分别测量连贯性,一致性,流利性和相关性。我们使用半监督的多轮训练来提高模型在极有限的注释数据上的性能。广泛的实验表明,与人类评分相关的上述四个维度中,SummScore在上述四个维度中的现有评估指标显着优于现有的评估指标。我们还为16个主流摘要模型提供了SummScore的质量评估结果,以供以后研究。
translated by 谷歌翻译
Human evaluation is the foundation upon which the evaluation of both summarization systems and automatic metrics rests. However, existing human evaluation protocols and benchmarks for summarization either exhibit low inter-annotator agreement or lack the scale needed to draw statistically significant conclusions, and an in-depth analysis of human evaluation is lacking. In this work, we address the shortcomings of existing summarization evaluation along the following axes: 1) We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which relies on fine-grained semantic units and allows for high inter-annotator agreement. 2) We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of over 22k summary-level annotations over state-of-the-art systems on three datasets. 3) We compare our ACU protocol with three other human evaluation protocols, underscoring potential confounding factors in evaluation setups. 4) We evaluate existing automatic metrics using the collected human annotations across evaluation protocols and demonstrate how our benchmark leads to more statistically stable and significant results. Furthermore, our findings have important implications for evaluating large language models (LLMs), as we show that LLMs adjusted by human feedback (e.g., GPT-3.5) may overfit unconstrained human evaluation, which is affected by the annotators' prior, input-agnostic preferences, calling for more robust, targeted evaluation methods.
translated by 谷歌翻译
自动评估摘要的连贯性具有重要意义,既可以实现成本效益的摘要评估,又可以通过选择高分候选候选摘要来提高连贯性。尽管已经提出了许多不同的方法来建模摘要相干性,但通常使用不同的数据集和指标对其进行评估。这使得很难理解他们的相对性能,并确定朝着更好的摘要连贯建模的方法。在这项工作中,我们对各种方法进行了大规模研究,以进行均匀的竞争环境建模。此外,我们介绍了两项新的分析措施,即系统内相关性和偏置矩阵,它们有助于确定相干度量的偏见,并为系统级混杂因素提供鲁棒性。尽管当前可用的自动连贯性措施都无法为所有评估指标的系统摘要分配可靠的连贯分数,但对自我监督任务进行了微调的大规模语言模型显示出令人鼓舞的结果,只要微调会考虑在内他们需要在不同的摘要长度上概括。
translated by 谷歌翻译
查询聚焦的文本摘要(QFTS)任务旨在构建基于给定查询的文本文档摘要的构建系统。解决此任务的关键挑战是缺乏培训摘要模型的大量标记数据。在本文中,我们通过探索一系列域适应技术来解决这一挑战。鉴于最近在广泛的自然语言处理任务中进行预先接受的变压器模型的成功,我们利用此类模型为单文档和多文件方案的QFTS任务产生抽象摘要。对于域适应,我们使用预先训练的变压器的摘要模型应用了各种技术,包括转移学习,弱监督学习和远程监督。六个数据集的广泛实验表明,我们所提出的方法非常有效地为QFTS任务产生抽象摘要,同时在一组自动和人类评估指标上设置新的最先进的结果。
translated by 谷歌翻译
最近提出的基于BERT的评估指标在标准评估基准方面表现良好,但容易受到对抗性攻击的影响,例如与事实错误有关。我们认为这(部分原因)是因为它们是语义相似性的模型。相反,我们根据自然语言推断(NLI)制定评估指标,我们认为这是更合适的建模。我们设计了一个基于偏好的对抗攻击框架,并表明我们的基于NLI的指标比最近基于BERT的指标更强大。在标准基准上,我们的基于NLI的指标的表现优于现有的摘要指标,但在SOTA MT指标下执行。但是,当我们将现有指标与NLI指标相结合时,我们可以获得更高的对抗性鲁棒性( +20%至 +30%)和较高质量的指标,如标准基准测量( +5%至 +25%)。
translated by 谷歌翻译
Aspect or query-based summarization has recently caught more attention, as it can generate differentiated summaries based on users' interests. However, the current dataset for aspect or query-based summarization either focuses on specific domains, contains relatively small-scale instances, or includes only a few aspect types. Such limitations hinder further explorations in this direction. In this work, we take advantage of crowd-sourcing knowledge on Wikipedia.org and automatically create a high-quality, large-scale open-domain aspect-based summarization dataset named OASum, which contains more than 3.7 million instances with around 1 million different aspects on 2 million Wikipedia pages. We provide benchmark results on OAsum and demonstrate its ability for diverse aspect-based summarization generation. To overcome the data scarcity problem on specific domains, we also perform zero-shot, few-shot, and fine-tuning on seven downstream datasets. Specifically, zero/few-shot and fine-tuning results show that the model pre-trained on our corpus demonstrates a strong aspect or query-focused generation ability compared with the backbone model. Our dataset and pre-trained checkpoints are publicly available.
translated by 谷歌翻译
诸如学术文章和商业报告之类的长期文件一直是详细说明重要问题和需要额外关注的复杂主题的标准格式。自动汇总系统可以有效地将长文档置于简短而简洁的文本中,以封装最重要的信息,从而在帮助读者的理解中很重要。最近,随着神经体系结构的出现,已经做出了重大的研究工作,以推动自动文本摘要系统,以及有关将这些系统扩展到长期文档领域的挑战的大量研究。在这项调查中,我们提供了有关长期文档摘要的研究的全面概述,以及其研究环境的三个主要组成部分的系统评估:基准数据集,汇总模型和评估指标。对于每个组成部分,我们在长期汇总的背景下组织文献,并进行经验分析,以扩大有关当前研究进度的观点。实证分析包括一项研究基准数据集的内在特征,摘要模型的多维分析以及摘要评估指标的综述。根据总体发现,我们通过提出可能在这个快速增长的领域中提出未来探索的方向来得出结论。
translated by 谷歌翻译
Dialogue summarization has recently garnered significant attention due to its wide range of applications. However, existing methods for summarizing dialogues are suboptimal because they do not take into account the inherent structure of dialogue and rely heavily on labeled data, which can lead to poor performance in new domains. In this work, we propose DIONYSUS (dynamic input optimization in pre-training for dialogue summarization), a pre-trained encoder-decoder model for summarizing dialogues in any new domain. To pre-train DIONYSUS, we create two pseudo summaries for each dialogue example: one is produced by a fine-tuned summarization model, and the other is a collection of dialogue turns that convey important information. We then choose one of these pseudo summaries based on the difference in information distribution across different types of dialogues. This selected pseudo summary serves as the objective for pre-training DIONYSUS using a self-supervised approach on a large dialogue corpus. Our experiments show that DIONYSUS outperforms existing methods on six datasets, as demonstrated by its ROUGE scores in zero-shot and few-shot settings.
translated by 谷歌翻译
摘要的目标是简明地说明文件中最重要的信息。在这一原则上,我们介绍了新的参考摘要评估指标,该评估指标使用预训练的语言模型来估计文档与其摘要之间共享的信息内容。这些指标是在香农游戏的现代化,这是几十年前提出的摘要质量评分的方法,在那里我们用语言模型替换人类的注释器。我们还将这些指标视为Blanc的扩展,最近提出的摘要质量测量方法,基于语言模型的性能,而无需摘要。采用基于变压器的语言模型,我们经验验证了我们的指标与人类判断的最先进的相关性,与既有一致性和相关性的摘要质量尺寸,以及与人为判断的一致性和流畅性的竞争相关性。
translated by 谷歌翻译
ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It includes measures to automatically determine the quality of a summary by comparing it to other (ideal) summaries created by humans. The measures count the number of overlapping units such as n-gram, word sequences, and word pairs between the computer-generated summary to be evaluated and the ideal summaries created by humans. This paper introduces four different ROUGE measures: ROUGE-N, ROUGE-L, ROUGE-W, and ROUGE-S included in the ROUGE summarization evaluation package and their evaluatio ns. Three of them have been used in the Document Understanding Conference (DUC) 2004, a large-scale summarization evaluation sponsored by NIST.
translated by 谷歌翻译
我们假设现有的句子级机器翻译(MT)指标在人类参考包含歧义时会效率降低。为了验证这一假设,我们提出了一种非常简单的方法,用于扩展预审计的指标以在文档级别合并上下文。我们将我们的方法应用于三个流行的指标,即Bertscore,Prism和Comet,以及无参考的公制Comet-QE。我们使用提供的MQM注释评估WMT 2021指标共享任务的扩展指标。我们的结果表明,扩展指标的表现在约85%的测试条件下优于其句子级别的级别,而在排除低质量人类参考的结果时。此外,我们表明我们的文档级扩展大大提高了其对话语现象任务的准确性,从而优于专用基线高达6.1%。我们的实验结果支持我们的初始假设,并表明对指标的简单扩展使他们能够利用上下文来解决参考中的歧义。
translated by 谷歌翻译
现有以查询为中心的摘要数据集的大小有限,使培训数据驱动的摘要模型提出了挑战。同时,以查询为重点的摘要语料库的手动构造昂贵且耗时。在本文中,我们使用Wikipedia自动收集超过280,000个示例的大型以查询为中心的摘要数据集(名为Wikiref),这可以用作数据增强的手段。我们还开发了一个基于BERT的以查询为重点的摘要模型(Q-bert),以从文档中提取句子作为摘要。为了更好地调整包含数百万个参数的巨大模型,我们仅识别和微调一个稀疏的子网络,这对应于整个模型参数的一小部分。三个DUC基准测试的实验结果表明,在Wikiref中预先培训的模型已经达到了合理的性能。在对特定基准数据集进行了微调后,具有数据增强的模型优于强大比较系统。此外,我们提出的Q-Bert模型和子网微调都进一步改善了模型性能。该数据集可在https://aka.ms/wikiref上公开获取。
translated by 谷歌翻译
Evaluating automatically-generated text summaries is a challenging task. While there have been many interesting approaches, they still fall short of human evaluations. We present RISE, a new approach for evaluating summaries by leveraging techniques from information retrieval. RISE is first trained as a retrieval task using a dual-encoder retrieval setup, and can then be subsequently utilized for evaluating a generated summary given an input document, without gold reference summaries. RISE is especially well suited when working on new datasets where one may not have reference summaries available for evaluation. We conduct comprehensive experiments on the SummEval benchmark (Fabbri et al., 2021) and the results show that RISE has higher correlation with human evaluations compared to many past approaches to summarization evaluation. Furthermore, RISE also demonstrates data-efficiency and generalizability across languages.
translated by 谷歌翻译
在多文件摘要(MDS)中,输入是一组文档,输出是群集摘要。在本文中,我们专注于预测MD的目标。具体而言,我们引入了一个简单的预处理目标,即选择每个文档群集的基于胭脂的质心作为摘要的代理。因此,我们的目标不需要人类的书面摘要,可以用于仅包含文档群的数据集进行预处理。通过在多个MDS数据集上进行零射击和完全监督的实验,我们表明我们的模型Centrum与最先进的模型更好或可比。我们在https://github.com/ratishsp/centrum上发布了预处理和填充的模型。
translated by 谷歌翻译
在摘要域中,摘要的关键要求是与输入文档一致。以前的工作发现,当应用于不一致检测时,自然语言推理(NLI)模型不会竞争地执行。在这项工作中,我们重新访问NLI的使用进行不一致检测,发现过去的工作遭到了NLI数据集(句子级)与不一致检测(文档级别)之间的输入粒度不匹配。我们提供称为SummacConv的高效和轻量级方法,使NLI模型能够通过将文档分段为句子单元并在句子对之间聚合得分来成功地用于此任务。在我们的新推出的基准名为Summac(简介一致性)中由六个大的不一致检测数据集组成,SummacConv以74.4%的均衡精度获得最先进的结果,与现有工作相比,5%的点改进。我们制作可用的模型和数据集:https://github.com/tingofurro/summac
translated by 谷歌翻译
Scoring the factuality of a generated summary involves measuring the degree to which a target text contains factual information using the input document as support. Given the similarities in the problem formulation, previous work has shown that Natural Language Inference models can be effectively repurposed to perform this task. As these models are trained to score entailment at a sentence level, several recent studies have shown that decomposing either the input document or the summary into sentences helps with factuality scoring. But is fine-grained decomposition always a winning strategy? In this paper we systematically compare different granularities of decomposition -- from document to sub-sentence level, and we show that the answer is no. Our results show that incorporating additional context can yield improvement, but that this does not necessarily apply to all datasets. We also show that small changes to previously proposed entailment-based scoring methods can result in better performance, highlighting the need for caution in model and methodology selection for downstream tasks.
translated by 谷歌翻译
多文件摘要(MDS)是信息聚合的有效工具,它从与主题相关文档集群生成信息和简洁的摘要。我们的调查是,首先,系统地概述了最近的基于深度学习的MDS模型。我们提出了一种新的分类学,总结神经网络的设计策略,并进行全面的最先进的概要。我们突出了在现有文献中很少讨论的各种客观函数之间的差异。最后,我们提出了与这个新的和令人兴奋的领域有关的几个方向。
translated by 谷歌翻译