Human evaluation is the foundation upon which the evaluation of both summarization systems and automatic metrics rests. However, existing human evaluation protocols and benchmarks for summarization either exhibit low inter-annotator agreement or lack the scale needed to draw statistically significant conclusions, and an in-depth analysis of human evaluation is lacking. In this work, we address the shortcomings of existing summarization evaluation along the following axes: 1) We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which relies on fine-grained semantic units and allows for high inter-annotator agreement. 2) We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of over 22k summary-level annotations over state-of-the-art systems on three datasets. 3) We compare our ACU protocol with three other human evaluation protocols, underscoring potential confounding factors in evaluation setups. 4) We evaluate existing automatic metrics using the collected human annotations across evaluation protocols and demonstrate how our benchmark leads to more statistically stable and significant results. Furthermore, our findings have important implications for evaluating large language models (LLMs), as we show that LLMs adjusted by human feedback (e.g., GPT-3.5) may overfit unconstrained human evaluation, which is affected by the annotators' prior, input-agnostic preferences, calling for more robust, targeted evaluation methods.
translated by 谷歌翻译
GPT-3等模型的零和少量提示的最新成功导致了NLP研究的范式转移。在本文中,我们研究了其对文本摘要的影响,重点是新闻摘要的经典基准领域。首先,我们研究了零击GPT-3与在大型摘要数据集中训练的微调模型的比较。我们表明,不仅人类压倒性地更喜欢GPT-3摘要,而且这些摘要也不遭受普通数据集特异性问题(例如事实差的问题)。接下来,我们研究这对评估意味着什么,尤其是黄金标准测试集的作用。我们的实验表明,基于参考和无参考的自动指标,例如最近提出的基于质量检查或基于质量的事实方法无法可靠地评估零击摘要。最后,我们讨论了未来的研究挑战,除了通用摘要之外,特别是基于关键字和方面的摘要,表明了优势微调方法与零拍的提示相比如何。为了支持进一步的研究,我们发布:(a)在4个标准摘要基准中,从微调和零摄像模型中产生的10K生成的摘要,(b)1K人类偏好判断和比较不同系统的普通系统,以进行通用和关键字的不同系统。基于摘要。
translated by 谷歌翻译
诸如学术文章和商业报告之类的长期文件一直是详细说明重要问题和需要额外关注的复杂主题的标准格式。自动汇总系统可以有效地将长文档置于简短而简洁的文本中,以封装最重要的信息,从而在帮助读者的理解中很重要。最近,随着神经体系结构的出现,已经做出了重大的研究工作,以推动自动文本摘要系统,以及有关将这些系统扩展到长期文档领域的挑战的大量研究。在这项调查中,我们提供了有关长期文档摘要的研究的全面概述,以及其研究环境的三个主要组成部分的系统评估:基准数据集,汇总模型和评估指标。对于每个组成部分,我们在长期汇总的背景下组织文献,并进行经验分析,以扩大有关当前研究进度的观点。实证分析包括一项研究基准数据集的内在特征,摘要模型的多维分析以及摘要评估指标的综述。根据总体发现,我们通过提出可能在这个快速增长的领域中提出未来探索的方向来得出结论。
translated by 谷歌翻译
Natural Language Generation (NLG) has improved exponentially in recent years thanks to the development of sequence-to-sequence deep learning technologies such as Transformer-based language models. This advancement has led to more fluent and coherent NLG, leading to improved development in downstream tasks such as abstractive summarization, dialogue generation and data-to-text generation. However, it is also apparent that deep learning based generation is prone to hallucinate unintended text, which degrades the system performance and fails to meet user expectations in many real-world scenarios. To address this issue, many studies have been presented in measuring and mitigating hallucinated texts, but these have never been reviewed in a comprehensive manner before. In this survey, we thus provide a broad overview of the research progress and challenges in the hallucination problem in NLG. The survey is organized into two parts: (1) a general overview of metrics, mitigation methods, and future directions; and (2) an overview of task-specific research progress on hallucinations in the following downstream tasks, namely abstractive summarization, dialogue generation, generative question answering, data-to-text generation, machine translation, and visual-language generation. This survey serves to facilitate collaborative efforts among researchers in tackling the challenge of hallucinated texts in NLG.
translated by 谷歌翻译
当前适用于摘要的预训练模型容易出现事实矛盾,这些不一致性歪曲了源文本或介绍无关信息。因此,在我们开发改进的模型时,必须比较摘要的事实一致性。但是,事实一致性的最佳人类评估设置尚未标准化。为了解决这个问题,我们使用基于评分的李克特量表和基于排名的最佳缩放协议对事实一致性进行了评估,对来自CNN每日邮件和XSUM数据集的100篇文章以及四个最新的最新最新的XSUM数据集进行了评估。艺术模型,以确定最可靠的评估框架。我们发现,基于排名的协议提供了整个数据集的摘要质量的更可靠度量,而Likert评分的可靠性取决于目标数据集和评估设计。我们的众包模板和摘要评估将公开获得,以促进对摘要中事实一致性的未来研究。
translated by 谷歌翻译
Despite the recent progress in language generation models, their outputs may not always meet user expectations. In this work, we study whether informational feedback in natural language can be leveraged to improve generation quality and user preference alignment. To this end, we consider factual consistency in summarization, the quality that the summary should only contain information supported by the input documents, for user preference alignment. We collect a high-quality dataset, DeFacto, containing human demonstrations and informational feedback in natural language consisting of corrective instructions, edited summaries, and explanations with respect to the factual consistency of the summary. Using our dataset, we study two natural language generation tasks: 1) editing a summary using the human feedback, and 2) generating human feedback from the original summary. Using the two tasks, we further evaluate if models can automatically correct factual inconsistencies in generated summaries. We show that the human-edited summaries we collected are more factually consistent, and pre-trained language models can leverage our dataset to improve the factual consistency of original system-generated summaries in our proposed generation tasks. We make the DeFacto dataset publicly available at https://github.com/microsoft/DeFacto.
translated by 谷歌翻译
The majority of available text summarization datasets include short-form source documents that lack long-range causal and temporal dependencies, and often contain strong layout and stylistic biases. While relevant, such datasets will offer limited challenges for future generations of text summarization systems. We address these issues by introducing BookSum, a collection of datasets for long-form narrative summarization. Our dataset covers source documents from the literature domain, such as novels, plays and stories, and includes highly abstractive, human written summaries on three levels of granularity of increasing difficulty: paragraph-, chapter-, and book-level. The domain and structure of our dataset poses a unique set of challenges for summarization systems, which include: processing very long documents, non-trivial causal and temporal dependencies, and rich discourse structures. To facilitate future work, we trained and evaluated multiple extractive and abstractive summarization models as baselines for our dataset.
translated by 谷歌翻译
文本生成的广泛使用的评估指标要么与更长的文本效果不错,要么无法评估文本质量的所有方面。在本文中,我们引入了一个名为SMART的新指标,以减轻此类限制。具体而言,我们将句子视为匹配的基本单位,而不是代币,并使用句子匹配函数来匹配匹配候选和参考句子。还将候选句子与源文件中的句子进行了比较,以允许接地(例如,事实)评估。我们的结果表明,我们提出的指标与基于模型的匹配函数的系统级相关性优于萨姆瓦尔摘要元评估数据集上的所有竞争指标指标。后者不使用任何神经模型,这在模型开发阶段很有用,在这些阶段,资源可以受到限制且需要快速评估。最后,我们还进行了广泛的分析,表明我们提出的指标与较长的摘要很好地运行,并且对特定模型的偏见较小。
translated by 谷歌翻译
预训练的语言模型(PLM)在自然语言生成(NLG)任务中取得了显着的成功。到目前为止,大多数PLM都使用大型一般语料库以无监督的方式进行了预培训。同时,与无监督的模型相比,预先训练的模型越来越多地显示出较低的数据表现出色。受监督预训练的成功的激励,我们提出了自然语言生成的多任务监督预训练(MVP)。为了预先培训文本生成模型MVP,我们从七个生成任务中收集了45个数据集的标记预训练语料库。对于每个任务,我们进一步预先训练特定的软提示,以刺激执行特定任务的模型能力。广泛的实验证明了我们在许多NLG任务中有监督的预训练的有效性,并且我们的一般方法在17个数据集中的12个中实现了最先进的性能。
translated by 谷歌翻译
Current metrics for evaluating factuality for abstractive document summarization have achieved high correlations with human judgment, but they do not account for the vision modality and thus are not adequate for vision-and-language summarization. We propose CLIPBERTScore, a simple weighted combination of CLIPScore and BERTScore to leverage the robustness and strong factuality detection performance between image-summary and document-summary, respectively. Next, due to the lack of meta-evaluation benchmarks to evaluate the quality of multimodal factuality metrics, we collect human judgments of factuality with respect to documents and images. We show that this simple combination of two metrics in the zero-shot setting achieves higher correlations than existing factuality metrics for document summarization, outperforms an existing multimodal summarization metric, and performs competitively with strong multimodal factuality metrics specifically fine-tuned for the task. Our thorough analysis demonstrates the robustness and high correlation of CLIPBERTScore and its components on four factuality metric-evaluation benchmarks. Finally, we demonstrate two practical downstream applications of our CLIPBERTScore metric: for selecting important images to focus on during training, and as a reward for reinforcement learning to improve factuality of multimodal summary generation w.r.t automatic and human evaluation. Our data and code are publicly available at https://github.com/meetdavidwan/faithful-multimodal-summ
translated by 谷歌翻译
众包NLP数据集的反复挑战是,在制作示例时,人类作家通常会依靠重复的模式,从而导致缺乏语言多样性。我们介绍了一种基于工人和AI协作的数据集创建的新方法,该方法汇集了语言模型的生成力量和人类的评估力量。从现有的数据集,自然语言推理(NLI)的Multinli开始,我们的方法使用数据集制图自动识别示例来证明具有挑战性的推理模式,并指示GPT-3撰写具有相似模式的新示例。然后,机器生成的示例会自动过滤,并最终由人类人群工人修订和标记。最终的数据集Wanli由107,885个NLI示例组成,并在现有的NLI数据集上呈现出独特的经验优势。值得注意的是,培训有关Wanli的模型,而不是Multinli($ 4 $ $倍)可改善我们考虑的七个外域测试集的性能,包括汉斯(Hans)的11%和对抗性NLI的9%。此外,将Multinli与Wanli结合起来比将其与其他NLI增强集相结合更有效。我们的结果表明,自然语言生成技术的潜力是策划增强质量和多样性的NLP数据集。
translated by 谷歌翻译
社区问题应答(CQA)FORA,如堆栈溢出和雅虎!答案包含丰富的资源,对广泛的基于社区的问题答案。每个问题线程都可以通过不同的角度接收大量答案。答案摘要的一个目标是产生反映答案视角范围的摘要。抽象答案概述的主要障碍是没有数据集,可以提供监督制作这些摘要。最近的作品提出了创建此类数据的启发式,但这些是嘈杂的,并且不会涵盖答案中存在的所有观点。这项工作介绍了4,631个CQA线程的新型数据集,用于答案摘要,由专业语言学家策划。我们的管道收集了答案概述所涉及的所有子特设的注释,包括选择与问题相关的答案句子,根据透视图对这些句子进行分组,总结每个视角,并生成整体摘要。我们在这些子组织上分析和基准最先进的模型,并为多视角数据增强引入了一种新的无监督方法,这进一步提高了根据自动评估的整体摘要性能。最后,我们提出了加强学习奖励,以改善事实一致性和答案覆盖范围和分析改进领域。
translated by 谷歌翻译
我们介绍了MTG,这是一套新的基准套件,用于培训和评估多语言文本生成。它是具有最大人类通知数据(400K)的第一次传播的多语言多路文本生成数据集。它包括五种语言(英语,德语,法语,西班牙语和中文)的四代任务(故事产生,问题生成,标题生成和文本摘要)。Multiway设置可以启用跨语言和任务的模型测试知识传输功能。使用MTG,我们从不同方面训练和分析了几种流行的多语言生成模型。我们的基准套件通过更多的人为宣传的并行数据促进了模型性能增强。它提供了各种一代方案的全面评估。代码和数据可在\ url {https://github.com/zide05/mtg}上获得。
translated by 谷歌翻译
自动评估摘要的连贯性具有重要意义,既可以实现成本效益的摘要评估,又可以通过选择高分候选候选摘要来提高连贯性。尽管已经提出了许多不同的方法来建模摘要相干性,但通常使用不同的数据集和指标对其进行评估。这使得很难理解他们的相对性能,并确定朝着更好的摘要连贯建模的方法。在这项工作中,我们对各种方法进行了大规模研究,以进行均匀的竞争环境建模。此外,我们介绍了两项新的分析措施,即系统内相关性和偏置矩阵,它们有助于确定相干度量的偏见,并为系统级混杂因素提供鲁棒性。尽管当前可用的自动连贯性措施都无法为所有评估指标的系统摘要分配可靠的连贯分数,但对自我监督任务进行了微调的大规模语言模型显示出令人鼓舞的结果,只要微调会考虑在内他们需要在不同的摘要长度上概括。
translated by 谷歌翻译
The proliferation of automatic faithfulness metrics for summarization has produced a need for benchmarks to evaluate them. While existing benchmarks measure the correlation with human judgements of faithfulness on model-generated summaries, they are insufficient for diagnosing whether metrics are: 1) consistent, i.e., decrease as errors are introduced into a summary, 2) effective on human-written texts, and 3) sensitive to different error types (as summaries can contain multiple errors). To address these needs, we present a benchmark of unfaithful minimal pairs (BUMP), a dataset of 889 human-written, minimally different summary pairs, where a single error (from an ontology of 7 types) is introduced to a summary from the CNN/DailyMail dataset to produce an unfaithful summary. We find BUMP complements existing benchmarks in a number of ways: 1) the summaries in BUMP are harder to discriminate and less probable under SOTA summarization models, 2) BUMP enables measuring the consistency of metrics, and reveals that the most discriminative metrics tend not to be the most consistent, 3) BUMP enables the measurement of metrics' performance on individual error types and highlights areas of weakness for future work.
translated by 谷歌翻译
Precisely assessing the progress in natural language generation (NLG) tasks is challenging, and human evaluation to establish a preference in a model's output over another is often necessary. However, human evaluation is usually costly, difficult to reproduce, and non-reusable. In this paper, we propose a new and simple automatic evaluation method for NLG called Near-Negative Distinction (NND) that repurposes prior human annotations into NND tests. In an NND test, an NLG model must place a higher likelihood on a high-quality output candidate than on a near-negative candidate with a known error. Model performance is established by the number of NND tests a model passes, as well as the distribution over task-specific errors the model fails on. Through experiments on three NLG tasks (question generation, question answering, and summarization), we show that NND achieves a higher correlation with human judgments than standard NLG evaluation metrics. We then illustrate NND evaluation in four practical scenarios, for example performing fine-grain model analysis, or studying model training dynamics. Our findings suggest that NND can give a second life to human annotations and provide low-cost NLG evaluation.
translated by 谷歌翻译
自动故事生成(ASG)的研究在很大程度上依赖于人类和自动评估。但是,尚无共识在哪些人类评估标准上使用,也没有分析自动标准与它们相关的良好状况。在本文中,我们建议重新评估ASG评估。我们介绍了由社会科学文学精心促进的6种正交和全面的人类标准。我们还提出了汉娜(Hanna),这是一个由10种不同ASG系统制作的1,056个故事的注释数据集。汉娜(Hanna)允许我们定量评估72个自动指标与人类标准的相关性。我们的分析强调了ASG当前指标的弱点,并使我们能够为ASG评估提出实用建议。
translated by 谷歌翻译
查询聚焦的文本摘要(QFTS)任务旨在构建基于给定查询的文本文档摘要的构建系统。解决此任务的关键挑战是缺乏培训摘要模型的大量标记数据。在本文中,我们通过探索一系列域适应技术来解决这一挑战。鉴于最近在广泛的自然语言处理任务中进行预先接受的变压器模型的成功,我们利用此类模型为单文档和多文件方案的QFTS任务产生抽象摘要。对于域适应,我们使用预先训练的变压器的摘要模型应用了各种技术,包括转移学习,弱监督学习和远程监督。六个数据集的广泛实验表明,我们所提出的方法非常有效地为QFTS任务产生抽象摘要,同时在一组自动和人类评估指标上设置新的最先进的结果。
translated by 谷歌翻译
Aspect or query-based summarization has recently caught more attention, as it can generate differentiated summaries based on users' interests. However, the current dataset for aspect or query-based summarization either focuses on specific domains, contains relatively small-scale instances, or includes only a few aspect types. Such limitations hinder further explorations in this direction. In this work, we take advantage of crowd-sourcing knowledge on Wikipedia.org and automatically create a high-quality, large-scale open-domain aspect-based summarization dataset named OASum, which contains more than 3.7 million instances with around 1 million different aspects on 2 million Wikipedia pages. We provide benchmark results on OAsum and demonstrate its ability for diverse aspect-based summarization generation. To overcome the data scarcity problem on specific domains, we also perform zero-shot, few-shot, and fine-tuning on seven downstream datasets. Specifically, zero/few-shot and fine-tuning results show that the model pre-trained on our corpus demonstrates a strong aspect or query-focused generation ability compared with the backbone model. Our dataset and pre-trained checkpoints are publicly available.
translated by 谷歌翻译
Narrative summarization aims to produce a distilled version of a narrative to describe its most salient events and characters. Summarizing a narrative is challenging as it requires an understanding of event causality and character behaviors. To encourage research in this direction, we propose NarraSum, a large-scale narrative summarization dataset. It contains 122K narrative documents, which are collected from plot descriptions of movies and TV episodes with diverse genres, and their corresponding abstractive summaries. Experiments show that there is a large performance gap between humans and the state-of-the-art summarization models on NarraSum. We hope that this dataset will promote future research in summarization, as well as broader studies of natural language understanding and generation. The dataset is available at https://github.com/zhaochaocs/narrasum.
translated by 谷歌翻译