This paper introduces the shared task of summarizing documents in several creative domains, namely literary texts, movie scripts, and television scripts. Summarizing these creative documents requires making complex literary interpretations, as well as understanding non-trivial temporal dependencies in texts containing varied styles of plot development and narrative structure. This poses unique challenges and is yet underexplored for text summarization systems. In this shared task, we introduce four sub-tasks and their corresponding datasets, focusing on summarizing books, movie scripts, primetime television scripts, and daytime soap opera scripts. We detail the process of curating these datasets for the task, as well as the metrics used for the evaluation of the submissions. As part of the CREATIVESUMM workshop at COLING 2022, the shared task attracted 18 submissions in total. We discuss the submissions and the baselines for each sub-task in this paper, along with directions for facilitating future work in the field.
translated by 谷歌翻译
Narrative summarization aims to produce a distilled version of a narrative to describe its most salient events and characters. Summarizing a narrative is challenging as it requires an understanding of event causality and character behaviors. To encourage research in this direction, we propose NarraSum, a large-scale narrative summarization dataset. It contains 122K narrative documents, which are collected from plot descriptions of movies and TV episodes with diverse genres, and their corresponding abstractive summaries. Experiments show that there is a large performance gap between humans and the state-of-the-art summarization models on NarraSum. We hope that this dataset will promote future research in summarization, as well as broader studies of natural language understanding and generation. The dataset is available at https://github.com/zhaochaocs/narrasum.
translated by 谷歌翻译
NLP基准在很大程度上主要集中在短篇文本上,例如句子和段落,即使长文本在野外占相当数量的自然语言。我们介绍卷轴,这是一套需要在长文本上推理的任务套件。我们检查现有的长文本数据集,文本自然是长期的,同时优先考虑涉及在输入上扫描信息的任务。滚动包含概述,问题应答和自然语言推理任务,包括多个域,包括文学,科学,业务和娱乐。初始基线(包括啰覆编码器),表明滚动有充足的改进空间。我们以统一的文本到文本格式提供所有数据集,并托管Live Refordboard,以促进模型架构和预用方法的研究。
translated by 谷歌翻译
Though many algorithms can be used to automatically summarize legal case decisions, most fail to incorporate domain knowledge about how important sentences in a legal decision relate to a representation of its document structure. For example, analysis of a legal case summarization dataset demonstrates that sentences serving different types of argumentative roles in the decision appear in different sections of the document. In this work, we propose an unsupervised graph-based ranking model that uses a reweighting algorithm to exploit properties of the document structure of legal case decisions. We also explore the impact of using different methods to compute the document structure. Results on the Canadian Legal Case Law dataset show that our proposed method outperforms several strong baselines.
translated by 谷歌翻译
随着大型语言模型的出现,抽象性摘要的方法取得了长足的进步,从而在应用程序中使用了帮助知识工人处理笨拙的文档收集的潜力。一个这样的环境是民权诉讼交换所(CRLC)(https://clearinghouse.net),其中发布了有关大规模民权诉讼,服务律师,学者和公众的信息。如今,CRLC中的摘要需要对律师和法律专业的学生进行广泛的培训,这些律师和法律专业的学生花费数小时了解多个相关文件,以便产生重要事件和结果的高质量摘要。在这种持续的现实世界摘要工作的激励下,我们引入了Multi-iplesum,这是由正在进行的CRLC写作中绘制的9,280个专家作者的摘要集。鉴于源文档的长度,多文章介绍了一个具有挑战性的多文档摘要任务,通常每个情况超过200页。此外,多胎sum与其多个目标摘要中的其他数据集不同,每个数据集都处于不同的粒度(从一句“极端”摘要到超过五百个单词的多段落叙述)。我们提供了广泛的分析,表明,尽管培训数据(遵守严格的内容和样式准则)中的摘要很高,但最新的摘要模型在此任务上的表现较差。我们发布了多体式的摘要方法,以及促进应用程序的开发,以协助CRLC的任务https://multilexsum.github.io。
translated by 谷歌翻译
The publication rates are skyrocketing across many fields of science, and it is difficult to stay up to date with the latest research. This makes automatically summarizing the latest findings and helping scholars to synthesize related work in a given area an attractive research objective. In this paper we study the problem of citation text generation, where given a set of cited papers and citing context the model should generate a citation text. While citation text generation has been tackled in prior work, existing studies use different datasets and task definitions, which makes it hard to study citation text generation systematically. To address this, we propose CiteBench: a benchmark for citation text generation that unifies the previous datasets and enables standardized evaluation of citation text generation models across task settings and domains. Using the new benchmark, we investigate the performance of multiple strong baselines, test their transferability between the datasets, and deliver new insights into task definition and evaluation to guide the future research in citation text generation. We make CiteBench publicly available at https://github.com/UKPLab/citebench.
translated by 谷歌翻译
Aspect or query-based summarization has recently caught more attention, as it can generate differentiated summaries based on users' interests. However, the current dataset for aspect or query-based summarization either focuses on specific domains, contains relatively small-scale instances, or includes only a few aspect types. Such limitations hinder further explorations in this direction. In this work, we take advantage of crowd-sourcing knowledge on Wikipedia.org and automatically create a high-quality, large-scale open-domain aspect-based summarization dataset named OASum, which contains more than 3.7 million instances with around 1 million different aspects on 2 million Wikipedia pages. We provide benchmark results on OAsum and demonstrate its ability for diverse aspect-based summarization generation. To overcome the data scarcity problem on specific domains, we also perform zero-shot, few-shot, and fine-tuning on seven downstream datasets. Specifically, zero/few-shot and fine-tuning results show that the model pre-trained on our corpus demonstrates a strong aspect or query-focused generation ability compared with the backbone model. Our dataset and pre-trained checkpoints are publicly available.
translated by 谷歌翻译
传统上,文本简化被视为单语翻译任务,其中源文本及其简化的对应物之间的句子是对齐的。但是,尤其是对于更长的输入文档,总结文本(或完全删除相关内容)在简化过程中起重要作用,目前在现有数据集中尚未反映出该过程。同时,非英语语言的资源通常很少,并且对于培训新解决方案而言是过分的。为了解决这个问题,我们对可以共同总结和简化长源文档的系统提出了核心要求。我们进一步描述了基于德国Wikipedia和德国儿童词典“ Klexikon”的新数据集的创建,用于简化和摘要,包括近2900个文档。我们发布了一个与文档一致的版本,特别突出了摘要方面,并提供了统计证据,表明此资源也非常适合简化。代码和数据可在GitHub上找到:https://github.com/dennlinger/klexikon
translated by 谷歌翻译
Text summarization is a user-preference based task, i.e., for one document, users often have different priorities for summary. As a key aspect of customization in summarization, granularity is used to measure the semantic coverage between the summary and source document. However, developing systems that can generate summaries with customizable semantic coverage is still an under-explored topic. In this paper, we propose the first unsupervised multi-granularity summarization framework, GranuSum. We take events as the basic semantic units of the source documents and propose to rank these events by their salience. We also develop a model to summarize input documents with given events as anchors and hints. By inputting different numbers of events, GranuSum is capable of producing multi-granular summaries in an unsupervised manner. Meanwhile, we annotate a new benchmark GranuDUC that contains multiple summaries at different granularities for each document cluster. Experimental results confirm the substantial superiority of GranuSum on multi-granularity summarization over strong baselines. Further, by exploiting the event information, GranuSum also exhibits state-of-the-art performance under the conventional unsupervised abstractive setting. Dataset for this paper can be found at: https://github.com/maszhongming/GranuDUC
translated by 谷歌翻译
诸如学术文章和商业报告之类的长期文件一直是详细说明重要问题和需要额外关注的复杂主题的标准格式。自动汇总系统可以有效地将长文档置于简短而简洁的文本中,以封装最重要的信息,从而在帮助读者的理解中很重要。最近,随着神经体系结构的出现,已经做出了重大的研究工作,以推动自动文本摘要系统,以及有关将这些系统扩展到长期文档领域的挑战的大量研究。在这项调查中,我们提供了有关长期文档摘要的研究的全面概述,以及其研究环境的三个主要组成部分的系统评估:基准数据集,汇总模型和评估指标。对于每个组成部分,我们在长期汇总的背景下组织文献,并进行经验分析,以扩大有关当前研究进度的观点。实证分析包括一项研究基准数据集的内在特征,摘要模型的多维分析以及摘要评估指标的综述。根据总体发现,我们通过提出可能在这个快速增长的领域中提出未来探索的方向来得出结论。
translated by 谷歌翻译
在印度法院制度中,长期以来一直是一个问题。有超过4千万的案件。对于法律利益相关者来说,手动总结数百个文件是一项耗时且繁琐的任务。随着机器学习的发展,许多用于文本摘要的最新模型已经出现。独立于域的模型在法律文本方面做得不好,由于缺乏公开可用的数据集,对印度法律制度的这些模型进行微调是有问题的。为了提高独立模型的性能,作者提出了一种在印度背景下使法律文本正常化的方法。作者试验了两个与法律文本摘要的最先进的域独立模型,即Bart和Pegasus。 Bart和Pegasus以提取性和抽象的摘要为方面,以了解文本归一化方法的有效性。汇总文本由域专家在多个参数和使用胭脂指标上评估。它表明,在具有域独立模型的法律文本中,提出的文本归一化方法有效。
translated by 谷歌翻译
Dialogue summarization has recently garnered significant attention due to its wide range of applications. However, existing methods for summarizing dialogues are suboptimal because they do not take into account the inherent structure of dialogue and rely heavily on labeled data, which can lead to poor performance in new domains. In this work, we propose DIONYSUS (dynamic input optimization in pre-training for dialogue summarization), a pre-trained encoder-decoder model for summarizing dialogues in any new domain. To pre-train DIONYSUS, we create two pseudo summaries for each dialogue example: one is produced by a fine-tuned summarization model, and the other is a collection of dialogue turns that convey important information. We then choose one of these pseudo summaries based on the difference in information distribution across different types of dialogues. This selected pseudo summary serves as the objective for pre-training DIONYSUS using a self-supervised approach on a large dialogue corpus. Our experiments show that DIONYSUS outperforms existing methods on six datasets, as demonstrated by its ROUGE scores in zero-shot and few-shot settings.
translated by 谷歌翻译
Bidirectional Encoder Representations from Transformers (BERT; Devlin et al. 2019) represents the latest incarnation of pretrained language models which have recently advanced a wide range of natural language processing tasks. In this paper, we showcase how BERT can be usefully applied in text summarization and propose a general framework for both extractive and abstractive models. We introduce a novel document-level encoder based on BERT which is able to express the semantics of a document and obtain representations for its sentences. Our extractive model is built on top of this encoder by stacking several intersentence Transformer layers. For abstractive summarization, we propose a new fine-tuning schedule which adopts different optimizers for the encoder and the decoder as a means of alleviating the mismatch between the two (the former is pretrained while the latter is not). We also demonstrate that a two-staged fine-tuning approach can further boost the quality of the generated summaries. Experiments on three datasets show that our model achieves stateof-the-art results across the board in both extractive and abstractive settings. 1
translated by 谷歌翻译
Human evaluation is the foundation upon which the evaluation of both summarization systems and automatic metrics rests. However, existing human evaluation protocols and benchmarks for summarization either exhibit low inter-annotator agreement or lack the scale needed to draw statistically significant conclusions, and an in-depth analysis of human evaluation is lacking. In this work, we address the shortcomings of existing summarization evaluation along the following axes: 1) We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which relies on fine-grained semantic units and allows for high inter-annotator agreement. 2) We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of over 22k summary-level annotations over state-of-the-art systems on three datasets. 3) We compare our ACU protocol with three other human evaluation protocols, underscoring potential confounding factors in evaluation setups. 4) We evaluate existing automatic metrics using the collected human annotations across evaluation protocols and demonstrate how our benchmark leads to more statistically stable and significant results. Furthermore, our findings have important implications for evaluating large language models (LLMs), as we show that LLMs adjusted by human feedback (e.g., GPT-3.5) may overfit unconstrained human evaluation, which is affected by the annotators' prior, input-agnostic preferences, calling for more robust, targeted evaluation methods.
translated by 谷歌翻译
多文件摘要(MDS)是信息聚合的有效工具,它从与主题相关文档集群生成信息和简洁的摘要。我们的调查是,首先,系统地概述了最近的基于深度学习的MDS模型。我们提出了一种新的分类学,总结神经网络的设计策略,并进行全面的最先进的概要。我们突出了在现有文献中很少讨论的各种客观函数之间的差异。最后,我们提出了与这个新的和令人兴奋的领域有关的几个方向。
translated by 谷歌翻译
Multi-document summarization (MDS) has traditionally been studied assuming a set of ground-truth topic-related input documents is provided. In practice, the input document set is unlikely to be available a priori and would need to be retrieved based on an information need, a setting we call open-domain MDS. We experiment with current state-of-the-art retrieval and summarization models on several popular MDS datasets extended to the open-domain setting. We find that existing summarizers suffer large reductions in performance when applied as-is to this more realistic task, though training summarizers with retrieved inputs can reduce their sensitivity retrieval errors. To further probe these findings, we conduct perturbation experiments on summarizer inputs to study the impact of different types of document retrieval errors. Based on our results, we provide practical guidelines to help facilitate a shift to open-domain MDS. We release our code and experimental results alongside all data or model artifacts created during our investigation.
translated by 谷歌翻译
健康素养被出现为制定适当的健康决策和确保治疗结果的关键因素。然而,医学术语和该领域的专业语言的复杂结构使健康信息尤为难以解释。因此,迫切需要对自动化方法来提高生物医学文献的可访问性,以提高一般人群。这个问题可以作为医疗保健专业人员语言与公众的语言之间的翻译问题。在本文中,我们介绍了自动化生物医学科学评论的制定语言摘要的新任务,建设了一个数据集,以支持自动化方法的开发和评估,以提高生物医学文献的可访问性。我们对解决这项任务的各种挑战进行了分析,包括不仅对关键要点的总结,而且还概述了对背景知识和专业语言的简化的解释。我们试验最先进的摘要模型以及多种数据增强技术,并使用自动指标和人工评估评估其性能。结果表明,与专家专家专门开发的参考摘要相比,使用当代神经架构产生的自动产生的摘要可以实现有希望的质量和可读性(最佳Rouge-L为50.24和Flesch-Kincaid可读性得分为13.30)。我们还讨论了目前尝试的局限性,为未来工作提供了洞察和方向。
translated by 谷歌翻译
We introduce extreme summarization, a new single-document summarization task which does not favor extractive strategies and calls for an abstractive modeling approach. The idea is to create a short, one-sentence news summary answering the question "What is the article about?". We collect a real-world, large scale dataset for this task by harvesting online articles from the British Broadcasting Corporation (BBC). We propose a novel abstractive model which is conditioned on the article's topics and based entirely on convolutional neural networks. We demonstrate experimentally that this architecture captures longrange dependencies in a document and recognizes pertinent content, outperforming an oracle extractive system and state-of-the-art abstractive approaches when evaluated automatically and by humans. 1
translated by 谷歌翻译
自动摘要评估对于机器生成和人为生产的摘要都有用。自动评估给定文档的摘要文本启用,例如,摘要生成系统开发和检测不适当的摘要。摘要评估可以以多种模式进行:排名摘要生成系统;对特定文档的排名摘要;并在绝对规模上估算文档 - 苏格尔对的质量。带有注释的现有数据集用于摘要评估,通常基于新闻摘要数据集,例如CNN/DailyMail或XSUM。在这项工作中,我们描述了一个新的数据集,即播客摘要评估语料库,这是由TREC2020的人类专家评估的播客摘要集。与现有的摘要评估数据相比,该数据集具有两个独特的方面:(i)基于语音播客的长输入,文档; (ii)有机会在播客语料库中检测不适当的参考摘要。首先,我们检查了现有的评估方法,包括无模型和基于模型的方法,并为此长输入摘要评估数据集提供基准结果。其次,为了过滤参考参考文献配对以进行培训,我们采用摘要评估进行数据选择。这两个方面的实验结果为摘要评估和发电任务提供了有趣的见解。播客摘要评估数据可用。
translated by 谷歌翻译
The internet has had a dramatic effect on the healthcare industry, allowing documents to be saved, shared, and managed digitally. This has made it easier to locate and share important data, improving patient care and providing more opportunities for medical studies. As there is so much data accessible to doctors and patients alike, summarizing it has become increasingly necessary - this has been supported through the introduction of deep learning and transformer-based networks, which have boosted the sector significantly in recent years. This paper gives a comprehensive survey of the current techniques and trends in medical summarization
translated by 谷歌翻译