Legal contracts, such as employment or lease agreements, are important documents as they govern the obligations and entitlements of the various contracting parties. However, these documents are typically long and written in legalese resulting in lots of manual hours spent in understanding them. In this paper, we address the task of summarizing legal contracts for each of the contracting parties, to enable faster reviewing and improved understanding of them. Specifically, we collect a dataset consisting of pairwise importance comparison annotations by legal experts for ~293K sentence pairs from lease agreements. We propose a novel extractive summarization system to automatically produce a summary consisting of the most important obligations, entitlements, and prohibitions in a contract. It consists of two modules: (1) a content categorize to identify sentences containing each of the categories (i.e., obligation, entitlement, and prohibition) for a party, and (2) an importance ranker to compare the importance among sentences of each category for a party to obtain a ranked list. The final summary is produced by selecting the most important sentences of a category for each of the parties. We demonstrate the effectiveness of our proposed system by comparing it against several text ranking baselines via automatic and human evaluation.
translated by 谷歌翻译
Though many algorithms can be used to automatically summarize legal case decisions, most fail to incorporate domain knowledge about how important sentences in a legal decision relate to a representation of its document structure. For example, analysis of a legal case summarization dataset demonstrates that sentences serving different types of argumentative roles in the decision appear in different sections of the document. In this work, we propose an unsupervised graph-based ranking model that uses a reweighting algorithm to exploit properties of the document structure of legal case decisions. We also explore the impact of using different methods to compute the document structure. Results on the Canadian Legal Case Law dataset show that our proposed method outperforms several strong baselines.
translated by 谷歌翻译
在摘要域中,摘要的关键要求是与输入文档一致。以前的工作发现,当应用于不一致检测时,自然语言推理(NLI)模型不会竞争地执行。在这项工作中,我们重新访问NLI的使用进行不一致检测,发现过去的工作遭到了NLI数据集(句子级)与不一致检测(文档级别)之间的输入粒度不匹配。我们提供称为SummacConv的高效和轻量级方法,使NLI模型能够通过将文档分段为句子单元并在句子对之间聚合得分来成功地用于此任务。在我们的新推出的基准名为Summac(简介一致性)中由六个大的不一致检测数据集组成,SummacConv以74.4%的均衡精度获得最先进的结果,与现有工作相比,5%的点改进。我们制作可用的模型和数据集:https://github.com/tingofurro/summac
translated by 谷歌翻译
We consider the end-to-end abstract-to-title generation problem, exploring seven recent transformer based models (including ChatGPT) fine-tuned on more than 30k abstract-title pairs from NLP and machine learning venues. As an extension, we also consider the harder problem of generating humorous paper titles. For the latter, we compile the first large-scale humor annotated dataset for scientific papers in the NLP/ML domains, comprising almost 2.5k titles. We evaluate all models using human and automatic metrics. Our human evaluation suggests that our best end-to-end system performs similarly to human authors (but arguably slightly worse). Generating funny titles is more difficult, however, and our automatic systems clearly underperform relative to humans and often learn dataset artefacts of humor. Finally, ChatGPT, without any fine-tuning, performs on the level of our best fine-tuned system.
translated by 谷歌翻译
学术研究是解决以前从未解决过的问题的探索活动。通过这种性质,每个学术研究工作都需要进行文献审查,以区分其Novelties尚未通过事先作品解决。在自然语言处理中,该文献综述通常在“相关工作”部分下进行。鉴于研究文件的其余部分和引用的论文列表,自动相关工作生成的任务旨在自动生成“相关工作”部分。虽然这项任务是在10年前提出的,但直到最近,它被认为是作为科学多文件摘要问题的变种。然而,即使在今天,尚未标准化了自动相关工作和引用文本生成的问题。在这项调查中,我们进行了一个元研究,从问题制定,数据集收集,方法方法,绩效评估和未来前景的角度来比较相关工作的现有文献,以便为读者洞察到国家的进步 - 最内容的研究,以及如何进行未来的研究。我们还调查了我们建议未来工作要考虑整合的相关研究领域。
translated by 谷歌翻译
诸如学术文章和商业报告之类的长期文件一直是详细说明重要问题和需要额外关注的复杂主题的标准格式。自动汇总系统可以有效地将长文档置于简短而简洁的文本中,以封装最重要的信息,从而在帮助读者的理解中很重要。最近,随着神经体系结构的出现,已经做出了重大的研究工作,以推动自动文本摘要系统,以及有关将这些系统扩展到长期文档领域的挑战的大量研究。在这项调查中,我们提供了有关长期文档摘要的研究的全面概述,以及其研究环境的三个主要组成部分的系统评估:基准数据集,汇总模型和评估指标。对于每个组成部分,我们在长期汇总的背景下组织文献,并进行经验分析,以扩大有关当前研究进度的观点。实证分析包括一项研究基准数据集的内在特征,摘要模型的多维分析以及摘要评估指标的综述。根据总体发现,我们通过提出可能在这个快速增长的领域中提出未来探索的方向来得出结论。
translated by 谷歌翻译
时间轴提供了最有效的方法之一,可以看到一段时间内发生的重要历史事实,从而呈现出从文本形式阅读等效信息的见解。通过利用生成的对抗性学习进行重要的句子分类,并通过吸收基于知识的标签来改善事件核心分辨率的性能,我们从多个(历史)文本文档中引入了两个分阶段的事件时间表生成的系统。我们在两个手动注释的历史文本文档上演示了我们的结果。我们的结果对历史学家,推进历史研究以及理解一个国家的社会政治格局的研究对历史学家来说非常有帮助。
translated by 谷歌翻译
Current metrics for evaluating factuality for abstractive document summarization have achieved high correlations with human judgment, but they do not account for the vision modality and thus are not adequate for vision-and-language summarization. We propose CLIPBERTScore, a simple weighted combination of CLIPScore and BERTScore to leverage the robustness and strong factuality detection performance between image-summary and document-summary, respectively. Next, due to the lack of meta-evaluation benchmarks to evaluate the quality of multimodal factuality metrics, we collect human judgments of factuality with respect to documents and images. We show that this simple combination of two metrics in the zero-shot setting achieves higher correlations than existing factuality metrics for document summarization, outperforms an existing multimodal summarization metric, and performs competitively with strong multimodal factuality metrics specifically fine-tuned for the task. Our thorough analysis demonstrates the robustness and high correlation of CLIPBERTScore and its components on four factuality metric-evaluation benchmarks. Finally, we demonstrate two practical downstream applications of our CLIPBERTScore metric: for selecting important images to focus on during training, and as a reward for reinforcement learning to improve factuality of multimodal summary generation w.r.t automatic and human evaluation. Our data and code are publicly available at https://github.com/meetdavidwan/faithful-multimodal-summ
translated by 谷歌翻译
自论证挖掘领域成立以来,在法律话语中识别,分类和分析的论点一直是研究的重要领域。但是,自然语言处理(NLP)研究人员的模型模型与法院决策中的注释论点与法律专家理解和分析法律论证的方式之间存在重大差异。尽管计算方法通常将论点简化为通用的前提和主张,但法律研究中的论点通常表现出丰富的类型,对于获得一般法律的特定案例和应用很重要。我们解决了这个问题,并做出了一些实质性的贡献,以推动该领域的前进。首先,我们在欧洲人权法院(ECHR)诉讼中为法律论点设计了新的注释计划,该计划深深植根于法律论证研究的理论和实践中。其次,我们编译和注释了373项法院判决(230万令牌和15K注释的论点跨度)的大量语料库。最后,我们训练一个论证挖掘模型,该模型胜过法律NLP领域中最先进的模型,并提供了彻底的基于专家的评估。所有数据集和源代码均可在https://github.com/trusthlt/mining-legal-arguments的开放lincenses下获得。
translated by 谷歌翻译
Human evaluation is the foundation upon which the evaluation of both summarization systems and automatic metrics rests. However, existing human evaluation protocols and benchmarks for summarization either exhibit low inter-annotator agreement or lack the scale needed to draw statistically significant conclusions, and an in-depth analysis of human evaluation is lacking. In this work, we address the shortcomings of existing summarization evaluation along the following axes: 1) We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which relies on fine-grained semantic units and allows for high inter-annotator agreement. 2) We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of over 22k summary-level annotations over state-of-the-art systems on three datasets. 3) We compare our ACU protocol with three other human evaluation protocols, underscoring potential confounding factors in evaluation setups. 4) We evaluate existing automatic metrics using the collected human annotations across evaluation protocols and demonstrate how our benchmark leads to more statistically stable and significant results. Furthermore, our findings have important implications for evaluating large language models (LLMs), as we show that LLMs adjusted by human feedback (e.g., GPT-3.5) may overfit unconstrained human evaluation, which is affected by the annotators' prior, input-agnostic preferences, calling for more robust, targeted evaluation methods.
translated by 谷歌翻译
随着大型语言模型的出现,抽象性摘要的方法取得了长足的进步,从而在应用程序中使用了帮助知识工人处理笨拙的文档收集的潜力。一个这样的环境是民权诉讼交换所(CRLC)(https://clearinghouse.net),其中发布了有关大规模民权诉讼,服务律师,学者和公众的信息。如今,CRLC中的摘要需要对律师和法律专业的学生进行广泛的培训,这些律师和法律专业的学生花费数小时了解多个相关文件,以便产生重要事件和结果的高质量摘要。在这种持续的现实世界摘要工作的激励下,我们引入了Multi-iplesum,这是由正在进行的CRLC写作中绘制的9,280个专家作者的摘要集。鉴于源文档的长度,多文章介绍了一个具有挑战性的多文档摘要任务,通常每个情况超过200页。此外,多胎sum与其多个目标摘要中的其他数据集不同,每个数据集都处于不同的粒度(从一句“极端”摘要到超过五百个单词的多段落叙述)。我们提供了广泛的分析,表明,尽管培训数据(遵守严格的内容和样式准则)中的摘要很高,但最新的摘要模型在此任务上的表现较差。我们发布了多体式的摘要方法,以及促进应用程序的开发,以协助CRLC的任务https://multilexsum.github.io。
translated by 谷歌翻译
在典型的客户服务聊天方案中,客户联系支持中心以便帮助或提高投诉,人类代理商试图解决这些问题。在大多数情况下,在谈话结束时,要求代理人写一份简短的总结强调问题和建议的解决方案,通常是为了使其他可能需要处理同一客户或问题的其他代理商的利益。本文的目标是推进此任务的自动化。我们介绍了第一个大规模,高质量的客户服务对话框摘要数据集,接近6500人的注释摘要。数据基于现实世界的客户支持对话框,包括提取和抽象摘要。我们还介绍了一种特定于对话框的新无监督的提取摘要方法。
translated by 谷歌翻译
Current abstractive summarization systems present important weaknesses which prevent their deployment in real-world applications, such as the omission of relevant information and the generation of factual inconsistencies (also known as hallucinations). At the same time, automatic evaluation metrics such as CTC scores have been recently proposed that exhibit a higher correlation with human judgments than traditional lexical-overlap metrics such as ROUGE. In this work, we intend to close the loop by leveraging the recent advances in summarization metrics to create quality-aware abstractive summarizers. Namely, we propose an energy-based model that learns to re-rank summaries according to one or a combination of these metrics. We experiment using several metrics to train our energy-based re-ranker and show that it consistently improves the scores achieved by the predicted summaries. Nonetheless, human evaluation results show that the re-ranking approach should be used with care for highly abstractive summaries, as the available metrics are not yet sufficiently reliable for this purpose.
translated by 谷歌翻译
Bidirectional Encoder Representations from Transformers (BERT; Devlin et al. 2019) represents the latest incarnation of pretrained language models which have recently advanced a wide range of natural language processing tasks. In this paper, we showcase how BERT can be usefully applied in text summarization and propose a general framework for both extractive and abstractive models. We introduce a novel document-level encoder based on BERT which is able to express the semantics of a document and obtain representations for its sentences. Our extractive model is built on top of this encoder by stacking several intersentence Transformer layers. For abstractive summarization, we propose a new fine-tuning schedule which adopts different optimizers for the encoder and the decoder as a means of alleviating the mismatch between the two (the former is pretrained while the latter is not). We also demonstrate that a two-staged fine-tuning approach can further boost the quality of the generated summaries. Experiments on three datasets show that our model achieves stateof-the-art results across the board in both extractive and abstractive settings. 1
translated by 谷歌翻译
意义表示(AMR)是一种基于图形的语义表示的句子,由语义关系链接的概念集合组成。基于AMR的方法在各种应用程序中找到了成功,但在需要文档级背景下的任务中使用它的挑战是它只代表单个句子。在基于AMR的总结中的事先工作已经自动将单个句子图与文档图合并到文档图中,但尚未独立地评估合并方法及其对摘要内容选择的影响。在本文中,我们介绍了一种新的数据集,由配对文件的节点与可用于评估(1)合并策略之间的摘要之间的人为注释对齐组成; (2)在合并或未混合的AMR图表的节点上的内容选择方法的性能。我们将这两种形式的评估应用于现有工作以及节点合并的新方法,并表明我们的新方法比现有工作明显更好。
translated by 谷歌翻译
在人口稠密的国家中,悬而未决的法律案件呈指数增长。需要开发处理和组织法律文件的技术。在本文中,我们引入了一个新的语料库来构建法律文件。特别是,我们介绍了用英语的法律判断文件进行的,这些文件被分割为局部和连贯的部分。这些零件中的每一个都有注释,标签来自预定义角色的列表。我们开发基线模型,以根据注释语料库自动预测法律文档中的修辞角色。此外,我们展示了修辞角色在提高总结和法律判断预测任务的绩效方面的应用。我们发布了语料库和基线模型代码以及纸张。
translated by 谷歌翻译
自动评估摘要的连贯性具有重要意义,既可以实现成本效益的摘要评估,又可以通过选择高分候选候选摘要来提高连贯性。尽管已经提出了许多不同的方法来建模摘要相干性,但通常使用不同的数据集和指标对其进行评估。这使得很难理解他们的相对性能,并确定朝着更好的摘要连贯建模的方法。在这项工作中,我们对各种方法进行了大规模研究,以进行均匀的竞争环境建模。此外,我们介绍了两项新的分析措施,即系统内相关性和偏置矩阵,它们有助于确定相干度量的偏见,并为系统级混杂因素提供鲁棒性。尽管当前可用的自动连贯性措施都无法为所有评估指标的系统摘要分配可靠的连贯分数,但对自我监督任务进行了微调的大规模语言模型显示出令人鼓舞的结果,只要微调会考虑在内他们需要在不同的摘要长度上概括。
translated by 谷歌翻译
Convincing people to get vaccinated against COVID-19 is a key societal challenge in the present times. As a first step towards this goal, many prior works have relied on social media analysis to understand the specific concerns that people have towards these vaccines, such as potential side-effects, ineffectiveness, political factors, and so on. Though there are datasets that broadly classify social media posts into Anti-vax and Pro-Vax labels, there is no dataset (to our knowledge) that labels social media posts according to the specific anti-vaccine concerns mentioned in the posts. In this paper, we have curated CAVES, the first large-scale dataset containing about 10k COVID-19 anti-vaccine tweets labelled into various specific anti-vaccine concerns in a multi-label setting. This is also the first multi-label classification dataset that provides explanations for each of the labels. Additionally, the dataset also provides class-wise summaries of all the tweets. We also perform preliminary experiments on the dataset and show that this is a very challenging dataset for multi-label explainable classification and tweet summarization, as is evident by the moderate scores achieved by some state-of-the-art models. Our dataset and codes are available at: https://github.com/sohampoddar26/caves-data
translated by 谷歌翻译
社区问题应答(CQA)FORA,如堆栈溢出和雅虎!答案包含丰富的资源,对广泛的基于社区的问题答案。每个问题线程都可以通过不同的角度接收大量答案。答案摘要的一个目标是产生反映答案视角范围的摘要。抽象答案概述的主要障碍是没有数据集,可以提供监督制作这些摘要。最近的作品提出了创建此类数据的启发式,但这些是嘈杂的,并且不会涵盖答案中存在的所有观点。这项工作介绍了4,631个CQA线程的新型数据集,用于答案摘要,由专业语言学家策划。我们的管道收集了答案概述所涉及的所有子特设的注释,包括选择与问题相关的答案句子,根据透视图对这些句子进行分组,总结每个视角,并生成整体摘要。我们在这些子组织上分析和基准最先进的模型,并为多视角数据增强引入了一种新的无监督方法,这进一步提高了根据自动评估的整体摘要性能。最后,我们提出了加强学习奖励,以改善事实一致性和答案覆盖范围和分析改进领域。
translated by 谷歌翻译
传统上,文本聚类方法包含在多文件摘要(MDS)中作为一种用于应对相当大的信息重复的手段。集群被利用以表明信息显着性并避免冗余。这些方法集中在聚类句子上,即使密切相关的句子也通常包含非对齐信息。在这项工作中,我们重新审视聚类方法,将命题分组为更精确的信息对齐。具体而言,我们的方法检测到突出的命题,将它们聚集到释义集群中,并通过融合其命题来为每个集群生成代表性句子。我们的摘要方法在自动胭脂评分和人类偏好中,通过了在DUC 2004和TAC 2011数据集中的先前最先进的MDS方法。
translated by 谷歌翻译