Evaluating automatically-generated text summaries is a challenging task. While there have been many interesting approaches, they still fall short of human evaluations. We present RISE, a new approach for evaluating summaries by leveraging techniques from information retrieval. RISE is first trained as a retrieval task using a dual-encoder retrieval setup, and can then be subsequently utilized for evaluating a generated summary given an input document, without gold reference summaries. RISE is especially well suited when working on new datasets where one may not have reference summaries available for evaluation. We conduct comprehensive experiments on the SummEval benchmark (Fabbri et al., 2021) and the results show that RISE has higher correlation with human evaluations compared to many past approaches to summarization evaluation. Furthermore, RISE also demonstrates data-efficiency and generalizability across languages.
translated by 谷歌翻译
文本生成的广泛使用的评估指标要么与更长的文本效果不错,要么无法评估文本质量的所有方面。在本文中,我们引入了一个名为SMART的新指标,以减轻此类限制。具体而言,我们将句子视为匹配的基本单位,而不是代币,并使用句子匹配函数来匹配匹配候选和参考句子。还将候选句子与源文件中的句子进行了比较,以允许接地(例如,事实)评估。我们的结果表明,我们提出的指标与基于模型的匹配函数的系统级相关性优于萨姆瓦尔摘要元评估数据集上的所有竞争指标指标。后者不使用任何神经模型,这在模型开发阶段很有用,在这些阶段,资源可以受到限制且需要快速评估。最后,我们还进行了广泛的分析,表明我们提出的指标与较长的摘要很好地运行,并且对特定模型的偏见较小。
translated by 谷歌翻译
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets. Despite promising results, current models still suffer from generating factually inconsistent summaries, reducing their utility for real-world application. Several recent efforts attempt to address this by devising models that automatically detect factual inconsistencies in machine generated summaries. However, they focus exclusively on English, a language with abundant resources. In this work, we leverage factual consistency evaluation models to improve multilingual summarization. We explore two intuitive approaches to mitigate hallucinations based on the signal provided by a multilingual NLI model, namely data filtering and controlled generation. Experimental results in the 45 languages from the XLSum dataset show gains over strong baselines in both automatic and human evaluation.
translated by 谷歌翻译
自动摘要评估对于机器生成和人为生产的摘要都有用。自动评估给定文档的摘要文本启用,例如,摘要生成系统开发和检测不适当的摘要。摘要评估可以以多种模式进行:排名摘要生成系统;对特定文档的排名摘要;并在绝对规模上估算文档 - 苏格尔对的质量。带有注释的现有数据集用于摘要评估,通常基于新闻摘要数据集,例如CNN/DailyMail或XSUM。在这项工作中,我们描述了一个新的数据集,即播客摘要评估语料库,这是由TREC2020的人类专家评估的播客摘要集。与现有的摘要评估数据相比,该数据集具有两个独特的方面:(i)基于语音播客的长输入,文档; (ii)有机会在播客语料库中检测不适当的参考摘要。首先,我们检查了现有的评估方法,包括无模型和基于模型的方法,并为此长输入摘要评估数据集提供基准结果。其次,为了过滤参考参考文献配对以进行培训,我们采用摘要评估进行数据选择。这两个方面的实验结果为摘要评估和发电任务提供了有趣的见解。播客摘要评估数据可用。
translated by 谷歌翻译
自动评估摘要的连贯性具有重要意义,既可以实现成本效益的摘要评估,又可以通过选择高分候选候选摘要来提高连贯性。尽管已经提出了许多不同的方法来建模摘要相干性,但通常使用不同的数据集和指标对其进行评估。这使得很难理解他们的相对性能,并确定朝着更好的摘要连贯建模的方法。在这项工作中,我们对各种方法进行了大规模研究,以进行均匀的竞争环境建模。此外,我们介绍了两项新的分析措施,即系统内相关性和偏置矩阵,它们有助于确定相干度量的偏见,并为系统级混杂因素提供鲁棒性。尽管当前可用的自动连贯性措施都无法为所有评估指标的系统摘要分配可靠的连贯分数,但对自我监督任务进行了微调的大规模语言模型显示出令人鼓舞的结果,只要微调会考虑在内他们需要在不同的摘要长度上概括。
translated by 谷歌翻译
查询聚焦的文本摘要(QFTS)任务旨在构建基于给定查询的文本文档摘要的构建系统。解决此任务的关键挑战是缺乏培训摘要模型的大量标记数据。在本文中,我们通过探索一系列域适应技术来解决这一挑战。鉴于最近在广泛的自然语言处理任务中进行预先接受的变压器模型的成功,我们利用此类模型为单文档和多文件方案的QFTS任务产生抽象摘要。对于域适应,我们使用预先训练的变压器的摘要模型应用了各种技术,包括转移学习,弱监督学习和远程监督。六个数据集的广泛实验表明,我们所提出的方法非常有效地为QFTS任务产生抽象摘要,同时在一组自动和人类评估指标上设置新的最先进的结果。
translated by 谷歌翻译
Current abstractive summarization systems present important weaknesses which prevent their deployment in real-world applications, such as the omission of relevant information and the generation of factual inconsistencies (also known as hallucinations). At the same time, automatic evaluation metrics such as CTC scores have been recently proposed that exhibit a higher correlation with human judgments than traditional lexical-overlap metrics such as ROUGE. In this work, we intend to close the loop by leveraging the recent advances in summarization metrics to create quality-aware abstractive summarizers. Namely, we propose an energy-based model that learns to re-rank summaries according to one or a combination of these metrics. We experiment using several metrics to train our energy-based re-ranker and show that it consistently improves the scores achieved by the predicted summaries. Nonetheless, human evaluation results show that the re-ranking approach should be used with care for highly abstractive summaries, as the available metrics are not yet sufficiently reliable for this purpose.
translated by 谷歌翻译
由于免费的在线百科全书具有大量内容,因此Wikipedia和Wikidata是许多自然语言处理(NLP)任务的关键,例如信息检索,知识基础构建,机器翻译,文本分类和文本摘要。在本文中,我们介绍了Wikides,这是一个新颖的数据集,用于为文本摘要问题提供Wikipedia文章的简短描述。该数据集由6987个主题上的80K英语样本组成。我们设置了一种两阶段的摘要方法 - 描述生成(I阶段)和候选排名(II阶段)作为一种依赖于转移和对比学习的强大方法。对于描述生成,与其他小规模的预训练模型相比,T5和BART表现出了优越性。通过将对比度学习与Beam Search的不同输入一起应用,基于度量的排名模型优于直接描述生成模型,在主题独立拆分和独立于主题的独立拆分中,最高可达22个胭脂。此外,第II期中的结果描述得到了人类评估的支持,其中45.33%以上,而I阶段的23.66%则支持针对黄金描述。在情感分析方面,生成的描述无法有效地从段落中捕获所有情感极性,同时从黄金描述中更好地完成此任务。自动产生的新描述减少了人类为创建它们的努力,并丰富了基于Wikidata的知识图。我们的论文对Wikipedia和Wikidata产生了实际影响,因为有成千上万的描述。最后,我们预计Wikides将成为从短段落中捕获显着信息的相关作品的有用数据集。策划的数据集可公开可用:https://github.com/declare-lab/wikides。
translated by 谷歌翻译
对比学习模型在无监督的视觉表示学习中取得了巨大成功,这使得相同图像的不同视图的特征表示之间的相似性最大化,同时最小化不同图像的视图的特征表示之间的相似性。在文本摘要中,输出摘要是输入文档的较短形式,它们具有类似的含义。在本文中,我们提出了对监督抽象文本摘要的对比学习模型,在那里我们查看文档,它的金摘要及其模型生成的摘要,与相同的平均表示的不同视图,并在培训期间最大化它们之间的相似性。我们在三个不同的摘要数据集上改进了一个强序列到序列文本生成模型(即,BART)。人类评估还表明,与其对应物相比,我们的模型达到了更好的忠实性评级,没有对比的目标。
translated by 谷歌翻译
诸如学术文章和商业报告之类的长期文件一直是详细说明重要问题和需要额外关注的复杂主题的标准格式。自动汇总系统可以有效地将长文档置于简短而简洁的文本中,以封装最重要的信息,从而在帮助读者的理解中很重要。最近,随着神经体系结构的出现,已经做出了重大的研究工作,以推动自动文本摘要系统,以及有关将这些系统扩展到长期文档领域的挑战的大量研究。在这项调查中,我们提供了有关长期文档摘要的研究的全面概述,以及其研究环境的三个主要组成部分的系统评估:基准数据集,汇总模型和评估指标。对于每个组成部分,我们在长期汇总的背景下组织文献,并进行经验分析,以扩大有关当前研究进度的观点。实证分析包括一项研究基准数据集的内在特征,摘要模型的多维分析以及摘要评估指标的综述。根据总体发现,我们通过提出可能在这个快速增长的领域中提出未来探索的方向来得出结论。
translated by 谷歌翻译
已经表明,在一个域上训练的双编码器经常概括到其他域以获取检索任务。一种广泛的信念是,一个双编码器的瓶颈层,其中最终得分仅仅是查询向量和通道向量之间的点产品,它过于局限,使得双编码器是用于域外概括的有效检索模型。在本文中,我们通过缩放双编码器模型的大小{\ em同时保持固定的瓶颈嵌入尺寸固定的瓶颈的大小来挑战这一信念。令人惊讶的是,令人惊讶的是,缩放模型尺寸会对各种缩放提高检索任务,特别是对于域外泛化。实验结果表明,我们的双编码器,\ textbf {g} enovalizable \ textbf {t} eTrievers(gtr),优先级%colbert〜\ cite {khattab2020colbertt}和现有的稀疏和密集的索取Beir DataSet〜\ Cite {Thakur2021Beir}显着显着。最令人惊讶的是,我们的消融研究发现,GTR是非常数据的高效,因为它只需要10 \%MARCO监督数据,以实现最佳域的性能。所有GTR模型都在https://tfhub.dev/google/collections/gtr/1发布。
translated by 谷歌翻译
长文件摘要是自然语言处理领域的重要且艰巨的任务。良好的长文件摘要表现揭示了模型对人类语言的理解。目前,大多数研究侧重于如何修改变压器的注意机制,实现更高的胭脂分数。数据预处理和后处理的研究相对较少。在本文中,我们使用两个预处理方法和后处理方法,并分析了这些方法对各种长文件摘要模型的影响。
translated by 谷歌翻译
Precisely assessing the progress in natural language generation (NLG) tasks is challenging, and human evaluation to establish a preference in a model's output over another is often necessary. However, human evaluation is usually costly, difficult to reproduce, and non-reusable. In this paper, we propose a new and simple automatic evaluation method for NLG called Near-Negative Distinction (NND) that repurposes prior human annotations into NND tests. In an NND test, an NLG model must place a higher likelihood on a high-quality output candidate than on a near-negative candidate with a known error. Model performance is established by the number of NND tests a model passes, as well as the distribution over task-specific errors the model fails on. Through experiments on three NLG tasks (question generation, question answering, and summarization), we show that NND achieves a higher correlation with human judgments than standard NLG evaluation metrics. We then illustrate NND evaluation in four practical scenarios, for example performing fine-grain model analysis, or studying model training dynamics. Our findings suggest that NND can give a second life to human annotations and provide low-cost NLG evaluation.
translated by 谷歌翻译
The research on text summarization for low-resource Indian languages has been limited due to the availability of relevant datasets. This paper presents a summary of various deep-learning approaches used for the ILSUM 2022 Indic language summarization datasets. The ISUM 2022 dataset consists of news articles written in Indian English, Hindi, and Gujarati respectively, and their ground-truth summarizations. In our work, we explore different pre-trained seq2seq models and fine-tune those with the ILSUM 2022 datasets. In our case, the fine-tuned SoTA PEGASUS model worked the best for English, the fine-tuned IndicBART model with augmented data for Hindi, and again fine-tuned PEGASUS model along with a translation mapping-based approach for Gujarati. Our scores on the obtained inferences were evaluated using ROUGE-1, ROUGE-2, and ROUGE-4 as the evaluation metrics.
translated by 谷歌翻译
以查询为中心的摘要(QFS)旨在产生应答感兴趣的特定问题的摘要,从而实现更大的用户控制和个性化。虽然最近发布的数据集如QMSUM或Aquamuse,促进QFS中的研究工作,但该领域缺乏对适用建模方法的广泛空间的全面研究。在本文中,考虑到两种普遍的方法,我们对QFS进行了系统探索,探讨了QFS:两阶段的采掘解决方案和端到端模型。在这些类别中,我们调查现有方法,并呈现了在QMSUM数据集上实现最先进的性能的两个模型扩展,其边缘高达3.38 Rouge-1,3.72 Rouge-2和3.28 Rouge-L。通过定量实验,我们突出了不同模型配置之间的权衡,并探讨了摘要任务之间的转移能力。代码和检查点公开可用:https://github.com/salesforce/query-focused-sum。
translated by 谷歌翻译
大型和超大语言模型的开发,例如GPT-3,T5,Switch Transformer,Ernie等,已经显着改善了文本生成的性能。该领域的重要研究方向之一是产生具有争论的文本。该问题的解决方案可以用于商务会议,政治辩论,对话系统,以准备学生论文。这些应用的主要领域之一是经济领域。俄罗斯语言的论证文本生成的关键问题是缺乏注释的论证语料库。在本文中,我们将论证的微观版,说服力论文和UKP句子语料库的翻译版本用于微调Rubert模型。此外,该模型用于通过论证注释经济新闻的语料库。然后使用带注释的语料库微调Rugpt-3模型,该模型生成参数文本。结果表明,与原始RUGPT-3模型相比,这种方法将论点生成的准确性提高了20个百分点(63.2%对42.5%)。
translated by 谷歌翻译
In long document controllable summarization, where labeled data is scarce, pretrained models struggle to adapt to the task and effectively respond to user queries. In this paper, we introduce Socratic pretraining, a question-driven, unsupervised pretraining objective specifically designed to improve controllability in summarization tasks. By training a model to generate and answer relevant questions in a given context, Socratic pretraining enables the model to more effectively adhere to user-provided queries and identify relevant content to be summarized. We demonstrate the effectiveness of this approach through extensive experimentation on two summarization domains, short stories and dialogue, and multiple control strategies: keywords, questions, and factoid QA pairs. Our pretraining method relies only on unlabeled documents and a question generation system and outperforms pre-finetuning approaches that use additional supervised data. Furthermore, our results show that Socratic pretraining cuts task-specific labeled data requirements in half, is more faithful to user-provided queries, and achieves state-of-the-art performance on QMSum and SQuALITY.
translated by 谷歌翻译
State-of-the-art summarization models still struggle to be factually consistent with the input text. A model-agnostic way to address this problem is post-editing the generated summaries. However, existing approaches typically fail to remove entity errors if a suitable input entity replacement is not available or may insert erroneous content. In our work, we focus on removing extrinsic entity errors, or entities not in the source, to improve consistency while retaining the summary's essential information and form. We propose to use sentence-compression data to train the post-editing model to take a summary with extrinsic entity errors marked with special tokens and output a compressed, well-formed summary with those errors removed. We show that this model improves factual consistency while maintaining ROUGE, improving entity precision by up to 30% on XSum, and that this model can be applied on top of another post-editor, improving entity precision by up to a total of 38%. We perform an extensive comparison of post-editing approaches that demonstrate trade-offs between factual consistency, informativeness, and grammaticality, and we analyze settings where post-editors show the largest improvements.
translated by 谷歌翻译
尽管最近的抽象性摘要在自动评估指标上取得了成功,但生成的摘要仍然与源文档呈现事实不一致。在本文中,我们专注于实体级别的事实不一致,即减少生成的摘要与源文档之间的不匹配实体。因此,我们提出了一种基于实体的新型跨度机制,并通过全球相关成分探索其扩展。四个摘要数据集的实验结果表明,跨度可以有效地改善实体级别的事实一致性,而单词级别和实体级别的显着性基本上没有变化。该代码可在https://github.com/wendy-xiao/entity基于基础上找到
translated by 谷歌翻译
最近提出的基于BERT的评估指标在标准评估基准方面表现良好,但容易受到对抗性攻击的影响,例如与事实错误有关。我们认为这(部分原因)是因为它们是语义相似性的模型。相反,我们根据自然语言推断(NLI)制定评估指标,我们认为这是更合适的建模。我们设计了一个基于偏好的对抗攻击框架,并表明我们的基于NLI的指标比最近基于BERT的指标更强大。在标准基准上,我们的基于NLI的指标的表现优于现有的摘要指标,但在SOTA MT指标下执行。但是,当我们将现有指标与NLI指标相结合时,我们可以获得更高的对抗性鲁棒性( +20%至 +30%)和较高质量的指标,如标准基准测量( +5%至 +25%)。
translated by 谷歌翻译