Recent work has identified noisy and misannotated data as a core cause of hallucinations and unfaithful outputs in Natural Language Generation (NLG) tasks. Consequently, identifying and removing these examples is a key open challenge in creating reliable NLG systems. In this work, we introduce a framework to identify and remove low-quality training instances that lead to undesirable outputs, such as faithfulness errors in text summarization. We show that existing approaches for error tracing, such as gradient-based influence measures, do not perform reliably for detecting faithfulness errors in summarization. We overcome the drawbacks of existing error tracing methods through a new, contrast-based estimate that compares undesired generations to human-corrected outputs. Our proposed method can achieve a mean average precision of 0.91 across synthetic tasks with known ground truth and can achieve a two-fold reduction in hallucinations on a real entity hallucination evaluation on the NYT dataset.
translated by 谷歌翻译
State-of-the-art summarization models still struggle to be factually consistent with the input text. A model-agnostic way to address this problem is post-editing the generated summaries. However, existing approaches typically fail to remove entity errors if a suitable input entity replacement is not available or may insert erroneous content. In our work, we focus on removing extrinsic entity errors, or entities not in the source, to improve consistency while retaining the summary's essential information and form. We propose to use sentence-compression data to train the post-editing model to take a summary with extrinsic entity errors marked with special tokens and output a compressed, well-formed summary with those errors removed. We show that this model improves factual consistency while maintaining ROUGE, improving entity precision by up to 30% on XSum, and that this model can be applied on top of another post-editor, improving entity precision by up to a total of 38%. We perform an extensive comparison of post-editing approaches that demonstrate trade-offs between factual consistency, informativeness, and grammaticality, and we analyze settings where post-editors show the largest improvements.
translated by 谷歌翻译
Multi-document summarization (MDS) has traditionally been studied assuming a set of ground-truth topic-related input documents is provided. In practice, the input document set is unlikely to be available a priori and would need to be retrieved based on an information need, a setting we call open-domain MDS. We experiment with current state-of-the-art retrieval and summarization models on several popular MDS datasets extended to the open-domain setting. We find that existing summarizers suffer large reductions in performance when applied as-is to this more realistic task, though training summarizers with retrieved inputs can reduce their sensitivity retrieval errors. To further probe these findings, we conduct perturbation experiments on summarizer inputs to study the impact of different types of document retrieval errors. Based on our results, we provide practical guidelines to help facilitate a shift to open-domain MDS. We release our code and experimental results alongside all data or model artifacts created during our investigation.
translated by 谷歌翻译
利用预训练语言模型的抽象摘要系统在基准数据集上取得了卓越的结果。但是,此类模型已被证明更容易幻觉,这些事实对输入背景不忠。在本文中,我们提出了一种通过实体覆盖范围控制(ECC)来补救实体级外部幻觉的方法。我们首先计算实体覆盖范围的精度,并为每个培训示例提供相应的控制代码,该示例隐含地指导该模型在训练阶段识别忠实的内容。我们通过从Wikipedia提取的大但嘈杂的数据中进行中间调整进一步扩展了我们的方法,以解锁零击摘要。我们表明,根据我们对三个基准数据集XSUM,PubMed和Samsum的实验结果,根据我们在监督的微调和零射击设置中,可以在监督微调和零摄像设置中更加忠实和显着的抽象性汇总。
translated by 谷歌翻译
尽管与专家标签相比,众包平台通常用于收集用于培训机器学习模型的数据集,尽管标签不正确。有两种常见的策略来管理这种噪音的影响。第一个涉及汇总冗余注释,但以较少的例子为代价。其次,先前的作品还考虑使用整个注释预算来标记尽可能多的示例,然后应用Denoising算法来隐式清洁数据集。我们找到了一个中间立场,并提出了一种方法,该方法保留了一小部分注释,以明确清理高度可能的错误样本以优化注释过程。特别是,我们分配了标签预算的很大一部分,以形成用于训练模型的初始数据集。然后,该模型用于确定最有可能是不正确的特定示例,我们将剩余预算用于重新标记。在三个模型变化和四个自然语言处理任务上进行的实验表明,当分配相同的有限注释预算时,旨在处理嘈杂标签的标签聚合和高级denoising方法均优于标签聚合或匹配。
translated by 谷歌翻译
机器学习模型的预测失败通常来自训练数据中的缺陷,例如不正确的标签,离群值和选择偏见。但是,这些负责给定失败模式的数据点通常不知道先验,更不用说修复故障的机制了。这项工作借鉴了贝叶斯对持续学习的看法,并为两者开发了一个通用框架,确定了导致目标失败的培训示例,并通过删除有关它们的信息来修复模型。该框架自然允许将最近学习的最新进展解决这一新的模型维修问题,同时将现有的作品集成了影响功能和数据删除作为特定实例。在实验上,提出的方法优于基准,既可以识别有害训练数据,又要以可普遍的方式固定模型失败。
translated by 谷歌翻译
最先进的抽象摘要系统经常生成\ emph {幻觉};即,不直接从源文本中推断的内容。尽管被认为是不正确的,我们发现非常令人难潮的内容是事实,即与世界知识一致。这些事实幻觉通过提供有用的背景信息,可以在摘要中受益。在这项工作中,我们提出了一种新的检测方法,将事实与实体的非事实幻觉分开。我们的方法分别使用实体的先前和后验概率,分别是预训练和芬特的屏蔽语言模型。经验结果表明,我们的方法在精度和F1分数方面大大优于两种基线%,与人类判断强烈相关。百分比对事实分类任务。此外,我们显示我们的探测器,当用作离线增强学习(RL)算法中的奖励信号时,显着提高了摘要的事实性,同时保持抽象水平。
translated by 谷歌翻译
Natural Language Generation (NLG) has improved exponentially in recent years thanks to the development of sequence-to-sequence deep learning technologies such as Transformer-based language models. This advancement has led to more fluent and coherent NLG, leading to improved development in downstream tasks such as abstractive summarization, dialogue generation and data-to-text generation. However, it is also apparent that deep learning based generation is prone to hallucinate unintended text, which degrades the system performance and fails to meet user expectations in many real-world scenarios. To address this issue, many studies have been presented in measuring and mitigating hallucinated texts, but these have never been reviewed in a comprehensive manner before. In this survey, we thus provide a broad overview of the research progress and challenges in the hallucination problem in NLG. The survey is organized into two parts: (1) a general overview of metrics, mitigation methods, and future directions; and (2) an overview of task-specific research progress on hallucinations in the following downstream tasks, namely abstractive summarization, dialogue generation, generative question answering, data-to-text generation, machine translation, and visual-language generation. This survey serves to facilitate collaborative efforts among researchers in tackling the challenge of hallucinated texts in NLG.
translated by 谷歌翻译
生成摘要中的事实不一致严重限制了抽象对话摘要的实际应用。尽管通过使用预先训练的模型实现了显着进展,但在人类评估期间发现了大量的幻觉含量。预先接受的模型最常见的是微调文本摘要的跨熵损失,这可能不是最佳策略。在这项工作中,我们为带注释数据提供了事实错误的类型,以突出显示错误的类型并远离对事实的二进制了解。我们进一步提出了一种培训策略,通过新颖的对比微调,改善了摘要的事实一致性和整体素质。基于我们的语言信息的错误类型,我们设计了各个目标的不同模块化目标。具体而言,我们利用硬阴性样本具有误差,以减少事实不一致的产生。为了捕获扬声器之间的关键信息,我们还设计了特定于对话的损失。使用人类评估和自动忠实度量指标,我们表明我们的模型在对话摘要,Samsum语料库中大大降低了各种事实错误。此外,我们的模型可以推广到会议概述,AMI语料库,它产生的分数明显高于两个数据集关于单词 - 重叠度量标准的基线。
translated by 谷歌翻译
我们解决了对追踪预测的影响函数的高效计算,回到培训数据。我们提出并分析了一种新方法来加速基于Arnoldi迭代的反向Hessian计算。通过这种改进,我们实现了我们的知识,首次成功实施了影响功能,该函数的尺寸为全尺寸(语言和愿景)变压器模型,具有数十亿个参数。我们评估我们对图像分类和序列任务的方法,以百万次训练示例。我们的代码将在https://github.com/google-research/jax-influence上获得。
translated by 谷歌翻译
实际一致性是实际设置中文本摘要模型的基本质量。在评估此维度的现有工作可以大致分为两行研究,基于征收的指标和问题应答(QA)的指标。然而,最近作品中提出的不同的实验设置导致对比的结论是哪个范例表现最佳。在这项工作中,我们进行了广泛的征集和基于QA的指标的比较,致力于仔细选择基于QA的度量的组件对于性能至关重要。在那些见解中,我们提出了一个优化的公制,我们称之为QAFacteval,这导致了对夏季事实一致性基准的基于QA的度量标准的平均平均平均改进。我们的解决方案提高了基于最佳的基于范围的公制,并在该基准测试中实现了最先进的性能。此外,我们发现基于QA和基于征求的度量提供了互补信号,并将两者组合成单个学习的度量,以进一步提升。通过定性和定量分析,我们将问题生成和可应答性分类视为基于QA的度量的未来工作的两个关键组成部分。
translated by 谷歌翻译
The proliferation of automatic faithfulness metrics for summarization has produced a need for benchmarks to evaluate them. While existing benchmarks measure the correlation with human judgements of faithfulness on model-generated summaries, they are insufficient for diagnosing whether metrics are: 1) consistent, i.e., decrease as errors are introduced into a summary, 2) effective on human-written texts, and 3) sensitive to different error types (as summaries can contain multiple errors). To address these needs, we present a benchmark of unfaithful minimal pairs (BUMP), a dataset of 889 human-written, minimally different summary pairs, where a single error (from an ontology of 7 types) is introduced to a summary from the CNN/DailyMail dataset to produce an unfaithful summary. We find BUMP complements existing benchmarks in a number of ways: 1) the summaries in BUMP are harder to discriminate and less probable under SOTA summarization models, 2) BUMP enables measuring the consistency of metrics, and reveals that the most discriminative metrics tend not to be the most consistent, 3) BUMP enables the measurement of metrics' performance on individual error types and highlights areas of weakness for future work.
translated by 谷歌翻译
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets. Despite promising results, current models still suffer from generating factually inconsistent summaries, reducing their utility for real-world application. Several recent efforts attempt to address this by devising models that automatically detect factual inconsistencies in machine generated summaries. However, they focus exclusively on English, a language with abundant resources. In this work, we leverage factual consistency evaluation models to improve multilingual summarization. We explore two intuitive approaches to mitigate hallucinations based on the signal provided by a multilingual NLI model, namely data filtering and controlled generation. Experimental results in the 45 languages from the XLSum dataset show gains over strong baselines in both automatic and human evaluation.
translated by 谷歌翻译
In many task settings, text classification models are likely to encounter examples from novel classes on which they cannot predict correctly. Selective prediction, in which models abstain on low-confidence examples, provides a possible solution, but existing models are often overly confident on OOD examples. To remedy this overconfidence, we introduce Contrastive Novelty-Augmented Learning (CoNAL), a two-step method that generates OOD examples representative of novel classes, then trains to decrease confidence on them. First, we generate OOD examples by prompting a large language model twice: we prompt it to enumerate relevant novel labels, then generate examples from each novel class matching the task format. Second, we train our classifier with a novel contrastive objective that encourages lower confidence on generated OOD examples than training examples. When trained with CoNAL, classifiers improve in their ability to detect and abstain on OOD examples over prior methods by an average of 2.3% AUAC and 5.5% AUROC across 4 NLP datasets, with no cost to in-distribution accuracy.
translated by 谷歌翻译
A primary objective of news articles is to establish the factual record for an event, frequently achieved by conveying both the details of the specified event (i.e., the 5 Ws; Who, What, Where, When and Why regarding the event) and how people reacted to it (i.e., reported statements). However, existing work on news summarization almost exclusively focuses on the event details. In this work, we propose the novel task of summarizing the reactions of different speakers, as expressed by their reported statements, to a given event. To this end, we create a new multi-document summarization benchmark, SUMREN, comprising 745 summaries of reported statements from various public figures obtained from 633 news articles discussing 132 events. We propose an automatic silver training data generation approach for our task, which helps smaller models like BART achieve GPT-3 level performance on this task. Finally, we introduce a pipeline-based framework for summarizing reported speech, which we empirically show to generate summaries that are more abstractive and factual than baseline query-focused summarization approaches.
translated by 谷歌翻译
影响功能有效地估计了删除单个训练数据点对模型学习参数的影响。尽管影响估计值与线性模型的剩余重新进行了良好的重新对齐,但最近的作品表明,在神经网络中,这种比对通常很差。在这项工作中,我们通过将其分解为五个单独的术语来研究导致这种差异的特定因素。我们研究每个术语对各种架构和数据集的贡献,以及它们如何随网络宽度和培训时间等因素而变化。尽管实际影响函数估计值可能是非线性网络中保留对方的重新培训的差异,但我们表明它们通常是对不同对象的良好近似值,我们称其为近端Bregman响应函数(PBRF)。由于PBRF仍然可以用来回答许多激励影响功能的问题,例如识别有影响力或标记的示例,因此我们的结果表明,影响功能估计的当前算法比以前的错误分析所暗示的更有用的结果。
translated by 谷歌翻译
在摘要域中,摘要的关键要求是与输入文档一致。以前的工作发现,当应用于不一致检测时,自然语言推理(NLI)模型不会竞争地执行。在这项工作中,我们重新访问NLI的使用进行不一致检测,发现过去的工作遭到了NLI数据集(句子级)与不一致检测(文档级别)之间的输入粒度不匹配。我们提供称为SummacConv的高效和轻量级方法,使NLI模型能够通过将文档分段为句子单元并在句子对之间聚合得分来成功地用于此任务。在我们的新推出的基准名为Summac(简介一致性)中由六个大的不一致检测数据集组成,SummacConv以74.4%的均衡精度获得最先进的结果,与现有工作相比,5%的点改进。我们制作可用的模型和数据集:https://github.com/tingofurro/summac
translated by 谷歌翻译
Good models require good training data. For overparameterized deep models, the causal relationship between training data and model predictions is increasingly opaque and poorly understood. Influence analysis partially demystifies training's underlying interactions by quantifying the amount each training instance alters the final model. Measuring the training data's influence exactly can be provably hard in the worst case; this has led to the development and use of influence estimators, which only approximate the true influence. This paper provides the first comprehensive survey of training data influence analysis and estimation. We begin by formalizing the various, and in places orthogonal, definitions of training data influence. We then organize state-of-the-art influence analysis methods into a taxonomy; we describe each of these methods in detail and compare their underlying assumptions, asymptotic complexities, and overall strengths and weaknesses. Finally, we propose future research directions to make influence analysis more useful in practice as well as more theoretically and empirically sound. A curated, up-to-date list of resources related to influence analysis is available at https://github.com/ZaydH/influence_analysis_papers.
translated by 谷歌翻译
有针对性的训练集攻击将恶意实例注入训练集中,以导致训练有素的模型错误地标记一个或多个特定的测试实例。这项工作提出了目标识别的任务,该任务决定了特定的测试实例是否是训练集攻击的目标。目标识别可以与对抗性识别相结合,以查找(并删除)攻击实例,从而减轻对其他预测的影响,从而减轻攻击。我们没有专注于单个攻击方法或数据模式,而是基于影响力估计,这量化了每个培训实例对模型预测的贡献。我们表明,现有的影响估计量的不良实际表现通常来自于他们对训练实例和迭代次数的过度依赖。我们重新归一化的影响估计器解决了这一弱点。他们的表现远远超过了原始估计量,可以在对抗和非对抗环境中识别有影响力的训练示例群体,甚至发现多达100%的对抗训练实例,没有清洁数据误报。然后,目标识别简化以检测具有异常影响值的测试实例。我们证明了我们的方法对各种数据域的后门和中毒攻击的有效性,包括文本,视觉和语音,以及针对灰色盒子的自适应攻击者,该攻击者专门优化了逃避我们方法的对抗性实例。我们的源代码可在https://github.com/zaydh/target_indistification中找到。
translated by 谷歌翻译
在这项工作中,我们提出了一种将问题回答(QA)信号纳入摘要模型的方法。我们的方法通过自动生成由NPS回答的WH问题并自动确定在黄金摘要中是否回答这些问题,识别输入文档中的显着名词短语(NPS)。基于QA的信号被纳入了一种双级摘要模型,该模型首先使用分类模型在输入文档中标记突出NPS,然后有条件地生成摘要。我们的实验表明,使用基于QA的监督训练的模型产生了比在基准摘要数据集上识别突出跨度的基线方法的高质量摘要。此外,我们示出可以基于输入文档中标记的NPS来控制所产生的摘要的内容。最后,我们提出了一种增强培训数据的方法,因此黄金摘要与培训期间使用的标记的输入跨度更加一致,并展示了如何在学习更好地排除未标记的文档内容的模型中的结果。
translated by 谷歌翻译