生成摘要中的事实不一致严重限制了抽象对话摘要的实际应用。尽管通过使用预先训练的模型实现了显着进展,但在人类评估期间发现了大量的幻觉含量。预先接受的模型最常见的是微调文本摘要的跨熵损失,这可能不是最佳策略。在这项工作中,我们为带注释数据提供了事实错误的类型,以突出显示错误的类型并远离对事实的二进制了解。我们进一步提出了一种培训策略,通过新颖的对比微调,改善了摘要的事实一致性和整体素质。基于我们的语言信息的错误类型,我们设计了各个目标的不同模块化目标。具体而言,我们利用硬阴性样本具有误差,以减少事实不一致的产生。为了捕获扬声器之间的关键信息,我们还设计了特定于对话的损失。使用人类评估和自动忠实度量指标,我们表明我们的模型在对话摘要,Samsum语料库中大大降低了各种事实错误。此外,我们的模型可以推广到会议概述,AMI语料库,它产生的分数明显高于两个数据集关于单词 - 重叠度量标准的基线。
translated by 谷歌翻译
我们报告了Dialogsum挑战的结果,即在INLG 2022上汇总现实生活中的对话的共同任务。四个团队参与了这项共享任务,并提交了他们的系统报告,探索了不同的方法来提高对话摘要的性能。尽管对于自动评估指标(例如Rouge分数),基线模型有很大的改进,但我们发现模型生成的输出与通过多个方面的人类评估之间的人类评估之间存在显着差距。这些发现表明了对话摘要的困难,并表明需要更细粒度的评估指标。
translated by 谷歌翻译
输出长度对于对话摘要系统至关重要。对话摘要长度由多个因素决定,包括对话复杂性,摘要目标和个人偏好。在这项工作中,我们从三个角度来对话摘要长度。首先,我们分析了现有模型的输出与相应的人类参考之间的长度差异,并发现摘要模型由于其预训练的目标而倾向于产生更多的详细摘要。其次,我们通过比较不同的模型设置来确定摘要长度预测的显着特征。第三,我们尝试使用长度意识的摘要,并在现有模型上显示出显着改进,如果汇总长度可以很好地整合。分析和实验是在流行的对话和Samsum数据集中进行的,以验证我们的发现。
translated by 谷歌翻译
Abstractive dialogue summarization has received increasing attention recently. Despite the fact that most of the current dialogue summarization systems are trained to maximize the likelihood of human-written summaries and have achieved significant results, there is still a huge gap in generating high-quality summaries as determined by humans, such as coherence and faithfulness, partly due to the misalignment in maximizing a single human-written summary. To this end, we propose to incorporate different levels of human feedback into the training process. This will enable us to guide the models to capture the behaviors humans care about for summaries. Specifically, we ask humans to highlight the salient information to be included in summaries to provide the local feedback , and to make overall comparisons among summaries in terms of coherence, accuracy, coverage, concise and overall quality, as the global feedback. We then combine both local and global feedback to fine-tune the dialog summarization policy with Reinforcement Learning. Experiments conducted on multiple datasets demonstrate the effectiveness and generalization of our methods over the state-of-the-art supervised baselines, especially in terms of human judgments.
translated by 谷歌翻译
在本文中,我们建议利用对话的独特特征,共享参与者的常识性知识,以解决总结它们的困难。我们提出了病态的框架,该框架使用常识推论作为其他背景。与以前仅依赖于输入对话的工作相比,Sick使用外部知识模型来生成丰富的常识推断,并选择具有基于相似性选择方法的最可能的推理。基于生病的,病人++的理解为监督,在总结多任务学习环境中的对话时,添加了产生常识推断的任务。实验结果表明,通过注入常识性知识,我们的框架比现有方法产生更多信息和一致的摘要。
translated by 谷歌翻译
Information overloading requires the need for summarizers to extract salient information from the text. Currently, there is an overload of dialogue data due to the rise of virtual communication platforms. The rise of Covid-19 has led people to rely on online communication platforms like Zoom, Slack, Microsoft Teams, Discord, etc. to conduct their company meetings. Instead of going through the entire meeting transcripts, people can use meeting summarizers to select useful data. Nevertheless, there is a lack of comprehensive surveys in the field of meeting summarizers. In this survey, we aim to cover recent meeting summarization techniques. Our survey offers a general overview of text summarization along with datasets and evaluation metrics for meeting summarization. We also provide the performance of each summarizer on a leaderboard. We conclude our survey with different challenges in this domain and potential research opportunities for future researchers.
translated by 谷歌翻译
有几种原因,抽象对话摘要是一项有挑战性的任务。首先,谈话中的大多数重要信息通过与不同纹理样式的多方交互来跨越话语。其次,对话通常是非正式结构,其中不同的个人表达个人观点,与文本摘要不同,通常针对新闻文章等正式文件的任务。为解决这些问题,我们专注于来自各个扬声器和独特的句法结构之间的话语之间的关联。扬声器具有唯一的文本方式,可以包含语言信息,例如声音。因此,我们通过利用语言信息(即POS标记)来构建语法感知模型,这通过自然区分从各个扬声器发出的句子来减轻上述问题。我们采用了多任务学习的语法感知信息和对话摘要。据我们所知,我们的方法是第一种将多任务学习应用于对话摘要任务的方法。 Samsum语料库(大规模对话摘要语料库)的实验表明,我们的方法改善了香草模型。我们进一步分析了我们对基线模型的方法的成本和益处。
translated by 谷歌翻译
深度学习的最新进展,尤其是编码器架构的发明,已大大改善了抽象性摘要系统的性能。尽管大多数研究都集中在书面文件上,但我们观察到过去几年对对话和多方对话的总结越来越兴趣。一个可以可靠地将人类对话的音频或笔录转换为删节版本的系统,该版本在讨论中最重要的一点上可以在各种现实世界中,从商务会议到医疗咨询再到客户都有价值服务电话。本文着重于多党会议的抽象性摘要,对与此任务相关的挑战,数据集和系统进行了调查,并讨论了未来研究的有希望的方向。
translated by 谷歌翻译
对话是人类沟通与合作的重要组成部分。现有研究主要关注一对一时尚的短对话情景。然而,现实世界中的多人互动,例如会议或访谈,经常超过几千个字。仍然缺乏相应的研究和强大的工具来了解和处理这么长的对话。因此,在这项工作中,我们为长时间对话理解和总结提供了预先培训框架。考虑到长期交谈的性质,我们提出了一种基于窗口的去噪方法,用于生成预训练。对于对话框,它损坏了一个带有对话激发灵感噪声的文本窗口,并指导模型基于剩余对话的内容来重建此窗口。此外,为了更长的输入,我们增加了稀疏关注模型,这些模型以混合方式与传统的关注相结合。我们在长对话的五个数据集进行广泛的实验,涵盖对话摘要的任务,抽象问题回答和主题分割。实验,我们表明,我们的预先训练的模型DialogLM显着超越了数据集和任务的最先进的模型。我们的GitHub存储库(HTTPS:/github.com/microsoft/dialoglm上有源代码和所有预先训练的型号。
translated by 谷歌翻译
预先训练的语言模型已经建立了有关各种自然语言处理任务的最新技术,包括对话摘要,这使读者可以在会议,访谈或电话中的长时间对话中快速访问关键信息。但是,这种对话仍然很难使用当前的模型来处理,因为语言的自发性涉及在用于预先培训语言模型的语料库中很少存在的表达式。此外,在这一领域完成的绝大多数工作都集中在英语上。在这项工作中,我们介绍了一项研究,使用几种特定语言的预培训模型:Barthez和Belgpt-2以及多语言预培训的模型:MBART,MBARTHEZ和MT5。实验是在Decoda(呼叫中心)对话语料库上进行的,其任务是根据情况在呼叫中心与一个或几个代理之间的呼叫中心对话中产生抽象介绍。结果表明,Barthez型号的性能最佳,远远超过了Decoda先前的最新性能。我们进一步讨论了此类预训练模型的局限性以及总结自发对话所需的挑战。
translated by 谷歌翻译
对话摘要已被广泛研究和应用,其中,先前的作品主要集中在探索卓越的模型结构方面,以对准输入对话和输出摘要。然而,对于专业对话(例如,法律辩论和医学诊断),语义/统计对齐可能几乎不会填补输入对话话语话语和外部知识的摘要输出之间的逻辑/事实差距。在本文中,我们主要研究了非预介绍和预用环境下对话检验摘要(DIS)的事实不一致问题。创新的端到端对话摘要生成框架是有两个辅助任务:预期事实方面正规化(EFAR)和缺少事实实体歧视(MFED)。综合实验表明,该模型可以以准确的事实方面的覆盖率来产生更可读的总结,以及通知用户从输入对话中检测到的潜在缺失事实以获得进一步的人为干预。
translated by 谷歌翻译
Dialogue summarization has recently garnered significant attention due to its wide range of applications. However, existing methods for summarizing dialogues are suboptimal because they do not take into account the inherent structure of dialogue and rely heavily on labeled data, which can lead to poor performance in new domains. In this work, we propose DIONYSUS (dynamic input optimization in pre-training for dialogue summarization), a pre-trained encoder-decoder model for summarizing dialogues in any new domain. To pre-train DIONYSUS, we create two pseudo summaries for each dialogue example: one is produced by a fine-tuned summarization model, and the other is a collection of dialogue turns that convey important information. We then choose one of these pseudo summaries based on the difference in information distribution across different types of dialogues. This selected pseudo summary serves as the objective for pre-training DIONYSUS using a self-supervised approach on a large dialogue corpus. Our experiments show that DIONYSUS outperforms existing methods on six datasets, as demonstrated by its ROUGE scores in zero-shot and few-shot settings.
translated by 谷歌翻译
在典型的客户服务聊天方案中,客户联系支持中心以便帮助或提高投诉,人类代理商试图解决这些问题。在大多数情况下,在谈话结束时,要求代理人写一份简短的总结强调问题和建议的解决方案,通常是为了使其他可能需要处理同一客户或问题的其他代理商的利益。本文的目标是推进此任务的自动化。我们介绍了第一个大规模,高质量的客户服务对话框摘要数据集,接近6500人的注释摘要。数据基于现实世界的客户支持对话框,包括提取和抽象摘要。我们还介绍了一种特定于对话框的新无监督的提取摘要方法。
translated by 谷歌翻译
Controllable summarization allows users to generate customized summaries with specified attributes. However, due to the lack of designated annotations of controlled summaries, existing works have to craft pseudo datasets by adapting generic summarization benchmarks. Furthermore, most research focuses on controlling single attributes individually (e.g., a short summary or a highly abstractive summary) rather than controlling a mix of attributes together (e.g., a short and highly abstractive summary). In this paper, we propose MACSum, the first human-annotated summarization dataset for controlling mixed attributes. It contains source texts from two domains, news articles and dialogues, with human-annotated summaries controlled by five designed attributes (Length, Extractiveness, Specificity, Topic, and Speaker). We propose two simple and effective parameter-efficient approaches for the new task of mixed controllable summarization based on hard prompt tuning and soft prefix tuning. Results and analysis demonstrate that hard prompt models yield the best performance on all metrics and human evaluations. However, mixed-attribute control is still challenging for summarization tasks. Our dataset and code are available at https://github.com/psunlpgroup/MACSum.
translated by 谷歌翻译
Natural Language Generation (NLG) has improved exponentially in recent years thanks to the development of sequence-to-sequence deep learning technologies such as Transformer-based language models. This advancement has led to more fluent and coherent NLG, leading to improved development in downstream tasks such as abstractive summarization, dialogue generation and data-to-text generation. However, it is also apparent that deep learning based generation is prone to hallucinate unintended text, which degrades the system performance and fails to meet user expectations in many real-world scenarios. To address this issue, many studies have been presented in measuring and mitigating hallucinated texts, but these have never been reviewed in a comprehensive manner before. In this survey, we thus provide a broad overview of the research progress and challenges in the hallucination problem in NLG. The survey is organized into two parts: (1) a general overview of metrics, mitigation methods, and future directions; and (2) an overview of task-specific research progress on hallucinations in the following downstream tasks, namely abstractive summarization, dialogue generation, generative question answering, data-to-text generation, machine translation, and visual-language generation. This survey serves to facilitate collaborative efforts among researchers in tackling the challenge of hallucinated texts in NLG.
translated by 谷歌翻译
寻求健康信息的寻求使网络与消费者的健康相关问题淹没了。通常,消费者使用过度描述性和外围信息来表达其医疗状况或其他医疗保健需求,从而有助于自然语言理解的挑战。解决这一挑战的一种方法是总结问题并提取原始问题的关键信息。为了解决此问题,我们介绍了一个新的数据集CHQ-SUMM,其中包含1507个域 - 专家注释的消费者健康问题和相应的摘要。该数据集源自社区提问论坛,因此为了解社交媒体上与消费者健康相关的帖子提供了宝贵的资源。我们在多个最先进的摘要模型上基准测试数据集,以显示数据集的有效性。
translated by 谷歌翻译
Human evaluation is the foundation upon which the evaluation of both summarization systems and automatic metrics rests. However, existing human evaluation protocols and benchmarks for summarization either exhibit low inter-annotator agreement or lack the scale needed to draw statistically significant conclusions, and an in-depth analysis of human evaluation is lacking. In this work, we address the shortcomings of existing summarization evaluation along the following axes: 1) We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which relies on fine-grained semantic units and allows for high inter-annotator agreement. 2) We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of over 22k summary-level annotations over state-of-the-art systems on three datasets. 3) We compare our ACU protocol with three other human evaluation protocols, underscoring potential confounding factors in evaluation setups. 4) We evaluate existing automatic metrics using the collected human annotations across evaluation protocols and demonstrate how our benchmark leads to more statistically stable and significant results. Furthermore, our findings have important implications for evaluating large language models (LLMs), as we show that LLMs adjusted by human feedback (e.g., GPT-3.5) may overfit unconstrained human evaluation, which is affected by the annotators' prior, input-agnostic preferences, calling for more robust, targeted evaluation methods.
translated by 谷歌翻译
对比学习模型在无监督的视觉表示学习中取得了巨大成功,这使得相同图像的不同视图的特征表示之间的相似性最大化,同时最小化不同图像的视图的特征表示之间的相似性。在文本摘要中,输出摘要是输入文档的较短形式,它们具有类似的含义。在本文中,我们提出了对监督抽象文本摘要的对比学习模型,在那里我们查看文档,它的金摘要及其模型生成的摘要,与相同的平均表示的不同视图,并在培训期间最大化它们之间的相似性。我们在三个不同的摘要数据集上改进了一个强序列到序列文本生成模型(即,BART)。人类评估还表明,与其对应物相比,我们的模型达到了更好的忠实性评级,没有对比的目标。
translated by 谷歌翻译
这些日子,自动会议总结变得越来越受欢迎。能够自动总结会议和提取关键信息的能力可以大大提高我们工作和生活的效率。在本文中,我们试验不同的方法来提高基于查询的会议概述的性能。我们从HMNET \ CITE {HMNET}开始了一个分层网络,该网络采用单词级变压器和转动级变压器,作为基线。我们探讨使用大型新闻摘要数据集进行预培训模型的有效性。我们调查将查询的嵌入品作为输入向量的一部分添加为基于查询的摘要。此外,我们使用中间聚类步骤扩展了QMSUM \ CITE {QMSUM}的定位 - 然后总结方法。最后,我们将基线模型与BART进行比较,这是一个有效的总结的最先进的语言模型。我们通过将查询嵌入物添加到模型的输入,通过使用BART作为替代语言模型来实现改进的性能,并且通过使用聚类方法在将文本送入摘要模型之前在话语级别提取关键信息。
translated by 谷歌翻译
自动医疗问题摘要可以极大地帮助系统了解消费者健康问题并检索正确的答案。基于最大似然估计(MLE)的SEQ2SEQ模型已在此任务中应用,这面临两个一般问题:该模型无法捕获良好的问题,并且传统的MLE策略缺乏理解句子级语义的能力。为了减轻这些问题,我们提出了一个新颖的问题焦点驱动的对比学习框架(QFCL)。特别是,我们提出了一种简单有效的方法来基于问题的重点生成硬性样本,并利用编码器和解码器的对比度学习以获得更好的句子级别表示。在三个医疗基准数据集上,我们提出的模型可实现新的最新结果,并在三个数据集的基线BART模型上获得了5.33、12.85和3.81点的性能增益。进一步的人类判断和详细的分析证明,我们的QFCL模型可以学习更好的句子表示,具有区分不同句子含义的能力,并通过捕获问题重点来产生高质量的摘要。
translated by 谷歌翻译