Recent advances in large-scale pre-training provide large models with the potential to learn knowledge from the raw text. It is thus natural to ask whether it is possible to leverage these large models as knowledge bases for downstream tasks. In this work, we answer the aforementioned question in unsupervised knowledge-grounded conversation. We explore various methods that best elicit knowledge from large models. Our human study indicates that, though hallucinations exist, large models post the unique advantage of being able to output common sense and summarize facts that cannot be directly retrieved from the search engine. To better exploit such generated knowledge in dialogue generation, we treat the generated knowledge as a noisy knowledge source and propose the posterior-based reweighing as well as the noisy training strategy. Empirical results on two benchmarks show advantages over the state-of-the-art methods.
translated by 谷歌翻译
由于缺乏培训数据和异质知识来源,知识接地的对话系统是挑战的。由于培训数据中涵盖的有限主题,现有系统在不良主题上表现不佳。此外,异构知识源使系统概括到其他任务的系统,因为不同知识表示中的知识来源需要不同的知识编码器。为了解决这些挑战,我们呈现插头,将不同知识来源均匀化为知识接地的对话生成任务的统一知识来源的语言模型。插头在对话生成任务上进行预先培训,调节统一的基本知识表示。它可以通过一些培训示例概括到不同下游知识接地的对话一代任务。两个基准测试的实证评估表明,我们的模型越好跨越不同的知识接地任务。它可以在完全监督的设置下实现具有最先进的方法的可比性,并且显着优于零拍摄和少量拍摄设置中的其他方法。
translated by 谷歌翻译
The quality of knowledge retrieval is crucial in knowledge-intensive conversations. Two common strategies to improve the retrieval quality are finetuning the retriever or generating a self-contained query, while they encounter heavy burdens on expensive computation and elaborate annotations. In this paper, we propose an unsupervised query enhanced approach for knowledge-intensive conversations, namely QKConv. There are three modules in QKConv: a query generator, an off-the-shelf knowledge selector, and a response generator. Without extra supervision, the end-to-end joint training of QKConv explores multiple candidate queries and utilizes corresponding selected knowledge to yield the target response. To evaluate the effectiveness of the proposed method, we conducted comprehensive experiments on conversational question-answering, task-oriented dialogue, and knowledge-grounded conversation. Experimental results demonstrate that QKConv achieves state-of-the-art performance compared to unsupervised methods and competitive performance compared to supervised methods.
translated by 谷歌翻译
Knowledge-grounded dialogue systems powered by large language models often generate responses that, while fluent, are not attributable to a relevant source of information. Progress towards models that do not exhibit this issue requires evaluation metrics that can quantify its prevalence. To this end, we introduce the Benchmark for Evaluation of Grounded INteraction (BEGIN), comprised of 12k dialogue turns generated by neural dialogue systems trained on three knowledgegrounded dialogue corpora. We collect human annotations assessing the extent to which the models' responses can be attributed to the given background information. We then use BEGIN to analyze eight evaluation metrics. We find that these metrics rely on spurious correlations, do not reliably distinguish attributable abstractive responses from unattributable ones, and perform substantially worse when the knowledge source is longer. Our findings underscore the need for more sophisticated and robust evaluation metrics for knowledge-grounded dialogue. We make BEGIN publicly available at https://github.com/ google/BEGIN-dataset.
translated by 谷歌翻译
Natural Language Generation (NLG) has improved exponentially in recent years thanks to the development of sequence-to-sequence deep learning technologies such as Transformer-based language models. This advancement has led to more fluent and coherent NLG, leading to improved development in downstream tasks such as abstractive summarization, dialogue generation and data-to-text generation. However, it is also apparent that deep learning based generation is prone to hallucinate unintended text, which degrades the system performance and fails to meet user expectations in many real-world scenarios. To address this issue, many studies have been presented in measuring and mitigating hallucinated texts, but these have never been reviewed in a comprehensive manner before. In this survey, we thus provide a broad overview of the research progress and challenges in the hallucination problem in NLG. The survey is organized into two parts: (1) a general overview of metrics, mitigation methods, and future directions; and (2) an overview of task-specific research progress on hallucinations in the following downstream tasks, namely abstractive summarization, dialogue generation, generative question answering, data-to-text generation, machine translation, and visual-language generation. This survey serves to facilitate collaborative efforts among researchers in tackling the challenge of hallucinated texts in NLG.
translated by 谷歌翻译
大型语言模型可以产生流畅的对话,但往往是幻觉的事实不准确。虽然检索式增强的模型有助于缓解这个问题,但他们仍然面临着推理的艰难挑战,以便同时提供正确的知识和产生对话。在这项工作中,我们提出了一种模块化模型,知识响应(K2R),将知识纳入会话代理商,这将这个问题分解为两个更简单的步骤。 K2R首先生成一个知识序列,给定对话背景作为中间步骤。在此“推理步骤”之后,该模型随后参加自己生成的知识序列,以及对话背景,以产生最终的响应。在详细的实验中,我们发现这种模型在知识接地的对话任务中少幻觉,并且在可解释性和模块化方面具有优势。特别地,它可以用来将QA和对话系统一起融合在一起,以使对话代理能够提供知识渊博的答案,或者QA模型,以在零拍摄设置中给出对话响应。
translated by 谷歌翻译
我们介绍了Godel(接地开放对话语言模型),这是对话框的大型预训练的语言模型。与诸如Dialogpt之类的早期模型相比,Godel利用了一个新的扎根预训练阶段,旨在更好地支持将Godel适应广泛的下游对话框任务,这些任务需要当前对话外部的信息(例如,数据库或文档)到产生良好的回应。针对一系列基准测试的实验,这些基准涵盖了面向任务的对话框,对话质量质量检查和接地的开放式对话框,表明Godel在几次以上的微调设置中优于最先进的预训练的对话模型,就人类和自动评估。我们评估方法的一个新颖特征是引入了一个效用概念,该概念除了其交流特征(内在评估)外,还评估了响应的有用性(外部评估)。我们表明,外部评估提供了改进的通道间一致性和与自动指标的相关性。代码和数据处理脚本公开可用。
translated by 谷歌翻译
生成的开放域对话系统可以从外部知识中受益,但是缺乏外部知识资源和寻找相关知识的困难限制了该技术的发展。为此,我们使用动态服务信息提出了一个知识驱动的对话任务。具体而言,我们使用大量的服务API,可以作为外部知识来源提供高覆盖范围和时空敏感性。对话系统生成查询以请求外部服务以及用户信息,获取相关知识,并基于此知识生成响应。为了实现此方法,我们收集并发布了第一个开放式域中国服务知识对话数据集Dusinc。同时,我们构建了一个基线模型柏拉图 - 线,该模型实现了对话的自动利用。自动评估和人类评估都表明,我们提出的新方法可以显着改善开放域对话的效果,并且与对话预培训模型Plato-2相比,人类评估中的会话级总数提高了59.29%。数据集和基准模型将被开源。
translated by 谷歌翻译
大型语言模型在各种任务上显示出令人印象深刻的几次结果。但是,当知识是此类结果的关键时,就像问题回答和事实检查之类的任务一样,似乎需要存储知识的大量参数计数。众所周知,检索增强模型可以在不需要多个参数的情况下在知识密集的任务上表现出色,但是目前尚不清楚它们是否在几个弹药设置中工作。在这项工作中,我们介绍了地图集,这是一个经过精心设计和预先训练的增强语言模型,能够通过很少的培训示例学习知识密集型任务。我们对包括MMLU,苏格兰短裙和归类等各种任务进行评估,并研究文档索引内容的影响,表明它可以很容易地进行更新。值得注意的是,在自然问题上仅使用64个示例在自然问题上达到超过42 \%的准确性,尽管参数少了50倍,但比540B参数模型的表现优于540b参数模型。
translated by 谷歌翻译
In open-domain dialogue intelligent agents should exhibit the use of knowledge, however there are few convincing demonstrations of this to date. The most popular sequence to sequence models typically "generate and hope" generic utterances that can be memorized in the weights of the model when mapping from input utterance(s) to output, rather than employing recalled knowledge as context. Use of knowledge has so far proved difficult, in part because of the lack of a supervised learning benchmark task which exhibits knowledgeable open dialogue with clear grounding. To that end we collect and release a large dataset with conversations directly grounded with knowledge retrieved from Wikipedia. We then design architectures capable of retrieving knowledge, reading and conditioning on it, and finally generating natural responses. Our best performing dialogue models are able to conduct knowledgeable discussions on open-domain topics as evaluated by automatic metrics and human evaluations, while our new benchmark allows for measuring further improvements in this important research direction.
translated by 谷歌翻译
真实的人类对话数据是复杂,异质和嘈杂的,从该数据中构建开放域对话系统仍然是一项艰巨的任务。实际上,此类对话数据仍然包含大量信息和知识,但是,它们没有得到充分探索。在本文中,我们展示了现有的开放域对话生成方法,这些方法记住上下文响应配对的数据,并使用自动回归或编码模型模型不利于培训数据。与当前的方法不同,使用外部知识,我们探索了一个检索生成培训框架,该培训框架可以通过将它们视为“证据”来利用异质和嘈杂的培训数据。特别是,我们使用Bertscore进行检索,这给出了证据和一代的更好品质。公开可用数据集的实验表明,我们的方法可以帮助模型产生更好的响应,即使此类培训数据通常会留下深刻的印象为低质量数据。这种性能增益与通过扩大训练组更好的改进的绩效增益相当,甚至更好。我们还发现,模型性能与检索到的证据的相关性有正相关。此外,我们的方法在零拍实验上表现良好,这表明我们的方法对现实世界数据可能更强大。
translated by 谷歌翻译
通过社交媒体评论预先培训的许多开放域对话模型都可以产生连贯的答复,但在与真实用户互动时会产生引人入胜的答复。这种现象可能主要是由于注释的人类对话的不足以及与人类偏爱的未对准。在本文中,我们提出了一种新颖而有效的方法,以增强开放域聊天机器人,其中有两种人类反馈(包括明确的演示和隐性偏好),并利用了。通过要求注释者选择或修改模型生成的候选响应,Diamante有效地收集了人类证明的响应并构建了中国聊天数据集。为了增强与人类偏好的一致性,Diamante利用数据收集过程中的隐含偏好,并引入了生成评估联合培训。全面的实验表明,Diamante数据集和联合培训范式可以显着提高中国预训练的对话模型的性能。
translated by 谷歌翻译
预先接受的语言模型(PLM)在神经对话建模中标志着巨大的飞跃。虽然PLMS在大型文本语料库上进行预先培训,但通常在具有特定领域知识和对话风格的稀缺对话数据上进行微调。然而,在大型预先训练模型中充分利用现有知识的同时定制语言模型仍然是一个挑战。在本文中,我们提出了一种预先接受训练的对话建模的新方法,将对话生成问题作为一个快速学习任务。而不是在有限的对话数据上进行微调,我们的方法,DialogPrompt学习针对对话背景优化的连续提示嵌入,从而从大型预训练模型中促进了知识。为了鼓励模型更好地利用提示嵌入,提示编码器被设计为在输入对话框上下文中的条件。流行对话数据集的实验表明,我们的方法显着优于微调基线和通用及时学习方法。此外,人类评估强烈支持对DialialPrompt的优越性在响应生成质量方面。
translated by 谷歌翻译
我们提出了Blenderbot 3,这是一个175B参数对话模型,能够通过访问Internet和长期内存进行开放域对话,并接受了大量用户定义的任务的培训。我们同时发布了模型权重和代码,还将模型部署在公共网页上,以与有机用户进行交互。该技术报告描述了该模型的构建方式(建筑,模型和培训计划)以及其部署的细节,包括安全机制。人类评估表明,它优于现有的开放域对话代理,包括其前身(Roller等,2021; Komeili等,2022)。最后,我们使用部署收集的数据详细介绍了持续学习的计划,该数据也将公开发布。因此,该研究计划的目标是使社区能够研究通过互动学习的不断改进的负责任的代理商。
translated by 谷歌翻译
Controllable Text Generation (CTG) is emerging area in the field of natural language generation (NLG). It is regarded as crucial for the development of advanced text generation technologies that are more natural and better meet the specific constraints in practical applications. In recent years, methods using large-scale pre-trained language models (PLMs), in particular the widely used transformer-based PLMs, have become a new paradigm of NLG, allowing generation of more diverse and fluent text. However, due to the lower level of interpretability of deep neural networks, the controllability of these methods need to be guaranteed. To this end, controllable text generation using transformer-based PLMs has become a rapidly growing yet challenging new research hotspot. A diverse range of approaches have emerged in the recent 3-4 years, targeting different CTG tasks which may require different types of controlled constraints. In this paper, we present a systematic critical review on the common tasks, main approaches and evaluation methods in this area. Finally, we discuss the challenges that the field is facing, and put forward various promising future directions. To the best of our knowledge, this is the first survey paper to summarize CTG techniques from the perspective of PLMs. We hope it can help researchers in related fields to quickly track the academic frontier, providing them with a landscape of the area and a roadmap for future research.
translated by 谷歌翻译
Pre-trained language models (LMs) store knowledge in their parameters and can generate informative responses when used in conversational systems. However, LMs suffer from the problem of "hallucination:" they may generate plausible-looking statements that are irrelevant or factually incorrect. To address this problem, we propose a contrastive learning scheme, named MixCL. A novel mixed contrastive objective is proposed to explicitly optimize the implicit knowledge elicitation process of LMs, and thus reduce their hallucination in conversations. We also examine negative sampling strategies of retrieved hard negatives and model-generated negatives. We conduct experiments on Wizard-of-Wikipedia, a public, open-domain knowledge-grounded dialogue benchmark, and assess the effectiveness of MixCL. MixCL effectively reduces the hallucination of LMs in conversations and achieves the highest performance among LM-based dialogue agents in terms of relevancy and factuality. We show that MixCL achieves comparable performance to state-of-the-art KB-based approaches while enjoying notable advantages in terms of efficiency and scalability.
translated by 谷歌翻译
在本文中,我们建议利用对话的独特特征,共享参与者的常识性知识,以解决总结它们的困难。我们提出了病态的框架,该框架使用常识推论作为其他背景。与以前仅依赖于输入对话的工作相比,Sick使用外部知识模型来生成丰富的常识推断,并选择具有基于相似性选择方法的最可能的推理。基于生病的,病人++的理解为监督,在总结多任务学习环境中的对话时,添加了产生常识推断的任务。实验结果表明,通过注入常识性知识,我们的框架比现有方法产生更多信息和一致的摘要。
translated by 谷歌翻译
最近的大规模预训练的进步,例如GPT-3允许从给定提示生成看似高质量的文本。然而,这种一代系统经常遭受幻觉的事实问题,并且本身并不是旨在包含有用的外部信息。接地的代表似乎提供了补救措施,但他们的培训通常依赖于提供信息相关文件的很少可用的并行数据。我们提出了一个框架,通过在语言模型信号上共同训练接地的发生器和文档检索来缓解这种数据约束。该模型学会奖励具有生成中最高效用的文档的检索,并用专家混合(MOE)合并来术语术,以产生后续文本。我们证明,发电机和猎犬都可以利用这种联合培训,协同作用,以生产散文和对话一代中的更多信息和相关文本。
translated by 谷歌翻译
在寻求信息的对话中,用户与代理商进行对话,以提出一系列通常可以不足或过度指定的问题。理想的代理商首先将通过搜索其基本知识来源,然后与用户进行适当互动以解决它,从而确定他们处于这种情况。但是,大多数现有研究都无法或人为地纳入此类代理端计划。在这项工作中,我们介绍了Inscit(发音为Insight),这是一种用于与混合互动相互作用的信息寻求对话的数据集。它包含从805个人类对话中进行的4.7k用户代理转弯,代理商对Wikipedia进行搜索,并要求澄清或提供相关信息以解决用户查询。我们定义了两个子任务,即证据通过识别和响应产生,以及一种新的人类评估协议来评估模型绩效。我们根据对话知识识别和开放域问题的最新模型报告了两个强大的基线的结果。这两种模型都显着不足,并且没有产生连贯和信息丰富的反应,这表明未来的研究有足够的改进空间。
translated by 谷歌翻译
医学对话生成是一项重要但具有挑战性的任务。以前的大多数作品都依赖于注意力机制和大规模预处理的语言模型。但是,这些方法通常无法从长时间的对话历史中获取关键信息,从而产生准确和信息丰富的响应,因为医疗实体通常散布在多种话语中以及它们之间的复杂关系。为了减轻此问题,我们提出了一个具有关键信息召回(Medpir)的医疗响应生成模型,该模型建立在两个组件上,即知识吸引的对话图形编码器和召回增强的生成器。知识吸引的对话图编码器通过利用话语中的实体之间的知识关系,并使用图形注意力网络对话图来构建对话图。然后,召回增强的发电机通过在产生实际响应之前生成对话的摘要来增强这些关键信息的使用。两个大型医学对话数据集的实验结果表明,Medpir在BLEU分数和医疗实体F1度量中的表现优于强大的基准。
translated by 谷歌翻译