隐性知识,例如常识,是人类对话的关键。当前的神经反应生成(RG)模型经过训练以直接产生响应,省略了未阐明的隐式知识。在本文中,我们介绍了说话之前的思维(TBS),这是一种首先将隐式常识知识(思考)外部化的生成方法(思考),并使用这些知识来产生响应(speak)。我们期望外部化隐式知识可以更有效地学习,产生更多信息的响应,并实现了更多可解释的模型。我们分析了不同的选择,以收集知识一致的对话,代表隐式知识以及知识和对话之间的过渡。经验结果表明,TBS模型在大多数自动指标上优于端到端和知识增强的RG基准,并通过人类注释者评估,产生更有信息,具体和常识性遵循的响应。 TBS还产生了有意义的知识,并且与85 \%左右的对话有关。
translated by 谷歌翻译
Dialogue systems can leverage large pre-trained language models and knowledge to generate fluent and informative responses. However, these models are still prone to produce hallucinated responses not supported by the input source, which greatly hinders their application. The heterogeneity between external knowledge and dialogue context challenges representation learning and source integration, and further contributes to unfaithfulness. To handle this challenge and generate more faithful responses, this paper presents RHO ($\rho$) utilizing the representations of linked entities and relation predicates from a knowledge graph (KG). We propose (1) local knowledge grounding to combine textual embeddings with the corresponding KG embeddings; and (2) global knowledge grounding to equip RHO with multi-hop reasoning abilities via the attention mechanism. In addition, we devise a response re-ranking technique based on walks over KG sub-graphs for better conversational reasoning. Experimental results on OpenDialKG show that our approach significantly outperforms state-of-the-art methods on both automatic and human evaluation by a large margin, especially in hallucination reduction (17.54% in FeQA).
translated by 谷歌翻译
Natural Language Generation (NLG) has improved exponentially in recent years thanks to the development of sequence-to-sequence deep learning technologies such as Transformer-based language models. This advancement has led to more fluent and coherent NLG, leading to improved development in downstream tasks such as abstractive summarization, dialogue generation and data-to-text generation. However, it is also apparent that deep learning based generation is prone to hallucinate unintended text, which degrades the system performance and fails to meet user expectations in many real-world scenarios. To address this issue, many studies have been presented in measuring and mitigating hallucinated texts, but these have never been reviewed in a comprehensive manner before. In this survey, we thus provide a broad overview of the research progress and challenges in the hallucination problem in NLG. The survey is organized into two parts: (1) a general overview of metrics, mitigation methods, and future directions; and (2) an overview of task-specific research progress on hallucinations in the following downstream tasks, namely abstractive summarization, dialogue generation, generative question answering, data-to-text generation, machine translation, and visual-language generation. This survey serves to facilitate collaborative efforts among researchers in tackling the challenge of hallucinated texts in NLG.
translated by 谷歌翻译
We present SODA: the first publicly available, million-scale high-quality social dialogue dataset. Using SODA, we train COSMO: a generalizable conversation agent outperforming previous best-performing agents on both in- and out-of-domain datasets. In contrast to most existing crowdsourced, small-scale dialogue corpora, we distill 1.5M socially-grounded dialogues from a pre-trained language model (InstructGPT; Ouyang et al., 2022). Dialogues are distilled by contextualizing social commonsense knowledge from a knowledge graph (Atomic10x; West et al., 2022). Human evaluation shows that dialogues in SODA are more consistent, specific, and (surprisingly) natural than prior human-authored datasets - e.g., DailyDialog (Li et al., 2017), BlendedSkillTalk (Smith et al., 2020). In addition, extensive evaluations show that COSMO is significantly more natural and consistent on unseen datasets than best-performing dialogue models - e.g., GODEL (Peng et al., 2022), BlenderBot (Roller et al., 2021), DialoGPT (Zhang et al., 2020). Furthermore, it is sometimes even preferred to the original human-written gold responses. We make our data, models, and code public.
translated by 谷歌翻译
大型语言模型可以产生流畅的对话,但往往是幻觉的事实不准确。虽然检索式增强的模型有助于缓解这个问题,但他们仍然面临着推理的艰难挑战,以便同时提供正确的知识和产生对话。在这项工作中,我们提出了一种模块化模型,知识响应(K2R),将知识纳入会话代理商,这将这个问题分解为两个更简单的步骤。 K2R首先生成一个知识序列,给定对话背景作为中间步骤。在此“推理步骤”之后,该模型随后参加自己生成的知识序列,以及对话背景,以产生最终的响应。在详细的实验中,我们发现这种模型在知识接地的对话任务中少幻觉,并且在可解释性和模块化方面具有优势。特别地,它可以用来将QA和对话系统一起融合在一起,以使对话代理能够提供知识渊博的答案,或者QA模型,以在零拍摄设置中给出对话响应。
translated by 谷歌翻译
在本文中,我们介绍了基于大型预训练的语言模型(PLM)pangu-alpha(Zeng等,2021)的中国预训练的开放域对话生成模型。与其他对大量对话数据进行培训的预训练的对话模型不同,我们旨在通过继承PLM的有价值的语言能力和知识来构建强大的对话模型,并以相对较少的数据和计算成本构建强大的对话模型。为此,我们训练大型PLM Pangu-Alpha的Pangu-bot,该机器人已被证明在各种中国自然语言任务上表现出色。我们研究了pangu-bot产生的响应的不同方面,包括响应质量,知识和安全性。我们表明,Pangu-Bot优于最先进的中国对话系统(CDIALGPT(Wang等,2020),Eva(Zhou等,2021),EVA2.0(Gu等,2022)) W.R.T.以上三个方面。我们还证明,可以轻松地部署pangu-bot,以在没有进一步训练的情况下产生情感反应。在整个经验分析中,我们还指出,Pangu-bot响应质量,知识正确性和安全性仍然远非完美,进一步的探索对于建立可靠且智能的对话系统是必不可少的。我们的型号和代码将在https://github.com/huawei-noah/pretretaining-language-model/tree/master/master/pangu-bot上提供。
translated by 谷歌翻译
Knowledge-grounded dialogue systems powered by large language models often generate responses that, while fluent, are not attributable to a relevant source of information. Progress towards models that do not exhibit this issue requires evaluation metrics that can quantify its prevalence. To this end, we introduce the Benchmark for Evaluation of Grounded INteraction (BEGIN), comprised of 12k dialogue turns generated by neural dialogue systems trained on three knowledgegrounded dialogue corpora. We collect human annotations assessing the extent to which the models' responses can be attributed to the given background information. We then use BEGIN to analyze eight evaluation metrics. We find that these metrics rely on spurious correlations, do not reliably distinguish attributable abstractive responses from unattributable ones, and perform substantially worse when the knowledge source is longer. Our findings underscore the need for more sophisticated and robust evaluation metrics for knowledge-grounded dialogue. We make BEGIN publicly available at https://github.com/ google/BEGIN-dataset.
translated by 谷歌翻译
The goal of building dialogue agents that can converse with humans naturally has been a long-standing dream of researchers since the early days of artificial intelligence. The well-known Turing Test proposed to judge the ultimate validity of an artificial intelligence agent on the indistinguishability of its dialogues from humans'. It should come as no surprise that human-level dialogue systems are very challenging to build. But, while early effort on rule-based systems found limited success, the emergence of deep learning enabled great advance on this topic. In this thesis, we focus on methods that address the numerous issues that have been imposing the gap between artificial conversational agents and human-level interlocutors. These methods were proposed and experimented with in ways that were inspired by general state-of-the-art AI methodologies. But they also targeted the characteristics that dialogue systems possess.
translated by 谷歌翻译
由于缺乏培训数据和异质知识来源,知识接地的对话系统是挑战的。由于培训数据中涵盖的有限主题,现有系统在不良主题上表现不佳。此外,异构知识源使系统概括到其他任务的系统,因为不同知识表示中的知识来源需要不同的知识编码器。为了解决这些挑战,我们呈现插头,将不同知识来源均匀化为知识接地的对话生成任务的统一知识来源的语言模型。插头在对话生成任务上进行预先培训,调节统一的基本知识表示。它可以通过一些培训示例概括到不同下游知识接地的对话一代任务。两个基准测试的实证评估表明,我们的模型越好跨越不同的知识接地任务。它可以在完全监督的设置下实现具有最先进的方法的可比性,并且显着优于零拍摄和少量拍摄设置中的其他方法。
translated by 谷歌翻译
预训练的语言模型(PTLM)已显示出在自然语言任务上表现良好。许多先前的作品都以通过知识图(KGS)标记的关系链接的实体的形式利用结构性常识来协助PTLM。检索方法使用kg作为单独的静态模块,该模块限制了覆盖范围,因为kgs包含有限的知识。生成方法训练PTLMS kg三倍以提高获得知识的规模。但是,对符号KG实体的培训限制了其在涉及自然语言文本的任务中的适用性,在这些任务中,它们忽略了整体上下文。为了减轻这种情况,我们提出了一个以句子为条件的常识性上下文化器(COSE-CO)作为输入,以使其在生成与输入文本的整体上下文相关的任务中通常可用。为了训练Cose-Co,我们提出了一个新的数据集,其中包括句子和常识知识对。 COSE-CO推断出的知识是多种多样的,并且包含了基础KG中不存在的新实体。我们增强了在多选质量质量检查和开放式常识性推理任务中产生的知识,从而改善了CSQA,ARC,QASC和OBQA数据集的当前最佳方法。我们还展示了其在改善释义生成任务的基线模型方面的适用性。
translated by 谷歌翻译
在本文中,我们建议利用对话的独特特征,共享参与者的常识性知识,以解决总结它们的困难。我们提出了病态的框架,该框架使用常识推论作为其他背景。与以前仅依赖于输入对话的工作相比,Sick使用外部知识模型来生成丰富的常识推断,并选择具有基于相似性选择方法的最可能的推理。基于生病的,病人++的理解为监督,在总结多任务学习环境中的对话时,添加了产生常识推断的任务。实验结果表明,通过注入常识性知识,我们的框架比现有方法产生更多信息和一致的摘要。
translated by 谷歌翻译
对事件序列的预测对于信息检索和自然语言处理中的许多现实世界应用至关重要。在事件序列预测中,未来的活动生成(FEG)是一项具有挑战性的任务,因为它不仅需要流利的文本生成,而且需要常识性推理才能保持整个事件故事的逻辑连贯性。在本文中,我们提出了一个新颖的可解释的FEG框架COEP。它突出并整合了两种类型的事件知识,对直接事件事件关系的顺序知识以及推论知识,这些知识反映了事件之间的中间角色心理学(例如意图,原因,反应),这些心理本质地将故事推向了故事。为了减轻知识遗忘问题,我们为每种类型的知识设计了两个模块,即IM和GM,它们是通过及时调整组合的。首先,IM专注于理解推论知识,以产生常识性解释并为通用汽车提供软提示向量。我们还设计了一种对比歧视器,以提高概括能力。其次,GM通过用IM的指导对直接顺序知识进行建模来生成未来事件。自动和人类评估表明,我们的方法可以产生更连贯,具体和逻辑的未来事件。
translated by 谷歌翻译
Many real-world applications of language models (LMs), such as code autocomplete and writing assistance, involve human-LM interaction, but the main LM benchmarks are non-interactive, where a system produces output without human intervention. To evaluate human-LM interaction, we develop a framework, Human-AI Language-based Interaction Evaluation (H-LINE), that expands non-interactive evaluation along three dimensions, capturing (i) the interactive process, not only the final output; (ii) the first-person subjective experience, not just a third-party assessment; and (iii) notions of preference beyond quality. We then design five tasks ranging from goal-oriented to open-ended to capture different forms of interaction. On four state-of-the-art LMs (three variants of OpenAI's GPT-3 and AI21's J1-Jumbo), we find that non-interactive performance does not always result in better human-LM interaction and that first-person and third-party metrics can diverge, suggesting the importance of examining the nuances of human-LM interaction.
translated by 谷歌翻译
真实的人类对话数据是复杂,异质和嘈杂的,从该数据中构建开放域对话系统仍然是一项艰巨的任务。实际上,此类对话数据仍然包含大量信息和知识,但是,它们没有得到充分探索。在本文中,我们展示了现有的开放域对话生成方法,这些方法记住上下文响应配对的数据,并使用自动回归或编码模型模型不利于培训数据。与当前的方法不同,使用外部知识,我们探索了一个检索生成培训框架,该培训框架可以通过将它们视为“证据”来利用异质和嘈杂的培训数据。特别是,我们使用Bertscore进行检索,这给出了证据和一代的更好品质。公开可用数据集的实验表明,我们的方法可以帮助模型产生更好的响应,即使此类培训数据通常会留下深刻的印象为低质量数据。这种性能增益与通过扩大训练组更好的改进的绩效增益相当,甚至更好。我们还发现,模型性能与检索到的证据的相关性有正相关。此外,我们的方法在零拍实验上表现良好,这表明我们的方法对现实世界数据可能更强大。
translated by 谷歌翻译
Pre-trained language models (LMs) store knowledge in their parameters and can generate informative responses when used in conversational systems. However, LMs suffer from the problem of "hallucination:" they may generate plausible-looking statements that are irrelevant or factually incorrect. To address this problem, we propose a contrastive learning scheme, named MixCL. A novel mixed contrastive objective is proposed to explicitly optimize the implicit knowledge elicitation process of LMs, and thus reduce their hallucination in conversations. We also examine negative sampling strategies of retrieved hard negatives and model-generated negatives. We conduct experiments on Wizard-of-Wikipedia, a public, open-domain knowledge-grounded dialogue benchmark, and assess the effectiveness of MixCL. MixCL effectively reduces the hallucination of LMs in conversations and achieves the highest performance among LM-based dialogue agents in terms of relevancy and factuality. We show that MixCL achieves comparable performance to state-of-the-art KB-based approaches while enjoying notable advantages in terms of efficiency and scalability.
translated by 谷歌翻译
We propose a novel task, G4C (Goal-driven Guidance Generation in Grounded Communication), for studying goal-driven and grounded natural language interactions. Specifically, we choose Dungeons and Dragons (D&D) -- a role-playing game consisting of multiple player characters and a Dungeon Master (DM) who collaborate to achieve a set of goals that are beneficial to the players -- as a testbed for this task. Here, each of the player characters is a student, with their own personas and abilities, and the DM is the teacher, an arbitrator of the rules of the world and responsible for assisting and guiding the students towards a global goal. We propose a theory-of-mind-inspired methodology for training such a DM with reinforcement learning (RL), where a DM: (1) learns to predict how the players will react to its utterances using a dataset of D&D dialogue transcripts; and (2) uses this prediction as a reward function providing feedback on how effective these utterances are at guiding the players towards a goal. Human and automated evaluations show that a DM trained with RL to generate guidance by incorporating a theory-of-mind of the players significantly improves the players' ability to achieve goals grounded in their shared world.
translated by 谷歌翻译
我们介绍了Sparrow,这是一个寻求信息的对话代理,与提示的语言模型基线相比,训练有素,更有帮助,正确和无害。我们使用从人类反馈中的强化学习来培训我们的模型,以帮助人类评估者判断代理人的行为。首先,为了使我们的代理人更有帮助和无害,我们将良好对话的要求分解为代理人应遵循的自然语言规则,并分别向评估者询问每个规则。我们证明,这种崩溃使我们能够收集对代理行为的更多针对性的人类判断,并允许更有效的规则条件奖励模型。其次,我们的代理商在收集对模型声明的偏好判决时提供了支持事实主张的来源的证据。对于事实问题,麻雀提供的证据支持了78%的时间。比基线比基线更享受麻雀,同时对人类的对抗性探测更具弹性,在探测时只有8%的时间违反了我们的规则。最后,我们进行了广泛的分析,表明尽管我们的模型学会遵守我们的规则,但它可以表现出分布偏见。
translated by 谷歌翻译
对话式AI中的现有研究主要将面向任务的对话框(TOD)和问题答案(QA)视为单独的任务。为了构建可以完成用户任务和支持信息寻求信息的对话代理的目标,构建一个可以访问各种外部知识的系统,构建一个处理TOD和QA的系统非常重要。在这项工作中,我们提出了一项新任务,开放式TOD(OB-TOD),将TOD与QA任务相结合,并将外部知识源扩展到包括明确的知识源(例如Web)和隐式知识源(例如,例如,预训练的语言模型)。我们创建了一个新的数据集ob-multiwoz,在这里,我们在其中丰富了Tod会议,并使用类似QA的信息寻求基于外部知识的经验。我们提出了一个统一的模型Opera(开放式末端到端任务对话框),可以适当地访问明确和隐性的外部知识,以解决定义的任务。实验结果表明,与闭环基线相比,Opera的表现出色,并说明了两种知识类型的价值。
translated by 谷歌翻译
We present a large, tunable neural conversational response generation model, DIALOGPT (dialogue generative pre-trained transformer). Trained on 147M conversation-like exchanges extracted from Reddit comment chains over a period spanning from 2005 through 2017, DialoGPT extends the Hugging Face PyTorch transformer to attain a performance close to human both in terms of automatic and human evaluation in single-turn dialogue settings. We show that conversational systems that leverage DialoGPT generate more relevant, contentful and context-consistent responses than strong baseline systems. The pre-trained model and training pipeline are publicly released to facilitate research into neural response generation and the development of more intelligent opendomain dialogue systems.
translated by 谷歌翻译
最近的大规模预训练的进步,例如GPT-3允许从给定提示生成看似高质量的文本。然而,这种一代系统经常遭受幻觉的事实问题,并且本身并不是旨在包含有用的外部信息。接地的代表似乎提供了补救措施,但他们的培训通常依赖于提供信息相关文件的很少可用的并行数据。我们提出了一个框架,通过在语言模型信号上共同训练接地的发生器和文档检索来缓解这种数据约束。该模型学会奖励具有生成中最高效用的文档的检索,并用专家混合(MOE)合并来术语术,以产生后续文本。我们证明,发电机和猎犬都可以利用这种联合培训,协同作用,以生产散文和对话一代中的更多信息和相关文本。
translated by 谷歌翻译