最近延伸预留下芬特的神经模型的神经模型继续实现新的最新导致对话状态跟踪(DST)基准的联合目标准确性(JGA)。但是,我们调查了他们的稳健性,因为它们在JGA中显示了急剧下降,以便与现实扰动的话语或对话框流动的对话。通过清单(Ribeiro等,2020),我们设计了一个名为CheckDST的度量集合,促进DST模型的比较,通过测试具有增强测试集的众所周知的弱点来促进革命性的全面尺寸。我们使用CheckDST评估最近的DST模型,并认为模型应该更全面地评估,而不是在JGA上追求最先进的JGA,因为更高的JGA不保证更好的整体稳健性。我们发现基于跨度的分类模型是有弹性的,不合适的命名实体,但对语言品种不强大,而那些基于自回归语言模型的人概括为语言变化,但往往会记住命名实体并往往是幻觉。由于它们各自的弱点,两种方法都不适合现实世界部署。我们认为CheckDst是未来研究的一个有用指南,用于开发面向任务的对话模型,体现了各种方法的优势。
translated by 谷歌翻译
在与用户进行交流时,以任务为导向的对话系统必须根据对话历史记录在每个回合时跟踪用户的需求。这个称为对话状态跟踪(DST)的过程至关重要,因为它直接告知下游对话政策。近年来,DST引起了很大的兴趣,文本到文本范式作为受欢迎的方法。在本评论论文中,我们首先介绍任务及其相关的数据集。然后,考虑到最近出版的大量出版物,我们确定了2021 - 2022年研究的重点和研究进展。尽管神经方法已经取得了重大进展,但我们认为对话系统(例如概括性)的某些关键方面仍未得到充实。为了激励未来的研究,我们提出了几种研究途径。
translated by 谷歌翻译
Training dialogue systems often entails dealing with noisy training examples and unexpected user inputs. Despite their prevalence, there currently lacks an accurate survey of dialogue noise, nor is there a clear sense of the impact of each noise type on task performance. This paper addresses this gap by first constructing a taxonomy of noise encountered by dialogue systems. In addition, we run a series of experiments to show how different models behave when subjected to varying levels of noise and types of noise. Our results reveal that models are quite robust to label errors commonly tackled by existing denoising algorithms, but that performance suffers from dialogue-specific noise. Driven by these observations, we design a data cleaning algorithm specialized for conversational settings and apply it as a proof-of-concept for targeted dialogue denoising.
translated by 谷歌翻译
针对任务导向的对话系统的强大状态跟踪目前仍然限于一些流行语言。本文显示,给定以一种语言设置的大规模对话数据,我们可以使用机器翻译自动为其他语言生成有效的语义解析器。我们提出了对话数据集的自动翻译,并进行对齐,以确保插槽值的忠实翻译,并消除以前的基准中使用的昂贵人类监督。我们还提出了一种新的上下文语义解析模型,它编码正式的插槽和值,只有最后一个代理和用户话语。我们表明,简洁的表示降低了翻译误差的复合效果,而不会损害实践中的准确性。我们评估我们对几个对话状态跟踪基准的方法。在Risawoz,Crosswoz,Crosswoz-Zh和Multiwoz-Zh Datasets,我们将最先进的技术提高11%,17%,20%和0.3%,以共同的目标准确度。我们为所有三个数据集提供了全面的错误分析,显示错误注释可以模糊模型质量的判断。最后,我们使用推荐方法创建了Risawoz英语和德语数据集。在这些数据集中,准确性在原始的11%以内,表示可能的高精度多语言对话数据集,而无需依赖昂贵的人类注释。
translated by 谷歌翻译
Dialogue State Tracking (DST), a key component of task-oriented conversation systems, represents user intentions by determining the values of pre-defined slots in an ongoing dialogue. Existing approaches use hand-crafted templates and additional slot information to fine-tune and prompt large pre-trained language models and elicit slot values from the dialogue context. Significant manual effort and domain knowledge is required to design effective prompts, limiting the generalizability of these approaches to new domains and tasks. In this work, we propose DiSTRICT, a generalizable in-context tuning approach for DST that retrieves highly relevant training examples for a given dialogue to fine-tune the model without any hand-crafted templates. Experiments with the MultiWOZ benchmark datasets show that DiSTRICT outperforms existing approaches in various zero-shot and few-shot settings using a much smaller model, thereby providing an important advantage for real-world deployments that often have limited resource availability.
translated by 谷歌翻译
对话状态跟踪(DST)是端到端对话系统的关键组成部分的主要目的是构建一个响应真实世界情况的模型。虽然我们经常在普通对话期间不时改变我们的思想,但是当前的基准数据集没有充分反映这种出现,而是由过度简化的对话组成,其中没有人在对话期间改变主意。作为激发本研究的主要问题,``现在是当前的基准数据集足以处理休闲谈话,在某个主题结束后,一个人在哪一个改变主意?“'我们发现答案是”否“,因为只是注入模板 - 基于卷起的卷数显着降低了DST模型性能。当注射最简单的回转话语时,多发性的测试接头目标精度降低超过5℃。此外,在面对更复杂的回转情况时,性能变性恶化。然而,我们还观察到,当卷倒数被适当地包含在训练数据集中时,表现篮板呈现,这意味着问题不具有DST模型,而是与基准数据集的构造。
translated by 谷歌翻译
Even though machine learning has become the major scene in dialogue research community, the real breakthrough has been blocked by the scale of data available. To address this fundamental obstacle, we introduce the Multi-Domain Wizard-of-Oz dataset (MultiWOZ), a fully-labeled collection of human-human written conversations spanning over multiple domains and topics. At a size of 10k dialogues, it is at least one order of magnitude larger than all previous annotated task-oriented corpora. The contribution of this work apart from the open-sourced dataset labelled with dialogue belief states and dialogue actions is two-fold: firstly, a detailed description of the data collection procedure along with a summary of data structure and analysis is provided. The proposed data-collection pipeline is entirely based on crowd-sourcing without the need of hiring professional annotators; secondly, a set of benchmark results of belief tracking, dialogue act and response generation is reported, which shows the usability of the data and sets a baseline for future studies.
translated by 谷歌翻译
Natural Language Generation (NLG) has improved exponentially in recent years thanks to the development of sequence-to-sequence deep learning technologies such as Transformer-based language models. This advancement has led to more fluent and coherent NLG, leading to improved development in downstream tasks such as abstractive summarization, dialogue generation and data-to-text generation. However, it is also apparent that deep learning based generation is prone to hallucinate unintended text, which degrades the system performance and fails to meet user expectations in many real-world scenarios. To address this issue, many studies have been presented in measuring and mitigating hallucinated texts, but these have never been reviewed in a comprehensive manner before. In this survey, we thus provide a broad overview of the research progress and challenges in the hallucination problem in NLG. The survey is organized into two parts: (1) a general overview of metrics, mitigation methods, and future directions; and (2) an overview of task-specific research progress on hallucinations in the following downstream tasks, namely abstractive summarization, dialogue generation, generative question answering, data-to-text generation, machine translation, and visual-language generation. This survey serves to facilitate collaborative efforts among researchers in tackling the challenge of hallucinated texts in NLG.
translated by 谷歌翻译
Virtual assistants such as Google Assistant, Alexa and Siri provide a conversational interface to a large number of services and APIs spanning multiple domains. Such systems need to support an ever-increasing number of services with possibly overlapping functionality. Furthermore, some of these services have little to no training data available. Existing public datasets for task-oriented dialogue do not sufficiently capture these challenges since they cover few domains and assume a single static ontology per domain. In this work, we introduce the the Schema-Guided Dialogue (SGD) dataset, containing over 16k multi-domain conversations spanning 16 domains. Our dataset exceeds the existing task-oriented dialogue corpora in scale, while also highlighting the challenges associated with building large-scale virtual assistants. It provides a challenging testbed for a number of tasks including language understanding, slot filling, dialogue state tracking and response generation. Along the same lines, we present a schema-guided paradigm for task-oriented dialogue, in which predictions are made over a dynamic set of intents and slots, provided as input, using their natural language descriptions. This allows a single dialogue system to easily support a large number of services and facilitates simple integration of new services without requiring additional training data. Building upon the proposed paradigm, we release a model for dialogue state tracking capable of zero-shot generalization to new APIs, while remaining competitive in the regular setting.
translated by 谷歌翻译
零/几次转移到看不见的服务是面向任务的对话研究中的一个关键挑战。架构引导的对话(SGD)数据集引入了一个范式,以使模型通过模式以零摄影的方式支持任何服务,该模型将服务API描述为自然语言的模型。我们通过设计SGD -X来探索对话系统对模式中语言变化的鲁棒性 - 一种基准,该基准扩展了SGD的语义上相似但风格相似但在每个模式上具有相似风格的变体。我们观察到,两种顶级状态跟踪模型无法通过模式变体概括,这些模型通过联合目标准确性和用于测量模式灵敏度的新型指标来衡量。此外,我们提出了一种简单的模型数据扩展方法,以改善模式鲁棒性。
translated by 谷歌翻译
以任务为导向的对话系统(TDSS)主要在离线设置或人类评估中评估。评估通常仅限于单转或非常耗时。作为替代方案,模拟用户行为的用户模拟器使我们能够考虑一组广泛的用户目标,以生成类似人类的对话以进行模拟评估。使用现有的用户模拟器来评估TDSS是具有挑战性的,因为用户模拟器主要旨在优化TDSS的对话策略,并且评估功能有限。此外,对用户模拟器的评估是一个开放的挑战。在这项工作中,我们提出了一个用于端到端TDS评估的隐喻用户模拟器,如果它在与系统的交互中模拟用户的类似思维,则定义模拟器是隐喻的。我们还提出了一个基于测试人员的评估框架,以生成变体,即具有不同功能的对话系统。我们的用户模拟器构建了一个隐喻的用户模型,该模型通过参考遇到新项目时的先验知识来帮助模拟器进行推理。我们通过检查模拟器与变体之间的模拟相互作用来估计模拟器的质量。我们的实验是使用三个TDS数据集进行的。与基于议程的模拟器和三个数据集上的SEQ2SEQ模型相比,隐喻用户模拟器与手动评估的一致性更好。我们的测试人员框架展示了效率,并且可以更好地概括和可扩展性,因为它可以适用于多个域中的对话和多个任务,例如对话建议和电子商务对话。
translated by 谷歌翻译
对话状态跟踪模型在面向任务的对话系统中发挥着重要作用。然而,它们中的大多数是根据输入定义地独立地造型的插槽类型。我们发现它可能导致模型由共享相同数据类型的插槽类型混淆。为了减轻这个问题,我们提出了连续模型插槽的Trippy-MRF和Trippy-LSTM。我们的研究结果表明,他们能够缓解上述混淆,并将最先进的数据集达到58.7至61.3推出。我们的实现可在https://github.com/ctinray/trippy-joint上获得。
translated by 谷歌翻译
Recent work has shown that fine-tuning large pre-trained language models on a collection of tasks described via instructions, a.k.a. instruction-tuning, improves their zero and few-shot generalization to unseen tasks. However, there is a limited understanding of the performance trade-offs of different decisions made during the instruction-tuning process. These decisions include the scale and diversity of the instruction-tuning benchmark, different task sampling strategies, fine-tuning with and without demonstrations, training using specialized datasets for reasoning and dialogue, and finally, the fine-tuning objectives themselves. In this paper, we characterize the effect of instruction-tuning decisions on downstream task performance when scaling both model and benchmark sizes. To this end, we create OPT-IML Bench: a large benchmark for Instruction Meta-Learning (IML) of 2000 NLP tasks consolidated into task categories from 8 existing benchmarks, and prepare an evaluation framework to measure three types of model generalizations: to tasks from fully held-out categories, to held-out tasks from seen categories, and to held-out instances from seen tasks. Through the lens of this framework, we first present insights about instruction-tuning decisions as applied to OPT-30B and further exploit these insights to train OPT-IML 30B and 175B, which are instruction-tuned versions of OPT. OPT-IML demonstrates all three generalization abilities at both scales on four different evaluation benchmarks with diverse tasks and input formats -- PromptSource, FLAN, Super-NaturalInstructions, and UnifiedSKG. Not only does it significantly outperform OPT on all benchmarks but is also highly competitive with existing models fine-tuned on each specific benchmark. We release OPT-IML at both scales, together with the OPT-IML Bench evaluation framework.
translated by 谷歌翻译
The goal of building dialogue agents that can converse with humans naturally has been a long-standing dream of researchers since the early days of artificial intelligence. The well-known Turing Test proposed to judge the ultimate validity of an artificial intelligence agent on the indistinguishability of its dialogues from humans'. It should come as no surprise that human-level dialogue systems are very challenging to build. But, while early effort on rule-based systems found limited success, the emergence of deep learning enabled great advance on this topic. In this thesis, we focus on methods that address the numerous issues that have been imposing the gap between artificial conversational agents and human-level interlocutors. These methods were proposed and experimented with in ways that were inspired by general state-of-the-art AI methodologies. But they also targeted the characteristics that dialogue systems possess.
translated by 谷歌翻译
最近,培训预培训方法在以任务为导向的对话框(TOD)系统中表现出了很大的成功。但是,大多数现有的预培训模型用于TOD专注于对话的理解或对话生成,但并非两者兼而有之。在本文中,我们提出了Space-3,这是一种新型的统一的半监督预培训的预训练的对话模型,从大规模对话CORPORA中学习有限的注释,可以有效地对广泛的下游对话任务进行微调。具体而言,Space-3由单个变压器中的四个连续组件组成,以维护TOD系统中的任务流:(i)对话框编码模块编码对话框历史记录,(ii)对话框理解模块以从任一用户中提取语义向量查询或系统响应,(iii)一个对话框策略模块,以生成包含响应高级语义的策略向量,以及(iv)对话框生成模块以产生适当的响应。我们为每个组件设计一个专门的预训练目标。具体而言,我们预先培训对话框编码模块,使用跨度掩码语言建模,以学习上下文化对话框信息。为了捕获“结构化对话框”语义,我们通过额外的对话注释通过新颖的树诱导的半监视对比度学习目标来预先培训对话框理解模块。此外,我们通过将其输出策略向量与响应响应的语义向量之间的L2距离最小化以进行策略优化,从而预先培训对话策略模块。最后,对话框生成模型由语言建模预先训练。结果表明,Space-3在八个下游对话框基准中实现最新性能,包括意图预测,对话框状态跟踪和端到端对话框建模。我们还表明,在低资源设置下,Space-3比现有模型具有更强的射击能力。
translated by 谷歌翻译
培训和评估语言模型越来越多地要求构建元数据 - 多样化的策划数据收集,并具有清晰的出处。自然语言提示最近通过将现有的,有监督的数据集转换为多种新颖的预处理任务,突出了元数据策划的好处,从而改善了零击的概括。尽管将这些以数据为中心的方法转化为生物医学语言建模的通用域文本成功,但由于标记的生物医学数据集在流行的数据中心中的代表性大大不足,因此仍然具有挑战性。为了应对这一挑战,我们介绍了BigBio一个由126个以上的生物医学NLP数据集的社区库,目前涵盖12个任务类别和10多种语言。 BigBio通过对数据集及其元数据进行程序化访问来促进可再现的元数据策划,并与当前的平台兼容,以及时工程和端到端的几个/零射击语言模型评估。我们讨论了我们的任务架构协调,数据审核,贡献指南的过程,并概述了两个说明性用例:生物医学提示和大规模,多任务学习的零射门评估。 BigBio是一项持续的社区努力,可在https://github.com/bigscience-workshop/biomedical上获得。
translated by 谷歌翻译
在本文中,我们建议利用对话的独特特征,共享参与者的常识性知识,以解决总结它们的困难。我们提出了病态的框架,该框架使用常识推论作为其他背景。与以前仅依赖于输入对话的工作相比,Sick使用外部知识模型来生成丰富的常识推断,并选择具有基于相似性选择方法的最可能的推理。基于生病的,病人++的理解为监督,在总结多任务学习环境中的对话时,添加了产生常识推断的任务。实验结果表明,通过注入常识性知识,我们的框架比现有方法产生更多信息和一致的摘要。
translated by 谷歌翻译
Diverse data formats and ontologies of task-oriented dialogue (TOD) datasets hinder us from developing general dialogue models that perform well on many datasets and studying knowledge transfer between datasets. To address this issue, we present ConvLab-3, a flexible dialogue system toolkit based on a unified TOD data format. In ConvLab-3, different datasets are transformed into one unified format and loaded by models in the same way. As a result, the cost of adapting a new model or dataset is significantly reduced. Compared to the previous releases of ConvLab (Lee et al., 2019b; Zhu et al., 2020b), ConvLab-3 allows developing dialogue systems with much more datasets and enhances the utility of the reinforcement learning (RL) toolkit for dialogue policies. To showcase the use of ConvLab-3 and inspire future work, we present a comprehensive study with various settings. We show the benefit of pre-training on other datasets for few-shot fine-tuning and RL, and encourage evaluating policy with diverse user simulators.
translated by 谷歌翻译
最近,通过“向导”模拟游戏收集了一类以任务为导向的对话(TOD)数据集。但是,《巫师》数据实际上是模拟的数据,因此与现实生活中的对话根本不同,这些对话更加嘈杂和随意。最近,Seretod挑战赛是组织的,并发布了Mobilecs数据集,该数据集由来自中国移动的真实用户和客户服务人员之间的真实世界对话框组成。基于Mobilecs数据集,Seretod挑战具有两个任务,不仅评估了对话系统本身的构建,而且还检查了对话框成绩单中的信息提取,这对于建立TOD的知识库至关重要。本文主要介绍了Mobilecs数据集对这两项任务的基线研究。我们介绍了如何构建两个基线,遇到的问题以及结果。我们预计基线可以促进令人兴奋的未来研究,以建立针对现实生活任务的人类机器人对话系统。
translated by 谷歌翻译
本文介绍了端到端以任务为导向的对话(TOD)的本体学预验证的语言模型(OPAL)。与Chit-Chat对话模型不同,面向任务的对话模型至少满足两个特定于任务的模块:对话状态跟踪器(DST)和响应生成器(RG)。对话状态由域插槽值三元组成,它们被认为是用户搜索与域相关数据库的约束。带有带注释的对话状态的大规模面向任务的对话数据通常是无法访问的。它可以防止针对任务对话的审慎语言模型的开发。我们提出了一种简单而有效的预处理方法来减轻此问题,该方法由两个预审进阶段组成。第一阶段是在大规模上下文文本数据上预处理,其中文本的结构化信息是由信息提取工具提取的。为了弥合训练方法和下游任务之间的差距,我们设计了两个预训练的任务:类似于本体的三重恢复和下一文本生成,分别模拟了DST和RG。第二阶段是在TOD数据上微调验证的模型。实验结果表明,即使没有CAMREST676和MULTIWOZ基准的任何TOD数据,我们提出的方法即使没有任何TOD数据,我们提出的方法也可以提高竞争性能。
translated by 谷歌翻译