针对任务导向的对话系统的强大状态跟踪目前仍然限于一些流行语言。本文显示,给定以一种语言设置的大规模对话数据,我们可以使用机器翻译自动为其他语言生成有效的语义解析器。我们提出了对话数据集的自动翻译,并进行对齐,以确保插槽值的忠实翻译,并消除以前的基准中使用的昂贵人类监督。我们还提出了一种新的上下文语义解析模型,它编码正式的插槽和值,只有最后一个代理和用户话语。我们表明,简洁的表示降低了翻译误差的复合效果,而不会损害实践中的准确性。我们评估我们对几个对话状态跟踪基准的方法。在Risawoz,Crosswoz,Crosswoz-Zh和Multiwoz-Zh Datasets,我们将最先进的技术提高11%,17%,20%和0.3%,以共同的目标准确度。我们为所有三个数据集提供了全面的错误分析,显示错误注释可以模糊模型质量的判断。最后,我们使用推荐方法创建了Risawoz英语和德语数据集。在这些数据集中,准确性在原始的11%以内,表示可能的高精度多语言对话数据集,而无需依赖昂贵的人类注释。
translated by 谷歌翻译
在与用户进行交流时,以任务为导向的对话系统必须根据对话历史记录在每个回合时跟踪用户的需求。这个称为对话状态跟踪(DST)的过程至关重要,因为它直接告知下游对话政策。近年来,DST引起了很大的兴趣,文本到文本范式作为受欢迎的方法。在本评论论文中,我们首先介绍任务及其相关的数据集。然后,考虑到最近出版的大量出版物,我们确定了2021 - 2022年研究的重点和研究进展。尽管神经方法已经取得了重大进展,但我们认为对话系统(例如概括性)的某些关键方面仍未得到充实。为了激励未来的研究,我们提出了几种研究途径。
translated by 谷歌翻译
虽然英语虚拟助手已经实现了令人兴奋的表现,但具有巨大的培训资源,但非英语扬声器的需求并没有满足。截至2021年12月,Alexa是世界上最受欢迎的智能扬声器之一,能够支持9种不同的语言[1],而世界上有数千种语言,其中91人被超过1000万人所说根据2019年发布的统计数据[2]。但是,培训以其他语言的虚拟助手比英语更困难,特别是对于那些低资源语言而言。缺乏高质量的培训数据限制了模型的性能,导致用户满意度差。因此,我们使用与Bitod [5]相同的数据集生成管道和端到端对话系统体系结构设计了用于多语言任务的对话系统的高效且有效的培训解决方案,该系统为Bitod [5]采用了一些关键设计选择,以实现简约的自然语言使用正式对话状态的设计代替自然语言输入。这减少了较弱的自然语言模型所带来的错误的空间,并确保模型可以正确提取执行对话状态跟踪所需的基本槽值(DST)。我们的目标是减少每次转弯编码的自然语言量,以及我们调查的关键参数是将作为模型历史源的转弯(h)的数量。我们首先探索转折点,其中越来越多的H开始产生限制返回整体性能。然后,我们检查一个小型H错误是否错误的示例可以在模式下对模型进行分类,以便执行几次射门。最后,将探讨这种方法的局限性,以及是否存在这种方法无法解决的某种类型的例子。
translated by 谷歌翻译
Even though machine learning has become the major scene in dialogue research community, the real breakthrough has been blocked by the scale of data available. To address this fundamental obstacle, we introduce the Multi-Domain Wizard-of-Oz dataset (MultiWOZ), a fully-labeled collection of human-human written conversations spanning over multiple domains and topics. At a size of 10k dialogues, it is at least one order of magnitude larger than all previous annotated task-oriented corpora. The contribution of this work apart from the open-sourced dataset labelled with dialogue belief states and dialogue actions is two-fold: firstly, a detailed description of the data collection procedure along with a summary of data structure and analysis is provided. The proposed data-collection pipeline is entirely based on crowd-sourcing without the need of hiring professional annotators; secondly, a set of benchmark results of belief tracking, dialogue act and response generation is reported, which shows the usability of the data and sets a baseline for future studies.
translated by 谷歌翻译
通常观察到的最先进的自然语言技术问题,例如亚马逊alexa和苹果公司,是他们的服务不会因语言障碍而扩展到大多数发展中国家的公民。这种种群因其语言缺乏可用资源来构建NLP产品。本文介绍了allwoz,一个多语言多域面向任务的客户服务对话框数据集覆盖八种语言:英语,普通话,韩语,越南语,印地语,法国,葡萄牙语和泰国。此外,我们通过使用mt5与元学习来创建多语言数据集的基准。
translated by 谷歌翻译
对话状态跟踪模型在面向任务的对话系统中发挥着重要作用。然而,它们中的大多数是根据输入定义地独立地造型的插槽类型。我们发现它可能导致模型由共享相同数据类型的插槽类型混淆。为了减轻这个问题,我们提出了连续模型插槽的Trippy-MRF和Trippy-LSTM。我们的研究结果表明,他们能够缓解上述混淆,并将最先进的数据集达到58.7至61.3推出。我们的实现可在https://github.com/ctinray/trippy-joint上获得。
translated by 谷歌翻译
Task-oriented dialogue (TOD) systems have been applied in a range of domains to support human users to achieve specific goals. Systems are typically constructed for a single domain or language and do not generalise well beyond this. Their extension to other languages in particular is restricted by the lack of available training data for many of the world's languages. To support work on Natural Language Understanding (NLU) in TOD across multiple languages and domains simultaneously, we constructed MULTI3NLU++, a multilingual, multi-intent, multi-domain dataset. MULTI3NLU++ extends the English-only NLU++ dataset to include manual translations into a range of high, medium and low resource languages (Spanish, Marathi, Turkish and Amharic), in two domains (banking and hotels). MULTI3NLU++ inherits the multi-intent property of NLU++, where an utterance may be labelled with multiple intents, providing a more realistic representation of a user's goals and aligning with the more complex tasks that commercial systems aim to model. We use MULTI3NLU++ to benchmark state-of-the-art multilingual language models as well as Machine Translation and Question Answering systems for the NLU task of intent detection for TOD systems in the multilingual setting. The results demonstrate the challenging nature of the dataset, particularly in the low-resource language setting.
translated by 谷歌翻译
最近,通过“向导”模拟游戏收集了一类以任务为导向的对话(TOD)数据集。但是,《巫师》数据实际上是模拟的数据,因此与现实生活中的对话根本不同,这些对话更加嘈杂和随意。最近,Seretod挑战赛是组织的,并发布了Mobilecs数据集,该数据集由来自中国移动的真实用户和客户服务人员之间的真实世界对话框组成。基于Mobilecs数据集,Seretod挑战具有两个任务,不仅评估了对话系统本身的构建,而且还检查了对话框成绩单中的信息提取,这对于建立TOD的知识库至关重要。本文主要介绍了Mobilecs数据集对这两项任务的基线研究。我们介绍了如何构建两个基线,遇到的问题以及结果。我们预计基线可以促进令人兴奋的未来研究,以建立针对现实生活任务的人类机器人对话系统。
translated by 谷歌翻译
Dialogue state tracking (DST) aims to convert the dialogue history into dialogue states which consist of slot-value pairs. As condensed structural information memorizing all history information, the dialogue state in the last turn is typically adopted as the input for predicting the current state by DST models. However, these models tend to keep the predicted slot values unchanged, which is defined as state momentum in this paper. Specifically, the models struggle to update slot values that need to be changed and correct wrongly predicted slot values in the last turn. To this end, we propose MoNET to tackle state momentum via noise-enhanced training. First, the previous state of each turn in the training data is noised via replacing some of its slot values. Then, the noised previous state is used as the input to learn to predict the current state, improving the model's ability to update and correct slot values. Furthermore, a contrastive context matching framework is designed to narrow the representation distance between a state and its corresponding noised variant, which reduces the impact of noised state and makes the model better understand the dialogue history. Experimental results on MultiWOZ datasets show that MoNET outperforms previous DST methods. Ablations and analysis verify the effectiveness of MoNET in alleviating state momentum and improving anti-noise ability.
translated by 谷歌翻译
对话状态跟踪(DST)是对话系统的核心子模块,旨在从系统和用户话语中提取适当的信念状态(域槽值)。大多数先前的研究试图通过增加预训练模型的大小或使用其他功能(例如图形关系)来提高性能。在这项研究中,我们建议使用实体自适应预训练(DSTEA)进行对话状态跟踪,该系统在该系统中,句子中的关键实体受到DST模型的编码者的训练。 DSTEA通过四种方式从输入对话中提取重要实体,然后应用选择性知识掩盖以有效地训练模型。尽管DSTEA仅进行预训练而没有直接向DST模型注入更多知识,但它的性能比Multiwoz 2.0、2.1和2.2上最著名的基准模型更好。 DSTEA的有效性通过有关实体类型和不同自适应设置的各种比较实验得到了验证。
translated by 谷歌翻译
Training dialogue systems often entails dealing with noisy training examples and unexpected user inputs. Despite their prevalence, there currently lacks an accurate survey of dialogue noise, nor is there a clear sense of the impact of each noise type on task performance. This paper addresses this gap by first constructing a taxonomy of noise encountered by dialogue systems. In addition, we run a series of experiments to show how different models behave when subjected to varying levels of noise and types of noise. Our results reveal that models are quite robust to label errors commonly tackled by existing denoising algorithms, but that performance suffers from dialogue-specific noise. Driven by these observations, we design a data cleaning algorithm specialized for conversational settings and apply it as a proof-of-concept for targeted dialogue denoising.
translated by 谷歌翻译
Multiwoz 2.0数据集极大地刺激了面向任务的对话系统的研究。但是,其状态注释包含大量噪声,这阻碍了对模型性能的正确评估。为了解决这个问题,大规模的努力致力于纠正注释。然后释放了三个改进的版本(即Multiwoz 2.1-2.3)。尽管如此,仍然有很多错误和不一致的注释。这项工作介绍了Multiwoz 2.4,该工作完善了Multiwoz 2.1的验证集和测试集中的注释。训练集中的注释保持不变(与多沃兹2.1相同),以引发强大的噪声模型训练。我们在Multiwoz 2.4上基准了八个最新的对话状态跟踪模型。所有这些表现出比Multiwoz 2.1的性能要高得多。
translated by 谷歌翻译
与具有粗粒度信息的Crosswoz(中文)和多发性(英文)数据集相比,没有数据集,可以正确处理细粒度和分层级别信息。在本文中,我们在香港发布了一份粤语知识驱动的对话数据集(KDDRES),将多转谈话中的信息放在一个特定的餐厅。我们的语料库包含0.8k次谈话,它来自10家餐厅,提供不同地区的各种风格。除此之外,我们还设计了细粒度的插槽和意图,以更好地捕获语义信息。基准实验和数据统计分析显示了我们数据集的多样性和丰富的注释。我们认为,KDDRE的出版可以是当前对话数据集的必要补充,以及社会中小企业(中小企业)更适合和更有价值,如为每家餐馆建立定制的对话系统。语料库和基准模型是公开可用的。
translated by 谷歌翻译
现有的多方对话数据集用于核心分辨率是新生的,许多挑战仍然没有解决。我们根据电视成绩单为此任务创建了一个大规模数据集,多语言多方CoreF(MMC)。由于使用多种语言的黄金质量字幕可用,我们建议重复注释以通过注释投影以其他语言(中文和Farsi)创建银色核心数据。在黄金(英语)数据上,现成的模型在MMC上的性能相对较差,这表明MMC比以前的数据集更广泛地覆盖多方核心。在银数据上,我们发现成功使用它进行数据增强和从头开始训练,这有效地模拟了零击的跨语性设置。
translated by 谷歌翻译
Dialogue State Tracking (DST), a key component of task-oriented conversation systems, represents user intentions by determining the values of pre-defined slots in an ongoing dialogue. Existing approaches use hand-crafted templates and additional slot information to fine-tune and prompt large pre-trained language models and elicit slot values from the dialogue context. Significant manual effort and domain knowledge is required to design effective prompts, limiting the generalizability of these approaches to new domains and tasks. In this work, we propose DiSTRICT, a generalizable in-context tuning approach for DST that retrieves highly relevant training examples for a given dialogue to fine-tune the model without any hand-crafted templates. Experiments with the MultiWOZ benchmark datasets show that DiSTRICT outperforms existing approaches in various zero-shot and few-shot settings using a much smaller model, thereby providing an important advantage for real-world deployments that often have limited resource availability.
translated by 谷歌翻译
Virtual assistants such as Google Assistant, Alexa and Siri provide a conversational interface to a large number of services and APIs spanning multiple domains. Such systems need to support an ever-increasing number of services with possibly overlapping functionality. Furthermore, some of these services have little to no training data available. Existing public datasets for task-oriented dialogue do not sufficiently capture these challenges since they cover few domains and assume a single static ontology per domain. In this work, we introduce the the Schema-Guided Dialogue (SGD) dataset, containing over 16k multi-domain conversations spanning 16 domains. Our dataset exceeds the existing task-oriented dialogue corpora in scale, while also highlighting the challenges associated with building large-scale virtual assistants. It provides a challenging testbed for a number of tasks including language understanding, slot filling, dialogue state tracking and response generation. Along the same lines, we present a schema-guided paradigm for task-oriented dialogue, in which predictions are made over a dynamic set of intents and slots, provided as input, using their natural language descriptions. This allows a single dialogue system to easily support a large number of services and facilitates simple integration of new services without requiring additional training data. Building upon the proposed paradigm, we release a model for dialogue state tracking capable of zero-shot generalization to new APIs, while remaining competitive in the regular setting.
translated by 谷歌翻译
在过去的十年中,对对话系统的兴趣已经大大增长。从扩展过程中,也有兴趣开发和改进意图分类和插槽填充模型,这是两个组件,这些组件通常在以任务为导向的对话框系统中使用。此外,良好的评估基准对于帮助比较和分析结合此类模型的系统很重要。不幸的是,该领域的许多文献仅限于对相对较少的基准数据集的分析。为了促进针对任务的对话系统的更强大的分析,我们对意图分类和插槽填充任务进行了公开可用数据集的调查。我们分类每个数据集的重要特征,并就每个数据集的适用性,优势和劣势进行讨论。我们的目标是,这项调查有助于提高这些数据集的可访问性,我们希望它们能够在未来评估意图分类和填充插槽模型中用于以任务为导向的对话框系统。
translated by 谷歌翻译
对话状态跟踪(DST)是端到端对话系统的关键组成部分的主要目的是构建一个响应真实世界情况的模型。虽然我们经常在普通对话期间不时改变我们的思想,但是当前的基准数据集没有充分反映这种出现,而是由过度简化的对话组成,其中没有人在对话期间改变主意。作为激发本研究的主要问题,``现在是当前的基准数据集足以处理休闲谈话,在某个主题结束后,一个人在哪一个改变主意?“'我们发现答案是”否“,因为只是注入模板 - 基于卷起的卷数显着降低了DST模型性能。当注射最简单的回转话语时,多发性的测试接头目标精度降低超过5℃。此外,在面对更复杂的回转情况时,性能变性恶化。然而,我们还观察到,当卷倒数被适当地包含在训练数据集中时,表现篮板呈现,这意味着问题不具有DST模型,而是与基准数据集的构造。
translated by 谷歌翻译
Automatic machine translation (MT) metrics are widely used to distinguish the translation qualities of machine translation systems across relatively large test sets (system-level evaluation). However, it is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level (segment-level evaluation). In this paper, we investigate how useful MT metrics are at detecting the success of a machine translation component when placed in a larger platform with a downstream task. We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks (dialogue state tracking, question answering, and semantic parsing). For each task, we only have access to a monolingual task-specific model. We calculate the correlation between the metric's ability to predict a good/bad translation with the success/failure on the final task for the Translate-Test setup. Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes. We also find that the scores provided by neural metrics are not interpretable mostly because of undefined ranges. Our analysis suggests that future MT metrics be designed to produce error labels rather than scores to facilitate extrinsic evaluation.
translated by 谷歌翻译
最近延伸预留下芬特的神经模型的神经模型继续实现新的最新导致对话状态跟踪(DST)基准的联合目标准确性(JGA)。但是,我们调查了他们的稳健性,因为它们在JGA中显示了急剧下降,以便与现实扰动的话语或对话框流动的对话。通过清单(Ribeiro等,2020),我们设计了一个名为CheckDST的度量集合,促进DST模型的比较,通过测试具有增强测试集的众所周知的弱点来促进革命性的全面尺寸。我们使用CheckDST评估最近的DST模型,并认为模型应该更全面地评估,而不是在JGA上追求最先进的JGA,因为更高的JGA不保证更好的整体稳健性。我们发现基于跨度的分类模型是有弹性的,不合适的命名实体,但对语言品种不强大,而那些基于自回归语言模型的人概括为语言变化,但往往会记住命名实体并往往是幻觉。由于它们各自的弱点,两种方法都不适合现实世界部署。我们认为CheckDst是未来研究的一个有用指南,用于开发面向任务的对话模型,体现了各种方法的优势。
translated by 谷歌翻译