We adapt Lee et al.'s (2018) span-based entity coreference model to the task of end-to-end discourse deixis resolution in dialogue, specifically by proposing extensions to their model that exploit task-specific characteristics. The resulting model, dd-utt, achieves state-of-the-art results on the four datasets in the CODI-CRAC 2021 shared task.
translated by 谷歌翻译
Coreference resolution (CR) is one of the most challenging areas of natural language processing. This task seeks to identify all textual references to the same real-world entity. Research in this field is divided into coreference resolution and anaphora resolution. Due to its application in textual comprehension and its utility in other tasks such as information extraction systems, document summarization, and machine translation, this field has attracted considerable interest. Consequently, it has a significant effect on the quality of these systems. This article reviews the existing corpora and evaluation metrics in this field. Then, an overview of the coreference algorithms, from rule-based methods to the latest deep learning techniques, is provided. Finally, coreference resolution and pronoun resolution systems in Persian are investigated.
translated by 谷歌翻译
本文概述了与CRAC 2022研讨会相关的多语言核心分辨率的共享任务。共同的任务参与者应该开发能够识别提及并根据身份核心重点聚集的训练系统。Corefud 1.0的公共版本包含10种语言的13个数据集,被用作培训和评估数据的来源。先前面向核心共享任务中使用的串联分数用作主要评估度量。5个参与团队提交了8个核心预测系统;此外,组织者在共享任务开始时提供了一个基于竞争变压器的基线系统。获胜者系统的表现优于基线12个百分点(就所有语言的所有数据集而言,在所有数据集中平均得分)。
translated by 谷歌翻译
在多方对话中有效地发现发言者的情绪状态是设计人类类似的会话代理商的重要性。在谈话期间,扬声器的认知状态通常由于某些过去的话语而改变,这可能导致他们的情绪状态的翻转。因此,在对话期间发现扬声器情感翻转背后的原因(触发)对于解释个人话语的情感标签至关重要。在本文中,除了解决对话中的情感认可的任务(ERC),我们介绍了一种新的任务 - 情感 - 翻转推理(EFR),旨在识别过去的话语,这引发了一个人的情绪状态以在一定时间翻转。我们提出了一个掩蔽的存储器网络来解决前者和基于变换器的网络的后一种任务。为此,我们考虑融合的基准情感识别数据集,用于ERC任务的多方对话,并使用EFR的新地基标签增强它。与五个最先进的模型进行了广泛的比较,表明我们对两个任务的模型的表现。我们进一步提出了轶事证据和定性和定量误差分析,以支持与基线相比模型的优势。
translated by 谷歌翻译
The goal of building dialogue agents that can converse with humans naturally has been a long-standing dream of researchers since the early days of artificial intelligence. The well-known Turing Test proposed to judge the ultimate validity of an artificial intelligence agent on the indistinguishability of its dialogues from humans'. It should come as no surprise that human-level dialogue systems are very challenging to build. But, while early effort on rule-based systems found limited success, the emergence of deep learning enabled great advance on this topic. In this thesis, we focus on methods that address the numerous issues that have been imposing the gap between artificial conversational agents and human-level interlocutors. These methods were proposed and experimented with in ways that were inspired by general state-of-the-art AI methodologies. But they also targeted the characteristics that dialogue systems possess.
translated by 谷歌翻译
从对话数据中提取信息特别具有挑战性,因为以任务为中心的对话的性质可以有效地传达人类隐式信息,但对机器来说是具有挑战性的。话语之间的挑战可能会有所不同,具体取决于说话者在对话中的作用,尤其是当相关专业知识跨角色不对称时。此外,随着对话中隐含地传达的信息构建更多的共享环境,挑战也可能会增加。在本文中,我们提出了新颖的建模方法MedFilter,该方法解决了这些见解,以提高识别和分类与任务相关的话语时的性能,并在这样做时对下游信息提取任务的性能产生积极影响。我们在近7,000次医生对话的语料库上评估了这种方法,其中使用MedFilter来识别与讨论的医学相关贡献(在PR曲线下的面积方面,比SOTA基线提高了10%的贡献)。确定与任务相关的话语受益于下游医疗处理,在提取症状,药物和投诉的提取方面分别提高了15%,105%和23%。
translated by 谷歌翻译
Natural Language Generation (NLG) has improved exponentially in recent years thanks to the development of sequence-to-sequence deep learning technologies such as Transformer-based language models. This advancement has led to more fluent and coherent NLG, leading to improved development in downstream tasks such as abstractive summarization, dialogue generation and data-to-text generation. However, it is also apparent that deep learning based generation is prone to hallucinate unintended text, which degrades the system performance and fails to meet user expectations in many real-world scenarios. To address this issue, many studies have been presented in measuring and mitigating hallucinated texts, but these have never been reviewed in a comprehensive manner before. In this survey, we thus provide a broad overview of the research progress and challenges in the hallucination problem in NLG. The survey is organized into two parts: (1) a general overview of metrics, mitigation methods, and future directions; and (2) an overview of task-specific research progress on hallucinations in the following downstream tasks, namely abstractive summarization, dialogue generation, generative question answering, data-to-text generation, machine translation, and visual-language generation. This survey serves to facilitate collaborative efforts among researchers in tackling the challenge of hallucinated texts in NLG.
translated by 谷歌翻译
我们提出了一种新颖的基准和相关的评估指标,用于评估文本匿名方法的性能。文本匿名化定义为编辑文本文档以防止个人信息披露的任务,目前遭受了面向隐私的带注释的文本资源的短缺,因此难以正确评估各种匿名方法提供的隐私保护水平。本文介绍了标签(文本匿名基准),这是一种新的开源注释语料库,以解决此短缺。该语料库包括欧洲人权法院(ECHR)的1,268个英语法院案件,并充满了有关每个文档中出现的个人信息的全面注释,包括其语义类别,标识符类型,机密属性和共同参考关系。与以前的工作相比,TAB语料库旨在超越传统的识别(仅限于检测预定义的语义类别),并且明确标记了这些文本跨越的标记,这些文本应该被掩盖,以掩盖该人的身份受到保护。除了介绍语料库及其注释层外,我们还提出了一套评估指标,这些指标是针对衡量文本匿名性的性能而定制的,无论是在隐私保护和公用事业保护方面。我们通过评估几个基线文本匿名模型的经验性能来说明基准和提议的指标的使用。完整的语料库及其面向隐私的注释准则,评估脚本和基线模型可在以下网址提供:
translated by 谷歌翻译
Grammatical Error Correction (GEC) is the task of automatically detecting and correcting errors in text. The task not only includes the correction of grammatical errors, such as missing prepositions and mismatched subject-verb agreement, but also orthographic and semantic errors, such as misspellings and word choice errors respectively. The field has seen significant progress in the last decade, motivated in part by a series of five shared tasks, which drove the development of rule-based methods, statistical classifiers, statistical machine translation, and finally neural machine translation systems which represent the current dominant state of the art. In this survey paper, we condense the field into a single article and first outline some of the linguistic challenges of the task, introduce the most popular datasets that are available to researchers (for both English and other languages), and summarise the various methods and techniques that have been developed with a particular focus on artificial error generation. We next describe the many different approaches to evaluation as well as concerns surrounding metric reliability, especially in relation to subjective human judgements, before concluding with an overview of recent progress and suggestions for future work and remaining challenges. We hope that this survey will serve as comprehensive resource for researchers who are new to the field or who want to be kept apprised of recent developments.
translated by 谷歌翻译
现有的多方对话数据集用于核心分辨率是新生的,许多挑战仍然没有解决。我们根据电视成绩单为此任务创建了一个大规模数据集,多语言多方CoreF(MMC)。由于使用多种语言的黄金质量字幕可用,我们建议重复注释以通过注释投影以其他语言(中文和Farsi)创建银色核心数据。在黄金(英语)数据上,现成的模型在MMC上的性能相对较差,这表明MMC比以前的数据集更广泛地覆盖多方核心。在银数据上,我们发现成功使用它进行数据增强和从头开始训练,这有效地模拟了零击的跨语性设置。
translated by 谷歌翻译
The rapid development of aspect-based sentiment analysis (ABSA) within recent decades shows great potential for real-world society. The current ABSA works, however, are mostly limited to the scenario of a single text piece, leaving the study in dialogue contexts unexplored. In this work, we introduce a novel task of conversational aspect-based sentiment quadruple analysis, namely DiaASQ, aiming to detect the sentiment quadruple of target-aspect-opinion-sentiment in a dialogue. DiaASQ bridges the gap between fine-grained sentiment analysis and conversational opinion mining. We manually construct a large-scale, high-quality Chinese dataset and also obtain the English version dataset via manual translation. We deliberately propose a neural model to benchmark the task. It advances in effectively performing end-to-end quadruple prediction and manages to incorporate rich dialogue-specific and discourse feature representations for better cross-utterance quadruple extraction. We finally point out several potential future works to facilitate the follow-up research of this new task. The DiaASQ data is open at https://github.com/unikcc/DiaASQ
translated by 谷歌翻译
注释数据是用于培训和评估机器学习模型的自然语言处理中的重要成分。因此,注释具有高质量是非常理想的。但是,最近的工作表明,几个流行的数据集包含令人惊讶的注释错误或不一致之处。为了减轻此问题,多年来已经设计了许多注释错误检测方法。尽管研究人员表明他们的方法在新介绍的数据集上效果很好,但他们很少将其方法与以前的工作或同一数据集进行比较。这引起了人们对方法的一般表现的强烈关注,并且使他们的优势和劣势很难解决。因此,我们重新实现18种检测潜在注释错误的方法,并在9个英语数据集上对其进行评估,以进行文本分类以及令牌和跨度标签。此外,我们定义了统一的评估设置,包括注释错误检测任务,评估协议和一般最佳实践的新形式化。为了促进未来的研究和可重复性,我们将数据集和实施释放到易于使用和开源软件包中。
translated by 谷歌翻译
随着近期自然语言生成(NLG)模型的各种应用程序的改进,它变得必须具有识别和评估NLG输出是否仅共享关于外部世界的可验证信息的手段。在这项工作中,我们提出了一个归属于识别的来源(AIS)的新评估框架,用于评估自然语言生成模型的输出,当这种输出涉及外部世界时。我们首先定义AIS,并引入两级注释管道,用于允许注释器根据AIS指南适当地评估模型输出。通过人为评估研究,我们在三个代数据集(会话QA域中的两个中和总结一下,概括地验证了这种方法,表明AIS可以作为测量模型生成的语句是否支持基础来源的常见框架。我们释放人类评估研究指南。
translated by 谷歌翻译
作为人类,我们通过我们所有的感官来驾驭世界,使用每个人从每个人纠正其他人。我们介绍了Merlot Reserve,一个模型,该模型是联合随着时间的推移而表示视频的模型 - 通过从音频,字幕和视频帧学习的新培训目标。给出了一个视频,我们用掩模令牌替换文本和音频的片段;该模型通过选择正确的蒙版片段来学习。我们的目标比替代方面更快地学习,并在规模上表现良好:我们预先逼近2000万YouTube视频。经验结果表明,Merlot Reserve学会通过所有组成模式的视频的强烈陈述。在FineTuned时,它在VCR和TVQA上为VCR和TVQA进行了新的最先进,优先于前勤工作分别为5%和7%。消融表明,两个任务都受益于音频预制 - 甚至录像机,围绕图像中心的QA任务(没有声音)。此外,我们的客观使开箱即用的预测,揭示了强大的多式联合致辞理解。在一个完全零拍摄的环境中,我们的模型在四个视频理解任务中获得竞争结果,甚至优于最近提出的定位推理(星)基准的监督方法。我们分析为什么包含音频导致更好的视觉语言表示,这表明未来研究的重要机会。我们通过讨论多式联运预测的道德和社会影响来得出结论。
translated by 谷歌翻译
插槽填充和意图检测是诸如语音助手的会话代理的骨干,是有效的研究领域。尽管公开的基准上的最先进的技术,但令人印象深刻的性能,他们概括到现实情景的能力尚未得到证明。在这项工作中,我们提出了一种自然,一套简单的口语导向转换,应用于数据集的评估集,在保留话语的语义时引入人类口语变化。我们将大自然应用于共同的插槽填充和意图检测基准,并证明了自然集合的标准评估的简单扰动可以显着降低模型性能。通过我们的实验,我们证明了当自然运营商应用于评估流行基准的评估集时,模型精度可以降低至多40%。
translated by 谷歌翻译
最近延伸预留下芬特的神经模型的神经模型继续实现新的最新导致对话状态跟踪(DST)基准的联合目标准确性(JGA)。但是,我们调查了他们的稳健性,因为它们在JGA中显示了急剧下降,以便与现实扰动的话语或对话框流动的对话。通过清单(Ribeiro等,2020),我们设计了一个名为CheckDST的度量集合,促进DST模型的比较,通过测试具有增强测试集的众所周知的弱点来促进革命性的全面尺寸。我们使用CheckDST评估最近的DST模型,并认为模型应该更全面地评估,而不是在JGA上追求最先进的JGA,因为更高的JGA不保证更好的整体稳健性。我们发现基于跨度的分类模型是有弹性的,不合适的命名实体,但对语言品种不强大,而那些基于自回归语言模型的人概括为语言变化,但往往会记住命名实体并往往是幻觉。由于它们各自的弱点,两种方法都不适合现实世界部署。我们认为CheckDst是未来研究的一个有用指南,用于开发面向任务的对话模型,体现了各种方法的优势。
translated by 谷歌翻译
Much recent work in task-oriented parsing has focused on finding a middle ground between flat slots and intents, which are inexpressive but easy to annotate, and powerful representations such as the lambda calculus, which are expressive but costly to annotate. This paper continues the exploration of task-oriented parsing by introducing a new dataset for parsing pizza and drink orders, whose semantics cannot be captured by flat slots and intents. We perform an extensive evaluation of deep-learning techniques for task-oriented parsing on this dataset, including different flavors of seq2seq systems and RNNGs. The dataset comes in two main versions, one in a recently introduced utterance-level hierarchical notation that we call TOP, and one whose targets are executable representations (EXR). We demonstrate empirically that training the parser to directly generate EXR notation not only solves the problem of entity resolution in one fell swoop and overcomes a number of expressive limitations of TOP notation, but also results in significantly greater parsing accuracy.
translated by 谷歌翻译
当前的语言模型可以产生高质量的文本。他们只是复制他们之前看到的文本,或者他们学习了普遍的语言抽象吗?要取笑这些可能性,我们介绍了乌鸦,这是一套评估生成文本的新颖性,专注于顺序结构(n-gram)和句法结构。我们将这些分析应用于四种神经语言模型(LSTM,变压器,变换器-XL和GPT-2)。对于本地结构 - 例如,单个依赖性 - 模型生成的文本比来自每个模型的测试集的人类生成文本的基线显着不那么新颖。对于大规模结构 - 例如,总句结构 - 模型生成的文本与人生成的基线一样新颖甚至更新颖,但模型仍然有时复制,在某些情况下,在训练集中重复超过1000字超过1,000字的通道。我们还表现了广泛的手动分析,表明GPT-2的新文本通常在形态学和语法中形成良好,但具有合理的语义问题(例如,是自相矛盾)。
translated by 谷歌翻译
言语的数字,例如隐喻和讽刺,在文学作品和口语对话中无处不在。这对自然语言理解构成了巨大的挑战,因为语音的数字通常偏离表面上表达更深层次的语义含义的含义。先前的研究强调了数字的文学方面,很少从计算语言学的观点提供全面的探索。在本文中,我们首先提出了象征性单元的概念,该单元是人物的载体。然后,我们选择了中文常用的12种类型的数字,并构建中文语料库以进行上下文化的图形识别(配置)。与以前的令牌级别或句子级别对应物不同,配置旨在从话语级别的上下文中提取象征性单元,并将象征性单元分类为正确的图类型。在配置时,设计了三个任务,即图形提取,图类型分类和图形识别,并使用最新技术来实现基准。我们进行彻底的实验,并表明所有三个任务对于现有模型都充满挑战,因此需要进一步研究。我们的数据集和代码可在https://github.com/pku-tangent/configure上公开获取。
translated by 谷歌翻译
提高对话系统的用户体验通常需要密集的开发人员努力读取对话日志,运行统计分析,并激活系统缺点的相对重要性。本文介绍了一种自动分析对话日志的新方法,了解用户系统交互与总体对话质量之间的关系。与在话语级别质量预测上的事先工作不同,我们的方法了解每个互动的影响,没有话语级注释的整体用户评级,允许基于经验证据和低成本获得所得模型结论。我们的模型识别与Chatbot设置中的与整体对话质量有着强烈相关的交互。实验表明,我们模型的自动分析同意专家判决,使这项工作首先表明这种弱监督的话语级质量预测学习是高度可取的。
translated by 谷歌翻译