In-game toxic language becomes the hot potato in the gaming industry and community. There have been several online game toxicity analysis frameworks and models proposed. However, it is still challenging to detect toxicity due to the nature of in-game chat, which has extremely short length. In this paper, we describe how the in-game toxic language shared task has been established using the real-world in-game chat data. In addition, we propose and introduce the model/framework for toxic language token tagging (slot filling) from the in-game chat. The data and code will be released.
translated by 谷歌翻译
Intent classification and slot filling are two core tasks in natural language understanding (NLU). The interaction nature of the two tasks makes the joint models often outperform the single designs. One of the promising solutions, called BERT (Bidirectional Encoder Representations from Transformers), achieves the joint optimization of the two tasks. BERT adopts the wordpiece to tokenize each input token into multiple sub-tokens, which causes a mismatch between the tokens and the labels lengths. Previous methods utilize the hidden states corresponding to the first sub-token as input to the classifier, which limits performance improvement since some hidden semantic informations is discarded in the fine-tune process. To address this issue, we propose a novel joint model based on BERT, which explicitly models the multiple sub-tokens features after wordpiece tokenization, thereby generating the context features that contribute to slot filling. Specifically, we encode the hidden states corresponding to multiple sub-tokens into a context vector via the attention mechanism. Then, we feed each context vector into the slot filling encoder, which preserves the integrity of the sentence. Experimental results demonstrate that our proposed model achieves significant improvement on intent classification accuracy, slot filling F1, and sentence-level semantic frame accuracy on two public benchmark datasets. The F1 score of the slot filling in particular has been improved from 96.1 to 98.2 (2.1% absolute) on the ATIS dataset.
translated by 谷歌翻译
插槽填充和意图检测是自然语言理解领域的两个基本任务。由于这两项任务之间存在很强的相关性,因此以前的研究努力通过多任务学习或设计功能交互模块来建模它们,以提高每个任务的性能。但是,现有的方法都没有考虑句子的结构信息与两个任务的标签语义之间的相关性。话语的意图和语义成分取决于句子的句法元素。在本文中,我们研究了一个多透明的标签改进网络,该网络利用依赖性结构和标签语义嵌入。考虑到增强句法表示,我们将句子的依赖性结构介绍到我们的模型中。为了捕获句法信息和任务标签之间的语义依赖性,我们将特定于任务的特征与相应的标签嵌入通过注意机制相结合。实验结果表明,我们的模型在两个公共数据集上实现了竞争性能。
translated by 谷歌翻译
自动言论(POS)标记是许多自然语言处理(NLP)任务的预处理步骤,例如名称实体识别(NER),语音处理,信息提取,单词sense sisse disampigation和Machine Translation。它已经在英语和欧洲语言方面取得了令人鼓舞的结果,但是使用印度语言,尤其是在Odia语言中,由于缺乏支持工具,资源和语言形态丰富性,因此尚未得到很好的探索。不幸的是,我们无法为ODIA找到一个开源POS标记,并且仅尝试为ODIA语言开发POS标记器的尝试。这项研究工作的主要贡献是介绍有条件的随机场(CRF)和基于深度学习的方法(CNN和双向长期短期记忆)来开发ODIA的语音部分。我们使用了一个公开访问的语料库,并用印度标准局(BIS)标签设定了数据集。但是,全球的大多数语言都使用了带有通用依赖项(UD)标签集注释的数据集。因此,要保持统一性,odia数据集应使用相同的标签集。因此,我们已经构建了一个从BIS标签集到UD标签集的简单映射。我们对CRF模型进行了各种特征集输入,观察到构造特征集的影响。基于深度学习的模型包括BI-LSTM网络,CNN网络,CRF层,角色序列信息和预训练的单词向量。通过使用卷积神经网络(CNN)和BI-LSTM网络提取角色序列信息。实施了神经序列标记模型的六种不同组合,并研究了其性能指标。已经观察到具有字符序列特征和预训练的单词矢量的BI-LSTM模型取得了显着的最新结果。
translated by 谷歌翻译
Multi-intent detection and slot filling joint models are gaining increasing traction since they are closer to complicated real-world scenarios. However, existing approaches (1) focus on identifying implicit correlations between utterances and one-hot encoded labels in both tasks while ignoring explicit label characteristics; (2) directly incorporate multi-intent information for each token, which could lead to incorrect slot prediction due to the introduction of irrelevant intent. In this paper, we propose a framework termed DGIF, which first leverages the semantic information of labels to give the model additional signals and enriched priors. Then, a multi-grain interactive graph is constructed to model correlations between intents and slots. Specifically, we propose a novel approach to construct the interactive graph based on the injection of label semantics, which can automatically update the graph to better alleviate error propagation. Experimental results show that our framework significantly outperforms existing approaches, obtaining a relative improvement of 13.7% over the previous best model on the MixATIS dataset in overall accuracy.
translated by 谷歌翻译
虽然罕见疾病的特征在于患病率低,但大约3亿人受到罕见疾病的影响。对这些条件的早期和准确诊断是一般从业者的主要挑战,没有足够的知识来识别它们。除此之外,罕见疾病通常会显示各种表现形式,这可能会使诊断更加困难。延迟的诊断可能会对患者的生命产生负面影响。因此,迫切需要增加关于稀有疾病的科学和医学知识。自然语言处理(NLP)和深度学习可以帮助提取有关罕见疾病的相关信息,以促进其诊断和治疗。本文探讨了几种深度学习技术,例如双向长期内存(BILSTM)网络或基于来自变压器(BERT)的双向编码器表示的深层语境化词表示,以识别罕见疾病及其临床表现(症状和症状) Raredis语料库。该毒品含有超过5,000名罕见疾病和近6,000个临床表现。 Biobert,基于BERT和培训的生物医学Corpora培训的域特定语言表示,获得了最佳结果。特别是,该模型获得罕见疾病的F1分数为85.2%,表现优于所有其他模型。
translated by 谷歌翻译
语言理解(SLU)是以任务为导向对话系统的核心组成部分,期望面对人类用户不耐烦的推理较短。现有的工作通过为单转弯任务设计非自动回旋模型来提高推理速度,但在面对对话历史记录时未能适用于多转移SLU。直观的想法是使所有历史言语串联并直接利用非自动进取模型。但是,这种方法严重错过了显着的历史信息,并遭受了不协调的问题。为了克服这些缺点,我们提出了一个新型模型,用于使用层改造的变压器(SHA-LRT),该模型名为“显着历史”,该模型由SHA模块组成,该模块由SHA模块组成,一种层的机制(LRM)和插槽标签生成(SLG)任务。 SHA通过历史悠久的注意机制捕获了从历史言论和结果进行的当前对话的显着历史信息。 LRM预测了Transferer的中间状态的初步SLU结果,并利用它们来指导最终预测,SLG获得了非自动进取编码器的顺序依赖性信息。公共数据集上的实验表明,我们的模型可显着提高多转弯性能(总体上为17.5%),并且加速(接近15倍)最先进的基线的推理过程,并且在单转弯方面有效SLU任务。
translated by 谷歌翻译
口语语言理解已被处理为监督的学习问题,其中每个域都有一组培训数据。但是,每个域的注释数据都是经济昂贵和不可扩展的,因此我们应该充分利用所有域的信息。通过进行多域学习,使用跨域的联合训练的共享参数来解决一个现有方法解决问题。我们建议通过使用域特定和特定于任务的模型参数来改善该方法的参数化,以改善知识学习和传输。5个域的实验表明,我们的模型对多域SLU更有效,并获得最佳效果。此外,当适应具有很少数据的新域时,通过优于12.4 \%来表现出先前最佳模型的可转换性。
translated by 谷歌翻译
Recent joint multiple intent detection and slot filling models employ label embeddings to achieve the semantics-label interactions. However, they treat all labels and label embeddings as uncorrelated individuals, ignoring the dependencies among them. Besides, they conduct the decoding for the two tasks independently, without leveraging the correlations between them. Therefore, in this paper, we first construct a Heterogeneous Label Graph (HLG) containing two kinds of topologies: (1) statistical dependencies based on labels' co-occurrence patterns and hierarchies in slot labels; (2) rich relations among the label nodes. Then we propose a novel model termed ReLa-Net. It can capture beneficial correlations among the labels from HLG. The label correlations are leveraged to enhance semantic-label interactions. Moreover, we also propose the label-aware inter-dependent decoding mechanism to further exploit the label correlations for decoding. Experiment results show that our ReLa-Net significantly outperforms previous models. Remarkably, ReLa-Net surpasses the previous best model by over 20\% in terms of overall accuracy on MixATIS dataset.
translated by 谷歌翻译
大多数现有的插槽填充模型倾向于记住实体的固有模式和培训数据中相应的上下文。但是,这些模型在暴露于口语语言扰动或实践中的变化时会导致系统故障或不良输出。我们提出了一种扰动的语义结构意识转移方法,用于训练扰动插槽填充模型。具体而言,我们介绍了两种基于传销的培训策略,以分别从无监督的语言扰动语料库中分别学习上下文语义结构和单词分布。然后,我们将从上游训练过程学到的语义知识转移到原始样本中,并通过一致性处理过滤生成的数据。这些程序旨在增强老虎机填充模型的鲁棒性。实验结果表明,我们的方法始终优于先前的基本方法,并获得强有力的概括,同时阻止模型记住实体和环境的固有模式。
translated by 谷歌翻译
了解用户的意图并从句子中识别出语义实体,即自然语言理解(NLU),是许多自然语言处理任务的上游任务。主要挑战之一是收集足够数量的注释数据来培训模型。现有有关文本增强的研究并没有充分考虑实体,因此对于NLU任务的表现不佳。为了解决这个问题,我们提出了一种新型的NLP数据增强技术,实体意识数据增强(EADA),该技术应用了树结构,实体意识到语法树(EAST),以表示句子与对实体的注意相结合。我们的EADA技术会自动从少量注释的数据中构造东方,然后生成大量的培训实例,以进行意图检测和插槽填充。四个数据集的实验结果表明,该技术在准确性和泛化能力方面显着优于现有数据增强方法。
translated by 谷歌翻译
Named Entity Recognition and Intent Classification are among the most important subfields of the field of Natural Language Processing. Recent research has lead to the development of faster, more sophisticated and efficient models to tackle the problems posed by those two tasks. In this work we explore the effectiveness of two separate families of Deep Learning networks for those tasks: Bidirectional Long Short-Term networks and Transformer-based networks. The models were trained and tested on the ATIS benchmark dataset for both English and Greek languages. The purpose of this paper is to present a comparative study of the two groups of networks for both languages and showcase the results of our experiments. The models, being the current state-of-the-art, yielded impressive results and achieved high performance.
translated by 谷歌翻译
通常,在自然语言处理领域,识别指定实体是一项实用且具有挑战性的任务。由于混合的性质导致语言复杂性,因此在代码混合文本上命名的实体识别是进一步的挑战。本文介绍了CMNERONE团队在Semeval 2022共享任务11 Multiconer的提交。代码混合的NER任务旨在识别代码混合数据集中的命名实体。我们的工作包括在代码混合数据集上的命名实体识别(NER),来利用多语言数据。我们的加权平均F1得分为0.7044,即比基线大6%。
translated by 谷歌翻译
从非结构化网络文本中提取网络安全实体,例如攻击者和漏洞是安全分析的重要组成部分。但是,智能数据的稀疏性是由较高的频率变化产生的,并且网络安全实体名称的随机性使得当前方法在提取与安全相关的概念和实体方面很难表现良好。为此,我们提出了一种语义增强方法,该方法结合了不同的语言特征,以丰富输入令牌的表示,以通过非结构化文本检测和对网络安全名称进行分类。特别是,我们编码和汇总每个输入令牌的组成特征,形态特征和语音特征的一部分,以提高方法的鲁棒性。不仅如此,令牌从其在网络安全域中最相似的k单词获得了增强的语义信息,在该语料库中,将一个细心的模块借给了一个单词的差异,并从基于大规模的一般田野语料库的上下文线索中权衡了差异。我们已经在网络安全数据集DNRTI和MalwaretextDB上进行了实验,结果证明了该方法的有效性。
translated by 谷歌翻译
随着信息技术的快速发展,在线平台(例如,新闻门户网站和社交媒体)每时每刻都会产生巨大的网络信息。因此,从社会流中提取结构化的事件表现至关重要。通常,现有事件提取研究利用模式匹配,机器学习或深度学习方法来执行事件提取任务。然而,由于汉语的独特特征,中国事件提取的表现并不像英语一样好。在本文中,我们提出了一个综合框架来执行中文事件提取。所提出的方法是一个多通道输入神经框架,它集成了语义特征和语法特征。 BERT架构捕获语义特征。通过分析嵌入嵌入和图形卷积网络(GCN)分别捕获语音(POS)特征和依赖解析(DP)特征的部分。我们还在真实世界数据集中评估我们的模型。实验结果表明,该方法显着优于基准方法。
translated by 谷歌翻译
指定的实体识别(NER)或从临床文本中提取概念是识别文本中的实体并将其插入诸如问题,治疗,测试,临床部门,事件(例如录取和出院)等类别的任务。 NER构成了处理和利用电子健康记录(EHR)的非结构化数据的关键组成部分。尽管识别概念的跨度和类别本身是一项具有挑战性的任务,但这些实体也可能具有诸如否定属性,即否定其含义暗示着指定实体的消费者。几乎没有研究致力于将实体及其合格属性一起确定。这项研究希望通过将NER任务建模为有监督的多标签标记问题,为检测实体及其相应属性做出贡献。在本文中,我们提出了3种架构来实现此多标签实体标签:Bilstm N-CRF,Bilstm-Crf-Smax-TF和Bilstm N-CRF-TF。我们在2010 I2B2/VA和I2B2 2012共享任务数据集上评估了这些方法。我们的不同模型分别在I2B2 2010/VA和I2B2 2012上获得最佳NER F1分数为0. 894和0.808。在I2B2 2010/VA和I2B2 2012数据集上,获得的最高跨度微积的F1极性得分分别为0.832和0.836,获得的最高宏观平均F1极性得分分别为0.924和0.888。对I2B2 2012数据集进行的模态研究显示,基于SPAN的微平均F1和宏观平均F1的高分分别为0.818和0.501。
translated by 谷歌翻译
Lexicon信息和预先训练的型号,如伯特,已被组合以探索由于各自的优势而探索中文序列标签任务。然而,现有方法通过浅和随机初始化的序列层仅熔断词典特征,并且不会将它们集成到伯特的底层中。在本文中,我们提出了用于汉语序列标记的Lexicon增强型BERT(Lebert),其直接通过Lexicon适配器层将外部词典知识集成到BERT层中。与现有方法相比,我们的模型促进了伯特下层的深层词典知识融合。关于十个任务的十个中文数据集的实验,包括命名实体识别,单词分段和言语部分标记,表明Lebert实现了最先进的结果。
translated by 谷歌翻译
We introduce a new type of deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). Our word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pretrained on a large text corpus. We show that these representations can be easily added to existing models and significantly improve the state of the art across six challenging NLP problems, including question answering, textual entailment and sentiment analysis. We also present an analysis showing that exposing the deep internals of the pre-trained network is crucial, allowing downstream models to mix different types of semi-supervision signals.
translated by 谷歌翻译
When a human communicates with a machine using natural language on the web and online, how can it understand the human's intention and semantic context of their talk? This is an important AI task as it enables the machine to construct a sensible answer or perform a useful action for the human. Meaning is represented at the sentence level, identification of which is known as intent detection, and at the word level, a labelling task called slot filling. This dual-level joint task requires innovative thinking about natural language and deep learning network design, and as a result, many approaches and models have been proposed and applied. This tutorial will discuss how the joint task is set up and introduce Spoken Language Understanding/Natural Language Understanding (SLU/NLU) with Deep Learning techniques. We will cover the datasets, experiments and metrics used in the field. We will describe how the machine uses the latest NLP and Deep Learning techniques to address the joint task, including recurrent and attention-based Transformer networks and pre-trained models (e.g. BERT). We will then look in detail at a network that allows the two levels of the task, intent classification and slot filling, to interact to boost performance explicitly. We will do a code demonstration of a Python notebook for this model and attendees will have an opportunity to watch coding demo tasks on this joint NLU to further their understanding.
translated by 谷歌翻译
我们利用预训练的语言模型来解决两种低资源语言的复杂NER任务:中文和西班牙语。我们使用整个单词掩码(WWM)的技术来提高大型和无监督的语料库的掩盖语言建模目标。我们在微调的BERT层之上进行多个神经网络体系结构,将CRF,Bilstms和线性分类器结合在一起。我们所有的模型都优于基线,而我们的最佳性能模型在盲目测试集的评估排行榜上获得了竞争地位。
translated by 谷歌翻译