Independent Component Analysis (ICA) is an algorithm originally developed for finding separate sources in a mixed signal, such as a recording of multiple people in the same room speaking at the same time. It has also been used to find linguistic features in distributional representations. In this paper, we used ICA to analyze words embeddings. We have found that ICA can be used to find semantic features of the words and these features can easily be combined to search for words that satisfy the combination. We show that only some of the independent components represent such features, but those that do are stable with regard to random initialization of the algorithm.
translated by 谷歌翻译
越来越多的自然语言处理研究(NLP)和自然语言理解(NLU)正在研究从大语言模型的嵌入一词中学习或编码的人类知识。这是了解哪些知识语言模型捕获的一步,类似于人类对语言和交流的理解。在这里,我们调查了单词(即价,唤醒,主导地位)的影响以及如何在大型神经网络中预先训练的单词嵌入中编码。我们将人类标记的数据集用作地面真理,并对四种单词嵌入方式进行了各种相关和分类测试。嵌入在静态或上下文化方面有所不同,以及在训练和微调阶段优先考虑特定信息的程度。我们的分析表明,嵌入Vanilla Bert模型的单词并未明显编码英语单词的影响信息。只有在与情绪相关的任务上进行微调或包含来自情感丰富的环境的额外上下文化信息时,只有在bert模型进行微调时,相应的嵌入方式可以编码更相关的影响信息。
translated by 谷歌翻译
我们对基于上下文化的基于嵌入的方法的(可能错误的)输出进行了定性分析,以检测直接性语义变化。首先,我们引入了一种合奏方法优于先前描述的上下文化方法。该方法被用作对5年英语单词预测的语义变化程度进行深入分析的基础。我们的发现表明,上下文化的方法通常可以预测单词的高变化分数,这些单词在术语的词典意义上没有经历任何实际的历时语义转移(或至少这些转移的状态值得怀疑)。详细讨论了此类具有挑战性的案例,并提出了它们的语言分类。我们的结论是,预训练的情境化语言模型容易产生词典感官和上下文方差变化的变化,这自然源于它们的分布性质,但与基于静态嵌入的方法中观察到的问题类型不同。此外,他们经常将词汇实体的句法和语义方面合并在一起。我们为这些问题提出了一系列可能的未来解决方案。
translated by 谷歌翻译
在论文中,我们测试了两个不同的方法,以获得波兰语的{令人难过的}词感人歧义任务。在这两种方法中,我们使用神经语言模型来预测与消歧的词语类似,并且在这些词的基础上,我们以不同的方式预测单词感官的分区。在第一种方法中,我们群集选定类似的单词,而在第二个中,我们群集代表其子集的群集向量。评估是在用PLONDNET感应注释的文本上进行的,并提供了相对良好的结果(对于所有模糊单词F1 = 0.68)。结果明显优于\ Cite {WAW:MYK:17:Sense}的神经模型的无人监督方法所获得的结果,并且处于在那里提供的监督方法的水平。所提出的方法可以是解决缺乏有义注释数据的语言的词语感义歧消声问题的方式。
translated by 谷歌翻译
隐喻检测的最先进方法比较他们的文字或核心 - 使用基于神经网络的顺序隐喻分类器的含义及其语境含义。表示字面含义的信号通常由(非语境)字嵌入式表示。然而,隐喻表达由于各种原因,例如文化和社会影响,随着时间的推移而发展。已知隐喻表达式通过语言和文字词含义,甚至在某种程度上驾驶这一进化。这升起了对文字含义不同,可能是特定于特定的,可能影响隐喻检测任务的问题。据我们所知,这是第一项研究,该研究在详细的探索性分析中检查了隐喻检测任务,其中使用不同的时间和静态字嵌入来占对字面意义的不同表示。我们的实验分析基于用于隐喻检测的三个流行基准,并从不同的Corpora中提取的单词嵌入式,并在时间上对齐到不同的最先进的方法。结果表明,不同的单词嵌入对隐喻检测任务的影响和一些时间字嵌入略高于一些性能措施的静态方法。然而,结果还表明,时间字嵌入可以提供单词“核心意义的表示,即使太接近其隐喻意义,因此令人困惑的分类器。总的来说,时间语言演化和隐喻检测之间的相互作用在我们的实验中使用的基准数据集中出现了微小。这表明对这种重要语言现象的计算分析的未来工作应该首先创建一个新的数据集,其中这个交互是更好的代表。
translated by 谷歌翻译
Linguists distinguish between novel and conventional metaphor, a distinction which the metaphor detection task in NLP does not take into account. Instead, metaphoricity is formulated as a property of a token in a sentence, regardless of metaphor type. In this paper, we investigate the limitations of treating conventional metaphors in this way, and advocate for an alternative which we name 'metaphorical polysemy detection' (MPD). In MPD, only conventional metaphoricity is treated, and it is formulated as a property of word senses in a lexicon. We develop the first MPD model, which learns to identify conventional metaphors in the English WordNet. To train it, we present a novel training procedure that combines metaphor detection with word sense disambiguation (WSD). For evaluation, we manually annotate metaphor in two subsets of WordNet. Our model significantly outperforms a strong baseline based on a state-of-the-art metaphor detection model, attaining an ROC-AUC score of .78 (compared to .65) on one of the sets. Additionally, when paired with a WSD model, our approach outperforms a state-of-the-art metaphor detection model at identifying conventional metaphors in text (.659 F1 compared to .626).
translated by 谷歌翻译
我们提出了Rudsi,这是俄罗斯语言感官诱导(WSI)的新基准。该数据集是使用单词用法图(WUGS)的手动注释和半自动聚类创建的。与俄罗斯的先前WSI数据集不同,Rudsi完全由数据驱动(基于俄罗斯国家语料库的文本),没有对注释者强加的外部词感官。根据图聚类的参数,可以从原始注释中产生不同的导数数据集。我们报告了几种基线WSI方法在Rudsi上获得的性能,并讨论了改善这些分数的可能性。
translated by 谷歌翻译
用户生成的内容充满了拼写错误。我们假设许多拼写错误的语义不仅仅是随机噪音,而是可以利用隐藏的语义来理解语言理解任务。本文提出了泰语中拼写错误的注释语料库,以及对拼写意图及其可能的语义的分析,以更好地理解语料库中观察到的拼写模式。此外,我们介绍了两种方法,以结合拼写错误的语义:拼写的平均嵌入(MAE)和拼写的语义令牌(MST)。情感分析任务的实验证实了我们的总体假设:拼写错误的其他语义可以提高微F1得分高达0.4-2%,而盲目正常化的拼写错误是有害的和次优的。
translated by 谷歌翻译
对于自然语言处理应用可能是有问题的,因为它们的含义不能从其构成词语推断出来。缺乏成功的方法方法和足够大的数据集防止了用于检测成语的机器学习方法的开发,特别是对于在训练集中不发生的表达式。我们提出了一种叫做小鼠的方法,它使用上下文嵌入来实现此目的。我们展示了一个新的多字表达式数据集,具有文字和惯用含义,并使用它根据两个最先进的上下文单词嵌入式培训分类器:Elmo和Bert。我们表明,使用两个嵌入式的深度神经网络比现有方法更好地执行,并且能够检测惯用词使用,即使对于训练集中不存在的表达式。我们展示了开发模型的交叉传输,并分析了所需数据集的大小。
translated by 谷歌翻译
News articles both shape and reflect public opinion across the political spectrum. Analyzing them for social bias can thus provide valuable insights, such as prevailing stereotypes in society and the media, which are often adopted by NLP models trained on respective data. Recent work has relied on word embedding bias measures, such as WEAT. However, several representation issues of embeddings can harm the measures' accuracy, including low-resource settings and token frequency differences. In this work, we study what kind of embedding algorithm serves best to accurately measure types of social bias known to exist in US online news articles. To cover the whole spectrum of political bias in the US, we collect 500k articles and review psychology literature with respect to expected social bias. We then quantify social bias using WEAT along with embedding algorithms that account for the aforementioned issues. We compare how models trained with the algorithms on news articles represent the expected social bias. Our results suggest that the standard way to quantify bias does not align well with knowledge from psychology. While the proposed algorithms reduce the~gap, they still do not fully match the literature.
translated by 谷歌翻译
Embedding words in vector space is a fundamental first step in state-of-the-art natural language processing (NLP). Typical NLP solutions employ pre-defined vector representations to improve generalization by co-locating similar words in vector space. For instance, Word2Vec is a self-supervised predictive model that captures the context of words using a neural network. Similarly, GLoVe is a popular unsupervised model incorporating corpus-wide word co-occurrence statistics. Such word embedding has significantly boosted important NLP tasks, including sentiment analysis, document classification, and machine translation. However, the embeddings are dense floating-point vectors, making them expensive to compute and difficult to interpret. In this paper, we instead propose to represent the semantics of words with a few defining words that are related using propositional logic. To produce such logical embeddings, we introduce a Tsetlin Machine-based autoencoder that learns logical clauses self-supervised. The clauses consist of contextual words like "black," "cup," and "hot" to define other words like "coffee," thus being human-understandable. We evaluate our embedding approach on several intrinsic and extrinsic benchmarks, outperforming GLoVe on six classification tasks. Furthermore, we investigate the interpretability of our embedding using the logical representations acquired during training. We also visualize word clusters in vector space, demonstrating how our logical embedding co-locate similar words.
translated by 谷歌翻译
当社会机器人和其他智能机器进入家中时,人工情感智力(AEI)正在焦点,以应对用户对更深入,更有意义的人类机器互动的渴望。为了完成这种有效的互动,下一代AEI需要全面的人类情感模型才能进行训练。与情感理论(一直是心理学的历史重点)不同,情感模型是一种描述性工具。在实践中,最强的模型需要强大的覆盖范围,这意味着定义最小的情绪集可以从中得出所有其他情感。为了实现所需的覆盖范围,我们转向自然语言处理中的单词嵌入。我们的实验使用无监督的聚类技术表明,只有15个离散的情绪类别,我们可以在六种主要语言(阿拉伯语,中文,英语,法语,西班牙语和俄语)上提供最大的覆盖范围。为了支持我们的发现,我们还检查了来自两个大规模情感识别数据集的注释,以评估与人类观念的规模观念相比,评估现有情绪模型的有效性。由于强大的,全面的情感模型是开发现实世界情感计算应用的基础,因此这项工作对社会机器人技术,人机互动,心理保健和计算心理学具有广泛的影响。
translated by 谷歌翻译
我们使用释义作为独特的数据来源来分析上下文化的嵌入,特别关注BERT。由于释义自然编码一致的单词和短语语义,因此它们提供了一种独特的镜头来研究嵌入的特性。使用释义数据库的比对,我们在释义和短语表示中研究单词。我们发现,上下文嵌入有效地处理多义单词,但在许多情况下给出了同义词,具有令人惊讶的不同表示。我们证实了先前的发现,即Bert对单词顺序敏感,但是就BERT层的情境化水平而言,发现与先前工作的模式略有不同。
translated by 谷歌翻译
Natural Language Understanding has seen an increasing number of publications in the last few years, especially after robust word embeddings models became prominent, when they proved themselves able to capture and represent semantic relationships from massive amounts of data. Nevertheless, traditional models often fall short in intrinsic issues of linguistics, such as polysemy and homonymy. Any expert system that makes use of natural language in its core, can be affected by a weak semantic representation of text, resulting in inaccurate outcomes based on poor decisions. To mitigate such issues, we propose a novel approach called Most Suitable Sense Annotation (MSSA), that disambiguates and annotates each word by its specific sense, considering the semantic effects of its context. Our approach brings three main contributions to the semantic representation scenario: (i) an unsupervised technique that disambiguates and annotates words by their senses, (ii) a multi-sense embeddings model that can be extended to any traditional word embeddings algorithm, and (iii) a recurrent methodology that allows our models to be re-used and their representations refined. We test our approach on six different benchmarks for the word similarity task, showing that our approach can produce state-of-the-art results and outperforms several more complex state-of-the-art systems.
translated by 谷歌翻译
The relationship between words in a sentence often tells us more about the underlying semantic content of a document than its actual words, individually. In this work, we propose two novel algorithms, called Flexible Lexical Chain II and Fixed Lexical Chain II. These algorithms combine the semantic relations derived from lexical chains, prior knowledge from lexical databases, and the robustness of the distributional hypothesis in word embeddings as building blocks forming a single system. In short, our approach has three main contributions: (i) a set of techniques that fully integrate word embeddings and lexical chains; (ii) a more robust semantic representation that considers the latent relation between words in a document; and (iii) lightweight word embeddings models that can be extended to any natural language task. We intend to assess the knowledge of pre-trained models to evaluate their robustness in the document classification task. The proposed techniques are tested against seven word embeddings algorithms using five different machine learning classifiers over six scenarios in the document classification task. Our results show the integration between lexical chains and word embeddings representations sustain state-of-the-art results, even against more complex systems.
translated by 谷歌翻译
背景:在信息提取和自然语言处理域中,可访问的数据集对于复制和比较结果至关重要。公开可用的实施和工具可以用作基准,并促进更复杂的应用程序的开发。但是,在临床文本处理的背景下,可访问数据集的数量很少 - 现有工具的数量也很少。主要原因之一是数据的敏感性。对于非英语语言,这个问题更为明显。方法:为了解决这种情况,我们介绍了一个工作台:德国临床文本处理模型的集合。这些模型接受了德国肾脏病报告的识别语料库的培训。结果:提出的模型为内域数据提供了有希望的结果。此外,我们表明我们的模型也可以成功应用于德语的其他生物医学文本。我们的工作台公开可用,因此可以开箱即用,或转移到相关问题上。
translated by 谷歌翻译
上下文(WIC)任务在NLP社区中引起了相当大的关注,如最近的MCL-WIC Semeval共享任务的普及所示。来自单词感应消歧(WSD)的系统和词汇资源通常用于WIC任务和WIC数据集构建。在本文中,我们建立了WIC和WSD之间的确切关系,以及目标感觉验证(TSV)的相关任务。建立在一个关于感觉和意味着区别的等价性的新颖假设,我们通过从理论计算机科学的应用中展示这三种语义分类问题可以彼此成对减少,因此是等同的。涉及WIC和WSD的系统和数据集的实验结果提供了强大的经验证据,以至于我们的问题减少在实践中的工作。
translated by 谷歌翻译
Supervised approaches generally rely on majority-based labels. However, it is hard to achieve high agreement among annotators in subjective tasks such as hate speech detection. Existing neural network models principally regard labels as categorical variables, while ignoring the semantic information in diverse label texts. In this paper, we propose AnnoBERT, a first-of-its-kind architecture integrating annotator characteristics and label text with a transformer-based model to detect hate speech, with unique representations based on each annotator's characteristics via Collaborative Topic Regression (CTR) and integrate label text to enrich textual representations. During training, the model associates annotators with their label choices given a piece of text; during evaluation, when label information is not available, the model predicts the aggregated label given by the participating annotators by utilising the learnt association. The proposed approach displayed an advantage in detecting hate speech, especially in the minority class and edge cases with annotator disagreement. Improvement in the overall performance is the largest when the dataset is more label-imbalanced, suggesting its practical value in identifying real-world hate speech, as the volume of hate speech in-the-wild is extremely small on social media, when compared with normal (non-hate) speech. Through ablation studies, we show the relative contributions of annotator embeddings and label text to the model performance, and tested a range of alternative annotator embeddings and label text combinations.
translated by 谷歌翻译
我们描述了NordiaChange:挪威的第一个历史语义改变数据集。NordiaChange包括两个新的子集,覆盖了大约80个挪威名词,随着时间的推移,用分级语义变化手动注释。两个数据集都遵循相同的注释程序,可以互换地作为火车和彼此的测试分割。Nordiachange涵盖与战后事件,挪威石油和天然气发现以及技术发展有关的时间段。注释是使用DUREL框架和两个大型历史挪威语料库完成的。NordiaChange在允许许可证下全额发布,完成了原始注释数据和推断仪式单词使用图(DWUG)。
translated by 谷歌翻译
Machine learning about language can be improved by supplying it with specific knowledge and sources of external information. We present here a new version of the linked open data resource ConceptNet that is particularly well suited to be used with modern NLP techniques such as word embeddings.ConceptNet is a knowledge graph that connects words and phrases of natural language with labeled edges. Its knowledge is collected from many sources that include expertcreated resources, crowd-sourcing, and games with a purpose. It is designed to represent the general knowledge involved in understanding language, improving natural language applications by allowing the application to better understand the meanings behind the words people use.When ConceptNet is combined with word embeddings acquired from distributional semantics (such as word2vec), it provides applications with understanding that they would not acquire from distributional semantics alone, nor from narrower resources such as WordNet or DBPedia. We demonstrate this with state-of-the-art results on intrinsic evaluations of word relatedness that translate into improvements on applications of word vectors, including solving SAT-style analogies.• A net is used for catching fish.• "Leaves" is a form of the word "leaf ".• The word cold in English is studený in Czech.• O alimento é usado para comer [Food is used for eating].
translated by 谷歌翻译