Distributed representations of words encode lexical semantic information, but how is that information encoded in word embeddings? Focusing on the skip-gram with negative-sampling method, we show theoretically and experimentally that the squared norm of word embedding encodes the information gain defined by the Kullback-Leibler divergence of the co-occurrence distribution of a word to the unigram distribution of the corpus. Furthermore, through experiments on tasks of keyword extraction, hypernym prediction, and part-of-speech discrimination, we confirmed that the KL divergence and the squared norm of embedding work as a measure of the informativeness of a word provided that the bias caused by word frequency is adequately corrected.
translated by 谷歌翻译
It is well-known that typical word embedding methods such as Word2Vec and GloVe have the property that the meaning can be composed by adding up the embeddings (additive compositionality). Several theories have been proposed to explain additive compositionality, but the following questions remain unanswered: (Q1) The assumptions of those theories do not hold for the practical word embedding. (Q2) Ordinary additive compositionality can be seen as an AND operation of word meanings, but it is not well understood how other operations, such as OR and NOT, can be computed by the embeddings. We address these issues by the idea of frequency-weighted centering at its core. This paper proposes a post-processing method for bridging the gap between practical word embedding and the assumption of theory about additive compositionality as an answer to (Q1). It also gives a method for taking OR or NOT of the meaning by linear operation of word embedding as an answer to (Q2). Moreover, we confirm experimentally that the accuracy of AND operation, i.e., the ordinary additive compositionality, can be improved by our post-processing method (3.5x improvement in top-100 accuracy) and that OR and NOT operations can be performed correctly.
translated by 谷歌翻译
Numerous works use word embedding-based metrics to quantify societal biases and stereotypes in texts. Recent studies have found that word embeddings can capture semantic similarity but may be affected by word frequency. In this work we study the effect of frequency when measuring female vs. male gender bias with word embedding-based bias quantification methods. We find that Skip-gram with negative sampling and GloVe tend to detect male bias in high frequency words, while GloVe tends to return female bias in low frequency words. We show these behaviors still exist when words are randomly shuffled. This proves that the frequency-based effect observed in unshuffled corpora stems from properties of the metric rather than from word associations. The effect is spurious and problematic since bias metrics should depend exclusively on word co-occurrences and not individual word frequencies. Finally, we compare these results with the ones obtained with an alternative metric based on Pointwise Mutual Information. We find that this metric does not show a clear dependence on frequency, even though it is slightly skewed towards male bias across all frequencies.
translated by 谷歌翻译
测量不同文本的语义相似性在数字人文研究中具有许多重要应用,例如信息检索,文档聚类和文本摘要。不同方法的性能取决于文本,域和语言的长度。本研究侧重于试验一些目前的芬兰方法,这是一种形态学丰富的语言。与此同时,我们提出了一种简单的方法TFW2V,它在处理长文本文档和有限的数据时显示出高效率。此外,我们设计了一种客观评估方法,可以用作基准标记文本相似性方法的框架。
translated by 谷歌翻译
Recent methods for learning vector space representations of words have succeeded in capturing fine-grained semantic and syntactic regularities using vector arithmetic, but the origin of these regularities has remained opaque. We analyze and make explicit the model properties needed for such regularities to emerge in word vectors. The result is a new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods. Our model efficiently leverages statistical information by training only on the nonzero elements in a word-word cooccurrence matrix, rather than on the entire sparse matrix or on individual context windows in a large corpus. The model produces a vector space with meaningful substructure, as evidenced by its performance of 75% on a recent word analogy task. It also outperforms related models on similarity tasks and named entity recognition.
translated by 谷歌翻译
我们对基于上下文化的基于嵌入的方法的(可能错误的)输出进行了定性分析,以检测直接性语义变化。首先,我们引入了一种合奏方法优于先前描述的上下文化方法。该方法被用作对5年英语单词预测的语义变化程度进行深入分析的基础。我们的发现表明,上下文化的方法通常可以预测单词的高变化分数,这些单词在术语的词典意义上没有经历任何实际的历时语义转移(或至少这些转移的状态值得怀疑)。详细讨论了此类具有挑战性的案例,并提出了它们的语言分类。我们的结论是,预训练的情境化语言模型容易产生词典感官和上下文方差变化的变化,这自然源于它们的分布性质,但与基于静态嵌入的方法中观察到的问题类型不同。此外,他们经常将词汇实体的句法和语义方面合并在一起。我们为这些问题提出了一系列可能的未来解决方案。
translated by 谷歌翻译
语言语料库中的统计规律将众所周知的社会偏见编码为单词嵌入。在这里,我们专注于性别,以全面分析在互联网语料库中训练的广泛使用的静态英语单词嵌入式(Glove 2014,FastText 2017)。使用单类单词嵌入关联测试,我们证明了性别偏见的广泛流行,这些偏见也显示出:(1)与男性与女性相关的单词频率; (b)与性别相关的单词中的言论部分; (c)与性别相关的单词中的语义类别; (d)性别相关的单词中的价,唤醒和优势。首先,就单词频率而言:我们发现,在词汇量中,有1000个最常见的单词与男性相比,有77%的人与男性相关,这是在英语世界的日常语言中直接证明男性默认的证据。其次,转向言论的部分:顶级男性相关的单词通常是动词(例如,战斗,压倒性),而顶级女性相关的单词通常是形容词和副词(例如,奉献,情感上)。嵌入中的性别偏见也渗透到言论部分。第三,对于语义类别:自下而上,对与每个性别相关的前1000个单词的群集分析。与男性相关的顶级概念包括大技术,工程,宗教,体育和暴力的角色和领域;相比之下,顶级女性相关的概念较少关注角色,包括女性特定的诽谤和性内容以及外观和厨房用语。第四,使用〜20,000个单词词典的人类评级,唤醒和主导地位,我们发现与男性相关的单词在唤醒和优势上较高,而与女性相关的单词在价上更高。
translated by 谷歌翻译
Word Embeddings从单词共同发生统计信息中捕获的语言规律学习隐式偏差。通过延长定量单词嵌入中的人类偏差的方法,我们介绍了valnorm,一种新的内在评估任务和方法,以量化人类级字体群体的价值维度与社会心理学。从七种语言(中文,英语,德语,波兰语,葡萄牙语,西班牙语和土耳其语)以及跨越200年的历史英语文本,将Valnorm应用于静态词嵌入式Valnorm在量化非歧视性的非社交组字集的价值方面达到了始终如一的高精度。具体而言,Valnorm实现了r = 0.88的Pearson相关性,用于399个单词的人类判断得分,以建立英语的愉快规范。相比之下,我们使用相同的单词嵌入品测量性别刻板印象,并发现社会偏见因语言而异。我们的结果表明,非歧视性,非社会群组词的价协会代表着七种语言和200多年的广泛共享的协会。
translated by 谷歌翻译
We analyze skip-gram with negative-sampling (SGNS), a word embedding method introduced by Mikolov et al., and show that it is implicitly factorizing a word-context matrix, whose cells are the pointwise mutual information (PMI) of the respective word and context pairs, shifted by a global constant. We find that another embedding method, NCE, is implicitly factorizing a similar matrix, where each cell is the (shifted) log conditional probability of a word given its context. We show that using a sparse Shifted Positive PMI word-context matrix to represent words improves results on two word similarity tasks and one of two analogy tasks. When dense low-dimensional vectors are preferred, exact factorization with SVD can achieve solutions that are at least as good as SGNS's solutions for word similarity tasks. On analogy questions SGNS remains superior to SVD. We conjecture that this stems from the weighted nature of SGNS's factorization.
translated by 谷歌翻译
Natural Language Understanding has seen an increasing number of publications in the last few years, especially after robust word embeddings models became prominent, when they proved themselves able to capture and represent semantic relationships from massive amounts of data. Nevertheless, traditional models often fall short in intrinsic issues of linguistics, such as polysemy and homonymy. Any expert system that makes use of natural language in its core, can be affected by a weak semantic representation of text, resulting in inaccurate outcomes based on poor decisions. To mitigate such issues, we propose a novel approach called Most Suitable Sense Annotation (MSSA), that disambiguates and annotates each word by its specific sense, considering the semantic effects of its context. Our approach brings three main contributions to the semantic representation scenario: (i) an unsupervised technique that disambiguates and annotates words by their senses, (ii) a multi-sense embeddings model that can be extended to any traditional word embeddings algorithm, and (iii) a recurrent methodology that allows our models to be re-used and their representations refined. We test our approach on six different benchmarks for the word similarity task, showing that our approach can produce state-of-the-art results and outperforms several more complex state-of-the-art systems.
translated by 谷歌翻译
基于嵌入的神经主题模型可以通过将它们嵌入均匀的特征空间来明确表示单词和主题,从而显示出更高的解释性。但是,嵌入训练没有明确的限制,从而导致更大的优化空间。此外,仍然缺乏对嵌入的变化以及对模型性能的影响的清晰描述。在本文中,我们提出了一个嵌入式化的神经主题模型,该模型应用于单词嵌入和主题嵌入的特殊设计的训练约束,以减少参数的优化空间。为了揭示嵌入的变化和角色,我们将\ textbf {均匀性}引入基于嵌入的神经主题模型中,作为嵌入空间的评估度量。在此基础上,我们描述了嵌入在训练过程中如何通过嵌入均匀性的变化而变化。此外,我们通过消融研究证明了基于嵌入的神经主题模型中嵌入的变化的影响。在两个主流数据集上实验的结果表明,我们的模型在主题质量和文档建模之间的和谐方面显着优于基线模型。这项工作是利用统一性来探索基于嵌入的神经主题模型嵌入的变化及其对模型性能的影响,从而获得了我们的最佳知识。
translated by 谷歌翻译
Word Mover的距离(WMD)计算单词和模型之间的距离与两个文本序列中的单词之间的移动成本相似。但是,它在句子相似性评估中没有提供良好的性能,因为它不包含单词重要性,并且在句子中未能将固有的上下文和结构信息纳入句子。提出了一种使用语法解析树(称为语法感知单词Mover的距离(SYNWMD))的改进的WMD方法,以解决这项工作中的这两个缺点。首先,基于从句子树的句法解析树中提取的一词共发生统计量建立了加权图。每个单词的重要性是从图形连接性推断出的。其次,在计算单词之间的距离时,考虑了单词的局部句法解析结构。为了证明拟议的SynWMD的有效性,我们对6个文本语义相似性(STS)数据集和4个句子分类数据集进行了实验。实验结果表明,SynWMD在STS任务上实现了最先进的性能。它还在句子分类任务上胜过其他基于WMD的方法。
translated by 谷歌翻译
科学世界正在快速改变,新技术正在开发,新的趋势正在进行频率增加。本文介绍了对学术出版物进行科学分析的框架,这对监测研究趋势并确定潜在的创新至关重要。该框架采用并结合了各种自然语言处理技术,例如Word Embedding和主题建模。嵌入单词嵌入用于捕获特定于域的单词的语义含义。我们提出了两种新颖的科学出版物嵌入,即PUB-G和PUB-W,其能够在各种研究领域学习一般的语义含义以及特定于域的单词。此后,主题建模用于识别这些更大的研究领域内的研究主题集群。我们策划了一个出版物数据集,由两条会议组成,并从1995年到2020年的两项期刊从两个研究领域组成。实验结果表明,与其他基线嵌入式的基于主题连贯性,我们的PUB-G和PUB-W嵌入式与其他基线嵌入式相比优越。
translated by 谷歌翻译
The relationship between words in a sentence often tells us more about the underlying semantic content of a document than its actual words, individually. In this work, we propose two novel algorithms, called Flexible Lexical Chain II and Fixed Lexical Chain II. These algorithms combine the semantic relations derived from lexical chains, prior knowledge from lexical databases, and the robustness of the distributional hypothesis in word embeddings as building blocks forming a single system. In short, our approach has three main contributions: (i) a set of techniques that fully integrate word embeddings and lexical chains; (ii) a more robust semantic representation that considers the latent relation between words in a document; and (iii) lightweight word embeddings models that can be extended to any natural language task. We intend to assess the knowledge of pre-trained models to evaluate their robustness in the document classification task. The proposed techniques are tested against seven word embeddings algorithms using five different machine learning classifiers over six scenarios in the document classification task. Our results show the integration between lexical chains and word embeddings representations sustain state-of-the-art results, even against more complex systems.
translated by 谷歌翻译
在NLP社区中有一个正在进行的辩论,无论现代语言模型是否包含语言知识,通过所谓的探针恢复。在本文中,我们研究了语言知识是否是现代语言模型良好表现的必要条件,我们称之为\ Texit {重新发现假设}。首先,我们展示了语言模型,这是显着压缩的,但在预先磨普目标上表现良好,以便在语言结构探讨时保持良好的分数。这一结果支持重新发现的假设,并导致我们的论文的第二款贡献:一个信息 - 理论框架,与语言建模目标相关。该框架还提供了测量语言信息对字词预测任务的影响的度量标准。我们通过英语综合和真正的NLP任务加固我们的分析结果。
translated by 谷歌翻译
专利数据是创新研究知识的重要来源。尽管专利对之间的技术相似性是用于专利分析的关键指标。最近,研究人员一直在使用基于不同NLP嵌入模型的专利矢量空间模型来计算专利对之间的技术相似性,以帮助更好地了解创新,专利景观,技术映射和专利质量评估。据我们所知,没有一项全面的调查来建立嵌入模型的性能以计算专利相似性指标的大图。因此,在这项研究中,我们根据专利分类性能概述了这些算法的准确性。在详细的讨论中,我们报告了部分,类和子类级别的前3个算法的性能。基于专利的第一个主张的结果表明,专利,贝特(Bert-For)和tf-idf加权单词嵌入具有最佳准确性,可以在亚类级别计算句子嵌入。根据第一个结果,不同类别中模型的性能各不相同,这表明专利分析中的研究人员可以利用本研究的结果根据他们使用的专利数据的特定部分选择最佳的适当模型。
translated by 谷歌翻译
Deep Learning and Machine Learning based models have become extremely popular in text processing and information retrieval. However, the non-linear structures present inside the networks make these models largely inscrutable. A significant body of research has focused on increasing the transparency of these models. This article provides a broad overview of research on the explainability and interpretability of natural language processing and information retrieval methods. More specifically, we survey approaches that have been applied to explain word embeddings, sequence modeling, attention modules, transformers, BERT, and document ranking. The concluding section suggests some possible directions for future research on this topic.
translated by 谷歌翻译
Measuring the semantic similarity between two sentences is still an important task. The word mover's distance (WMD) computes the similarity via the optimal alignment between the sets of word embeddings. However, WMD does not utilize word order, making it difficult to distinguish sentences with large overlaps of similar words, even if they are semantically very different. Here, we attempt to improve WMD by incorporating the sentence structure represented by BERT's self-attention matrix (SAM). The proposed method is based on the Fused Gromov-Wasserstein distance, which simultaneously considers the similarity of the word embedding and the SAM for calculating the optimal transport between two sentences. Experiments on paraphrase identification and semantic textual similarity show that the proposed method improves WMD and its variants. Our code is available at https://github.com/ymgw55/WSMD.
translated by 谷歌翻译
分布式文档表示是自然语言处理中的基本问题之一。目前分布式文档表示方法主要考虑单词或句子的上下文信息。这些方法不考虑文件作为整体的一致性,例如文档之间的关系,文档中的纸张标题和抽象,标题和描述或相邻机构之间的关系。一致性显示文档是否有意义,逻辑和句法,尤其是科学文档(论文或专利等)。在本文中,我们提出了一个耦合文本对嵌入(CTPE)模型来学习科学文档的表示,其通过分割文档来维护文档与耦合文本对的相干性。首先,我们将文档划分为构造耦合文本对的两个部分(例如,标题和抽象等)。然后,我们采用负面采样来构建两个部分来自不同文档的未耦合文本对。最后,我们训练模型以判断文本对是否被耦合或解耦并使用所获得的耦合文本对的嵌入作为嵌入文档。我们在三个数据集上执行实验,以获得一个信息检索任务和两个推荐任务。实验结果验证了所提出的CTPE模型的有效性。
translated by 谷歌翻译
在自然语言处理(NLP)中,通常从频率信息估计n-gram的似然比(LR)。然而,语料库只包含可能的n克的一小部分,并且它们中的大多数很少发生。因此,我们希望LR估算器用于低频和零频率N-GRAM。实现这一目标的一种方法是将n-gram分解成离散值,例如字母和单词,并占据LRS的乘积。但是,因为该方法处理大量离散值,所以估计的运行时间和内存用法是有问题的。此外,使用不必要的离散值会导致估计精度的恶化。因此,本文提出将上述方法与文档分类中使用的特征选择方法相结合,并表明我们的估计器为低频和零频率提供了有效和有效的估计结果。
translated by 谷歌翻译