理解词语如何随时间改变其意义是语言和文化进化模型的关键,但是关于意义的历史数据是稀缺的,使得理论难以发展和测试。 Word嵌入显示作为adiachronic工具的承诺,但尚未经过仔细评估。我们通过针对已知的历史变化评估单词嵌入(PPMI,SVD,word2vec)来开发用于量化语义变化的强大方法。然后,我们使用这种方法来揭示语义演化的统计规律。使用跨越四种语言和两个世纪的六个历史语料库,我们提出了两个语义变化的定量定律:(i)一致性定律 - 语义变化率与词频的逆幂律相关; (ii)创新法 - 独立于频率,更多义的词语具有更高的语义变化率。
translated by 谷歌翻译
在本文中,我们提供了对单词嵌入和维度的理论理解。在单词嵌入的单一不变性的推动下,我们提出了成对内积(PIP)损失,这是一种关于词嵌入之间不相似性的新指标。使用来自矩阵微扰理论的技术,我们揭示了字嵌入的基本偏差 - 方差权衡折衷选择。这种偏差 - 方差权衡了许多以前无法解释的经验观察,例如最优维度的存在。此外,还揭示了新的见解和发现,例如文字嵌入何时以及如何对过度拟合具有鲁棒性。通过对PIP损失的偏差 - 方差权衡进行优化,我们可以明确地回答维度嵌入的维度选择的开放性问题。
translated by 谷歌翻译
We present a method to explore semantic change as a function of variation in distributional semantic spaces. In this paper, we apply this approach to automatically identify the areas of semantic change in the lexicon of Ancient Greek between the pre-Christian and Christian era. Distributional Semantic Models are used to identify meaningful clusters and patterns of semantic shift within a set of target words, defined through a purely data-driven approach. The results emphasize the role played by the diffusion of Christianity and by technical languages in determining semantic change in Ancient Greek and show the potentialities of distributional models in diachronic semantics.
translated by 谷歌翻译
We propose a new computational approach for tracking and detecting statistically significant linguistic shifts in the meaning and usage of words. Such linguistic shifts are especially prevalent on the Internet, where the rapid exchange of ideas can quickly change a word's meaning. Our meta-analysis approach constructs property time series of word usage, and then uses statistically sound change point detection algorithms to identify significant linguistic shifts. We consider and analyze three approaches of increasing complexity to generate such linguistic property time series, the culmination of which uses distributional characteristics inferred from word co-occurrences. Using recently proposed deep neural language models, we first train vector representations of words for each time period. Second, we warp the vector spaces into one unified coordinate system. Finally, we construct a distance-based distributional time series for each word to track it's linguistic displacement over time. We demonstrate that our approach is scalable by tracking linguistic change across years of micro-blogging using Twitter, a decade of product reviews using a corpus of movie reviews from Amazon, and a century of written books using the Google Book-ngrams. Our analysis reveals interesting patterns of language usage change commensurate with each medium.
translated by 谷歌翻译
Cross-lingual representations of words enable us to reason about word meaning in multilingual contexts and are a key facilitator of cross-lingual transfer when developing natural language processing models for low-resource languages. In this survey, we provide a comprehensive typology of cross-lingual word embedding models. We compare their data requirements and objective functions. The recurring theme of the survey is that many of the models presented in the literature optimize for the same objectives, and that seemingly different models are often equivalent modulo optimization strategies, hyper-parameters, and such. We also discuss the different ways cross-lingual word embeddings are evaluated, as well as future challenges and research horizons.
translated by 谷歌翻译
在传统的分布式语义模型(DSM)中,多义词的多重意义被混合成单个向量空间表示。在这项工作中,我们提出了一个DSM,它根据不同的主题学习单词的多个分布表示。首先,针对每个主题训练单独的DSM,然后将每个基于主题的DSM与公共向量空间对齐。我们的无监督映射方法的动机是这样的假设,即在不同主题语义子空间中保留其相对距离的词构成了定义它们之间映射的鲁棒\ textit {语义锚}。对齐的跨主题表示实现了上下文单词相似性任务的最新结果。此外,对NLP下游任务的评估表明,多个基于主题的嵌入优于单原型模型。
translated by 谷歌翻译
Recently, large amounts of historical texts have been digitized and made accessible to the public. Thanks to this, for the first time, it became possible to analyze evolution of language through the use of automatic approaches. In this paper, we show the results of an exploratory analysis aiming to investigate methods for studying and visualizing changes in word meaning over time. In particular, we propose a framework for exploring semantic change at the lexical level, at the contrastive-pair level, and at the sentiment orientation level. We demonstrate several kinds of NLP approaches that altogether give users deeper understanding of word evolution. We use two diachronic corpora that are currently the largest available historical language corpora. Our results indicate that the task is feasible and satisfactory outcomes can be already achieved by using simple approaches.
translated by 谷歌翻译
我们应用自然语言处理,计算语言学和机器学习技术来研究arXiv的第四和四个相关部分的论文:hep-ph,hep-lat,gr-qc和math-ph。从2017年底开始的arXivuntil开始,这些部分中的所有论文都被提取并作为语料库处理,我们使用它来构建神经网络Word2Vec。对常见的n-gram,线性句法同一性,词云和词的相似性进行了比较研究。我们发现这些领域之间存在显着的科学和社会学差异。与支持向量机相结合,我们还表明高能量和数学物理学的不同子领域中的标题的句法结构是完全不同的,神经网络可以执行形式与现象学部分的二元分类,准确度为87.1%,并且可以执行所有部分的精确五倍分类,准确率为65.1%。
translated by 谷歌翻译
We investigate the relationship between lexical spaces and contextually-defined conceptual spaces, offering applications to creative concept discovery. We define a computational method for discovering members of concepts based on semantic spaces: starting with a standard distributional model derived from corpus co-occurrence statistics, we dynamically select characteristic dimensions associated with seed terms, and thus a subspace of terms defining the related concept. This approach performs as well as, and in some cases better than, leading distributional semantic models on a WordNet-based concept discovery task, while also providing a model of concepts as convex regions within a space with interpretable dimensions. In particular, it performs well on more specific, contextualized concepts; to investigate this we therefore move beyond WordNet to a set of human empirical studies, in which we compare output against human responses on a membership task for novel concepts. Finally, a separate panel of judges rate both model output and human responses, showing similar ratings in many cases, and some commonalities and divergences which reveal interesting issues for computational concept discovery.
translated by 谷歌翻译
我们引入了一个框架,用于量化实践社区和主题相关社区中常见词语的语义变异。我们表明,虽然某些意义的转变是在相关社区之间共享的,但其他意义却是社区特定的,因此独立于讨论的主题。我们提出这样的发现作为有利于社会语言学理论的社会语言驱动语义变异的证据。使用独立语言建模任务评估结果。此外,我们研究了语言特征,并表明词语突出和传播等因素与语义变异有关。
translated by 谷歌翻译
We analyze semantic changes in loan-words from English that are used in Japanese (Japanese loanwords). Specifically , we create word embeddings of En-glish and Japanese and map the Japanese embeddings into the English space so that we can calculate the similarity of each Japanese word and each English word. We then attempt to find loanwords that are semantically different from their original, see if known meaning changes are correctly captured, and show the possibility of using our methodology in language education .
translated by 谷歌翻译
在本文中,我们报告了我们对文本数据密集分布表示的研究结果。我们提出了两种新颖的神经模型来学习这种表征。第一个模型学习文档级别的表示,而第二个模型学习单词级表示。对于文档级表示,我们提出二进制段落向量:用于学习文本文档的二进制表示的神经网络模型,其可用于快速文档检索。我们对这些模型进行了全面评估,并证明它们在信息检索任务中的表现优于该领域的开创性方法。我们还报告了强有力的结果转换学习设置,其中我们的模型在通用textcorpus上训练,然后用于从特定于域的数据集推断文档的代码。与先前提出的方法相反,二进制段落矢量模型直接从原始文本数据学习嵌入。对于词级表示,我们提出消歧Skip-gram:用于学习多义词嵌入的神经网络模型。通过该模型学习的表示可以用于下游任务,例如词性标记或语义关系的识别。在单词意义上感应任务Disambiguated Skip-gram在三个基准测试数据集上优于最先进的模型。我们的模型具有优雅的概率解释。此外,与以前的这种模型不同,它在所有参数方面都是不同的,并且可以用反向传播进行训练。除了定量结果,我们还提出消除歧义的Skip-gram的定性评估,包括选定的词义嵌入的二维可视化。
translated by 谷歌翻译
Research in historical semantics relies on the examination, selection , and interpretation of texts from corpora. Changes in meaning are tracked through the collection and careful inspection of examples that span decades and centuries. This process is inextricably tied to the researcher"s expertise and familiarity with the corpus. Consequently, the results tend to be difficult to quantify and put on an objective footing, and "big-picture" changes in the vocabulary other than the specific ones under investigation may be hard to keep track of. In this paper we present a method that uses Latent Semantic Analysis (Landauer, Foltz & Laham, 1998) to automatically track and identify semantic changes across a corpus. This method can take the entire corpus into account when tracing changes in the use of words and phrases, thus potentially allowing researchers to observe the larger context in which these changes occurred, while at the same time considerably reducing the amount of work required. Moreover, because this measure relies on readily observable co-occurrence data, it affords the study of semantic change a measure of objectivity that was previously difficult to attain. In this paper we describe our method and demonstrate its potential by applying it to several well-known examples of semantic change in the history of the English language.
translated by 谷歌翻译
A recent ''third wave'' of neural network (NN) approaches now delivers state-of-the-art performance in many machine learning tasks, spanning speech recognition, computer vision, and natural language processing. Because these modern NNs often comprise multiple interconnected layers, work in this area is often referred to as deep learning. Recent years have witnessed an explosive growth of research into NN-based approaches to information retrieval (IR). A significant body of work has now been created. In this paper, Kezban Dilek Onal and Ye Zhang contributed equally. Maarten de Rijke and Matthew Lease contributed equally. we survey the current landscape of Neural IR research, paying special attention to the use of learned distributed representations of textual units. We highlight the successes of neural IR thus far, catalog obstacles to its wider adoption, and suggest potentially promising directions for future research.
translated by 谷歌翻译
This paper gives an overview of distributional modelling of word meaning for contemporary lexicography. We also apply it in a case study on automatic semantic shift detection in Slovene tweets. We use word embeddings to compare the semantic behaviour of frequent words from a reference corpus of Slovene with their behaviour on Twitter. Words with the highest model distance between the corpora are considered as semantic shift candidates. They are manually analysed and classified in order to evaluate the proposed approach as well as to gain a better qualitative understanding of the problem. Apart from the noise due to pre-processing errors (45%), the approach yields a lot of valuable candidates, especially the novel senses occurring due to daily events and the ones produced in informal communication settings.
translated by 谷歌翻译
阿拉伯语是一种广泛使用的语言,历史悠久而丰富,但现有的语言和语言技术主要集中在现代阿拉伯语及其各种语言上。因此,迄今为止,研究语言的历史几乎仅限于小规模的人工分析。在这项工作中,我们提出了大约1400年的书面阿拉伯语的历史语料库。我们描述了使用ArabicNLP工具清理和处理此语料库的努力,包括重用文本的识别。我们使用新颖的自动分期算法以及其他技术来研究阿拉伯语的历史。我们的研究结果证实了将阿拉伯语写成现代标准和古典阿拉伯语的确立,并确认了其他既定的周期,同时暗示书面阿拉伯语可以在进一步的发展阶段中被整除。
translated by 谷歌翻译