Sememes are defined as the minimum semantic units of human languages. People have manually annotated lexical sememes for words and form linguistic knowledge bases. However, manual construction is time-consuming and labor-intensive, with significant annotation inconsistency and noise. In this paper, we for the first time explore to automatically predict lexical sememes based on semantic meanings of words encoded by word em-beddings. Moreover, we apply matrix factorization to learn semantic relations between sememes and words. In experiments, we take a real-world se-meme knowledge base HowNet for training and evaluation , and the results reveal the effectiveness of our method for lexical sememe prediction. Our method will be of great use for annotation verification of existing noisy sememe knowledge bases and annotation suggestion of new words and phrases.
translated by 谷歌翻译
Sememes are minimum semantic units of word meanings, and the meaning of each word sense is typically composed by several sememes. Since sememes are not explicit for each word, people manually annotate word sememes and form linguistic common-sense knowledge bases. In this paper, we present that, word sememe information can improve word representation learning (WRL), which maps words into a low-dimensional semantic space and serves as a fundamental step for many NLP tasks. The key idea is to utilize word sememes to capture exact meanings of a word within specific contexts accurately. More specifically, we follow the framework of Skip-gram and present three sememe-encoded models to learn representations of sememes, senses and words , where we apply the attention scheme to detect word senses in various contexts. We conduct experiments on two tasks including word similarity and word analogy, and our models significantly outperform baselines. The results indicate that WRL can benefit from sememes via the attention scheme, and also confirm our models being capable of correctly modeling sememe information.
translated by 谷歌翻译
Sememes are defined as the minimum semantic units of human languages. As important knowledge sources, sememe-based linguistic knowledge bases have been widely used in many NLP tasks. However, most languages still do not have sememe-based linguistic knowledge bases. Thus we present a task of cross-lingual lexical sememe prediction , aiming to automatically predict se-memes for words in other languages. We propose a novel framework to model correlations between sememes and multilingual words in low-dimensional semantic space for sememe prediction. Experimental results on real-world datasets show that our proposed model achieves consistent and significant improvements as compared to baseline methods in cross-lingual sememe prediction. The codes and data of this paper are available at https: //github.com/thunlp/CL-SP.
translated by 谷歌翻译
在本文中,我们提出了一个基于开放的基于词汇的词汇知识库OpenOowNet。基于着名的知网,OpenHowNet包含三个组成部分:核心数据,由超过10万个带有注释的意义组成,OpenHowNet Web简要介绍OpenHowNet以及OpenHowNet信息的在线展示,OpenHowNet API包括几个有用的API,例如访问OpenHowNet核心数据和绘制感官的sememe树结构。在正文中,我们首先给出一些背景知识,包括sememe的定义和HowNet的细节。然后我们介绍一些以前的知网和基于sememe的研究工作。最后但并非至少,我们详细介绍了OpenHowNet的组成部分及其基本功能和功能。此外,我们简要地总结一下并列出一些未来的工作。
translated by 谷歌翻译
大多数语言建模方法依赖于大规模数据来统计地学习单词的顺序模式。在本文中,我们认为单词是原子语言单位,但不一定是原子语义单位。受到HowNet的启发,我们使用人类语言中最小语义单位的sememes来表示语言建模后面的隐含语义,即名为驱动语言模型(SDLM)。更具体地说,为了预测下一个词,SDLM首先估计了sememe分布给出了文本背景。之后,它将每个sememe视为一个独特的语义专家,并且这些专家共同识别最可能的感官和相应的词。这样,SDLM启用了语言除了词级操作之外,模型还可以工作到细粒度的语义级语义,并为我们提供更多的功能,以便微调语言模型,提高可解释性以及语言模型的稳健性。语言建模实验和标题生成的下游应用证明了SDLM的显着性。可以通过以下网址访问实验中使用的源代码和数据:// github.com/thunpp/SDLM-pytorch。
translated by 谷歌翻译
在过去的几年中,分布式语义表示已被证明是有效且灵活的先前知识保持者,可以集成到城市下游应用程序中。本调查着重于意义的表征。我们从单词矢量空间模型背后的理论背景开始,并突出其主要局限之一:意义混合缺陷,它产生于将一个单词及其所有可能含义表示为单个向量。然后,我们解释如何通过从单词级别到更细粒度的单词意义(在其更广泛的接受中)的过渡作为一种模拟明确的词汇意义的方法来解决这种不足。我们全面概述了感知表示的两个主要分支中的广泛技术,即无监督和基于知识的。最后,本调查涵盖了此类表示的主要评估程序和应用,并对其四个重要方面进行了分析:可解释性,感知粒度,适应性不同的域和组合性。
translated by 谷歌翻译
以前,研究人员并没有注意创建独立于语料库的明确语素嵌入,而这些信息在表达中文等副语言的确切含义方面起着重要作用。本文在构建基于构词的汉语词汇本体论之后,提出了一种将结构化理性知识植入语素层次分布式表达的新方法,自然地避免了语料库中的重度消歧。我们设计了一个模板,仅仅根据词典中构建的语素知识,将实例创建为伪句。为了利用层次信息和解决数据稀疏问题,基于相似性应用实例增殖技术来扩展假句的集合。然后可以使用word2vec在这些伪句子上对语素的分布式表示进行背叛。为了评估,我们验证了语素嵌入的范式和组合关系,并将所获得的嵌入应用于单词相似度测量,通过超过5个Spearman分数或8个百分点对经典模型进行了显着的改进,这显示了采用知识新闻源的非常有希望的前景。 。
translated by 谷歌翻译
Context representations are central to various NLP tasks, such as word sense disam-biguation, named entity recognition, co-reference resolution, and many more. In this work we present a neural model for efficiently learning a generic context embedding function from large corpora, using bidirectional LSTM. With a very simple application of our context representations , we manage to surpass or nearly reach state-of-the-art results on sentence completion, lexical substitution and word sense disambiguation tasks, while substantially outperforming the popular context representation of averaged word em-beddings. We release our code and pre-trained models, suggesting they could be useful in a wide variety of NLP tasks.
translated by 谷歌翻译
Recent work has shown success in learning word embeddings with neural network language models (NNLM). However, the majority of previous NNLMs represent each word with a single embedding, which fails to capture polysemy. In this paper, we address this problem by representing words with multiple and sense-specific embeddings, which are learned from bilingual parallel data. We evaluate our embeddings using the word similarity measurement and show that our approach is significantly better in capturing the sense-level word similarities. We further feed our embeddings as features in Chinese named entity recognition and obtain noticeable improvements against single embeddings.
translated by 谷歌翻译
单词嵌入是高性能自然语言处理(NLP)系统的关键组成部分,但是对于即时学习新单词的良好表示仍然是一个挑战,即对于在训练数据中没有出现的单词。一般的问题设置是在未标记的训练语料库中引入单词嵌入,然后训练模型将新的单词嵌入到这个诱导的嵌入空间中。目前,存在两种用于学习新词嵌入的方法:(i)学习来自新词的表面形式(例如,子词n-gram)的嵌入和(ii)从其出现的上下文中学习嵌入。在本文中,我们提出了利用信息来源(表面形式和上下文)的结构,并表明它可以大大提高嵌入质量。 Ourarchitecture在Definitional Nonce和ContextualRare Words数据集上获得最先进的结果。作为输入,我们只需要一个嵌入集和一个未标记的语料库来训练我们的体系结构,以产生适合于诱导嵌入空间的嵌入。因此,我们的模型可以轻松集成到任何现有的NLP系统中,并增强其处理handlenovel单词的能力。
translated by 谷歌翻译
Motivations like domain adaptation, transfer learning, and feature learning have fu-eled interest in inducing embeddings for rare or unseen words, n-grams, synsets, and other textual features. This paper introducesàintroduces`introducesà la carte embedding, a simple and general alternative to the usual word2vec-based approaches for building such representations that is based upon recent theoretical results for GloVe-like em-beddings. Our method relies mainly on a linear transformation that is efficiently learnable using pretrained word vectors and linear regression. This transform is applicable "on the fly" in the future when a new text feature or rare word is encountered , even if only a single usage example is available. We introduce a new dataset showing how thè a la carte method requires fewer examples of words in context to learn high-quality embed-dings and we obtain state-of-the-art results on a nonce task and some unsupervised document classification tasks.
translated by 谷歌翻译
每天都会涌现出大量新词,因此非常需要用NLP系统可理解的语义来表示它们。语义被定义为人类语言的最小语义单位,其组合可以代表一个词的意义。基于sememe的知识库的手动构建是耗时且劳动密集的。幸运的是,社区致力于在wiki网站上撰写单词的描述。在本文中,我们探讨了基于wiki网站中单词的描述自动预测词汇量词。我们将此问题视为弱有序的多标签任务,并提出具有新颖软丢失函数的LabelDistributed seq2seq模型(LD-seq2seq)来解决该问题。在实验中,我们采用现实世界的知识库和网络对应的描述来对百度Wiki进行训练和评估。结果表明,我们的LD-seq2seq模型不仅在测试集上显着地胜过所有基线,而且在测试集的随机子集中也优于业余人类注释器。
translated by 谷歌翻译
由于有效处理不频繁的单词可能会产生不准确的语义理解,因此罕见的单词表示最近引起了人们的兴趣。然而,缺乏可靠的基准来评估和比较这些技术。我们在本文中表明,唯一存在的基准(斯坦福稀有词数据集)缺乏低置信度注释和有限的词汇量;因此,它没有构成一个可靠的比较框架。为了填补这一评估空白,我们提出了CAmbridge Rare字数据集(Card-660),这是一个专家注释的单词相似度数据集,它为罕见的单词表示技术提供了高度可靠但具有挑战性的基准。通过一系列实验,我们表明,即使是最好的主流词汇嵌入,在词汇中有数百万个单词,也无法在数据集上达到高于0.43(Pearsoncorrelation)的表现,而人类级别的上限为0.90。我们发布数据集和注释资料:http://pilehvar.github.io/card-660/。
translated by 谷歌翻译
我们在不使用额外的上下文信息的情况下解决了将预训练的单词嵌入概括为超出固定大小的词汇表的问题。我们提出了一个子词级单词矢量生成模型,它将单词视为包含字符$ n $ -grams的包。该模型简单,快速训练,并为罕见或看不见的单词提供良好的向量。实验表明,我们的模型在英语单词相似性任务和23种语言中的词性标签和形态句法属性的联合预测中达到了最先进的表现,这表明我们的模型能够捕捉到词的文本表征与其嵌入之间的关系。
translated by 谷歌翻译
在这项工作中,我们专注于有效地利用和整合来自概念层面和词汇层面的信息,通过将概念和文字投影到较低维空间,同时保留最关键的语义。在舆论理解系统的广泛背景下,我们研究了融合嵌入在若干核心NLP任务中的使用:命名实体检测和分类,自动语音识别重新排名和有针对性的情感分析。
translated by 谷歌翻译
In this paper, we propose a general framework to incorporate semantic knowledge into the popular data-driven learning process of word embeddings to improve the quality of them. Under this framework, we represent semantic knowledge as many ordinal ranking inequalities and formulate the learning of semantic word embed-dings (SWE) as a constrained optimization problem, where the data-derived objective function is optimized subject to all ordinal knowledge inequality constraints extracted from available knowledge resources such as Thesaurus and Word-Net. We have demonstrated that this constrained optimization problem can be efficiently solved by the stochastic gradient descent (SGD) algorithm, even for a large number of inequality constraints. Experimental results on four standard NLP tasks, including word similarity measure, sentence completion, name entity recognition , and the TOEFL synonym selection, have all demonstrated that the quality of learned word vectors can be significantly improved after semantic knowledge is incorporated as inequality constraints during the learning process of word embeddings.
translated by 谷歌翻译
单词嵌入技术在很大程度上依赖于个人单词的丰富训练数据。鉴于Zipfian在自然语言文本中的单词分布,在训练数据中通常不会频繁出现或根本不出现大量单词。在本文中,我们提出了一种技术,利用词汇资源中编码的知识,如WordNet,来诱导嵌入不可见的单词。我们的方法采用图形嵌入和跨语言矢量空间转换技术,以便将编码的词汇知识与从语料库统计得到的词汇知识合并。我们表明,该方法可以在多个评估基准测试中提供一致的性能改进:体外,多个罕见的单词相似性数据集,以及两个下游文本分类任务中的体内。
translated by 谷歌翻译
广泛采用字符级表示来缓解有效表示稀有或复杂单词的问题。然而,由于忽略了单词内部连续字符的语言连贯性,characteritself不是表示或代表作曲的自然最小语言单位。本文提出了一个通用的subword-augmentedembedding框架,用于学习和编写计算导出的子级表示。我们研究了一系列用于子字获取的无监督分割方法和不同的子字增强策略强化理解,表明子字增强嵌入显着改善了我们在英语和中文语言的多文本理解任务中的基线。
translated by 谷歌翻译
本文介绍了第一个评估英汉双语语境相似性的数据集,即BCWS(https://github.com/MiuLab/BCWS)。该数据集由2,091个英语 - 汉语对组成,具有相应的句子语境和由人类注释的相似性评分。与其他类似数据集相比,我们的带注释数据集具有更高的一致性。我们为双语嵌入任务建立了几个基线来对实验进行基准测试。对该数据集中提供的跨语言表达进行建模有可能将人工智能从单语理解转向多语言理解。
translated by 谷歌翻译
Integrating text and knowledge into a unified semantic space has attracted significant research interests recently. However, the ambiguity in the common space remains a challenge, namely that the same mention phrase usually refers to various entities. In this paper, to deal with the ambiguity of entity mentions, we propose a novel Multi-Prototype Mention Embedding model, which learns multiple sense embeddings for each mention by jointly modeling words from textual contexts and entities derived from a knowledge base. In addition, we further design an efficient language model based approach to disam-biguate each mention to a specific sense. In experiments, both qualitative and quantitative analysis demonstrate the high quality of the word, entity and multi-prototype mention embeddings. Using entity linking as a study case, we apply our disambigua-tion method as well as the multi-prototype mention embeddings on the benchmark dataset, and achieve state-of-the-art performance .
translated by 谷歌翻译