Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Popular models that learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skipgram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram; words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpora quickly and allows us to compute word representations for words that did not appear in the training data. We evaluate our word representations on nine different languages, both on word similarity and analogy tasks. By comparing to recently proposed morphological word representations, we show that our vectors achieve state-of-the-art performance on these tasks.
translated by 谷歌翻译
We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.
translated by 谷歌翻译
自Mikolov等人的开创性工作以来。 (2013A)和Bojanowski等。 (2017),浅日志双线性语言模型的单词表示已成为许多NLP应用程序。 Mikolov等人。 (2018)介绍了一个位置日志双线性语言模型,具有基于关注的语言模型的特征,并且在内在单词类比任务上达到了最先进的性能。然而,位置模型从未评估了定性标准或外在任务,其速度是不切实际的。我们概述了注意机制与位置模型之间的相似性,并提出了一个受约束的位置模型,它适应Dai等人的稀疏注意机制。 (2018)。我们评估了三个新型定性标准的位置和受约束的位置模型及其对两种和布鲁森的外在语言建模任务(2014)。我们表明,位置和约束位置模型包含有关字令的可解释信息,优于Bojanowski等人的子字模型。 (2017)语言建模。我们还表明,受约束的位置模型优于语言建模的位置模型,并且是快速的两倍。
translated by 谷歌翻译
Recent methods for learning vector space representations of words have succeeded in capturing fine-grained semantic and syntactic regularities using vector arithmetic, but the origin of these regularities has remained opaque. We analyze and make explicit the model properties needed for such regularities to emerge in word vectors. The result is a new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods. Our model efficiently leverages statistical information by training only on the nonzero elements in a word-word cooccurrence matrix, rather than on the entire sparse matrix or on individual context windows in a large corpus. The model produces a vector space with meaningful substructure, as evidenced by its performance of 75% on a recent word analogy task. It also outperforms related models on similarity tasks and named entity recognition.
translated by 谷歌翻译
We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including: part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements.
translated by 谷歌翻译
我们想要模型的文本单位是什么?从字节到多字表达式,可以在许多粒度下分析和生成文本。直到最近,大多数自然语言处理(NLP)模型通过单词操作,将那些作为离散和原子令牌处理,但从字节对编码(BPE)开始,基于次字的方法在许多领域都变得占主导地位,使得仍然存在小词汇表允许快速推断。是道路字符级模型的结束或字节级处理吗?在这项调查中,我们通过展示和评估基于学习分割的词语和字符以及基于子字的方法的混合方法以及基于学习的分割的杂交方法,连接多行工作。我们得出结论,对于所有应用来说,并且可能永远不会成为所有应用的银子弹奇异解决方案,并且严重思考令牌化对许多应用仍然很重要。
translated by 谷歌翻译
Neural machine translation (NMT) models typically operate with a fixed vocabulary, but translation is an open-vocabulary problem. Previous work addresses the translation of out-of-vocabulary words by backing off to a dictionary. In this paper, we introduce a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units. This is based on the intuition that various word classes are translatable via smaller units than words, for instance names (via character copying or transliteration), compounds (via compositional translation), and cognates and loanwords (via phonological and morphological transformations). We discuss the suitability of different word segmentation techniques, including simple character ngram models and a segmentation based on the byte pair encoding compression algorithm, and empirically show that subword models improve over a back-off dictionary baseline for the WMT 15 translation tasks English→German and English→Russian by up to 1.1 and 1.3 BLEU, respectively.
translated by 谷歌翻译
确保适当的标点符号和字母外壳是朝向应用复杂的自然语言处理算法的关键预处理步骤。这对于缺少标点符号和壳体的文本源,例如自动语音识别系统的原始输出。此外,简短的短信和微博的平台提供不可靠且经常错误的标点符号和套管。本调查概述了历史和最先进的技术,用于恢复标点符号和纠正单词套管。此外,突出了当前的挑战和研究方向。
translated by 谷歌翻译
Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, with the potential to overcome many of the weaknesses of conventional phrase-based translation systems. Unfortunately, NMT systems are known to be computationally expensive both in training and in translation inference -sometimes prohibitively so in the case of very large data sets and large models. Several authors have also charged that NMT systems lack robustness, particularly when input sentences contain rare words. These issues have hindered NMT's use in practical deployments and services, where both accuracy and speed are essential. In this work, we present GNMT, Google's Neural Machine Translation system, which attempts to address many of these issues. Our model consists of a deep LSTM network with 8 encoder and 8 decoder layers using residual connections as well as attention connections from the decoder network to the encoder. To improve parallelism and therefore decrease training time, our attention mechanism connects the bottom layer of the decoder to the top layer of the encoder. To accelerate the final translation speed, we employ low-precision arithmetic during inference computations. To improve handling of rare words, we divide words into a limited set of common sub-word units ("wordpieces") for both input and output. This method provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delimited models, naturally handles translation of rare words, and ultimately improves the overall accuracy of the system. Our beam search technique employs a length-normalization procedure and uses a coverage penalty, which encourages generation of an output sentence that is most likely to cover all the words in the source sentence. To directly optimize the translation BLEU scores, we consider refining the models by using reinforcement learning, but we found that the improvement in the BLEU scores did not reflect in the human evaluation. On the WMT'14 English-to-French and English-to-German benchmarks, GNMT achieves competitive results to state-of-the-art. Using a human side-by-side evaluation on a set of isolated simple sentences, it reduces translation errors by an average of 60% compared to Google's phrase-based production system.
translated by 谷歌翻译
The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada". Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.
translated by 谷歌翻译
识别跨语言抄袭是挑战性的,特别是对于遥远的语言对和感知翻译。我们介绍了这项任务的新型多语言检索模型跨语言本体论(CL \ nobreakdash-osa)。 CL-OSA表示从开放知识图Wikidata获得的实体向量的文档。反对其他方法,Cl \ nobreakdash-osa不需要计算昂贵的机器翻译,也不需要使用可比较或平行语料库进行预培训。它可靠地歧义同音异义和缩放,以允许其应用于Web级文档集合。我们展示了CL-OSA优于从五个大局部多样化的测试语料中检索候选文档的最先进的方法,包括日语英语等遥控语言对。为了识别在角色级别的跨语言抄袭,CL-OSA主要改善了感觉识别翻译的检测。对于这些挑战性案例,CL-OSA在良好的Plagdet得分方面的表现超过了最佳竞争对手的比例超过两种。我们研究的代码和数据公开可用。
translated by 谷歌翻译
State-of-the-art named entity recognition systems rely heavily on hand-crafted features and domain-specific knowledge in order to learn effectively from the small, supervised training corpora that are available. In this paper, we introduce two new neural architectures-one based on bidirectional LSTMs and conditional random fields, and the other that constructs and labels segments using a transition-based approach inspired by shift-reduce parsers. Our models rely on two sources of information about words: character-based word representations learned from the supervised corpus and unsupervised word representations learned from unannotated corpora. Our models obtain state-of-the-art performance in NER in four languages without resorting to any language-specific knowledge or resources such as gazetteers. 1
translated by 谷歌翻译
Natural Language Understanding has seen an increasing number of publications in the last few years, especially after robust word embeddings models became prominent, when they proved themselves able to capture and represent semantic relationships from massive amounts of data. Nevertheless, traditional models often fall short in intrinsic issues of linguistics, such as polysemy and homonymy. Any expert system that makes use of natural language in its core, can be affected by a weak semantic representation of text, resulting in inaccurate outcomes based on poor decisions. To mitigate such issues, we propose a novel approach called Most Suitable Sense Annotation (MSSA), that disambiguates and annotates each word by its specific sense, considering the semantic effects of its context. Our approach brings three main contributions to the semantic representation scenario: (i) an unsupervised technique that disambiguates and annotates words by their senses, (ii) a multi-sense embeddings model that can be extended to any traditional word embeddings algorithm, and (iii) a recurrent methodology that allows our models to be re-used and their representations refined. We test our approach on six different benchmarks for the word similarity task, showing that our approach can produce state-of-the-art results and outperforms several more complex state-of-the-art systems.
translated by 谷歌翻译
令牌化是预用语言模型(PLMS)的基础。用于中文PLMS的现有销量化方法通常将每个角色视为不可分割的令牌。然而,它们忽略了中文写字系统的独特特征,其中附加语言信息在字符级别下方,即在子字符级别。要利用此类信息,我们提出了子字符(Sub Const for Short)标记。具体地,我们首先通过基于其字形或发音将每个汉字转换为短序列来编码输入文本,然后根据具有子字标记化的编码文本构造词汇表。实验结果表明,Sub Colar标记与现有标记均具有两个主要优点:1)它们可以将输入牌销料到更短的序列中,从而提高计算效率。 2)基于发音的Sub Col.Tokenizers可以将中文同音铭器编码为相同的音译序列并产生相同的标记输出,因此对所有同音声音拼写的强大。与此同时,使用Sub Colar标记培训的模型竞争地执行下游任务。我们在https://github.com/thunlp/subchartoken中发布我们的代码,以促进未来的工作。
translated by 谷歌翻译
Subword units are an effective way to alleviate the open vocabulary problems in neural machine translation (NMT). While sentences are usually converted into unique subword sequences, subword segmentation is potentially ambiguous and multiple segmentations are possible even with the same vocabulary. The question addressed in this paper is whether it is possible to harness the segmentation ambiguity as a noise to improve the robustness of NMT. We present a simple regularization method, subword regularization, which trains the model with multiple subword segmentations probabilistically sampled during training. In addition, for better subword sampling, we propose a new subword segmentation algorithm based on a unigram language model. We experiment with multiple corpora and report consistent improvements especially on low resource and out-of-domain settings.
translated by 谷歌翻译
A goal of statistical language modeling is to learn the joint probability function of sequences of words. This is intrinsically difficult because of the curse of dimensionality: we propose to fight it with its own weapons.In the proposed approach one learns simultaneously (1) a distributed representation for each word (i.e. a similarity between words) along with (2) the probability function for word sequences, expressed with these representations. Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar to words forming an already seen sentence. We report on experiments using neural networks for the probability function, showing on two text corpora that the proposed approach very significantly improves on a state-of-the-art trigram model. "Y.B. was also with AT&T Research while doing this research.
translated by 谷歌翻译
The relationship between words in a sentence often tells us more about the underlying semantic content of a document than its actual words, individually. In this work, we propose two novel algorithms, called Flexible Lexical Chain II and Fixed Lexical Chain II. These algorithms combine the semantic relations derived from lexical chains, prior knowledge from lexical databases, and the robustness of the distributional hypothesis in word embeddings as building blocks forming a single system. In short, our approach has three main contributions: (i) a set of techniques that fully integrate word embeddings and lexical chains; (ii) a more robust semantic representation that considers the latent relation between words in a document; and (iii) lightweight word embeddings models that can be extended to any natural language task. We intend to assess the knowledge of pre-trained models to evaluate their robustness in the document classification task. The proposed techniques are tested against seven word embeddings algorithms using five different machine learning classifiers over six scenarios in the document classification task. Our results show the integration between lexical chains and word embeddings representations sustain state-of-the-art results, even against more complex systems.
translated by 谷歌翻译
本文使用寄存器预测任务进行了39种语言的基于频率语料库相似性的实验。目的是量化(i)不同语料库与同一语言和(ii)单个语音的同质性之间的距离。这两个目标对于衡量基于语料库的语言分析如何从一个数据集推广到另一个数据集都至关重要。问题在于,以前的工作集中在印欧语上,提出了一个问题,即这些措施是否能够在各种语言上提供强大的概括。本文使用寄存器预测任务来评估跨39种语言的竞争措施:他们能够区分代表不同生产环境的语料库?每个实验都将单个语言的三个语料库与所有语言共享的三个数字寄存器进行比较:社交媒体,网页和Wikipedia。结果表明,语料库相似性的衡量标准保留了不同语言家族,写作系统和形态类型的有效性。此外,当对不域外的语料库,应用于低资源语言以及应用于不同的寄存器集时,这些措施仍然坚固。鉴于我们需要在可用于分析的迅速增加的情况下进行概括,因此这些发现很重要。
translated by 谷歌翻译
In this paper, we propose a novel neural network model called RNN Encoder-Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixedlength vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder-Decoder as an additional feature in the existing log-linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases.
translated by 谷歌翻译
命名实体识别是一项信息提取任务,可作为其他自然语言处理任务的预处理步骤,例如机器翻译,信息检索和问题答案。命名实体识别能够识别专有名称以及开放域文本中的时间和数字表达式。对于诸如阿拉伯语,阿姆哈拉语和希伯来语之类的闪族语言,由于这些语言的结构严重变化,指定的实体识别任务更具挑战性。在本文中,我们提出了一个基于双向长期记忆的Amharic命名实体识别系统,并带有条件随机字段层。我们注释了一种新的Amharic命名实体识别数据集(8,070个句子,具有182,691个令牌),并将合成少数群体过度采样技术应用于我们的数据集,以减轻不平衡的分类问题。我们命名的实体识别系统的F_1得分为93%,这是Amharic命名实体识别的新最新结果。
translated by 谷歌翻译