This paper explores a simple and efficient baseline for text classification.Our experiments show that our fast text classifier fastText is often on par with deep learning classifiers in terms of accuracy, and many orders of magnitude faster for training and evaluation. We can train fastText on more than one billion words in less than ten minutes using a standard multicore CPU, and classify half a million sentences among 312K classes in less than a minute.
translated by 谷歌翻译
We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.
translated by 谷歌翻译
We propose a hierarchical attention network for document classification. Our model has two distinctive characteristics: (i) it has a hierarchical structure that mirrors the hierarchical structure of documents; (ii) it has two levels of attention mechanisms applied at the wordand sentence-level, enabling it to attend differentially to more and less important content when constructing the document representation. Experiments conducted on six large scale text classification tasks demonstrate that the proposed architecture outperform previous methods by a substantial margin. Visualization of the attention layers illustrates that the model selects qualitatively informative words and sentences.
translated by 谷歌翻译
We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including: part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements.
translated by 谷歌翻译
Many machine learning algorithms require the input to be represented as a fixed-length feature vector. When it comes to texts, one of the most common fixed-length features is bag-of-words. Despite their popularity, bag-of-words features have two major weaknesses: they lose the ordering of the words and they also ignore semantics of the words. For example, "powerful," "strong" and "Paris" are equally distant. In this paper, we propose Paragraph Vector, an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents. Our algorithm represents each document by a dense vector which is trained to predict words in the document. Its construction gives our algorithm the potential to overcome the weaknesses of bag-ofwords models. Empirical results show that Paragraph Vectors outperform bag-of-words models as well as other techniques for text representations. Finally, we achieve new state-of-the-art results on several text classification and sentiment analysis tasks.
translated by 谷歌翻译
Text classification is an important and classical problem in natural language processing. There have been a number of studies that applied convolutional neural networks (convolution on regular grid, e.g., sequence) to classification. However, only a limited number of studies have explored the more flexible graph convolutional neural networks (convolution on non-grid, e.g., arbitrary graph) for the task. In this work, we propose to use graph convolutional networks for text classification. We build a single text graph for a corpus based on word co-occurrence and document word relations, then learn a Text Graph Convolutional Network (Text GCN) for the corpus. Our Text GCN is initialized with one-hot representation for word and document, it then jointly learns the embeddings for both words and documents, as supervised by the known class labels for documents. Our experimental results on multiple benchmark datasets demonstrate that a vanilla Text GCN without any external word embeddings or knowledge outperforms state-of-the-art methods for text classification. On the other hand, Text GCN also learns predictive word and document embeddings. In addition, experimental results show that the improvement of Text GCN over state-of-the-art comparison methods become more prominent as we lower the percentage of training data, suggesting the robustness of Text GCN to less training data in text classification.
translated by 谷歌翻译
The pre-dominant approach to language modeling to date is based on recurrent neural networks. Their success on this task is often linked to their ability to capture unbounded context. In this paper we develop a finite context approach through stacked convolutions, which can be more efficient since they allow parallelization over sequential tokens. We propose a novel simplified gating mechanism that outperforms Oord et al. (2016b) and investigate the impact of key architectural decisions. The proposed approach achieves state-of-the-art on the WikiText-103 benchmark, even though it features longterm dependencies, as well as competitive results on the Google Billion Words benchmark. Our model reduces the latency to score a sentence by an order of magnitude compared to a recurrent baseline. To our knowledge, this is the first time a non-recurrent approach is competitive with strong recurrent models on these large scale language tasks.
translated by 谷歌翻译
Deep neural networks (DNNs) are often used for text classification tasks as they usually achieve high levels of accuracy. However, DNNs can be computationally intensive with billions of parameters and large amounts of labeled data, which can make them expensive to use, to optimize and to transfer to out-of-distribution (OOD) cases in practice. In this paper, we propose a non-parametric alternative to DNNs that's easy, light-weight and universal in text classification: a combination of a simple compressor like gzip with a $k$-nearest-neighbor classifier. Without any training, pre-training or fine-tuning, our method achieves results that are competitive with non-pretrained deep learning methods on six in-distributed datasets. It even outperforms BERT on all five OOD datasets, including four low-resource languages. Our method also performs particularly well in few-shot settings where labeled data are too scarce for DNNs to achieve a satisfying accuracy.
translated by 谷歌翻译
本文通过自然应用程序对网页和元素分类来解决复杂结构数据的高效表示的问题。我们假设网页内部元素周围的上下文对问题的价值很高,目前正在被利用。本文旨在通过考虑到其上下文来解决将Web元素分类为DOM树的子树的问题。为实现这一目标,首先讨论当前在结构上工作的专家知识系统,如树 - LSTM。然后,我们向该模型提出上下文感知扩展。我们表明,在多级Web分类任务中,新模型实现了0.7973的平均F1分数。该模型为各种子树生成更好的表示,并且可以用于应用此类元素分类,钢筋在网上学习中的状态估计等。
translated by 谷歌翻译
The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada". Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.
translated by 谷歌翻译
In this work, we model abstractive text summarization using Attentional Encoder-Decoder Recurrent Neural Networks, and show that they achieve state-of-the-art performance on two different corpora. We propose several novel models that address critical problems in summarization that are not adequately modeled by the basic architecture, such as modeling key-words, capturing the hierarchy of sentence-toword structure, and emitting words that are rare or unseen at training time. Our work shows that many of our proposed models contribute to further improvement in performance. We also propose a new dataset consisting of multi-sentence summaries, and establish performance benchmarks for further research.
translated by 谷歌翻译
State-of-the-art named entity recognition systems rely heavily on hand-crafted features and domain-specific knowledge in order to learn effectively from the small, supervised training corpora that are available. In this paper, we introduce two new neural architectures-one based on bidirectional LSTMs and conditional random fields, and the other that constructs and labels segments using a transition-based approach inspired by shift-reduce parsers. Our models rely on two sources of information about words: character-based word representations learned from the supervised corpus and unsupervised word representations learned from unannotated corpora. Our models obtain state-of-the-art performance in NER in four languages without resorting to any language-specific knowledge or resources such as gazetteers. 1
translated by 谷歌翻译
自动言论(POS)标记是许多自然语言处理(NLP)任务的预处理步骤,例如名称实体识别(NER),语音处理,信息提取,单词sense sisse disampigation和Machine Translation。它已经在英语和欧洲语言方面取得了令人鼓舞的结果,但是使用印度语言,尤其是在Odia语言中,由于缺乏支持工具,资源和语言形态丰富性,因此尚未得到很好的探索。不幸的是,我们无法为ODIA找到一个开源POS标记,并且仅尝试为ODIA语言开发POS标记器的尝试。这项研究工作的主要贡献是介绍有条件的随机场(CRF)和基于深度学习的方法(CNN和双向长期短期记忆)来开发ODIA的语音部分。我们使用了一个公开访问的语料库,并用印度标准局(BIS)标签设定了数据集。但是,全球的大多数语言都使用了带有通用依赖项(UD)标签集注释的数据集。因此,要保持统一性,odia数据集应使用相同的标签集。因此,我们已经构建了一个从BIS标签集到UD标签集的简单映射。我们对CRF模型进行了各种特征集输入,观察到构造特征集的影响。基于深度学习的模型包括BI-LSTM网络,CNN网络,CRF层,角色序列信息和预训练的单词向量。通过使用卷积神经网络(CNN)和BI-LSTM网络提取角色序列信息。实施了神经序列标记模型的六种不同组合,并研究了其性能指标。已经观察到具有字符序列特征和预训练的单词矢量的BI-LSTM模型取得了显着的最新结果。
translated by 谷歌翻译
近年来,基于变压器的预训练模型已获得了很大的进步,成为自然语言处理中最重要的骨干之一。最近的工作表明,变压器内部的注意力机制可能不需要,卷积神经网络和基于多层感知器的模型也已被研究为变压器替代方案。在本文中,我们考虑了一个用于语言模型预训练的图形循环网络,该网络通过本地令牌级通信为每个序列构建一个图形结构,以及与其他代币解耦的句子级表示。原始模型在受监督培训下的特定领域特定文本分类中表现良好,但是,其通过自我监督的方式学习转移知识的潜力尚未得到充分利用。我们通过优化体系结构并验证其在更通用的语言理解任务(英语和中文)中的有效性来填补这一空白。至于模型效率,我们的模型在基于变压器的模型中而不是二次复杂性,而是具有线性复杂性,并且在推断过程中的性能更有效。此外,我们发现与现有基于注意力的模型相比,我们的模型可以生成更多样化的输出,而背景化的功能冗余性较小。
translated by 谷歌翻译
Inductive transfer learning has greatly impacted computer vision, but existing approaches in NLP still require task-specific modifications and training from scratch. We propose Universal Language Model Fine-tuning (ULMFiT), an effective transfer learning method that can be applied to any task in NLP, and introduce techniques that are key for fine-tuning a language model. Our method significantly outperforms the state-of-the-art on six text classification tasks, reducing the error by 18-24% on the majority of datasets. Furthermore, with only 100 labeled examples, it matches the performance of training from scratch on 100× more data. We opensource our pretrained models and code 1 .
translated by 谷歌翻译
We present two approaches to use unlabeled data to improve Sequence Learning with recurrent networks. The first approach is to predict what comes next in a sequence, which is a language model in NLP. The second approach is to use a sequence autoencoder, which reads the input sequence into a vector and predicts the input sequence again. These two algorithms can be used as a "pretraining" algorithm for a later supervised sequence learning algorithm. In other words, the parameters obtained from the pretraining step can then be used as a starting point for other supervised training models. In our experiments, we find that long short term memory recurrent networks after pretrained with the two approaches become more stable to train and generalize better. With pretraining, we were able to achieve strong performance in many classification tasks, such as text classification with IMDB, DBpedia or image recognition in CIFAR-10.
translated by 谷歌翻译
情感分析是最古典的,主要研究的自然语言处理任务之一。这个问题有一个值得注意的提前,主张更复杂和可扩展的机器学习模型。尽管存在这一进展,但巴西葡萄牙语仍然只处理了有限的语言资源,例如专用于情绪分类的数据集,特别是在考虑培训,测试和验证集中的预定义分区时,这将允许更公平地比较不同算法备择方案。这些问题的动机,这项工作分析了一系列文档嵌入策略的预测性能,假设极性作为系统结果。此分析包括在巴西葡萄牙语中的五种情感分析数据集,在单个数据集中统一,以及培训,测试和验证集中的引用分区,两者都通过数字存储库公开可用。进行不同上下文的数据集特定模型的交叉评估,以评估其泛化能力和采用唯一模型来解决所有方案的可行性。
translated by 谷歌翻译
尽管对作者身份归因(AA)和作者身份验证(AV)进行了数十年的研究,但数据集拆分/过滤和不匹配的评估方法不一致,因此很难评估艺术的状态。在本文中,我们介绍了对领域的调查,解决混乱点,介绍瓦拉(Valla)标准化和基准测试AA/AV数据集和指标,提供了大规模的经验评估,并提供现有方法之间的苹果对苹果比较。我们评估了15个数据集(包括分配偏移的挑战集)上的八种有希望的方法,并根据Project Gutenberg归档的文本引入了新的大规模数据集。令人惊讶的是,我们发现基于NGRAM的传统模型在5(7个)AA任务上表现最佳,达到了76.50美元的平均宏观准确性\%$(相比之下,基于BERT的型号为66.71美元\%$)。但是,在两个AA数据集上,每个作者和AV数据集中的单词数量最多,基于BERT的模型表现最好。虽然AV方法很容易应用于AA,但很少将它们作为基准包含在AA论文中。我们表明,通过应用硬性采矿,AV方法是AA方法的竞争替代方法。 Valla和所有实验代码可以在此处找到:https://github.com/jacobtyo/valla
translated by 谷歌翻译
最近,基于图形神经网络(GNN)的文本分类模型引起了越来越多的关注。大多数这些模型采用类似的网络范例,即使用预训练节点嵌入初始化和两层图卷积。在这项工作中,我们提出了Textrgnn,一种改进的GNN结构,它引入了剩余连接以加深卷积网络深度。我们的结构可以获得更广泛的节点接收领域,有效地抑制节点特征的过平滑。此外,我们将概率语言模型集成到图形节点嵌入的初始化中,从而可以更好地提取非图形语义信息。实验结果表明,我们的模型是一般和高效的。无论是语料库级别还是文本级别,它都可以显着提高分类准确性,并在各种文本分类数据集中实现SOTA性能。
translated by 谷歌翻译
在线评论在电子商务中发挥重要作用进行决策。大部分人口做出了哪些地方,餐厅访问,以根据各自的平台发布的评论来购买的地方,从哪里购买。欺诈性审查或意见垃圾邮件被分类为一个不诚实或欺骗性的审查。产品或餐厅的肯定审查有助于吸引客户,从而导致销售额增加,而负面评论可能会妨碍餐厅或产品销售的进展,从而导致令人害羞的声誉和损失。欺诈性评论是故意发布的各种在线审查平台,以欺骗客户购买,访问或分散产品或餐厅的注意力。它们也被编写或诋毁产品的辩护。该工作旨在检测和分类审查作为欺骗性或真实性。它涉及使用各种深入学习技术来分类审查和概述涉及基于人的双向LSTM的提出的方法,以解决与基线机器学习技术的评论和比较研究中的语义信息有关的问题,以进行审查分类。
translated by 谷歌翻译