Embedding words in vector space is a fundamental first step in state-of-the-art natural language processing (NLP). Typical NLP solutions employ pre-defined vector representations to improve generalization by co-locating similar words in vector space. For instance, Word2Vec is a self-supervised predictive model that captures the context of words using a neural network. Similarly, GLoVe is a popular unsupervised model incorporating corpus-wide word co-occurrence statistics. Such word embedding has significantly boosted important NLP tasks, including sentiment analysis, document classification, and machine translation. However, the embeddings are dense floating-point vectors, making them expensive to compute and difficult to interpret. In this paper, we instead propose to represent the semantics of words with a few defining words that are related using propositional logic. To produce such logical embeddings, we introduce a Tsetlin Machine-based autoencoder that learns logical clauses self-supervised. The clauses consist of contextual words like "black," "cup," and "hot" to define other words like "coffee," thus being human-understandable. We evaluate our embedding approach on several intrinsic and extrinsic benchmarks, outperforming GLoVe on six classification tasks. Furthermore, we investigate the interpretability of our embedding using the logical representations acquired during training. We also visualize word clusters in vector space, demonstrating how our logical embedding co-locate similar words.
translated by 谷歌翻译
The relationship between words in a sentence often tells us more about the underlying semantic content of a document than its actual words, individually. In this work, we propose two novel algorithms, called Flexible Lexical Chain II and Fixed Lexical Chain II. These algorithms combine the semantic relations derived from lexical chains, prior knowledge from lexical databases, and the robustness of the distributional hypothesis in word embeddings as building blocks forming a single system. In short, our approach has three main contributions: (i) a set of techniques that fully integrate word embeddings and lexical chains; (ii) a more robust semantic representation that considers the latent relation between words in a document; and (iii) lightweight word embeddings models that can be extended to any natural language task. We intend to assess the knowledge of pre-trained models to evaluate their robustness in the document classification task. The proposed techniques are tested against seven word embeddings algorithms using five different machine learning classifiers over six scenarios in the document classification task. Our results show the integration between lexical chains and word embeddings representations sustain state-of-the-art results, even against more complex systems.
translated by 谷歌翻译
变量名称对于传达预期的程序行为至关重要。基于机器学习的程序分析方法使用变量名称表示广泛的任务,例如建议新的变量名称和错误检测。理想情况下,这些方法可以捕获句法相似性的名称之间的语义关系,例如,名称平均和均值的事实是相似的。不幸的是,以前的工作发现,即使是先前的最佳的表示方法主要是捕获相关性(是否有两个变量始终链接),而不是相似性(是否具有相同的含义)。我们提出了VarCLR,一种用于学习变量名称的语义表示的新方法,这些方法有效地捕获了这种更严格的意义上的可变相似性。我们观察到这个问题是对比学习的优秀契合,旨在最小化明确类似的输入之间的距离,同时最大化不同输入之间的距离。这需要标记的培训数据,因此我们构建了一种新颖的弱监督的变量重命名数据集,从GitHub编辑开采。我们表明VarCLR能够有效地应用BERT等复杂的通用语言模型,以变为变量名称表示,因此也是与变量名称相似性搜索或拼写校正等相关的下游任务。 varclr产生模型,显着越优于idbench的最先进的现有基准,明确地捕获可变相似度(与相关性不同)。最后,我们贡献了所有数据,代码和预先训练模型的版本,旨在为现有或未来程序分析中使用的可变表示提供的可变表示的替代品。
translated by 谷歌翻译
Natural Language Understanding has seen an increasing number of publications in the last few years, especially after robust word embeddings models became prominent, when they proved themselves able to capture and represent semantic relationships from massive amounts of data. Nevertheless, traditional models often fall short in intrinsic issues of linguistics, such as polysemy and homonymy. Any expert system that makes use of natural language in its core, can be affected by a weak semantic representation of text, resulting in inaccurate outcomes based on poor decisions. To mitigate such issues, we propose a novel approach called Most Suitable Sense Annotation (MSSA), that disambiguates and annotates each word by its specific sense, considering the semantic effects of its context. Our approach brings three main contributions to the semantic representation scenario: (i) an unsupervised technique that disambiguates and annotates words by their senses, (ii) a multi-sense embeddings model that can be extended to any traditional word embeddings algorithm, and (iii) a recurrent methodology that allows our models to be re-used and their representations refined. We test our approach on six different benchmarks for the word similarity task, showing that our approach can produce state-of-the-art results and outperforms several more complex state-of-the-art systems.
translated by 谷歌翻译
Obtaining labelled data in a particular context could be expensive and time consuming. Although different algorithms, including unsupervised learning, semi-supervised learning, self-learning have been adopted, the performance of text classification varies with context. Given the lack of labelled dataset, we proposed a novel and simple unsupervised text classification model to classify cargo content in international shipping industry using the Standard International Trade Classification (SITC) codes. Our method stems from representing words using pretrained Glove Word Embeddings and finding the most likely label using Cosine Similarity. To compare unsupervised text classification model with supervised classification, we also applied several Transformer models to classify cargo content. Due to lack of training data, the SITC numerical codes and the corresponding textual descriptions were used as training data. A small number of manually labelled cargo content data was used to evaluate the classification performances of the unsupervised classification and the Transformer based supervised classification. The comparison reveals that unsupervised classification significantly outperforms Transformer based supervised classification even after increasing the size of the training dataset by 30%. Lacking training data is a key bottleneck that prohibits deep learning models (such as Transformers) from successful practical applications. Unsupervised classification can provide an alternative efficient and effective method to classify text when there is scarce training data.
translated by 谷歌翻译
专利数据是创新研究知识的重要来源。尽管专利对之间的技术相似性是用于专利分析的关键指标。最近,研究人员一直在使用基于不同NLP嵌入模型的专利矢量空间模型来计算专利对之间的技术相似性,以帮助更好地了解创新,专利景观,技术映射和专利质量评估。据我们所知,没有一项全面的调查来建立嵌入模型的性能以计算专利相似性指标的大图。因此,在这项研究中,我们根据专利分类性能概述了这些算法的准确性。在详细的讨论中,我们报告了部分,类和子类级别的前3个算法的性能。基于专利的第一个主张的结果表明,专利,贝特(Bert-For)和tf-idf加权单词嵌入具有最佳准确性,可以在亚类级别计算句子嵌入。根据第一个结果,不同类别中模型的性能各不相同,这表明专利分析中的研究人员可以利用本研究的结果根据他们使用的专利数据的特定部分选择最佳的适当模型。
translated by 谷歌翻译
本教程展示了工作流程,将文本数据纳入精算分类和回归任务。主要重点是采用基于变压器模型的方法。平均长度为400个单词的车祸描述的数据集,英语和德语可用,以及具有简短财产保险索赔的数据集用来证明这些技术。案例研究应对与多语言环境和长输入序列有关的挑战。他们还展示了解释模型输出,评估和改善模型性能的方法,通过将模型调整到应用程序领域或特定预测任务。最后,该教程提供了在没有或仅有少数标记数据的情况下处理分类任务的实用方法。通过使用最少的预处理和微调的现成自然语言处理(NLP)模型的语言理解技能(NLP)模型实现的结果清楚地证明了用于实际应用的转移学习能力。
translated by 谷歌翻译
语言基础与视觉是一个积极的研究领域,旨在通过利用视觉感知知识来丰富基于文本的单词含义的表示。尽管进行了多次接地尝试,但仍不清楚如何以一种保持文本和视觉知识的适当平衡的方式将视觉知识注入语言嵌入一词。一些普遍的问题是以下内容。视觉基础对抽象单词有益吗?还是仅限于具体单词的贡献?弥合文本和视觉之间差距的最佳方法是什么?通过视觉接地的文本嵌入,我们可以获得多少收益?本研究通过提出一种简单但非常有效的基础方法来解决这些问题,以预先训练的单词嵌入。我们的模型将文本嵌入与视觉保持一致,同时在很大程度上保留了在文本语料库中使用单词使用的分布统计数据。通过应用学习的对齐方式,我们能够生成视觉接地的嵌入,用于看不见的单词,包括抽象单词。一系列对单词相似性基准的评估表明,视觉接地不仅对具体单词有益,而且对抽象单词也有益。我们还表明,我们的视觉接地方法为上下文化的嵌入提供了优势,但只有在对相对尺寸相对较小的语料库进行培训时,我们才能提供优势。可以在https://github.com/hazel1994/visaly_grounded_word_word_embeddings_2上获得英语的代码和接地嵌入。
translated by 谷歌翻译
Tsetlin Machine (TM) has been gaining popularity as an inherently interpretable machine leaning method that is able to achieve promising performance with low computational complexity on a variety of applications. The interpretability and the low computational complexity of the TM are inherited from the Boolean expressions for representing various sub-patterns. Although possessing favorable properties, TM has not been the go-to method for AI applications, mainly due to its conceptual and theoretical differences compared with perceptrons and neural networks, which are more widely known and well understood. In this paper, we provide detailed insights for the operational concept of the TM, and try to bridge the gap in the theoretical understanding between the perceptron and the TM. More specifically, we study the operational concept of the TM following the analytical structure of perceptrons, showing the resemblance between the perceptrons and the TM. Through the analysis, we indicated that the TM's weight update can be considered as a special case of the gradient weight update. We also perform an empirical analysis of TM by showing the flexibility in determining the clause length, visualization of decision boundaries and obtaining interpretable boolean expressions from TM. In addition, we also discuss the advantages of TM in terms of its structure and its ability to solve more complex problems.
translated by 谷歌翻译
手套通过利用来自Word Co-Feationence矩阵的统计信息来学习Word Embeddings。然而,矩阵中的字对对从预定义的本地上下文窗口中提取,这可能导致有限的字对对和潜在的语义无关词对。在本文中,我们提出了Semglove,其中从伯爵蒸馏到静态手套单词嵌入。特别是,我们提出了两种模型来提取基于屏蔽语言模型或伯特的多针注意重量的共发生统计。我们的方法可以在不受本地窗口假设的情况下提取字对对,并且可以通过直接考虑词对之间的语义距离来定义共发生权重。几个单词相似性数据集和四个外部任务的实验表明semglove可以优于手套。
translated by 谷歌翻译
近年来,人们对开发自然语言处理(NLP)中可解释模型的利益越来越多。大多数现有模型旨在识别输入功能,例如对于模型预测而言重要的单词或短语。然而,在NLP中开发的神经模型通常以层次结构的方式构成单词语义,文本分类需要层次建模来汇总本地信息,以便处理主题和标签更有效地转移。因此,单词或短语的解释不能忠实地解释文本分类中的模型决策。本文提出了一种新型的层次解释性神经文本分类器,称为提示,该分类器可以自动以层次结构方式以标记相关主题的形式生成模型预测的解释。模型解释不再处于单词级别,而是基于主题作为基本语义单元。评论数据集和新闻数据集的实验结果表明,我们所提出的方法与现有最新的文本分类器相当地达到文本分类结果,并比其他可解释的神经文本更忠实于模型的预测和更好地理解人类的解释分类器。
translated by 谷歌翻译
本次调查绘制了用于分析社交媒体数据的生成方法的研究状态的广泛的全景照片(Sota)。它填补了空白,因为现有的调查文章在其范围内或被约会。我们包括两个重要方面,目前正在挖掘和建模社交媒体的重要性:动态和网络。社会动态对于了解影响影响或疾病的传播,友谊的形成,友谊的形成等,另一方面,可以捕获各种复杂关系,提供额外的洞察力和识别否则将不会被注意的重要模式。
translated by 谷歌翻译
使用机器学习算法从未标记的文本中提取知识可能很复杂。文档分类和信息检索是两个应用程序,可以从无监督的学习(例如文本聚类和主题建模)中受益,包括探索性数据分析。但是,无监督的学习范式提出了可重复性问题。初始化可能会导致可变性,具体取决于机器学习算法。此外,关于群集几何形状,扭曲可能会产生误导。在原因中,异常值和异常的存在可能是决定因素。尽管初始化和异常问题与文本群集和主题建模相关,但作者并未找到对它们的深入分析。这项调查提供了这些亚地区的系统文献综述(2011-2022),并提出了共同的术语,因为类似的程序具有不同的术语。作者描述了研究机会,趋势和开放问题。附录总结了与审查的作品直接或间接相关的文本矢量化,分解和聚类算法的理论背景。
translated by 谷歌翻译
It is well-known that typical word embedding methods such as Word2Vec and GloVe have the property that the meaning can be composed by adding up the embeddings (additive compositionality). Several theories have been proposed to explain additive compositionality, but the following questions remain unanswered: (Q1) The assumptions of those theories do not hold for the practical word embedding. (Q2) Ordinary additive compositionality can be seen as an AND operation of word meanings, but it is not well understood how other operations, such as OR and NOT, can be computed by the embeddings. We address these issues by the idea of frequency-weighted centering at its core. This paper proposes a post-processing method for bridging the gap between practical word embedding and the assumption of theory about additive compositionality as an answer to (Q1). It also gives a method for taking OR or NOT of the meaning by linear operation of word embedding as an answer to (Q2). Moreover, we confirm experimentally that the accuracy of AND operation, i.e., the ordinary additive compositionality, can be improved by our post-processing method (3.5x improvement in top-100 accuracy) and that OR and NOT operations can be performed correctly.
translated by 谷歌翻译
越来越多的自然语言处理研究(NLP)和自然语言理解(NLU)正在研究从大语言模型的嵌入一词中学习或编码的人类知识。这是了解哪些知识语言模型捕获的一步,类似于人类对语言和交流的理解。在这里,我们调查了单词(即价,唤醒,主导地位)的影响以及如何在大型神经网络中预先训练的单词嵌入中编码。我们将人类标记的数据集用作地面真理,并对四种单词嵌入方式进行了各种相关和分类测试。嵌入在静态或上下文化方面有所不同,以及在训练和微调阶段优先考虑特定信息的程度。我们的分析表明,嵌入Vanilla Bert模型的单词并未明显编码英语单词的影响信息。只有在与情绪相关的任务上进行微调或包含来自情感丰富的环境的额外上下文化信息时,只有在bert模型进行微调时,相应的嵌入方式可以编码更相关的影响信息。
translated by 谷歌翻译
建模法检索和检索作为预测问题最近被出现为法律智能的主要方法。专注于法律文章检索任务,我们展示了一个名为Lamberta的深度学习框架,该框架被设计用于民法代码,并在意大利民法典上专门培训。为了我们的知识,这是第一项研究提出了基于伯特(来自变压器的双向编码器表示)学习框架的意大利法律制度对意大利法律制度的高级法律文章预测的研究,最近引起了深度学习方法的增加,呈现出色的有效性在几种自然语言处理和学习任务中。我们通过微调意大利文章或其部分的意大利预先训练的意大利预先训练的伯爵来定义Lamberta模型,因为法律文章作为分类任务检索。我们Lamberta框架的一个关键方面是我们构思它以解决极端的分类方案,其特征在于课程数量大,少量学习问题,以及意大利法律预测任务的缺乏测试查询基准。为了解决这些问题,我们为法律文章的无监督标签定义了不同的方法,原则上可以应用于任何法律制度。我们提供了深入了解我们Lamberta模型的解释性和可解释性,并且我们对单一标签以及多标签评估任务进行了广泛的查询模板实验分析。经验证据表明了Lamberta的有效性,以及对广泛使用的深度学习文本分类器和一些构思的几次学习者来说,其优越性是对属性感知预测任务的优势。
translated by 谷歌翻译
Word Mover的距离(WMD)计算单词和模型之间的距离与两个文本序列中的单词之间的移动成本相似。但是,它在句子相似性评估中没有提供良好的性能,因为它不包含单词重要性,并且在句子中未能将固有的上下文和结构信息纳入句子。提出了一种使用语法解析树(称为语法感知单词Mover的距离(SYNWMD))的改进的WMD方法,以解决这项工作中的这两个缺点。首先,基于从句子树的句法解析树中提取的一词共发生统计量建立了加权图。每个单词的重要性是从图形连接性推断出的。其次,在计算单词之间的距离时,考虑了单词的局部句法解析结构。为了证明拟议的SynWMD的有效性,我们对6个文本语义相似性(STS)数据集和4个句子分类数据集进行了实验。实验结果表明,SynWMD在STS任务上实现了最先进的性能。它还在句子分类任务上胜过其他基于WMD的方法。
translated by 谷歌翻译
Short text classification is a crucial and challenging aspect of Natural Language Processing. For this reason, there are numerous highly specialized short text classifiers. However, in recent short text research, State of the Art (SOTA) methods for traditional text classification, particularly the pure use of Transformers, have been unexploited. In this work, we examine the performance of a variety of short text classifiers as well as the top performing traditional text classifier. We further investigate the effects on two new real-world short text datasets in an effort to address the issue of becoming overly dependent on benchmark datasets with a limited number of characteristics. Our experiments unambiguously demonstrate that Transformers achieve SOTA accuracy on short text classification tasks, raising the question of whether specialized short text techniques are necessary.
translated by 谷歌翻译
雇用措施恳求抄袭文本的措施是对学术诚信的严重威胁。要启用检测机释录的文本,我们会评估五个预先训练的单词嵌入模型的有效性与机器学习分类器和最先进的神经语言模型相结合。我们分析了研究论文,毕业论文和维基百科文章的预印刷品,我们使用不同的工具SpinBot和Spinnerchief释放。最佳的表演技术,啰素,平均F1得分为80.99%(F1 = 99.68%,纺纱病例的F1 = 71.64%),而人类评估员均达到纺纱病例的F1 = 78.4%,F1 = 65.6%的纺纱病例。我们表明,自动分类减轻了广泛使用的文本匹配系统的缺点,例如金风格和Plagscan。为了促进未来的研究,所有数据,代码和两个展示我们贡献的Web应用程序都公开使用。
translated by 谷歌翻译
长期以来,共同基金或交易所交易基金(ETF)的分类已为财务分析师提供服务,以进行同行分析,以从竞争对手分析开始到量化投资组合多元化。分类方法通常依赖于从n-1a表格中提取的结构化格式的基金组成数据。在这里,我们启动一项研究,直接从使用自然语言处理(NLP)的表格中描绘的非结构化数据中学习分类系统。将输入数据仅作为表格中报告的投资策略描述,而目标变量是Lipper全球类别,并且使用各种NLP模型,我们表明,分类系统确实可以通过高准确率。我们讨论了我们发现的含义和应用,以及现有的预培训架构的局限性在应用它们以学习基金分类时。
translated by 谷歌翻译