我们使用释义作为独特的数据来源来分析上下文化的嵌入,特别关注BERT。由于释义自然编码一致的单词和短语语义,因此它们提供了一种独特的镜头来研究嵌入的特性。使用释义数据库的比对,我们在释义和短语表示中研究单词。我们发现,上下文嵌入有效地处理多义单词,但在许多情况下给出了同义词,具有令人惊讶的不同表示。我们证实了先前的发现,即Bert对单词顺序敏感,但是就BERT层的情境化水平而言,发现与先前工作的模式略有不同。
translated by 谷歌翻译
语言模型中的上下文化单词嵌入已为NLP提供了很大的进步。直观地,句子信息集成到单词的表示中,这可以帮助模型多义。但是,上下文灵敏度也导致表示形式的差异,这可能会破坏同义词的语义一致性。我们量化了典型的预训练模型中每个单词sense的上下文嵌入的程度各不相同。结果表明,在上下文中,上下文化的嵌入可以高度一致。此外,词性,单词感官的数量和句子长度对感官表示的差异有影响。有趣的是,我们发现单词表示是偏见的,在不同上下文中的第一个单词往往更相似。我们分析了这种现象,还提出了一种简单的方法来减轻基于距离的单词sense剥夺歧义设置的偏见。
translated by 谷歌翻译
基于变压器的语言模型最近在许多自然语言任务中取得了显着的结果。但是,通常通过利用大量培训数据来实现排行榜的性能,并且很少通过将明确的语言知识编码为神经模型。这使许多人质疑语言学对现代自然语言处理的相关性。在本文中,我介绍了几个案例研究,以说明理论语言学和神经语言模型仍然相互关联。首先,语言模型通过提供一个客观的工具来测量语义距离,这对语言学家很有用,语义距离很难使用传统方法。另一方面,语言理论通过提供框架和数据源来探究我们的语言模型,以了解语言理解的特定方面,从而有助于语言建模研究。本论文贡献了三项研究,探讨了语言模型中语法 - 听觉界面的不同方面。在论文的第一部分中,我将语言模型应用于单词类灵活性的问题。我将Mbert作为语义距离测量的来源,我提供了有利于将单词类灵活性分析为方向过程的证据。在论文的第二部分中,我提出了一种方法来测量语言模型中间层的惊奇方法。我的实验表明,包含形态句法异常的句子触发了语言模型早期的惊喜,而不是语义和常识异常。最后,在论文的第三部分中,我适应了一些心理语言学研究,以表明语言模型包含了论证结构结构的知识。总而言之,我的论文在自然语言处理,语言理论和心理语言学之间建立了新的联系,以为语言模型的解释提供新的观点。
translated by 谷歌翻译
Natural Language Understanding has seen an increasing number of publications in the last few years, especially after robust word embeddings models became prominent, when they proved themselves able to capture and represent semantic relationships from massive amounts of data. Nevertheless, traditional models often fall short in intrinsic issues of linguistics, such as polysemy and homonymy. Any expert system that makes use of natural language in its core, can be affected by a weak semantic representation of text, resulting in inaccurate outcomes based on poor decisions. To mitigate such issues, we propose a novel approach called Most Suitable Sense Annotation (MSSA), that disambiguates and annotates each word by its specific sense, considering the semantic effects of its context. Our approach brings three main contributions to the semantic representation scenario: (i) an unsupervised technique that disambiguates and annotates words by their senses, (ii) a multi-sense embeddings model that can be extended to any traditional word embeddings algorithm, and (iii) a recurrent methodology that allows our models to be re-used and their representations refined. We test our approach on six different benchmarks for the word similarity task, showing that our approach can produce state-of-the-art results and outperforms several more complex state-of-the-art systems.
translated by 谷歌翻译
惯用表达(IES)以其非构成性为特征,是自然语言的重要组成部分。这是对NLP的经典挑战,包括推动当今最先进的培训的预培训语言模型。先前的工作已经确定了其背景化表示的缺陷,这是由于代表的基本组成范式所致。在这项工作中,我们采用了第一个原理的方法,以使用适配器作为对惯用句子的轻量级非构成语言专家来建立惯用性。通过固有和外在方法可以看到基准(例如BART)的能力提高,其中嵌入聚类的均匀性得分高0.19分,而IE sense sense Inambiagiation和Insense Disamage Disamage和Idiom处理任务的均质得分提高了0.19分,高达25%跨度检测。
translated by 谷歌翻译
越来越多的自然语言处理研究(NLP)和自然语言理解(NLU)正在研究从大语言模型的嵌入一词中学习或编码的人类知识。这是了解哪些知识语言模型捕获的一步,类似于人类对语言和交流的理解。在这里,我们调查了单词(即价,唤醒,主导地位)的影响以及如何在大型神经网络中预先训练的单词嵌入中编码。我们将人类标记的数据集用作地面真理,并对四种单词嵌入方式进行了各种相关和分类测试。嵌入在静态或上下文化方面有所不同,以及在训练和微调阶段优先考虑特定信息的程度。我们的分析表明,嵌入Vanilla Bert模型的单词并未明显编码英语单词的影响信息。只有在与情绪相关的任务上进行微调或包含来自情感丰富的环境的额外上下文化信息时,只有在bert模型进行微调时,相应的嵌入方式可以编码更相关的影响信息。
translated by 谷歌翻译
成语与大多数短语不同。首先,成语中的单词具有非规范含义。其次,习语中单词的非传统含义取决于习惯中其他单词的存在。语言理论在这些特性是否相互依赖,以及是否需要特殊的理论机制来容纳成语方面有所不同。我们定义了与上述属性相对应的两个度量,并使用BERT(Devlin等,2019)和XLNet实施它们(Yang等,2019)。我们表明,成语落在两个维度的预期交集处,但是尺寸本身并不相关。我们的结果表明,处理习语的特殊机械可能不保证。
translated by 谷歌翻译
大多数无监督的NLP模型代表了语义空间中单点或单个区域的每个单词,而现有的多感觉单词嵌入物不能代表像素序或句子等更长的单词序列。我们提出了一种用于文本序列(短语或句子)的新型嵌入方法,其中每个序列由一个不同的多模码本嵌入物组表示,以捕获其含义的不同语义面。码本嵌入式可以被视为集群中心,该中心总结了在预训练的单词嵌入空间中的可能共同出现的单词的分布。我们介绍了一个端到端的训练神经模型,直接从测试时间内从输入文本序列预测集群中心集。我们的实验表明,每句话码本嵌入式显着提高无监督句子相似性和提取摘要基准的性能。在短语相似之处实验中,我们发现多面嵌入物提供可解释的语义表示,但不优于单面基线。
translated by 谷歌翻译
对于自然语言处理应用可能是有问题的,因为它们的含义不能从其构成词语推断出来。缺乏成功的方法方法和足够大的数据集防止了用于检测成语的机器学习方法的开发,特别是对于在训练集中不发生的表达式。我们提出了一种叫做小鼠的方法,它使用上下文嵌入来实现此目的。我们展示了一个新的多字表达式数据集,具有文字和惯用含义,并使用它根据两个最先进的上下文单词嵌入式培训分类器:Elmo和Bert。我们表明,使用两个嵌入式的深度神经网络比现有方法更好地执行,并且能够检测惯用词使用,即使对于训练集中不存在的表达式。我们展示了开发模型的交叉传输,并分析了所需数据集的大小。
translated by 谷歌翻译
我们对基于上下文化的基于嵌入的方法的(可能错误的)输出进行了定性分析,以检测直接性语义变化。首先,我们引入了一种合奏方法优于先前描述的上下文化方法。该方法被用作对5年英语单词预测的语义变化程度进行深入分析的基础。我们的发现表明,上下文化的方法通常可以预测单词的高变化分数,这些单词在术语的词典意义上没有经历任何实际的历时语义转移(或至少这些转移的状态值得怀疑)。详细讨论了此类具有挑战性的案例,并提出了它们的语言分类。我们的结论是,预训练的情境化语言模型容易产生词典感官和上下文方差变化的变化,这自然源于它们的分布性质,但与基于静态嵌入的方法中观察到的问题类型不同。此外,他们经常将词汇实体的句法和语义方面合并在一起。我们为这些问题提出了一系列可能的未来解决方案。
translated by 谷歌翻译
Recent progress in pretraining language models on large textual corpora led to a surge of improvements for downstream NLP tasks. Whilst learning linguistic knowledge, these models may also be storing relational knowledge present in the training data, and may be able to answer queries structured as "fillin-the-blank" cloze statements. Language models have many advantages over structured knowledge bases: they require no schema engineering, allow practitioners to query about an open class of relations, are easy to extend to more data, and require no human supervision to train. We present an in-depth analysis of the relational knowledge already present (without fine-tuning) in a wide range of state-of-theart pretrained language models. We find that (i) without fine-tuning, BERT contains relational knowledge competitive with traditional NLP methods that have some access to oracle knowledge, (ii) BERT also does remarkably well on open-domain question answering against a supervised baseline, and (iii) certain types of factual knowledge are learned much more readily than others by standard language model pretraining approaches. The surprisingly strong ability of these models to recall factual knowledge without any fine-tuning demonstrates their potential as unsupervised open-domain QA systems. The code to reproduce our analysis is available at https: //github.com/facebookresearch/LAMA.
translated by 谷歌翻译
在论文中,我们测试了两个不同的方法,以获得波兰语的{令人难过的}词感人歧义任务。在这两种方法中,我们使用神经语言模型来预测与消歧的词语类似,并且在这些词的基础上,我们以不同的方式预测单词感官的分区。在第一种方法中,我们群集选定类似的单词,而在第二个中,我们群集代表其子集的群集向量。评估是在用PLONDNET感应注释的文本上进行的,并提供了相对良好的结果(对于所有模糊单词F1 = 0.68)。结果明显优于\ Cite {WAW:MYK:17:Sense}的神经模型的无人监督方法所获得的结果,并且处于在那里提供的监督方法的水平。所提出的方法可以是解决缺乏有义注释数据的语言的词语感义歧消声问题的方式。
translated by 谷歌翻译
The rapid advancement of AI technology has made text generation tools like GPT-3 and ChatGPT increasingly accessible, scalable, and effective. This can pose serious threat to the credibility of various forms of media if these technologies are used for plagiarism, including scientific literature and news sources. Despite the development of automated methods for paraphrase identification, detecting this type of plagiarism remains a challenge due to the disparate nature of the datasets on which these methods are trained. In this study, we review traditional and current approaches to paraphrase identification and propose a refined typology of paraphrases. We also investigate how this typology is represented in popular datasets and how under-representation of certain types of paraphrases impacts detection capabilities. Finally, we outline new directions for future research and datasets in the pursuit of more effective paraphrase detection using AI.
translated by 谷歌翻译
News articles both shape and reflect public opinion across the political spectrum. Analyzing them for social bias can thus provide valuable insights, such as prevailing stereotypes in society and the media, which are often adopted by NLP models trained on respective data. Recent work has relied on word embedding bias measures, such as WEAT. However, several representation issues of embeddings can harm the measures' accuracy, including low-resource settings and token frequency differences. In this work, we study what kind of embedding algorithm serves best to accurately measure types of social bias known to exist in US online news articles. To cover the whole spectrum of political bias in the US, we collect 500k articles and review psychology literature with respect to expected social bias. We then quantify social bias using WEAT along with embedding algorithms that account for the aforementioned issues. We compare how models trained with the algorithms on news articles represent the expected social bias. Our results suggest that the standard way to quantify bias does not align well with knowledge from psychology. While the proposed algorithms reduce the~gap, they still do not fully match the literature.
translated by 谷歌翻译
Incorporating contrastive learning objectives in sentence representation learning (SRL) has yielded significant improvements on many sentence-level NLP tasks. However, It is not well understood why contrastive learning works for learning sentence-level semantics. In this paper, we take a closer look at contrastive sentence representation learning through the lens of isotropy and learning dynamics. We interpret its success stories through the geometry of the representation shifts. We show that contrastive learning brings isotropy, and surprisingly learns to converge tokens to similar positions in the semantic space if given the signal that they are in the same sentence. Also, what we formalize as "spurious contextualization" is mitigated for semantically meaningful tokens, while augmented for functional ones. The embedding space is pushed toward the origin during training, with more areas now better defined. We ablate these findings by observing the learning dynamic with different training temperatures, batch sizes and pooling methods. With these findings, we aim to shed light on future designs of sentence representation learning methods.
translated by 谷歌翻译
People constantly use language to learn about the world. Computational linguists have capitalized on this fact to build large language models (LLMs) that acquire co-occurrence-based knowledge from language corpora. LLMs achieve impressive performance on many tasks, but the robustness of their world knowledge has been questioned. Here, we ask: do LLMs acquire generalized knowledge about real-world events? Using curated sets of minimal sentence pairs (n=1215), we tested whether LLMs are more likely to generate plausible event descriptions compared to their implausible counterparts. We found that LLMs systematically distinguish possible and impossible events (The teacher bought the laptop vs. The laptop bought the teacher) but fall short of human performance when distinguishing likely and unlikely events (The nanny tutored the boy vs. The boy tutored the nanny). In follow-up analyses, we show that (i) LLM scores are driven by both plausibility and surface-level sentence features, (ii) LLMs generalize well across syntactic sentence variants (active vs passive) but less well across semantic sentence variants (synonymous sentences), (iii) some, but not all LLM deviations from ground-truth labels align with crowdsourced human judgments, and (iv) explicit event plausibility information emerges in middle LLM layers and remains high thereafter. Overall, our analyses reveal a gap in LLMs' event knowledge, highlighting their limitations as generalized knowledge bases. We conclude by speculating that the differential performance on impossible vs. unlikely events is not a temporary setback but an inherent property of LLMs, reflecting a fundamental difference between linguistic knowledge and world knowledge in intelligent systems.
translated by 谷歌翻译
Contextualized representation models such as ELMo (Peters et al., 2018a) and BERT (Devlin et al., 2018) have recently achieved state-of-the-art results on a diverse array of downstream NLP tasks. Building on recent token-level probing work, we introduce a novel edge probing task design and construct a broad suite of sub-sentence tasks derived from the traditional structured NLP pipeline. We probe word-level contextual representations from four recent models and investigate how they encode sentence structure across a range of syntactic, semantic, local, and long-range phenomena. We find that existing models trained on language modeling and translation produce strong representations for syntactic phenomena, but only offer comparably small improvements on semantic tasks over a non-contextual baseline.
translated by 谷歌翻译
我们提出了Rudsi,这是俄罗斯语言感官诱导(WSI)的新基准。该数据集是使用单词用法图(WUGS)的手动注释和半自动聚类创建的。与俄罗斯的先前WSI数据集不同,Rudsi完全由数据驱动(基于俄罗斯国家语料库的文本),没有对注释者强加的外部词感官。根据图聚类的参数,可以从原始注释中产生不同的导数数据集。我们报告了几种基线WSI方法在Rudsi上获得的性能,并讨论了改善这些分数的可能性。
translated by 谷歌翻译
我们提出了一种使用预训练的语言模型的新的无监督方法,用于词汇替换。与以前使用语言模型的生成能力预测替代品的方法相比,我们的方法基于上下文化和脱皮的单词嵌入的相似性检索替代品,即单词在多个上下文中的平均上下文表示。我们以英语和意大利语进行实验,并表明我们的方法基本上要优于强大的基准,并在没有任何明确的监督或微调的情况下建立了新的最新技术。我们进一步表明,我们的方法在预测低频替代品方面的表现特别出色,还产生了多种替代候选者列表,从而减少了根据文章 - 名称协议引起的形态寄电或形态句法偏见。
translated by 谷歌翻译
Training learnable metrics using modern language models has recently emerged as a promising method for the automatic evaluation of machine translation. However, existing human evaluation datasets in text simplification are limited by a lack of annotations, unitary simplification types, and outdated models, making them unsuitable for this approach. To address these issues, we introduce the SIMPEVAL corpus that contains: SIMPEVAL_ASSET, comprising 12K human ratings on 2.4K simplifications of 24 systems, and SIMPEVAL_2022, a challenging simplification benchmark consisting of over 1K human ratings of 360 simplifications including generations from GPT-3.5. Training on SIMPEVAL_ASSET, we present LENS, a Learnable Evaluation Metric for Text Simplification. Extensive empirical results show that LENS correlates better with human judgment than existing metrics, paving the way for future progress in the evaluation of text simplification. To create the SIMPEVAL datasets, we introduce RANK & RATE, a human evaluation framework that rates simplifications from several models in a list-wise manner by leveraging an interactive interface, which ensures both consistency and accuracy in the evaluation process. Our metric, dataset, and annotation toolkit are available at https://github.com/Yao-Dou/LENS.
translated by 谷歌翻译