为了解释神经NLI模型及其推理策略,我们进行了一个系统的探测研究,调查了这些模型是否捕获了自然逻辑的至关重要:单调性和概念包容性。在向下单调上下文中正确识别有效推论是NLI性能的已知绊脚石,包括否定范围和广义量子等语言现象。要了解这种困难,我们将单调性强调为上下文的属性,并检查模型在中文嵌入中捕获单调信息的程度,这些嵌入式是其决策过程的中间嵌入。绘制最近探测范式的进步,我们比较各种模型的单调性功能的存在。我们发现,单调信息在基准测试中实现高分的流行NLI模型的表现中,并观察到基于微调策略的这些模型的改进引入了更强大的单调性功能,以及他们在挑战集上的提高性能。
translated by 谷歌翻译
预先接受训练的语言模型的进展导致了对自然语言理解的下游任务的令人印象深刻的结果。探索预先训练的语言模型的最新工作揭示了在其上下围化表示中编码的广泛的语言属性。然而,目前尚不清楚他们是否编码对符号推理方法至关重要的语义知识。我们提出了一种用于探测预先接受训练的语言模型表示的逻辑推断的语言信息的方法。我们的探测数据集涵盖主要符号推理系统所需的语言现象列表。我们发现(i)预先接受的语言模型为推断编码几种类型的语言信息,但是还有一些类型的信息弱编码,(ii)语言模型可以通过微调有效地学习语言信息缺少语言信息。总体而言,我们的调查结果提供了逻辑推理语言模型的语言信息的洞察力,以及他们的预训练程序捕获。此外,我们已经证明了语言模型作为语义和背景知识库的潜力,用于支持符号推断方法。
translated by 谷歌翻译
我们研究了现代神经语言模型容易受到结构启动的程度,这种现象使句子的结构在后续句子中更有可能使相同的结构更有可能。我们探索如何使用启动来研究这些模型学习抽象结构信息的潜力,这是需要自然语言理解技能的任务良好表现的先决条件。我们引入了一种新型的度量标准和释放Prime-LM,这是一个大型语料库,我们可以控制与启动强度相互作用的各种语言因素。我们发现,变压器模型确实显示了结构启动的证据,但他们所学到的概括在某种程度上是由语义信息调节的。我们的实验还表明,模型获得的表示不仅可以编码抽象的顺序结构,而且还涉及一定级别的层次句法信息。更普遍的是,我们的研究表明,启动范式是一种有用的,可用于洞悉语言模型能力的有用的,并为未来的基于底漆的调查打开了探测模型内部状态的未来大门。
translated by 谷歌翻译
基于变压器的语言模型最近在许多自然语言任务中取得了显着的结果。但是,通常通过利用大量培训数据来实现排行榜的性能,并且很少通过将明确的语言知识编码为神经模型。这使许多人质疑语言学对现代自然语言处理的相关性。在本文中,我介绍了几个案例研究,以说明理论语言学和神经语言模型仍然相互关联。首先,语言模型通过提供一个客观的工具来测量语义距离,这对语言学家很有用,语义距离很难使用传统方法。另一方面,语言理论通过提供框架和数据源来探究我们的语言模型,以了解语言理解的特定方面,从而有助于语言建模研究。本论文贡献了三项研究,探讨了语言模型中语法 - 听觉界面的不同方面。在论文的第一部分中,我将语言模型应用于单词类灵活性的问题。我将Mbert作为语义距离测量的来源,我提供了有利于将单词类灵活性分析为方向过程的证据。在论文的第二部分中,我提出了一种方法来测量语言模型中间层的惊奇方法。我的实验表明,包含形态句法异常的句子触发了语言模型早期的惊喜,而不是语义和常识异常。最后,在论文的第三部分中,我适应了一些心理语言学研究,以表明语言模型包含了论证结构结构的知识。总而言之,我的论文在自然语言处理,语言理论和心理语言学之间建立了新的联系,以为语言模型的解释提供新的观点。
translated by 谷歌翻译
递归名词短语(NPS)具有有趣的语义属性。例如,“我最喜欢的新电影”不一定是“我最喜欢的电影”,而“我最喜欢的电影”是。这对人类来说是常识,但它是未知预先接受预审的语言模型有这样的知识。我们介绍了递归名词短语挑战(RNPC),是针对对递归NPS的理解的挑战。在评估我们的数据集时,最先进的变压器模型只能实现偶然的偶然性能。尽管如此,我们表明这些知识是以适当的数据学习。我们进一步探讨了可以从我们的任务中学到的相关语言功能的模型,包括修饰语语义类别和修改范围。最后,培训的模型在外在伤害检测任务上实现了强大的零射击性能,显示了在下游应用中了解递归NP的有用性。所有代码和数据都将在https://github.com/veronica320/recursive-nps发布。
translated by 谷歌翻译
Contextual word representations derived from large-scale neural language models are successful across a diverse set of NLP tasks, suggesting that they encode useful and transferable features of language. To shed light on the linguistic knowledge they capture, we study the representations produced by several recent pretrained contextualizers (variants of ELMo, the OpenAI transformer language model, and BERT) with a suite of seventeen diverse probing tasks. We find that linear models trained on top of frozen contextual representations are competitive with state-of-the-art task-specific models in many cases, but fail on tasks requiring fine-grained linguistic knowledge (e.g., conjunct identification). To investigate the transferability of contextual word representations, we quantify differences in the transferability of individual layers within contextualizers, especially between recurrent neural networks (RNNs) and transformers. For instance, higher layers of RNNs are more taskspecific, while transformer layers do not exhibit the same monotonic trend. In addition, to better understand what makes contextual word representations transferable, we compare language model pretraining with eleven supervised pretraining tasks. For any given task, pretraining on a closely related task yields better performance than language model pretraining (which is better on average) when the pretraining dataset is fixed. However, language model pretraining on more data gives the best results.
translated by 谷歌翻译
The long-distance agreement, evidence for syntactic structure, is increasingly used to assess the syntactic generalization of Neural Language Models. Much work has shown that transformers are capable of high accuracy in varied agreement tasks, but the mechanisms by which the models accomplish this behavior are still not well understood. To better understand transformers' internal working, this work contrasts how they handle two superficially similar but theoretically distinct agreement phenomena: subject-verb and object-past participle agreement in French. Using probing and counterfactual analysis methods, our experiments show that i) the agreement task suffers from several confounders which partially question the conclusions drawn so far and ii) transformers handle subject-verb and object-past participle agreements in a way that is consistent with their modeling in theoretical linguistics.
translated by 谷歌翻译
Contextualized representation models such as ELMo (Peters et al., 2018a) and BERT (Devlin et al., 2018) have recently achieved state-of-the-art results on a diverse array of downstream NLP tasks. Building on recent token-level probing work, we introduce a novel edge probing task design and construct a broad suite of sub-sentence tasks derived from the traditional structured NLP pipeline. We probe word-level contextual representations from four recent models and investigate how they encode sentence structure across a range of syntactic, semantic, local, and long-range phenomena. We find that existing models trained on language modeling and translation produce strong representations for syntactic phenomena, but only offer comparably small improvements on semantic tasks over a non-contextual baseline.
translated by 谷歌翻译
For natural language understanding (NLU) technology to be maximally useful, it must be able to process language in a way that is not exclusive to a single task, genre, or dataset. In pursuit of this objective, we introduce the General Language Understanding Evaluation (GLUE) benchmark, a collection of tools for evaluating the performance of models across a diverse set of existing NLU tasks. By including tasks with limited training data, GLUE is designed to favor and encourage models that share general linguistic knowledge across tasks. GLUE also includes a hand-crafted diagnostic test suite that enables detailed linguistic analysis of models. We evaluate baselines based on current methods for transfer and representation learning and find that multi-task training on all tasks performs better than training a separate model per task. However, the low absolute performance of our best model indicates the need for improved general NLU systems.
translated by 谷歌翻译
对于自然语言处理系统,两种证据支持在大型未解除的基层上的神经语言模型中使用文本表示:在应用程序启发基准上的表现(Peters等,2018年,除其他外)以及出现的出现这些陈述中的句法抽象(Tenney等,2019年,尤其)。另一方面,缺乏接地的监督呼吁质疑这些表现如何捕获意义(Bender和Koller,2020)。我们对最近的语言模型应用小说探针 - 特别关注由语义依赖性运作的谓词参数结构(Ivanova等,2012) - 并发现,与语法不同,语义不是通过今天的预磨款模型带到表面上。然后,我们使用卷积图编码器将语义解析明确地将语义解析结合到特定于任务的FineTuning中,为胶水基准测试中的自然语言理解(NLU)任务产生益处。这种方法展示了通用(而不是任务特定的)语言监督的潜力,以上和超越传统的预威胁和芬特。有几个诊断有助于本地化我们方法的好处。
translated by 谷歌翻译
Transformer-based models have pushed state of the art in many areas of NLP, but our understanding of what is behind their success is still limited. This paper is the first survey of over 150 studies of the popular BERT model. We review the current state of knowledge about how BERT works, what kind of information it learns and how it is represented, common modifications to its training objectives and architecture, the overparameterization issue and approaches to compression. We then outline directions for future research.
translated by 谷歌翻译
State-of-the-art deep-learning-based approaches to Natural Language Processing (NLP) are credited with various capabilities that involve reasoning with natural language texts. In this paper we carry out a large-scale empirical study investigating the detection of formally valid inferences in controlled fragments of natural language for which the satisfiability problem becomes increasingly complex. We find that, while transformer-based language models perform surprisingly well in these scenarios, a deeper analysis re-veals that they appear to overfit to superficial patterns in the data rather than acquiring the logical principles governing the reasoning in these fragments.
translated by 谷歌翻译
People constantly use language to learn about the world. Computational linguists have capitalized on this fact to build large language models (LLMs) that acquire co-occurrence-based knowledge from language corpora. LLMs achieve impressive performance on many tasks, but the robustness of their world knowledge has been questioned. Here, we ask: do LLMs acquire generalized knowledge about real-world events? Using curated sets of minimal sentence pairs (n=1215), we tested whether LLMs are more likely to generate plausible event descriptions compared to their implausible counterparts. We found that LLMs systematically distinguish possible and impossible events (The teacher bought the laptop vs. The laptop bought the teacher) but fall short of human performance when distinguishing likely and unlikely events (The nanny tutored the boy vs. The boy tutored the nanny). In follow-up analyses, we show that (i) LLM scores are driven by both plausibility and surface-level sentence features, (ii) LLMs generalize well across syntactic sentence variants (active vs passive) but less well across semantic sentence variants (synonymous sentences), (iii) some, but not all LLM deviations from ground-truth labels align with crowdsourced human judgments, and (iv) explicit event plausibility information emerges in middle LLM layers and remains high thereafter. Overall, our analyses reveal a gap in LLMs' event knowledge, highlighting their limitations as generalized knowledge bases. We conclude by speculating that the differential performance on impossible vs. unlikely events is not a temporary setback but an inherent property of LLMs, reflecting a fundamental difference between linguistic knowledge and world knowledge in intelligent systems.
translated by 谷歌翻译
Deep Learning and Machine Learning based models have become extremely popular in text processing and information retrieval. However, the non-linear structures present inside the networks make these models largely inscrutable. A significant body of research has focused on increasing the transparency of these models. This article provides a broad overview of research on the explainability and interpretability of natural language processing and information retrieval methods. More specifically, we survey approaches that have been applied to explain word embeddings, sequence modeling, attention modules, transformers, BERT, and document ranking. The concluding section suggests some possible directions for future research on this topic.
translated by 谷歌翻译
深层神经网络在各个领域的增殖已经增加了对这些模型的解释性的需求。沿着这条线进行的初步工作,调查了这种调查的论文集中在高级表示分析上。然而,最近的工作分支集中在这些模型中分析神经元的更详细水平上的可解释性。在本文中,我们调查了神经元分析所做的工作,包括:i)在网络中发现和理解神经元的方法,ii)评估方法,iii)主要发现,包括神经元分析已解散的跨架构比较,iv)神经元的应用。探索:控制模型,域适应等,v)关于开放问题和未来研究方向的讨论。
translated by 谷歌翻译
专门的基于变形金刚的模型(例如生物Biobert和Biomegatron)适用于基于公共可用的生物医学语料库的生物医学领域。因此,它们有可能编码大规模的生物学知识。我们研究了这些模型中生物学知识的编码和表示,及其支持癌症精度医学推断的潜在实用性 - 即,对基因组改变的临床意义的解释。我们比较不同变压器基线的性能;我们使用探测来确定针对不同实体的编码的一致性;我们使用聚类方法来比较和对比基因,变异,药物和疾病的嵌入的内部特性。我们表明,这些模型确实确实编码了生物学知识,尽管其中一些模型在针对特定任务的微调中丢失了。最后,我们分析了模型在数据集中的偏见和失衡方面的行为。
translated by 谷歌翻译
借助情境化语言模型的成功,许多研究探讨了这些模型真正学到的知识,并且在哪些情况下仍然失败。这项工作的大部分都集中在特定的NLP任务和学习成果上。很少的研究试图使模型的弱点与特定任务的弱点相结合,并专注于嵌入本身及其学习方式。在本文中,我们抓住了这一研究机会:基于理论语言见解,我们探讨了功能词的语义限制是否是学习的,以及周围环境如何影响其嵌入。我们创建合适的数据集,为LMS VIS-VIS功能单词的内部工作提供新的见解,并实施辅助视觉网络界面以进行定性分析。
translated by 谷歌翻译
虽然句子异常已经定期应用于NLP中的测试,但我们尚未建立从NLP模型中的表示中的异常信息的确切状态的图片。在本文中,我们的目标是填补两个主要间隙,重点关注句法异常的领域。首先,我们通过设计改变异常在句子中发生的分层级别的探测任务来探讨异常编码的细粒度差异。其次,我们不仅测试了模型能够通过检查不同异常类型之间的转移来检测给定异常的能力,还能检测给定的异常信号的一般性。结果表明,所有型号都编码一些支持异常检测的信息,但检测性能在异常之间变化,并且只有最近的变压器模型的唯一表示显示了异常知识的概括知识的迹象。随访分析支持这些模型在合法的句子奇迹上接受合法的概念,而粗糙的单词位置信息也可能是观察到的异常检测的贡献者。
translated by 谷歌翻译
尽管在理解深度NLP模型中学到的表示形式以及他们所捕获的知识方面已经做了很多工作,但对单个神经元的关注很少。我们提出了一种称为语言相关性分析的技术,可在任何外部特性中提取模型中的显着神经元 - 目的是了解如何保留这种知识在神经元中。我们进行了细粒度的分析以回答以下问题:(i)我们可以识别网络中捕获特定语言特性的神经元子集吗? (ii)整个网络中的局部或分布式神经元如何? iii)信息保留了多么冗余? iv)针对下游NLP任务的微调预训练模型如何影响学习的语言知识? iv)架构在学习不同的语言特性方面有何不同?我们的数据驱动的定量分析阐明了有趣的发现:(i)我们发现了可以预测不同语言任务的神经元的小亚集,ii)捕获基本的词汇信息(例如后缀),而这些神经元位于较低的大多数层中,iii,iii),而这些神经元,而那些神经元,而那些神经元则可以预测。学习复杂的概念(例如句法角色)主要是在中间和更高层中,iii),在转移学习过程中,显着的语言神经元从较高到较低的层移至较低的层,因为网络保留了较高的层以特定于任务信息,iv)我们发现很有趣在培训预训练模型之间的差异,关于如何保留语言信息,V)我们发现概念在多语言变压器模型中跨不同语言表现出相似的神经元分布。我们的代码作为Neurox工具包的一部分公开可用。
translated by 谷歌翻译
类比在人类常识推理中起着核心作用。识别类比诸如“眼睛是看到耳朵的声音”之类的类比的能力,有时也称为类比比例,塑造我们如何构建知识和理解语言。但是,令人惊讶的是,在语言模型时代,识别这种类比的任务尚未受到太多关注。在本文中,我们使用从教育环境以及更常用的数据集获得的基准分析了基于变压器的语言模型的功能。我们发现,现成的语言模型可以在一定程度上识别类比,但要与抽象和复杂的关系斗争,结果对模型架构和超参数高度敏感。总体而言,最佳结果是通过GPT-2和Roberta获得的,而使用BERT的配置无法超越单词嵌入模型。我们的结果为未来的工作提出了重要的问题,内容涉及如何以及在何种程度上培训的语言模型捕获有关抽象语义关系的知识。
translated by 谷歌翻译