The long-distance agreement, evidence for syntactic structure, is increasingly used to assess the syntactic generalization of Neural Language Models. Much work has shown that transformers are capable of high accuracy in varied agreement tasks, but the mechanisms by which the models accomplish this behavior are still not well understood. To better understand transformers' internal working, this work contrasts how they handle two superficially similar but theoretically distinct agreement phenomena: subject-verb and object-past participle agreement in French. Using probing and counterfactual analysis methods, our experiments show that i) the agreement task suffers from several confounders which partially question the conclusions drawn so far and ii) transformers handle subject-verb and object-past participle agreements in a way that is consistent with their modeling in theoretical linguistics.
translated by 谷歌翻译
我们研究了现代神经语言模型容易受到结构启动的程度,这种现象使句子的结构在后续句子中更有可能使相同的结构更有可能。我们探索如何使用启动来研究这些模型学习抽象结构信息的潜力,这是需要自然语言理解技能的任务良好表现的先决条件。我们引入了一种新型的度量标准和释放Prime-LM,这是一个大型语料库,我们可以控制与启动强度相互作用的各种语言因素。我们发现,变压器模型确实显示了结构启动的证据,但他们所学到的概括在某种程度上是由语义信息调节的。我们的实验还表明,模型获得的表示不仅可以编码抽象的顺序结构,而且还涉及一定级别的层次句法信息。更普遍的是,我们的研究表明,启动范式是一种有用的,可用于洞悉语言模型能力的有用的,并为未来的基于底漆的调查打开了探测模型内部状态的未来大门。
translated by 谷歌翻译
基于变压器的语言模型最近在许多自然语言任务中取得了显着的结果。但是,通常通过利用大量培训数据来实现排行榜的性能,并且很少通过将明确的语言知识编码为神经模型。这使许多人质疑语言学对现代自然语言处理的相关性。在本文中,我介绍了几个案例研究,以说明理论语言学和神经语言模型仍然相互关联。首先,语言模型通过提供一个客观的工具来测量语义距离,这对语言学家很有用,语义距离很难使用传统方法。另一方面,语言理论通过提供框架和数据源来探究我们的语言模型,以了解语言理解的特定方面,从而有助于语言建模研究。本论文贡献了三项研究,探讨了语言模型中语法 - 听觉界面的不同方面。在论文的第一部分中,我将语言模型应用于单词类灵活性的问题。我将Mbert作为语义距离测量的来源,我提供了有利于将单词类灵活性分析为方向过程的证据。在论文的第二部分中,我提出了一种方法来测量语言模型中间层的惊奇方法。我的实验表明,包含形态句法异常的句子触发了语言模型早期的惊喜,而不是语义和常识异常。最后,在论文的第三部分中,我适应了一些心理语言学研究,以表明语言模型包含了论证结构结构的知识。总而言之,我的论文在自然语言处理,语言理论和心理语言学之间建立了新的联系,以为语言模型的解释提供新的观点。
translated by 谷歌翻译
神经语言模型如何跟踪主题和动词之间的数字协议?我们显示“诊断分类器”,以预测语言模型的内部状态的预测数字,详细了解如何,何时和位置。此外,在语言模型最终达成协议错误的情况下,他们向我们讨论数字信息损坏的时间和地点。为了展示我们发现的陈述所扮演的因果作用,我们在处理困难句子期间使用协议信息来影响LSTM的过程。这种干预的结果揭示了语言模型的准确性大幅增加。这些结果表明,诊断分类器给我们一个无与伦比的详细研究神经模型中语言信息的表示,并证明了这种知识可用于提高它们的性能。
translated by 谷歌翻译
人类和神经语言模型都能够执行主题 - 动词数协议(SVA)。原则上,语义不应干扰此任务,这仅需要句法知识。在这项工作中,我们测试含义是否干扰了各种复杂性的句法结构中的英语一致性。为此,我们同时生成语义上良好的和荒谬的项目。我们将Bert Base与人类的表现进行了比较,该表现是通过心理语言在线众包实验获得的。我们发现伯特和人类都对我们的语义操纵敏感:出现荒谬的项目时,它们的频率更高,尤其是当它们的句法结构具有吸引子(主题和动词之间的名词短语和与该数字不同的名词短语)时主题)。我们还发现,有意义性对SVA错误的影响对于BERT而言比对人类的影响更强,显示前者对这项任务的词汇敏感性更高。
translated by 谷歌翻译
This paper investigates the ability of artificial neural networks to judge the grammatical acceptability of a sentence, with the goal of testing their linguistic competence. We introduce the Corpus of Linguistic Acceptability (CoLA), a set of 10,657 English sentences labeled as grammatical or ungrammatical from published linguistics literature. As baselines, we train several recurrent neural network models on acceptability classification, and find that our models outperform unsupervised models by Lau et al. (2016) on CoLA. Error-analysis on specific grammatical phenomena reveals that both Lau et al.'s models and ours learn systematic generalizations like subject-verb-object order. However, all models we test perform far below human level on a wide range of grammatical constructions.
translated by 谷歌翻译
虽然句子异常已经定期应用于NLP中的测试,但我们尚未建立从NLP模型中的表示中的异常信息的确切状态的图片。在本文中,我们的目标是填补两个主要间隙,重点关注句法异常的领域。首先,我们通过设计改变异常在句子中发生的分层级别的探测任务来探讨异常编码的细粒度差异。其次,我们不仅测试了模型能够通过检查不同异常类型之间的转移来检测给定异常的能力,还能检测给定的异常信号的一般性。结果表明,所有型号都编码一些支持异常检测的信息,但检测性能在异常之间变化,并且只有最近的变压器模型的唯一表示显示了异常知识的概括知识的迹象。随访分析支持这些模型在合法的句子奇迹上接受合法的概念,而粗糙的单词位置信息也可能是观察到的异常检测的贡献者。
translated by 谷歌翻译
People constantly use language to learn about the world. Computational linguists have capitalized on this fact to build large language models (LLMs) that acquire co-occurrence-based knowledge from language corpora. LLMs achieve impressive performance on many tasks, but the robustness of their world knowledge has been questioned. Here, we ask: do LLMs acquire generalized knowledge about real-world events? Using curated sets of minimal sentence pairs (n=1215), we tested whether LLMs are more likely to generate plausible event descriptions compared to their implausible counterparts. We found that LLMs systematically distinguish possible and impossible events (The teacher bought the laptop vs. The laptop bought the teacher) but fall short of human performance when distinguishing likely and unlikely events (The nanny tutored the boy vs. The boy tutored the nanny). In follow-up analyses, we show that (i) LLM scores are driven by both plausibility and surface-level sentence features, (ii) LLMs generalize well across syntactic sentence variants (active vs passive) but less well across semantic sentence variants (synonymous sentences), (iii) some, but not all LLM deviations from ground-truth labels align with crowdsourced human judgments, and (iv) explicit event plausibility information emerges in middle LLM layers and remains high thereafter. Overall, our analyses reveal a gap in LLMs' event knowledge, highlighting their limitations as generalized knowledge bases. We conclude by speculating that the differential performance on impossible vs. unlikely events is not a temporary setback but an inherent property of LLMs, reflecting a fundamental difference between linguistic knowledge and world knowledge in intelligent systems.
translated by 谷歌翻译
Syntax is a latent hierarchical structure which underpins the robust and compositional nature of human language. An active line of inquiry is whether large pretrained language models (LLMs) are able to acquire syntax by training on text alone; understanding a model's syntactic capabilities is essential to understanding how it processes and makes use of language. In this paper, we propose a new method, SSUD, which allows for the induction of syntactic structures without supervision from gold-standard parses. Instead, we seek to define formalism-agnostic, model-intrinsic syntactic parses by using a property of syntactic relations: syntactic substitutability. We demonstrate both quantitative and qualitative gains on dependency parsing tasks using SSUD, and induce syntactic structures which we hope provide clarity into LLMs and linguistic representations, alike.
translated by 谷歌翻译
自然语言处理的机器学习快速进步有可能改变有关人类学习语言的辩论。但是,当前人工学习者和人类的学习环境和偏见以削弱从学习模拟获得的证据的影响的方式分歧。例如,当今最有效的神经语言模型接受了典型儿童可用的语言数据量的大约一千倍。为了增加计算模型的可学习性结果的相关性,我们需要培训模型学习者,而没有比人类具有显着优势的学习者。如果合适的模型成功地获得了一些目标语言知识,则可以提供一个概念证明,即在假设的人类学习方案中可以学习目标。合理的模型学习者将使我们能够进行实验操作,以对学习环境中的变量进行因果推断,并严格测试史密斯风格的贫困声明,主张根据人类对人类的先天语言知识,基于有关可学习性的猜测。由于实用和道德的考虑因素,人类受试者将永远无法实现可比的实验,从而使模型学习者成为必不可少的资源。到目前为止,试图剥夺当前模型的不公平优势,为关键语法行为(例如可接受性判断)获得亚人类结果。但是,在我们可以合理地得出结论,语言学习需要比当前模型拥有更多的特定领域知识,我们必须首先以多模式刺激和多代理互动的形式探索非语言意见,以使学习者更有效地学习学习者来自有限的语言输入。
translated by 谷歌翻译
Targeted syntactic evaluations of language models ask whether models show stable preferences for syntactically acceptable content over minimal-pair unacceptable inputs. Most targeted syntactic evaluation datasets ask models to make these judgements with just a single context-free sentence as input. This does not match language models' training regime, in which input sentences are always highly contextualized by the surrounding corpus. This mismatch raises an important question: how robust are models' syntactic judgements in different contexts? In this paper, we investigate the stability of language models' performance on targeted syntactic evaluations as we vary properties of the input context: the length of the context, the types of syntactic phenomena it contains, and whether or not there are violations of grammaticality. We find that model judgements are generally robust when placed in randomly sampled linguistic contexts. However, they are substantially unstable for contexts containing syntactic structures matching those in the critical test content. Among all tested models (GPT-2 and five variants of OPT), we significantly improve models' judgements by providing contexts with matching syntactic structures, and conversely significantly worsen them using unacceptable contexts with matching but violated syntactic structures. This effect is amplified by the length of the context, except for unrelated inputs. We show that these changes in model performance are not explainable by simple features matching the context and the test inputs, such as lexical overlap and dependency overlap. This sensitivity to highly specific syntactic features of the context can only be explained by the models' implicit in-context learning abilities.
translated by 谷歌翻译
当前的语言模型可以产生高质量的文本。他们只是复制他们之前看到的文本,或者他们学习了普遍的语言抽象吗?要取笑这些可能性,我们介绍了乌鸦,这是一套评估生成文本的新颖性,专注于顺序结构(n-gram)和句法结构。我们将这些分析应用于四种神经语言模型(LSTM,变压器,变换器-XL和GPT-2)。对于本地结构 - 例如,单个依赖性 - 模型生成的文本比来自每个模型的测试集的人类生成文本的基线显着不那么新颖。对于大规模结构 - 例如,总句结构 - 模型生成的文本与人生成的基线一样新颖甚至更新颖,但模型仍然有时复制,在某些情况下,在训练集中重复超过1000字超过1,000字的通道。我们还表现了广泛的手动分析,表明GPT-2的新文本通常在形态学和语法中形成良好,但具有合理的语义问题(例如,是自相矛盾)。
translated by 谷歌翻译
A machine learning system can score well on a given test set by relying on heuristics that are effective for frequent example types but break down in more challenging cases. We study this issue within natural language inference (NLI), the task of determining whether one sentence entails another. We hypothesize that statistical NLI models may adopt three fallible syntactic heuristics: the lexical overlap heuristic, the subsequence heuristic, and the constituent heuristic. To determine whether models have adopted these heuristics, we introduce a controlled evaluation set called HANS (Heuristic Analysis for NLI Systems), which contains many examples where the heuristics fail. We find that models trained on MNLI, including BERT, a state-of-the-art model, perform very poorly on HANS, suggesting that they have indeed adopted these heuristics. We conclude that there is substantial room for improvement in NLI systems, and that the HANS dataset can motivate and measure progress in this area.
translated by 谷歌翻译
Transformer-based models have pushed state of the art in many areas of NLP, but our understanding of what is behind their success is still limited. This paper is the first survey of over 150 studies of the popular BERT model. We review the current state of knowledge about how BERT works, what kind of information it learns and how it is represented, common modifications to its training objectives and architecture, the overparameterization issue and approaches to compression. We then outline directions for future research.
translated by 谷歌翻译
众所周知,端到端的神经NLP体系结构很难理解,这引起了近年来为解释性建模的许多努力。模型解释的基本原则是忠诚,即,解释应准确地代表模型预测背后的推理过程。这项调查首先讨论了忠诚的定义和评估及其对解释性的意义。然后,我们通过将方法分为五类来介绍忠实解释的最新进展:相似性方法,模型内部结构的分析,基于反向传播的方法,反事实干预和自我解释模型。每个类别将通过其代表性研究,优势和缺点来说明。最后,我们从它们的共同美德和局限性方面讨论了上述所有方法,并反思未来的工作方向忠实的解释性。对于有兴趣研究可解释性的研究人员,这项调查将为该领域提供可访问且全面的概述,为进一步探索提供基础。对于希望更好地了解自己的模型的用户,该调查将是一项介绍性手册,帮助选择最合适的解释方法。
translated by 谷歌翻译
We propose reconstruction probing, a new analysis method for contextualized representations based on reconstruction probabilities in masked language models (MLMs). This method relies on comparing the reconstruction probabilities of tokens in a given sequence when conditioned on the representation of a single token that has been fully contextualized and when conditioned on only the decontextualized lexical prior of the model. This comparison can be understood as quantifying the contribution of contextualization towards reconstruction -- the difference in the reconstruction probabilities can only be attributed to the representational change of the single token induced by contextualization. We apply this analysis to three MLMs and find that contextualization boosts reconstructability of tokens that are close to the token being reconstructed in terms of linear and syntactic distance. Furthermore, we extend our analysis to finer-grained decomposition of contextualized representations, and we find that these boosts are largely attributable to static and positional embeddings at the input layer.
translated by 谷歌翻译
Contextual word representations derived from large-scale neural language models are successful across a diverse set of NLP tasks, suggesting that they encode useful and transferable features of language. To shed light on the linguistic knowledge they capture, we study the representations produced by several recent pretrained contextualizers (variants of ELMo, the OpenAI transformer language model, and BERT) with a suite of seventeen diverse probing tasks. We find that linear models trained on top of frozen contextual representations are competitive with state-of-the-art task-specific models in many cases, but fail on tasks requiring fine-grained linguistic knowledge (e.g., conjunct identification). To investigate the transferability of contextual word representations, we quantify differences in the transferability of individual layers within contextualizers, especially between recurrent neural networks (RNNs) and transformers. For instance, higher layers of RNNs are more taskspecific, while transformer layers do not exhibit the same monotonic trend. In addition, to better understand what makes contextual word representations transferable, we compare language model pretraining with eleven supervised pretraining tasks. For any given task, pretraining on a closely related task yields better performance than language model pretraining (which is better on average) when the pretraining dataset is fixed. However, language model pretraining on more data gives the best results.
translated by 谷歌翻译
在NLP社区中有一个正在进行的辩论,无论现代语言模型是否包含语言知识,通过所谓的探针恢复。在本文中,我们研究了语言知识是否是现代语言模型良好表现的必要条件,我们称之为\ Texit {重新发现假设}。首先,我们展示了语言模型,这是显着压缩的,但在预先磨普目标上表现良好,以便在语言结构探讨时保持良好的分数。这一结果支持重新发现的假设,并导致我们的论文的第二款贡献:一个信息 - 理论框架,与语言建模目标相关。该框架还提供了测量语言信息对字词预测任务的影响的度量标准。我们通过英语综合和真正的NLP任务加固我们的分析结果。
translated by 谷歌翻译
语法提示有时具有自然语言的单词含义。例如,英语单词顺序规则限制了句子的单词顺序,例如“狗咀嚼骨头”,即使可以从世界知识和合理性中推断出“狗”作为代理人和“骨头”的状态。量化这种冗余的发生频率,以及冗余水平如何在类型上多样化的语言中变化,可以阐明语法的功能和演变。为此,我们在英语和俄语中进行了一个行为实验,并进行了跨语言计算分析,以测量从自然主义文本中提取的及物子句中语法线索的冗余性。从自然发生的句子中提取的主题,动词和物体(按随机顺序和形态标记)提出了英语和俄罗斯说话者(n = 484),并被要求确定哪个名词是该动作的推动者。两种语言的准确性都很高(英语约为89%,俄语为87%)。接下来,我们在类似的任务上训练了神经网络机分类器:预测主题对象三合会中的哪个名义是主题。在来自八个语言家庭的30种语言中,性能始终很高:中位准确性为87%,与人类实验中观察到的准确性相当。结论是,语法提示(例如单词顺序)对于仅在10-15%的自然句子中传达了代理和耐心是必要的。然而,他们可以(a)提供重要的冗余来源,(b)对于传达无法从单词中推断出的预期含义至关重要,包括对人类互动的描述,在这些含义中,角色通常是可逆的(例如,雷(Ray)帮助lu/ Lu帮助雷),表达了非典型的含义(例如,“骨头咀嚼狗”。)。
translated by 谷歌翻译
我们研究了Levin(1993)所述的动词交替类的程度和句子级预测任务。我们遵循并扩展了Kann等人的实验。(2019年),旨在探测静态嵌入是否编码动词的框架选择性。在单词和句子级别上,我们发现来自PLM的上下文嵌入不仅超过了非上下文嵌入,而且在大多数交替类中的任务上达到了惊人的高精度。此外,我们发现证据表明,PLM的中间层平均比所有探测任务中的较低层都能取得更好的性能。
translated by 谷歌翻译