神经语言模型如何跟踪主题和动词之间的数字协议?我们显示“诊断分类器”,以预测语言模型的内部状态的预测数字,详细了解如何,何时和位置。此外,在语言模型最终达成协议错误的情况下,他们向我们讨论数字信息损坏的时间和地点。为了展示我们发现的陈述所扮演的因果作用,我们在处理困难句子期间使用协议信息来影响LSTM的过程。这种干预的结果揭示了语言模型的准确性大幅增加。这些结果表明,诊断分类器给我们一个无与伦比的详细研究神经模型中语言信息的表示,并证明了这种知识可用于提高它们的性能。
translated by 谷歌翻译
The long-distance agreement, evidence for syntactic structure, is increasingly used to assess the syntactic generalization of Neural Language Models. Much work has shown that transformers are capable of high accuracy in varied agreement tasks, but the mechanisms by which the models accomplish this behavior are still not well understood. To better understand transformers' internal working, this work contrasts how they handle two superficially similar but theoretically distinct agreement phenomena: subject-verb and object-past participle agreement in French. Using probing and counterfactual analysis methods, our experiments show that i) the agreement task suffers from several confounders which partially question the conclusions drawn so far and ii) transformers handle subject-verb and object-past participle agreements in a way that is consistent with their modeling in theoretical linguistics.
translated by 谷歌翻译
对理解通过语言模型(LMS)的隐藏表示捕获的信息有很多兴趣。通常,解释方法I)不保证模型实际使用编码信息,并且II)不发现负责考虑现象的小型神经元。灵感来自因果调解分析,我们提出了一种在神经LM内发现的方法,该方法在负责特定语言现象的小神经元的小神经元中发现,即引起相应令牌排放概率的变化的子集。我们使用可怜的放松来近似搜索组合空间。 $ L_0 $正常化术语可确保搜索收敛到离散和稀疏解决方案。我们应用我们在LSTMS中分析主题 - 动词号协议和性别偏见检测的方法。我们观察到它是快速的,而不是替代方面的解决方案(加强)。我们的实验证实,这些现象中的每一个都是通过不发挥任何其他可辨别作用的小神经元的小型介导的。
translated by 谷歌翻译
基于变压器的语言模型最近在许多自然语言任务中取得了显着的结果。但是,通常通过利用大量培训数据来实现排行榜的性能,并且很少通过将明确的语言知识编码为神经模型。这使许多人质疑语言学对现代自然语言处理的相关性。在本文中,我介绍了几个案例研究,以说明理论语言学和神经语言模型仍然相互关联。首先,语言模型通过提供一个客观的工具来测量语义距离,这对语言学家很有用,语义距离很难使用传统方法。另一方面,语言理论通过提供框架和数据源来探究我们的语言模型,以了解语言理解的特定方面,从而有助于语言建模研究。本论文贡献了三项研究,探讨了语言模型中语法 - 听觉界面的不同方面。在论文的第一部分中,我将语言模型应用于单词类灵活性的问题。我将Mbert作为语义距离测量的来源,我提供了有利于将单词类灵活性分析为方向过程的证据。在论文的第二部分中,我提出了一种方法来测量语言模型中间层的惊奇方法。我的实验表明,包含形态句法异常的句子触发了语言模型早期的惊喜,而不是语义和常识异常。最后,在论文的第三部分中,我适应了一些心理语言学研究,以表明语言模型包含了论证结构结构的知识。总而言之,我的论文在自然语言处理,语言理论和心理语言学之间建立了新的联系,以为语言模型的解释提供新的观点。
translated by 谷歌翻译
Syntax is a latent hierarchical structure which underpins the robust and compositional nature of human language. An active line of inquiry is whether large pretrained language models (LLMs) are able to acquire syntax by training on text alone; understanding a model's syntactic capabilities is essential to understanding how it processes and makes use of language. In this paper, we propose a new method, SSUD, which allows for the induction of syntactic structures without supervision from gold-standard parses. Instead, we seek to define formalism-agnostic, model-intrinsic syntactic parses by using a property of syntactic relations: syntactic substitutability. We demonstrate both quantitative and qualitative gains on dependency parsing tasks using SSUD, and induce syntactic structures which we hope provide clarity into LLMs and linguistic representations, alike.
translated by 谷歌翻译
我们研究了现代神经语言模型容易受到结构启动的程度,这种现象使句子的结构在后续句子中更有可能使相同的结构更有可能。我们探索如何使用启动来研究这些模型学习抽象结构信息的潜力,这是需要自然语言理解技能的任务良好表现的先决条件。我们引入了一种新型的度量标准和释放Prime-LM,这是一个大型语料库,我们可以控制与启动强度相互作用的各种语言因素。我们发现,变压器模型确实显示了结构启动的证据,但他们所学到的概括在某种程度上是由语义信息调节的。我们的实验还表明,模型获得的表示不仅可以编码抽象的顺序结构,而且还涉及一定级别的层次句法信息。更普遍的是,我们的研究表明,启动范式是一种有用的,可用于洞悉语言模型能力的有用的,并为未来的基于底漆的调查打开了探测模型内部状态的未来大门。
translated by 谷歌翻译
This paper investigates the ability of artificial neural networks to judge the grammatical acceptability of a sentence, with the goal of testing their linguistic competence. We introduce the Corpus of Linguistic Acceptability (CoLA), a set of 10,657 English sentences labeled as grammatical or ungrammatical from published linguistics literature. As baselines, we train several recurrent neural network models on acceptability classification, and find that our models outperform unsupervised models by Lau et al. (2016) on CoLA. Error-analysis on specific grammatical phenomena reveals that both Lau et al.'s models and ours learn systematic generalizations like subject-verb-object order. However, all models we test perform far below human level on a wide range of grammatical constructions.
translated by 谷歌翻译
当前的语言模型可以产生高质量的文本。他们只是复制他们之前看到的文本,或者他们学习了普遍的语言抽象吗?要取笑这些可能性,我们介绍了乌鸦,这是一套评估生成文本的新颖性,专注于顺序结构(n-gram)和句法结构。我们将这些分析应用于四种神经语言模型(LSTM,变压器,变换器-XL和GPT-2)。对于本地结构 - 例如,单个依赖性 - 模型生成的文本比来自每个模型的测试集的人类生成文本的基线显着不那么新颖。对于大规模结构 - 例如,总句结构 - 模型生成的文本与人生成的基线一样新颖甚至更新颖,但模型仍然有时复制,在某些情况下,在训练集中重复超过1000字超过1,000字的通道。我们还表现了广泛的手动分析,表明GPT-2的新文本通常在形态学和语法中形成良好,但具有合理的语义问题(例如,是自相矛盾)。
translated by 谷歌翻译
语法提示有时具有自然语言的单词含义。例如,英语单词顺序规则限制了句子的单词顺序,例如“狗咀嚼骨头”,即使可以从世界知识和合理性中推断出“狗”作为代理人和“骨头”的状态。量化这种冗余的发生频率,以及冗余水平如何在类型上多样化的语言中变化,可以阐明语法的功能和演变。为此,我们在英语和俄语中进行了一个行为实验,并进行了跨语言计算分析,以测量从自然主义文本中提取的及物子句中语法线索的冗余性。从自然发生的句子中提取的主题,动词和物体(按随机顺序和形态标记)提出了英语和俄罗斯说话者(n = 484),并被要求确定哪个名词是该动作的推动者。两种语言的准确性都很高(英语约为89%,俄语为87%)。接下来,我们在类似的任务上训练了神经网络机分类器:预测主题对象三合会中的哪个名义是主题。在来自八个语言家庭的30种语言中,性能始终很高:中位准确性为87%,与人类实验中观察到的准确性相当。结论是,语法提示(例如单词顺序)对于仅在10-15%的自然句子中传达了代理和耐心是必要的。然而,他们可以(a)提供重要的冗余来源,(b)对于传达无法从单词中推断出的预期含义至关重要,包括对人类互动的描述,在这些含义中,角色通常是可逆的(例如,雷(Ray)帮助lu/ Lu帮助雷),表达了非典型的含义(例如,“骨头咀嚼狗”。)。
translated by 谷歌翻译
在本文中,我们试图通过引入深度学习模型的句法归纳偏见来建立两所学校之间的联系。我们提出了两个归纳偏见的家族,一个家庭用于选区结构,另一个用于依赖性结构。选区归纳偏见鼓励深度学习模型使用不同的单位(或神经元)分别处理长期和短期信息。这种分离为深度学习模型提供了一种方法,可以从顺序输入中构建潜在的层次表示形式,即更高级别的表示由高级表示形式组成,并且可以分解为一系列低级表示。例如,在不了解地面实际结构的情况下,我们提出的模型学会通过根据其句法结构组成变量和运算符的表示来处理逻辑表达。另一方面,依赖归纳偏置鼓励模型在输入序列中找到实体之间的潜在关系。对于自然语言,潜在关系通常被建模为一个定向依赖图,其中一个单词恰好具有一个父节点和零或几个孩子的节点。将此约束应用于类似变压器的模型之后,我们发现该模型能够诱导接近人类专家注释的有向图,并且在不同任务上也优于标准变压器模型。我们认为,这些实验结果为深度学习模型的未来发展展示了一个有趣的选择。
translated by 谷歌翻译
人类和神经语言模型都能够执行主题 - 动词数协议(SVA)。原则上,语义不应干扰此任务,这仅需要句法知识。在这项工作中,我们测试含义是否干扰了各种复杂性的句法结构中的英语一致性。为此,我们同时生成语义上良好的和荒谬的项目。我们将Bert Base与人类的表现进行了比较,该表现是通过心理语言在线众包实验获得的。我们发现伯特和人类都对我们的语义操纵敏感:出现荒谬的项目时,它们的频率更高,尤其是当它们的句法结构具有吸引子(主题和动词之间的名词短语和与该数字不同的名词短语)时主题)。我们还发现,有意义性对SVA错误的影响对于BERT而言比对人类的影响更强,显示前者对这项任务的词汇敏感性更高。
translated by 谷歌翻译
Targeted syntactic evaluations of language models ask whether models show stable preferences for syntactically acceptable content over minimal-pair unacceptable inputs. Most targeted syntactic evaluation datasets ask models to make these judgements with just a single context-free sentence as input. This does not match language models' training regime, in which input sentences are always highly contextualized by the surrounding corpus. This mismatch raises an important question: how robust are models' syntactic judgements in different contexts? In this paper, we investigate the stability of language models' performance on targeted syntactic evaluations as we vary properties of the input context: the length of the context, the types of syntactic phenomena it contains, and whether or not there are violations of grammaticality. We find that model judgements are generally robust when placed in randomly sampled linguistic contexts. However, they are substantially unstable for contexts containing syntactic structures matching those in the critical test content. Among all tested models (GPT-2 and five variants of OPT), we significantly improve models' judgements by providing contexts with matching syntactic structures, and conversely significantly worsen them using unacceptable contexts with matching but violated syntactic structures. This effect is amplified by the length of the context, except for unrelated inputs. We show that these changes in model performance are not explainable by simple features matching the context and the test inputs, such as lexical overlap and dependency overlap. This sensitivity to highly specific syntactic features of the context can only be explained by the models' implicit in-context learning abilities.
translated by 谷歌翻译
Contextualized representation models such as ELMo (Peters et al., 2018a) and BERT (Devlin et al., 2018) have recently achieved state-of-the-art results on a diverse array of downstream NLP tasks. Building on recent token-level probing work, we introduce a novel edge probing task design and construct a broad suite of sub-sentence tasks derived from the traditional structured NLP pipeline. We probe word-level contextual representations from four recent models and investigate how they encode sentence structure across a range of syntactic, semantic, local, and long-range phenomena. We find that existing models trained on language modeling and translation produce strong representations for syntactic phenomena, but only offer comparably small improvements on semantic tasks over a non-contextual baseline.
translated by 谷歌翻译
Transformer-based models have pushed state of the art in many areas of NLP, but our understanding of what is behind their success is still limited. This paper is the first survey of over 150 studies of the popular BERT model. We review the current state of knowledge about how BERT works, what kind of information it learns and how it is represented, common modifications to its training objectives and architecture, the overparameterization issue and approaches to compression. We then outline directions for future research.
translated by 谷歌翻译
虽然句子异常已经定期应用于NLP中的测试,但我们尚未建立从NLP模型中的表示中的异常信息的确切状态的图片。在本文中,我们的目标是填补两个主要间隙,重点关注句法异常的领域。首先,我们通过设计改变异常在句子中发生的分层级别的探测任务来探讨异常编码的细粒度差异。其次,我们不仅测试了模型能够通过检查不同异常类型之间的转移来检测给定异常的能力,还能检测给定的异常信号的一般性。结果表明,所有型号都编码一些支持异常检测的信息,但检测性能在异常之间变化,并且只有最近的变压器模型的唯一表示显示了异常知识的概括知识的迹象。随访分析支持这些模型在合法的句子奇迹上接受合法的概念,而粗糙的单词位置信息也可能是观察到的异常检测的贡献者。
translated by 谷歌翻译
We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including: part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements.
translated by 谷歌翻译
我们表明,LSTM和统一进化的复发神经网络(URN)都可以在两种类型的句法模式上实现令人鼓舞的准确性:无上下文的长距离一致性和轻微的上下文敏感的交叉序列依赖性。这项工作扩展了有关深层无上下文的长距离依赖性的最新实验,结果相似。URN与LSTM的不同之处在于它们避免了非线性激活函数,并且它们将矩阵乘法应用于编码为单位矩阵的单词嵌入。这使他们可以将所有信息保留在任意距离上的输入字符串中。这也使他们满足严格的组成性。在应用于NLP的深度学习中寻找可解释的模型时,URN构成了重大进步。
translated by 谷歌翻译
This work presents a detailed linguistic analysis into why larger Transformer-based pre-trained language models with more parameters and lower perplexity nonetheless yield surprisal estimates that are less predictive of human reading times. First, regression analyses show a strictly monotonic, positive log-linear relationship between perplexity and fit to reading times for the more recently released five GPT-Neo variants and eight OPT variants on two separate datasets, replicating earlier results limited to just GPT-2 (Oh et al., 2022). Subsequently, analysis of residual errors reveals a systematic deviation of the larger variants, such as underpredicting reading times of named entities and making compensatory overpredictions for reading times of function words such as modals and conjunctions. These results suggest that the propensity of larger Transformer-based models to 'memorize' sequences during training makes their surprisal estimates diverge from humanlike expectations, which warrants caution in using pre-trained language models to study human language processing.
translated by 谷歌翻译
Grammatical Error Correction (GEC) is the task of automatically detecting and correcting errors in text. The task not only includes the correction of grammatical errors, such as missing prepositions and mismatched subject-verb agreement, but also orthographic and semantic errors, such as misspellings and word choice errors respectively. The field has seen significant progress in the last decade, motivated in part by a series of five shared tasks, which drove the development of rule-based methods, statistical classifiers, statistical machine translation, and finally neural machine translation systems which represent the current dominant state of the art. In this survey paper, we condense the field into a single article and first outline some of the linguistic challenges of the task, introduce the most popular datasets that are available to researchers (for both English and other languages), and summarise the various methods and techniques that have been developed with a particular focus on artificial error generation. We next describe the many different approaches to evaluation as well as concerns surrounding metric reliability, especially in relation to subjective human judgements, before concluding with an overview of recent progress and suggestions for future work and remaining challenges. We hope that this survey will serve as comprehensive resource for researchers who are new to the field or who want to be kept apprised of recent developments.
translated by 谷歌翻译
People constantly use language to learn about the world. Computational linguists have capitalized on this fact to build large language models (LLMs) that acquire co-occurrence-based knowledge from language corpora. LLMs achieve impressive performance on many tasks, but the robustness of their world knowledge has been questioned. Here, we ask: do LLMs acquire generalized knowledge about real-world events? Using curated sets of minimal sentence pairs (n=1215), we tested whether LLMs are more likely to generate plausible event descriptions compared to their implausible counterparts. We found that LLMs systematically distinguish possible and impossible events (The teacher bought the laptop vs. The laptop bought the teacher) but fall short of human performance when distinguishing likely and unlikely events (The nanny tutored the boy vs. The boy tutored the nanny). In follow-up analyses, we show that (i) LLM scores are driven by both plausibility and surface-level sentence features, (ii) LLMs generalize well across syntactic sentence variants (active vs passive) but less well across semantic sentence variants (synonymous sentences), (iii) some, but not all LLM deviations from ground-truth labels align with crowdsourced human judgments, and (iv) explicit event plausibility information emerges in middle LLM layers and remains high thereafter. Overall, our analyses reveal a gap in LLMs' event knowledge, highlighting their limitations as generalized knowledge bases. We conclude by speculating that the differential performance on impossible vs. unlikely events is not a temporary setback but an inherent property of LLMs, reflecting a fundamental difference between linguistic knowledge and world knowledge in intelligent systems.
translated by 谷歌翻译