Transformer-based language models have shown strong performance on an array of natural language understanding tasks. However, the question of how these models react to implicit meaning has been largely unexplored. We investigate this using the complement coercion phenomenon, which involves sentences like "The student finished the book about sailing" where the action "reading" is implicit. We compare LMs' surprisal estimates at various critical sentence regions in sentences with and without implicit meaning. Effects associated with recovering implicit meaning were found at a critical region other than where sentences minimally differ. We then use follow-up experiments to factor out potential confounds, revealing different perspectives that offer a richer and more accurate picture.
translated by 谷歌翻译
基于变压器的语言模型最近在许多自然语言任务中取得了显着的结果。但是,通常通过利用大量培训数据来实现排行榜的性能,并且很少通过将明确的语言知识编码为神经模型。这使许多人质疑语言学对现代自然语言处理的相关性。在本文中,我介绍了几个案例研究,以说明理论语言学和神经语言模型仍然相互关联。首先,语言模型通过提供一个客观的工具来测量语义距离,这对语言学家很有用,语义距离很难使用传统方法。另一方面,语言理论通过提供框架和数据源来探究我们的语言模型,以了解语言理解的特定方面,从而有助于语言建模研究。本论文贡献了三项研究,探讨了语言模型中语法 - 听觉界面的不同方面。在论文的第一部分中,我将语言模型应用于单词类灵活性的问题。我将Mbert作为语义距离测量的来源,我提供了有利于将单词类灵活性分析为方向过程的证据。在论文的第二部分中,我提出了一种方法来测量语言模型中间层的惊奇方法。我的实验表明,包含形态句法异常的句子触发了语言模型早期的惊喜,而不是语义和常识异常。最后,在论文的第三部分中,我适应了一些心理语言学研究,以表明语言模型包含了论证结构结构的知识。总而言之,我的论文在自然语言处理,语言理论和心理语言学之间建立了新的联系,以为语言模型的解释提供新的观点。
translated by 谷歌翻译
This work presents a detailed linguistic analysis into why larger Transformer-based pre-trained language models with more parameters and lower perplexity nonetheless yield surprisal estimates that are less predictive of human reading times. First, regression analyses show a strictly monotonic, positive log-linear relationship between perplexity and fit to reading times for the more recently released five GPT-Neo variants and eight OPT variants on two separate datasets, replicating earlier results limited to just GPT-2 (Oh et al., 2022). Subsequently, analysis of residual errors reveals a systematic deviation of the larger variants, such as underpredicting reading times of named entities and making compensatory overpredictions for reading times of function words such as modals and conjunctions. These results suggest that the propensity of larger Transformer-based models to 'memorize' sequences during training makes their surprisal estimates diverge from humanlike expectations, which warrants caution in using pre-trained language models to study human language processing.
translated by 谷歌翻译
我们研究了现代神经语言模型容易受到结构启动的程度,这种现象使句子的结构在后续句子中更有可能使相同的结构更有可能。我们探索如何使用启动来研究这些模型学习抽象结构信息的潜力,这是需要自然语言理解技能的任务良好表现的先决条件。我们引入了一种新型的度量标准和释放Prime-LM,这是一个大型语料库,我们可以控制与启动强度相互作用的各种语言因素。我们发现,变压器模型确实显示了结构启动的证据,但他们所学到的概括在某种程度上是由语义信息调节的。我们的实验还表明,模型获得的表示不仅可以编码抽象的顺序结构,而且还涉及一定级别的层次句法信息。更普遍的是,我们的研究表明,启动范式是一种有用的,可用于洞悉语言模型能力的有用的,并为未来的基于底漆的调查打开了探测模型内部状态的未来大门。
translated by 谷歌翻译
People constantly use language to learn about the world. Computational linguists have capitalized on this fact to build large language models (LLMs) that acquire co-occurrence-based knowledge from language corpora. LLMs achieve impressive performance on many tasks, but the robustness of their world knowledge has been questioned. Here, we ask: do LLMs acquire generalized knowledge about real-world events? Using curated sets of minimal sentence pairs (n=1215), we tested whether LLMs are more likely to generate plausible event descriptions compared to their implausible counterparts. We found that LLMs systematically distinguish possible and impossible events (The teacher bought the laptop vs. The laptop bought the teacher) but fall short of human performance when distinguishing likely and unlikely events (The nanny tutored the boy vs. The boy tutored the nanny). In follow-up analyses, we show that (i) LLM scores are driven by both plausibility and surface-level sentence features, (ii) LLMs generalize well across syntactic sentence variants (active vs passive) but less well across semantic sentence variants (synonymous sentences), (iii) some, but not all LLM deviations from ground-truth labels align with crowdsourced human judgments, and (iv) explicit event plausibility information emerges in middle LLM layers and remains high thereafter. Overall, our analyses reveal a gap in LLMs' event knowledge, highlighting their limitations as generalized knowledge bases. We conclude by speculating that the differential performance on impossible vs. unlikely events is not a temporary setback but an inherent property of LLMs, reflecting a fundamental difference between linguistic knowledge and world knowledge in intelligent systems.
translated by 谷歌翻译
虽然句子异常已经定期应用于NLP中的测试,但我们尚未建立从NLP模型中的表示中的异常信息的确切状态的图片。在本文中,我们的目标是填补两个主要间隙,重点关注句法异常的领域。首先,我们通过设计改变异常在句子中发生的分层级别的探测任务来探讨异常编码的细粒度差异。其次,我们不仅测试了模型能够通过检查不同异常类型之间的转移来检测给定异常的能力,还能检测给定的异常信号的一般性。结果表明,所有型号都编码一些支持异常检测的信息,但检测性能在异常之间变化,并且只有最近的变压器模型的唯一表示显示了异常知识的概括知识的迹象。随访分析支持这些模型在合法的句子奇迹上接受合法的概念,而粗糙的单词位置信息也可能是观察到的异常检测的贡献者。
translated by 谷歌翻译
自然语言处理的机器学习快速进步有可能改变有关人类学习语言的辩论。但是,当前人工学习者和人类的学习环境和偏见以削弱从学习模拟获得的证据的影响的方式分歧。例如,当今最有效的神经语言模型接受了典型儿童可用的语言数据量的大约一千倍。为了增加计算模型的可学习性结果的相关性,我们需要培训模型学习者,而没有比人类具有显着优势的学习者。如果合适的模型成功地获得了一些目标语言知识,则可以提供一个概念证明,即在假设的人类学习方案中可以学习目标。合理的模型学习者将使我们能够进行实验操作,以对学习环境中的变量进行因果推断,并严格测试史密斯风格的贫困声明,主张根据人类对人类的先天语言知识,基于有关可学习性的猜测。由于实用和道德的考虑因素,人类受试者将永远无法实现可比的实验,从而使模型学习者成为必不可少的资源。到目前为止,试图剥夺当前模型的不公平优势,为关键语法行为(例如可接受性判断)获得亚人类结果。但是,在我们可以合理地得出结论,语言学习需要比当前模型拥有更多的特定领域知识,我们必须首先以多模式刺激和多代理互动的形式探索非语言意见,以使学习者更有效地学习学习者来自有限的语言输入。
translated by 谷歌翻译
Are the predictions of humans and language models affected by similar things? Research suggests that while comprehending language, humans make predictions about upcoming words, with more predictable words being processed more easily. However, evidence also shows that humans display a similar processing advantage for highly anomalous words when these words are semantically related to the preceding context or to the most probable continuation. Using stimuli from 3 psycholinguistic experiments, we find that this is also almost always also the case for 8 contemporary transformer language models (BERT, ALBERT, RoBERTa, XLM-R, GPT-2, GPT-Neo, GPT-J, and XGLM). We then discuss the implications of this phenomenon for our understanding of both human language comprehension and the predictions made by language models.
translated by 谷歌翻译
当前的语言模型可以产生高质量的文本。他们只是复制他们之前看到的文本,或者他们学习了普遍的语言抽象吗?要取笑这些可能性,我们介绍了乌鸦,这是一套评估生成文本的新颖性,专注于顺序结构(n-gram)和句法结构。我们将这些分析应用于四种神经语言模型(LSTM,变压器,变换器-XL和GPT-2)。对于本地结构 - 例如,单个依赖性 - 模型生成的文本比来自每个模型的测试集的人类生成文本的基线显着不那么新颖。对于大规模结构 - 例如,总句结构 - 模型生成的文本与人生成的基线一样新颖甚至更新颖,但模型仍然有时复制,在某些情况下,在训练集中重复超过1000字超过1,000字的通道。我们还表现了广泛的手动分析,表明GPT-2的新文本通常在形态学和语法中形成良好,但具有合理的语义问题(例如,是自相矛盾)。
translated by 谷歌翻译
关于人类阅读的研究长期以来一直记录在阅读行为表明特定于任务的效果,但是建立一个通用模型来预测人类在给定任务中将显示什么的通用模型。我们介绍了Neat,这是人类阅读中注意力分配的计算模型,基于人类阅读优化了一项任务中关注经济和成功之间的权衡。我们的模型是使用当代神经网络建模技术实施的,并对注意力分配的分配方式在不同任务中如何变化做出明确的测试预测。我们在一项针对阅读理解任务的两个版本的眼影研究中对此进行了测试,发现我们的模型成功说明了整个任务的阅读行为。因此,我们的工作提供了证据表明,任务效果可以建模为对任务需求的最佳适应。
translated by 谷歌翻译
The long-distance agreement, evidence for syntactic structure, is increasingly used to assess the syntactic generalization of Neural Language Models. Much work has shown that transformers are capable of high accuracy in varied agreement tasks, but the mechanisms by which the models accomplish this behavior are still not well understood. To better understand transformers' internal working, this work contrasts how they handle two superficially similar but theoretically distinct agreement phenomena: subject-verb and object-past participle agreement in French. Using probing and counterfactual analysis methods, our experiments show that i) the agreement task suffers from several confounders which partially question the conclusions drawn so far and ii) transformers handle subject-verb and object-past participle agreements in a way that is consistent with their modeling in theoretical linguistics.
translated by 谷歌翻译
Transformer-based models have pushed state of the art in many areas of NLP, but our understanding of what is behind their success is still limited. This paper is the first survey of over 150 studies of the popular BERT model. We review the current state of knowledge about how BERT works, what kind of information it learns and how it is represented, common modifications to its training objectives and architecture, the overparameterization issue and approaches to compression. We then outline directions for future research.
translated by 谷歌翻译
Pragmatics is an essential part of communication, but it remains unclear what mechanisms underlie human pragmatic communication and whether NLP systems capture pragmatic language understanding. To investigate both these questions, we perform a fine-grained comparison of language models and humans on seven pragmatic phenomena, using zero-shot prompting on an expert-curated set of English materials. We ask whether models (1) select pragmatic interpretations of speaker utterances, (2) make similar error patterns as humans, and (3) use similar linguistic cues as humans to solve the tasks. We find that the largest models achieve high accuracy and match human error patterns: within incorrect responses, models favor the literal interpretation of an utterance over heuristic-based distractors. We also find evidence that models and humans are sensitive to similar linguistic cues. Our results suggest that even paradigmatic pragmatic phenomena may be solved without explicit representations of other agents' mental states, and that artificial models can be used to gain mechanistic insights into human pragmatic processing.
translated by 谷歌翻译
现在,通过复杂的神经网络模型(例如蒙版的神经语言模型(MNLM))学习了许多上下文化的单词表示形式,这些模型由巨大的神经网络结构组成,并经过训练以恢复蒙面文本。这样的表示表明在某些阅读理解(RC)任务中表现出超人的表现,这些任务在给出问题的上下文中提取了适当的答案。但是,由于许多模型参数,确定在MNLM中训练的详细知识是具有挑战性的。本文提供了有关MNLMS中包含的常识性知识的新见解和经验分析。首先,我们使用诊断测试来评估常识性知识是否在MNLMS中进行了适当的培训。我们观察到,在MNLMS中没有适当训练很多常识性知识,并且MNLMS并不经常准确地理解关系的语义含义。此外,我们发现基于MNLM的RC模型仍然容易受到需要常识知识的语义变化的影响。最后,我们发现了未经训练的知识的基本原因。我们进一步建议,利用外常识性知识存储库可以是一个有效的解决方案。我们说明了通过在受控实验中以外常识性知识存储库来丰富文本的经文,以克服基于MNLM的RC模型的局限性的可能性。
translated by 谷歌翻译
Targeted syntactic evaluations of language models ask whether models show stable preferences for syntactically acceptable content over minimal-pair unacceptable inputs. Most targeted syntactic evaluation datasets ask models to make these judgements with just a single context-free sentence as input. This does not match language models' training regime, in which input sentences are always highly contextualized by the surrounding corpus. This mismatch raises an important question: how robust are models' syntactic judgements in different contexts? In this paper, we investigate the stability of language models' performance on targeted syntactic evaluations as we vary properties of the input context: the length of the context, the types of syntactic phenomena it contains, and whether or not there are violations of grammaticality. We find that model judgements are generally robust when placed in randomly sampled linguistic contexts. However, they are substantially unstable for contexts containing syntactic structures matching those in the critical test content. Among all tested models (GPT-2 and five variants of OPT), we significantly improve models' judgements by providing contexts with matching syntactic structures, and conversely significantly worsen them using unacceptable contexts with matching but violated syntactic structures. This effect is amplified by the length of the context, except for unrelated inputs. We show that these changes in model performance are not explainable by simple features matching the context and the test inputs, such as lexical overlap and dependency overlap. This sensitivity to highly specific syntactic features of the context can only be explained by the models' implicit in-context learning abilities.
translated by 谷歌翻译
This paper investigates the ability of artificial neural networks to judge the grammatical acceptability of a sentence, with the goal of testing their linguistic competence. We introduce the Corpus of Linguistic Acceptability (CoLA), a set of 10,657 English sentences labeled as grammatical or ungrammatical from published linguistics literature. As baselines, we train several recurrent neural network models on acceptability classification, and find that our models outperform unsupervised models by Lau et al. (2016) on CoLA. Error-analysis on specific grammatical phenomena reveals that both Lau et al.'s models and ours learn systematic generalizations like subject-verb-object order. However, all models we test perform far below human level on a wide range of grammatical constructions.
translated by 谷歌翻译
借助情境化语言模型的成功,许多研究探讨了这些模型真正学到的知识,并且在哪些情况下仍然失败。这项工作的大部分都集中在特定的NLP任务和学习成果上。很少的研究试图使模型的弱点与特定任务的弱点相结合,并专注于嵌入本身及其学习方式。在本文中,我们抓住了这一研究机会:基于理论语言见解,我们探讨了功能词的语义限制是否是学习的,以及周围环境如何影响其嵌入。我们创建合适的数据集,为LMS VIS-VIS功能单词的内部工作提供新的见解,并实施辅助视觉网络界面以进行定性分析。
translated by 谷歌翻译
成语与大多数短语不同。首先,成语中的单词具有非规范含义。其次,习语中单词的非传统含义取决于习惯中其他单词的存在。语言理论在这些特性是否相互依赖,以及是否需要特殊的理论机制来容纳成语方面有所不同。我们定义了与上述属性相对应的两个度量,并使用BERT(Devlin等,2019)和XLNet实施它们(Yang等,2019)。我们表明,成语落在两个维度的预期交集处,但是尺寸本身并不相关。我们的结果表明,处理习语的特殊机械可能不保证。
translated by 谷歌翻译
尽管预训练的语言模型(LMS)在许多NLP任务中都取得了重大改进,但人们越来越关注探索LMS的能力并解释其预测。但是,现有作品通常仅着眼于某些下游任务的特定功能。缺乏直接评估蒙版单词预测性能和预训练LMS的解释性的数据集。为了填补空白,我们提出了一个新颖的评估基准,以提供英语和中文注释的数据。它在多个维度(即语法,语义,知识,推理和计算)中测试LMS能力。此外,它提供了满足足够和紧凑性的仔细注释的令牌级别的理由。它包含每个原始实例的扰动实例,以便将扰动下的基本原理一致性用作忠实的指标,即解释性的观点。我们在几个广泛使用的预训练的LMS上进行实验。结果表明,他们在知识和计算的维度上表现较差。而且它们在所有维度上的合理性远非令人满意,尤其是当理由缩短时。此外,我们评估的预训练的LMS在语法感知数据上并不强大。我们将以\ url {http:// xyz}发布此评估基准,并希望它可以促进预训练的LMS的研究进度。
translated by 谷歌翻译
递归名词短语(NPS)具有有趣的语义属性。例如,“我最喜欢的新电影”不一定是“我最喜欢的电影”,而“我最喜欢的电影”是。这对人类来说是常识,但它是未知预先接受预审的语言模型有这样的知识。我们介绍了递归名词短语挑战(RNPC),是针对对递归NPS的理解的挑战。在评估我们的数据集时,最先进的变压器模型只能实现偶然的偶然性能。尽管如此,我们表明这些知识是以适当的数据学习。我们进一步探讨了可以从我们的任务中学到的相关语言功能的模型,包括修饰语语义类别和修改范围。最后,培训的模型在外在伤害检测任务上实现了强大的零射击性能,显示了在下游应用中了解递归NP的有用性。所有代码和数据都将在https://github.com/veronica320/recursive-nps发布。
translated by 谷歌翻译