Syntax is a latent hierarchical structure which underpins the robust and compositional nature of human language. An active line of inquiry is whether large pretrained language models (LLMs) are able to acquire syntax by training on text alone; understanding a model's syntactic capabilities is essential to understanding how it processes and makes use of language. In this paper, we propose a new method, SSUD, which allows for the induction of syntactic structures without supervision from gold-standard parses. Instead, we seek to define formalism-agnostic, model-intrinsic syntactic parses by using a property of syntactic relations: syntactic substitutability. We demonstrate both quantitative and qualitative gains on dependency parsing tasks using SSUD, and induce syntactic structures which we hope provide clarity into LLMs and linguistic representations, alike.
translated by 谷歌翻译
Large pre-trained neural networks such as BERT have had great recent success in NLP, motivating a growing body of research investigating what aspects of language they are able to learn from unlabeled data. Most recent analysis has focused on model outputs (e.g., language model surprisal) or internal vector representations (e.g., probing classifiers). Complementary to these works, we propose methods for analyzing the attention mechanisms of pre-trained models and apply them to BERT. BERT's attention heads exhibit patterns such as attending to delimiter tokens, specific positional offsets, or broadly attending over the whole sentence, with heads in the same layer often exhibiting similar behaviors. We further show that certain attention heads correspond well to linguistic notions of syntax and coreference. For example, we find heads that attend to the direct objects of verbs, determiners of nouns, objects of prepositions, and coreferent mentions with remarkably high accuracy. Lastly, we propose an attention-based probing classifier and use it to further demonstrate that substantial syntactic information is captured in BERT's attention. 1 Code will be released at https://github.com/ clarkkev/attention-analysis.2 We use the English base-sized model.
translated by 谷歌翻译
在本文中,我们试图通过引入深度学习模型的句法归纳偏见来建立两所学校之间的联系。我们提出了两个归纳偏见的家族,一个家庭用于选区结构,另一个用于依赖性结构。选区归纳偏见鼓励深度学习模型使用不同的单位(或神经元)分别处理长期和短期信息。这种分离为深度学习模型提供了一种方法,可以从顺序输入中构建潜在的层次表示形式,即更高级别的表示由高级表示形式组成,并且可以分解为一系列低级表示。例如,在不了解地面实际结构的情况下,我们提出的模型学会通过根据其句法结构组成变量和运算符的表示来处理逻辑表达。另一方面,依赖归纳偏置鼓励模型在输入序列中找到实体之间的潜在关系。对于自然语言,潜在关系通常被建模为一个定向依赖图,其中一个单词恰好具有一个父节点和零或几个孩子的节点。将此约束应用于类似变压器的模型之后,我们发现该模型能够诱导接近人类专家注释的有向图,并且在不同任务上也优于标准变压器模型。我们认为,这些实验结果为深度学习模型的未来发展展示了一个有趣的选择。
translated by 谷歌翻译
Transformer-based models have pushed state of the art in many areas of NLP, but our understanding of what is behind their success is still limited. This paper is the first survey of over 150 studies of the popular BERT model. We review the current state of knowledge about how BERT works, what kind of information it learns and how it is represented, common modifications to its training objectives and architecture, the overparameterization issue and approaches to compression. We then outline directions for future research.
translated by 谷歌翻译
The long-distance agreement, evidence for syntactic structure, is increasingly used to assess the syntactic generalization of Neural Language Models. Much work has shown that transformers are capable of high accuracy in varied agreement tasks, but the mechanisms by which the models accomplish this behavior are still not well understood. To better understand transformers' internal working, this work contrasts how they handle two superficially similar but theoretically distinct agreement phenomena: subject-verb and object-past participle agreement in French. Using probing and counterfactual analysis methods, our experiments show that i) the agreement task suffers from several confounders which partially question the conclusions drawn so far and ii) transformers handle subject-verb and object-past participle agreements in a way that is consistent with their modeling in theoretical linguistics.
translated by 谷歌翻译
Multilingual BERT (mBERT) has demonstrated considerable cross-lingual syntactic ability, whereby it enables effective zero-shot cross-lingual transfer of syntactic knowledge. The transfer is more successful between some languages, but it is not well understood what leads to this variation and whether it fairly reflects difference between languages. In this work, we investigate the distributions of grammatical relations induced from mBERT in the context of 24 typologically different languages. We demonstrate that the distance between the distributions of different languages is highly consistent with the syntactic difference in terms of linguistic formalisms. Such difference learnt via self-supervision plays a crucial role in the zero-shot transfer performance and can be predicted by variation in morphosyntactic properties between languages. These results suggest that mBERT properly encodes languages in a way consistent with linguistic diversity and provide insights into the mechanism of cross-lingual transfer.
translated by 谷歌翻译
基于变压器的语言模型最近在许多自然语言任务中取得了显着的结果。但是,通常通过利用大量培训数据来实现排行榜的性能,并且很少通过将明确的语言知识编码为神经模型。这使许多人质疑语言学对现代自然语言处理的相关性。在本文中,我介绍了几个案例研究,以说明理论语言学和神经语言模型仍然相互关联。首先,语言模型通过提供一个客观的工具来测量语义距离,这对语言学家很有用,语义距离很难使用传统方法。另一方面,语言理论通过提供框架和数据源来探究我们的语言模型,以了解语言理解的特定方面,从而有助于语言建模研究。本论文贡献了三项研究,探讨了语言模型中语法 - 听觉界面的不同方面。在论文的第一部分中,我将语言模型应用于单词类灵活性的问题。我将Mbert作为语义距离测量的来源,我提供了有利于将单词类灵活性分析为方向过程的证据。在论文的第二部分中,我提出了一种方法来测量语言模型中间层的惊奇方法。我的实验表明,包含形态句法异常的句子触发了语言模型早期的惊喜,而不是语义和常识异常。最后,在论文的第三部分中,我适应了一些心理语言学研究,以表明语言模型包含了论证结构结构的知识。总而言之,我的论文在自然语言处理,语言理论和心理语言学之间建立了新的联系,以为语言模型的解释提供新的观点。
translated by 谷歌翻译
在这项研究中,我们提出了一种基于词素的方案,用于韩国依赖解析,并采用拟议方案来普遍依赖。我们介绍了语言原理,该基本原理说明了采用基于词素的格式的动机和必要性,并开发了脚本,这些脚本会在通用依赖项使用的原始格式和所提出的基于词素的格式自动之间转换。然后,统计和神经模型(包括udpipe和stanza)证明了提出的格式对韩国依赖解析的有效性,并以我们精心构造的基于词素的单词嵌入韩语。Morphud的表现优于所有韩国UD Treebanks的解析结果,我们还提供了详细的错误分析。
translated by 谷歌翻译
数据饥饿的深度神经网络已经将自己作为许多NLP任务的标准建立为包括传统序列标记的标准。尽管他们在高资源语言上表现最先进的表现,但它们仍然落后于低资源场景的统计计数器。一个方法来反击攻击此问题是文本增强,即,从现有数据生成新的合成训练数据点。虽然NLP最近目睹了一种文本增强技术的负载,但该领域仍然缺乏对多种语言和序列标记任务的系统性能分析。为了填补这一差距,我们调查了三类文本增强方法,其在语法(例如,裁剪子句子),令牌(例如,随机字插入)和字符(例如,字符交换)级别上执行更改。我们系统地将它们与语音标记,依赖解析和语义角色标记的分组进行了比较,用于使用各种模型的各种语言系列,包括依赖于诸如MBERT的普赖金的多语言语境化语言模型的架构。增强最显着改善了解析,然后是语音标记和语义角色标记的依赖性解析。我们发现实验技术通常在形态上丰富的语言,而不是越南语等分析语言。我们的研究结果表明,增强技术可以进一步改善基于MBERT的强基线。我们将字符级方法标识为最常见的表演者,而同义词替换和语法增强仪提供不一致的改进。最后,我们讨论了最大依赖于任务,语言对和模型类型的结果。
translated by 谷歌翻译
Contextualized representation models such as ELMo (Peters et al., 2018a) and BERT (Devlin et al., 2018) have recently achieved state-of-the-art results on a diverse array of downstream NLP tasks. Building on recent token-level probing work, we introduce a novel edge probing task design and construct a broad suite of sub-sentence tasks derived from the traditional structured NLP pipeline. We probe word-level contextual representations from four recent models and investigate how they encode sentence structure across a range of syntactic, semantic, local, and long-range phenomena. We find that existing models trained on language modeling and translation produce strong representations for syntactic phenomena, but only offer comparably small improvements on semantic tasks over a non-contextual baseline.
translated by 谷歌翻译
通常认为语言模型能够编码语法[Tenney等,2019; Jawahar等,2019; Hewitt和Manning,2019]。在本文中,我们提出了UPOA,这是一种无监督的组成分析模型,该模型仅基于以验证的语言模型学习为跨度分割的句法距离,仅基于自我发挥的权重矩阵来计算出OUT关联得分。我们进一步提出了一个增强的版本UPIO,该版本利用了内部关联和外部关联得分来估计跨度的可能性。使用UPOA和UPIO的实验揭示了自我注意机制中查询和密钥的线性投影矩阵在解析中起重要作用。因此,我们将无监督的模型扩展到了几个射击模型(FPOA,FPIO),这些模型使用一些注释的树来学习更好的线性投影矩阵进行解析。宾夕法尼亚河岸上的实验表明,我们的无监督解析模型UPIO实现了与短句子(长度<= 10)相当的结果。我们的几个解析模型FPIO接受了仅20棵带注释的树木的训练,优于前几种镜头解析方法,该方法接受了50棵带注释的树木的训练。交叉解析的实验表明,无监督和少数解析方法都比SPMRL大多数语言的先前方法都更好[Seddah等,2013]。
translated by 谷歌翻译
我们研究了现代神经语言模型容易受到结构启动的程度,这种现象使句子的结构在后续句子中更有可能使相同的结构更有可能。我们探索如何使用启动来研究这些模型学习抽象结构信息的潜力,这是需要自然语言理解技能的任务良好表现的先决条件。我们引入了一种新型的度量标准和释放Prime-LM,这是一个大型语料库,我们可以控制与启动强度相互作用的各种语言因素。我们发现,变压器模型确实显示了结构启动的证据,但他们所学到的概括在某种程度上是由语义信息调节的。我们的实验还表明,模型获得的表示不仅可以编码抽象的顺序结构,而且还涉及一定级别的层次句法信息。更普遍的是,我们的研究表明,启动范式是一种有用的,可用于洞悉语言模型能力的有用的,并为未来的基于底漆的调查打开了探测模型内部状态的未来大门。
translated by 谷歌翻译
我们研究了原则上的程度,原则上,语言图表表示可以补充和改进神经语言建模。通过一个由7种不同的形式主义之一的预磨削变压器和地面真相图组成的集合设置,我们发现,总体而言,语义构成结构对语言建模性能最有用 - 超越句法选区结构以及句法和语义依赖结构。此外,效果取决于语音级别的级别大大变化。总而言之,我们的调查结果指出了神经象征性语言建模的有希望的趋势,并邀请未来的研究量化不同形式主义所制作的设计选择。
translated by 谷歌翻译
We propose a transition-based approach that, by training a single model, can efficiently parse any input sentence with both constituent and dependency trees, supporting both continuous/projective and discontinuous/non-projective syntactic structures. To that end, we develop a Pointer Network architecture with two separate task-specific decoders and a common encoder, and follow a multitask learning strategy to jointly train them. The resulting quadratic system, not only becomes the first parser that can jointly produce both unrestricted constituent and dependency trees from a single model, but also proves that both syntactic formalisms can benefit from each other during training, achieving state-of-the-art accuracies in several widely-used benchmarks such as the continuous English and Chinese Penn Treebanks, as well as the discontinuous German NEGRA and TIGER datasets.
translated by 谷歌翻译
We propose reconstruction probing, a new analysis method for contextualized representations based on reconstruction probabilities in masked language models (MLMs). This method relies on comparing the reconstruction probabilities of tokens in a given sequence when conditioned on the representation of a single token that has been fully contextualized and when conditioned on only the decontextualized lexical prior of the model. This comparison can be understood as quantifying the contribution of contextualization towards reconstruction -- the difference in the reconstruction probabilities can only be attributed to the representational change of the single token induced by contextualization. We apply this analysis to three MLMs and find that contextualization boosts reconstructability of tokens that are close to the token being reconstructed in terms of linear and syntactic distance. Furthermore, we extend our analysis to finer-grained decomposition of contextualized representations, and we find that these boosts are largely attributable to static and positional embeddings at the input layer.
translated by 谷歌翻译
语法纠错(GEC)是检测和纠正句子中语法错误的任务。最近,神经机翻译系统已成为这项任务的流行方法。然而,这些方法缺乏使用句法知识,这在语法错误的校正中起着重要作用。在这项工作中,我们提出了一种语法引导的GEC模型(SG-GEC),它采用图表注意机制来利用依赖树的句法知识。考虑到语法不正确的源句子的依赖性树可以提供不正确的语法知识,我们提出了一个依赖树修正任务来处理它。结合数据增强方法,我们的模型在不使用任何大型预先训练模型的情况下实现了强大的性能。我们评估我们在GEC任务的公共基准上的模型,实现了竞争结果。
translated by 谷歌翻译
本文介绍了正式和非正式波斯之间的语音,形态和句法区别,表明这两个变体具有根本差异,不能仅归因于发音差异。鉴于非正式波斯展出特殊的特征,任何在正式波斯语上培训的计算模型都不太可能转移到非正式的波斯,所以需要为这种品种创建专用的树木银行。因此,我们详细介绍了开源非正式波斯普通依赖性TreeBank的开发,这是一个在通用依赖性方案中注释的新的TreeBank。然后,我们通过在现有的正式树木银行上培训两个依赖性解析器并在域名数据上进行评估,调查非正式波斯的解析,即我们非正式树木银行的开发集。我们的结果表明,当我们穿过两个域时,解析器在跨越两个域时遇到了实质性的性能下降,因为它们面临更为不知名的令牌和结构,并且无法概括。此外,性能恶化的依赖关系最多代表了非正式变体的独特属性。这项研究的最终目标表明更广泛的影响是提供踩踏石头,以揭示语言的非正式变种的重要性,这被广泛地忽略了跨语言的自然语言处理工具。
translated by 谷歌翻译
人类和神经语言模型都能够执行主题 - 动词数协议(SVA)。原则上,语义不应干扰此任务,这仅需要句法知识。在这项工作中,我们测试含义是否干扰了各种复杂性的句法结构中的英语一致性。为此,我们同时生成语义上良好的和荒谬的项目。我们将Bert Base与人类的表现进行了比较,该表现是通过心理语言在线众包实验获得的。我们发现伯特和人类都对我们的语义操纵敏感:出现荒谬的项目时,它们的频率更高,尤其是当它们的句法结构具有吸引子(主题和动词之间的名词短语和与该数字不同的名词短语)时主题)。我们还发现,有意义性对SVA错误的影响对于BERT而言比对人类的影响更强,显示前者对这项任务的词汇敏感性更高。
translated by 谷歌翻译
Open Information Extraction (OpenIE) aims to extract relational tuples from open-domain sentences. Traditional rule-based or statistical models have been developed based on syntactic structures of sentences, identified by syntactic parsers. However, previous neural OpenIE models under-explore the useful syntactic information. In this paper, we model both constituency and dependency trees into word-level graphs, and enable neural OpenIE to learn from the syntactic structures. To better fuse heterogeneous information from both graphs, we adopt multi-view learning to capture multiple relationships from them. Finally, the finetuned constituency and dependency representations are aggregated with sentential semantic representations for tuple generation. Experiments show that both constituency and dependency information, and the multi-view learning are effective.
translated by 谷歌翻译
We introduce Transformer Grammars (TGs), a novel class of Transformer language models that combine (i) the expressive power, scalability, and strong performance of Transformers and (ii) recursive syntactic compositions, which here are implemented through a special attention mask and deterministic transformation of the linearized tree. We find that TGs outperform various strong baselines on sentence-level language modeling perplexity, as well as on multiple syntax-sensitive language modeling evaluation metrics. Additionally, we find that the recursive syntactic composition bottleneck which represents each sentence as a single vector harms perplexity on document-level language modeling, providing evidence that a different kind of memory mechanism -- one that is independent of composed syntactic representations -- plays an important role in current successful models of long text.
translated by 谷歌翻译