The increasingly widespread adoption of large language models has highlighted the need for improving their explainability. We present context length probing, a novel explanation technique for causal language models, based on tracking the predictions of a model as a function of the length of available context, and allowing to assign differential importance scores to different contexts. The technique is model-agnostic and does not rely on access to model internals beyond computing token-level probabilities. We apply context length probing to large pre-trained language models and offer some initial analyses and insights, including the potential for studying long-range dependencies. The source code and a demo of the method are available.
translated by 谷歌翻译
When scaled to hundreds of billions of parameters, pretrained language models such as GPT-3 (Brown et al., 2020) achieve remarkable few-shot performance. However, enormous amounts of compute are required for training and applying such big models, resulting in a large carbon footprint and making it difficult for researchers and practitioners to use them. We show that performance similar to GPT-3 can be obtained with language models that are much "greener" in that their parameter count is several orders of magnitude smaller. This is achieved by converting textual inputs into cloze questions that contain a task description, combined with gradient-based optimization; exploiting unlabeled data gives further improvements. We identify key factors required for successful natural language understanding with small language models. 1
translated by 谷歌翻译
我们研究了原则上的程度,原则上,语言图表表示可以补充和改进神经语言建模。通过一个由7种不同的形式主义之一的预磨削变压器和地面真相图组成的集合设置,我们发现,总体而言,语义构成结构对语言建模性能最有用 - 超越句法选区结构以及句法和语义依赖结构。此外,效果取决于语音级别的级别大大变化。总而言之,我们的调查结果指出了神经象征性语言建模的有希望的趋势,并邀请未来的研究量化不同形式主义所制作的设计选择。
translated by 谷歌翻译
尽管变压器语言模型(LMS)是信息提取的最新技术,但长文本引入了需要次优的预处理步骤或替代模型体系结构的计算挑战。稀疏注意的LMS可以代表更长的序列,克服性能障碍。但是,目前尚不清楚如何解释这些模型的预测,因为并非所有令牌都在自我发项层中相互参加,而在运行时,长序列对可解释性算法提出了计算挑战,而当运行时取决于文档长度。这些挑战在文档可能很长的医学环境中是严重的,机器学习(ML)模型必须是审核和值得信赖的。我们介绍了一种新颖的蒙版抽样程序(MSP),以识别有助于预测的文本块,将MSP应用于预测医学文本诊断的背景下,并通过两位临床医生的盲目审查来验证我们的方法。我们的方法比以前的最先进的临床信息块高约1.7倍,速度更快100倍,并且可用于生成重要的短语对。 MSP特别适合长LMS,但可以应用于任何文本分类器。我们提供了MSP的一般实施。
translated by 谷歌翻译
当前的语言模型可以产生高质量的文本。他们只是复制他们之前看到的文本,或者他们学习了普遍的语言抽象吗?要取笑这些可能性,我们介绍了乌鸦,这是一套评估生成文本的新颖性,专注于顺序结构(n-gram)和句法结构。我们将这些分析应用于四种神经语言模型(LSTM,变压器,变换器-XL和GPT-2)。对于本地结构 - 例如,单个依赖性 - 模型生成的文本比来自每个模型的测试集的人类生成文本的基线显着不那么新颖。对于大规模结构 - 例如,总句结构 - 模型生成的文本与人生成的基线一样新颖甚至更新颖,但模型仍然有时复制,在某些情况下,在训练集中重复超过1000字超过1,000字的通道。我们还表现了广泛的手动分析,表明GPT-2的新文本通常在形态学和语法中形成良好,但具有合理的语义问题(例如,是自相矛盾)。
translated by 谷歌翻译
随着越来越多的可用文本数据,能够自动分析,分类和摘要这些数据的算法的开发已成为必需品。在本研究中,我们提出了一种用于关键字识别的新颖算法,即表示给定文档的关键方面的一个或多字短语的提取,称为基于变压器的神经标记器,用于关键字识别(TNT-KID)。通过将变压器架构适用于手头的特定任务并利用域特定语料库上的预先磨损的语言模型,该模型能够通过提供竞争和强大的方式克服监督和无监督的最先进方法的缺陷在各种不同的数据集中的性能,同时仅需要最佳执行系统所需的手动标记的数据。本研究还提供了彻底的错误分析,具有对模型内部运作的有价值的见解和一种消融研究,测量关键字识别工作流程的特定组分对整体性能的影响。
translated by 谷歌翻译
测序技术容易出错,对下游应用程序进行纠错(EC)。需要手动配置EC工具以获得最佳性能。我们发现最佳参数(例如,k-mer大小)是依赖于工具和数据集。此外,评估给定工具的性能(即,对准速率或增益)通常依赖于参考基因组,但是质量参考基因组并不总是可用的。我们介绍了基于K-MEC的自动配置的Lerna。 Lerna首先创建未校正的基因组读取的语言模型(LM);然后,计算困惑度量以评估不同参数选择的校正读取。接下来,在不使用参考基因​​组的情况下发现产生最高对准率的那个。我们的方法的基本直觉是困惑度量与纠错后的组件的质量与组件的质量相反。结果:首先,我们表明,即使对于相同的EC工具,不同的数据集也可以对不同的数据集格变化。其次,我们使用其组件基于关注的变压器显示了我们的LM的收益。我们展示了误差校正前后困惑度量的模型的估计。校正后的困惑越低,k-mer大小越好。我们还表明,用于校正读取的对准率和组装质量与困惑强烈地呈负相关,从而实现了k-mer值的自动选择以获得更好的纠错,因此改善的组装质量。此外,我们表明我们的注意力模型对于整个管道的重大运行时间改善 - 由于并行化注意机制和JIT编译对GPU推理的使用JIT编译,因此整个管道的运行时间更快。
translated by 谷歌翻译
We introduce Transformer Grammars (TGs), a novel class of Transformer language models that combine (i) the expressive power, scalability, and strong performance of Transformers and (ii) recursive syntactic compositions, which here are implemented through a special attention mask and deterministic transformation of the linearized tree. We find that TGs outperform various strong baselines on sentence-level language modeling perplexity, as well as on multiple syntax-sensitive language modeling evaluation metrics. Additionally, we find that the recursive syntactic composition bottleneck which represents each sentence as a single vector harms perplexity on document-level language modeling, providing evidence that a different kind of memory mechanism -- one that is independent of composed syntactic representations -- plays an important role in current successful models of long text.
translated by 谷歌翻译
序列模型是现代NLP系统的关键组成部分,但它们的预测难以解释。我们考虑虽然可以解释单个模型预测的基础,但是可以解释各种模型预测的上下文的模型解释。通过解决组合优化来找到顺序律师:最佳理由是输入令牌的最小子集,这些令牌将预测与完整序列相同的输出。枚举所有子集是棘手的,因此我们提出了一种高效的贪婪算法来近似这个目标。称为贪婪合理化的算法适用于任何模型。对于这种方法有效,模型应该在对上下文的不完整子集进行预测时形成兼容的条件分布。这种情况可以用短的微调步骤强制执行。我们研究语言建模与机器翻译的贪婪合理化。与现有的基线相比,贪婪合理化是最优化组合目标的,并提供最忠实的理由。在注释的顺序理由的新数据集中,贪婪的理由与人类理由最相似。
translated by 谷歌翻译
Deep Learning and Machine Learning based models have become extremely popular in text processing and information retrieval. However, the non-linear structures present inside the networks make these models largely inscrutable. A significant body of research has focused on increasing the transparency of these models. This article provides a broad overview of research on the explainability and interpretability of natural language processing and information retrieval methods. More specifically, we survey approaches that have been applied to explain word embeddings, sequence modeling, attention modules, transformers, BERT, and document ranking. The concluding section suggests some possible directions for future research on this topic.
translated by 谷歌翻译
在NLP社区中有一个正在进行的辩论,无论现代语言模型是否包含语言知识,通过所谓的探针恢复。在本文中,我们研究了语言知识是否是现代语言模型良好表现的必要条件,我们称之为\ Texit {重新发现假设}。首先,我们展示了语言模型,这是显着压缩的,但在预先磨普目标上表现良好,以便在语言结构探讨时保持良好的分数。这一结果支持重新发现的假设,并导致我们的论文的第二款贡献:一个信息 - 理论框架,与语言建模目标相关。该框架还提供了测量语言信息对字词预测任务的影响的度量标准。我们通过英语综合和真正的NLP任务加固我们的分析结果。
translated by 谷歌翻译
Causal transformer language models (LMs), such as GPT-3, typically require some form of positional encoding, such as positional embeddings. However, we show that LMs without any explicit positional encoding are still competitive with standard models, and that this phenomenon is robust across different datasets, model sizes, and sequence lengths. Probing experiments reveal that such models acquire an implicit notion of absolute positions throughout the network, effectively compensating for the missing information. We conjecture that causal attention enables the model to infer the number of predecessors that each token can attend to, thereby approximating its absolute position. Our findings indicate that causal LMs might derive positional awareness not only from the explicit positioning mechanism, but also from the effects of the causal mask.
translated by 谷歌翻译
Recent work has improved language models (LMs) remarkably by equipping them with a non-parametric memory component. However, most existing approaches only introduce mem-ories at testing time or represent them using a separately trained encoder, resulting in suboptimal training of the language model. In this work, we present TRIME, a novel yet simple training approach designed for training LMs with memory augmentation. Our approach uses a training objective that directly takes in-batch examples as accessible memory. We also present new methods for memory construction and data batching, which are used for adapting to different sets of memories--local, long-term, and external memory--at testing time. We evaluate TRIME on multiple language modeling and machine translation benchmarks and show that it is able to achieve significant improvements across all the settings. Concretely, TRIME reduces the perplexity from 18.70 to 15.37 on WIKITEXT-103, by effectively leveraging a large memory set from the training corpus. Compared to standard LM training, TRIME adds negligible computational overhead and is compatible with different neural architectures, making it a versatile solution for training memory-augmented LMs.
translated by 谷歌翻译
It has become common to publish large (billion parameter) language models that have been trained on private datasets. This paper demonstrates that in such settings, an adversary can perform a training data extraction attack to recover individual training examples by querying the language model. We demonstrate our attack on GPT-2, a language model trained on scrapes of the public Internet, and are able to extract hundreds of verbatim text sequences from the model's training data. These extracted examples include (public) personally identifiable information (names, phone numbers, and email addresses), IRC conversations, code, and 128-bit UUIDs. Our attack is possible even though each of the above sequences are included in just one document in the training data.We comprehensively evaluate our extraction attack to understand the factors that contribute to its success. Worryingly, we find that larger models are more vulnerable than smaller models. We conclude by drawing lessons and discussing possible safeguards for training large language models.
translated by 谷歌翻译
Natural language processing tasks, such as question answering, machine translation, reading comprehension, and summarization, are typically approached with supervised learning on taskspecific datasets. We demonstrate that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText. When conditioned on a document plus questions, the answers generated by the language model reach 55 F1 on the CoQA dataset -matching or exceeding the performance of 3 out of 4 baseline systems without using the 127,000+ training examples. The capacity of the language model is essential to the success of zero-shot task transfer and increasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText. Samples from the model reflect these improvements and contain coherent paragraphs of text. These findings suggest a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
translated by 谷歌翻译
Transformers do not scale very well to long sequence lengths largely because of quadratic self-attention complexity. In the recent months, a wide spectrum of efficient, fast Transformers have been proposed to tackle this problem, more often than not claiming superior or comparable model quality to vanilla Transformer models. To this date, there is no well-established consensus on how to evaluate this class of models. Moreover, inconsistent benchmarking on a wide spectrum of tasks and datasets makes it difficult to assess relative model quality amongst many models. This paper proposes a systematic and unified benchmark, Long-Range Arena, specifically focused on evaluating model quality under long-context scenarios. Our benchmark is a suite of tasks consisting of sequences ranging from 1K to 16K tokens, encompassing a wide range of data types and modalities such as text, natural, synthetic images, and mathematical expressions requiring similarity, structural, and visual-spatial reasoning. We systematically evaluate ten well-established long-range Transformer models (Reformers, Linformers, Linear Transformers, Sinkhorn Transformers, Performers, Synthesizers, Sparse Transformers, and Longformers) on our newly proposed benchmark suite. Long-Range Arena paves the way towards better understanding this class of efficient Transformer models, facilitates more research in this direction, and presents new challenging tasks to tackle. Our benchmark code will be released at https://github.com/google-research/long-range-arena.
translated by 谷歌翻译
Contextualized representation models such as ELMo (Peters et al., 2018a) and BERT (Devlin et al., 2018) have recently achieved state-of-the-art results on a diverse array of downstream NLP tasks. Building on recent token-level probing work, we introduce a novel edge probing task design and construct a broad suite of sub-sentence tasks derived from the traditional structured NLP pipeline. We probe word-level contextual representations from four recent models and investigate how they encode sentence structure across a range of syntactic, semantic, local, and long-range phenomena. We find that existing models trained on language modeling and translation produce strong representations for syntactic phenomena, but only offer comparably small improvements on semantic tasks over a non-contextual baseline.
translated by 谷歌翻译
大型语言模型在各种任务上显示出令人印象深刻的几次结果。但是,当知识是此类结果的关键时,就像问题回答和事实检查之类的任务一样,似乎需要存储知识的大量参数计数。众所周知,检索增强模型可以在不需要多个参数的情况下在知识密集的任务上表现出色,但是目前尚不清楚它们是否在几个弹药设置中工作。在这项工作中,我们介绍了地图集,这是一个经过精心设计和预先训练的增强语言模型,能够通过很少的培训示例学习知识密集型任务。我们对包括MMLU,苏格兰短裙和归类等各种任务进行评估,并研究文档索引内容的影响,表明它可以很容易地进行更新。值得注意的是,在自然问题上仅使用64个示例在自然问题上达到超过42 \%的准确性,尽管参数少了50倍,但比540B参数模型的表现优于540b参数模型。
translated by 谷歌翻译
基于变压器的语言模型最近在许多自然语言任务中取得了显着的结果。但是,通常通过利用大量培训数据来实现排行榜的性能,并且很少通过将明确的语言知识编码为神经模型。这使许多人质疑语言学对现代自然语言处理的相关性。在本文中,我介绍了几个案例研究,以说明理论语言学和神经语言模型仍然相互关联。首先,语言模型通过提供一个客观的工具来测量语义距离,这对语言学家很有用,语义距离很难使用传统方法。另一方面,语言理论通过提供框架和数据源来探究我们的语言模型,以了解语言理解的特定方面,从而有助于语言建模研究。本论文贡献了三项研究,探讨了语言模型中语法 - 听觉界面的不同方面。在论文的第一部分中,我将语言模型应用于单词类灵活性的问题。我将Mbert作为语义距离测量的来源,我提供了有利于将单词类灵活性分析为方向过程的证据。在论文的第二部分中,我提出了一种方法来测量语言模型中间层的惊奇方法。我的实验表明,包含形态句法异常的句子触发了语言模型早期的惊喜,而不是语义和常识异常。最后,在论文的第三部分中,我适应了一些心理语言学研究,以表明语言模型包含了论证结构结构的知识。总而言之,我的论文在自然语言处理,语言理论和心理语言学之间建立了新的联系,以为语言模型的解释提供新的观点。
translated by 谷歌翻译
虽然已经提出了许多背景感知神经机器转换模型在翻译中包含语境,但大多数模型在句子级别对齐的并行文档上培训结束到底。因为只有少数域(和语言对)具有此类文档级并行数据,所以我们无法在大多数域中执行准确的上下文感知转换。因此,我们通过将文档级语言模型结合到解码器中,提出了一种简单的方法将句子级转换模型转换为上下文感知模型。我们的上下文感知解码器仅在句子级并行语料库和单语演模板上构建;因此,不需要文档级并行数据。在理论上,这项工作的核心部分是使用上下文和当前句子之间的点亮互信息的语境信息的新颖表示。我们以三种语言对,英语到法语,英语到俄语,以及日语到英语,通过评估,通过评估以及对上下文意识翻译的对比测试。
translated by 谷歌翻译