petroni等。 (2019)证明,可以通过将它们表达为冻结式提示并将模型的预测准确性解释为下限,作为其编码的事实信息量的较低限制,从预先接收的语言模型中检索世界事实。随后的工作已经尝试通过搜索更好的提示来缩回估计,使用不相交的事实作为培训数据。在这项工作中,我们制作两个互补贡献,以更好地了解这些事实探测技术。首先,我们提出了OptiPrompt,一种新颖的和有效的方法,直接在连续嵌入空间中优化。我们发现这种简单的方法能够预测喇嘛基准中的额外6.4%的事实。其次,我们提出了一个更重要的问题:我们真的可以将这些探测结果解释为下限吗?这些提示搜索方法是否有可能从培训数据中学习?我们发现,有些令人惊讶的是,这些方法使用的培训数据包含了潜在的事实分布的某些规则,以及所有现有的提示方法,包括我们的方法,可以利用它们以获得更好的事实预测。我们开展一系列控制实验来解除“学习”从“学习召回”,提供了更详细的图片,不同的提示可以揭示关于预先接受的语言模型。
translated by 谷歌翻译
The remarkable success of pretrained language models has motivated the study of what kinds of knowledge these models learn during pretraining. Reformulating tasks as fillin-the-blanks problems (e.g., cloze tests) is a natural approach for gauging such knowledge, however, its usage is limited by the manual effort and guesswork required to write suitable prompts. To address this, we develop AUTOPROMPT, an automated method to create prompts for a diverse set of tasks, based on a gradient-guided search. Using AUTO-PROMPT, we show that masked language models (MLMs) have an inherent capability to perform sentiment analysis and natural language inference without additional parameters or finetuning, sometimes achieving performance on par with recent state-of-the-art supervised models. We also show that our prompts elicit more accurate factual knowledge from MLMs than the manually created prompts on the LAMA benchmark, and that MLMs can be used as relation extractors more effectively than supervised relation extraction models. These results demonstrate that automatically generated prompts are a viable parameter-free alternative to existing probing methods, and as pretrained LMs become more sophisticated and capable, potentially a replacement for finetuning.
translated by 谷歌翻译
Recent work has presented intriguing results examining the knowledge contained in language models (LM) by having the LM fill in the blanks of prompts such as "Obama is a by profession". These prompts are usually manually created, and quite possibly suboptimal; another prompt such as "Obama worked as a " may result in more accurately predicting the correct profession. Because of this, given an inappropriate prompt, we might fail to retrieve facts that the LM does know, and thus any given prompt only provides a lower bound estimate of the knowledge contained in an LM. In this paper, we attempt to more accurately estimate the knowledge contained in LMs by automatically discovering better prompts to use in this querying process. Specifically, we propose mining-based and paraphrasing-based methods to automatically generate high-quality and diverse prompts, as well as ensemble methods to combine answers from different prompts. Extensive experiments on the LAMA benchmark for extracting relational knowledge from LMs demonstrate that our methods can improve accuracy from 31.1% to 39.6%, providing a tighter lower bound on what LMs know. We have released the code and the resulting LM Prompt And Query Archive (LPAQA) at https://github. com/jzbjyb/LPAQA.1 Some models we use in this paper, e.g. BERT (Devlin et al., 2019), are bi-directional, and do not directly define probability distribution over text, which is the underlying definition of an LM. Nonetheless, we call them LMs for simplicity.
translated by 谷歌翻译
语言模型(LMS)已被证明在各种下游应用程序中很有用,例如摘要,翻译,问答和文本分类。由于它们可以存储的大量信息,LMS正在成为人工智能中越来越重要的工具。在这项工作中,我们提出了道具(提示为探测),该道具利用GPT-3(最初由OpenAI在2020年提出的大型语言模型)来执行知识基础构建任务(KBC)。 Prop实施了一种多步骤方法,该方法结合了各种提示技术来实现这一目标。我们的结果表明,手动提示策划是必不可少的,必须鼓励LM给出可变长度的答案集,特别是包括空的答案集,True/False问题是提高LM生成的建议精度的有用设备。 LM的大小是至关重要的因素,并且实体字典别名提高了LM评分。我们的评估研究表明,这些提出的技术可以大大提高最终预测的质量:Prop赢得了LM-KBC竞争的轨道2,表现优于基线36.4个百分点。我们的实施可在https://github.com/hemile/iswc-challenge上获得。
translated by 谷歌翻译
The recent GPT-3 model (Brown et al., 2020) achieves remarkable few-shot performance solely by leveraging a natural-language prompt and a few task demonstrations as input context. Inspired by their findings, we study few-shot learning in a more practical scenario, where we use smaller language models for which fine-tuning is computationally efficient. We present LM-BFF-better few-shot fine-tuning of language models 1 -a suite of simple and complementary techniques for finetuning language models on a small number of annotated examples. Our approach includes (1) prompt-based fine-tuning together with a novel pipeline for automating prompt generation; and (2) a refined strategy for dynamically and selectively incorporating demonstrations into each context. Finally, we present a systematic evaluation for analyzing few-shot performance on a range of NLP tasks, including classification and regression. Our experiments demonstrate that our methods combine to dramatically outperform standard fine-tuning procedures in this low resource setting, achieving up to 30% absolute improvement, and 11% on average across all tasks. Our approach makes minimal assumptions on task resources and domain expertise, and hence constitutes a strong task-agnostic method for few-shot learning. 2 * The first two authors contributed equally. 1 Alternatively, language models' best friends forever. 2 Our implementation is publicly available at https:// github.com/princeton-nlp/LM-BFF.
translated by 谷歌翻译
Recent progress in pretraining language models on large textual corpora led to a surge of improvements for downstream NLP tasks. Whilst learning linguistic knowledge, these models may also be storing relational knowledge present in the training data, and may be able to answer queries structured as "fillin-the-blank" cloze statements. Language models have many advantages over structured knowledge bases: they require no schema engineering, allow practitioners to query about an open class of relations, are easy to extend to more data, and require no human supervision to train. We present an in-depth analysis of the relational knowledge already present (without fine-tuning) in a wide range of state-of-theart pretrained language models. We find that (i) without fine-tuning, BERT contains relational knowledge competitive with traditional NLP methods that have some access to oracle knowledge, (ii) BERT also does remarkably well on open-domain question answering against a supervised baseline, and (iii) certain types of factual knowledge are learned much more readily than others by standard language model pretraining approaches. The surprisingly strong ability of these models to recall factual knowledge without any fine-tuning demonstrates their potential as unsupervised open-domain QA systems. The code to reproduce our analysis is available at https: //github.com/facebookresearch/LAMA.
translated by 谷歌翻译
现在,通过复杂的神经网络模型(例如蒙版的神经语言模型(MNLM))学习了许多上下文化的单词表示形式,这些模型由巨大的神经网络结构组成,并经过训练以恢复蒙面文本。这样的表示表明在某些阅读理解(RC)任务中表现出超人的表现,这些任务在给出问题的上下文中提取了适当的答案。但是,由于许多模型参数,确定在MNLM中训练的详细知识是具有挑战性的。本文提供了有关MNLMS中包含的常识性知识的新见解和经验分析。首先,我们使用诊断测试来评估常识性知识是否在MNLMS中进行了适当的培训。我们观察到,在MNLMS中没有适当训练很多常识性知识,并且MNLMS并不经常准确地理解关系的语义含义。此外,我们发现基于MNLM的RC模型仍然容易受到需要常识知识的语义变化的影响。最后,我们发现了未经训练的知识的基本原因。我们进一步建议,利用外常识性知识存储库可以是一个有效的解决方案。我们说明了通过在受控实验中以外常识性知识存储库来丰富文本的经文,以克服基于MNLM的RC模型的局限性的可能性。
translated by 谷歌翻译
探测是一种流行的方法,可以辨别预先训练的语言模型表示中包含哪些语言信息。但是,选择探针模型的机制最近受到了激烈的争论,因为尚不清楚探针是否只是在提取信息或对语言属性进行建模。为了应对这一挑战,本文通过将探测作为提示任务提出探测来介绍一种新颖的探测方法。我们对五个探测任务进行实验,并表明我们的方法在提取信息方面比诊断探针更为可比或更好,而自行学习得更少。我们通过提示方法与注意力头修剪进一步结合探测,以分析模型将语言信息存储在其体系结构中的位置。然后,我们通过删除对该属性至关重要的头部并评估所得模型在语言建模上的性能来检查特定语言属性对预训练的有用性。
translated by 谷歌翻译
大型基于变压器的预训练的语言模型在各种知识密集的任务上取得了令人印象深刻的表现,并可以在其参数中捕获事实知识。我们认为,考虑到不断增长的知识和资源需求,在模型参数中存储大量知识是亚最佳选择。我们认为,更有效的替代方法是向模型提供对上下文相关的结构化知识的明确访问,并训练它以使用该知识。我们提出了LM核 - 实现这一目标的一般框架 - 允许从外部知识源对语言模型培训的\ textit {解耦},并允许后者更新而不会影响已经训练的模型。实验结果表明,LM核心获得外部知识,在知识探索任务上的最先进的知识增强语言模型中实现了重要而强大的优于性能。可以有效处理知识更新;并在两个下游任务上表现良好。我们还提出了一个彻底的错误分析,突出了LM核的成功和失败。
translated by 谷歌翻译
符号知识图(kgs)是通过昂贵的人众包或特定于域特异性的复杂信息提取管道来构建的。诸如BERT之类的新兴大型语言模型(LMS)已显示出隐式编码的大量知识,可以使用正确设计的提示来查询。但是,与明确的公斤相比,黑盒LMS中的知识通常很难访问或编辑,并且缺乏解释性。在这项工作中,我们旨在从LMS收获符号KG,这是一个由神经LMS的灵活性和可扩展性增强的自动kg构造的新框架。与通常依赖大型人类注释的数据或现有大量KG的先前作品相比,我们的方法仅需要对关系的最小定义作为输入,因此适合于以前无法提取有关丰富新关系的知识。该方法会自动生成多样化的提示,并在给定的LM内执行有效的知识搜索,以进行一致和广泛的输出。与以前的方法相比,使用我们的方法收获的知识要准确得多,如自动和人类评估所示。结果,我们源于多元化的LMS,一个新的KG家族(例如Bertnet和Robertanet),其中包含一套更丰富的常识关系,包括复杂的关系(例如,A对B的能力,但不擅长B”)人类注销的kg(例如概念网)。此外,由此产生的kg也是解释各自的源LMS的工具,从而导致对不同LMS不同知识能力的新见解。
translated by 谷歌翻译
In this work, we explore "prompt tuning," a simple yet effective mechanism for learning "soft prompts" to condition frozen language models to perform specific downstream tasks. Unlike the discrete text prompts used by GPT-3, soft prompts are learned through backpropagation and can be tuned to incorporate signals from any number of labeled examples. Our end-to-end learned approach outperforms GPT-3's few-shot learning by a large margin. More remarkably, through ablations on model size using T5, we show that prompt tuning becomes more competitive with scale: as models exceed billions of parameters, our method "closes the gap" and matches the strong performance of model tuning (where all model weights are tuned). This finding is especially relevant because large models are costly to share and serve and the ability to reuse one frozen model for multiple downstream tasks can ease this burden. Our method can be seen as a simplification of the recently proposed "prefix tuning" of Li and Liang (2021) and we provide a comparison to this and other similar approaches. Finally, we show that conditioning a frozen model with soft prompts confers benefits in robustness to domain transfer and enables efficient "prompt ensembling." * Work done as a Google AI Resident.
translated by 谷歌翻译
在NLP社区中有一个正在进行的辩论,无论现代语言模型是否包含语言知识,通过所谓的探针恢复。在本文中,我们研究了语言知识是否是现代语言模型良好表现的必要条件,我们称之为\ Texit {重新发现假设}。首先,我们展示了语言模型,这是显着压缩的,但在预先磨普目标上表现良好,以便在语言结构探讨时保持良好的分数。这一结果支持重新发现的假设,并导致我们的论文的第二款贡献:一个信息 - 理论框架,与语言建模目标相关。该框架还提供了测量语言信息对字词预测任务的影响的度量标准。我们通过英语综合和真正的NLP任务加固我们的分析结果。
translated by 谷歌翻译
基于变压器的语言模型最近在许多自然语言任务中取得了显着的结果。但是,通常通过利用大量培训数据来实现排行榜的性能,并且很少通过将明确的语言知识编码为神经模型。这使许多人质疑语言学对现代自然语言处理的相关性。在本文中,我介绍了几个案例研究,以说明理论语言学和神经语言模型仍然相互关联。首先,语言模型通过提供一个客观的工具来测量语义距离,这对语言学家很有用,语义距离很难使用传统方法。另一方面,语言理论通过提供框架和数据源来探究我们的语言模型,以了解语言理解的特定方面,从而有助于语言建模研究。本论文贡献了三项研究,探讨了语言模型中语法 - 听觉界面的不同方面。在论文的第一部分中,我将语言模型应用于单词类灵活性的问题。我将Mbert作为语义距离测量的来源,我提供了有利于将单词类灵活性分析为方向过程的证据。在论文的第二部分中,我提出了一种方法来测量语言模型中间层的惊奇方法。我的实验表明,包含形态句法异常的句子触发了语言模型早期的惊喜,而不是语义和常识异常。最后,在论文的第三部分中,我适应了一些心理语言学研究,以表明语言模型包含了论证结构结构的知识。总而言之,我的论文在自然语言处理,语言理论和心理语言学之间建立了新的联系,以为语言模型的解释提供新的观点。
translated by 谷歌翻译
预先接受训练的语言模型的进展导致了对自然语言理解的下游任务的令人印象深刻的结果。探索预先训练的语言模型的最新工作揭示了在其上下围化表示中编码的广泛的语言属性。然而,目前尚不清楚他们是否编码对符号推理方法至关重要的语义知识。我们提出了一种用于探测预先接受训练的语言模型表示的逻辑推断的语言信息的方法。我们的探测数据集涵盖主要符号推理系统所需的语言现象列表。我们发现(i)预先接受的语言模型为推断编码几种类型的语言信息,但是还有一些类型的信息弱编码,(ii)语言模型可以通过微调有效地学习语言信息缺少语言信息。总体而言,我们的调查结果提供了逻辑推理语言模型的语言信息的洞察力,以及他们的预训练程序捕获。此外,我们已经证明了语言模型作为语义和背景知识库的潜力,用于支持符号推断方法。
translated by 谷歌翻译
大量培训数据是最先进的NLP模型高性能的主要原因之一。但是,在培训数据中,什么导致模型做出一定的预测?我们试图通过提供一种通过因果框架来描述培训数据如何影响预测的语言来回答这个问题。重要的是,我们的框架绕过了重新培训昂贵模型的需求,并使我们能够仅基于观察数据来估计因果效应。解决从验证的语言模型(PLM)中提取事实知识的问题,我们重点介绍了简单的数据统计数据,例如共发生计数,并表明这些统计数据确实会影响PLM的预测,这表明此类模型依赖于浅启发式方法。我们的因果框架和结果表明,研究数据集的重要性以及因果关系对理解NLP模型的好处。
translated by 谷歌翻译
Recent work has shown that Pre-trained Language Models (PLMs) store the relational knowledge learned from data and utilize it for performing downstream tasks. However, commonsense knowledge across different regions may vary. For instance, the color of bridal dress is white in American weddings whereas it is red in Chinese weddings. In this paper, we introduce a benchmark dataset, Geo-Diverse Commonsense Multilingual Language Models Analysis (GeoMLAMA), for probing the diversity of the relational knowledge in multilingual PLMs. GeoMLAMA contains 3,125 prompts in English, Chinese, Hindi, Persian, and Swahili, with a wide coverage of concepts shared by people from American, Chinese, Indian, Iranian and Kenyan cultures. We benchmark 11 standard multilingual PLMs on GeoMLAMA. Interestingly, we find that 1) larger multilingual PLMs variants do not necessarily store geo-diverse concepts better than its smaller variant; 2) multilingual PLMs are not intrinsically biased towards knowledge from the Western countries (the United States); 3) the native language of a country may not be the best language to probe its knowledge and 4) a language may better probe knowledge about a non-native country than its native country. Code and data are released at https://github.com/WadeYin9712/GeoMLAMA.
translated by 谷歌翻译
GPT-3等大型自回归语言模型是几秒钟的学习者,可以在没有微调的情况下执行各种语言任务。虽然已知这些模型能够共同代表许多不同的语言,但他们的培训数据由英语主导,可能限制了它们的交叉概括。在这项工作中,我们在覆盖多种语言的平衡语料库上培训多语言自回归语言模型,并在广泛的任务中研究他们几乎没有零点的学习能力。我们最大的模型,具有75亿参数,在20多种代表语言中,在几种代表语言中,在几种代表性语言中,在几种代表性语言中,在多语言型号推理中表现出可比大小的GPT-3(在0次设置和0次拍摄设置中的绝对精度改善+ 7.4% 4-拍摄设置中的9.4%)和自然语言推理(每次拍摄和4次设置中的每一个+ 5.4%)。在Flores-101机器翻译基准测试中,我们的模型优于GPT-3在182个翻译方向上有32个培训例子,同时超过45个方向的官方监督基线。我们介绍了模型成功和失败的位置的详细分析,特别是它尤其显示在某些任务中实现交叉语境的内容学习,而仍然存在改善表面的鲁棒性和适应没有a的任务的余地自然冻结形式。最后,我们评估我们在仇恨语音检测中以五种语言的仇恨语音检测的模型,并发现它具有与可比大小的GPT-3模型类似的限制。
translated by 谷歌翻译
及时调整是将预训练模型调整到下游任务的极其有效的工具。但是,基于标准及时的方法主要考虑下游任务的足够数据的情况。目前尚不清楚是否可以将优势传输到几杆式制度,在每个下游任务中只有有限的数据。尽管有些作品证明了在几次弹奏设置下及时调整的潜力,但通过搜索离散提示或使用有限数据调整软提示的主流方法仍然非常具有挑战性。通过广泛的实证研究,我们发现迅速调整和完全微调之间的学习差距仍然存在差距。为了弥合差距,我们提出了一个新的及时调整框架,称为软模板调整(STT)。 STT结合了手册和自动提示,并将下游分类任务视为掩盖语言建模任务。对不同设置的全面评估表明,STT可以在不引入其他参数的情况下缩小微调和基于及时的方法之间的差距。值得注意的是,它甚至可以胜过情感分类任务的时间和资源消耗的微调方法。
translated by 谷歌翻译
How can we extend a pre-trained model to many language understanding tasks, without labeled or additional unlabeled data? Pre-trained language models (PLMs) have been effective for a wide range of NLP tasks. However, existing approaches either require fine-tuning on downstream labeled datasets or manually constructing proper prompts. In this paper, we propose nonparametric prompting PLM (NPPrompt) for fully zero-shot language understanding. Unlike previous methods, NPPrompt uses only pre-trained language models and does not require any labeled data or additional raw corpus for further fine-tuning, nor does it rely on humans to construct a comprehensive set of prompt label words. We evaluate NPPrompt against previous major few-shot and zero-shot learning methods on diverse NLP tasks: including text classification, text entailment, similar text retrieval, and paraphrasing. Experimental results demonstrate that our NPPrompt outperforms the previous best fully zero-shot method by big margins, with absolute gains of 12.8% in accuracy on text classification and 18.9% on the GLUE benchmark.
translated by 谷歌翻译
类比在人类常识推理中起着核心作用。识别类比诸如“眼睛是看到耳朵的声音”之类的类比的能力,有时也称为类比比例,塑造我们如何构建知识和理解语言。但是,令人惊讶的是,在语言模型时代,识别这种类比的任务尚未受到太多关注。在本文中,我们使用从教育环境以及更常用的数据集获得的基准分析了基于变压器的语言模型的功能。我们发现,现成的语言模型可以在一定程度上识别类比,但要与抽象和复杂的关系斗争,结果对模型架构和超参数高度敏感。总体而言,最佳结果是通过GPT-2和Roberta获得的,而使用BERT的配置无法超越单词嵌入模型。我们的结果为未来的工作提出了重要的问题,内容涉及如何以及在何种程度上培训的语言模型捕获有关抽象语义关系的知识。
translated by 谷歌翻译