估算预测预测的语言模型的不确定性对于提高NLP的可靠性是重要的。虽然许多以前的作品侧重于量化预测不确定性,但在解释不确定性时几乎没有工作。本文进一步推动了一个关于解释后校准的预训练的语言模型的不确定预测。我们适应了两种基于扰动的后宫释放方法,留出次出来和采样福利,以识别引起预测中不确定性的输入中的单词。我们以三项任务测试BERT和Roberta上提出的方法:情绪分类,自然语言推断和解释域,在域内和域外设置。实验表明,两种方法都始终捕获引起预测不确定性的输入中的单词。
translated by 谷歌翻译
Calibration strengthens the trustworthiness of black-box models by producing better accurate confidence estimates on given examples. However, little is known about if model explanations can help confidence calibration. Intuitively, humans look at important features attributions and decide whether the model is trustworthy. Similarly, the explanations can tell us when the model may or may not know. Inspired by this, we propose a method named CME that leverages model explanations to make the model less confident with non-inductive attributions. The idea is that when the model is not highly confident, it is difficult to identify strong indications of any class, and the tokens accordingly do not have high attribution scores for any class and vice versa. We conduct extensive experiments on six datasets with two popular pre-trained language models in the in-domain and out-of-domain settings. The results show that CME improves calibration performance in all settings. The expected calibration errors are further reduced when combined with temperature scaling. Our findings highlight that model explanations can help calibrate posterior estimates.
translated by 谷歌翻译
Some recent works observed the instability of post-hoc explanations when input side perturbations are applied to the model. This raises the interest and concern in the stability of post-hoc explanations. However, the remaining question is: is the instability caused by the neural network model or the post-hoc explanation method? This work explores the potential source that leads to unstable post-hoc explanations. To separate the influence from the model, we propose a simple output probability perturbation method. Compared to prior input side perturbation methods, the output probability perturbation method can circumvent the neural model's potential effect on the explanations and allow the analysis on the explanation method. We evaluate the proposed method with three widely-used post-hoc explanation methods (LIME (Ribeiro et al., 2016), Kernel Shapley (Lundberg and Lee, 2017a), and Sample Shapley (Strumbelj and Kononenko, 2010)). The results demonstrate that the post-hoc methods are stable, barely producing discrepant explanations under output probability perturbations. The observation suggests that neural network models may be the primary source of fragile explanations.
translated by 谷歌翻译
Calibration is a popular framework to evaluate whether a classifier knows when it does not know - i.e., its predictive probabilities are a good indication of how likely a prediction is to be correct. Correctness is commonly estimated against the human majority class. Recently, calibration to human majority has been measured on tasks where humans inherently disagree about which class applies. We show that measuring calibration to human majority given inherent disagreements is theoretically problematic, demonstrate this empirically on the ChaosNLI dataset, and derive several instance-level measures of calibration that capture key statistical properties of human judgements - class frequency, ranking and entropy.
translated by 谷歌翻译
变形金刚目前是自然语言理解(NLU)任务的最新技术,容易产生未校准的预测或极端概率,从而根据其输出相对困难而做出不同的决策过程。在本文中,我们建议构建几个电感Venn - 持续预测因子(IVAP),这些预测因子(IVAP)可以根据预先训练的变压器的选择在最小的假设下可以很好地校准。我们在一组不同的NLU任务上测试了它们的性能,并表明它们能够产生均匀分布在[0,1]间隔的概率预测的良好概率预测,同时均保留了原始模型的预测准确性。
translated by 谷歌翻译
估计数据集的难度通常涉及将最新模型与人类进行比较;性能差距越大,据说数据集就越难。但是,这种比较几乎没有理解给定分布中的每个实例的难度,或者什么属性使给定模型的数据集难以进行。为了解决这些问题,我们将数据集难度框架 - W.R.T.模型$ \ MATHCAL {V} $ - 由于缺乏$ \ Mathcal {V} $ - $ \ textit {usable Information} $(Xu等,2019),其中较低的值表示更困难的数据集用于$ \ mathcal {v} $。我们进一步介绍了$ \ textit {pointSise $ \ mathcal {v} $ - 信息} $(pvi),以测量单个实例的难度W.R.T.给定的分布。虽然标准评估指标通常仅比较同一数据集的不同模型,但$ \ MATHCAL {V} $ - $ \ textit {usable Information} $ and PVI也允许相反:对于给定的模型$ \ Mathcal {v} $,我们,我们,我们可以比较同一数据集的不同数据集以及不同的实例/切片。此外,我们的框架可以通过输入的转换来解释不同的输入属性,我们用来在广泛使用的NLP基准中发现注释人工制品。
translated by 谷歌翻译
通过利用仅偏置模型的输出来调整学习目标,可以有效地显示了基于组合的脱叠方法。在本文中,我们专注于这些基于集合的方法的偏见模型,这起到了重要作用,但在现有文献中没有大量关注。从理论上讲,我们证明了脱结性能可能因偏见模型的不准确性估计而受损。凭经验,我们表明现有的偏见模型在产生准确的不确定性估计方面不足。这些发现的动机,我们建议在唯一的模型上进行校准,从而实现基于三阶段的脱叠框架,包括偏置建模,模型校准和脱叠。 NLI的实验结果和事实验证任务表明,我们提出的三阶段脱叠框架始终如一地优于传统的两级,以分配的准确性。
translated by 谷歌翻译
Pre-trained language models (PLMs) achieve remarkable performance on many downstream tasks, but may fail in giving reliable estimates of their predictive uncertainty. Given the lack of a comprehensive understanding of PLMs calibration, we take a close look into this new research problem, aiming to answer two questions: (1) Do PLMs learn to become calibrated in the training process? (2) How effective are existing calibration methods? For the first question, we conduct fine-grained control experiments to study the dynamic change in PLMs' calibration performance in training. We consider six factors as control variables, including dataset difficulty, available training samples, training steps, the number of tunable parameters, model scale, and pretraining. In experiments, we observe a consistent change in calibration performance across six factors. We find that PLMs don't learn to become calibrated in training, evidenced by the continual increase in confidence, no matter the predictions are correct or not. We highlight that our finding presents some contradiction with two established conclusions: (a) Larger PLMs are more calibrated; (b) Pretraining improves model calibration. Next, we study the effectiveness of existing calibration methods in mitigating the overconfidence issue, in both in-distribution and various out-of-distribution settings. Besides unlearnable calibration methods, we adapt two recently proposed learnable methods that directly collect data to train models to have reasonable confidence estimations. Also, we propose extended learnable methods based on existing ones to further improve or maintain PLMs calibration without sacrificing the original task performance. Experimental results show that learnable methods significantly reduce PLMs' confidence in wrong predictions, and our methods exhibit superior performance compared with previous methods.
translated by 谷歌翻译
许多可解释性工具使从业人员和研究人员可以解释自然语言处理系统。但是,每个工具都需要不同的配置,并提供不同形式的解释,从而阻碍了评估和比较它们的可能性。原则上的统一评估基准将指导用户解决中心问题:哪种解释方法对我的用例更可靠?我们介绍了雪貂,这是一个易于使用的,可扩展的Python库,以解释与拥抱面枢纽集成的基于变形金刚的模型。它提供了一个统一的基准测试套件来测试和比较任何文本或可解释性语料库的广泛最先进的解释器。此外,雪貂提供方便的编程摘要,以促进新的解释方法,数据集或评估指标的引入。
translated by 谷歌翻译
标记数据可以是昂贵的任务,因为它通常由域专家手动执行。对于深度学习而言,这是繁琐的,因为它取决于大型标记的数据集。主动学习(AL)是一种范式,旨在通过仅使用二手车型认为最具信息丰富的数据来减少标签努力。在文本分类设置中,在AL上完成了很少的研究,旁边没有涉及最近的最先进的自然语言处理(NLP)模型。在这里,我们介绍了一个实证研究,可以将基于不确定性的基于不确定性的算法与Bert $ _ {base} $相比,作为使用的分类器。我们评估两个NLP分类数据集的算法:斯坦福情绪树木银行和kvk-Front页面。此外,我们探讨了旨在解决不确定性的al的预定问题的启发式;即,它是不可规范的,并且易于选择异常值。此外,我们探讨了查询池大小对al的性能的影响。虽然发现,AL的拟议启发式没有提高AL的表现;我们的结果表明,使用BERT $ _ {Base} $概率使用不确定性的AL。随着查询池大小变大,性能的这种差异可以减少。
translated by 谷歌翻译
尽管对检测分配(OOD)示例的重要性一致,但就OOD示例的正式定义几乎没有共识,以及如何最好地检测到它们。我们将这些示例分类为它们是否表现出背景换档或语义移位,并发现ood检测,模型校准和密度估计(文本语言建模)的两个主要方法,对这些类型的ood数据具有不同的行为。在14对分布和ood英语自然语言理解数据集中,我们发现密度估计方法一致地在背景移位设置中展开校准方法,同时在语义移位设置中执行更糟。此外,我们发现两种方法通常都无法检测到挑战数据中的示例,突出显示当前方法的弱点。由于没有单个方法在所有设置上都效果很好,因此在评估不同的检测方法时,我们的结果呼叫了OOD示例的明确定义。
translated by 谷歌翻译
Recently it has been shown that state-of-the-art NLP models are vulnerable to adversarial attacks, where the predictions of a model can be drastically altered by slight modifications to the input (such as synonym substitutions). While several defense techniques have been proposed, and adapted, to the discrete nature of text adversarial attacks, the benefits of general-purpose regularization methods such as label smoothing for language models, have not been studied. In this paper, we study the adversarial robustness provided by various label smoothing strategies in foundational models for diverse NLP tasks in both in-domain and out-of-domain settings. Our experiments show that label smoothing significantly improves adversarial robustness in pre-trained models like BERT, against various popular attacks. We also analyze the relationship between prediction confidence and robustness, showing that label smoothing reduces over-confident errors on adversarial examples.
translated by 谷歌翻译
最近,与“预训练,及时和预测”的新范式相比,与“预训练,微调”范式相比,新的范式“预训练,及时和预测”取得了显着的成就。在基于及时的GPT-3成功之后,一系列基于蒙版的语言模型(MLM)(例如Bert,Roberta)及时学习方法变得流行并广泛使用。但是,另一个有效的预训练的判别模型Electra可能被忽略了。在本文中,我们尝试使用拟议的替换代替令牌检测(RTD)基于基于的及时学习方法来完成零摄像的几个NLP任务。实验结果表明,基于RTD-Prompt学习的Electra模型可达到令人惊讶的最先进的零拍性能。在数字上,与MLM-Roberta-Large和MLM-Bert-Large相比,我们的RTD-Electra-Large在所有15个任务上平均提高了约8.4%和13.7%。特别是在SST-2任务上,我们的RTD-Electra-Large在没有任何培训数据的情况下达到了令人惊讶的90.1%精度。总体而言,与预先训练的蒙版语言模型相比,预先训练的代替令牌检测模型在零拍学习中的性能更好。因此,Electra是一位出色的零球学习者。源代码可在以下网址获得:https://github.com/nishiwen1214/rtd-electra。
translated by 谷歌翻译
在本文中,我们研究了现代神经网络的事后校准,这个问题近年来引起了很多关注。已经为任务提出了许多不同复杂性的校准方法,但是关于这些任务的表达方式尚无共识。我们专注于置信度缩放的任务,特别是在概括温度缩放的事后方法上,我们将其称为自适应温度缩放家族。我们分析了改善校准并提出可解释方法的表达功能。我们表明,当有大量数据复杂模型(例如神经网络)产生更好的性能时,但是当数据量受到限制时,很容易失败,这是某些事后校准应用(例如医学诊断)的常见情况。我们研究表达方法在理想条件和设计更简单的方法下学习但对这些表现良好的功能具有强烈的感应偏见的功能。具体而言,我们提出了基于熵的温度缩放,这是一种简单的方法,可根据其熵缩放预测的置信度。结果表明,与其他方法相比,我们的方法可获得最先进的性能,并且与复杂模型不同,它对数据稀缺是可靠的。此外,我们提出的模型可以更深入地解释校准过程。
translated by 谷歌翻译
众所周知,端到端的神经NLP体系结构很难理解,这引起了近年来为解释性建模的许多努力。模型解释的基本原则是忠诚,即,解释应准确地代表模型预测背后的推理过程。这项调查首先讨论了忠诚的定义和评估及其对解释性的意义。然后,我们通过将方法分为五类来介绍忠实解释的最新进展:相似性方法,模型内部结构的分析,基于反向传播的方法,反事实干预和自我解释模型。每个类别将通过其代表性研究,优势和缺点来说明。最后,我们从它们的共同美德和局限性方面讨论了上述所有方法,并反思未来的工作方向忠实的解释性。对于有兴趣研究可解释性的研究人员,这项调查将为该领域提供可访问且全面的概述,为进一步探索提供基础。对于希望更好地了解自己的模型的用户,该调查将是一项介绍性手册,帮助选择最合适的解释方法。
translated by 谷歌翻译
对比的学习技术已广泛用于计算机视野中作为增强数据集的手段。在本文中,我们将这些对比学习嵌入的使用扩展到情绪分析任务,并证明了对这些嵌入的微调在基于BERT的嵌入物上的微调方面提供了改进,以在评估时实现更高的基准。在Dynasent DataSet上。我们还探讨了我们的微调模型在跨域基准数据集上执行的。此外,我们探索了ups采样技术,以实现更平衡的班级分发,以进一步改进我们的基准任务。
translated by 谷歌翻译
Pre-trained language models (PLMs) are known to improve the generalization performance of natural language understanding models by leveraging large amounts of data during the pre-training phase. However, the out-of-distribution (OOD) generalization problem remains a challenge in many NLP tasks, limiting the real-world deployment of these methods. This paper presents the first attempt at creating a unified benchmark named GLUE-X for evaluating OOD robustness in NLP models, highlighting the importance of OOD robustness and providing insights on how to measure the robustness of a model and how to improve it. The benchmark includes 13 publicly available datasets for OOD testing, and evaluations are conducted on 8 classic NLP tasks over 19 popularly used PLMs. Our findings confirm the need for improved OOD accuracy in NLP tasks, as significant performance degradation was observed in all settings compared to in-distribution (ID) accuracy.
translated by 谷歌翻译
当前的自动驾驶汽车技术主要集中于将乘客从A点带到B。但是,已经证明乘客害怕乘坐自动驾驶汽车。减轻此问题的一种方法是允许乘客给汽车提供自然语言命令。但是,汽车可能会误解发布的命令或视觉环境,这可能导致不确定的情况。希望自动驾驶汽车检测到这些情况并与乘客互动以解决它们。本文提出了一个模型,该模型检测到命令时不确定的情况并找到引起该命令的视觉对象。可选地,包括描述不确定对象的系统生成的问题。我们认为,如果汽车可以以人类的方式解释这些物体,乘客就可以对汽车能力获得更多信心。因此,我们研究了如何(1)检测不确定的情况及其根本原因,以及(2)如何为乘客产生澄清的问题。在对Talk2CAR数据集进行评估时,我们表明所提出的模型\ acrfull {pipeline},改善\ gls {m:模棱两可 - absolute-Increse},与$ iou _ {.5} $相比,与不使用\ gls {pipeline {pipeline {pipeline { }。此外,我们设计了一个引用表达生成器(reg)\ acrfull {reg_model}量身定制的自动驾驶汽车设置,该设置可产生\ gls {m:流星伴侣} Meteor的相对改进,\ gls \ gls {m:rouge felative}}与最先进的REG模型相比,Rouge-L的速度快三倍。
translated by 谷歌翻译
Machine learning algorithms are often vulnerable to adversarial examples that have imperceptible alterations from the original counterparts but can fool the state-of-the-art models. It is helpful to evaluate or even improve the robustness of these models by exposing the maliciously crafted adversarial examples. In this paper, we present TEXTFOOLER, a simple but strong baseline to generate adversarial text. By applying it to two fundamental natural language tasks, text classification and textual entailment, we successfully attacked three target models, including the powerful pre-trained BERT, and the widely used convolutional and recurrent neural networks. We demonstrate three advantages of this framework:(1) effective-it outperforms previous attacks by success rate and perturbation rate, (2) utility-preserving-it preserves semantic content, grammaticality, and correct types classified by humans, and (3) efficient-it generates adversarial text with computational complexity linear to the text length. 1
translated by 谷歌翻译
Confidence calibration -the problem of predicting probability estimates representative of the true correctness likelihood -is important for classification models in many applications. We discover that modern neural networks, unlike those from a decade ago, are poorly calibrated. Through extensive experiments, we observe that depth, width, weight decay, and Batch Normalization are important factors influencing calibration. We evaluate the performance of various post-processing calibration methods on state-ofthe-art architectures with image and document classification datasets. Our analysis and experiments not only offer insights into neural network learning, but also provide a simple and straightforward recipe for practical settings: on most datasets, temperature scaling -a singleparameter variant of Platt Scaling -is surprisingly effective at calibrating predictions.
translated by 谷歌翻译