通过利用仅偏置模型的输出来调整学习目标,可以有效地显示了基于组合的脱叠方法。在本文中,我们专注于这些基于集合的方法的偏见模型,这起到了重要作用,但在现有文献中没有大量关注。从理论上讲,我们证明了脱结性能可能因偏见模型的不准确性估计而受损。凭经验,我们表明现有的偏见模型在产生准确的不确定性估计方面不足。这些发现的动机,我们建议在唯一的模型上进行校准,从而实现基于三阶段的脱叠框架,包括偏置建模,模型校准和脱叠。 NLI的实验结果和事实验证任务表明,我们提出的三阶段脱叠框架始终如一地优于传统的两级,以分配的准确性。
translated by 谷歌翻译
通过推断培训数据中的潜在群体,最近的作品将不可用的注释不可用的情况引入不变性学习。通常,在大多数/少数族裔分裂下学习群体不变性在经验上被证明可以有效地改善许多数据集的分布泛化。但是,缺乏这些关于学习不变机制的理论保证。在本文中,我们揭示了在防止分类器依赖于培训集中的虚假相关性的情况下,现有小组不变学习方法的不足。具体来说,我们提出了两个关于判断这种充分性的标准。从理论和经验上讲,我们表明现有方法可以违反标准,因此未能推广出虚假的相关性转移。在此激励的情况下,我们设计了一种新的组不变学习方法,该方法构建具有统计独立性测试的组,并按组标签重新启动样本,以满足标准。关于合成数据和真实数据的实验表明,新方法在推广到虚假相关性转移方面显着优于现有的组不变学习方法。
translated by 谷歌翻译
本文介绍了分类器校准原理和实践的简介和详细概述。校准的分类器正确地量化了与其实例明智的预测相关的不确定性或信心水平。这对于关键应用,最佳决策,成本敏感的分类以及某些类型的上下文变化至关重要。校准研究具有丰富的历史,其中几十年来预测机器学习作为学术领域的诞生。然而,校准兴趣的最近增加导致了新的方法和从二进制到多种子体设置的扩展。需要考虑的选项和问题的空间很大,并导航它需要正确的概念和工具集。我们提供了主要概念和方法的介绍性材料和最新的技术细节,包括适当的评分规则和其他评估指标,可视化方法,全面陈述二进制和多字数分类的HOC校准方法,以及几个先进的话题。
translated by 谷歌翻译
具有大量偏见的数据集当前威胁要培训有关NLU任务的值得信赖的模型。尽管取得了巨大进展,但当前的偏见方法却过分依赖偏见属性的知识。但是,属性的​​定义是难以捉摸的,并且在不同的数据集上有所不同。此外,利用输入级别的这些属性到偏置缓解可能会留下内在属性与基本决策规则之间的差距。为了缩小这一差距并解放有关偏见的监督,我们建议将缓解偏见扩展到特征空间。因此,开发了一个新型模型,即恢复具有无知识(风险)的预期功能子空间。假设由各种偏见引起的快捷键特征是为了预测而无意的,则风险将其视为冗余特征。当研究较低的歧管以去除冗余时,风险表明,具有预期功能的极低维度子空间可以牢固地表示高度偏见的数据集。经验结果表明,我们的模型可以始终如一地提高模型的概括到分布式集合,并实现新的最新性能。
translated by 谷歌翻译
神经网络通常使预测依赖于数据集的虚假相关性,而不是感兴趣的任务的内在特性,面对分布外(OOD)测试数据的急剧下降。现有的De-Bias学习框架尝试通过偏置注释捕获特定的DataSet偏差,它们无法处理复杂的“ood方案”。其他人在低能力偏置模型或损失上隐含地识别数据集偏置,但在训练和测试数据来自相同分布时,它们会降低。在本文中,我们提出了一般的贪婪去偏见学习框架(GGD),它贪婪地训练偏置模型和基础模型,如功能空间中的梯度下降。它鼓励基础模型专注于用偏置模型难以解决的示例,从而仍然在测试阶段中的杂散相关性稳健。 GGD在很大程度上提高了各种任务的模型的泛化能力,但有时会过度估计偏置水平并降低在分配测试。我们进一步重新分析了GGD的集合过程,并将课程正规化为由课程学习启发的GGD,这取得了良好的分配和分发性能之间的权衡。对图像分类的广泛实验,对抗问题应答和视觉问题应答展示了我们方法的有效性。 GGD可以在特定于特定于任务的偏置模型的设置下学习更强大的基础模型,其中具有现有知识和自组合偏置模型而无需先验知识。
translated by 谷歌翻译
Deep neural network (DNN) classifiers are often overconfident, producing miscalibrated class probabilities. Most existing calibration methods either lack theoretical guarantees for producing calibrated outputs or reduce the classification accuracy in the process. This paper proposes a new Kernel-based calibration method called KCal. Unlike other calibration procedures, KCal does not operate directly on the logits or softmax outputs of the DNN. Instead, it uses the penultimate-layer latent embedding to train a metric space in a supervised manner. In effect, KCal amounts to a supervised dimensionality reduction of the neural network embedding, and generates a prediction using kernel density estimation on a holdout calibration set. We first analyze KCal theoretically, showing that it enjoys a provable asymptotic calibration guarantee. Then, through extensive experiments, we confirm that KCal consistently outperforms existing calibration methods in terms of both the classification accuracy and the (confidence and class-wise) calibration error.
translated by 谷歌翻译
Calibration strengthens the trustworthiness of black-box models by producing better accurate confidence estimates on given examples. However, little is known about if model explanations can help confidence calibration. Intuitively, humans look at important features attributions and decide whether the model is trustworthy. Similarly, the explanations can tell us when the model may or may not know. Inspired by this, we propose a method named CME that leverages model explanations to make the model less confident with non-inductive attributions. The idea is that when the model is not highly confident, it is difficult to identify strong indications of any class, and the tokens accordingly do not have high attribution scores for any class and vice versa. We conduct extensive experiments on six datasets with two popular pre-trained language models in the in-domain and out-of-domain settings. The results show that CME improves calibration performance in all settings. The expected calibration errors are further reduced when combined with temperature scaling. Our findings highlight that model explanations can help calibrate posterior estimates.
translated by 谷歌翻译
深度神经网络具有令人印象深刻的性能,但是他们无法可靠地估计其预测信心,从而限制了其在高风险领域中的适用性。我们表明,应用多标签的一VS损失揭示了分类的歧义并降低了模型的过度自信。引入的Slova(单标签One-Vs-All)模型重新定义了单个标签情况的典型单VS-ALL预测概率,其中只有一个类是正确的答案。仅当单个类具有很高的概率并且其他概率可忽略不计时,提议的分类器才有信心。与典型的SoftMax函数不同,如果所有其他类的概率都很小,Slova自然会检测到分布的样本。该模型还通过指数校准进行了微调,这使我们能够与模型精度准确地对齐置信分数。我们在三个任务上验证我们的方法。首先,我们证明了斯洛伐克与最先进的分布校准具有竞争力。其次,在数据集偏移下,斯洛伐克的性能很强。最后,我们的方法在检测到分布样品的检测方面表现出色。因此,斯洛伐克是一种工具,可以在需要不确定性建模的各种应用中使用。
translated by 谷歌翻译
在本文中,我们研究了现代神经网络的事后校准,这个问题近年来引起了很多关注。已经为任务提出了许多不同复杂性的校准方法,但是关于这些任务的表达方式尚无共识。我们专注于置信度缩放的任务,特别是在概括温度缩放的事后方法上,我们将其称为自适应温度缩放家族。我们分析了改善校准并提出可解释方法的表达功能。我们表明,当有大量数据复杂模型(例如神经网络)产生更好的性能时,但是当数据量受到限制时,很容易失败,这是某些事后校准应用(例如医学诊断)的常见情况。我们研究表达方法在理想条件和设计更简单的方法下学习但对这些表现良好的功能具有强烈的感应偏见的功能。具体而言,我们提出了基于熵的温度缩放,这是一种简单的方法,可根据其熵缩放预测的置信度。结果表明,与其他方法相比,我们的方法可获得最先进的性能,并且与复杂模型不同,它对数据稀缺是可靠的。此外,我们提出的模型可以更深入地解释校准过程。
translated by 谷歌翻译
The deployment of machine learning classifiers in high-stakes domains requires well-calibrated confidence scores for model predictions. In this paper we introduce the notion of variable-based calibration to characterize calibration properties of a model with respect to a variable of interest, generalizing traditional score-based calibration and metrics such as expected calibration error (ECE). In particular, we find that models with near-perfect ECE can exhibit significant variable-based calibration error as a function of features of the data. We demonstrate this phenomenon both theoretically and in practice on multiple well-known datasets, and show that it can persist after the application of existing recalibration methods. To mitigate this issue, we propose strategies for detection, visualization, and quantification of variable-based calibration error. We then examine the limitations of current score-based recalibration methods and explore potential modifications. Finally, we discuss the implications of these findings, emphasizing that an understanding of calibration beyond simple aggregate measures is crucial for endeavors such as fairness and model interpretability.
translated by 谷歌翻译
在在下游决策取决于预测概率的安全关键应用中,校准神经网络是最重要的。测量校准误差相当于比较两个实证分布。在这项工作中,我们引入了由经典Kolmogorov-Smirnov(KS)统计测试的自由校准措施,其中主要思想是比较各自的累积概率分布。由此,通过通过Quidsime使用可微分函数来近似经验累积分布,我们获得重新校准函数,将网络输出映射到实际(校准的)类分配概率。使用停滞校准组进行脊柱拟合,并在看不见的测试集上评估所获得的重新校准功能。我们测试了我们对各种图像分类数据集的现有校准方法的方法,并且我们的样条键的重新校准方法始终如一地优于KS错误的现有方法以及其他常用的校准措施。我们的代码可在https://github.com/kartikgupta-at-anu/spline-calibration获得。
translated by 谷歌翻译
现实世界数据普遍面对严重的类别 - 不平衡问题,并且展示了长尾分布,即,大多数标签与有限的情况有关。由此类数据集监督的NA \“IVE模型更愿意占主导地位标签,遇到严重的普遍化挑战并变得不佳。我们从先前的角度提出了两种新的方法,以减轻这种困境。首先,我们推导了一个以平衡为导向的数据增强命名均匀的混合物(Unimix)促进长尾情景中的混合,采用先进的混合因子和采样器,支持少数民族。第二,受贝叶斯理论的动机,我们弄清了贝叶斯偏见(北美),是由此引起的固有偏见先前的不一致,并将其补偿为对标准交叉熵损失的修改。我们进一步证明了所提出的方法理论上和经验地确保分类校准。广泛的实验验证我们的策略是否有助于更好校准的模型,以及他们的策略组合在CIFAR-LT,ImageNet-LT和Inattations 2018上实现最先进的性能。
translated by 谷歌翻译
神经网络校准是深度学习的重要任务,以确保模型预测的信心与真正的正确性可能性之间的一致性。在本文中,我们提出了一种称为Neural夹紧的新的后处理校准方法,该方法通过可学习的通用输入扰动和输出温度扩展参数在预训练的分类器上采用简单的联合输入输出转换。此外,我们提供了理论上的解释,说明为什么神经夹具比温度缩放更好。在CIFAR-100和Imagenet图像识别数据集以及各种深神经网络模型上进行了评估,我们的经验结果表明,神经夹具明显优于最先进的后处理校准方法。
translated by 谷歌翻译
由于模型可信度对于敏感的现实世界应用至关重要,因此从业者越来越重视改善深神经网络的不确定性校准。校准误差旨在量化概率预测的可靠性,但其估计器通常是偏见且不一致的。在这项工作中,我们介绍了适当的校准误差的框架,该校准误差将每个校准误差与适当的分数联系起来,并提供具有最佳估计属性的相应上限。这种关系可用于可靠地量化模型校准改进。与我们的方法相比,我们从理论上和经验上证明了常用估计量的缺点。由于适当的分数的广泛适用性,这可以自然地扩展到分类之外的重新校准。
translated by 谷歌翻译
Confidence calibration -the problem of predicting probability estimates representative of the true correctness likelihood -is important for classification models in many applications. We discover that modern neural networks, unlike those from a decade ago, are poorly calibrated. Through extensive experiments, we observe that depth, width, weight decay, and Batch Normalization are important factors influencing calibration. We evaluate the performance of various post-processing calibration methods on state-ofthe-art architectures with image and document classification datasets. Our analysis and experiments not only offer insights into neural network learning, but also provide a simple and straightforward recipe for practical settings: on most datasets, temperature scaling -a singleparameter variant of Platt Scaling -is surprisingly effective at calibrating predictions.
translated by 谷歌翻译
Several recent works find empirically that the average test error of deep neural networks can be estimated via the prediction disagreement of models, which does not require labels. In particular, Jiang et al. (2022) show for the disagreement between two separately trained networks that this `Generalization Disagreement Equality' follows from the well-calibrated nature of deep ensembles under the notion of a proposed `class-aggregated calibration.' In this reproduction, we show that the suggested theory might be impractical because a deep ensemble's calibration can deteriorate as prediction disagreement increases, which is precisely when the coupling of test error and disagreement is of interest, while labels are needed to estimate the calibration on new datasets. Further, we simplify the theoretical statements and proofs, showing them to be straightforward within a probabilistic context, unlike the original hypothesis space view employed by Jiang et al. (2022).
translated by 谷歌翻译
学习推迟(L2D)框架有可能使AI系统更安全。对于给定的输入,如果人类比模型更有可能采取正确的行动,则系统可以将决定推迟给人类。我们研究L2D系统的校准,研究它们输出的概率是否合理。我们发现Mozannar&Sontag(2020)多类框架没有针对专家正确性进行校准。此外,由于其参数化是为此目的而退化的,因此甚至不能保证产生有效的概率。我们提出了一个基于单VS-ALL分类器的L2D系统,该系统能够产生专家正确性的校准概率。此外,我们的损失功能也是多类L2D的一致替代,例如Mozannar&Sontag(2020)。我们的实验验证了我们的系统校准不仅是我们的系统校准,而且这种好处无需准确。我们的模型的准确性始终可与Mozannar&Sontag(2020)模型的模型相当(通常是优越),从仇恨言语检测到星系分类到诊断皮肤病变的任务。
translated by 谷歌翻译
尽管图形神经网络(GNNS)已经取得了显着的准确性,但结果是否值得信赖仍未开发。以前的研究表明,许多现代神经网络对预测过度充满信心,然而,令人惊讶的是,我们发现GNN主要呈相反方向,即,GNN是不受自信的。因此,非常需要GNN的置信度校准。在本文中,我们通过设计拓扑知识的后HOC校准函数提出了一种新型值得信赖的GNN模型。具体而言,我们首先验证图形中的置信度分布具有同眼性的财产,而且这一发现激发了我们设计校准GNN模型(CAGCN)以学习校准功能。 CAGCN能够从GNN的Logits对每个节点的校准置信度获得独特的变换,同时,这种变换能够在类之间保留课程之间的顺序,满足精度保留的属性。此外,我们将校准GNN应用于自培训框架,表明可以通过校准的置信度获得更可靠的伪标签,并进一步提高性能。广泛的实验证明了我们所提出的模型在校准和准确性方面的有效性。
translated by 谷歌翻译
How do we know when the predictions made by a classifier can be trusted? This is a fundamental problem that also has immense practical applicability, especially in safety-critical areas such as medicine and autonomous driving. The de facto approach of using the classifier's softmax outputs as a proxy for trustworthiness suffers from the over-confidence issue; while the most recent works incur problems such as additional retraining cost and accuracy versus trustworthiness trade-off. In this work, we argue that the trustworthiness of a classifier's prediction for a sample is highly associated with two factors: the sample's neighborhood information and the classifier's output. To combine the best of both worlds, we design a model-agnostic post-hoc approach NeighborAgg to leverage the two essential information via an adaptive neighborhood aggregation. Theoretically, we show that NeighborAgg is a generalized version of a one-hop graph convolutional network, inheriting the powerful modeling ability to capture the varying similarity between samples within each class. We also extend our approach to the closely related task of mislabel detection and provide a theoretical coverage guarantee to bound the false negative. Empirically, extensive experiments on image and tabular benchmarks verify our theory and suggest that NeighborAgg outperforms other methods, achieving state-of-the-art trustworthiness performance.
translated by 谷歌翻译
In this paper, we investigate the problem of predictive confidence in face and kinship verification. Most existing face and kinship verification methods focus on accuracy performance while ignoring confidence estimation for their prediction results. However, confidence estimation is essential for modeling reliability in such high-risk tasks. To address this issue, we first introduce a novel yet simple confidence measure for face and kinship verification, which allows the verification models to transform the similarity score into a confidence score for a given face pair. We further propose a confidence-calibrated approach called angular scaling calibration (ASC). ASC is easy to implement and can be directly applied to existing face and kinship verification models without model modifications, yielding accuracy-preserving and confidence-calibrated probabilistic verification models. To the best of our knowledge, our approach is the first general confidence-calibrated solution to face and kinship verification in a modern context. We conduct extensive experiments on four widely used face and kinship verification datasets, and the results demonstrate the effectiveness of our approach.
translated by 谷歌翻译