由于机器学习技术在新域中被广泛采用,特别是在诸如自主车辆的安全关键系统中,这是提供准确的输出不确定性估计至关重要。因此,已经提出了许多方法来校准神经网络以准确估计错误分类的可能性。但是,虽然这些方法实现了低校准误差,但有空间以进一步改进,尤其是在大维设置(如想象成)中。在本文中,我们介绍了一个名为Hoki的校准算法,它通过将随机转换应用于神经网络编程来工作。我们为基于应用转换后观察到的标签预测变化的数量提供了足够的条件。我们在多个数据集上执行实验,并表明所提出的方法通常优于多个数据集和模型的最先进的校准算法,尤其是在充满挑战的ImageNet数据集上。最后,Hoki也是可扩展的,因为它需要可比较的执行时间到温度缩放的执行时间。
translated by 谷歌翻译
神经网络校准是深度学习的重要任务,以确保模型预测的信心与真正的正确性可能性之间的一致性。在本文中,我们提出了一种称为Neural夹紧的新的后处理校准方法,该方法通过可学习的通用输入扰动和输出温度扩展参数在预训练的分类器上采用简单的联合输入输出转换。此外,我们提供了理论上的解释,说明为什么神经夹具比温度缩放更好。在CIFAR-100和Imagenet图像识别数据集以及各种深神经网络模型上进行了评估,我们的经验结果表明,神经夹具明显优于最先进的后处理校准方法。
translated by 谷歌翻译
在本文中,我们研究了现代神经网络的事后校准,这个问题近年来引起了很多关注。已经为任务提出了许多不同复杂性的校准方法,但是关于这些任务的表达方式尚无共识。我们专注于置信度缩放的任务,特别是在概括温度缩放的事后方法上,我们将其称为自适应温度缩放家族。我们分析了改善校准并提出可解释方法的表达功能。我们表明,当有大量数据复杂模型(例如神经网络)产生更好的性能时,但是当数据量受到限制时,很容易失败,这是某些事后校准应用(例如医学诊断)的常见情况。我们研究表达方法在理想条件和设计更简单的方法下学习但对这些表现良好的功能具有强烈的感应偏见的功能。具体而言,我们提出了基于熵的温度缩放,这是一种简单的方法,可根据其熵缩放预测的置信度。结果表明,与其他方法相比,我们的方法可获得最先进的性能,并且与复杂模型不同,它对数据稀缺是可靠的。此外,我们提出的模型可以更深入地解释校准过程。
translated by 谷歌翻译
我们解决了不确定性校准的问题,并引入了一种新型的校准方法,即参数化温度缩放(PTS)。标准的深神经网络通常会产生未校准的预测,可以使用事后校准方法将其转化为校准的置信得分。在这项贡献中,我们证明了准确保存最先进的事后校准器的性能受其内在表达能力的限制。我们通过计算通过神经网络参数为参数的预测温度来概括温度缩放。我们通过广泛的实验表明,我们的新型准确性保护方法始终优于大量模型体系结构,数据集和指标的现有算法。
translated by 谷歌翻译
Confidence calibration -the problem of predicting probability estimates representative of the true correctness likelihood -is important for classification models in many applications. We discover that modern neural networks, unlike those from a decade ago, are poorly calibrated. Through extensive experiments, we observe that depth, width, weight decay, and Batch Normalization are important factors influencing calibration. We evaluate the performance of various post-processing calibration methods on state-ofthe-art architectures with image and document classification datasets. Our analysis and experiments not only offer insights into neural network learning, but also provide a simple and straightforward recipe for practical settings: on most datasets, temperature scaling -a singleparameter variant of Platt Scaling -is surprisingly effective at calibrating predictions.
translated by 谷歌翻译
在在下游决策取决于预测概率的安全关键应用中,校准神经网络是最重要的。测量校准误差相当于比较两个实证分布。在这项工作中,我们引入了由经典Kolmogorov-Smirnov(KS)统计测试的自由校准措施,其中主要思想是比较各自的累积概率分布。由此,通过通过Quidsime使用可微分函数来近似经验累积分布,我们获得重新校准函数,将网络输出映射到实际(校准的)类分配概率。使用停滞校准组进行脊柱拟合,并在看不见的测试集上评估所获得的重新校准功能。我们测试了我们对各种图像分类数据集的现有校准方法的方法,并且我们的样条键的重新校准方法始终如一地优于KS错误的现有方法以及其他常用的校准措施。我们的代码可在https://github.com/kartikgupta-at-anu/spline-calibration获得。
translated by 谷歌翻译
由于模型可信度对于敏感的现实世界应用至关重要,因此从业者越来越重视改善深神经网络的不确定性校准。校准误差旨在量化概率预测的可靠性,但其估计器通常是偏见且不一致的。在这项工作中,我们介绍了适当的校准误差的框架,该校准误差将每个校准误差与适当的分数联系起来,并提供具有最佳估计属性的相应上限。这种关系可用于可靠地量化模型校准改进。与我们的方法相比,我们从理论上和经验上证明了常用估计量的缺点。由于适当的分数的广泛适用性,这可以自然地扩展到分类之外的重新校准。
translated by 谷歌翻译
Deep neural networks (DNN) are prone to miscalibrated predictions, often exhibiting a mismatch between the predicted output and the associated confidence scores. Contemporary model calibration techniques mitigate the problem of overconfident predictions by pushing down the confidence of the winning class while increasing the confidence of the remaining classes across all test samples. However, from a deployment perspective, an ideal model is desired to (i) generate well-calibrated predictions for high-confidence samples with predicted probability say >0.95, and (ii) generate a higher proportion of legitimate high-confidence samples. To this end, we propose a novel regularization technique that can be used with classification losses, leading to state-of-the-art calibrated predictions at test time; From a deployment standpoint in safety-critical applications, only high-confidence samples from a well-calibrated model are of interest, as the remaining samples have to undergo manual inspection. Predictive confidence reduction of these potentially ``high-confidence samples'' is a downside of existing calibration approaches. We mitigate this by proposing a dynamic train-time data pruning strategy that prunes low-confidence samples every few epochs, providing an increase in "confident yet calibrated samples". We demonstrate state-of-the-art calibration performance across image classification benchmarks, reducing training time without much compromise in accuracy. We provide insights into why our dynamic pruning strategy that prunes low-confidence training samples leads to an increase in high-confidence samples at test time.
translated by 谷歌翻译
概率分类器输出置信信心得分随着他们的预测,并且应该校准这些置信分数,即,它们应该反映预测的可靠性。最小化标准度量的置信度分数,例如预期的校准误差(ECE)准确地测量整个人口平均值的可靠性。然而,通常不可能测量单独预测的可靠性。在这项工作中,我们提出了本地校准误差(LCE),以跨越平均值和各个可靠性之间的间隙。对于每个单独的预测,LCE测量一组类似预测的平均可靠性,其中通过预先训练的特征空间上的内核函数和通过预测模型信仰的融合方案来量化相似性。我们从理论上显示了LCE可以从数据估计,并经验地发现它显示出比ECE可以检测到更细粒度的错误级别模式。我们的关键结果是一种新颖的局部重新校准方法,以改善个人预测的置信度分数并减少LCE。实验,我们表明我们的重新校准方法产生更准确的置信度分数,从而提高了具有图像和表格数据的分类任务的下游公平性和决策。
translated by 谷歌翻译
现在众所周知,神经网络对其预测的信心很高,导致校准不良。弥补这一点的最常见的事后方法是执行温度缩放,这可以通过将逻辑缩放为固定值来调整任何输入的预测的信心。尽管这种方法通常会改善整个测试数据集中的平均校准,但无论给定输入的分类是否正确还是不正确,这种改进通常会降低预测的个人信心。有了这种见解,我们将方法基于这样的观察结果,即不同的样品通过不同的量导致校准误差,有些人需要提高其信心,而另一些则需要减少它。因此,对于每个输入,我们建议预测不同的温度值,从而使我们能够调整较细性的置信度和准确性之间的不匹配。此外,我们观察到了OOD检测结果的改善,还可以提取数据点的硬度概念。我们的方法是在事后应用的,因此使用很少的计算时间和可忽略不计的记忆足迹,并应用于现成的预训练的分类器。我们使用CIFAR10/100和TINY-IMAGENET数据集对RESNET50和WIDERESNET28-10架构进行测试,这表明在整个测试集中产生每数据点温度也有益于预期的校准误差。代码可在以下网址获得:https://github.com/thwjoy/adats。
translated by 谷歌翻译
Deep neural network (DNN) classifiers are often overconfident, producing miscalibrated class probabilities. Most existing calibration methods either lack theoretical guarantees for producing calibrated outputs or reduce the classification accuracy in the process. This paper proposes a new Kernel-based calibration method called KCal. Unlike other calibration procedures, KCal does not operate directly on the logits or softmax outputs of the DNN. Instead, it uses the penultimate-layer latent embedding to train a metric space in a supervised manner. In effect, KCal amounts to a supervised dimensionality reduction of the neural network embedding, and generates a prediction using kernel density estimation on a holdout calibration set. We first analyze KCal theoretically, showing that it enjoys a provable asymptotic calibration guarantee. Then, through extensive experiments, we confirm that KCal consistently outperforms existing calibration methods in terms of both the classification accuracy and the (confidence and class-wise) calibration error.
translated by 谷歌翻译
本文介绍了分类器校准原理和实践的简介和详细概述。校准的分类器正确地量化了与其实例明智的预测相关的不确定性或信心水平。这对于关键应用,最佳决策,成本敏感的分类以及某些类型的上下文变化至关重要。校准研究具有丰富的历史,其中几十年来预测机器学习作为学术领域的诞生。然而,校准兴趣的最近增加导致了新的方法和从二进制到多种子体设置的扩展。需要考虑的选项和问题的空间很大,并导航它需要正确的概念和工具集。我们提供了主要概念和方法的介绍性材料和最新的技术细节,包括适当的评分规则和其他评估指标,可视化方法,全面陈述二进制和多字数分类的HOC校准方法,以及几个先进的话题。
translated by 谷歌翻译
如果预测类的概率(顶级标签)是校准的,则在顶部标签上进行条件,则据说多类分类器将是顶级标签的校准。在密切相关和流行的置信度校准概念中,这种条件不存在,我们认为这使得置信校准难以解释决策。我们提出顶级标签校准作为置信校准的纠正。此外,我们概述了一个多类对二进制(M2B)还原框架,该框架统一了信心,顶级标签和班级校准等。顾名思义,M2B通过将多类校准减少到众多二元校准问题来起作用,每个二进制校准问题都可以使用简单的二进制校准例程来解决。我们将M2B框架实例化使用经过良好研究的直方图(HB)二进制校准器,并证明整体过程是多类校准的,而无需对基础数据分布进行任何假设。在CIFAR-10和CIFAR-100上具有四个深净体系结构的经验评估中,我们发现M2B + HB程序比其他方法(例如温度缩放)获得了较低的顶级标签和类别校准误差。这项工作的代码可在\ url {https://github.com/aigen/df-posthoc-calibration}中获得。
translated by 谷歌翻译
尽管图形神经网络(GNNS)已经取得了显着的准确性,但结果是否值得信赖仍未开发。以前的研究表明,许多现代神经网络对预测过度充满信心,然而,令人惊讶的是,我们发现GNN主要呈相反方向,即,GNN是不受自信的。因此,非常需要GNN的置信度校准。在本文中,我们通过设计拓扑知识的后HOC校准函数提出了一种新型值得信赖的GNN模型。具体而言,我们首先验证图形中的置信度分布具有同眼性的财产,而且这一发现激发了我们设计校准GNN模型(CAGCN)以学习校准功能。 CAGCN能够从GNN的Logits对每个节点的校准置信度获得独特的变换,同时,这种变换能够在类之间保留课程之间的顺序,满足精度保留的属性。此外,我们将校准GNN应用于自培训框架,表明可以通过校准的置信度获得更可靠的伪标签,并进一步提高性能。广泛的实验证明了我们所提出的模型在校准和准确性方面的有效性。
translated by 谷歌翻译
尽管深神经网络的占优势性能,但最近的作品表明它们校准不佳,导致过度自信的预测。由于培训期间的跨熵最小化,因此可以通过过度化来加剧错误烫伤,因为它促进了预测的Softmax概率来匹配单热标签分配。这产生了正确的类别的Pre-SoftMax激活,该类别明显大于剩余的激活。来自文献的最近证据表明,损失函数嵌入隐含或明确最大化的预测熵会产生最先进的校准性能。我们提供了当前最先进的校准损耗的统一约束优化视角。具体地,这些损失可以被视为在Logit距离上施加平等约束的线性惩罚(或拉格朗日)的近似值。这指出了这种潜在的平等约束的一个重要限制,其随后的梯度不断推动非信息解决方案,这可能会阻止在基于梯度的优化期间模型的辨别性能和校准之间的最佳妥协。在我们的观察之后,我们提出了一种基于不平等约束的简单灵活的泛化,这在Logit距离上强加了可控裕度。关于各种图像分类,语义分割和NLP基准的综合实验表明,我们的方法在网络校准方面对这些任务设置了新的最先进的结果,而不会影响辨别性能。代码可在https://github.com/by-liu/mbls上获得。
translated by 谷歌翻译
Model calibration, which is concerned with how frequently the model predicts correctly, not only plays a vital part in statistical model design, but also has substantial practical applications, such as optimal decision-making in the real world. However, it has been discovered that modern deep neural networks are generally poorly calibrated due to the overestimation (or underestimation) of predictive confidence, which is closely related to overfitting. In this paper, we propose Annealing Double-Head, a simple-to-implement but highly effective architecture for calibrating the DNN during training. To be precise, we construct an additional calibration head-a shallow neural network that typically has one latent layer-on top of the last latent layer in the normal model to map the logits to the aligned confidence. Furthermore, a simple Annealing technique that dynamically scales the logits by calibration head in training procedure is developed to improve its performance. Under both the in-distribution and distributional shift circumstances, we exhaustively evaluate our Annealing Double-Head architecture on multiple pairs of contemporary DNN architectures and vision and speech datasets. We demonstrate that our method achieves state-of-the-art model calibration performance without post-processing while simultaneously providing comparable predictive accuracy in comparison to other recently proposed calibration methods on a range of learning tasks.
translated by 谷歌翻译
The deployment of machine learning classifiers in high-stakes domains requires well-calibrated confidence scores for model predictions. In this paper we introduce the notion of variable-based calibration to characterize calibration properties of a model with respect to a variable of interest, generalizing traditional score-based calibration and metrics such as expected calibration error (ECE). In particular, we find that models with near-perfect ECE can exhibit significant variable-based calibration error as a function of features of the data. We demonstrate this phenomenon both theoretically and in practice on multiple well-known datasets, and show that it can persist after the application of existing recalibration methods. To mitigate this issue, we propose strategies for detection, visualization, and quantification of variable-based calibration error. We then examine the limitations of current score-based recalibration methods and explore potential modifications. Finally, we discuss the implications of these findings, emphasizing that an understanding of calibration beyond simple aggregate measures is crucial for endeavors such as fairness and model interpretability.
translated by 谷歌翻译
Recent studies have revealed that, beyond conventional accuracy, calibration should also be considered for training modern deep neural networks. To address miscalibration during learning, some methods have explored different penalty functions as part of the learning objective, alongside a standard classification loss, with a hyper-parameter controlling the relative contribution of each term. Nevertheless, these methods share two major drawbacks: 1) the scalar balancing weight is the same for all classes, hindering the ability to address different intrinsic difficulties or imbalance among classes; and 2) the balancing weight is usually fixed without an adaptive strategy, which may prevent from reaching the best compromise between accuracy and calibration, and requires hyper-parameter search for each application. We propose Class Adaptive Label Smoothing (CALS) for calibrating deep networks, which allows to learn class-wise multipliers during training, yielding a powerful alternative to common label smoothing penalties. Our method builds on a general Augmented Lagrangian approach, a well-established technique in constrained optimization, but we introduce several modifications to tailor it for large-scale, class-adaptive training. Comprehensive evaluation and multiple comparisons on a variety of benchmarks, including standard and long-tailed image classification, semantic segmentation, and text classification, demonstrate the superiority of the proposed method. The code is available at https://github.com/by-liu/CALS.
translated by 谷歌翻译
超越在分销数据上的测试上,在分销(OOD)检测中最近的普及方式增加了。最近尝试分类OOD数据介绍了接近和远远检测的概念。具体而言,先前作品在检测难度方面定义了OOD数据的特征。我们建议使用两种类型的分布换档来表征ood数据的频谱:协变速和概念转移,其中协变速转移对应于样式的变化,例如噪声和概念移位表示语义的变化。该表征揭示了对每种类型的敏感性对OOD数据的检测和置信校准是重要的。因此,我们调查了捕获对改善它们的每种类型数据集偏移和方法的敏感性的得分功能。为此,我们从理论上得出了两个分数函数,用于ood检测,协变速分数和概念换档分数,基于对均分数的kl分解,并提出了一种几何启发方法(几何奥丁)来改善ood检测在两个班次下,只有分发数据。另外,所提出的方法自然地导致表现力的后HOC校准函数,其在分配和分发数据中产生最先进的校准性能。我们是第一个提出一种跨越检测和校准以及不同类型的班次工作的方法的方法。查看https://sites.google.com/view/geometric-decomposition的project页面。
translated by 谷歌翻译
最佳决策要求分类器产生与其经验准确性一致的不确定性估计。然而,深度神经网络通常在他们的预测中受到影响或过度自信。因此,已经开发了方法,以改善培训和后HOC期间的预测性不确定性的校准。在这项工作中,我们提出了可分解的损失,以改善基于频流校准误差估计底层的钻孔操作的软(连续)版本的校准。当纳入训练时,这些软校准损耗在多个数据集中实现最先进的单一模型ECE,精度低于1%的数量。例如,我们观察到ECE的82%(相对于HOC后射出ECE 70%),以换取相对于CIFAR-100上的交叉熵基线的准确性0.7%的相对降低。在培训后结合时,基于软合成的校准误差目标会改善温度缩放,一种流行的重新校准方法。总体而言,跨损失和数据集的实验表明,使用校准敏感程序在数据集移位下产生更好的不确定性估计,而不是使用跨熵损失和后HOC重新校准方法的标准做法。
translated by 谷歌翻译