Deep neural network (DNN) classifiers are often overconfident, producing miscalibrated class probabilities. Most existing calibration methods either lack theoretical guarantees for producing calibrated outputs or reduce the classification accuracy in the process. This paper proposes a new Kernel-based calibration method called KCal. Unlike other calibration procedures, KCal does not operate directly on the logits or softmax outputs of the DNN. Instead, it uses the penultimate-layer latent embedding to train a metric space in a supervised manner. In effect, KCal amounts to a supervised dimensionality reduction of the neural network embedding, and generates a prediction using kernel density estimation on a holdout calibration set. We first analyze KCal theoretically, showing that it enjoys a provable asymptotic calibration guarantee. Then, through extensive experiments, we confirm that KCal consistently outperforms existing calibration methods in terms of both the classification accuracy and the (confidence and class-wise) calibration error.
translated by 谷歌翻译
Accurate uncertainty quantification is a major challenge in deep learning, as neural networks can make overconfident errors and assign high confidence predictions to out-of-distribution (OOD) inputs. The most popular approaches to estimate predictive uncertainty in deep learning are methods that combine predictions from multiple neural networks, such as Bayesian neural networks (BNNs) and deep ensembles. However their practicality in real-time, industrial-scale applications are limited due to the high memory and computational cost. Furthermore, ensembles and BNNs do not necessarily fix all the issues with the underlying member networks. In this work, we study principled approaches to improve uncertainty property of a single network, based on a single, deterministic representation. By formalizing the uncertainty quantification as a minimax learning problem, we first identify distance awareness, i.e., the model's ability to quantify the distance of a testing example from the training data, as a necessary condition for a DNN to achieve high-quality (i.e., minimax optimal) uncertainty estimation. We then propose Spectral-normalized Neural Gaussian Process (SNGP), a simple method that improves the distance-awareness ability of modern DNNs with two simple changes: (1) applying spectral normalization to hidden weights to enforce bi-Lipschitz smoothness in representations and (2) replacing the last output layer with a Gaussian process layer. On a suite of vision and language understanding benchmarks, SNGP outperforms other single-model approaches in prediction, calibration and out-of-domain detection. Furthermore, SNGP provides complementary benefits to popular techniques such as deep ensembles and data augmentation, making it a simple and scalable building block for probabilistic deep learning. Code is open-sourced at https://github.com/google/uncertainty-baselines
translated by 谷歌翻译
Confidence calibration -the problem of predicting probability estimates representative of the true correctness likelihood -is important for classification models in many applications. We discover that modern neural networks, unlike those from a decade ago, are poorly calibrated. Through extensive experiments, we observe that depth, width, weight decay, and Batch Normalization are important factors influencing calibration. We evaluate the performance of various post-processing calibration methods on state-ofthe-art architectures with image and document classification datasets. Our analysis and experiments not only offer insights into neural network learning, but also provide a simple and straightforward recipe for practical settings: on most datasets, temperature scaling -a singleparameter variant of Platt Scaling -is surprisingly effective at calibrating predictions.
translated by 谷歌翻译
如果预测类的概率(顶级标签)是校准的,则在顶部标签上进行条件,则据说多类分类器将是顶级标签的校准。在密切相关和流行的置信度校准概念中,这种条件不存在,我们认为这使得置信校准难以解释决策。我们提出顶级标签校准作为置信校准的纠正。此外,我们概述了一个多类对二进制(M2B)还原框架,该框架统一了信心,顶级标签和班级校准等。顾名思义,M2B通过将多类校准减少到众多二元校准问题来起作用,每个二进制校准问题都可以使用简单的二进制校准例程来解决。我们将M2B框架实例化使用经过良好研究的直方图(HB)二进制校准器,并证明整体过程是多类校准的,而无需对基础数据分布进行任何假设。在CIFAR-10和CIFAR-100上具有四个深净体系结构的经验评估中,我们发现M2B + HB程序比其他方法(例如温度缩放)获得了较低的顶级标签和类别校准误差。这项工作的代码可在\ url {https://github.com/aigen/df-posthoc-calibration}中获得。
translated by 谷歌翻译
We propose SWA-Gaussian (SWAG), a simple, scalable, and general purpose approach for uncertainty representation and calibration in deep learning. Stochastic Weight Averaging (SWA), which computes the first moment of stochastic gradient descent (SGD) iterates with a modified learning rate schedule, has recently been shown to improve generalization in deep learning. With SWAG, we fit a Gaussian using the SWA solution as the first moment and a low rank plus diagonal covariance also derived from the SGD iterates, forming an approximate posterior distribution over neural network weights; we then sample from this Gaussian distribution to perform Bayesian model averaging. We empirically find that SWAG approximates the shape of the true posterior, in accordance with results describing the stationary distribution of SGD iterates. Moreover, we demonstrate that SWAG performs well on a wide variety of tasks, including out of sample detection, calibration, and transfer learning, in comparison to many popular alternatives including MC dropout, KFAC Laplace, SGLD, and temperature scaling.
translated by 谷歌翻译
Model calibration, which is concerned with how frequently the model predicts correctly, not only plays a vital part in statistical model design, but also has substantial practical applications, such as optimal decision-making in the real world. However, it has been discovered that modern deep neural networks are generally poorly calibrated due to the overestimation (or underestimation) of predictive confidence, which is closely related to overfitting. In this paper, we propose Annealing Double-Head, a simple-to-implement but highly effective architecture for calibrating the DNN during training. To be precise, we construct an additional calibration head-a shallow neural network that typically has one latent layer-on top of the last latent layer in the normal model to map the logits to the aligned confidence. Furthermore, a simple Annealing technique that dynamically scales the logits by calibration head in training procedure is developed to improve its performance. Under both the in-distribution and distributional shift circumstances, we exhaustively evaluate our Annealing Double-Head architecture on multiple pairs of contemporary DNN architectures and vision and speech datasets. We demonstrate that our method achieves state-of-the-art model calibration performance without post-processing while simultaneously providing comparable predictive accuracy in comparison to other recently proposed calibration methods on a range of learning tasks.
translated by 谷歌翻译
在本文中,我们研究了现代神经网络的事后校准,这个问题近年来引起了很多关注。已经为任务提出了许多不同复杂性的校准方法,但是关于这些任务的表达方式尚无共识。我们专注于置信度缩放的任务,特别是在概括温度缩放的事后方法上,我们将其称为自适应温度缩放家族。我们分析了改善校准并提出可解释方法的表达功能。我们表明,当有大量数据复杂模型(例如神经网络)产生更好的性能时,但是当数据量受到限制时,很容易失败,这是某些事后校准应用(例如医学诊断)的常见情况。我们研究表达方法在理想条件和设计更简单的方法下学习但对这些表现良好的功能具有强烈的感应偏见的功能。具体而言,我们提出了基于熵的温度缩放,这是一种简单的方法,可根据其熵缩放预测的置信度。结果表明,与其他方法相比,我们的方法可获得最先进的性能,并且与复杂模型不同,它对数据稀缺是可靠的。此外,我们提出的模型可以更深入地解释校准过程。
translated by 谷歌翻译
由于模型可信度对于敏感的现实世界应用至关重要,因此从业者越来越重视改善深神经网络的不确定性校准。校准误差旨在量化概率预测的可靠性,但其估计器通常是偏见且不一致的。在这项工作中,我们介绍了适当的校准误差的框架,该校准误差将每个校准误差与适当的分数联系起来,并提供具有最佳估计属性的相应上限。这种关系可用于可靠地量化模型校准改进。与我们的方法相比,我们从理论上和经验上证明了常用估计量的缺点。由于适当的分数的广泛适用性,这可以自然地扩展到分类之外的重新校准。
translated by 谷歌翻译
本文介绍了分类器校准原理和实践的简介和详细概述。校准的分类器正确地量化了与其实例明智的预测相关的不确定性或信心水平。这对于关键应用,最佳决策,成本敏感的分类以及某些类型的上下文变化至关重要。校准研究具有丰富的历史,其中几十年来预测机器学习作为学术领域的诞生。然而,校准兴趣的最近增加导致了新的方法和从二进制到多种子体设置的扩展。需要考虑的选项和问题的空间很大,并导航它需要正确的概念和工具集。我们提供了主要概念和方法的介绍性材料和最新的技术细节,包括适当的评分规则和其他评估指标,可视化方法,全面陈述二进制和多字数分类的HOC校准方法,以及几个先进的话题。
translated by 谷歌翻译
有效的决策需要了解预测中固有的不确定性。在回归中,这种不确定性可以通过各种方法估算;然而,许多这些方法对调谐进行费力,产生过度自确性的不确定性间隔,或缺乏敏锐度(给予不精确的间隔)。我们通过提出一种通过定义具有两个不同损失功能的神经网络来捕获回归中的预测分布的新方法来解决这些挑战。具体地,一个网络近似于累积分布函数,第二网络近似于其逆。我们将此方法称为合作网络(CN)。理论分析表明,优化的固定点处于理想化的解决方案,并且该方法是渐近的与地面真理分布一致。凭经验,学习是简单且强大的。我们基准CN对两个合成和六个现实世界数据集的几种常见方法,包括预测来自电子健康记录的糖尿病患者的A1C值,其中不确定是至关重要的。在合成数据中,所提出的方法与基本上匹配地面真理。在真实世界数据集中,CN提高了许多性能度量的结果,包括对数似然估计,平均误差,覆盖估计和预测间隔宽度。
translated by 谷歌翻译
最佳决策要求分类器产生与其经验准确性一致的不确定性估计。然而,深度神经网络通常在他们的预测中受到影响或过度自信。因此,已经开发了方法,以改善培训和后HOC期间的预测性不确定性的校准。在这项工作中,我们提出了可分解的损失,以改善基于频流校准误差估计底层的钻孔操作的软(连续)版本的校准。当纳入训练时,这些软校准损耗在多个数据集中实现最先进的单一模型ECE,精度低于1%的数量。例如,我们观察到ECE的82%(相对于HOC后射出ECE 70%),以换取相对于CIFAR-100上的交叉熵基线的准确性0.7%的相对降低。在培训后结合时,基于软合成的校准误差目标会改善温度缩放,一种流行的重新校准方法。总体而言,跨损失和数据集的实验表明,使用校准敏感程序在数据集移位下产生更好的不确定性估计,而不是使用跨熵损失和后HOC重新校准方法的标准做法。
translated by 谷歌翻译
在在下游决策取决于预测概率的安全关键应用中,校准神经网络是最重要的。测量校准误差相当于比较两个实证分布。在这项工作中,我们引入了由经典Kolmogorov-Smirnov(KS)统计测试的自由校准措施,其中主要思想是比较各自的累积概率分布。由此,通过通过Quidsime使用可微分函数来近似经验累积分布,我们获得重新校准函数,将网络输出映射到实际(校准的)类分配概率。使用停滞校准组进行脊柱拟合,并在看不见的测试集上评估所获得的重新校准功能。我们测试了我们对各种图像分类数据集的现有校准方法的方法,并且我们的样条键的重新校准方法始终如一地优于KS错误的现有方法以及其他常用的校准措施。我们的代码可在https://github.com/kartikgupta-at-anu/spline-calibration获得。
translated by 谷歌翻译
概率分类器输出置信信心得分随着他们的预测,并且应该校准这些置信分数,即,它们应该反映预测的可靠性。最小化标准度量的置信度分数,例如预期的校准误差(ECE)准确地测量整个人口平均值的可靠性。然而,通常不可能测量单独预测的可靠性。在这项工作中,我们提出了本地校准误差(LCE),以跨越平均值和各个可靠性之间的间隙。对于每个单独的预测,LCE测量一组类似预测的平均可靠性,其中通过预先训练的特征空间上的内核函数和通过预测模型信仰的融合方案来量化相似性。我们从理论上显示了LCE可以从数据估计,并经验地发现它显示出比ECE可以检测到更细粒度的错误级别模式。我们的关键结果是一种新颖的局部重新校准方法,以改善个人预测的置信度分数并减少LCE。实验,我们表明我们的重新校准方法产生更准确的置信度分数,从而提高了具有图像和表格数据的分类任务的下游公平性和决策。
translated by 谷歌翻译
当疑问以获得更好的有效精度时,选择性分类允许模型放弃预测(例如,说“我不知道”)。尽管典型的选择性模型平均可以有效地产生更准确的预测,但它们仍可能允许具有很高置信度的错误预测,或者跳过置信度较低的正确预测。提供校准的不确定性估计以及预测(与真实频率相对应的概率)以及具有平均准确的预测一样重要。但是,不确定性估计对于某些输入可能不可靠。在本文中,我们开发了一种新的选择性分类方法,其中我们提出了一种拒绝“不确定”不确定性的示例的方法。通过这样做,我们旨在通过对所接受示例的分布进行{良好校准}的不确定性估计进行预测,这是我们称为选择性校准的属性。我们提出了一个用于学习选择性校准模型的框架,其中训练了单独的选择器网络以改善给定基本模型的选择性校准误差。特别是,我们的工作重点是实现强大的校准,该校准有意地设计为在室外数据上进行测试。我们通过受分配强大的优化启发的训练策略实现了这一目标,在该策略中,我们将模拟输入扰动应用于已知的,内域培训数据。我们证明了方法对多个图像分类和肺癌风险评估任务的经验有效性。
translated by 谷歌翻译
神经网络校准是深度学习的重要任务,以确保模型预测的信心与真正的正确性可能性之间的一致性。在本文中,我们提出了一种称为Neural夹紧的新的后处理校准方法,该方法通过可学习的通用输入扰动和输出温度扩展参数在预训练的分类器上采用简单的联合输入输出转换。此外,我们提供了理论上的解释,说明为什么神经夹具比温度缩放更好。在CIFAR-100和Imagenet图像识别数据集以及各种深神经网络模型上进行了评估,我们的经验结果表明,神经夹具明显优于最先进的后处理校准方法。
translated by 谷歌翻译
The ability to quickly and accurately identify covariate shift at test time is a critical and often overlooked component of safe machine learning systems deployed in high-risk domains. While methods exist for detecting when predictions should not be made on out-of-distribution test examples, identifying distributional level differences between training and test time can help determine when a model should be removed from the deployment setting and retrained. In this work, we define harmful covariate shift (HCS) as a change in distribution that may weaken the generalization of a predictive model. To detect HCS, we use the discordance between an ensemble of classifiers trained to agree on training data and disagree on test data. We derive a loss function for training this ensemble and show that the disagreement rate and entropy represent powerful discriminative statistics for HCS. Empirically, we demonstrate the ability of our method to detect harmful covariate shift with statistical certainty on a variety of high-dimensional datasets. Across numerous domains and modalities, we show state-of-the-art performance compared to existing methods, particularly when the number of observed test samples is small.
translated by 谷歌翻译
适应数据分布的结构(例如对称性和转型Imarerces)是机器学习中的重要挑战。通过架构设计或通过增强数据集,可以内在学习过程中内置Inhormces。两者都需要先验的了解对称性的确切性质。缺乏这种知识,从业者求助于昂贵且耗时的调整。为了解决这个问题,我们提出了一种新的方法来学习增强变换的分布,以新的\ emph {转换风险最小化}(trm)框架。除了预测模型之外,我们还优化了从假说空间中选择的转换。作为算法框架,我们的TRM方法是(1)有效(共同学习增强和模型,以\ emph {单训练环}),(2)模块化(使用\ emph {任何训练算法),以及(3)一般(处理\ \ ich {离散和连续}增强)。理论上与标准风险最小化的TRM比较,并在其泛化误差上给出PAC-Bayes上限。我们建议通过块组成的新参数化优化富裕的增强空间,导致新的\ EMPH {随机成分增强学习}(SCALE)算法。我们在CIFAR10 / 100,SVHN上使用先前的方法(快速自身自动化和武术器)进行实际比较规模。此外,我们表明规模可以在数据分布中正确地学习某些对称性(恢复旋转Mnist上的旋转),并且还可以改善学习模型的校准。
translated by 谷歌翻译
Deep neural networks (DNN) are prone to miscalibrated predictions, often exhibiting a mismatch between the predicted output and the associated confidence scores. Contemporary model calibration techniques mitigate the problem of overconfident predictions by pushing down the confidence of the winning class while increasing the confidence of the remaining classes across all test samples. However, from a deployment perspective, an ideal model is desired to (i) generate well-calibrated predictions for high-confidence samples with predicted probability say >0.95, and (ii) generate a higher proportion of legitimate high-confidence samples. To this end, we propose a novel regularization technique that can be used with classification losses, leading to state-of-the-art calibrated predictions at test time; From a deployment standpoint in safety-critical applications, only high-confidence samples from a well-calibrated model are of interest, as the remaining samples have to undergo manual inspection. Predictive confidence reduction of these potentially ``high-confidence samples'' is a downside of existing calibration approaches. We mitigate this by proposing a dynamic train-time data pruning strategy that prunes low-confidence samples every few epochs, providing an increase in "confident yet calibrated samples". We demonstrate state-of-the-art calibration performance across image classification benchmarks, reducing training time without much compromise in accuracy. We provide insights into why our dynamic pruning strategy that prunes low-confidence training samples leads to an increase in high-confidence samples at test time.
translated by 谷歌翻译
学习推迟(L2D)框架有可能使AI系统更安全。对于给定的输入,如果人类比模型更有可能采取正确的行动,则系统可以将决定推迟给人类。我们研究L2D系统的校准,研究它们输出的概率是否合理。我们发现Mozannar&Sontag(2020)多类框架没有针对专家正确性进行校准。此外,由于其参数化是为此目的而退化的,因此甚至不能保证产生有效的概率。我们提出了一个基于单VS-ALL分类器的L2D系统,该系统能够产生专家正确性的校准概率。此外,我们的损失功能也是多类L2D的一致替代,例如Mozannar&Sontag(2020)。我们的实验验证了我们的系统校准不仅是我们的系统校准,而且这种好处无需准确。我们的模型的准确性始终可与Mozannar&Sontag(2020)模型的模型相当(通常是优越),从仇恨言语检测到星系分类到诊断皮肤病变的任务。
translated by 谷歌翻译
由于机器学习技术在新域中被广泛采用,特别是在诸如自主车辆的安全关键系统中,这是提供准确的输出不确定性估计至关重要。因此,已经提出了许多方法来校准神经网络以准确估计错误分类的可能性。但是,虽然这些方法实现了低校准误差,但有空间以进一步改进,尤其是在大维设置(如想象成)中。在本文中,我们介绍了一个名为Hoki的校准算法,它通过将随机转换应用于神经网络编程来工作。我们为基于应用转换后观察到的标签预测变化的数量提供了足够的条件。我们在多个数据集上执行实验,并表明所提出的方法通常优于多个数据集和模型的最先进的校准算法,尤其是在充满挑战的ImageNet数据集上。最后,Hoki也是可扩展的,因为它需要可比较的执行时间到温度缩放的执行时间。
translated by 谷歌翻译