本文重新讨论了一个非常简单但非常有效的计算范式,深度共同学习(DML)。我们观察到,有效性与其出色的概括质量高度相关。在本文中,我们从新的角度来解释了DML的性能改善,即这大约是贝叶斯后的采样程序。这也为应用R \'{e} nyi Divergence改善原始DML的基础建立了基础,因为它带来了先验的差异控制(在DML的上下文中)。因此,我们提出了r \'{e} nyi Divergence深度共同学习(RDML)。我们的经验结果代表了DML和\ renyi {}差异的婚姻的优势。R \'{E} nyi Divergence施加的灵活控制能够进一步改进DML,以学习更好的广义模型。
translated by 谷歌翻译
We propose SWA-Gaussian (SWAG), a simple, scalable, and general purpose approach for uncertainty representation and calibration in deep learning. Stochastic Weight Averaging (SWA), which computes the first moment of stochastic gradient descent (SGD) iterates with a modified learning rate schedule, has recently been shown to improve generalization in deep learning. With SWAG, we fit a Gaussian using the SWA solution as the first moment and a low rank plus diagonal covariance also derived from the SGD iterates, forming an approximate posterior distribution over neural network weights; we then sample from this Gaussian distribution to perform Bayesian model averaging. We empirically find that SWAG approximates the shape of the true posterior, in accordance with results describing the stationary distribution of SGD iterates. Moreover, we demonstrate that SWAG performs well on a wide variety of tasks, including out of sample detection, calibration, and transfer learning, in comparison to many popular alternatives including MC dropout, KFAC Laplace, SGLD, and temperature scaling.
translated by 谷歌翻译
现代深度学习方法构成了令人难以置信的强大工具,以解决无数的挑战问题。然而,由于深度学习方法作为黑匣子运作,因此与其预测相关的不确定性往往是挑战量化。贝叶斯统计数据提供了一种形式主义来理解和量化与深度神经网络预测相关的不确定性。本教程概述了相关文献和完整的工具集,用于设计,实施,列车,使用和评估贝叶斯神经网络,即使用贝叶斯方法培训的随机人工神经网络。
translated by 谷歌翻译
随机梯度马尔可夫链蒙特卡洛(SGMCMC)被认为是大型模型(例如贝叶斯神经网络)中贝叶斯推断的金标准。由于从业人员在这些模型中面临速度与准确性权衡,因此变异推理(VI)通常是可取的选择。不幸的是,VI对后部的分解和功能形式做出了有力的假设。在这项工作中,我们提出了一个新的非参数变分近似,该近似没有对后验功能形式进行假设,并允许从业者指定算法应尊重或断裂的确切依赖性。该方法依赖于在修改的能量函数上运行的新的langevin型算法,其中潜在变量的一部分是在马尔可夫链的早期迭代中平均的。这样,统计依赖性可以以受控的方式破裂,从而使链条混合更快。可以以“辍学”方式进一步修改该方案,从而导致更大的可扩展性。我们在CIFAR-10,SVHN和FMNIST上测试RESNET-20的计划。在所有情况下,与SG-MCMC和VI相比,我们都会发现收敛速度和/或最终精度的提高。
translated by 谷歌翻译
我们使用高斯过程扰动模型在高维二次上的真实和批量风险表面之间的高斯过程扰动模型分析和解释迭代平均的泛化性能。我们从我们的理论结果中获得了三个现象\姓名:}(1)将迭代平均值(ia)与大型学习率和正则化进行了改进的正规化的重要性。 (2)对较少频繁平均的理由。 (3)我们预计自适应梯度方法同样地工作,或者更好,而不是其非自适应对应物的迭代平均值。灵感来自这些结果\姓据{,一起与}对迭代解决方案多样性的适当正则化的重要性,我们提出了两个具有迭代平均的自适应算法。与随机梯度下降(SGD)相比,这些结果具有明显更好的结果,需要较少调谐并且不需要早期停止或验证设定监视。我们在各种现代和古典网络架构上展示了我们对CiFar-10/100,Imagenet和Penn TreeBank数据集的方法的疗效。
translated by 谷歌翻译
Often we wish to transfer representational knowledge from one neural network to another. Examples include distilling a large network into a smaller one, transferring knowledge from one sensory modality to a second, or ensembling a collection of models into a single estimator. Knowledge distillation, the standard approach to these problems, minimizes the KL divergence between the probabilistic outputs of a teacher and student network. We demonstrate that this objective ignores important structural knowledge of the teacher network. This motivates an alternative objective by which we train a student to capture significantly more information in the teacher's representation of the data. We formulate this objective as contrastive learning. Experiments demonstrate that our resulting new objective outperforms knowledge distillation and other cutting-edge distillers on a variety of knowledge transfer tasks, including single model compression, ensemble distillation, and cross-modal transfer. Our method sets a new state-of-the-art in many transfer tasks, and sometimes even outperforms the teacher network when combined with knowledge distillation.
translated by 谷歌翻译
Accurate uncertainty quantification is a major challenge in deep learning, as neural networks can make overconfident errors and assign high confidence predictions to out-of-distribution (OOD) inputs. The most popular approaches to estimate predictive uncertainty in deep learning are methods that combine predictions from multiple neural networks, such as Bayesian neural networks (BNNs) and deep ensembles. However their practicality in real-time, industrial-scale applications are limited due to the high memory and computational cost. Furthermore, ensembles and BNNs do not necessarily fix all the issues with the underlying member networks. In this work, we study principled approaches to improve uncertainty property of a single network, based on a single, deterministic representation. By formalizing the uncertainty quantification as a minimax learning problem, we first identify distance awareness, i.e., the model's ability to quantify the distance of a testing example from the training data, as a necessary condition for a DNN to achieve high-quality (i.e., minimax optimal) uncertainty estimation. We then propose Spectral-normalized Neural Gaussian Process (SNGP), a simple method that improves the distance-awareness ability of modern DNNs with two simple changes: (1) applying spectral normalization to hidden weights to enforce bi-Lipschitz smoothness in representations and (2) replacing the last output layer with a Gaussian process layer. On a suite of vision and language understanding benchmarks, SNGP outperforms other single-model approaches in prediction, calibration and out-of-domain detection. Furthermore, SNGP provides complementary benefits to popular techniques such as deep ensembles and data augmentation, making it a simple and scalable building block for probabilistic deep learning. Code is open-sourced at https://github.com/google/uncertainty-baselines
translated by 谷歌翻译
嵌套辍学是辍学操作的变体,能够根据训练期间的预定义重要性订购网络参数或功能。它已被探索:I。构造嵌套网络:嵌套网是神经网络,可以在测试时间(例如基于计算约束)中立即调整架构的架构。嵌套的辍学者隐含地对网络参数进行排名,生成一组子网络,从而使任何较小的子网络构成较大的子网络的基础。 ii。学习排序表示:应用于生成模型的潜在表示(例如自动编码器)对特征进行排名,从而在尺寸上执行密集表示的明确顺序。但是,在整个训练过程中,辍学率是固定为高参数的。对于嵌套网,当删除网络参数时,性能衰减在人类指定的轨迹中而不是从数据中学到的轨迹中。对于生成模型,特征的重要性被指定为恒定向量,从而限制了表示学习的灵活性。为了解决该问题,我们专注于嵌套辍学的概率对应物。我们提出了一个嵌套掉落(VND)操作,该操作以低成本绘制多维有序掩码的样品,为嵌套掉落的参数提供了有用的梯度。基于这种方法,我们设计了一个贝叶斯嵌套的神经网络,以了解参数分布的顺序知识。我们在不同的生成模型下进一步利用VND来学习有序的潜在分布。在实验中,我们表明所提出的方法在分类任务中的准确性,校准和室外检测方面优于嵌套网络。它还在数据生成任务上胜过相关的生成模型。
translated by 谷歌翻译
项目反应理论(IRT)是一个无处不在的模型,可以根据他们对问题的回答理解人类行为和态度。大型现代数据集为捕捉人类行为的更多细微差别提供了机会,从而有可能改善心理测量模型,从而改善科学理解和公共政策。但是,尽管较大的数据集允许采用更灵活的方法,但许多用于拟合IRT模型的当代算法也可能具有禁止现实世界应用的巨大计算需求。为了解决这种瓶颈,我们引入了IRT的变异贝叶斯推理算法,并表明它在不牺牲准确性的情况下快速可扩展。将此方法应用于认知科学和教育的五个大规模项目响应数据集中,比替代推理算法更高的对数可能性和更高的准确性。然后,使用这种新的推论方法,我们将IRT概括为具有表现力的贝叶斯响应模型,利用深度学习的最新进展来捕获具有神经网络的非线性项目特征曲线(ICC)。使用TIMSS的特定级数学测试,我们显示我们的非线性IRT模型可以捕获有趣的不对称ICC。该算法实现是开源的,易于使用。
translated by 谷歌翻译
除了使用硬标签的标准监督学习外,通常在许多监督学习设置中使用辅助损失来改善模型的概括。例如,知识蒸馏增加了第二个教师模仿模型训练的损失,在该培训中,教师可能是一个验证的模型,可以输出比标签更丰富的分布。同样,在标记数据有限的设置中,弱标记信息以标签函数的形式使用。此处引入辅助损失来对抗标签函数,这些功能可能是基于嘈杂的规则的真实标签近似值。我们解决了学习以原则性方式结合这些损失的问题。我们介绍AMAL,该AMAL使用元学习在验证度量上学习实例特定的权重,以实现损失的最佳混合。在许多知识蒸馏和规则降解域中进行的实验表明,Amal在这些领域中对竞争基准的增长可显着。我们通过经验分析我们的方法,并分享有关其提供性能提升的机制的见解。
translated by 谷歌翻译
独立训练的神经网络的集合是一种最新的方法,可以在深度学习中估算预测性不确定性,并且可以通过三角洲函数的混合物解释为后验分布的近似值。合奏的培训依赖于损失景观的非跨性别性和其单个成员的随机初始化,从而使后近似不受控制。本文提出了一种解决此限制的新颖和原则性的方法,最大程度地减少了函数空间中真实后验和内核密度估计器(KDE)之间的$ f $ divergence。我们从组合的角度分析了这一目标,并表明它在任何$ f $的混合组件方面都是supporular。随后,我们考虑了贪婪合奏结构的问题。从负$ f $ didivergence上的边际增益来量化后近似的改善,通过将新组件添加到KDE中得出,我们得出了集合方法的新型多样性项。我们的方法的性能在计算机视觉的分布外检测基准测试中得到了证明,该基准在多个数据集中训练的一系列架构中。我们方法的源代码可在https://github.com/oulu-imeds/greedy_ensembles_training上公开获得。
translated by 谷歌翻译
我们引入了一种新的经验贝叶斯方法,用于大规模多线性回归。我们的方法结合了两个关键思想:(i)使用灵活的“自适应收缩”先验,该先验近似于正常分布的有限混合物,近似于正常分布的非参数家族; (ii)使用变分近似来有效估计先前的超参数并计算近似后期。将这两个想法结合起来,将快速,灵活的方法与计算速度相当,可与快速惩罚的回归方法(例如Lasso)相当,并在各种场景中具有出色的预测准确性。此外,我们表明,我们方法中的后验平均值可以解释为解决惩罚性回归问题,并通过直接解决优化问题(而不是通过交叉验证来调整)从数据中学到的惩罚函数的精确形式。 。我们的方法是在r https://github.com/stephenslab/mr.ash.ash.alpha的r软件包中实现的
translated by 谷歌翻译
This paper proposes a new optimization algorithm called Entropy-SGD for training deep neural networks that is motivated by the local geometry of the energy landscape. Local extrema with low generalization error have a large proportion of almost-zero eigenvalues in the Hessian with very few positive or negative eigenvalues. We leverage upon this observation to construct a local-entropy-based objective function that favors well-generalizable solutions lying in large flat regions of the energy landscape, while avoiding poorly-generalizable solutions located in the sharp valleys. Conceptually, our algorithm resembles two nested loops of SGD where we use Langevin dynamics in the inner loop to compute the gradient of the local entropy before each update of the weights. We show that the new objective has a smoother energy landscape and show improved generalization over SGD using uniform stability, under certain assumptions. Our experiments on convolutional and recurrent networks demonstrate that Entropy-SGD compares favorably to state-of-the-art techniques in terms of generalization error and training time.
translated by 谷歌翻译
Knowledge distillation (KD) has gained a lot of attention in the field of model compression for edge devices thanks to its effectiveness in compressing large powerful networks into smaller lower-capacity models. Online distillation, in which both the teacher and the student are learning collaboratively, has also gained much interest due to its ability to improve on the performance of the networks involved. The Kullback-Leibler (KL) divergence ensures the proper knowledge transfer between the teacher and student. However, most online KD techniques present some bottlenecks under the network capacity gap. By cooperatively and simultaneously training, the models the KL distance becomes incapable of properly minimizing the teacher's and student's distributions. Alongside accuracy, critical edge device applications are in need of well-calibrated compact networks. Confidence calibration provides a sensible way of getting trustworthy predictions. We propose BD-KD: Balancing of Divergences for online Knowledge Distillation. We show that adaptively balancing between the reverse and forward divergences shifts the focus of the training strategy to the compact student network without limiting the teacher network's learning process. We demonstrate that, by performing this balancing design at the level of the student distillation loss, we improve upon both performance accuracy and calibration of the compact student network. We conducted extensive experiments using a variety of network architectures and show improvements on multiple datasets including CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet. We illustrate the effectiveness of our approach through comprehensive comparisons and ablations with current state-of-the-art online and offline KD techniques.
translated by 谷歌翻译
In today's heavily overparameterized models, the value of the training loss provides few guarantees on model generalization ability. Indeed, optimizing only the training loss value, as is commonly done, can easily lead to suboptimal model quality. Motivated by prior work connecting the geometry of the loss landscape and generalization, we introduce a novel, effective procedure for instead simultaneously minimizing loss value and loss sharpness. In particular, our procedure, Sharpness-Aware Minimization (SAM), seeks parameters that lie in neighborhoods having uniformly low loss; this formulation results in a minmax optimization problem on which gradient descent can be performed efficiently. We present empirical results showing that SAM improves model generalization across a variety of benchmark datasets (e.g., CIFAR-{10, 100}, Ima-geNet, finetuning tasks) and models, yielding novel state-of-the-art performance for several. Additionally, we find that SAM natively provides robustness to label noise on par with that provided by state-of-the-art procedures that specifically target learning with noisy labels. We open source our code at https: //github.com/google-research/sam. * Work done as part of the Google AI Residency program.
translated by 谷歌翻译
知识蒸馏是一种培训小型学生网络的流行技术,以模仿更大的教师模型,例如网络的集合。我们表明,虽然知识蒸馏可以改善学生泛化,但它通常不得如此普遍地工作:虽然在教师和学生的预测分布之间,甚至在学生容量的情况下,通常仍然存在令人惊讶的差异完美地匹配老师。我们认为优化的困难是为什么学生无法与老师匹配的关键原因。我们还展示了用于蒸馏的数据集的细节如何在学生与老师匹配的紧密关系中发挥作用 - 以及教师矛盾的教师并不总是导致更好的学生泛化。
translated by 谷歌翻译
具有许多预训练模型(PTM)的模型中心已经是深度学习的基石。尽管以高成本建造,但它们仍然保持\ emph {探索}:从业人员通常会通过普及从提供的模型中心中选择一个PTM,然后对PTM进行微调以解决目标任务。这种na \“我的但共同的实践构成了两个障碍,以充分利用预训练的模型中心:(1)通过受欢迎程度选择的PTM选择没有最佳保证;(2)仅使用一个PTM,而其余的PTM则被忽略。理想情况下。理想情况下。 ,为了最大程度地利用预训练的模型枢纽,需要尝试所有PTM的所有组合和广泛的微调每个PTM组合,这会产生指数组合和不可偿还的计算预算。在本文中,我们提出了一种新的范围排名和调整预训练的模型:(1)我们的会议论文〜\ citep {you_logme:_2021}提出的logMe,以估算预先训练模型提取的标签证据的最大值,该标签证据可以在模型中排名所有PTMS用于各种类型的PTM和任务的枢纽\ Emph {微调之前}。(2)如果我们不偏爱模型的体系结构,则可以对排名最佳的PTM进行微调和部署,或者可以通过TOPE调整目标PTM -k通过t排名PTM他提出了b-tuning算法。排名部分基于会议论文,我们在本文中完成了其理论分析,包括启发式证据最大化程序的收敛证明和特征维度的影响。调整零件引入了一种用于调整多个PTM的新型贝叶斯调整(B-Tuning)方法,该方法超过了专门的方法,该方法旨在调整均匀的PTMS,并为调整异质PTMS设置了一种新的技术。利用PTM枢纽的新范式对于整个机器学习社区的大量受众来说可能会很有趣。
translated by 谷歌翻译
We propose a simultaneous learning and pruning algorithm capable of identifying and eliminating irrelevant structures in a neural network during the early stages of training. Thus, the computational cost of subsequent training iterations, besides that of inference, is considerably reduced. Our method, based on variational inference principles using Gaussian scale mixture priors on neural network weights, learns the variational posterior distribution of Bernoulli random variables multiplying the units/filters similarly to adaptive dropout. Our algorithm, ensures that the Bernoulli parameters practically converge to either 0 or 1, establishing a deterministic final network. We analytically derive a novel hyper-prior distribution over the prior parameters that is crucial for their optimal selection and leads to consistent pruning levels and prediction accuracy regardless of weight initialization or the size of the starting network. We prove the convergence properties of our algorithm establishing theoretical and practical pruning conditions. We evaluate the proposed algorithm on the MNIST and CIFAR-10 data sets and the commonly used fully connected and convolutional LeNet and VGG16 architectures. The simulations show that our method achieves pruning levels on par with state-of the-art methods for structured pruning, while maintaining better test-accuracy and more importantly in a manner robust with respect to network initialization and initial size.
translated by 谷歌翻译
适应数据分布的结构(例如对称性和转型Imarerces)是机器学习中的重要挑战。通过架构设计或通过增强数据集,可以内在学习过程中内置Inhormces。两者都需要先验的了解对称性的确切性质。缺乏这种知识,从业者求助于昂贵且耗时的调整。为了解决这个问题,我们提出了一种新的方法来学习增强变换的分布,以新的\ emph {转换风险最小化}(trm)框架。除了预测模型之外,我们还优化了从假说空间中选择的转换。作为算法框架,我们的TRM方法是(1)有效(共同学习增强和模型,以\ emph {单训练环}),(2)模块化(使用\ emph {任何训练算法),以及(3)一般(处理\ \ ich {离散和连续}增强)。理论上与标准风险最小化的TRM比较,并在其泛化误差上给出PAC-Bayes上限。我们建议通过块组成的新参数化优化富裕的增强空间,导致新的\ EMPH {随机成分增强学习}(SCALE)算法。我们在CIFAR10 / 100,SVHN上使用先前的方法(快速自身自动化和武术器)进行实际比较规模。此外,我们表明规模可以在数据分布中正确地学习某些对称性(恢复旋转Mnist上的旋转),并且还可以改善学习模型的校准。
translated by 谷歌翻译
We develop an optimization algorithm suitable for Bayesian learning in complex models. Our approach relies on natural gradient updates within a general black-box framework for efficient training with limited model-specific derivations. It applies within the class of exponential-family variational posterior distributions, for which we extensively discuss the Gaussian case for which the updates have a rather simple form. Our Quasi Black-box Variational Inference (QBVI) framework is readily applicable to a wide class of Bayesian inference problems and is of simple implementation as the updates of the variational posterior do not involve gradients with respect to the model parameters, nor the prescription of the Fisher information matrix. We develop QBVI under different hypotheses for the posterior covariance matrix, discuss details about its robust and feasible implementation, and provide a number of real-world applications to demonstrate its effectiveness.
translated by 谷歌翻译