In today's heavily overparameterized models, the value of the training loss provides few guarantees on model generalization ability. Indeed, optimizing only the training loss value, as is commonly done, can easily lead to suboptimal model quality. Motivated by prior work connecting the geometry of the loss landscape and generalization, we introduce a novel, effective procedure for instead simultaneously minimizing loss value and loss sharpness. In particular, our procedure, Sharpness-Aware Minimization (SAM), seeks parameters that lie in neighborhoods having uniformly low loss; this formulation results in a minmax optimization problem on which gradient descent can be performed efficiently. We present empirical results showing that SAM improves model generalization across a variety of benchmark datasets (e.g., CIFAR-{10, 100}, Ima-geNet, finetuning tasks) and models, yielding novel state-of-the-art performance for several. Additionally, we find that SAM natively provides robustness to label noise on par with that provided by state-of-the-art procedures that specifically target learning with noisy labels. We open source our code at https: //github.com/google-research/sam. * Work done as part of the Google AI Residency program.
translated by 谷歌翻译
清晰度感知最小化(SAM)是一种最近的训练方法,它依赖于最严重的重量扰动,可显着改善各种环境中的概括。我们认为,基于pac-bayes概括结合的SAM成功的现有理由,而收敛到平面最小值的想法是不完整的。此外,没有解释说在SAM中使用$ m $ sharpness的成功,这对于概括而言至关重要。为了更好地理解SAM的这一方面,我们理论上分析了其对角线性网络的隐式偏差。我们证明,SAM总是选择一种比标准梯度下降更好的解决方案,用于某些类别的问题,并且通过使用$ m $ -sharpness可以放大这种效果。我们进一步研究了隐性偏见在非线性网络上的特性,在经验上,我们表明使用SAM进行微调的标准模型可以导致显着的概括改进。最后,当与随机梯度一起使用时,我们为非凸目标提供了SAM的收敛结果。我们从经验上说明了深层网络的这些结果,并讨论了它们与SAM的概括行为的关系。我们的实验代码可在https://github.com/tml-epfl/understanding-sam上获得。
translated by 谷歌翻译
已知最近的清晰度感知最小化(SAM)可以找到平坦的最小值,这有助于改善稳健性。 Sam通过报告当前迭代周围的小社区内的最大损失值来修改损失函数。但是,它使用欧几里得球来定义邻域,这可能是不准确的,因为神经网络的损失函数通常是根据概率分布(例如类预测概率)定义的,从而使参数空间空间非欧几里得。在本文中,我们在定义邻里时考虑了模型参数空间的信息几何形状,即用Fisher信息引起的椭圆形取代Sam的欧几里得球。我们称为Fisher Sam的方法定义了符合基础统计歧管的内在度量的更准确的邻域结构。例如,由于我们的Fisher Sam避免了参数空间几何形状,因此SAM可能会在附近或不当远处探测最坏情况下的损失值。最近,另一种自适应SAM方法会根据参数幅度的规模拉伸/收缩欧几里得球。这可能是危险的,有可能破坏邻里结构。我们证明了在几个基准数据集/任务上提出的Fisher SAM的性能提高。
translated by 谷歌翻译
模型不合时宜的元学习(MAML)目前是少量元学习的主要方法之一。尽管它具有有效性,但由于先天的二聚体问题结构,MAML的优化可能具有挑战性。具体而言,MAML的损失格局比其经验风险最小化的对应物更为复杂,可能的鞍点和局部最小化可能更复杂。为了应对这一挑战,我们利用了最近发明的清晰度最小化的最小化,并开发出一种清晰感的MAML方法,我们称其为Sharp MAML。我们从经验上证明,Sharp-MAML及其计算有效的变体可以胜过流行的现有MAML基准(例如,Mini-Imagenet上的$+12 \%$ $精度)。我们通过收敛速率分析和尖锐MAML的概括结合进行了经验研究。据我们所知,这是在双层学习背景下对清晰度感知最小化的第一个经验和理论研究。该代码可在https://github.com/mominabbass/sharp-maml上找到。
translated by 谷歌翻译
随着最近在移动和边缘设备上部署神经网络模型的需求,希望提高模型对看不见的测试数据的普遍性,以及提高模型在固定点量化下的稳健性,以实现有效部署。然而,最大限度地减少培训损失在泛化和量化性能上提供了一些保证。在这项工作中,我们通过在改善模型对界限重量扰动的框架下理论上统一它们的理论上统一并最小化模型权重的稳健性并最小化了模型权重的框架的框架,同时履行泛化和量化性能。因此,我们提出了HESSIAN增强的鲁棒优化方法,以通过基于梯度的训练过程最小化Hessian特征值,同时提高泛化和量化性能。 HERO在测试准确性上高达3.8%,高度高达30%,在80%的培训标签扰动下的准确性高达30%,以及各种精度范围内的最佳训练后量化精度,包括在SGD上的高精度改善> 10%在各种数据集上的共同模型架构培训模型。
translated by 谷歌翻译
我们使用高斯过程扰动模型在高维二次上的真实和批量风险表面之间的高斯过程扰动模型分析和解释迭代平均的泛化性能。我们从我们的理论结果中获得了三个现象\姓名:}(1)将迭代平均值(ia)与大型学习率和正则化进行了改进的正规化的重要性。 (2)对较少频繁平均的理由。 (3)我们预计自适应梯度方法同样地工作,或者更好,而不是其非自适应对应物的迭代平均值。灵感来自这些结果\姓据{,一起与}对迭代解决方案多样性的适当正则化的重要性,我们提出了两个具有迭代平均的自适应算法。与随机梯度下降(SGD)相比,这些结果具有明显更好的结果,需要较少调谐并且不需要早期停止或验证设定监视。我们在各种现代和古典网络架构上展示了我们对CiFar-10/100,Imagenet和Penn TreeBank数据集的方法的疗效。
translated by 谷歌翻译
经过深入的研究,最低限度的损失景观的局部形状,尤其是平坦度对于深层模型的概括起重要作用。我们开发了一种称为POF的培训算法:特征提取器的训练后培训,该培训更新了已经训练的深层模型的特征提取器部分,以搜索最小的最小值。特征是两倍:1)特征提取器在高层参数空间中的参数扰动下受到训练,基于表明使更高层参数空间变平的观测值,以及2)扰动范围以数据驱动的方式确定旨在减少由正损失曲率引起的一部分测试损失。我们提供了理论分析,该分析表明所提出的算法隐含地减少了目标Hessian组件以及损失。实验结果表明,POF仅针对CIFAR-10和CIFAR-100数据集的基线方法提高了模型性能,仅用于10个上学后培训,以及用于50个上学后培训的SVHN数据集。源代码可用:\ url {https://github.com/densoitlab/pof-v1
translated by 谷歌翻译
在神经网络的经验风险景观中扁平最小值的性质已经讨论了一段时间。越来越多的证据表明他们对尖锐物质具有更好的泛化能力。首先,我们讨论高斯混合分类模型,并分析显示存在贝叶斯最佳点估算器,其对应于属于宽平区域的最小值。可以通过直接在分类器(通常是独立的)或学习中使用的可分解损耗函数上应用最大平坦度算法来找到这些估计器。接下来,我们通过广泛的数值验证将分析扩展到深度学习场景。使用两种算法,熵-SGD和复制-SGD,明确地包括在优化目标中,所谓的非局部平整度措施称为本地熵,我们一直提高常见架构的泛化误差(例如Resnet,CeffectnNet)。易于计算的平坦度测量显示与测试精度明确的相关性。
translated by 谷歌翻译
Deep neural networks may easily memorize noisy labels present in real-world data, which degrades their ability to generalize. It is therefore important to track and evaluate the robustness of models against noisy label memorization. We propose a metric, called susceptibility, to gauge such memorization for neural networks. Susceptibility is simple and easy to compute during training. Moreover, it does not require access to ground-truth labels and it only uses unlabeled data. We empirically show the effectiveness of our metric in tracking memorization on various architectures and datasets and provide theoretical insights into the design of the susceptibility metric. Finally, we show through extensive experiments on datasets with synthetic and real-world label noise that one can utilize susceptibility and the overall training accuracy to distinguish models that maintain a low memorization on the training set and generalize well to unseen clean data.
translated by 谷歌翻译
如何训练深度神经网络(DNNS)很好地概括了深度学习的核心问题,尤其是对于当今严重的过度参数化网络。在本文中,我们提出了一种有效的方法来通过对优化过程中损失函数的梯度规范进行惩罚来改善模型的概括。我们证明,限制损失功能的梯度规范可以帮助引导优化者找到平坦的最小值。我们利用一阶近似来有效地实现相应的梯度,以适应梯度下降框架。在我们的实验中,我们确认使用我们的方法时,可以在不同的数据集中改善各种模型的概括性能。另外,我们表明,最近的清晰度最小化方法(Foret等,2021)是我们方法的特殊情况,但不是最好的情况,我们方法的最佳情况可以给出新的最先进的性能在这些任务上。代码可从{https://github.com/zhaoyang-0204/gnp}获得。
translated by 谷歌翻译
Real-world datasets exhibit imbalances of varying types and degrees. Several techniques based on re-weighting and margin adjustment of loss are often used to enhance the performance of neural networks, particularly on minority classes. In this work, we analyze the class-imbalanced learning problem by examining the loss landscape of neural networks trained with re-weighting and margin-based techniques. Specifically, we examine the spectral density of Hessian of class-wise loss, through which we observe that the network weights converge to a saddle point in the loss landscapes of minority classes. Following this observation, we also find that optimization methods designed to escape from saddle points can be effectively used to improve generalization on minority classes. We further theoretically and empirically demonstrate that Sharpness-Aware Minimization (SAM), a recent technique that encourages convergence to a flat minima, can be effectively used to escape saddle points for minority classes. Using SAM results in a 6.2\% increase in accuracy on the minority classes over the state-of-the-art Vector Scaling Loss, leading to an overall average increase of 4\% across imbalanced datasets. The code is available at: https://github.com/val-iisc/Saddle-LongTail.
translated by 谷歌翻译
对抗性培训(AT)已成为培训强大网络的热门选择。然而,它倾向于牺牲清洁精度,以令人满意的鲁棒性,并且遭受大的概括误差。为了解决这些问题,我们提出了平稳的对抗培训(SAT),以我们对损失令人歉端的损失的终人谱指导。 We find that curriculum learning, a scheme that emphasizes on starting "easy" and gradually ramping up on the "difficulty" of training, smooths the adversarial loss landscape for a suitably chosen difficulty metric.我们展示了对普通环境中的课程学习的一般制定,并提出了一种基于最大Hessian特征值(H-SAT)和软MAX概率(P-SA)的两个难度指标。我们展示SAT稳定网络培训即使是大型扰动规范,并且允许网络以更好的清洁精度运行而与鲁棒性权衡曲线相比。与AT,交易和其他基线相比,这导致清洁精度和鲁棒性的显着改善。为了突出一些结果,我们的最佳模型将分别在CIFAR-100上提高6%和1%的稳健准确性。在Imagenette上,一个十一级想象成的子集,我们的模型分别以正常和强大的准确性达到23%和3%。
translated by 谷歌翻译
众所周知,随机梯度噪声(SGN)是深度学习的隐式正则化,对于深层网络的优化和概括至关重要。一些作品试图通过注入随机噪声来改善深度学习来人为地模拟SGN。但是,事实证明,注入的简单随机噪声不能像sgn一样工作,而sgn是各向异性和参数依赖性的。为了以低计算成本模拟SGN,并且在不更改学习率或批处理大小的情况下,我们提出了正面的动量(PNM)方法,这是经典优化器中常规动量的强大替代方法。引入的PNM方法维持两个近似独立的动量项。然后,我们可以通过调整动量差异来明确控制SGN的大小。从理论上讲,我们证明了PNM比随机梯度下降(SGD)的收敛保证和概括优势。通过将PNM与动量和Adam合并到两个常规优化器SGD中,我们的广泛实验在经验上验证了基于PNM的变体的显着优势,而不是相应的常规动量基于动量的优化器。
translated by 谷歌翻译
This paper proposes a new optimization algorithm called Entropy-SGD for training deep neural networks that is motivated by the local geometry of the energy landscape. Local extrema with low generalization error have a large proportion of almost-zero eigenvalues in the Hessian with very few positive or negative eigenvalues. We leverage upon this observation to construct a local-entropy-based objective function that favors well-generalizable solutions lying in large flat regions of the energy landscape, while avoiding poorly-generalizable solutions located in the sharp valleys. Conceptually, our algorithm resembles two nested loops of SGD where we use Langevin dynamics in the inner loop to compute the gradient of the local entropy before each update of the weights. We show that the new objective has a smoother energy landscape and show improved generalization over SGD using uniform stability, under certain assumptions. Our experiments on convolutional and recurrent networks demonstrate that Entropy-SGD compares favorably to state-of-the-art techniques in terms of generalization error and training time.
translated by 谷歌翻译
清晰度感知最小化(SAM)和自适应清晰度感知最小化(ASAM)旨在改善模型的概括。在这个项目中,我们提出了三个实验,以从清晰度意识到的角度有效地概括它们。我们的实验表明,基于清晰度的优化技术可以帮助提供具有强大概括能力的模型。我们的实验还表明,ASAM可以改善对非归一化数据的概括性能,但是需要进一步的研究来确认这一点。
translated by 谷歌翻译
The vast majority of successful deep neural networks are trained using variants of stochastic gradient descent (SGD) algorithms. Recent attempts to improve SGD can be broadly categorized into two approaches: (1) adaptive learning rate schemes, such as AdaGrad and Adam, and (2) accelerated schemes, such as heavy-ball and Nesterov momentum. In this paper, we propose a new optimization algorithm, Lookahead, that is orthogonal to these previous approaches and iteratively updates two sets of weights. Intuitively, the algorithm chooses a search direction by looking ahead at the sequence of "fast weights" generated by another optimizer. We show that Lookahead improves the learning stability and lowers the variance of its inner optimizer with negligible computation and memory cost. We empirically demonstrate Lookahead can significantly improve the performance of SGD and Adam, even with their default hyperparameter settings on ImageNet, CIFAR-10/100, neural machine translation, and Penn Treebank.
translated by 谷歌翻译
It is common practice in deep learning to use overparameterized networks and train for as long as possible; there are numerous studies that show, both theoretically and empirically, that such practices surprisingly do not unduly harm the generalization performance of the classifier. In this paper, we empirically study this phenomenon in the setting of adversarially trained deep networks, which are trained to minimize the loss under worst-case adversarial perturbations. We find that overfitting to the training set does in fact harm robust performance to a very large degree in adversarially robust training across multiple datasets (SVHN, CIFAR-10, CIFAR-100, and ImageNet) and perturbation models ( ∞ and 2 ). Based upon this observed effect, we show that the performance gains of virtually all recent algorithmic improvements upon adversarial training can be matched by simply using early stopping. We also show that effects such as the double descent curve do still occur in adversarially trained models, yet fail to explain the observed overfitting. Finally, we study several classical and modern deep learning remedies for overfitting, including regularization and data augmentation, and find that no approach in isolation improves significantly upon the gains achieved by early stopping. All code for reproducing the experiments as well as pretrained model weights and training logs can be found at https://github.com/ locuslab/robust_overfitting.
translated by 谷歌翻译
我们研究了使用尖刺,现场依赖的随机矩阵理论研究迷你批次对深神经网络损失景观的影响。我们表明,批量黑森州的极值值的大小大于经验丰富的黑森州。我们还获得了类似的结果对Hessian的概括高斯牛顿矩阵近似。由于我们的定理,我们推导出作为批量大小的最大学习速率的分析表达式,为随机梯度下降(线性缩放)和自适应算法(例如ADAM(Square Root Scaling)提供了通知实际培训方案,例如光滑,非凸深神经网络。虽然随机梯度下降的线性缩放是在我们概括的更多限制性条件下导出的,但是适应优化者的平方根缩放规则是我们的知识,完全小说。随机二阶方法和自适应方法的百分比,我们得出了最小阻尼系数与学习率与批量尺寸的比率成比例。我们在Cifar-$ 100 $和ImageNet数据集上验证了我们的VGG / WimerEsnet架构上的索赔。根据我们对象检的调查,我们基于飞行学习率和动量学习者开发了一个随机兰齐齐竞争,这避免了对这些关键的超参数进行昂贵的多重评估的需求,并在预残留的情况下显示出良好的初步结果Cifar的architecure - $ 100 $。
translated by 谷歌翻译
彩票票证假设(LTH)引起了人们的关注,因为它可以解释为什么过度参数化模型通常显示出很高的概括能力。众所周知,当我们使用迭代幅度修剪(IMP)时,这是一种算法,可以找到具有高概括能力的稀疏网络,可以独立从初始权重训练,称为获胜票,最初的大型学习率在深层神经网络,例如重新连接。但是,由于最初的较大学习率通常有助于优化器收敛到平坦的最小值,因此我们假设获胜票的最小值相对较高,这在概括能力方面被认为是不利的。在本文中,我们证实了这一假设,并表明Pac-Bayesian理论可以对LTH与概括行为之间的关系有明确的理解。根据我们的实验发现,平坦度可用于提高标签噪声的准确性和稳健性,并且与初始权重的距离深深涉及获胜的门票,我们提供了使用尖峰和slab分布的PAC-Bayes绑定到的pac-bayes分析获胜门票。最后,我们重新审视了现有的算法,以从Pac-Bayesian的角度查找获奖门票,并对这些方法提供新的见解。
translated by 谷歌翻译
差异隐私(DP)提供了正式的隐私保证,以防止对手可以访问机器学习模型,从而从提取有关单个培训点的信息。最受欢迎的DP训练方法是差异私有随机梯度下降(DP-SGD),它通过在训练过程中注入噪声来实现这种保护。然而,以前的工作发现,DP-SGD通常会导致标准图像分类基准的性能显着降解。此外,一些作者假设DP-SGD在大型模型上固有地表现不佳,因为保留隐私所需的噪声规范与模型维度成正比。相反,我们证明了过度参数化模型上的DP-SGD可以比以前想象的要好得多。将仔细的超参数调整与简单技术结合起来,以确保信号传播并提高收敛速率,我们获得了新的SOTA,而没有额外数据的CIFAR-10,在81.4%的81.4%下(8,10^{ - 5}) - 使用40 -layer wide-Resnet,比以前的SOTA提高了71.7%。当对预训练的NFNET-F3进行微调时,我们在ImageNet(0.5,8*10^{ - 7})下达到了83.8%的TOP-1精度。此外,我们还在(8,8 \ cdot 10^{ - 7})下达到了86.7%的TOP-1精度,DP仅比当前的非私人SOTA仅4.3%。我们认为,我们的结果是缩小私人图像分类和非私有图像分类之间准确性差距的重要一步。
translated by 谷歌翻译