随着最近在移动和边缘设备上部署神经网络模型的需求,希望提高模型对看不见的测试数据的普遍性,以及提高模型在固定点量化下的稳健性,以实现有效部署。然而,最大限度地减少培训损失在泛化和量化性能上提供了一些保证。在这项工作中,我们通过在改善模型对界限重量扰动的框架下理论上统一它们的理论上统一并最小化模型权重的稳健性并最小化了模型权重的框架的框架,同时履行泛化和量化性能。因此,我们提出了HESSIAN增强的鲁棒优化方法,以通过基于梯度的训练过程最小化Hessian特征值,同时提高泛化和量化性能。 HERO在测试准确性上高达3.8%,高度高达30%,在80%的培训标签扰动下的准确性高达30%,以及各种精度范围内的最佳训练后量化精度,包括在SGD上的高精度改善> 10%在各种数据集上的共同模型架构培训模型。
translated by 谷歌翻译
In today's heavily overparameterized models, the value of the training loss provides few guarantees on model generalization ability. Indeed, optimizing only the training loss value, as is commonly done, can easily lead to suboptimal model quality. Motivated by prior work connecting the geometry of the loss landscape and generalization, we introduce a novel, effective procedure for instead simultaneously minimizing loss value and loss sharpness. In particular, our procedure, Sharpness-Aware Minimization (SAM), seeks parameters that lie in neighborhoods having uniformly low loss; this formulation results in a minmax optimization problem on which gradient descent can be performed efficiently. We present empirical results showing that SAM improves model generalization across a variety of benchmark datasets (e.g., CIFAR-{10, 100}, Ima-geNet, finetuning tasks) and models, yielding novel state-of-the-art performance for several. Additionally, we find that SAM natively provides robustness to label noise on par with that provided by state-of-the-art procedures that specifically target learning with noisy labels. We open source our code at https: //github.com/google-research/sam. * Work done as part of the Google AI Residency program.
translated by 谷歌翻译
Mixed-precision quantization has been widely applied on deep neural networks (DNNs) as it leads to significantly better efficiency-accuracy tradeoffs compared to uniform quantization. Meanwhile, determining the exact precision of each layer remains challenging. Previous attempts on bit-level regularization and pruning-based dynamic precision adjustment during training suffer from noisy gradients and unstable convergence. In this work, we propose Continuous Sparsification Quantization (CSQ), a bit-level training method to search for mixed-precision quantization schemes with improved stability. CSQ stabilizes the bit-level mixed-precision training process with a bi-level gradual continuous sparsification on both the bit values of the quantized weights and the bit selection in determining the quantization precision of each layer. The continuous sparsification scheme enables fully-differentiable training without gradient approximation while achieving an exact quantized model in the end.A budget-aware regularization of total model size enables the dynamic growth and pruning of each layer's precision towards a mixed-precision quantization scheme of the desired size. Extensive experiments show CSQ achieves better efficiency-accuracy tradeoff than previous methods on multiple models and datasets.
translated by 谷歌翻译
网络量化是一种有效的压缩方法,以降低模型大小和计算成本。尽管压缩比高,但训练低精度模型由于量化的离散和不可分散的性质,难以实现相当大的性能下降。最近,提出了清晰度感知最小化(SAM),以通过同时最小化损耗值和损耗曲率来改善模型的泛化性能。在本文中,我们设计了锐度感知量化(SAQ)方法来培训量化模型,从而导致更好的泛化性能。此外,由于每个层与网络的损耗和损耗锐度有不同的贡献,我们进一步设计了一种有效的方法,该方法学习配置生成器以自动确定每层的位宽度配置,鼓励平面区域的较低位,反之亦然尖锐的景观,同时促进最小值的平整度,以实现更积极的量化。对CiFar-100和Imagenet的广泛实验显示了所提出的方法的优越性。例如,我们的量化Reset-18具有55.1X比特操作(BOP)减少甚至在前1个精度方面均匀地优于0.7%。代码可在https://github.com/zhuang-group/saq获得。
translated by 谷歌翻译
Quantization has become a predominant approach for model compression, enabling deployment of large models trained on GPUs onto smaller form-factor devices for inference. Quantization-aware training (QAT) optimizes model parameters with respect to the end task while simulating quantization error, leading to better performance than post-training quantization. Approximation of gradients through the non-differentiable quantization operator is typically achieved using the straight-through estimator (STE) or additive noise. However, STE-based methods suffer from instability due to biased gradients, whereas existing noise-based methods cannot reduce the resulting variance. In this work, we incorporate exponentially decaying quantization-error-aware noise together with a learnable scale of task loss gradient to approximate the effect of a quantization operator. We show this method combines gradient scale and quantization noise in a better optimized way, providing finer-grained estimation of gradients at each weight and activation layer's quantizer bin size. Our controlled noise also contains an implicit curvature term that could encourage flatter minima, which we show is indeed the case in our experiments. Experiments training ResNet architectures on the CIFAR-10, CIFAR-100 and ImageNet benchmarks show that our method obtains state-of-the-art top-1 classification accuracy for uniform (non mixed-precision) quantization, out-performing previous methods by 0.5-1.2% absolute.
translated by 谷歌翻译
清晰度感知最小化(SAM)是一种最近的训练方法,它依赖于最严重的重量扰动,可显着改善各种环境中的概括。我们认为,基于pac-bayes概括结合的SAM成功的现有理由,而收敛到平面最小值的想法是不完整的。此外,没有解释说在SAM中使用$ m $ sharpness的成功,这对于概括而言至关重要。为了更好地理解SAM的这一方面,我们理论上分析了其对角线性网络的隐式偏差。我们证明,SAM总是选择一种比标准梯度下降更好的解决方案,用于某些类别的问题,并且通过使用$ m $ -sharpness可以放大这种效果。我们进一步研究了隐性偏见在非线性网络上的特性,在经验上,我们表明使用SAM进行微调的标准模型可以导致显着的概括改进。最后,当与随机梯度一起使用时,我们为非凸目标提供了SAM的收敛结果。我们从经验上说明了深层网络的这些结果,并讨论了它们与SAM的概括行为的关系。我们的实验代码可在https://github.com/tml-epfl/understanding-sam上获得。
translated by 谷歌翻译
对抗性培训(AT)已成为培训强大网络的热门选择。然而,它倾向于牺牲清洁精度,以令人满意的鲁棒性,并且遭受大的概括误差。为了解决这些问题,我们提出了平稳的对抗培训(SAT),以我们对损失令人歉端的损失的终人谱指导。 We find that curriculum learning, a scheme that emphasizes on starting "easy" and gradually ramping up on the "difficulty" of training, smooths the adversarial loss landscape for a suitably chosen difficulty metric.我们展示了对普通环境中的课程学习的一般制定,并提出了一种基于最大Hessian特征值(H-SAT)和软MAX概率(P-SA)的两个难度指标。我们展示SAT稳定网络培训即使是大型扰动规范,并且允许网络以更好的清洁精度运行而与鲁棒性权衡曲线相比。与AT,交易和其他基线相比,这导致清洁精度和鲁棒性的显着改善。为了突出一些结果,我们的最佳模型将分别在CIFAR-100上提高6%和1%的稳健准确性。在Imagenette上,一个十一级想象成的子集,我们的模型分别以正常和强大的准确性达到23%和3%。
translated by 谷歌翻译
我们研究了使用尖刺,现场依赖的随机矩阵理论研究迷你批次对深神经网络损失景观的影响。我们表明,批量黑森州的极值值的大小大于经验丰富的黑森州。我们还获得了类似的结果对Hessian的概括高斯牛顿矩阵近似。由于我们的定理,我们推导出作为批量大小的最大学习速率的分析表达式,为随机梯度下降(线性缩放)和自适应算法(例如ADAM(Square Root Scaling)提供了通知实际培训方案,例如光滑,非凸深神经网络。虽然随机梯度下降的线性缩放是在我们概括的更多限制性条件下导出的,但是适应优化者的平方根缩放规则是我们的知识,完全小说。随机二阶方法和自适应方法的百分比,我们得出了最小阻尼系数与学习率与批量尺寸的比率成比例。我们在Cifar-$ 100 $和ImageNet数据集上验证了我们的VGG / WimerEsnet架构上的索赔。根据我们对象检的调查,我们基于飞行学习率和动量学习者开发了一个随机兰齐齐竞争,这避免了对这些关键的超参数进行昂贵的多重评估的需求,并在预残留的情况下显示出良好的初步结果Cifar的architecure - $ 100 $。
translated by 谷歌翻译
强大的量化提高了网络对各种实现的公差,从而允许在不同的位宽度或零散的低精度算术中可靠的输出。在这项工作中,我们进行了广泛的分析以确定量化误差的来源,并提出了三个见解以鲁棒化的网络,以防止量化:减少误差传播,范围夹紧误差最小化以及遗传的稳健性,以抗量化。基于这些见解,我们提出了两种称为对称正则化(Symreg)和饱和非线性(SATNL)的新方法。在培训期间应用提出的方法可以增强对现有训练后量化(PTQ)和量化感知培训(QAT)算法的量化的任意神经网络的鲁棒性各种条件。我们对CIFAR和Imagenet数据集进行了广泛的研究,并验证了所提出的方法的有效性。
translated by 谷歌翻译
经过深入的研究,最低限度的损失景观的局部形状,尤其是平坦度对于深层模型的概括起重要作用。我们开发了一种称为POF的培训算法:特征提取器的训练后培训,该培训更新了已经训练的深层模型的特征提取器部分,以搜索最小的最小值。特征是两倍:1)特征提取器在高层参数空间中的参数扰动下受到训练,基于表明使更高层参数空间变平的观测值,以及2)扰动范围以数据驱动的方式确定旨在减少由正损失曲率引起的一部分测试损失。我们提供了理论分析,该分析表明所提出的算法隐含地减少了目标Hessian组件以及损失。实验结果表明,POF仅针对CIFAR-10和CIFAR-100数据集的基线方法提高了模型性能,仅用于10个上学后培训,以及用于50个上学后培训的SVHN数据集。源代码可用:\ url {https://github.com/densoitlab/pof-v1
translated by 谷歌翻译
Real-world datasets exhibit imbalances of varying types and degrees. Several techniques based on re-weighting and margin adjustment of loss are often used to enhance the performance of neural networks, particularly on minority classes. In this work, we analyze the class-imbalanced learning problem by examining the loss landscape of neural networks trained with re-weighting and margin-based techniques. Specifically, we examine the spectral density of Hessian of class-wise loss, through which we observe that the network weights converge to a saddle point in the loss landscapes of minority classes. Following this observation, we also find that optimization methods designed to escape from saddle points can be effectively used to improve generalization on minority classes. We further theoretically and empirically demonstrate that Sharpness-Aware Minimization (SAM), a recent technique that encourages convergence to a flat minima, can be effectively used to escape saddle points for minority classes. Using SAM results in a 6.2\% increase in accuracy on the minority classes over the state-of-the-art Vector Scaling Loss, leading to an overall average increase of 4\% across imbalanced datasets. The code is available at: https://github.com/val-iisc/Saddle-LongTail.
translated by 谷歌翻译
最近的研究表明,深度神经网络(DNNS)极易受到精心设计的对抗例子的影响。对那些对抗性例子的对抗性学习已被证明是防御这种攻击的最有效方法之一。目前,大多数现有的对抗示例生成方法基于一阶梯度,这几乎无法进一步改善模型的鲁棒性,尤其是在面对二阶对抗攻击时。与一阶梯度相比,二阶梯度提供了相对于自然示例的损失格局的更准确近似。受此启发的启发,我们的工作制作了二阶的对抗示例,并使用它们来训练DNNS。然而,二阶优化涉及Hessian Inverse的耗时计算。我们通过将问题转换为Krylov子空间中的优化,提出了一种近似方法,该方法显着降低了计算复杂性以加快训练过程。在矿工和CIFAR-10数据集上进行的广泛实验表明,我们使用二阶对抗示例的对抗性学习优于其他FISRT-阶方法,这可以改善针对广泛攻击的模型稳健性。
translated by 谷歌翻译
混合精确的深神经网络达到了硬件部署所需的能源效率和吞吐量,尤其是在资源有限的情况下,而无需牺牲准确性。但是,不容易找到保留精度的最佳每层钻头精度,尤其是在创建巨大搜索空间的大量模型,数据集和量化技术中。为了解决这一困难,最近出现了一系列文献,并且已经提出了一些实现有希望的准确性结果的框架。在本文中,我们首先总结了文献中通常使用的量化技术。然后,我们对混合精液框架进行了彻底的调查,该调查是根据其优化技术进行分类的,例如增强学习和量化技术,例如确定性舍入。此外,讨论了每个框架的优势和缺点,我们在其中呈现并列。我们最终为未来的混合精液框架提供了指南。
translated by 谷歌翻译
深度学习在广泛的AI应用方面取得了有希望的结果。较大的数据集和模型一致地产生更好的性能。但是,我们一般花费更长的培训时间,以更多的计算和沟通。在本调查中,我们的目标是在模型精度和模型效率方面提供关于大规模深度学习优化的清晰草图。我们调查最常用于优化的算法,详细阐述了大批量培训中出现的泛化差距的可辩论主题,并审查了解决通信开销并减少内存足迹的SOTA策略。
translated by 谷歌翻译
Deep neural networks achieve high prediction accuracy when the train and test distributions coincide. In practice though, various types of corruptions occur which deviate from this setup and cause severe performance degradations. Few methods have been proposed to address generalization in the presence of unforeseen domain shifts. In particular, digital noise corruptions arise commonly in practice during the image acquisition stage and present a significant challenge for current robustness approaches. In this paper, we propose a diverse Gaussian noise consistency regularization method for improving robustness of image classifiers under a variety of noise corruptions while still maintaining high clean accuracy. We derive bounds to motivate and understand the behavior of our Gaussian noise consistency regularization using a local loss landscape analysis. We show that this simple approach improves robustness against various unforeseen noise corruptions by 4.2-18.4% over adversarial training and other strong diverse data augmentation baselines across several benchmarks. Furthermore, when combined with state-of-the-art diverse data augmentation techniques, experiments against state-of-the-art show our method further improves robustness accuracy by 3.7% and uncertainty calibration by 5.5% for all common corruptions on several image classification benchmarks.
translated by 谷歌翻译
State-of-the-art classifiers have been shown to be largely vulnerable to adversarial perturbations. One of the most effective strategies to improve robustness is adversarial training. In this paper, we investigate the effect of adversarial training on the geometry of the classification landscape and decision boundaries. We show in particular that adversarial training leads to a significant decrease in the curvature of the loss surface with respect to inputs, leading to a drastically more "linear" behaviour of the network. Using a locally quadratic approximation, we provide theoretical evidence on the existence of a strong relation between large robustness and small curvature. To further show the importance of reduced curvature for improving the robustness, we propose a new regularizer that directly minimizes curvature of the loss surface, and leads to adversarial robustness that is on par with adversarial training. Besides being a more efficient and principled alternative to adversarial training, the proposed regularizer confirms our claims on the importance of exhibiting quasi-linear behavior in the vicinity of data points in order to achieve robustness.
translated by 谷歌翻译
我们使用高斯过程扰动模型在高维二次上的真实和批量风险表面之间的高斯过程扰动模型分析和解释迭代平均的泛化性能。我们从我们的理论结果中获得了三个现象\姓名:}(1)将迭代平均值(ia)与大型学习率和正则化进行了改进的正规化的重要性。 (2)对较少频繁平均的理由。 (3)我们预计自适应梯度方法同样地工作,或者更好,而不是其非自适应对应物的迭代平均值。灵感来自这些结果\姓据{,一起与}对迭代解决方案多样性的适当正则化的重要性,我们提出了两个具有迭代平均的自适应算法。与随机梯度下降(SGD)相比,这些结果具有明显更好的结果,需要较少调谐并且不需要早期停止或验证设定监视。我们在各种现代和古典网络架构上展示了我们对CiFar-10/100,Imagenet和Penn TreeBank数据集的方法的疗效。
translated by 谷歌翻译
研究神经网络中重量扰动的敏感性及其对模型性能的影响,包括泛化和鲁棒性,是一种积极的研究主题,因为它对模型压缩,泛化差距评估和对抗攻击等诸如模型压缩,泛化差距评估和对抗性攻击的广泛机器学习任务。在本文中,我们在重量扰动下的鲁棒性方面提供了前馈神经网络的第一积分研究和分析及其在体重扰动下的泛化行为。我们进一步设计了一种新的理论驱动损失功能,用于培训互动和强大的神经网络免受重量扰动。进行实证实验以验证我们的理论分析。我们的结果提供了基本洞察,以表征神经网络免受重量扰动的泛化和鲁棒性。
translated by 谷歌翻译
神经网络量化旨在将特定神经网络的高精度权重和激活转变为低精度的权重/激活,以减少存储器使用和计算,同时保留原始模型的性能。但是,紧凑设计的主链体系结构(例如Mobilenets)通常用于边缘设备部署的极端量化(1位重量/1位激活)会导致严重的性能变性。本文提出了一种新颖的量化感知训练(QAT)方法,即使通过重点关注各层之间的权重之间的重量间依赖性,也可以通过极端量化有效地减轻性能退化。为了最大程度地减少每个重量对其他重量的量化影响,我们通过训练一个依赖输入依赖性的相关矩阵和重要性向量来对每一层的权重进行正交转换,从而使每个权重都与其他权重分开。然后,我们根据权重量化的重要性来最大程度地减少原始权重/激活中信息丢失的重要性。我们进一步执行从底层到顶部的渐进层量化,因此每一层的量化都反映了先前层的权重和激活的量化分布。我们验证了我们的方法对各种基准数据集的有效性,可针对强神经量化基线,这表明它可以减轻ImageNet上的性能变性,并成功地保留了CIFAR-100上具有紧凑型骨干网络的完整精确模型性能。
translated by 谷歌翻译
如何训练深度神经网络(DNNS)很好地概括了深度学习的核心问题,尤其是对于当今严重的过度参数化网络。在本文中,我们提出了一种有效的方法来通过对优化过程中损失函数的梯度规范进行惩罚来改善模型的概括。我们证明,限制损失功能的梯度规范可以帮助引导优化者找到平坦的最小值。我们利用一阶近似来有效地实现相应的梯度,以适应梯度下降框架。在我们的实验中,我们确认使用我们的方法时,可以在不同的数据集中改善各种模型的概括性能。另外,我们表明,最近的清晰度最小化方法(Foret等,2021)是我们方法的特殊情况,但不是最好的情况,我们方法的最佳情况可以给出新的最先进的性能在这些任务上。代码可从{https://github.com/zhaoyang-0204/gnp}获得。
translated by 谷歌翻译