L 2 regularization and weight decay regularization are equivalent for standard stochastic gradient descent (when rescaled by the learning rate), but as we demonstrate this is not the case for adaptive gradient algorithms, such as Adam. While common implementations of these algorithms employ L 2 regularization (often calling it "weight decay" in what may be misleading due to the inequivalence we expose), we propose a simple modification to recover the original formulation of weight decay regularization by decoupling the weight decay from the optimization steps taken w.r.t. the loss function. We provide empirical evidence that our proposed modification (i) decouples the optimal choice of weight decay factor from the setting of the learning rate for both standard SGD and Adam and (ii) substantially improves Adam's generalization performance, allowing it to compete with SGD with momentum on image classification datasets (on which it was previously typically outperformed by the latter). Our proposed decoupled weight decay has already been adopted by many researchers, and the community has implemented it in TensorFlow and PyTorch; the complete source code for our experiments is
translated by 谷歌翻译
Restart techniques are common in gradient-free optimization to deal with multimodal functions. Partial warm restarts are also gaining popularity in gradientbased optimization to improve the rate of convergence in accelerated gradient schemes to deal with ill-conditioned functions. In this paper, we propose a simple warm restart technique for stochastic gradient descent to improve its anytime performance when training deep neural networks. We empirically study its performance on the CIFAR-10 and CIFAR-100 datasets, where we demonstrate new state-of-the-art results at 3.14% and 16.21%, respectively. We also demonstrate its advantages on a dataset of EEG recordings and on a downsampled version of the ImageNet dataset. Our source code is available at https://github.com/loshchil/SGDR
translated by 谷歌翻译
我们使用高斯过程扰动模型在高维二次上的真实和批量风险表面之间的高斯过程扰动模型分析和解释迭代平均的泛化性能。我们从我们的理论结果中获得了三个现象\姓名:}(1)将迭代平均值(ia)与大型学习率和正则化进行了改进的正规化的重要性。 (2)对较少频繁平均的理由。 (3)我们预计自适应梯度方法同样地工作,或者更好,而不是其非自适应对应物的迭代平均值。灵感来自这些结果\姓据{,一起与}对迭代解决方案多样性的适当正则化的重要性,我们提出了两个具有迭代平均的自适应算法。与随机梯度下降(SGD)相比,这些结果具有明显更好的结果,需要较少调谐并且不需要早期停止或验证设定监视。我们在各种现代和古典网络架构上展示了我们对CiFar-10/100,Imagenet和Penn TreeBank数据集的方法的疗效。
translated by 谷歌翻译
The vast majority of successful deep neural networks are trained using variants of stochastic gradient descent (SGD) algorithms. Recent attempts to improve SGD can be broadly categorized into two approaches: (1) adaptive learning rate schemes, such as AdaGrad and Adam, and (2) accelerated schemes, such as heavy-ball and Nesterov momentum. In this paper, we propose a new optimization algorithm, Lookahead, that is orthogonal to these previous approaches and iteratively updates two sets of weights. Intuitively, the algorithm chooses a search direction by looking ahead at the sequence of "fast weights" generated by another optimizer. We show that Lookahead improves the learning stability and lowers the variance of its inner optimizer with negligible computation and memory cost. We empirically demonstrate Lookahead can significantly improve the performance of SGD and Adam, even with their default hyperparameter settings on ImageNet, CIFAR-10/100, neural machine translation, and Penn Treebank.
translated by 谷歌翻译
众所周知,随机梯度噪声(SGN)是深度学习的隐式正则化,对于深层网络的优化和概括至关重要。一些作品试图通过注入随机噪声来改善深度学习来人为地模拟SGN。但是,事实证明,注入的简单随机噪声不能像sgn一样工作,而sgn是各向异性和参数依赖性的。为了以低计算成本模拟SGN,并且在不更改学习率或批处理大小的情况下,我们提出了正面的动量(PNM)方法,这是经典优化器中常规动量的强大替代方法。引入的PNM方法维持两个近似独立的动量项。然后,我们可以通过调整动量差异来明确控制SGN的大小。从理论上讲,我们证明了PNM比随机梯度下降(SGD)的收敛保证和概括优势。通过将PNM与动量和Adam合并到两个常规优化器SGD中,我们的广泛实验在经验上验证了基于PNM的变体的显着优势,而不是相应的常规动量基于动量的优化器。
translated by 谷歌翻译
有效地近似损失函数的局部曲率信息是用于深神经网络的优化和压缩的关键工具。然而,大多数现有方法近似二阶信息具有高计算或存储成本,这可以限制其实用性。在这项工作中,我们调查矩阵,用于估计逆象征的矢量产品(IHVPS)的矩阵线性时间方法,因为当Hessian可以近似为乘语 - 一个矩阵的总和时,如Hessian的经典近似由经验丰富的Fisher矩阵。我们提出了两个新的算法作为称为M-FAC的框架的一部分:第一个算法朝着网络压缩量身定制,如果Hessian给出了M $等级的总和,则可以计算Dimension $ D $的IHVP。 ,使用$ O(DM ^ 2)$预压制,$ O(DM)$代价计算IHVP,并查询逆Hessian的任何单个元素的费用$ O(m)$。第二算法针对优化设置,我们希望在反向Hessian之间计算产品,估计在优化步骤的滑动窗口和给定梯度方向上,根据预先说明的SGD所需的梯度方向。我们为计算IHVP和OHVP和O(DM + M ^ 3)$ of $ o(dm + m ^ 2)$提供算法,以便从滑动窗口添加或删除任何渐变。这两种算法产生最先进的结果,用于网络修剪和相对于现有二阶方法的计算开销的优化。在[9]和[17]可用实现。
translated by 谷歌翻译
In today's heavily overparameterized models, the value of the training loss provides few guarantees on model generalization ability. Indeed, optimizing only the training loss value, as is commonly done, can easily lead to suboptimal model quality. Motivated by prior work connecting the geometry of the loss landscape and generalization, we introduce a novel, effective procedure for instead simultaneously minimizing loss value and loss sharpness. In particular, our procedure, Sharpness-Aware Minimization (SAM), seeks parameters that lie in neighborhoods having uniformly low loss; this formulation results in a minmax optimization problem on which gradient descent can be performed efficiently. We present empirical results showing that SAM improves model generalization across a variety of benchmark datasets (e.g., CIFAR-{10, 100}, Ima-geNet, finetuning tasks) and models, yielding novel state-of-the-art performance for several. Additionally, we find that SAM natively provides robustness to label noise on par with that provided by state-of-the-art procedures that specifically target learning with noisy labels. We open source our code at https: //github.com/google-research/sam. * Work done as part of the Google AI Residency program.
translated by 谷歌翻译
在神经网络的经验风险景观中扁平最小值的性质已经讨论了一段时间。越来越多的证据表明他们对尖锐物质具有更好的泛化能力。首先,我们讨论高斯混合分类模型,并分析显示存在贝叶斯最佳点估算器,其对应于属于宽平区域的最小值。可以通过直接在分类器(通常是独立的)或学习中使用的可分解损耗函数上应用最大平坦度算法来找到这些估计器。接下来,我们通过广泛的数值验证将分析扩展到深度学习场景。使用两种算法,熵-SGD和复制-SGD,明确地包括在优化目标中,所谓的非局部平整度措施称为本地熵,我们一直提高常见架构的泛化误差(例如Resnet,CeffectnNet)。易于计算的平坦度测量显示与测试精度明确的相关性。
translated by 谷歌翻译
深度学习在广泛的AI应用方面取得了有希望的结果。较大的数据集和模型一致地产生更好的性能。但是,我们一般花费更长的培训时间,以更多的计算和沟通。在本调查中,我们的目标是在模型精度和模型效率方面提供关于大规模深度学习优化的清晰草图。我们调查最常用于优化的算法,详细阐述了大批量培训中出现的泛化差距的可辩论主题,并审查了解决通信开销并减少内存足迹的SOTA策略。
translated by 谷歌翻译
深度学习归一化技术的基本特性,例如批准归一化,正在使范围前的参数量表不变。此类参数的固有域是单位球,因此可以通过球形优化的梯度优化动力学以不同的有效学习率(ELR)来表示,这是先前研究的。在这项工作中,我们使用固定的ELR直接研究了训练量表不变的神经网络的特性。我们根据ELR值发现了这种训练的三个方案:收敛,混乱平衡和差异。我们详细研究了这些制度示例的理论检查,以及对真实规模不变深度学习模型的彻底经验分析。每个制度都有独特的特征,并反映了内在损失格局的特定特性,其中一些与先前对常规和规模不变的神经网络培训的研究相似。最后,我们证明了如何在归一化网络的常规培训以及如何利用它们以实现更好的Optima中反映发现的制度。
translated by 谷歌翻译
We propose SWA-Gaussian (SWAG), a simple, scalable, and general purpose approach for uncertainty representation and calibration in deep learning. Stochastic Weight Averaging (SWA), which computes the first moment of stochastic gradient descent (SGD) iterates with a modified learning rate schedule, has recently been shown to improve generalization in deep learning. With SWAG, we fit a Gaussian using the SWA solution as the first moment and a low rank plus diagonal covariance also derived from the SGD iterates, forming an approximate posterior distribution over neural network weights; we then sample from this Gaussian distribution to perform Bayesian model averaging. We empirically find that SWAG approximates the shape of the true posterior, in accordance with results describing the stationary distribution of SGD iterates. Moreover, we demonstrate that SWAG performs well on a wide variety of tasks, including out of sample detection, calibration, and transfer learning, in comparison to many popular alternatives including MC dropout, KFAC Laplace, SGLD, and temperature scaling.
translated by 谷歌翻译
培训具有批量标准化和重量衰减的神经网络已成为近年来的常见做法。在这项工作中,我们表明它们的结合使用可能导致优化动态的令人惊讶的周期性行为:培训过程定期表现出稳定,然而,不会导致完全发散但导致新的培训期。我们严格研究了从经验和理论观点的发现的定期行为基础的机制,并分析了实践中发生的条件。我们还证明,周期性行为可以被视为在批量归一化和体重衰减的训练中进行两种先前反对的视角的概括,即平衡推定和不稳定的推定。
translated by 谷歌翻译
优化通常是一个确定性问题,其中通过诸如梯度下降的一些迭代过程找到解决方案。然而,当培训神经网络时,由于样本的子集的随机选择,损耗函数会超过(迭代)时间。该随机化将优化问题转变为随机级别。我们建议将损失视为关于一些参考最优参考的嘈杂观察。这种对损失的解释使我们能够采用卡尔曼滤波作为优化器,因为其递归制剂旨在估计来自嘈杂测量的未知参数。此外,我们表明,用于未知参数的演进的卡尔曼滤波器动力学模型可用于捕获高级方法的梯度动态,如动量和亚当。我们称之为该随机优化方法考拉,对于Kalman优化算法而言,具有损失适应性的缺陷。考拉是一种易于实现,可扩展,高效的方法来训练神经网络。我们提供了通过实验的收敛分析和显示,它产生了与跨多个神经网络架构和机器学习任务的现有技术优化算法的现有状态的参数估计,例如计算机视觉和语言建模。
translated by 谷歌翻译
Power等人报道的\ emph {grokking现象} {power2021grokking}是指一个长期过度拟合之后,似乎突然过渡到完美的概括。在本文中,我们试图通过一系列经验研究来揭示Grokking的基础。具体而言,我们在极端的训练阶段(称为\ emph {slingshot机构)发现了一个优化的异常缺陷自适应优化器。可以通过稳定和不稳定的训练方案之间的循环过渡来测量弹弓机制的突出伪像,并且可以通过最后一层重量的规范的循环行为轻松监测。我们从经验上观察到,在\ cite {power2021grokking}中报道的无明确正规化,几乎完全发生在\ emph {slingshots}的开始时,并且没有它。虽然在更一般的环境中常见且容易复制,但弹弓机制并不遵循我们所知道的任何已知优化理论,并且可以轻松地忽略而无需深入研究。我们的工作表明,在培训的后期阶段,适应性梯度优化器的令人惊讶且有用的归纳偏见,要求对其起源进行修订。
translated by 谷歌翻译
Many applications require sparse neural networks due to space or inference time restrictions. There is a large body of work on training dense networks to yield sparse networks for inference, but this limits the size of the largest trainable sparse model to that of the largest trainable dense model. In this paper we introduce a method to train sparse neural networks with a fixed parameter count and a fixed computational cost throughout training, without sacrificing accuracy relative to existing dense-tosparse training methods. Our method updates the topology of the sparse network during training by using parameter magnitudes and infrequent gradient calculations. We show that this approach requires fewer floating-point operations (FLOPs) to achieve a given level of accuracy compared to prior techniques. We demonstrate state-of-the-art sparse training results on a variety of networks and datasets, including ResNet-50, MobileNets on Imagenet-2012, and RNNs on WikiText-103. Finally, we provide some insights into why allowing the topology to change during the optimization can overcome local minima encountered when the topology remains static * .
translated by 谷歌翻译
我们研究了使用尖刺,现场依赖的随机矩阵理论研究迷你批次对深神经网络损失景观的影响。我们表明,批量黑森州的极值值的大小大于经验丰富的黑森州。我们还获得了类似的结果对Hessian的概括高斯牛顿矩阵近似。由于我们的定理,我们推导出作为批量大小的最大学习速率的分析表达式,为随机梯度下降(线性缩放)和自适应算法(例如ADAM(Square Root Scaling)提供了通知实际培训方案,例如光滑,非凸深神经网络。虽然随机梯度下降的线性缩放是在我们概括的更多限制性条件下导出的,但是适应优化者的平方根缩放规则是我们的知识,完全小说。随机二阶方法和自适应方法的百分比,我们得出了最小阻尼系数与学习率与批量尺寸的比率成比例。我们在Cifar-$ 100 $和ImageNet数据集上验证了我们的VGG / WimerEsnet架构上的索赔。根据我们对象检的调查,我们基于飞行学习率和动量学习者开发了一个随机兰齐齐竞争,这避免了对这些关键的超参数进行昂贵的多重评估的需求,并在预残留的情况下显示出良好的初步结果Cifar的architecure - $ 100 $。
translated by 谷歌翻译
观察到在训练期间重新定位神经网络,以改善最近的作品中的概括。然而,它既不在深度学习实践中被广泛采用,也不经常用于最先进的培训方案中。这就提出了一个问题,即何时重新定位起作用,以及是否应与正规化技术一起使用,例如数据增强,体重衰减和学习率计划。在这项工作中,我们对标准培训的经验比较进行了广泛的经验比较,并选择了一些重新定位方法来回答这个问题,并在各种图像分类基准上培训了15,000多个模型。我们首先确定在没有任何其他正则化的情况下,这种方法对概括始终有益。但是,当与其他经过精心调整的正则化技术一起部署时,重新定位方法几乎没有给予概括,尽管最佳的概括性能对学习率和体重衰减超参数的选择不太敏感。为了研究重新定位方法对嘈杂数据的影响,我们还考虑在标签噪声下学习。令人惊讶的是,在这种情况下,即使在存在其他经过精心调整的正则化技术的情况下,重新定位也会显着改善标准培训。
translated by 谷歌翻译
Deep neural networks are typically trained by optimizing a loss function with an SGD variant, in conjunction with a decaying learning rate, until convergence. We show that simple averaging of multiple points along the trajectory of SGD, with a cyclical or constant learning rate, leads to better generalization than conventional training. We also show that this Stochastic Weight Averaging (SWA) procedure finds much flatter solutions than SGD, and approximates the recent Fast Geometric Ensembling (FGE) approach with a single model. Using SWA we achieve notable improvement in test accuracy over conventional SGD training on a range of state-of-the-art residual networks, PyramidNets, DenseNets, and Shake-Shake networks on CIFAR-10, CIFAR-100, and ImageNet. In short, SWA is extremely easy to implement, improves generalization, and has almost no computational overhead.
translated by 谷歌翻译
This paper proposes a new optimization algorithm called Entropy-SGD for training deep neural networks that is motivated by the local geometry of the energy landscape. Local extrema with low generalization error have a large proportion of almost-zero eigenvalues in the Hessian with very few positive or negative eigenvalues. We leverage upon this observation to construct a local-entropy-based objective function that favors well-generalizable solutions lying in large flat regions of the energy landscape, while avoiding poorly-generalizable solutions located in the sharp valleys. Conceptually, our algorithm resembles two nested loops of SGD where we use Langevin dynamics in the inner loop to compute the gradient of the local entropy before each update of the weights. We show that the new objective has a smoother energy landscape and show improved generalization over SGD using uniform stability, under certain assumptions. Our experiments on convolutional and recurrent networks demonstrate that Entropy-SGD compares favorably to state-of-the-art techniques in terms of generalization error and training time.
translated by 谷歌翻译
二阶优化器被认为具有加快神经网络训练的潜力,但是由于曲率矩阵的尺寸巨大,它们通常需要近似值才能计算。最成功的近似家庭是Kronecker因块状曲率估计值(KFAC)。在这里,我们结合了先前工作的工具,以评估确切的二阶更新和仔细消融以建立令人惊讶的结果:由于其近似值,KFAC与二阶更新无关,尤其是,它极大地胜过真实的第二阶段更新。订单更新。这一挑战广泛地相信,并立即提出了为什么KFAC表现如此出色的问题。为了回答这个问题,我们提出了强烈的证据,表明KFAC近似于一阶算法,该算法在神经元上执行梯度下降而不是权重。最后,我们表明,这种优化器通常会在计算成本和数据效率方面改善KFAC。
translated by 谷歌翻译