Restart techniques are common in gradient-free optimization to deal with multimodal functions. Partial warm restarts are also gaining popularity in gradientbased optimization to improve the rate of convergence in accelerated gradient schemes to deal with ill-conditioned functions. In this paper, we propose a simple warm restart technique for stochastic gradient descent to improve its anytime performance when training deep neural networks. We empirically study its performance on the CIFAR-10 and CIFAR-100 datasets, where we demonstrate new state-of-the-art results at 3.14% and 16.21%, respectively. We also demonstrate its advantages on a dataset of EEG recordings and on a downsampled version of the ImageNet dataset. Our source code is available at https://github.com/loshchil/SGDR
translated by 谷歌翻译
L 2 regularization and weight decay regularization are equivalent for standard stochastic gradient descent (when rescaled by the learning rate), but as we demonstrate this is not the case for adaptive gradient algorithms, such as Adam. While common implementations of these algorithms employ L 2 regularization (often calling it "weight decay" in what may be misleading due to the inequivalence we expose), we propose a simple modification to recover the original formulation of weight decay regularization by decoupling the weight decay from the optimization steps taken w.r.t. the loss function. We provide empirical evidence that our proposed modification (i) decouples the optimal choice of weight decay factor from the setting of the learning rate for both standard SGD and Adam and (ii) substantially improves Adam's generalization performance, allowing it to compete with SGD with momentum on image classification datasets (on which it was previously typically outperformed by the latter). Our proposed decoupled weight decay has already been adopted by many researchers, and the community has implemented it in TensorFlow and PyTorch; the complete source code for our experiments is
translated by 谷歌翻译
Deep neural networks are typically trained by optimizing a loss function with an SGD variant, in conjunction with a decaying learning rate, until convergence. We show that simple averaging of multiple points along the trajectory of SGD, with a cyclical or constant learning rate, leads to better generalization than conventional training. We also show that this Stochastic Weight Averaging (SWA) procedure finds much flatter solutions than SGD, and approximates the recent Fast Geometric Ensembling (FGE) approach with a single model. Using SWA we achieve notable improvement in test accuracy over conventional SGD training on a range of state-of-the-art residual networks, PyramidNets, DenseNets, and Shake-Shake networks on CIFAR-10, CIFAR-100, and ImageNet. In short, SWA is extremely easy to implement, improves generalization, and has almost no computational overhead.
translated by 谷歌翻译
The loss functions of deep neural networks are complex and their geometric properties are not well understood. We show that the optima of these complex loss functions are in fact connected by simple curves over which training and test accuracy are nearly constant. We introduce a training procedure to discover these high-accuracy pathways between modes. Inspired by this new geometric insight, we also propose a new ensembling method entitled Fast Geometric Ensembling (FGE). Using FGE we can train high-performing ensembles in the time required to train a single model. We achieve improved performance compared to the recent state-of-the-art Snapshot Ensembles, on CIFAR-10, CIFAR-100, and ImageNet. * Equal contribution. 1 Suppose we have three weight vectors w1, w2, w3. We set u = (w2 − w1), v = (w3 − w1) − w3 − w1, w2 − w1 / w2 − w1 2 • (w2 − w1). Then the normalized vectors û = u/ u , v = v/ v form an orthonormal basis in the plane containing w1, w2, w3. To visualize the loss in this plane, we define a Cartesian grid in the basis û, v and evaluate the networks corresponding to each of the points in the grid. A point P with coordinates (x, y) in the plane would then be given by P = w1 + x • û + y • v.
translated by 谷歌翻译
The vast majority of successful deep neural networks are trained using variants of stochastic gradient descent (SGD) algorithms. Recent attempts to improve SGD can be broadly categorized into two approaches: (1) adaptive learning rate schemes, such as AdaGrad and Adam, and (2) accelerated schemes, such as heavy-ball and Nesterov momentum. In this paper, we propose a new optimization algorithm, Lookahead, that is orthogonal to these previous approaches and iteratively updates two sets of weights. Intuitively, the algorithm chooses a search direction by looking ahead at the sequence of "fast weights" generated by another optimizer. We show that Lookahead improves the learning stability and lowers the variance of its inner optimizer with negligible computation and memory cost. We empirically demonstrate Lookahead can significantly improve the performance of SGD and Adam, even with their default hyperparameter settings on ImageNet, CIFAR-10/100, neural machine translation, and Penn Treebank.
translated by 谷歌翻译
Deep residual networks were shown to be able to scale up to thousands of layers and still have improving performance. However, each fraction of a percent of improved accuracy costs nearly doubling the number of layers, and so training very deep residual networks has a problem of diminishing feature reuse, which makes these networks very slow to train. To tackle these problems, in this paper we conduct a detailed experimental study on the architecture of ResNet blocks, based on which we propose a novel architecture where we decrease depth and increase width of residual networks. We call the resulting network structures wide residual networks (WRNs) and show that these are far superior over their commonly used thin and very deep counterparts. For example, we demonstrate that even a simple 16-layer-deep wide residual network outperforms in accuracy and efficiency all previous deep residual networks, including thousand-layerdeep networks, achieving new state-of-the-art results on CIFAR, SVHN, COCO, and significant improvements on ImageNet. Our code and models are available at https: //github.com/szagoruyko/wide-residual-networks.
translated by 谷歌翻译
我们使用高斯过程扰动模型在高维二次上的真实和批量风险表面之间的高斯过程扰动模型分析和解释迭代平均的泛化性能。我们从我们的理论结果中获得了三个现象\姓名:}(1)将迭代平均值(ia)与大型学习率和正则化进行了改进的正规化的重要性。 (2)对较少频繁平均的理由。 (3)我们预计自适应梯度方法同样地工作,或者更好,而不是其非自适应对应物的迭代平均值。灵感来自这些结果\姓据{,一起与}对迭代解决方案多样性的适当正则化的重要性,我们提出了两个具有迭代平均的自适应算法。与随机梯度下降(SGD)相比,这些结果具有明显更好的结果,需要较少调谐并且不需要早期停止或验证设定监视。我们在各种现代和古典网络架构上展示了我们对CiFar-10/100,Imagenet和Penn TreeBank数据集的方法的疗效。
translated by 谷歌翻译
在神经网络的经验风险景观中扁平最小值的性质已经讨论了一段时间。越来越多的证据表明他们对尖锐物质具有更好的泛化能力。首先,我们讨论高斯混合分类模型,并分析显示存在贝叶斯最佳点估算器,其对应于属于宽平区域的最小值。可以通过直接在分类器(通常是独立的)或学习中使用的可分解损耗函数上应用最大平坦度算法来找到这些估计器。接下来,我们通过广泛的数值验证将分析扩展到深度学习场景。使用两种算法,熵-SGD和复制-SGD,明确地包括在优化目标中,所谓的非局部平整度措施称为本地熵,我们一直提高常见架构的泛化误差(例如Resnet,CeffectnNet)。易于计算的平坦度测量显示与测试精度明确的相关性。
translated by 谷歌翻译
It is known that the learning rate is the most important hyper-parameter to tune for training deep neural networks. This paper describes a new method for setting the learning rate, named cyclical learning rates, which practically eliminates the need to experimentally find the best values and schedule for the global learning rates. Instead of monotonically decreasing the learning rate, this method lets the learning rate cyclically vary between reasonable boundary values. Training with cyclical learning rates instead of fixed values achieves improved classification accuracy without a need to tune and often in fewer iterations. This paper also describes a simple way to estimate "reasonable bounds" -linearly increasing the learning rate of the network for a few epochs. In addition, cyclical learning rates are demonstrated on the CIFAR-10 and CIFAR-100 datasets with ResNets, Stochastic Depth networks, and DenseNets, and the ImageNet dataset with the AlexNet and GoogLeNet architectures. These are practical tools for everyone who trains neural networks.
translated by 谷歌翻译
In today's heavily overparameterized models, the value of the training loss provides few guarantees on model generalization ability. Indeed, optimizing only the training loss value, as is commonly done, can easily lead to suboptimal model quality. Motivated by prior work connecting the geometry of the loss landscape and generalization, we introduce a novel, effective procedure for instead simultaneously minimizing loss value and loss sharpness. In particular, our procedure, Sharpness-Aware Minimization (SAM), seeks parameters that lie in neighborhoods having uniformly low loss; this formulation results in a minmax optimization problem on which gradient descent can be performed efficiently. We present empirical results showing that SAM improves model generalization across a variety of benchmark datasets (e.g., CIFAR-{10, 100}, Ima-geNet, finetuning tasks) and models, yielding novel state-of-the-art performance for several. Additionally, we find that SAM natively provides robustness to label noise on par with that provided by state-of-the-art procedures that specifically target learning with noisy labels. We open source our code at https: //github.com/google-research/sam. * Work done as part of the Google AI Residency program.
translated by 谷歌翻译
现代深度学习(DL)架构使用使用$ \ Texit运行的SGD算法的变体训练训练{手动} $定义的学习率计划,即,在预定义的时期删除了学习率,通常在训练时损失预计会饱和。在本文中,我们开发了一种实现学习率下降$ \ Texit {自动} $的算法。所提出的方法,即我们称为Autodrop,通过观察到模型参数的角速度,即收敛方向的变化的速度,用于固定学习速率最初迅速增加,然后朝向软饱和。在饱和时,优化器减慢,因此角速度饱和度是用于降低学习率的良好指标。在下降之后,角速度“重置”并遵循先前描述的图案 - 它再次增加,直到饱和度。我们表明,我们的方法改善了SOTA培训方法:它加快了对DL模型的培训并导致更好的概括。我们还表明,我们的方法不需要任何额外的额外的覆盖器调整。 AutoDrop进一步实现和计算方式非常简单。最后,我们开发了一个分析我们算法的理论框架,并提供了收敛保证。
translated by 谷歌翻译
众所周知,随机梯度噪声(SGN)是深度学习的隐式正则化,对于深层网络的优化和概括至关重要。一些作品试图通过注入随机噪声来改善深度学习来人为地模拟SGN。但是,事实证明,注入的简单随机噪声不能像sgn一样工作,而sgn是各向异性和参数依赖性的。为了以低计算成本模拟SGN,并且在不更改学习率或批处理大小的情况下,我们提出了正面的动量(PNM)方法,这是经典优化器中常规动量的强大替代方法。引入的PNM方法维持两个近似独立的动量项。然后,我们可以通过调整动量差异来明确控制SGN的大小。从理论上讲,我们证明了PNM比随机梯度下降(SGD)的收敛保证和概括优势。通过将PNM与动量和Adam合并到两个常规优化器SGD中,我们的广泛实验在经验上验证了基于PNM的变体的显着优势,而不是相应的常规动量基于动量的优化器。
translated by 谷歌翻译
学习率调度程序已在培训深层神经网络中广泛采用。尽管它们的实际重要性,但其实践与理论分析之间存在差异。例如,即使是出于优化二次目标等简单问题,也不知道哪些SGD的时间表达到了最佳收敛性。在本文中,我们提出了本特征库,这是第一个可以在二次目标上获得最小值最佳收敛速率(最多达到常数)的最佳最佳收敛速率(最多达到常数),当时基础Hessian矩阵的特征值分布偏好。这种情况在实践中很普遍。实验结果表明,在CIFAR-10上的图像分类任务中,特征库可以显着超过阶跃衰减,尤其是当时期数量较小时。此外,该理论激发了两个简单的学习率调度程序,用于实用应用程序,可以近似特征。对于某些问题,提议的调度程序的最佳形状类似于余弦衰减的最佳形状,这阐明了余弦衰减在这种情况下的成功。对于其他情况,建议的调度程序优于余弦衰减。
translated by 谷歌翻译
优化通常是一个确定性问题,其中通过诸如梯度下降的一些迭代过程找到解决方案。然而,当培训神经网络时,由于样本的子集的随机选择,损耗函数会超过(迭代)时间。该随机化将优化问题转变为随机级别。我们建议将损失视为关于一些参考最优参考的嘈杂观察。这种对损失的解释使我们能够采用卡尔曼滤波作为优化器,因为其递归制剂旨在估计来自嘈杂测量的未知参数。此外,我们表明,用于未知参数的演进的卡尔曼滤波器动力学模型可用于捕获高级方法的梯度动态,如动量和亚当。我们称之为该随机优化方法考拉,对于Kalman优化算法而言,具有损失适应性的缺陷。考拉是一种易于实现,可扩展,高效的方法来训练神经网络。我们提供了通过实验的收敛分析和显示,它产生了与跨多个神经网络架构和机器学习任务的现有技术优化算法的现有状态的参数估计,例如计算机视觉和语言建模。
translated by 谷歌翻译
培训具有批量标准化和重量衰减的神经网络已成为近年来的常见做法。在这项工作中,我们表明它们的结合使用可能导致优化动态的令人惊讶的周期性行为:培训过程定期表现出稳定,然而,不会导致完全发散但导致新的培训期。我们严格研究了从经验和理论观点的发现的定期行为基础的机制,并分析了实践中发生的条件。我们还证明,周期性行为可以被视为在批量归一化和体重衰减的训练中进行两种先前反对的视角的概括,即平衡推定和不稳定的推定。
translated by 谷歌翻译
Transfer of pre-trained representations improves sample efficiency and simplifies hyperparameter tuning when training deep neural networks for vision. We revisit the paradigm of pre-training on large supervised datasets and fine-tuning the model on a target task. We scale up pre-training, and propose a simple recipe that we call Big Transfer (BiT). By combining a few carefully selected components, and transferring using a simple heuristic, we achieve strong performance on over 20 datasets. BiT performs well across a surprisingly wide range of data regimes -from 1 example per class to 1 M total examples. BiT achieves 87.5% top-1 accuracy on ILSVRC-2012, 99.4% on CIFAR-10, and 76.3% on the 19 task Visual Task Adaptation Benchmark (VTAB). On small datasets, BiT attains 76.8% on ILSVRC-2012 with 10 examples per class, and 97.0% on CIFAR-10 with 10 examples per class. We conduct detailed analysis of the main components that lead to high transfer performance.
translated by 谷歌翻译
深度学习在广泛的AI应用方面取得了有希望的结果。较大的数据集和模型一致地产生更好的性能。但是,我们一般花费更长的培训时间,以更多的计算和沟通。在本调查中,我们的目标是在模型精度和模型效率方面提供关于大规模深度学习优化的清晰草图。我们调查最常用于优化的算法,详细阐述了大批量培训中出现的泛化差距的可辩论主题,并审查了解决通信开销并减少内存足迹的SOTA策略。
translated by 谷歌翻译
随机重量平均(SWA)被认为是一种简单的,而一种有效的方法来改善随机梯度下降(SGD)的推广,用于训练深层神经网络(DNN)。解释其成功的常见见解是,在配备周期性或高常数学习率的SGD过程之后的平均权重可以发现更广泛的Optima,然后导致更好的泛化。我们给出了一个不同意上述内容的新洞察力。我们的表征,SWA的性能高度依赖于SWA收敛前运行的SGD进程的程度,并且权重平均的操作仅有助于减少方差。这种新的Insight表明了更好的算法设计上的实用指南。作为一个实例化,我们表明,随着收敛不足的SGD过程,运行SWA更多次导致泛化方面的持续增量益处。我们的发现在不同网络架构上的广泛实验得到了证实,包括基线CNN,PRERESNET-164,WieresNetNet-28-10,VGG16,Resnet-50,Reset-152,DenSenet-161和不同的数据集,包括CiFar- {10,100}和想象因。
translated by 谷歌翻译
我们为深度残留网络(RESNETS)提出了一种全球收敛的多级训练方法。设计的方法可以看作是递归多级信任区域(RMTR)方法的新型变体,该方法通过在训练过程中自适应调节迷你批量,在混合(随机确定性)设置中运行。多级层次结构和传输运算符是通过利用动力学系统的观点来构建的,该观点通过重新连接来解释远期传播作为对初始值问题的正向Euler离散化。与传统的培训方法相反,我们的新型RMTR方法还通过有限的内存SR1方法结合了有关多级层次结构各个级别的曲率信息。使用分类和回归领域的示例,对我们的多级训练方法的总体性能和收敛属性进行了数值研究。
translated by 谷歌翻译
尽管卷积神经网络(CNN)的演变发展,但它们的性能令人惊讶地取决于超参数的选择。但是,由于现代CNN的较长训练时间,有效探索大型超参数搜索空间仍然具有挑战性。多保真优化可以通过提前终止无主张的配置来探索更多的超参数配置。但是,它通常会导致选择亚最佳配置作为训练,并在早期阶段通常会缓慢收敛。在本文中,我们提出了具有重复学习率(MORL)的多余性优化,该率将CNNS的优化过程纳入了多性效率优化。莫尔减轻了缓慢启动的问题,并实现了更精确的低保真近似。我们对一般图像分类,转移学习和半监督学习的全面实验证明了MORL对其他多保真优化方法的有效性,例如连续减半算法(SHA)和HyperBand。此外,它可以在实际预算内进行手工调整的超参数配置的显着性能改进。
translated by 谷歌翻译