我们研究了梯度下降对学习多层齐次函数所施加的隐式正则化,包括具有线性,ReLU或Leaky ReLU激活的前馈全连通和卷积深度神经网络。我们严格证明梯度流(即有限步长内的梯度下降)有效地强制执行不同层次的平方之间的差异在没有任何明确的规则化的情况下保持不变。该结果意味着如果权重最初很小,则梯度流自动平衡所有层的大小。利用adiscretization参数,我们分析了具有正步长的梯度下降,得到了非凸低阶非对称矩阵分解问题,没有任何正则化。受我们对梯度流的研究结果的启发,我们证明了步长为$ \ eta_t = O \ left(t ^ { - \ left(\ frac12 + \ delta \ right)} \ right)$($ 0 <\ delta \ le \)的梯度下降frac12 $)自动平衡两个低秩因子并收敛到有界全局最优。此外,对于秩为1美元的非对称矩阵分解,我们给出了更精细的分析,显示梯度下降,恒定步长在全球线性速率下收敛到全局最小值。我们认为,在学习同质模型时检验一阶算法所假设的不变性的想法可以作为研究深度模型学习优化的基本构件。
translated by 谷歌翻译
深度学习模型通常使用梯度下降成功训练,尽管底层非凸优化问题的最坏情况硬度。关键问题是在什么条件下可以证明优化会成功。在这里,我们提供了这种强有力的结果。我们考虑一个神经网络,其中一个隐藏层和一个没有重叠的卷积结构和一个ReLU激活函数。对于这种架构,我们表明在一般情况下学习是NP完全的,但是当输入分布是高斯时,梯度下降会收敛到多项式时间的全局最优。据我们所知,这是具有ReLU激活的卷积神经网络上梯度下降的第一个全局最优保证。
translated by 谷歌翻译
We consider the problem of learning a one-hidden-layer neural network: we assume the input x ∈ R d is from Gaussian distribution and the label y = a ⊤ σ(Bx) + ξ, where a is a nonnegative vector in R m with m ≤ d, B ∈ R m×d is a full-rank weight matrix, and ξ is a noise vector. We first give an analytic formula for the population risk of the standard squared loss and demonstrate that it implicitly attempts to decompose a sequence of low-rank tensors simultaneously. Inspired by the formula, we design a non-convex objective function G(·) whose landscape is guaranteed to have the following properties: 1. All local minima of G are also global minima. 2. All global minima of G correspond to the ground truth parameters. 3. The value and gradient of G can be estimated using samples. With these properties, stochastic gradient descent on G provably converges to the global minimum and learn the ground-truth parameters. We also prove finite sample complexity result and validate the results by simulations.
translated by 谷歌翻译
我们研究了具有重新线性单元(ReLU)激活函数的单隐层神经网络的学习问题,其中输入是从标准高斯分布中采样的,输出是从噪声教学网络生成的。我们分析了基于经验风险最小化的训练这种神经网络的梯度下降的性能,并提供了算法依赖的保证。特别地,我们证明了随后梯度下降的张量初始化可以以线性速率收敛到地面真值参数,直到某些统计误差。据我们所知,这是第一个表征具有多个神经元的单层隐藏层ReLU网络实际学习的恢复保证的工作。数值实验验证了我们的理论发现。
translated by 谷歌翻译
我们提供了关于过度参数化在学习神经网络中有效的原因的新理论见解。对于具有二次激活和$ n $训练数据点的$ k $隐藏节点浅网络,我们只要$ k \ ge \ sqrt {2n} $显示,过度参数化使本地搜索算法能够找到\ emph {global}一般情况下,尽管参数的数量可能超过样本量,但使用Rademacher复杂度理论,我们表明,如果数据是从常规分布中采样的,那么解决方案也可以很好地推广。高斯。为了证明$ k \ ge \ sqrt {2n} $,损失函数有benignlandscape属性,我们采用平滑分析的想法,这可能有其他应用于研究神经网络的损失表面。
translated by 谷歌翻译
最近的作品揭示了为什么深网适合任何数据的神秘性,并且尽管非常过度参数化但仍然概括。本文分析了具有randominitialization的简单2层ReLU网络的训练和推广,并对近期工作提供了以下改进:(i)使用比最近的论文更严格的训练速度表征,解释为什么用随机标签训练神经网络导致最初在[Zhang et al。 ICLR'17。 (ii)使用数据相关复杂性度量,与网络大小无关的泛化界限。我们的衡量标准明确区分了MNIST和CIFAR上的随机标签和真实标签,如实验所示。此外,近期报纸要求样本复杂度随着大小增加(缓慢),而我们的示例复杂度完全独立于网络大小。 (iii)通过梯度下降网络实现2层ReLU的广泛平滑函数的可学习性。关键思想是通过相关内核的属性跟踪训练和概括的动态。
translated by 谷歌翻译
训练激活量化神经网络涉及最小化其渐变在几乎所有地方消失的分段常数函数,这对于标准反向传播或链规则是不希望的。这个问题的经验解决方法是在后向传递中使用直通估计器(STE)(Bengio等,2013),以便通过修改的链规则的“梯度”变得非常重要。由于这种不寻常的“梯度”肯定不是损失函数的梯度,因此出现了以下问题:为什么在其负方向上搜索可以最大限度地减少训练损失?在本文中,我们通过回答这个问题提供了STE概念的理论证明。我们考虑使用二值化ReLUactivation和高斯输入数据学习双线性层网络的问题。我们将把STE修改的链规则给出的不寻常的“梯度”称为粗梯度。 STE的选择并不是唯一的。我们证明,如果适当选择STE,预期的粗梯度与种群梯度(培训不可用)正相关,其否定是最小化种群损失的下降方向。我们进一步表明,相关的粗梯度下降算法收敛于人口损失最小化问题的临界点。此外,我们表明STE的不良选择导致训练算法在某些局部最小值附近的不稳定性,这通过CIFAR-10实验验证。
translated by 谷歌翻译
我们利用梯度下降和随机梯度下降研究了具有整流线性单位(ReLU)激活函数的深度神经网络训练问题。特别地,我们研究了二元分类问题并且表明对于广泛的损失函数族,具有适当的随机权重初始化,梯度下降和随机梯度下降都可以在温和假设下找到过度参数化深度ReLUnetwork的训练损失的全局最小值。关于培训数据。我们证明的关键思想是高斯随机初始化后跟(随机)梯度下降产生一系列迭代,这些迭代保持在以初始权重为中心的小扰动区域内,其中深ReLU网络的经验损失函数享有良好的局部曲率属性,从而影响全局(随机)梯度下降的收敛性。我们的理论结果有助于理解深度学习的优化,为研究现代深度神经网络训练的优化动力学奠定了基础。
translated by 谷歌翻译
Although gradient descent (GD) almost always escapes saddle points asymptot-ically [Lee et al., 2016], this paper shows that even with fairly natural random initialization schemes and non-pathological functions, GD can be significantly slowed down by saddle points, taking exponential time to escape. On the other hand, gradient descent with perturbations [Ge et al., 2015, Jin et al., 2017] is not slowed down by saddle points-it can find an approximate local minimizer in polynomial time. This result implies that GD is inherently slower than perturbed GD, and justifies the importance of adding perturbations for efficient non-convex optimization. While our focus is theoretical, we also present experiments that illustrate our theoretical findings.
translated by 谷歌翻译
In recent years, stochastic gradient descent (SGD) based techniques has become the standard tools for training neural networks. However, formal theoretical understanding of why SGD can train neural networks in practice is largely missing. In this paper, we make progress on understanding this mystery by providing a convergence analysis for SGD on a rich subset of two-layer feedforward networks with ReLU activations. This subset is characterized by a special structure called "identity mapping". We prove that, if input follows from Gaussian distribution, with standard O(1/ √ d) initialization of the weights, SGD converges to the global minimum in polynomial number of steps. Unlike normal vanilla networks, the "identity mapping" makes our network asymmetric and thus the global minimum is unique. To complement our theory, we are also able to show experimentally that multi-layer networks with this mapping have better performance compared with normal vanilla networks. Our convergence theorem differs from traditional non-convex optimization techniques. We show that SGD converges to optimal in "two phases": In phase I, the gradient points to the wrong direction, however, a potential function g gradually decreases. Then in phase II, SGD enters a nice one point convex region and converges. We also show that the identity mapping is necessary for convergence, as it moves the initial point to a better place for optimization. Experiment verifies our claims.
translated by 谷歌翻译
We show that the gradient descent algorithm provides an implicit regularization effect in the learning of over-parameterized matrix factorization models and one-hidden-layer neural networks with quadratic activations. Concretely, we show that giveñ O(dr 2) random linear measurements of a rank r positive semidefinite matrix X , we can recover X by parameterizing it by U U with U ∈ R d×d and minimizing the squared loss, even if r d. We prove that starting from a small initialization, gradient descent recovers X iñ O(√ r) iterations approximately. The results solve the conjecture of Gunasekar et al. Gunasekar et al. (2017) under the restricted isometry property. The technique can be applied to analyzing neural networks with one-hidden-layer quadratic activations with some technical modifications.
translated by 谷歌翻译
We show that the gradient descent algorithm provides an implicit regularization effect in the learning of over-parameterized matrix factorization models and one-hidden-layer neural networks with quadratic activations. Concretely, we show that giveñ O(dr 2) random linear measurements of a rank r positive semidefinite matrix X ⋆ , we can recover X ⋆ by parameterizing it by U U ⊤ with U ∈ R d×d and minimizing the squared loss, even if r ≪ d. We prove that starting from a small initialization, gradient descent recovers X ⋆ iñ O(√ r) iterations approximately. The results solve the conjecture of Gunasekar et al. [16] under the restricted isometry property. The technique can be applied to analyzing neural networks with one-hidden-layer quadratic activations with some technical modifications.
translated by 谷歌翻译
This paper shows that a perturbed form of gradient descent converges to asecond-order stationary point in a number iterations which depends onlypoly-logarithmically on dimension (i.e., it is almost "dimension-free"). Theconvergence rate of this procedure matches the well-known convergence rate ofgradient descent to first-order stationary points, up to log factors. When allsaddle points are non-degenerate, all second-order stationary points are localminima, and our result thus shows that perturbed gradient descent can escapesaddle points almost for free. Our results can be directly applied to manymachine learning applications, including deep learning. As a particularconcrete example of such an application, we show that our results can be useddirectly to establish sharp global convergence rates for matrix factorization.Our results rely on a novel characterization of the geometry around saddlepoints, which may be of independent interest to the non-convex optimizationcommunity.
translated by 谷歌翻译
In this paper we study the problem of learning a shallow artificial neural network that best fits a training data set. We study this problem in the over-parameterized regime where the number of observations are fewer than the number of parameters in the model. We show that with quadratic activations the optimization landscape of training such shallow neural networks has certain favorable characteristics that allow globally optimal models to be found efficiently using a variety of local search heuristics. This result holds for an arbitrary training data of input/output pairs. For differentiable activation functions we also show that gradient descent, when suitably initialized, converges at a linear rate to a globally optimal model. This result focuses on a realizable model where the inputs are chosen i.i.d. from a Gaussian distribution and the labels are generated according to planted weight coefficients. Dedicated to the memory of Maryam Mirzakhani.
translated by 谷歌翻译
We study whether a depth two neural network can learn another depth two network using gradient descent. Assuming a linear output node, we show that the question of whether gradient descent converges to the target function is equivalent to the following question in electrodynamics: Given k fixed protons in R d , and k electrons, each moving due to the attractive force from the protons and repulsive force from the remaining electrons, whether at equilibrium all the electrons will be matched up with the protons, up to a permutation. Under the standard electrical force, this follows from the classic Earnshaw's theorem. In our setting, the force is determined by the activation function and the input distribution. Building on this equivalence, we prove the existence of an activation function such that gradient descent learns at least one of the hidden nodes in the target network. Iterating, we show that gradient descent can be used to learn the entire network one node at a time.
translated by 谷歌翻译
随机梯度Langevin动态(SGLD)是一种基本算法随机优化。 Zhang等人最近的工作。 [2017]给出了SGLD对一阶和二阶静止点的击中时间的分析。 Zhang等人的证明。 [2017]是一个两阶段的程序,通过整个Cheeger常数,这是相当复杂的,导致松散的。在本文中,利用随机微分方程的直觉,我们提供了SGLD对一阶和二阶静止点的击中时间的直接分析。我们的分析很简单。它只依赖于基本的线性代数和概率论工具。与Zhang等人相比,我们的直接分析也导致了更严格的界限。 [2017]并显示击球时间对不同因素的显着依赖性,包括维度,平滑度,噪声强度和步长效应。在适当的条件下,我们表明SGLD对一阶静止点的击中时间可以与床层无关。此外,我们应用我们的分析来研究机器学习中的几个重要的在线估计问题,包括线性回归,矩阵分解和在线PCA。
translated by 谷歌翻译
我们证明对于一个$ L $层完全连接的线性神经网络,如果每个隐藏层的宽度是$ \ tilde \ Omega(L \ cdot r \ cdot d _ {\ mathrm {out}} \ cdot \ kappa ^ 3 )$,其中$ r $和$ \ kappa $是输入数据的排名和条件数,$ d _ {\ mathrm {out}} $是输出维度,高斯随机初始化的梯度下降收敛于全局最小值线性率。找到$ \ epsilon $ -sopoptimal解决方案的迭代次数是$ O(\ kappa \ log(\ frac {1} {\ epsilon}))$。对于宽深线性网络的总运行时间,我们的多项式上界和窄线性神经网络的$ \ exp \ left(\ Omega \ left(L \ right)\ right)$下界[Shamir,2018]一起证明了宽层对于优化深度模型是必要的。
translated by 谷歌翻译
我们通过最小化白化数据的$ \ ell_2 $损失来分析收敛速度到全局最优的梯度下降训练深度线性神经网络(参数化为$ x \ mapsto W_N \ cdotsW_1x $)。当以下保持时,保证了线性速率的收敛:(i)隐藏层的尺寸至少是输入和输出尺寸的最小值; (ii)初始化时的权重矩阵大致平衡; (iii)初始损失小于任何等级缺陷解决方案的损失。初始化的假设(条件(ii)和(iii))是必要的,因为违反它们中的任何一个可能导致收敛失败。此外,在输出维数1的重要情况下,即标量回归,它们被满足,并且在全随机初始化方案下具有恒定概率的全局最优的收敛。我们的结果显着扩展了先前的分析,例如,深度线性残余网络(Bartlett等,2018)。
translated by 谷歌翻译
人口风险始终是机器学习的主要关注点;但是,学习算法只能获得经验风险。即使是非凸面非光滑损失(例如现代深度网络),人口风险从优化观点来看通常比经验风险更为明显。特别是,采样可以创建许多虚假的局部最小值。我们考虑一个通用框架,旨在优化平滑的非凸函数$ F $(人口风险),只给出近似$ f $(经验风险),即逐点接近$ F $(即$ \ | Ff \ | _ { \ infty} \ le \ nu $)。我们的目标是找到底层函数$ F $的$ \ epsilon $ -approximate局部最小值,同时保持浅局部最小值---由于容差$ \ nu $ ---而产生,仅存在于$ f $中。我们提出了一个基于随机梯度下降(SGD)的简单算法,在$ f $的平滑版本上,只要$ \ nu \ le O(\ epsilon ^ {1.5} / d)$就可以保证实现我们的目标。 Wealso提供了几乎匹配的下界,表明我们的算法在所有算法中实现了最优误差容限$ \ nu $,使得多项式数量的查询为$ f $。作为一个具体的例子,我们表明我们的结果可以直接用于为学习ReLU单元提供样本复杂性。
translated by 谷歌翻译
诸如批量标准化之类的标准化技术已成功应用于训练深度神经网络。然而,尽管具有明显的经验效益,批量标准化成功背后的原因几乎是假设的。我们这里旨在从经典的优化角度提供更全面的理论理解。我们对这个目标的主要贡献是识别机器学习的各种问题实例,其中% - 在某些假设下 - BatchNormalization可以证明加速优化。我们认为这种加速是由于批量标准化将优化任务分离为优化参数的长度和方向。这允许基于梯度的方法利用我们证明存在于学习半空间问题和高斯输入的神经网络训练中的有利全局结构。因此,我们将批量标准化从有效的实用启发式转变为可用于这些设置的可转换算法。此外,我们用经验证据证实了分析,这些证据表明我们的理论结果在更广泛的背景下是有效的。
translated by 谷歌翻译