2017-02-26

translated by 谷歌翻译

2018-06-20

translated by 谷歌翻译

2018-06-04

translated by 谷歌翻译
We consider the problem of learning a one-hidden-layer neural network: we assume the input x ∈ R d is from Gaussian distribution and the label y = a ⊤ σ(Bx) + ξ, where a is a nonnegative vector in R m with m ≤ d, B ∈ R m×d is a full-rank weight matrix, and ξ is a noise vector. We first give an analytic formula for the population risk of the standard squared loss and demonstrate that it implicitly attempts to decompose a sequence of low-rank tensors simultaneously. Inspired by the formula, we design a non-convex objective function G(·) whose landscape is guaranteed to have the following properties: 1. All local minima of G are also global minima. 2. All global minima of G correspond to the ground truth parameters. 3. The value and gradient of G can be estimated using samples. With these properties, stochastic gradient descent on G provably converges to the global minimum and learn the ground-truth parameters. We also prove finite sample complexity result and validate the results by simulations.
translated by 谷歌翻译

translated by 谷歌翻译
Although gradient descent (GD) almost always escapes saddle points asymptot-ically [Lee et al., 2016], this paper shows that even with fairly natural random initialization schemes and non-pathological functions, GD can be significantly slowed down by saddle points, taking exponential time to escape. On the other hand, gradient descent with perturbations [Ge et al., 2015, Jin et al., 2017] is not slowed down by saddle points-it can find an approximate local minimizer in polynomial time. This result implies that GD is inherently slower than perturbed GD, and justifies the importance of adding perturbations for efficient non-convex optimization. While our focus is theoretical, we also present experiments that illustrate our theoretical findings.
translated by 谷歌翻译

translated by 谷歌翻译
In recent years, stochastic gradient descent (SGD) based techniques has become the standard tools for training neural networks. However, formal theoretical understanding of why SGD can train neural networks in practice is largely missing. In this paper, we make progress on understanding this mystery by providing a convergence analysis for SGD on a rich subset of two-layer feedforward networks with ReLU activations. This subset is characterized by a special structure called "identity mapping". We prove that, if input follows from Gaussian distribution, with standard O(1/ √ d) initialization of the weights, SGD converges to the global minimum in polynomial number of steps. Unlike normal vanilla networks, the "identity mapping" makes our network asymmetric and thus the global minimum is unique. To complement our theory, we are also able to show experimentally that multi-layer networks with this mapping have better performance compared with normal vanilla networks. Our convergence theorem differs from traditional non-convex optimization techniques. We show that SGD converges to optimal in "two phases": In phase I, the gradient points to the wrong direction, however, a potential function g gradually decreases. Then in phase II, SGD enters a nice one point convex region and converges. We also show that the identity mapping is necessary for convergence, as it moves the initial point to a better place for optimization. Experiment verifies our claims.
translated by 谷歌翻译
We show that the gradient descent algorithm provides an implicit regularization effect in the learning of over-parameterized matrix factorization models and one-hidden-layer neural networks with quadratic activations. Concretely, we show that giveñ O(dr 2) random linear measurements of a rank r positive semidefinite matrix X , we can recover X by parameterizing it by U U with U ∈ R d×d and minimizing the squared loss, even if r d. We prove that starting from a small initialization, gradient descent recovers X iñ O(√ r) iterations approximately. The results solve the conjecture of Gunasekar et al. Gunasekar et al. (2017) under the restricted isometry property. The technique can be applied to analyzing neural networks with one-hidden-layer quadratic activations with some technical modifications.
translated by 谷歌翻译

translated by 谷歌翻译

2018-10-09

translated by 谷歌翻译

2017-12-26
We show that the gradient descent algorithm provides an implicit regularization effect in the learning of over-parameterized matrix factorization models and one-hidden-layer neural networks with quadratic activations. Concretely, we show that giveñ O(dr 2) random linear measurements of a rank r positive semidefinite matrix X ⋆ , we can recover X ⋆ by parameterizing it by U U ⊤ with U ∈ R d×d and minimizing the squared loss, even if r ≪ d. We prove that starting from a small initialization, gradient descent recovers X ⋆ iñ O(√ r) iterations approximately. The results solve the conjecture of Gunasekar et al. [16] under the restricted isometry property. The technique can be applied to analyzing neural networks with one-hidden-layer quadratic activations with some technical modifications.
translated by 谷歌翻译

2017-03-02
This paper shows that a perturbed form of gradient descent converges to asecond-order stationary point in a number iterations which depends onlypoly-logarithmically on dimension (i.e., it is almost "dimension-free"). Theconvergence rate of this procedure matches the well-known convergence rate ofgradient descent to first-order stationary points, up to log factors. When allsaddle points are non-degenerate, all second-order stationary points are localminima, and our result thus shows that perturbed gradient descent can escapesaddle points almost for free. Our results can be directly applied to manymachine learning applications, including deep learning. As a particularconcrete example of such an application, we show that our results can be useddirectly to establish sharp global convergence rates for matrix factorization.Our results rely on a novel characterization of the geometry around saddlepoints, which may be of independent interest to the non-convex optimizationcommunity.
translated by 谷歌翻译
In this paper, we explore theoretical properties of training a two-layered ReLU network g(x; w) = K j=1 σ(w j x) with centered d-dimensional spherical Gaussian input x (σ=ReLU). We train our network with gradient descent on w to mimic the output of a teacher network with the same architecture and fixed parameters w *. We show that its population gradient has an analytical formula, leading to interesting theoretical analysis of critical points and convergence behaviors. First, we prove that critical points outside the hyperplane spanned by the teacher parameters ("out-of-plane") are not isolated and form manifolds, and characterize in-plane critical-point-free regions for two ReLU case. On the other hand, convergence to w * for one ReLU node is guaranteed with at least (1 −)/2 probability, if weights are initialized randomly with standard deviation upper-bounded by O(/ √ d), consistent with empirical practice. For network with many ReLU nodes, we prove that an infinitesimal perturbation of weight initializa-tion results in convergence towards w * (or its permutation), a phenomenon known as spontaneous symmetric-breaking (SSB) in physics. We assume no independence of ReLU activations. Simulation verifies our findings.
translated by 谷歌翻译

translated by 谷歌翻译

2018-03-25

translated by 谷歌翻译
We study whether a depth two neural network can learn another depth two network using gradient descent. Assuming a linear output node, we show that the question of whether gradient descent converges to the target function is equivalent to the following question in electrodynamics: Given k fixed protons in R d , and k electrons, each moving due to the attractive force from the protons and repulsive force from the remaining electrons, whether at equilibrium all the electrons will be matched up with the protons, up to a permutation. Under the standard electrical force, this follows from the classic Earnshaw's theorem. In our setting, the force is determined by the activation function and the input distribution. Building on this equivalence, we prove the existence of an activation function such that gradient descent learns at least one of the hidden nodes in the target network. Iterating, we show that gradient descent can be used to learn the entire network one node at a time.
translated by 谷歌翻译

2019-01-21

translated by 谷歌翻译

2018-05-27

translated by 谷歌翻译
In this paper we study the problem of learning a shallow artificial neural network that best fits a training data set. We study this problem in the over-parameterized regime where the number of observations are fewer than the number of parameters in the model. We show that with quadratic activations the optimization landscape of training such shallow neural networks has certain favorable characteristics that allow globally optimal models to be found efficiently using a variety of local search heuristics. This result holds for an arbitrary training data of input/output pairs. For differentiable activation functions we also show that gradient descent, when suitably initialized, converges at a linear rate to a globally optimal model. This result focuses on a realizable model where the inputs are chosen i.i.d. from a Gaussian distribution and the labels are generated according to planted weight coefficients. Dedicated to the memory of Maryam Mirzakhani.
translated by 谷歌翻译
${authors} 分类：${tags}
${pubdate}${abstract_cn}
translated by 谷歌翻译