Gradient descent finds a global minimum in training deep neural networks despite the objective function being non-convex. The current paper proves gradient descent achieves zero training loss in polynomial time for a deep overparameterized neural network with residual connections (ResNet). Our analysis relies on the particular structure of the Gram matrix induced by the neural network architecture. This structure allows us to show the Gram matrix is stable throughout the training process and this stability implies the global optimality of the gradient descent algorithm. We further extend our analysis to deep residual convolutional neural networks and obtain a similar convergence result.
translated by 谷歌翻译
鉴于密集的浅色神经网络,我们专注于迭代创建,培训和组合随机选择的子网(代理函数),以训练完整模型。通过仔细分析$ i)$ Subnetworks的神经切线内核,II美元)$代理职能'梯度,以及$ iii)$我们如何对替代品函数进行采样并结合训练错误的线性收敛速度 - 内部一个错误区域 - 对于带有回归任务的Relu激活的过度参数化单隐藏层Perceptron。我们的结果意味着,对于固定的神经元选择概率,当我们增加代理模型的数量时,误差项会减少,并且随着我们增加每个所选子网的本地训练步骤的数量而增加。考虑的框架概括并提供了关于辍学培训,多样化辍学培训以及独立的子网培训的新见解;对于每种情况,我们提供相应的收敛结果,作为我们主要定理的冠状动脉。
translated by 谷歌翻译
How well does a classic deep net architecture like AlexNet or VGG19 classify on a standard dataset such as CIFAR-10 when its "width"-namely, number of channels in convolutional layers, and number of nodes in fully-connected internal layers -is allowed to increase to infinity? Such questions have come to the forefront in the quest to theoretically understand deep learning and its mysteries about optimization and generalization. They also connect deep learning to notions such as Gaussian processes and kernels. A recent paper [Jacot et al., 2018] introduced the Neural Tangent Kernel (NTK) which captures the behavior of fully-connected deep nets in the infinite width limit trained by gradient descent; this object was implicit in some other recent papers. An attraction of such ideas is that a pure kernel-based method is used to capture the power of a fully-trained deep net of infinite width. The current paper gives the first efficient exact algorithm for computing the extension of NTK to convolutional neural nets, which we call Convolutional NTK (CNTK), as well as an efficient GPU implementation of this algorithm. This results in a significant new benchmark for performance of a pure kernel-based method on CIFAR-10, being 10% higher than the methods reported in [Novak et al., 2019], and only 6% lower than the performance of the corresponding finite deep net architecture (once batch normalization etc. are turned off). Theoretically, we also give the first non-asymptotic proof showing that a fully-trained sufficiently wide net is indeed equivalent to the kernel regression predictor using NTK.
translated by 谷歌翻译
最近的一项工作已经通过神经切线核(NTK)分析了深神经网络的理论特性。特别是,NTK的最小特征值与记忆能力,梯度下降算法的全球收敛性和深网的概括有关。但是,现有结果要么在两层设置中提供边界,要么假设对于多层网络,将NTK矩阵的频谱从0界限为界限。在本文中,我们在无限宽度和有限宽度的限制情况下,在最小的ntk矩阵的最小特征值上提供了紧密的界限。在有限宽度的设置中,我们认为的网络体系结构相当笼统:我们需要大致订购$ n $神经元的宽层,$ n $是数据示例的数量;剩余层宽度的缩放是任意的(取决于对数因素)。为了获得我们的结果,我们分析了各种量的独立兴趣:我们对隐藏特征矩阵的最小奇异值以及输入输出特征图的Lipschitz常数上的上限给出了下限。
translated by 谷歌翻译
Deep neural networks (DNNs) have demonstrated dominating performance in many fields; since AlexNet, networks used in practice are going wider and deeper. On the theoretical side, a long line of works have been focusing on why we can train neural networks when there is only one hidden layer. The theory of multi-layer networks remains unsettled. In this work, we prove simple algorithms such as stochastic gradient descent (SGD) can find global minima on the training objective of DNNs in polynomial time. We only make two assumptions: the inputs do not degenerate and the network is over-parameterized. The latter means the number of hidden neurons is sufficiently large: polynomial in L, the number of DNN layers and in n, the number of training samples. As concrete examples, starting from randomly initialized weights, we show that SGD attains 100% training accuracy in classification tasks, or minimizes regression loss in linear convergence speed ε ∝ e −Ω(T ) , with running time polynomial in n and L. Our theory applies to the widely-used but non-smooth ReLU activation, and to any smooth and possibly non-convex loss functions. In terms of network architectures, our theory at least applies to fully-connected neural networks, convolutional neural networks (CNN), and residual neural networks (ResNet).* Equal contribution . Full version and future updates are available at https://arxiv.org/abs/1811.03962.This paper is a follow up to the recurrent neural network (RNN) paper (Allen-Zhu et al., 2018b) by the same set of authors. Most of the techniques used in this paper were already discovered in the RNN paper, and this paper can be viewed as a simplification (or to some extent a special case) of the RNN setting in order to reach out to a wider audience. We compare the difference and mention our additional contribution in Section 1.2.
translated by 谷歌翻译
Recent works have cast some light on the mystery of why deep nets fit any data and generalize despite being very overparametrized. This paper analyzes training and generalization for a simple 2-layer ReLU net with random initialization, and provides the following improvements over recent works: (i) Using a tighter characterization of training speed than recent papers, an explanation for why training a neural net with random labels leads to slower training, as originally observed in [Zhang et al. ICLR'17]. (ii) Generalization bound independent of network size, using a data-dependent complexity measure. Our measure distinguishes clearly between random labels and true labels on MNIST and CIFAR, as shown by experiments. Moreover, recent papers require sample complexity to increase (slowly) with the size, while our sample complexity is completely independent of the network size. (iii) Learnability of a broad class of smooth functions by 2-layer ReLU nets trained via gradient descent.The key idea is to track dynamics of training and generalization via properties of a related kernel.
translated by 谷歌翻译
训练神经网络的一种常见方法是将所有权重初始化为独立的高斯向量。我们观察到,通过将权重初始化为独立对,每对由两个相同的高斯向量组成,我们可以显着改善收敛分析。虽然已经研究了类似的技术来进行随机输入[Daniely,Neurips 2020],但尚未使用任意输入进行分析。使用此技术,我们展示了如何显着减少两层relu网络所需的神经元数量,均在逻辑损失的参数化设置不足的情况下,大约$ \ gamma^{ - 8} $ [Ji and telgarsky,ICLR, 2020]至$ \ gamma^{ - 2} $,其中$ \ gamma $表示带有神经切线内核的分离边距,以及在与平方损失的过度参数化设置中,从大约$ n^4 $ [song [song]和Yang,2019年]至$ n^2 $,隐含地改善了[Brand,Peng,Song和Weinstein,ITCS 2021]的近期运行时间。对于参数不足的设置,我们还证明了在先前工作时改善的新下限,并且在某些假设下是最好的。
translated by 谷歌翻译
神经切线内核(NTK)已成为提供记忆,优化和泛化的强大工具,可保证深度神经网络。一项工作已经研究了NTK频谱的两层和深网,其中至少具有$ \ omega(n)$神经元的层,$ n $是培训样本的数量。此外,有越来越多的证据表明,只要参数数量超过样品数量,具有亚线性层宽度的深网是强大的记忆和优化器。因此,一个自然的开放问题是NTK是否在如此充满挑战的子线性设置中适应得很好。在本文中,我们以肯定的方式回答了这个问题。我们的主要技术贡献是对最小的深网的最小NTK特征值的下限,最小可能的过度参数化:参数的数量大约为$ \ omega(n)$,因此,神经元的数量仅为$ $ $ \ omega(\ sqrt {n})$。为了展示我们的NTK界限的适用性,我们为梯度下降训练提供了两个有关记忆能力和优化保证的结果。
translated by 谷歌翻译
现代神经网络通常以强烈的过度构造状态运行:它们包含许多参数,即使实际标签被纯粹随机的标签代替,它们也可以插入训练集。尽管如此,他们在看不见的数据上达到了良好的预测错误:插值训练集并不会导致巨大的概括错误。此外,过度散色化似乎是有益的,因为它简化了优化景观。在这里,我们在神经切线(NT)制度中的两层神经网络的背景下研究这些现象。我们考虑了一个简单的数据模型,以及各向同性协变量的矢量,$ d $尺寸和$ n $隐藏的神经元。我们假设样本量$ n $和尺寸$ d $都很大,并且它们在多项式上相关。我们的第一个主要结果是对过份术的经验NT内核的特征结构的特征。这种表征意味着必然的表明,经验NT内核的最低特征值在$ ND \ gg n $后立即从零界限,因此网络可以在同一制度中精确插值任意标签。我们的第二个主要结果是对NT Ridge回归的概括误差的表征,包括特殊情况,最小值-ULL_2 $ NORD插值。我们证明,一旦$ nd \ gg n $,测试误差就会被内核岭回归之一相对于无限宽度内核而近似。多项式脊回归的误差依次近似后者,从而通过与激活函数的高度组件相关的“自我诱导的”项增加了正则化参数。多项式程度取决于样本量和尺寸(尤其是$ \ log n/\ log d $)。
translated by 谷歌翻译
在本文中,我们研究了学习最适合培训数据集的浅层人工神经网络的问题。我们在过度参数化的制度中研究了这个问题,在该制度中,观测值的数量少于模型中的参数数量。我们表明,通过二次激活,训练的优化景观这种浅神经网络具有某些有利的特征,可以使用各种局部搜索启发式方法有效地找到全球最佳模型。该结果适用于输入/输出对的任意培训数据。对于可区分的激活函数,我们还表明,适当初始化的梯度下降以线性速率收敛到全球最佳模型。该结果着重于选择输入的可实现模型。根据高斯分布和标签是根据种植的重量系数生成的。
translated by 谷歌翻译
The fundamental learning theory behind neural networks remains largely open. What classes of functions can neural networks actually learn? Why doesn't the trained network overfit when it is overparameterized?In this work, we prove that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations. Moreover, the learning can be simply done by SGD (stochastic gradient descent) or its variants in polynomial time using polynomially many samples. The sample complexity can also be almost independent of the number of parameters in the network.On the technique side, our analysis goes beyond the so-called NTK (neural tangent kernel) linearization of neural networks in prior works. We establish a new notion of quadratic approximation of the neural network (that can be viewed as a second-order variant of NTK), and connect it to the SGD theory of escaping saddle points.
translated by 谷歌翻译
We study the training and generalization of deep neural networks (DNNs) in the overparameterized regime, where the network width (i.e., number of hidden nodes per layer) is much larger than the number of training data points. We show that, the expected 0-1 loss of a wide enough ReLU network trained with stochastic gradient descent (SGD) and random initialization can be bounded by the training loss of a random feature model induced by the network gradient at initialization, which we call a neural tangent random feature (NTRF) model. For data distributions that can be classified by NTRF model with sufficiently small error, our result yields a generalization error bound in the order of r Opn ´1{2 q that is independent of the network width. Our result is more general and sharper than many existing generalization error bounds for over-parameterized neural networks. In addition, we establish a strong connection between our generalization error bound and the neural tangent kernel (NTK) proposed in recent work.
translated by 谷歌翻译
This work studies training one-hidden-layer overparameterized ReLU networks via gradient descent in the neural tangent kernel (NTK) regime, where, differently from the previous works, the networks' biases are trainable and are initialized to some constant rather than zero. The first set of results of this work characterize the convergence of the network's gradient descent dynamics. Surprisingly, it is shown that the network after sparsification can achieve as fast convergence as the original network. The contribution over previous work is that not only the bias is allowed to be updated by gradient descent under our setting but also a finer analysis is given such that the required width to ensure the network's closeness to its NTK is improved. Secondly, the networks' generalization bound after training is provided. A width-sparsity dependence is presented which yields sparsity-dependent localized Rademacher complexity and a generalization bound matching previous analysis (up to logarithmic factors). As a by-product, if the bias initialization is chosen to be zero, the width requirement improves the previous bound for the shallow networks' generalization. Lastly, since the generalization bound has dependence on the smallest eigenvalue of the limiting NTK and the bounds from previous works yield vacuous generalization, this work further studies the least eigenvalue of the limiting NTK. Surprisingly, while it is not shown that trainable biases are necessary, trainable bias helps to identify a nice data-dependent region where a much finer analysis of the NTK's smallest eigenvalue can be conducted, which leads to a much sharper lower bound than the previously known worst-case bound and, consequently, a non-vacuous generalization bound.
translated by 谷歌翻译
我们考虑培训多层过参数化神经网络的问题,以最大限度地减少损失函数引起的经验风险。在过度参数化的典型设置中,网络宽度$ M $远大于数据维度$ D $和培训数量$ N $($ m = \ mathrm {poly}(n,d)$),其中诱导禁止的大量矩阵$ w \ in \ mathbb {r} ^ {m \ times m} $每层。天真地,一个人必须支付$ O(m ^ 2)$时间读取权重矩阵并评估前向和后向计算中的神经网络功能。在这项工作中,我们展示了如何降低每个迭代的培训成本,具体而言,我们提出了一个仅在初始化阶段使用M ^ 2美元的框架,并且在$ M $的情况下实现了每次迭代的真正子种化成本。 ,$ m ^ {2- \ oomga(1)} $次迭代。为了获得此结果,我们利用各种技术,包括偏移的基于Relu的稀释器,懒惰的低级维护数据结构,快速矩阵矩阵乘法,张量的草图技术和预处理。
translated by 谷歌翻译
通过梯度流优化平均平衡误差,研究了功能空间中神经网络的动态。我们认为,在underParameterized制度中,网络了解由与其特征值对应的率的神经切线内核(NTK)确定的整体运算符$ t_ {k ^ \ infty} $的特征功能。例如,对于SPENTE $ S ^ {D-1} $和旋转不变的权重分配的均匀分布式数据,$ t_ {k ^ \ infty} $的特征函数是球形谐波。我们的结果可以理解为描述interparameterized制度中的光谱偏压。证据使用“阻尼偏差”的概念,其中NTK物质对具有由于阻尼因子的发生而具有大特征值的特征的偏差。除了下公共条例的制度之外,阻尼偏差可用于跟踪过度分辨率设置中经验风险的动态,允许我们在文献中延长某些结果。我们得出结论,阻尼偏差在优化平方误差时提供了动态的简单和统一的视角。
translated by 谷歌翻译
神经网络在许多领域取得了巨大的经验成功。已经观察到,通过一阶方法训练的随机初始化的神经网络能够实现接近零的训练损失,尽管其损失景观是非凸的并且不平滑的。这种现象很少有理论解释。最近,通过分析过参数化制度中的梯度下降〜(GD)和重球方法〜(HB)的梯度来弥合实践和理论之间的这种差距。在这项工作中,通过考虑Nesterov的加速梯度方法〜(nag),我们通过恒定的动量参数进行进一步进展。我们通过Relu激活分析其用于过度参数化的双层完全连接神经网络的收敛性。具体而言,我们证明了NAG的训练误差以非渐近线性收敛率$(1- \θ(1 / \ sqrt {\ kappa}))收敛到零(1 / \ sqrt {\ kappa})^ t $ the $ t $迭代,其中$ \ Kappa> 1 $由神经网络的初始化和架构决定。此外,我们在NAG和GD和HB的现有收敛结果之间提供了比较。我们的理论结果表明,NAG实现了GD的加速度,其会聚率与HB相当。此外,数值实验验证了我们理论分析的正确性。
translated by 谷歌翻译
现代神经网络通常具有很大的表现力,并且可以接受训练以使培训数据过高,同时仍能达到良好的测试性能。这种现象被称为“良性过度拟合”。最近,从理论角度出现了一系列研究“良性过度拟合”的作品。但是,它们仅限于线性模型或内核/随机特征模型,并且仍然缺乏关于何时以及如何在神经网络中发生过度拟合的理论理解。在本文中,我们研究了训练两层卷积神经网络(CNN)的良性过度拟合现象。我们表明,当信噪比满足一定条件时,通过梯度下降训练的两层CNN可以实现任意小的训练和测试损失。另一方面,当这种情况无法成立时,过度拟合就会有害,并且获得的CNN只能实现恒定的测试损失。这些共同证明了由信噪比驱动的良性过度拟合和有害过度拟合之间的急剧过渡。据我们所知,这是第一部精确地表征良性过度拟合在训练卷积神经网络中的条件的工作。
translated by 谷歌翻译
在深度学习中的优化分析是连续的,专注于(变体)梯度流动,或离散,直接处理(变体)梯度下降。梯度流程可符合理论分析,但是风格化并忽略计算效率。它代表梯度下降的程度是深度学习理论的一个开放问题。目前的论文研究了这个问题。将梯度下降视为梯度流量初始值问题的近似数值问题,发现近似程度取决于梯度流动轨迹周围的曲率。然后,我们表明,在具有均匀激活的深度神经网络中,梯度流动轨迹享有有利的曲率,表明它们通过梯度下降近似地近似。该发现允许我们将深度线性神经网络的梯度流分析转换为保证梯度下降,其几乎肯定会在随机初始化下有效地收敛到全局最小值。实验表明,在简单的深度神经网络中,具有传统步长的梯度下降确实接近梯度流。我们假设梯度流动理论将解开深入学习背后的奥秘。
translated by 谷歌翻译
过度分辨率是指选择神经网络的宽度,使得学习算法可以在非凸训练中可被估计零损失的重要现象。现有理论建立了各种初始化策略,培训修改和宽度缩放等全局融合。特别地,最先进的结果要求宽度以二次逐步缩放,并在实践中使用的标准初始化策略下进行培训数据的数量,以获得最佳泛化性能。相比之下,最新的结果可以获得线性缩放,需要导致导致“懒惰训练”的初始化,或者仅训练单层。在这项工作中,我们提供了一个分析框架,使我们能够采用标准的初始化策略,可能避免懒惰的训练,并在基本浅色神经网络中同时培训所有层,同时获得网络宽度的理想子标缩放。我们通过Polyak-Lojasiewicz条件,平滑度和数据标准假设实现了Desiderata,并使用随机矩阵理论的工具。
translated by 谷歌翻译
Autoencoders are a popular model in many branches of machine learning and lossy data compression. However, their fundamental limits, the performance of gradient methods and the features learnt during optimization remain poorly understood, even in the two-layer setting. In fact, earlier work has considered either linear autoencoders or specific training regimes (leading to vanishing or diverging compression rates). Our paper addresses this gap by focusing on non-linear two-layer autoencoders trained in the challenging proportional regime in which the input dimension scales linearly with the size of the representation. Our results characterize the minimizers of the population risk, and show that such minimizers are achieved by gradient methods; their structure is also unveiled, thus leading to a concise description of the features obtained via training. For the special case of a sign activation function, our analysis establishes the fundamental limits for the lossy compression of Gaussian sources via (shallow) autoencoders. Finally, while the results are proved for Gaussian data, numerical simulations on standard datasets display the universality of the theoretical predictions.
translated by 谷歌翻译