近年来,人们对无限宽网络与高斯流程之间的对应关系产生了越来越多的兴趣。尽管当前的神经网络高斯过程理论具有有效性和优雅性,但据我们所知,所有神经网络高斯过程基本上都是通过增加宽度引起的。但是,在深度学习的时代,关于神经网络的更多关注是它的深度以及深度如何影响网络的行为。受宽度深度对称考虑因素的启发,我们使用快捷网络表明,增加神经网络的深度也会引起高斯过程,这是对现有理论的宝贵补充,并有助于揭示的真实情况深度学习。除了深入提出的高斯过程之外,我们从理论上表征了其均匀的紧密度和高斯工艺过程中最小的特征值。这些特征不仅可以增强我们对拟议深度引起的高斯过程的理解,而且还可以为未来的应用铺平道路。最后,我们通过对两个基准数据集的回归实验来检查提出的高斯过程的性能。
translated by 谷歌翻译
Whilst deep neural networks have shown great empirical success, there is still much work to be done to understand their theoretical properties. In this paper, we study the relationship between random, wide, fully connected, feedforward networks with more than one hidden layer and Gaussian processes with a recursive kernel definition. We show that, under broad conditions, as we make the architecture increasingly wide, the implied random function converges in distribution to a Gaussian process, formalising and extending existing results by Neal (1996) to deep networks. To evaluate convergence rates empirically, we use maximum mean discrepancy. We then compare finite Bayesian deep networks from the literature to Gaussian processes in terms of the key predictive quantities of interest, finding that in some cases the agreement can be very close. We discuss the desirability of Gaussian process behaviour and review non-Gaussian alternative models from the literature. 1
translated by 谷歌翻译
The fundamental learning theory behind neural networks remains largely open. What classes of functions can neural networks actually learn? Why doesn't the trained network overfit when it is overparameterized?In this work, we prove that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations. Moreover, the learning can be simply done by SGD (stochastic gradient descent) or its variants in polynomial time using polynomially many samples. The sample complexity can also be almost independent of the number of parameters in the network.On the technique side, our analysis goes beyond the so-called NTK (neural tangent kernel) linearization of neural networks in prior works. We establish a new notion of quadratic approximation of the neural network (that can be viewed as a second-order variant of NTK), and connect it to the SGD theory of escaping saddle points.
translated by 谷歌翻译
神经切线内核(NTK)已成为提供记忆,优化和泛化的强大工具,可保证深度神经网络。一项工作已经研究了NTK频谱的两层和深网,其中至少具有$ \ omega(n)$神经元的层,$ n $是培训样本的数量。此外,有越来越多的证据表明,只要参数数量超过样品数量,具有亚线性层宽度的深网是强大的记忆和优化器。因此,一个自然的开放问题是NTK是否在如此充满挑战的子线性设置中适应得很好。在本文中,我们以肯定的方式回答了这个问题。我们的主要技术贡献是对最小的深网的最小NTK特征值的下限,最小可能的过度参数化:参数的数量大约为$ \ omega(n)$,因此,神经元的数量仅为$ $ $ \ omega(\ sqrt {n})$。为了展示我们的NTK界限的适用性,我们为梯度下降训练提供了两个有关记忆能力和优化保证的结果。
translated by 谷歌翻译
最近的一项工作已经通过神经切线核(NTK)分析了深神经网络的理论特性。特别是,NTK的最小特征值与记忆能力,梯度下降算法的全球收敛性和深网的概括有关。但是,现有结果要么在两层设置中提供边界,要么假设对于多层网络,将NTK矩阵的频谱从0界限为界限。在本文中,我们在无限宽度和有限宽度的限制情况下,在最小的ntk矩阵的最小特征值上提供了紧密的界限。在有限宽度的设置中,我们认为的网络体系结构相当笼统:我们需要大致订购$ n $神经元的宽层,$ n $是数据示例的数量;剩余层宽度的缩放是任意的(取决于对数因素)。为了获得我们的结果,我们分析了各种量的独立兴趣:我们对隐藏特征矩阵的最小奇异值以及输入输出特征图的Lipschitz常数上的上限给出了下限。
translated by 谷歌翻译
众所周知,$ O(n)$参数足以让神经网络记住任意$ N $ INPUT-LABE标签对。通过利用深度,我们显示$ O(n ^ {2/3})$参数足以在输入点的分离的温和条件下记住$ n $对。特别是,更深的网络(即使是宽度为3美元),也会显示比浅网络更有成对,这也同意最近的作品对函数近似的深度的好处。我们还提供支持我们理论发现的经验结果。
translated by 谷歌翻译
在现代深度学习中,最近又越来越多的文献,关于深高斯神经网络(NNS)的大宽度渐近性能之间的相互作用,即具有高斯分布重量的深NNS和高斯随机过程(SPS)。事实证明,这种相互作用在高斯SP先验下的贝叶斯推论中至关重要,对通过梯度下降训练的无限宽的深NN的内核回归以及无限宽的NN中的信息传播。通过经验分析的激励,该经验分析表明了用稳定的NN重量代替高斯分布的潜力,在本文中,我们对(完全连接的)进料深度稳定NN的大差异行为进行了严格的分析,即深NNS,即具有稳定的分布重量。我们表明,随着宽度共同在NN的层上共同进入无限,即``关节生长''的设置,重新缩放的深稳定nn弱收敛到稳定的SP,其分布通过NN的层递归地表征。 NN的三角结构,这是一个非标准的渐近问题,我们提出了一种独立利益的感应方法。然后,我们在````''''下建立了对稳定的SP的Sup-Norm收敛速率,``关节增长和``顺序增长''的宽度在NN的层上。这样的结果提供了'关节增长'和``顺序增长''的差异,表明前者的速率比速度慢。后者根据层的深度和NN的投入数量。我们的工作扩展了有关深gaussian nns无限宽限制的一些最新结果,以至于更通用的深稳定稳定性NNS,这是第一个结果,这是对融合率的第一个结果。``联合增长''环境。
translated by 谷歌翻译
How well does a classic deep net architecture like AlexNet or VGG19 classify on a standard dataset such as CIFAR-10 when its "width"-namely, number of channels in convolutional layers, and number of nodes in fully-connected internal layers -is allowed to increase to infinity? Such questions have come to the forefront in the quest to theoretically understand deep learning and its mysteries about optimization and generalization. They also connect deep learning to notions such as Gaussian processes and kernels. A recent paper [Jacot et al., 2018] introduced the Neural Tangent Kernel (NTK) which captures the behavior of fully-connected deep nets in the infinite width limit trained by gradient descent; this object was implicit in some other recent papers. An attraction of such ideas is that a pure kernel-based method is used to capture the power of a fully-trained deep net of infinite width. The current paper gives the first efficient exact algorithm for computing the extension of NTK to convolutional neural nets, which we call Convolutional NTK (CNTK), as well as an efficient GPU implementation of this algorithm. This results in a significant new benchmark for performance of a pure kernel-based method on CIFAR-10, being 10% higher than the methods reported in [Novak et al., 2019], and only 6% lower than the performance of the corresponding finite deep net architecture (once batch normalization etc. are turned off). Theoretically, we also give the first non-asymptotic proof showing that a fully-trained sufficiently wide net is indeed equivalent to the kernel regression predictor using NTK.
translated by 谷歌翻译
The logit outputs of a feedforward neural network at initialization are conditionally Gaussian, given a random covariance matrix defined by the penultimate layer. In this work, we study the distribution of this random matrix. Recent work has shown that shaping the activation function as network depth grows large is necessary for this covariance matrix to be non-degenerate. However, the current infinite-width-style understanding of this shaping method is unsatisfactory for large depth: infinite-width analyses ignore the microscopic fluctuations from layer to layer, but these fluctuations accumulate over many layers. To overcome this shortcoming, we study the random covariance matrix in the shaped infinite-depth-and-width limit. We identify the precise scaling of the activation function necessary to arrive at a non-trivial limit, and show that the random covariance matrix is governed by a stochastic differential equation (SDE) that we call the Neural Covariance SDE. Using simulations, we show that the SDE closely matches the distribution of the random covariance matrix of finite networks. Additionally, we recover an if-and-only-if condition for exploding and vanishing norms of large shaped networks based on the activation function.
translated by 谷歌翻译
为了理论上了解训练有素的深神经网络的行为,有必要研究来自随机初始化的梯度方法引起的动态。然而,这些模型的非线性和组成结构使得这些动态难以分析。为了克服这些挑战,最近出现了大宽度的渐近学作为富有成效的观点,并导致了对真实世界的深网络的实用洞察。对于双层神经网络,已经通过这些渐近学理解,训练模型的性质根据初始随机权重的规模而变化,从内核制度(大初始方差)到特征学习制度(对于小初始方差)。对于更深的网络,更多的制度是可能的,并且在本文中,我们详细研究了与神经网络的“卑鄙字段”限制相对应的“小”初始化的特定选择,我们称之为可分配的参数化(IP)。首先,我们展示了标准I.I.D.零平均初始化,具有多于四个层的神经网络的可集参数,从无限宽度限制的静止点开始,并且不会发生学习。然后,我们提出了各种方法来避免这种琐碎的行为并详细分析所得到的动态。特别是,这些方法中的一种包括使用大的初始学习速率,并且我们表明它相当于最近提出的最大更新参数化$ \ mu $ p的修改。我们将结果与图像分类任务的数值实验确认,其另外显示出在尚未捕获的激活功能的各种选择之间的行为中的强烈差异。
translated by 谷歌翻译
现代神经网络通常以强烈的过度构造状态运行:它们包含许多参数,即使实际标签被纯粹随机的标签代替,它们也可以插入训练集。尽管如此,他们在看不见的数据上达到了良好的预测错误:插值训练集并不会导致巨大的概括错误。此外,过度散色化似乎是有益的,因为它简化了优化景观。在这里,我们在神经切线(NT)制度中的两层神经网络的背景下研究这些现象。我们考虑了一个简单的数据模型,以及各向同性协变量的矢量,$ d $尺寸和$ n $隐藏的神经元。我们假设样本量$ n $和尺寸$ d $都很大,并且它们在多项式上相关。我们的第一个主要结果是对过份术的经验NT内核的特征结构的特征。这种表征意味着必然的表明,经验NT内核的最低特征值在$ ND \ gg n $后立即从零界限,因此网络可以在同一制度中精确插值任意标签。我们的第二个主要结果是对NT Ridge回归的概括误差的表征,包括特殊情况,最小值-ULL_2 $ NORD插值。我们证明,一旦$ nd \ gg n $,测试误差就会被内核岭回归之一相对于无限宽度内核而近似。多项式脊回归的误差依次近似后者,从而通过与激活函数的高度组件相关的“自我诱导的”项增加了正则化参数。多项式程度取决于样本量和尺寸(尤其是$ \ log n/\ log d $)。
translated by 谷歌翻译
通过建立神经网络和内核方法之间的联系,无限宽度极限阐明了深度学习的概括和优化方面。尽管它们的重要性,但这些内核方法的实用性在大规模学习设置中受到限制,因为它们(超)二次运行时和内存复杂性。此外,大多数先前关于神经内核的作品都集中在relu激活上,这主要是由于其受欢迎程度,但这也是由于很难计算此类内核来进行一般激活。在这项工作中,我们通过提供进行一般激活的方法来克服此类困难。首先,我们编译和扩展激活功能的列表,该函数允许精确的双重激活表达式计算神经内核。当确切的计算未知时,我们提出有效近似它们的方法。我们提出了一种快速的素描方法,该方法近似于任何多种多层神经网络高斯过程(NNGP)内核和神经切线核(NTK)矩阵,以实现广泛的激活功能,这超出了常见的经过分析的RELU激活。这是通过显示如何使用任何所需激活函​​数的截短的Hermite膨胀来近似神经内核来完成的。虽然大多数先前的工作都需要单位球体上的数据点,但我们的方法不受此类限制的影响,并且适用于$ \ Mathbb {r}^d $中的任何点数据集。此外,我们为NNGP和NTK矩阵提供了一个子空间嵌入,具有接近输入的距离运行时和接近最佳的目标尺寸,该目标尺寸适用于任何\ EMPH {均质}双重激活功能,具有快速收敛的Taylor膨胀。从经验上讲,关于精确的卷积NTK(CNTK)计算,我们的方法可实现$ 106 \ times $速度,用于在CIFAR-10数据集上的5层默特网络的近似CNTK。
translated by 谷歌翻译
Gradient descent finds a global minimum in training deep neural networks despite the objective function being non-convex. The current paper proves gradient descent achieves zero training loss in polynomial time for a deep overparameterized neural network with residual connections (ResNet). Our analysis relies on the particular structure of the Gram matrix induced by the neural network architecture. This structure allows us to show the Gram matrix is stable throughout the training process and this stability implies the global optimality of the gradient descent algorithm. We further extend our analysis to deep residual convolutional neural networks and obtain a similar convergence result.
translated by 谷歌翻译
This paper investigates the stability of deep ReLU neural networks for nonparametric regression under the assumption that the noise has only a finite p-th moment. We unveil how the optimal rate of convergence depends on p, the degree of smoothness and the intrinsic dimension in a class of nonparametric regression functions with hierarchical composition structure when both the adaptive Huber loss and deep ReLU neural networks are used. This optimal rate of convergence cannot be obtained by the ordinary least squares but can be achieved by the Huber loss with a properly chosen parameter that adapts to the sample size, smoothness, and moment parameters. A concentration inequality for the adaptive Huber ReLU neural network estimators with allowable optimization errors is also derived. To establish a matching lower bound within the class of neural network estimators using the Huber loss, we employ a different strategy from the traditional route: constructing a deep ReLU network estimator that has a better empirical loss than the true function and the difference between these two functions furnishes a low bound. This step is related to the Huberization bias, yet more critically to the approximability of deep ReLU networks. As a result, we also contribute some new results on the approximation theory of deep ReLU neural networks.
translated by 谷歌翻译
古典统计学习理论表示,拟合太多参数导致过度舒服和性能差。尽管大量参数矛盾,但是现代深度神经网络概括了这一发现,并构成了解释深度学习成功的主要未解决的问题。随机梯度下降(SGD)引起的隐式正规被认为是重要的,但其特定原则仍然是未知的。在这项工作中,我们研究了当地最小值周围的能量景观的局部几何学如何影响SGD的统计特性,具有高斯梯度噪声。我们争辩说,在合理的假设下,局部几何形状力强制SGD保持接近低维子空间,这会引起隐式正则化并导致深神经网络的泛化误差界定更严格的界限。为了获得神经网络的泛化误差界限,我们首先引入局部最小值周围的停滞迹象,并施加人口风险的局部基本凸性财产。在这些条件下,推导出SGD的下界,以保留在这些停滞套件中。如果发生停滞,我们会导出涉及权重矩阵的光谱规范的深神经网络的泛化误差的界限,但不是网络参数的数量。从技术上讲,我们的证据基于控制SGD中的参数值的变化以及基于局部最小值周围的合适邻域的熵迭代的参数值和局部均匀收敛。我们的工作试图通过统一收敛更好地连接非凸优化和泛化分析。
translated by 谷歌翻译
This work studies training one-hidden-layer overparameterized ReLU networks via gradient descent in the neural tangent kernel (NTK) regime, where, differently from the previous works, the networks' biases are trainable and are initialized to some constant rather than zero. The first set of results of this work characterize the convergence of the network's gradient descent dynamics. Surprisingly, it is shown that the network after sparsification can achieve as fast convergence as the original network. The contribution over previous work is that not only the bias is allowed to be updated by gradient descent under our setting but also a finer analysis is given such that the required width to ensure the network's closeness to its NTK is improved. Secondly, the networks' generalization bound after training is provided. A width-sparsity dependence is presented which yields sparsity-dependent localized Rademacher complexity and a generalization bound matching previous analysis (up to logarithmic factors). As a by-product, if the bias initialization is chosen to be zero, the width requirement improves the previous bound for the shallow networks' generalization. Lastly, since the generalization bound has dependence on the smallest eigenvalue of the limiting NTK and the bounds from previous works yield vacuous generalization, this work further studies the least eigenvalue of the limiting NTK. Surprisingly, while it is not shown that trainable biases are necessary, trainable bias helps to identify a nice data-dependent region where a much finer analysis of the NTK's smallest eigenvalue can be conducted, which leads to a much sharper lower bound than the previously known worst-case bound and, consequently, a non-vacuous generalization bound.
translated by 谷歌翻译
神经体系结构搜索(NAS)促进了神经体系结构的自动发现,从而实现了图像识别的最新精度。尽管NAS取得了进展,但到目前为止,NAS对理论保证几乎没有关注。在这项工作中,我们研究了NAS在统一框架下的概括属性,从而实现(深)层跳过连接搜索和激活功能搜索。为此,我们从搜索空间(包括混合的激活功能,完全连接和残留的神经网络)的(包括)有限宽度方向上得出了神经切线核的最小特征值的下(和上)边界。由于在统一框架下的各种体系结构和激活功能的耦合,我们的分析是不平凡的。然后,我们利用特征值边界在随机梯度下降训练中建立NAS的概括误差界。重要的是,我们从理论上和实验上展示了衍生结果如何指导NAS,即使在没有培训的情况下,即使在没有培训的情况下,也可以根据我们的理论进行无训练的算法。因此,我们的数值验证阐明了NAS计算有效方法的设计。
translated by 谷歌翻译
神经切线内核(NTK)是分析神经网络及其泛化界限的训练动力学的强大工具。关于NTK的研究已致力于典型的神经网络体系结构,但对于Hadamard产品(NNS-HP)的神经网络不完整,例如StyleGAN和多项式神经网络。在这项工作中,我们为特殊类别的NNS-HP(即多项式神经网络)得出了有限宽度的NTK公式。我们证明了它们与关联的NTK与内核回归预测变量的等效性,该预测扩大了NTK的应用范围。根据我们的结果,我们阐明了针对外推和光谱偏置,PNN在标准神经网络上的分离。我们的两个关键见解是,与标准神经网络相比,PNN能够在外推方案中拟合更复杂的功能,并承认相应NTK的特征值衰减较慢。此外,我们的理论结果可以扩展到其他类型的NNS-HP,从而扩大了我们工作的范围。我们的经验结果验证了更广泛的NNS-HP类别的分离,这为对神经体系结构有了更深入的理解提供了良好的理由。
translated by 谷歌翻译
对于由缺陷线性回归中的标签噪声引起的预期平均平方概率,我们证明了无渐近分布的下限。我们的下部结合概括了过度公共数据(内插)制度的类似已知结果。与最先前的作品相比,我们的分析适用于广泛的输入分布,几乎肯定的全排列功能矩阵,允许我们涵盖各种类型的确定性或随机特征映射。我们的下限是渐近的锐利,暗示在存在标签噪声时,缺陷的线性回归不会在任何这些特征映射中围绕内插阈值进行良好的。我们详细分析了强加的假设,并为分析(随机)特征映射提供了理论。使用此理论,我们可以表明我们的假设对于具有(Lebesgue)密度的输入分布以及随机深神经网络给出的特征映射,具有Sigmoid,Tanh,SoftPlus或Gelu等分析激活功能。作为进一步的例子,我们示出了来自随机傅里叶特征和多项式内核的特征映射也满足我们的假设。通过进一步的实验和分析结果,我们补充了我们的理论。
translated by 谷歌翻译
在分析过度参数化神经网络的训练动力学方面的最新进展主要集中在广泛的网络上,因此无法充分解决深度在深度学习中的作用。在这项工作中,我们介绍了第一个无限深层但狭窄的神经网络的训练保证。我们研究具有特定初始化的多层感知器(MLP)的无限深度极限,并使用NTK理论建立了可训练性保证。然后,我们将分析扩展到无限深的卷积神经网络(CNN),并进行简短的实验。
translated by 谷歌翻译