A longstanding goal in deep learning research has been to precisely characterize training and generalization. However, the often complex loss landscapes of neural networks have made a theory of learning dynamics elusive. In this work, we show that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters. Furthermore, mirroring the correspondence between wide Bayesian neural networks and Gaussian processes, gradient-based training of wide neural networks with a squared loss produces test set predictions drawn from a Gaussian process with a particular compositional kernel. While these theoretical results are only exact in the infinite width limit, we nevertheless find excellent empirical agreement between the predictions of the original network and those of the linearized version even for finite practically-sized networks. This agreement is robust across different architectures, optimization methods, and loss functions.
translated by 谷歌翻译
我们分析了通过梯度流通过自洽动力场理论训练的无限宽度神经网络中的特征学习。我们构建了确定性动力学阶参数的集合,该参数是内部产物内核,用于在成对的时间点中,每一层中隐藏的单位激活和梯度,从而减少了通过训练对网络活动的描述。这些内核顺序参数共同定义了隐藏层激活分布,神经切线核的演变以及因此输出预测。我们表明,现场理论推导恢复了从Yang和Hu(2021)获得张量程序的无限宽度特征学习网络的递归随机过程。对于深线性网络,这些内核满足一组代数矩阵方程。对于非线性网络,我们提供了一个交替的采样过程,以求助于内核顺序参数。我们提供了与各种近似方案的自洽解决方案的比较描述。最后,我们提供了更现实的设置中的实验,这些实验表明,在CIFAR分类任务上,在不同宽度上保留了CNN的CNN的损耗和内核动力学。
translated by 谷歌翻译
我们引入了重新定性,这是一种数据依赖性的重新聚集化,将贝叶斯神经网络(BNN)转化为后部的分布,其KL对BNN对BNN的差异随着层宽度的增长而消失。重新定义图直接作用于参数,其分析简单性补充了宽BNN在功能空间中宽BNN的已知神经网络过程(NNGP)行为。利用重新定性,我们开发了马尔可夫链蒙特卡洛(MCMC)后采样算法,该算法将BNN更快地混合在一起。这与MCMC在高维度上的表现差异很差。对于完全连接和残留网络,我们观察到有效样本量高达50倍。在各个宽度上都取得了改进,并在层宽度的重新培训和标准BNN之间的边缘。
translated by 谷歌翻译
It has long been known that a single-layer fully-connected neural network with an i.i.d. prior over its parameters is equivalent to a Gaussian process (GP), in the limit of infinite network width. This correspondence enables exact Bayesian inference for infinite width neural networks on regression tasks by means of evaluating the corresponding GP. Recently, kernel functions which mimic multi-layer random neural networks have been developed, but only outside of a Bayesian framework. As such, previous work has not identified that these kernels can be used as covariance functions for GPs and allow fully Bayesian prediction with a deep neural network. In this work, we derive the exact equivalence between infinitely wide deep networks and GPs. We further develop a computationally efficient pipeline to compute the covariance function for these GPs. We then use the resulting GPs to perform Bayesian inference for wide deep neural networks on MNIST and CIFAR-10. We observe that trained neural network accuracy approaches that of the corresponding GP with increasing layer width, and that the GP uncertainty is strongly correlated with trained network prediction error. We further find that test performance increases as finite-width trained networks are made wider and more similar to a GP, and thus that GP predictions typically outperform those of finite-width networks. Finally we connect the performance of these GPs to the recent theory of signal propagation in random neural networks. * Both authors contributed equally to this work. † Work done as a member of the Google AI Residency program (g.co/airesidency). 1 Throughout this paper, we assume the conditions on the parameter distributions and nonlinearities are such that the Central Limit Theorem will hold; for instance, that the weight variance is scaled inversely proportional to the layer width.
translated by 谷歌翻译
为了更好地了解大型神经网络的理论行为,有几项工程已经分析了网络宽度倾向于无穷大的情况。在该制度中,随机初始化的影响和训练神经网络的过程可以与高斯过程和神经切线内核等分析工具正式表达。在本文中,我们审查了在这种无限宽度神经网络中量化不确定性的方法,并将它们与贝叶斯推理框架中的高斯过程的关系进行比较。我们利用沿途使用几个等价结果,以获得预测不确定性的确切闭合性解决方案。
translated by 谷歌翻译
一项开创性的工作[Jacot等,2018]表明,在特定参数化下训练神经网络等同于执行特定的内核方法,因为宽度延伸到无穷大。这种等效性为将有关内核方法的丰富文献结果应用于神经网的结果开辟了一个有希望的方向,而神经网络很难解决。本调查涵盖了内核融合的关键结果,因为宽度进入无穷大,有限宽度校正,应用以及对相应方法的局限性的讨论。
translated by 谷歌翻译
At initialization, artificial neural networks (ANNs) are equivalent to Gaussian processes in the infinite-width limit (16; 4; 7; 13; 6), thus connecting them to kernel methods. We prove that the evolution of an ANN during training can also be described by a kernel: during gradient descent on the parameters of an ANN, the network function f θ (which maps input vectors to output vectors) follows the kernel gradient of the functional cost (which is convex, in contrast to the parameter cost) w.r.t. a new kernel: the Neural Tangent Kernel (NTK). This kernel is central to describe the generalization features of ANNs. While the NTK is random at initialization and varies during training, in the infinite-width limit it converges to an explicit limiting kernel and it stays constant during training. This makes it possible to study the training of ANNs in function space instead of parameter space. Convergence of the training can then be related to the positive-definiteness of the limiting NTK. We prove the positive-definiteness of the limiting NTK when the data is supported on the sphere and the non-linearity is non-polynomial. We then focus on the setting of least-squares regression and show that in the infinitewidth limit, the network function f θ follows a linear differential equation during training. The convergence is fastest along the largest kernel principal components of the input data with respect to the NTK, hence suggesting a theoretical motivation for early stopping. Finally we study the NTK numerically, observe its behavior for wide networks, and compare it to the infinite-width limit.
translated by 谷歌翻译
在分析过度参数化神经网络的训练动力学方面的最新进展主要集中在广泛的网络上,因此无法充分解决深度在深度学习中的作用。在这项工作中,我们介绍了第一个无限深层但狭窄的神经网络的训练保证。我们研究具有特定初始化的多层感知器(MLP)的无限深度极限,并使用NTK理论建立了可训练性保证。然后,我们将分析扩展到无限深的卷积神经网络(CNN),并进行简短的实验。
translated by 谷歌翻译
尽管通常认为在高维度中学习受到维度的诅咒,但现代的机器学习方法通​​常具有惊人的力量,可以解决广泛的挑战性现实世界学习问题而无需使用大量数据。这些方法如何打破这种诅咒仍然是深度学习理论中的一个基本开放问题。尽管以前的努力通过研究数据(D),模型(M)和推理算法(i)作为独立模块来研究了这个问题,但在本文中,我们将三胞胎(D,M,I)分析为集成系统和确定有助于减轻维度诅咒的重要协同作用。我们首先研究了与各种学习算法(M,i)相关的基本对称性,重点是深度学习中的四个原型体系结构:完全连接的网络(FCN),本地连接的网络(LCN)和卷积网络,而无需合并(有和没有合并)( GAP/VEC)。我们发现,当这些对称性与数据分布的对称性兼容时,学习是最有效的,并且当(d,m,i)三重态的任何成员不一致或次优时,性能会显着恶化。
translated by 谷歌翻译
过度参数化神经网络(NN)的损失表面具有许多全球最小值,却零训练误差。我们解释了标准NN训练程序的常见变体如何改变获得的最小化器。首先,我们明确说明了强烈参数化的NN初始化的大小如何影响最小化器,并可能恶化其最终的测试性能。我们提出了限制这种效果的策略。然后,我们证明,对于自适应优化(例如Adagrad),所获得的最小化器通常与梯度下降(GD)最小化器不同。随机迷你批次训练,即使在非自适应情况下,GD和随机GD基本相同的最小化器,这种自适应最小化器也会进一步改变。最后,我们解释说,这些效果仍然与较少参数化的NN相关。尽管过度参数具有其好处,但我们的工作强调,它会导致参数化模型缺乏错误来源。
translated by 谷歌翻译
How well does a classic deep net architecture like AlexNet or VGG19 classify on a standard dataset such as CIFAR-10 when its "width"-namely, number of channels in convolutional layers, and number of nodes in fully-connected internal layers -is allowed to increase to infinity? Such questions have come to the forefront in the quest to theoretically understand deep learning and its mysteries about optimization and generalization. They also connect deep learning to notions such as Gaussian processes and kernels. A recent paper [Jacot et al., 2018] introduced the Neural Tangent Kernel (NTK) which captures the behavior of fully-connected deep nets in the infinite width limit trained by gradient descent; this object was implicit in some other recent papers. An attraction of such ideas is that a pure kernel-based method is used to capture the power of a fully-trained deep net of infinite width. The current paper gives the first efficient exact algorithm for computing the extension of NTK to convolutional neural nets, which we call Convolutional NTK (CNTK), as well as an efficient GPU implementation of this algorithm. This results in a significant new benchmark for performance of a pure kernel-based method on CIFAR-10, being 10% higher than the methods reported in [Novak et al., 2019], and only 6% lower than the performance of the corresponding finite deep net architecture (once batch normalization etc. are turned off). Theoretically, we also give the first non-asymptotic proof showing that a fully-trained sufficiently wide net is indeed equivalent to the kernel regression predictor using NTK.
translated by 谷歌翻译
懒惰培训制度中的神经网络收敛到内核机器。在丰富的特征学习制度中可以在丰富的特征学习制度中可以使用数据依赖性内核来学习内核机器吗?我们证明,这可以是由于我们术语静音对准的现象,这可能需要网络的切线内核在特征内演变,而在小并且在损失明显降低,并且之后仅在整体尺度上生长。我们表明这种效果在具有小初始化和白化数据的同质神经网络中进行。我们在线性网络壳体提供了对这种效果的分析处理。一般来说,我们发现内核在训练的早期阶段开发了低级贡献,然后在总体上发展,产生了与最终网络的切线内核的内核回归解决方案等同的函数。内核的早期光谱学习取决于深度。我们还证明了非白化数据可以削弱无声的对准效果。
translated by 谷歌翻译
由于其宽度趋于无穷大,如果梯度下降下的深度神经网络的行为可以简化和可预测(例如,如果神经切线核(NTK)给出,则如果适当地进行了参数化(例如,NTK参数化)。但是,我们表明,神经网络的标准和NTK参数化不接受可以学习特征的无限宽度限制,这对于训练和转移学习至关重要。我们对标准参数化提出了简单的修改,以允许在极限内进行特征学习。使用 * Tensor程序 *技术,我们为此类限制提供了明确的公式。在Word2Vec和Omniglot上通过MAML进行的几杆学习,这是两个依赖特征学习的规范任务,我们准确地计算了这些限制。我们发现它们的表现都优于NTK基准和有限宽度网络,后者接近无限宽度的特征学习表现,随着宽度的增加。更普遍地,我们对神经网络参数化的自然空间进行分类,该空间概括了标准,NTK和平均场参数化。我们显示1)该空间中的任何参数化都可以接受特征学习或具有内核梯度下降给出的无限宽度训练动力学,但并非两者兼而有之; 2)可以使用Tensor程序技术计算任何此类无限宽度限制。可以在github.com/edwardjhu/tp4上找到我们的实验代码。
translated by 谷歌翻译
我们证明了由例如He等人提出的广泛使用的方法。(2015年)并使用梯度下降对最小二乘损失进行训练并不普遍。具体而言,我们描述了一大批一维数据生成分布,较高的概率下降只会发现优化景观的局部最小值不好,因为它无法将其偏离偏差远离其初始化,以零移动。。事实证明,在这些情况下,即使目标函数是非线性的,发现的网络也基本执行线性回归。我们进一步提供了数值证据,表明在实际情况下,对于某些多维分布而发生这种情况,并且随机梯度下降表现出相似的行为。我们还提供了有关初始化和优化器的选择如何影响这种行为的经验结果。
translated by 谷歌翻译
Whilst deep neural networks have shown great empirical success, there is still much work to be done to understand their theoretical properties. In this paper, we study the relationship between random, wide, fully connected, feedforward networks with more than one hidden layer and Gaussian processes with a recursive kernel definition. We show that, under broad conditions, as we make the architecture increasingly wide, the implied random function converges in distribution to a Gaussian process, formalising and extending existing results by Neal (1996) to deep networks. To evaluate convergence rates empirically, we use maximum mean discrepancy. We then compare finite Bayesian deep networks from the literature to Gaussian processes in terms of the key predictive quantities of interest, finding that in some cases the agreement can be very close. We discuss the desirability of Gaussian process behaviour and review non-Gaussian alternative models from the literature. 1
translated by 谷歌翻译
In a series of recent theoretical works, it was shown that strongly overparameterized neural networks trained with gradient-based methods could converge exponentially fast to zero training loss, with their parameters hardly varying. In this work, we show that this "lazy training" phenomenon is not specific to overparameterized neural networks, and is due to a choice of scaling, often implicit, that makes the model behave as its linearization around the initialization, thus yielding a model equivalent to learning with positive-definite kernels. Through a theoretical analysis, we exhibit various situations where this phenomenon arises in non-convex optimization and we provide bounds on the distance between the lazy and linearized optimization paths. Our numerical experiments bring a critical note, as we observe that the performance of commonly used non-linear deep convolutional neural networks in computer vision degrades when trained in the lazy regime. This makes it unlikely that "lazy training" is behind the many successes of neural networks in difficult high dimensional tasks.
translated by 谷歌翻译
经验神经切线内核(ENTKS)可以很好地了解给定网络的表示:它们通常比无限宽度NTK的计算价格要低得多。但是,对于具有O输出单元(例如O级分类器)的网络,n输入上的ENTK是尺寸$ no \ times no $,服用$ o((no)^2)$内存,最多可达$ o((否)^3)$计算。因此,大多数现有的应用程序都使用了少数几个近似值之一,该近似值n $ n $内核矩阵,节省了计算的数量级,但没有理由。我们证明,我们称之为“ logits”的近似值,在任何具有宽最终“读取”层的网络时,在初始化时会收敛到真实的ENTK。我们的实验证明了这种近似值的质量,用于各种设置的各种用途。
translated by 谷歌翻译
最近的作品研究了在神经切线内核(NTK)制度中训练的广泛神经网络的理论和经验特性。鉴于生物神经网络比其人工对应物宽得多,因此我们认为NTK范围广泛的神经网络是生物神经网络的可能模型。利用NTK理论,我们从理论上说明梯度下降驱动层的重量更新与其输入活动相关性一致,并通过误差加权,并从经验上证明了结果在有限宽度的宽网络中也存在。对齐结果使我们能够制定一个生物动机的,无反向传播的学习规则,理论上等同于无限宽度网络中的反向传播。我们测试了馈电和经常性神经网络中基准问题的这些学习规则,并在宽网络中证明了与反向传播相当的性能。所提出的规则在低数据制度中特别有效,这在生物学习环境中很常见。
translated by 谷歌翻译
为了理论上了解训练有素的深神经网络的行为,有必要研究来自随机初始化的梯度方法引起的动态。然而,这些模型的非线性和组成结构使得这些动态难以分析。为了克服这些挑战,最近出现了大宽度的渐近学作为富有成效的观点,并导致了对真实世界的深网络的实用洞察。对于双层神经网络,已经通过这些渐近学理解,训练模型的性质根据初始随机权重的规模而变化,从内核制度(大初始方差)到特征学习制度(对于小初始方差)。对于更深的网络,更多的制度是可能的,并且在本文中,我们详细研究了与神经网络的“卑鄙字段”限制相对应的“小”初始化的特定选择,我们称之为可分配的参数化(IP)。首先,我们展示了标准I.I.D.零平均初始化,具有多于四个层的神经网络的可集参数,从无限宽度限制的静止点开始,并且不会发生学习。然后,我们提出了各种方法来避免这种琐碎的行为并详细分析所得到的动态。特别是,这些方法中的一种包括使用大的初始学习速率,并且我们表明它相当于最近提出的最大更新参数化$ \ mu $ p的修改。我们将结果与图像分类任务的数值实验确认,其另外显示出在尚未捕获的激活功能的各种选择之间的行为中的强烈差异。
translated by 谷歌翻译
我们为生成对抗网络(GAN)提出了一个新颖的理论框架。我们揭示了先前分析的基本缺陷,通过错误地对GANS的训练计划进行了错误的建模,该缺陷受到定义不定的鉴别梯度的约束。我们克服了这个问题,该问题阻碍了对GAN培训的原则研究,并考虑了歧视者的体系结构在我们的框架内解决它。为此,我们通过其神经切线核为歧视者提供了无限宽度神经网络的理论。我们表征了训练有素的判别器,以实现广泛的损失,并建立网络的一般可怜性属性。由此,我们获得了有关生成分布的融合的新见解,从而促进了我们对GANS训练动态的理解。我们通过基于我们的框架的分析工具包来证实这些结果,并揭示了与GAN实践一致的直觉。
translated by 谷歌翻译