背部衰退的随机梯度下降是人工神经网络的主力。已经很久认识到,BackPropagation无法成为一种生物合理的算法。从根本上,它是一种非本地程序 - 更新一个神经元的突触权重,需要了解下游神经元的突触权重或接受领域。这限制了人工神经网络作为理解大脑中信息处理生物学原理的工具。 Lillicrap等人。 (2016)提出了一种更具生物合理的“反馈对齐”算法,该算法使用随机和固定的反向化重量,并显示有希望的模拟。在本文中,我们通过分析在平方误差损失下的两层网络的收敛和对准来研究反馈对准过程的数学特性。在过度指数化的设置中,我们证明误差会使误差快速收敛到零,并且还需要进行正则化,以便参数与随机背交量对齐。给出了与该分析一致的模拟,并建议进一步的概括。这些结果有助于我们了解生物学合理的算法如何以不同于Hebbian学习的方式进行体重学习,性能与完整的非本地反向验证算法相当。
translated by 谷歌翻译
现代神经网络通常以强烈的过度构造状态运行:它们包含许多参数,即使实际标签被纯粹随机的标签代替,它们也可以插入训练集。尽管如此,他们在看不见的数据上达到了良好的预测错误:插值训练集并不会导致巨大的概括错误。此外,过度散色化似乎是有益的,因为它简化了优化景观。在这里,我们在神经切线(NT)制度中的两层神经网络的背景下研究这些现象。我们考虑了一个简单的数据模型,以及各向同性协变量的矢量,$ d $尺寸和$ n $隐藏的神经元。我们假设样本量$ n $和尺寸$ d $都很大,并且它们在多项式上相关。我们的第一个主要结果是对过份术的经验NT内核的特征结构的特征。这种表征意味着必然的表明,经验NT内核的最低特征值在$ ND \ gg n $后立即从零界限,因此网络可以在同一制度中精确插值任意标签。我们的第二个主要结果是对NT Ridge回归的概括误差的表征,包括特殊情况,最小值-ULL_2 $ NORD插值。我们证明,一旦$ nd \ gg n $,测试误差就会被内核岭回归之一相对于无限宽度内核而近似。多项式脊回归的误差依次近似后者,从而通过与激活函数的高度组件相关的“自我诱导的”项增加了正则化参数。多项式程度取决于样本量和尺寸(尤其是$ \ log n/\ log d $)。
translated by 谷歌翻译
Gradient descent finds a global minimum in training deep neural networks despite the objective function being non-convex. The current paper proves gradient descent achieves zero training loss in polynomial time for a deep overparameterized neural network with residual connections (ResNet). Our analysis relies on the particular structure of the Gram matrix induced by the neural network architecture. This structure allows us to show the Gram matrix is stable throughout the training process and this stability implies the global optimality of the gradient descent algorithm. We further extend our analysis to deep residual convolutional neural networks and obtain a similar convergence result.
translated by 谷歌翻译
鉴于密集的浅色神经网络,我们专注于迭代创建,培训和组合随机选择的子网(代理函数),以训练完整模型。通过仔细分析$ i)$ Subnetworks的神经切线内核,II美元)$代理职能'梯度,以及$ iii)$我们如何对替代品函数进行采样并结合训练错误的线性收敛速度 - 内部一个错误区域 - 对于带有回归任务的Relu激活的过度参数化单隐藏层Perceptron。我们的结果意味着,对于固定的神经元选择概率,当我们增加代理模型的数量时,误差项会减少,并且随着我们增加每个所选子网的本地训练步骤的数量而增加。考虑的框架概括并提供了关于辍学培训,多样化辍学培训以及独立的子网培训的新见解;对于每种情况,我们提供相应的收敛结果,作为我们主要定理的冠状动脉。
translated by 谷歌翻译
在本文中,我们研究了学习最适合培训数据集的浅层人工神经网络的问题。我们在过度参数化的制度中研究了这个问题,在该制度中,观测值的数量少于模型中的参数数量。我们表明,通过二次激活,训练的优化景观这种浅神经网络具有某些有利的特征,可以使用各种局部搜索启发式方法有效地找到全球最佳模型。该结果适用于输入/输出对的任意培训数据。对于可区分的激活函数,我们还表明,适当初始化的梯度下降以线性速率收敛到全球最佳模型。该结果着重于选择输入的可实现模型。根据高斯分布和标签是根据种植的重量系数生成的。
translated by 谷歌翻译
ML的梯度下降的成功尤其是学习神经网络是显着的和稳健的。在大脑如何学习的背景下,似乎在生物学上难以实现(如果不是难以判断)的梯度下降的一个方面是,其更新依赖于通过相同的连接到更早层的反馈。这种双向链路在脑网络中相对较少,即使存在互易连接时,它们也可能不等级。随机反馈对准(LillicRap等,2016),后向后重量是随机的和固定的,已经提出作为生物合理的替代品,并发现凭经验有效。我们调查如何以及当反馈对齐(FA)工作的方式,重点关注分层结构的最基本问题之一 - 低秩矩阵分解。在这个问题中,给定矩阵$ y_ {n \ times m} $,目标是找到低秩分解$ z_ {n \ times r} w_ {r \ times m} $,从而最小化错误$ \ | zw - 我\ | _f $。梯度血压最佳地解决了这个问题。我们显示FA收敛于当$ r \ ge \ mbox {rank}(y)$时收敛到最佳解决方案。我们还阐明了Fa工作的方式。经验上观察到前进权重矩阵和(随机)反馈矩阵在FA更新期间更接近。我们的分析严格地源地源于这种现象,并展示了如何促进FA的收敛。我们还表明,当$ r <\ mbox {rank}(y)$时,FA可能远非最佳。这是梯度下降和FA之间的第一个可提供的分离结果。此外,即使当它们的错误$ \ | zw-y \ | _f $大致相等时,梯度下降和fa发现的表示也可能是几乎正交的。
translated by 谷歌翻译
我们证明了由例如He等人提出的广泛使用的方法。(2015年)并使用梯度下降对最小二乘损失进行训练并不普遍。具体而言,我们描述了一大批一维数据生成分布,较高的概率下降只会发现优化景观的局部最小值不好,因为它无法将其偏离偏差远离其初始化,以零移动。。事实证明,在这些情况下,即使目标函数是非线性的,发现的网络也基本执行线性回归。我们进一步提供了数值证据,表明在实际情况下,对于某些多维分布而发生这种情况,并且随机梯度下降表现出相似的行为。我们还提供了有关初始化和优化器的选择如何影响这种行为的经验结果。
translated by 谷歌翻译
最近的作品证明了过度参数化学习中的双重下降现象:随着模型参数的数量的增加,多余的风险具有$ \ mathsf {u} $ - 在开始时形状,然后在模型高度过度参数化时再次减少。尽管最近在不同的环境(例如线性模型,随机特征模型和内核方法)下进行了研究,但在理论上尚未完全理解这种现象。在本文中,我们考虑了由两种随机特征组成的双随机特征模型(DRFM),并研究DRFM在脊回归中实现的多余风险。我们计算高维框架下的多余风险的确切限制,在这种框架上,训练样本量,数据尺寸和随机特征的维度往往会成比例地无限。根据计算,我们证明DRFM的风险曲线可以表现出三重下降。然后,我们提供三重下降现象的解释,并讨论随机特征维度,正则化参数和信噪比比率如何控制DRFMS风险曲线的形状。最后,我们将研究扩展到多个随机功能模型(MRFM),并表明具有$ K $类型的随机功能的MRFM可能会显示出$(K+1)$ - 折叠。我们的分析指出,具有特定数量下降的风险曲线通常在基于特征的回归中存在。另一个有趣的发现是,当学习神经网络在“神经切线内核”制度中时,我们的结果可以恢复文献中报告的风险峰值位置。
translated by 谷歌翻译
训练神经网络的一种常见方法是将所有权重初始化为独立的高斯向量。我们观察到,通过将权重初始化为独立对,每对由两个相同的高斯向量组成,我们可以显着改善收敛分析。虽然已经研究了类似的技术来进行随机输入[Daniely,Neurips 2020],但尚未使用任意输入进行分析。使用此技术,我们展示了如何显着减少两层relu网络所需的神经元数量,均在逻辑损失的参数化设置不足的情况下,大约$ \ gamma^{ - 8} $ [Ji and telgarsky,ICLR, 2020]至$ \ gamma^{ - 2} $,其中$ \ gamma $表示带有神经切线内核的分离边距,以及在与平方损失的过度参数化设置中,从大约$ n^4 $ [song [song]和Yang,2019年]至$ n^2 $,隐含地改善了[Brand,Peng,Song和Weinstein,ITCS 2021]的近期运行时间。对于参数不足的设置,我们还证明了在先前工作时改善的新下限,并且在某些假设下是最好的。
translated by 谷歌翻译
现代神经网络通常具有很大的表现力,并且可以接受训练以使培训数据过高,同时仍能达到良好的测试性能。这种现象被称为“良性过度拟合”。最近,从理论角度出现了一系列研究“良性过度拟合”的作品。但是,它们仅限于线性模型或内核/随机特征模型,并且仍然缺乏关于何时以及如何在神经网络中发生过度拟合的理论理解。在本文中,我们研究了训练两层卷积神经网络(CNN)的良性过度拟合现象。我们表明,当信噪比满足一定条件时,通过梯度下降训练的两层CNN可以实现任意小的训练和测试损失。另一方面,当这种情况无法成立时,过度拟合就会有害,并且获得的CNN只能实现恒定的测试损失。这些共同证明了由信噪比驱动的良性过度拟合和有害过度拟合之间的急剧过渡。据我们所知,这是第一部精确地表征良性过度拟合在训练卷积神经网络中的条件的工作。
translated by 谷歌翻译
在本文中,我们利用过度参数化来设计高维单索索引模型的无规矩算法,并为诱导的隐式正则化现象提供理论保证。具体而言,我们研究了链路功能是非线性且未知的矢量和矩阵单索引模型,信号参数是稀疏向量或低秩对称矩阵,并且响应变量可以是重尾的。为了更好地理解隐含正规化的角色而没有过度的技术性,我们假设协变量的分布是先验的。对于载体和矩阵设置,我们通过采用分数函数变换和专为重尾数据的强大截断步骤来构造过度参数化最小二乘损耗功能。我们建议通过将无规则化的梯度下降应用于损耗函数来估计真实参数。当初始化接近原点并且步骤中足够小时,我们证明了所获得的解决方案在载体和矩阵案件中实现了最小的收敛统计速率。此外,我们的实验结果支持我们的理论调查结果,并表明我们的方法在$ \ ell_2 $ -staticatisticated率和变量选择一致性方面具有明确的正则化的经验卓越。
translated by 谷歌翻译
Consider the multivariate nonparametric regression model. It is shown that estimators based on sparsely connected deep neural networks with ReLU activation function and properly chosen network architecture achieve the minimax rates of convergence (up to log nfactors) under a general composition assumption on the regression function. The framework includes many well-studied structural constraints such as (generalized) additive models. While there is a lot of flexibility in the network architecture, the tuning parameter is the sparsity of the network. Specifically, we consider large networks with number of potential network parameters exceeding the sample size. The analysis gives some insights into why multilayer feedforward neural networks perform well in practice. Interestingly, for ReLU activation function the depth (number of layers) of the neural network architectures plays an important role and our theory suggests that for nonparametric regression, scaling the network depth with the sample size is natural. It is also shown that under the composition assumption wavelet estimators can only achieve suboptimal rates.
translated by 谷歌翻译
低维歧管假设认为,在许多应用中发现的数据,例如涉及自然图像的数据(大约)位于嵌入高维欧几里得空间中的低维歧管上。在这种情况下,典型的神经网络定义了一个函数,该函数在嵌入空间中以有限数量的向量作为输入。但是,通常需要考虑在训练分布以外的点上评估优化网络。本文考虑了培训数据以$ \ mathbb r^d $的线性子空间分配的情况。我们得出对由神经网络定义的学习函数变化的估计值,沿横向子空间的方向。我们研究了数据歧管的编纂中与网络的深度和噪声相关的潜在正则化效应。由于存在噪声,我们还提出了训练中的其他副作用。
translated by 谷歌翻译
本文介绍了梯度下降到全球最低最低限度的新标准。该标准用于表明,当训练任何具有光滑且严格增加激活功能的前馈神经网络时,具有适当初始化的梯度下降将收敛到全局最小值,前提是输入维度大于或等于数据点的数量。先前工作的主要区别在于,网络的宽度可以是固定的数字,而不是作为数据点数量的某些倍数或功率而不现实地生长。
translated by 谷歌翻译
我们考虑培训多层过参数化神经网络的问题,以最大限度地减少损失函数引起的经验风险。在过度参数化的典型设置中,网络宽度$ M $远大于数据维度$ D $和培训数量$ N $($ m = \ mathrm {poly}(n,d)$),其中诱导禁止的大量矩阵$ w \ in \ mathbb {r} ^ {m \ times m} $每层。天真地,一个人必须支付$ O(m ^ 2)$时间读取权重矩阵并评估前向和后向计算中的神经网络功能。在这项工作中,我们展示了如何降低每个迭代的培训成本,具体而言,我们提出了一个仅在初始化阶段使用M ^ 2美元的框架,并且在$ M $的情况下实现了每次迭代的真正子种化成本。 ,$ m ^ {2- \ oomga(1)} $次迭代。为了获得此结果,我们利用各种技术,包括偏移的基于Relu的稀释器,懒惰的低级维护数据结构,快速矩阵矩阵乘法,张量的草图技术和预处理。
translated by 谷歌翻译
In many modern applications of deep learning the neural network has many more parameters than the data points used for its training. Motivated by those practices, a large body of recent theoretical research has been devoted to studying overparameterized models. One of the central phenomena in this regime is the ability of the model to interpolate noisy data, but still have test error lower than the amount of noise in that data. arXiv:1906.11300 characterized for which covariance structure of the data such a phenomenon can happen in linear regression if one considers the interpolating solution with minimum $\ell_2$-norm and the data has independent components: they gave a sharp bound on the variance term and showed that it can be small if and only if the data covariance has high effective rank in a subspace of small co-dimension. We strengthen and complete their results by eliminating the independence assumption and providing sharp bounds for the bias term. Thus, our results apply in a much more general setting than those of arXiv:1906.11300, e.g., kernel regression, and not only characterize how the noise is damped but also which part of the true signal is learned. Moreover, we extend the result to the setting of ridge regression, which allows us to explain another interesting phenomenon: we give general sufficient conditions under which the optimal regularization is negative.
translated by 谷歌翻译
The fundamental learning theory behind neural networks remains largely open. What classes of functions can neural networks actually learn? Why doesn't the trained network overfit when it is overparameterized?In this work, we prove that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations. Moreover, the learning can be simply done by SGD (stochastic gradient descent) or its variants in polynomial time using polynomially many samples. The sample complexity can also be almost independent of the number of parameters in the network.On the technique side, our analysis goes beyond the so-called NTK (neural tangent kernel) linearization of neural networks in prior works. We establish a new notion of quadratic approximation of the neural network (that can be viewed as a second-order variant of NTK), and connect it to the SGD theory of escaping saddle points.
translated by 谷歌翻译
This work studies training one-hidden-layer overparameterized ReLU networks via gradient descent in the neural tangent kernel (NTK) regime, where, differently from the previous works, the networks' biases are trainable and are initialized to some constant rather than zero. The first set of results of this work characterize the convergence of the network's gradient descent dynamics. Surprisingly, it is shown that the network after sparsification can achieve as fast convergence as the original network. The contribution over previous work is that not only the bias is allowed to be updated by gradient descent under our setting but also a finer analysis is given such that the required width to ensure the network's closeness to its NTK is improved. Secondly, the networks' generalization bound after training is provided. A width-sparsity dependence is presented which yields sparsity-dependent localized Rademacher complexity and a generalization bound matching previous analysis (up to logarithmic factors). As a by-product, if the bias initialization is chosen to be zero, the width requirement improves the previous bound for the shallow networks' generalization. Lastly, since the generalization bound has dependence on the smallest eigenvalue of the limiting NTK and the bounds from previous works yield vacuous generalization, this work further studies the least eigenvalue of the limiting NTK. Surprisingly, while it is not shown that trainable biases are necessary, trainable bias helps to identify a nice data-dependent region where a much finer analysis of the NTK's smallest eigenvalue can be conducted, which leads to a much sharper lower bound than the previously known worst-case bound and, consequently, a non-vacuous generalization bound.
translated by 谷歌翻译
我们研究了学习单个神经元的基本问题,即$ \ mathbf {x} \ mapsto \ sigma(\ mathbf {w} \ cdot \ cdot \ mathbf {x})$单调激活$ \ sigma $ \ sigma: \ mathbb {r} \ mapsto \ mathbb {r} $,相对于$ l_2^2 $ -loss,在存在对抗标签噪声的情况下。具体来说,我们将在$(\ mathbf {x},y)\ in \ mathbb {r}^d \ times \ times \ mathbb {r} $上给我们从$(\ mathbf {x},y)\ on a发行$ d $中给我们标记的示例。 }^\ ast \ in \ mathbb {r}^d $ achieving $ f(\ mathbf {w}^\ ast)= \ epsilon $,其中$ f(\ mathbf {w})= \ m马理bf {e} (\ mathbf {x},y)\ sim d} [(\ sigma(\ mathbf {w} \ cdot \ mathbf {x}) - y)^2] $。学习者的目标是输出假设向量$ \ mathbf {w} $,以使$ f(\ m athbb {w})= c \,\ epsilon $具有高概率,其中$ c> 1 $是通用常数。作为我们的主要贡献,我们为广泛的分布(包括对数 - 循环分布)和激活功能提供有效的恒定因素近似学习者。具体地说,对于各向同性对数凸出分布的类别,我们获得以下重要的推论:对于逻辑激活,我们获得了第一个多项式时间常数因子近似(即使在高斯分布下)。我们的算法具有样品复杂性$ \ widetilde {o}(d/\ epsilon)$,这在多毛体因子中很紧。对于relu激活,我们给出了一个有效的算法,带有样品复杂性$ \ tilde {o}(d \,\ polylog(1/\ epsilon))$。在我们工作之前,最著名的常数因子近似学习者具有样本复杂性$ \ tilde {\ omega}(d/\ epsilon)$。在这两个设置中,我们的算法很简单,在(正规)$ L_2^2 $ -LOSS上执行梯度散发。我们的算法的正确性取决于我们确定的新结构结果,表明(本质上是基本上)基础非凸损失的固定点大约是最佳的。
translated by 谷歌翻译
Artificial neural networks are functions depending on a finite number of parameters typically encoded as weights and biases. The identification of the parameters of the network from finite samples of input-output pairs is often referred to as the \emph{teacher-student model}, and this model has represented a popular framework for understanding training and generalization. Even if the problem is NP-complete in the worst case, a rapidly growing literature -- after adding suitable distributional assumptions -- has established finite sample identification of two-layer networks with a number of neurons $m=\mathcal O(D)$, $D$ being the input dimension. For the range $D<m<D^2$ the problem becomes harder, and truly little is known for networks parametrized by biases as well. This paper fills the gap by providing constructive methods and theoretical guarantees of finite sample identification for such wider shallow networks with biases. Our approach is based on a two-step pipeline: first, we recover the direction of the weights, by exploiting second order information; next, we identify the signs by suitable algebraic evaluations, and we recover the biases by empirical risk minimization via gradient descent. Numerical results demonstrate the effectiveness of our approach.
translated by 谷歌翻译