We continue a long line of research aimed at proving convergence of depth 2 neural networks, trained via gradient descent, to a global minimum. Like in many previous works, our model has the following features: regression with quadratic loss function, fully connected feedforward architecture, RelU activations, Gaussian data instances and network initialization, adversarial labels. It is more general in the sense that we allow both layers to be trained simultaneously and at {\em different} rates. Our results improve on state-of-the-art [Oymak Soltanolkotabi 20] (training the first layer only) and [Nguyen 21, Section 3.2] (training both layers with Le Cun's initialization). We also report several simple experiments with synthetic data. They strongly suggest that, at least in our model, the convergence phenomenon extends well beyond the ``NTK regime''.
translated by 谷歌翻译
训练神经网络的一种常见方法是将所有权重初始化为独立的高斯向量。我们观察到,通过将权重初始化为独立对,每对由两个相同的高斯向量组成,我们可以显着改善收敛分析。虽然已经研究了类似的技术来进行随机输入[Daniely,Neurips 2020],但尚未使用任意输入进行分析。使用此技术,我们展示了如何显着减少两层relu网络所需的神经元数量,均在逻辑损失的参数化设置不足的情况下,大约$ \ gamma^{ - 8} $ [Ji and telgarsky,ICLR, 2020]至$ \ gamma^{ - 2} $,其中$ \ gamma $表示带有神经切线内核的分离边距,以及在与平方损失的过度参数化设置中,从大约$ n^4 $ [song [song]和Yang,2019年]至$ n^2 $,隐含地改善了[Brand,Peng,Song和Weinstein,ITCS 2021]的近期运行时间。对于参数不足的设置,我们还证明了在先前工作时改善的新下限,并且在某些假设下是最好的。
translated by 谷歌翻译
在本文中,我们研究了学习最适合培训数据集的浅层人工神经网络的问题。我们在过度参数化的制度中研究了这个问题,在该制度中,观测值的数量少于模型中的参数数量。我们表明,通过二次激活,训练的优化景观这种浅神经网络具有某些有利的特征,可以使用各种局部搜索启发式方法有效地找到全球最佳模型。该结果适用于输入/输出对的任意培训数据。对于可区分的激活函数,我们还表明,适当初始化的梯度下降以线性速率收敛到全球最佳模型。该结果着重于选择输入的可实现模型。根据高斯分布和标签是根据种植的重量系数生成的。
translated by 谷歌翻译
成功的深度学习模型往往涉及培训具有比训练样本数量更多的参数的神经网络架构。近年来已经广泛研究了这种超分子化的模型,并且通过双下降现象和通过优化景观的结构特性,从统计的角度和计算视角都建立了过分统计化的优点。尽管在过上分层的制度中深入学习架构的显着成功,但也众所周知,这些模型对其投入中的小对抗扰动感到高度脆弱。即使在普遍培训的情况下,它们在扰动输入(鲁棒泛化)上的性能也会比良性输入(标准概括)的最佳可达到的性能更糟糕。因此,必须了解如何从根本上影响稳健性的情况下如何影响鲁棒性。在本文中,我们将通过专注于随机特征回归模型(具有随机第一层权重的两层神经网络)来提供超分度化对鲁棒性的作用的精确表征。我们考虑一个制度,其中样本量,输入维度和参数的数量彼此成比例地生长,并且当模型发生前列地训练时,可以为鲁棒泛化误差导出渐近精确的公式。我们的发达理论揭示了过分统计化对鲁棒性的非竞争效果,表明对于普遍训练的随机特征模型,高度公正化可能会损害鲁棒泛化。
translated by 谷歌翻译
神经切线内核(NTK)已成为提供记忆,优化和泛化的强大工具,可保证深度神经网络。一项工作已经研究了NTK频谱的两层和深网,其中至少具有$ \ omega(n)$神经元的层,$ n $是培训样本的数量。此外,有越来越多的证据表明,只要参数数量超过样品数量,具有亚线性层宽度的深网是强大的记忆和优化器。因此,一个自然的开放问题是NTK是否在如此充满挑战的子线性设置中适应得很好。在本文中,我们以肯定的方式回答了这个问题。我们的主要技术贡献是对最小的深网的最小NTK特征值的下限,最小可能的过度参数化:参数的数量大约为$ \ omega(n)$,因此,神经元的数量仅为$ $ $ \ omega(\ sqrt {n})$。为了展示我们的NTK界限的适用性,我们为梯度下降训练提供了两个有关记忆能力和优化保证的结果。
translated by 谷歌翻译
The fundamental learning theory behind neural networks remains largely open. What classes of functions can neural networks actually learn? Why doesn't the trained network overfit when it is overparameterized?In this work, we prove that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations. Moreover, the learning can be simply done by SGD (stochastic gradient descent) or its variants in polynomial time using polynomially many samples. The sample complexity can also be almost independent of the number of parameters in the network.On the technique side, our analysis goes beyond the so-called NTK (neural tangent kernel) linearization of neural networks in prior works. We establish a new notion of quadratic approximation of the neural network (that can be viewed as a second-order variant of NTK), and connect it to the SGD theory of escaping saddle points.
translated by 谷歌翻译
现代神经网络通常以强烈的过度构造状态运行:它们包含许多参数,即使实际标签被纯粹随机的标签代替,它们也可以插入训练集。尽管如此,他们在看不见的数据上达到了良好的预测错误:插值训练集并不会导致巨大的概括错误。此外,过度散色化似乎是有益的,因为它简化了优化景观。在这里,我们在神经切线(NT)制度中的两层神经网络的背景下研究这些现象。我们考虑了一个简单的数据模型,以及各向同性协变量的矢量,$ d $尺寸和$ n $隐藏的神经元。我们假设样本量$ n $和尺寸$ d $都很大,并且它们在多项式上相关。我们的第一个主要结果是对过份术的经验NT内核的特征结构的特征。这种表征意味着必然的表明,经验NT内核的最低特征值在$ ND \ gg n $后立即从零界限,因此网络可以在同一制度中精确插值任意标签。我们的第二个主要结果是对NT Ridge回归的概括误差的表征,包括特殊情况,最小值-ULL_2 $ NORD插值。我们证明,一旦$ nd \ gg n $,测试误差就会被内核岭回归之一相对于无限宽度内核而近似。多项式脊回归的误差依次近似后者,从而通过与激活函数的高度组件相关的“自我诱导的”项增加了正则化参数。多项式程度取决于样本量和尺寸(尤其是$ \ log n/\ log d $)。
translated by 谷歌翻译
最近的一项工作已经通过神经切线核(NTK)分析了深神经网络的理论特性。特别是,NTK的最小特征值与记忆能力,梯度下降算法的全球收敛性和深网的概括有关。但是,现有结果要么在两层设置中提供边界,要么假设对于多层网络,将NTK矩阵的频谱从0界限为界限。在本文中,我们在无限宽度和有限宽度的限制情况下,在最小的ntk矩阵的最小特征值上提供了紧密的界限。在有限宽度的设置中,我们认为的网络体系结构相当笼统:我们需要大致订购$ n $神经元的宽层,$ n $是数据示例的数量;剩余层宽度的缩放是任意的(取决于对数因素)。为了获得我们的结果,我们分析了各种量的独立兴趣:我们对隐藏特征矩阵的最小奇异值以及输入输出特征图的Lipschitz常数上的上限给出了下限。
translated by 谷歌翻译
在负面的感知问题中,我们给出了$ n $数据点$({\ boldsymbol x} _i,y_i)$,其中$ {\ boldsymbol x} _i $是$ d $ -densional vector和$ y_i \ in \ { + 1,-1 \} $是二进制标签。数据不是线性可分离的,因此我们满足自己的内容,以找到最大的线性分类器,具有最大的\ emph {否定}余量。换句话说,我们想找到一个单位常规矢量$ {\ boldsymbol \ theta} $,最大化$ \ min_ {i \ le n} y_i \ langle {\ boldsymbol \ theta},{\ boldsymbol x} _i \ rangle $ 。这是一个非凸优化问题(它相当于在Polytope中找到最大标准矢量),我们在两个随机模型下研究其典型属性。我们考虑比例渐近,其中$ n,d \ to \ idty $以$ n / d \ to \ delta $,并在最大边缘$ \ kappa _ {\ text {s}}(\ delta)上证明了上限和下限)$或 - 等效 - 在其逆函数$ \ delta _ {\ text {s}}(\ kappa)$。换句话说,$ \ delta _ {\ text {s}}(\ kappa)$是overparametization阈值:以$ n / d \ le \ delta _ {\ text {s}}(\ kappa) - \ varepsilon $一个分类器实现了消失的训练错误,具有高概率,而以$ n / d \ ge \ delta _ {\ text {s}}(\ kappa)+ \ varepsilon $。我们在$ \ delta _ {\ text {s}}(\ kappa)$匹配,以$ \ kappa \ to - \ idty $匹配。然后,我们分析了线性编程算法来查找解决方案,并表征相应的阈值$ \ delta _ {\ text {lin}}(\ kappa)$。我们观察插值阈值$ \ delta _ {\ text {s}}(\ kappa)$和线性编程阈值$ \ delta _ {\ text {lin {lin}}(\ kappa)$之间的差距,提出了行为的问题其他算法。
translated by 谷歌翻译
通过梯度流优化平均平衡误差,研究了功能空间中神经网络的动态。我们认为,在underParameterized制度中,网络了解由与其特征值对应的率的神经切线内核(NTK)确定的整体运算符$ t_ {k ^ \ infty} $的特征功能。例如,对于SPENTE $ S ^ {D-1} $和旋转不变的权重分配的均匀分布式数据,$ t_ {k ^ \ infty} $的特征函数是球形谐波。我们的结果可以理解为描述interparameterized制度中的光谱偏压。证据使用“阻尼偏差”的概念,其中NTK物质对具有由于阻尼因子的发生而具有大特征值的特征的偏差。除了下公共条例的制度之外,阻尼偏差可用于跟踪过度分辨率设置中经验风险的动态,允许我们在文献中延长某些结果。我们得出结论,阻尼偏差在优化平方误差时提供了动态的简单和统一的视角。
translated by 谷歌翻译
本文评价用机器学习问题的数值优化方法。由于机器学习模型是高度参数化的,我们专注于适合高维优化的方法。我们在二次模型上构建直觉,以确定哪种方法适用于非凸优化,并在凸函数上开发用于这种方法的凸起函数。随着随机梯度下降和动量方法的这种理论基础,我们试图解释为什么机器学习领域通常使用的方法非常成功。除了解释成功的启发式之外,最后一章还提供了对更多理论方法的广泛审查,这在实践中并不像惯例。所以在某些情况下,这项工作试图回答这个问题:为什么默认值中包含的默认TensorFlow优化器?
translated by 谷歌翻译
This work studies training one-hidden-layer overparameterized ReLU networks via gradient descent in the neural tangent kernel (NTK) regime, where, differently from the previous works, the networks' biases are trainable and are initialized to some constant rather than zero. The first set of results of this work characterize the convergence of the network's gradient descent dynamics. Surprisingly, it is shown that the network after sparsification can achieve as fast convergence as the original network. The contribution over previous work is that not only the bias is allowed to be updated by gradient descent under our setting but also a finer analysis is given such that the required width to ensure the network's closeness to its NTK is improved. Secondly, the networks' generalization bound after training is provided. A width-sparsity dependence is presented which yields sparsity-dependent localized Rademacher complexity and a generalization bound matching previous analysis (up to logarithmic factors). As a by-product, if the bias initialization is chosen to be zero, the width requirement improves the previous bound for the shallow networks' generalization. Lastly, since the generalization bound has dependence on the smallest eigenvalue of the limiting NTK and the bounds from previous works yield vacuous generalization, this work further studies the least eigenvalue of the limiting NTK. Surprisingly, while it is not shown that trainable biases are necessary, trainable bias helps to identify a nice data-dependent region where a much finer analysis of the NTK's smallest eigenvalue can be conducted, which leads to a much sharper lower bound than the previously known worst-case bound and, consequently, a non-vacuous generalization bound.
translated by 谷歌翻译
我们考虑培训多层过参数化神经网络的问题,以最大限度地减少损失函数引起的经验风险。在过度参数化的典型设置中,网络宽度$ M $远大于数据维度$ D $和培训数量$ N $($ m = \ mathrm {poly}(n,d)$),其中诱导禁止的大量矩阵$ w \ in \ mathbb {r} ^ {m \ times m} $每层。天真地,一个人必须支付$ O(m ^ 2)$时间读取权重矩阵并评估前向和后向计算中的神经网络功能。在这项工作中,我们展示了如何降低每个迭代的培训成本,具体而言,我们提出了一个仅在初始化阶段使用M ^ 2美元的框架,并且在$ M $的情况下实现了每次迭代的真正子种化成本。 ,$ m ^ {2- \ oomga(1)} $次迭代。为了获得此结果,我们利用各种技术,包括偏移的基于Relu的稀释器,懒惰的低级维护数据结构,快速矩阵矩阵乘法,张量的草图技术和预处理。
translated by 谷歌翻译
古典统计学习理论表示,拟合太多参数导致过度舒服和性能差。尽管大量参数矛盾,但是现代深度神经网络概括了这一发现,并构成了解释深度学习成功的主要未解决的问题。随机梯度下降(SGD)引起的隐式正规被认为是重要的,但其特定原则仍然是未知的。在这项工作中,我们研究了当地最小值周围的能量景观的局部几何学如何影响SGD的统计特性,具有高斯梯度噪声。我们争辩说,在合理的假设下,局部几何形状力强制SGD保持接近低维子空间,这会引起隐式正则化并导致深神经网络的泛化误差界定更严格的界限。为了获得神经网络的泛化误差界限,我们首先引入局部最小值周围的停滞迹象,并施加人口风险的局部基本凸性财产。在这些条件下,推导出SGD的下界,以保留在这些停滞套件中。如果发生停滞,我们会导出涉及权重矩阵的光谱规范的深神经网络的泛化误差的界限,但不是网络参数的数量。从技术上讲,我们的证据基于控制SGD中的参数值的变化以及基于局部最小值周围的合适邻域的熵迭代的参数值和局部均匀收敛。我们的工作试图通过统一收敛更好地连接非凸优化和泛化分析。
translated by 谷歌翻译
我们证明了由例如He等人提出的广泛使用的方法。(2015年)并使用梯度下降对最小二乘损失进行训练并不普遍。具体而言,我们描述了一大批一维数据生成分布,较高的概率下降只会发现优化景观的局部最小值不好,因为它无法将其偏离偏差远离其初始化,以零移动。。事实证明,在这些情况下,即使目标函数是非线性的,发现的网络也基本执行线性回归。我们进一步提供了数值证据,表明在实际情况下,对于某些多维分布而发生这种情况,并且随机梯度下降表现出相似的行为。我们还提供了有关初始化和优化器的选择如何影响这种行为的经验结果。
translated by 谷歌翻译
近似消息传递(AMP)是解决高维统计问题的有效迭代范式。但是,当迭代次数超过$ o \ big(\ frac {\ log n} {\ log log \ log \ log n} \时big)$(带有$ n $问题维度)。为了解决这一不足,本文开发了一个非吸附框架,用于理解峰值矩阵估计中的AMP。基于AMP更新的新分解和可控的残差项,我们布置了一个分析配方,以表征在存在独立初始化的情况下AMP的有限样本行为,该过程被进一步概括以进行光谱初始化。作为提出的分析配方的两个具体后果:(i)求解$ \ mathbb {z} _2 $同步时,我们预测了频谱初始化AMP的行为,最高为$ o \ big(\ frac {n} {\ mathrm {\ mathrm { poly} \ log n} \ big)$迭代,表明该算法成功而无需随后的细化阶段(如最近由\ citet {celentano2021local}推测); (ii)我们表征了稀疏PCA中AMP的非反应性行为(在尖刺的Wigner模型中),以广泛的信噪比。
translated by 谷歌翻译
Artificial neural networks are functions depending on a finite number of parameters typically encoded as weights and biases. The identification of the parameters of the network from finite samples of input-output pairs is often referred to as the \emph{teacher-student model}, and this model has represented a popular framework for understanding training and generalization. Even if the problem is NP-complete in the worst case, a rapidly growing literature -- after adding suitable distributional assumptions -- has established finite sample identification of two-layer networks with a number of neurons $m=\mathcal O(D)$, $D$ being the input dimension. For the range $D<m<D^2$ the problem becomes harder, and truly little is known for networks parametrized by biases as well. This paper fills the gap by providing constructive methods and theoretical guarantees of finite sample identification for such wider shallow networks with biases. Our approach is based on a two-step pipeline: first, we recover the direction of the weights, by exploiting second order information; next, we identify the signs by suitable algebraic evaluations, and we recover the biases by empirical risk minimization via gradient descent. Numerical results demonstrate the effectiveness of our approach.
translated by 谷歌翻译
我们研究神经网络的基于规范的统一收敛范围,旨在密切理解它们如何受到规范约束的架构和类型的影响,对于简单的标量价值一类隐藏的一层网络,并在其中界定了输入。欧几里得规范。我们首先证明,通常,控制隐藏层重量矩阵的光谱规范不足以获得均匀的收敛保证(与网络宽度无关),而更强的Frobenius Norm Control是足够的,扩展并改善了以前的工作。在证明构造中,我们识别和分析了两个重要的设置,在这些设置中(可能令人惊讶)仅光谱规范控制就足够了:首先,当网络的激活函数足够平滑时(结果扩展到更深的网络);其次,对于某些类型的卷积网络。在后一种情况下,我们研究样品复杂性如何受到参数的影响,例如斑块之间的重叠量和斑块的总数。
translated by 谷歌翻译
We develop new theoretical results on matrix perturbation to shed light on the impact of architecture on the performance of a deep network. In particular, we explain analytically what deep learning practitioners have long observed empirically: the parameters of some deep architectures (e.g., residual networks, ResNets, and Dense networks, DenseNets) are easier to optimize than others (e.g., convolutional networks, ConvNets). Building on our earlier work connecting deep networks with continuous piecewise-affine splines, we develop an exact local linear representation of a deep network layer for a family of modern deep networks that includes ConvNets at one end of a spectrum and ResNets, DenseNets, and other networks with skip connections at the other. For regression and classification tasks that optimize the squared-error loss, we show that the optimization loss surface of a modern deep network is piecewise quadratic in the parameters, with local shape governed by the singular values of a matrix that is a function of the local linear representation. We develop new perturbation results for how the singular values of matrices of this sort behave as we add a fraction of the identity and multiply by certain diagonal matrices. A direct application of our perturbation results explains analytically why a network with skip connections (such as a ResNet or DenseNet) is easier to optimize than a ConvNet: thanks to its more stable singular values and smaller condition number, the local loss surface of such a network is less erratic, less eccentric, and features local minima that are more accommodating to gradient-based optimization. Our results also shed new light on the impact of different nonlinear activation functions on a deep network's singular values, regardless of its architecture.
translated by 谷歌翻译
Autoencoders are a popular model in many branches of machine learning and lossy data compression. However, their fundamental limits, the performance of gradient methods and the features learnt during optimization remain poorly understood, even in the two-layer setting. In fact, earlier work has considered either linear autoencoders or specific training regimes (leading to vanishing or diverging compression rates). Our paper addresses this gap by focusing on non-linear two-layer autoencoders trained in the challenging proportional regime in which the input dimension scales linearly with the size of the representation. Our results characterize the minimizers of the population risk, and show that such minimizers are achieved by gradient methods; their structure is also unveiled, thus leading to a concise description of the features obtained via training. For the special case of a sign activation function, our analysis establishes the fundamental limits for the lossy compression of Gaussian sources via (shallow) autoencoders. Finally, while the results are proved for Gaussian data, numerical simulations on standard datasets display the universality of the theoretical predictions.
translated by 谷歌翻译