We develop new theoretical results on matrix perturbation to shed light on the impact of architecture on the performance of a deep network. In particular, we explain analytically what deep learning practitioners have long observed empirically: the parameters of some deep architectures (e.g., residual networks, ResNets, and Dense networks, DenseNets) are easier to optimize than others (e.g., convolutional networks, ConvNets). Building on our earlier work connecting deep networks with continuous piecewise-affine splines, we develop an exact local linear representation of a deep network layer for a family of modern deep networks that includes ConvNets at one end of a spectrum and ResNets, DenseNets, and other networks with skip connections at the other. For regression and classification tasks that optimize the squared-error loss, we show that the optimization loss surface of a modern deep network is piecewise quadratic in the parameters, with local shape governed by the singular values of a matrix that is a function of the local linear representation. We develop new perturbation results for how the singular values of matrices of this sort behave as we add a fraction of the identity and multiply by certain diagonal matrices. A direct application of our perturbation results explains analytically why a network with skip connections (such as a ResNet or DenseNet) is easier to optimize than a ConvNet: thanks to its more stable singular values and smaller condition number, the local loss surface of such a network is less erratic, less eccentric, and features local minima that are more accommodating to gradient-based optimization. Our results also shed new light on the impact of different nonlinear activation functions on a deep network's singular values, regardless of its architecture.
translated by 谷歌翻译
现代神经网络通常以强烈的过度构造状态运行:它们包含许多参数,即使实际标签被纯粹随机的标签代替,它们也可以插入训练集。尽管如此,他们在看不见的数据上达到了良好的预测错误:插值训练集并不会导致巨大的概括错误。此外,过度散色化似乎是有益的,因为它简化了优化景观。在这里,我们在神经切线(NT)制度中的两层神经网络的背景下研究这些现象。我们考虑了一个简单的数据模型,以及各向同性协变量的矢量,$ d $尺寸和$ n $隐藏的神经元。我们假设样本量$ n $和尺寸$ d $都很大,并且它们在多项式上相关。我们的第一个主要结果是对过份术的经验NT内核的特征结构的特征。这种表征意味着必然的表明,经验NT内核的最低特征值在$ ND \ gg n $后立即从零界限,因此网络可以在同一制度中精确插值任意标签。我们的第二个主要结果是对NT Ridge回归的概括误差的表征,包括特殊情况,最小值-ULL_2 $ NORD插值。我们证明,一旦$ nd \ gg n $,测试误差就会被内核岭回归之一相对于无限宽度内核而近似。多项式脊回归的误差依次近似后者,从而通过与激活函数的高度组件相关的“自我诱导的”项增加了正则化参数。多项式程度取决于样本量和尺寸(尤其是$ \ log n/\ log d $)。
translated by 谷歌翻译
在本文中,我们研究了学习最适合培训数据集的浅层人工神经网络的问题。我们在过度参数化的制度中研究了这个问题,在该制度中,观测值的数量少于模型中的参数数量。我们表明,通过二次激活,训练的优化景观这种浅神经网络具有某些有利的特征,可以使用各种局部搜索启发式方法有效地找到全球最佳模型。该结果适用于输入/输出对的任意培训数据。对于可区分的激活函数,我们还表明,适当初始化的梯度下降以线性速率收敛到全球最佳模型。该结果着重于选择输入的可实现模型。根据高斯分布和标签是根据种植的重量系数生成的。
translated by 谷歌翻译
在深度学习中的优化分析是连续的,专注于(变体)梯度流动,或离散,直接处理(变体)梯度下降。梯度流程可符合理论分析,但是风格化并忽略计算效率。它代表梯度下降的程度是深度学习理论的一个开放问题。目前的论文研究了这个问题。将梯度下降视为梯度流量初始值问题的近似数值问题,发现近似程度取决于梯度流动轨迹周围的曲率。然后,我们表明,在具有均匀激活的深度神经网络中,梯度流动轨迹享有有利的曲率,表明它们通过梯度下降近似地近似。该发现允许我们将深度线性神经网络的梯度流分析转换为保证梯度下降,其几乎肯定会在随机初始化下有效地收敛到全局最小值。实验表明,在简单的深度神经网络中,具有传统步长的梯度下降确实接近梯度流。我们假设梯度流动理论将解开深入学习背后的奥秘。
translated by 谷歌翻译
我们为特殊神经网络架构,称为运营商复发性神经网络的理论分析,用于近似非线性函数,其输入是线性运算符。这些功能通常在解决方案算法中出现用于逆边值问题的问题。传统的神经网络将输入数据视为向量,因此它们没有有效地捕获与对应于这种逆问题中的数据的线性运算符相关联的乘法结构。因此,我们介绍一个类似标准的神经网络架构的新系列,但是输入数据在向量上乘法作用。由较小的算子出现在边界控制中的紧凑型操作员和波动方程的反边值问题分析,我们在网络中的选择权重矩阵中促进结构和稀疏性。在描述此架构后,我们研究其表示属性以及其近似属性。我们还表明,可以引入明确的正则化,其可以从所述逆问题的数学分析导出,并导致概括属性上的某些保证。我们观察到重量矩阵的稀疏性改善了概括估计。最后,我们讨论如何将运营商复发网络视为深度学习模拟,以确定诸如用于从边界测量的声波方程中重建所未知的WAVESTED的边界控制的算法算法。
translated by 谷歌翻译
我们考虑估计与I.I.D的排名$ 1 $矩阵因素的问题。高斯,排名$ 1 $的测量值,这些测量值非线性转化和损坏。考虑到非线性的两种典型选择,我们研究了从随机初始化开始的此非convex优化问题的天然交流更新规则的收敛性能。我们通过得出确定性递归,即使在高维问题中也是准确的,我们显示出算法的样本分割版本的敏锐收敛保证。值得注意的是,虽然无限样本的种群更新是非信息性的,并提示单个步骤中的精确恢复,但算法 - 我们的确定性预测 - 从随机初始化中迅速地收敛。我们尖锐的非反应分析也暴露了此问题的其他几种细粒度,包括非线性和噪声水平如何影响收敛行为。从技术层面上讲,我们的结果可以通过证明我们的确定性递归可以通过我们的确定性顺序来预测我们的确定性序列,而当每次迭代都以$ n $观测来运行时,我们的确定性顺序可以通过$ n^{ - 1/2} $的波动。我们的技术利用了源自有关高维$ m $估计文献的遗留工具,并为通过随机数据的其他高维优化问题的随机初始化而彻底地分析了高阶迭代算法的途径。
translated by 谷歌翻译
强大的机器学习模型的开发中的一个重要障碍是协变量的转变,当训练和测试集的输入分布时发生的分配换档形式在条件标签分布保持不变时发生。尽管现实世界应用的协变量转变普遍存在,但在现代机器学习背景下的理论理解仍然缺乏。在这项工作中,我们检查协变量的随机特征回归的精确高尺度渐近性,并在该设置中提出了限制测试误差,偏差和方差的精确表征。我们的结果激发了一种自然部分秩序,通过协变速转移,提供足够的条件来确定何时何时损害(甚至有助于)测试性能。我们发现,过度分辨率模型表现出增强的协会转变的鲁棒性,为这种有趣现象提供了第一个理论解释之一。此外,我们的分析揭示了分销和分发外概率性能之间的精确线性关系,为这一令人惊讶的近期实证观察提供了解释。
translated by 谷歌翻译
我们研究了神经网络中平方损耗训练问题的优化景观和稳定性,但通用非线性圆锥近似方案。据证明,如果认为非线性圆锥近似方案是(以适当定义的意义)比经典线性近似方法更具表现力,并且如果存在不完美的标签向量,则在方位损耗的训练问题必须在其中不稳定感知其解决方案集在训练数据中的标签向量上不连续地取决于标签向量。我们进一步证明对这些不稳定属性负责的效果也是马鞍点出现的原因和杂散的局部最小值,这可能是从全球解决方案的任意遥远的,并且既不训练问题也不是训练问题的不稳定性通常,杂散局部最小值的存在可以通过向目标函数添加正则化术语来克服衡量近似方案中参数大小的目标函数。无论可实现的可实现性是否满足,后一种结果都被证明是正确的。我们表明,我们的分析特别适用于具有可变宽度的自由结插值方案和深层和浅层神经网络的培训问题,其涉及各种激活功能的任意混合(例如,二进制,六骨,Tanh,arctan,软标志, ISRU,Soft-Clip,SQNL,Relu,Lifley Relu,Soft-Plus,Bent Identity,Silu,Isrlu和ELU)。总之,本文的发现说明了神经网络和一般非线性圆锥近似仪器的改进近似特性以直接和可量化的方式与必须解决的优化问题的不期望的性质链接,以便训练它们。
translated by 谷歌翻译
我们证明了由例如He等人提出的广泛使用的方法。(2015年)并使用梯度下降对最小二乘损失进行训练并不普遍。具体而言,我们描述了一大批一维数据生成分布,较高的概率下降只会发现优化景观的局部最小值不好,因为它无法将其偏离偏差远离其初始化,以零移动。。事实证明,在这些情况下,即使目标函数是非线性的,发现的网络也基本执行线性回归。我们进一步提供了数值证据,表明在实际情况下,对于某些多维分布而发生这种情况,并且随机梯度下降表现出相似的行为。我们还提供了有关初始化和优化器的选择如何影响这种行为的经验结果。
translated by 谷歌翻译
了解通过随机梯度下降(SGD)训练的神经网络的特性是深度学习理论的核心。在这项工作中,我们采取了平均场景,并考虑通过SGD培训的双层Relu网络,以实现一个非变量正则化回归问题。我们的主要结果是SGD偏向于简单的解决方案:在收敛时,Relu网络实现输入的分段线性图,以及“结”点的数量 - 即,Relu网络估计器的切线变化的点数 - 在两个连续的训练输入之间最多三个。特别地,随着网络的神经元的数量,通过梯度流的解决方案捕获SGD动力学,并且在收敛时,重量的分布方法接近相关的自由能量的独特最小化器,其具有GIBBS形式。我们的主要技术贡献在于分析了这一最小化器产生的估计器:我们表明其第二阶段在各地消失,除了代表“结”要点的一些特定地点。我们还提供了经验证据,即我们的理论预测的不同可能发生与数据点不同的位置的结。
translated by 谷歌翻译
在负面的感知问题中,我们给出了$ n $数据点$({\ boldsymbol x} _i,y_i)$,其中$ {\ boldsymbol x} _i $是$ d $ -densional vector和$ y_i \ in \ { + 1,-1 \} $是二进制标签。数据不是线性可分离的,因此我们满足自己的内容,以找到最大的线性分类器,具有最大的\ emph {否定}余量。换句话说,我们想找到一个单位常规矢量$ {\ boldsymbol \ theta} $,最大化$ \ min_ {i \ le n} y_i \ langle {\ boldsymbol \ theta},{\ boldsymbol x} _i \ rangle $ 。这是一个非凸优化问题(它相当于在Polytope中找到最大标准矢量),我们在两个随机模型下研究其典型属性。我们考虑比例渐近,其中$ n,d \ to \ idty $以$ n / d \ to \ delta $,并在最大边缘$ \ kappa _ {\ text {s}}(\ delta)上证明了上限和下限)$或 - 等效 - 在其逆函数$ \ delta _ {\ text {s}}(\ kappa)$。换句话说,$ \ delta _ {\ text {s}}(\ kappa)$是overparametization阈值:以$ n / d \ le \ delta _ {\ text {s}}(\ kappa) - \ varepsilon $一个分类器实现了消失的训练错误,具有高概率,而以$ n / d \ ge \ delta _ {\ text {s}}(\ kappa)+ \ varepsilon $。我们在$ \ delta _ {\ text {s}}(\ kappa)$匹配,以$ \ kappa \ to - \ idty $匹配。然后,我们分析了线性编程算法来查找解决方案,并表征相应的阈值$ \ delta _ {\ text {lin}}(\ kappa)$。我们观察插值阈值$ \ delta _ {\ text {s}}(\ kappa)$和线性编程阈值$ \ delta _ {\ text {lin {lin}}(\ kappa)$之间的差距,提出了行为的问题其他算法。
translated by 谷歌翻译
Autoencoders are a popular model in many branches of machine learning and lossy data compression. However, their fundamental limits, the performance of gradient methods and the features learnt during optimization remain poorly understood, even in the two-layer setting. In fact, earlier work has considered either linear autoencoders or specific training regimes (leading to vanishing or diverging compression rates). Our paper addresses this gap by focusing on non-linear two-layer autoencoders trained in the challenging proportional regime in which the input dimension scales linearly with the size of the representation. Our results characterize the minimizers of the population risk, and show that such minimizers are achieved by gradient methods; their structure is also unveiled, thus leading to a concise description of the features obtained via training. For the special case of a sign activation function, our analysis establishes the fundamental limits for the lossy compression of Gaussian sources via (shallow) autoencoders. Finally, while the results are proved for Gaussian data, numerical simulations on standard datasets display the universality of the theoretical predictions.
translated by 谷歌翻译
通过梯度流优化平均平衡误差,研究了功能空间中神经网络的动态。我们认为,在underParameterized制度中,网络了解由与其特征值对应的率的神经切线内核(NTK)确定的整体运算符$ t_ {k ^ \ infty} $的特征功能。例如,对于SPENTE $ S ^ {D-1} $和旋转不变的权重分配的均匀分布式数据,$ t_ {k ^ \ infty} $的特征函数是球形谐波。我们的结果可以理解为描述interparameterized制度中的光谱偏压。证据使用“阻尼偏差”的概念,其中NTK物质对具有由于阻尼因子的发生而具有大特征值的特征的偏差。除了下公共条例的制度之外,阻尼偏差可用于跟踪过度分辨率设置中经验风险的动态,允许我们在文献中延长某些结果。我们得出结论,阻尼偏差在优化平方误差时提供了动态的简单和统一的视角。
translated by 谷歌翻译
Low-rank matrix approximations, such as the truncated singular value decomposition and the rank-revealing QR decomposition, play a central role in data analysis and scientific computing. This work surveys and extends recent research which demonstrates that randomization offers a powerful tool for performing low-rank matrix approximation. These techniques exploit modern computational architectures more fully than classical methods and open the possibility of dealing with truly massive data sets.This paper presents a modular framework for constructing randomized algorithms that compute partial matrix decompositions. These methods use random sampling to identify a subspace that captures most of the action of a matrix. The input matrix is then compressed-either explicitly or implicitly-to this subspace, and the reduced matrix is manipulated deterministically to obtain the desired low-rank factorization. In many cases, this approach beats its classical competitors in terms of accuracy, speed, and robustness. These claims are supported by extensive numerical experiments and a detailed error analysis.The specific benefits of randomized techniques depend on the computational environment. Consider the model problem of finding the k dominant components of the singular value decomposition of an m × n matrix. (i) For a dense input matrix, randomized algorithms require O(mn log(k)) floating-point operations (flops) in contrast with O(mnk) for classical algorithms. (ii) For a sparse input matrix, the flop count matches classical Krylov subspace methods, but the randomized approach is more robust and can easily be reorganized to exploit multi-processor architectures. (iii) For a matrix that is too large to fit in fast memory, the randomized techniques require only a constant number of passes over the data, as opposed to O(k) passes for classical algorithms. In fact, it is sometimes possible to perform matrix approximation with a single pass over the data.
translated by 谷歌翻译
The fundamental learning theory behind neural networks remains largely open. What classes of functions can neural networks actually learn? Why doesn't the trained network overfit when it is overparameterized?In this work, we prove that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations. Moreover, the learning can be simply done by SGD (stochastic gradient descent) or its variants in polynomial time using polynomially many samples. The sample complexity can also be almost independent of the number of parameters in the network.On the technique side, our analysis goes beyond the so-called NTK (neural tangent kernel) linearization of neural networks in prior works. We establish a new notion of quadratic approximation of the neural network (that can be viewed as a second-order variant of NTK), and connect it to the SGD theory of escaping saddle points.
translated by 谷歌翻译
套索是一种高维回归的方法,当时,当协变量$ p $的订单数量或大于观测值$ n $时,通常使用它。由于两个基本原因,经典的渐近态性理论不适用于该模型:$(1)$正规风险是非平滑的; $(2)$估算器$ \ wideHat {\ boldsymbol {\ theta}} $与true参数vector $ \ boldsymbol {\ theta}^*$无法忽略。结果,标准的扰动论点是渐近正态性的传统基础。另一方面,套索估计器可以精确地以$ n $和$ p $大,$ n/p $的订单为一。这种表征首先是在使用I.I.D的高斯设计的情况下获得的。协变量:在这里,我们将其推广到具有非偏差协方差结构的高斯相关设计。这是根据更简单的``固定设计''模型表示的。我们在两个模型中各种数量的分布之间的距离上建立了非反应界限,它们在合适的稀疏类别中均匀地固定在信号上$ \ boldsymbol {\ theta}^*$。作为应用程序,我们研究了借助拉索的分布,并表明需要校正程度对于计算有效的置信区间是必要的。
translated by 谷歌翻译
成功的深度学习模型往往涉及培训具有比训练样本数量更多的参数的神经网络架构。近年来已经广泛研究了这种超分子化的模型,并且通过双下降现象和通过优化景观的结构特性,从统计的角度和计算视角都建立了过分统计化的优点。尽管在过上分层的制度中深入学习架构的显着成功,但也众所周知,这些模型对其投入中的小对抗扰动感到高度脆弱。即使在普遍培训的情况下,它们在扰动输入(鲁棒泛化)上的性能也会比良性输入(标准概括)的最佳可达到的性能更糟糕。因此,必须了解如何从根本上影响稳健性的情况下如何影响鲁棒性。在本文中,我们将通过专注于随机特征回归模型(具有随机第一层权重的两层神经网络)来提供超分度化对鲁棒性的作用的精确表征。我们考虑一个制度,其中样本量,输入维度和参数的数量彼此成比例地生长,并且当模型发生前列地训练时,可以为鲁棒泛化误差导出渐近精确的公式。我们的发达理论揭示了过分统计化对鲁棒性的非竞争效果,表明对于普遍训练的随机特征模型,高度公正化可能会损害鲁棒泛化。
translated by 谷歌翻译
Network data are ubiquitous in modern machine learning, with tasks of interest including node classification, node clustering and link prediction. A frequent approach begins by learning an Euclidean embedding of the network, to which algorithms developed for vector-valued data are applied. For large networks, embeddings are learned using stochastic gradient methods where the sub-sampling scheme can be freely chosen. Despite the strong empirical performance of such methods, they are not well understood theoretically. Our work encapsulates representation methods using a subsampling approach, such as node2vec, into a single unifying framework. We prove, under the assumption that the graph is exchangeable, that the distribution of the learned embedding vectors asymptotically decouples. Moreover, we characterize the asymptotic distribution and provided rates of convergence, in terms of the latent parameters, which includes the choice of loss function and the embedding dimension. This provides a theoretical foundation to understand what the embedding vectors represent and how well these methods perform on downstream tasks. Notably, we observe that typically used loss functions may lead to shortcomings, such as a lack of Fisher consistency.
translated by 谷歌翻译
In many modern applications of deep learning the neural network has many more parameters than the data points used for its training. Motivated by those practices, a large body of recent theoretical research has been devoted to studying overparameterized models. One of the central phenomena in this regime is the ability of the model to interpolate noisy data, but still have test error lower than the amount of noise in that data. arXiv:1906.11300 characterized for which covariance structure of the data such a phenomenon can happen in linear regression if one considers the interpolating solution with minimum $\ell_2$-norm and the data has independent components: they gave a sharp bound on the variance term and showed that it can be small if and only if the data covariance has high effective rank in a subspace of small co-dimension. We strengthen and complete their results by eliminating the independence assumption and providing sharp bounds for the bias term. Thus, our results apply in a much more general setting than those of arXiv:1906.11300, e.g., kernel regression, and not only characterize how the noise is damped but also which part of the true signal is learned. Moreover, we extend the result to the setting of ridge regression, which allows us to explain another interesting phenomenon: we give general sufficient conditions under which the optimal regularization is negative.
translated by 谷歌翻译
神经切线内核(NTK)已成为提供记忆,优化和泛化的强大工具,可保证深度神经网络。一项工作已经研究了NTK频谱的两层和深网,其中至少具有$ \ omega(n)$神经元的层,$ n $是培训样本的数量。此外,有越来越多的证据表明,只要参数数量超过样品数量,具有亚线性层宽度的深网是强大的记忆和优化器。因此,一个自然的开放问题是NTK是否在如此充满挑战的子线性设置中适应得很好。在本文中,我们以肯定的方式回答了这个问题。我们的主要技术贡献是对最小的深网的最小NTK特征值的下限,最小可能的过度参数化:参数的数量大约为$ \ omega(n)$,因此,神经元的数量仅为$ $ $ \ omega(\ sqrt {n})$。为了展示我们的NTK界限的适用性,我们为梯度下降训练提供了两个有关记忆能力和优化保证的结果。
translated by 谷歌翻译