为了理论上了解训练有素的深神经网络的行为,有必要研究来自随机初始化的梯度方法引起的动态。然而,这些模型的非线性和组成结构使得这些动态难以分析。为了克服这些挑战,最近出现了大宽度的渐近学作为富有成效的观点,并导致了对真实世界的深网络的实用洞察。对于双层神经网络,已经通过这些渐近学理解,训练模型的性质根据初始随机权重的规模而变化,从内核制度(大初始方差)到特征学习制度(对于小初始方差)。对于更深的网络,更多的制度是可能的,并且在本文中,我们详细研究了与神经网络的“卑鄙字段”限制相对应的“小”初始化的特定选择,我们称之为可分配的参数化(IP)。首先,我们展示了标准I.I.D.零平均初始化,具有多于四个层的神经网络的可集参数,从无限宽度限制的静止点开始,并且不会发生学习。然后,我们提出了各种方法来避免这种琐碎的行为并详细分析所得到的动态。特别是,这些方法中的一种包括使用大的初始学习速率,并且我们表明它相当于最近提出的最大更新参数化$ \ mu $ p的修改。我们将结果与图像分类任务的数值实验确认,其另外显示出在尚未捕获的激活功能的各种选择之间的行为中的强烈差异。
深度重新结合因实现最新的机器学习任务而被认可。但是,这些体系结构的出色性能取决于培训程序,需要精心制作以避免消失或爆炸梯度,尤其是随着深度$ l $的增加。关于如何减轻此问题,尚无共识,尽管广泛讨论的策略在于将每一层的输出缩放为$ \ alpha_l $。我们在概率环境中显示标准I.I.D.初始化,唯一的非平凡动力学是$ \ alpha_l = 1/\ sqrt {l} $(其他选择导致爆炸或身份映射)。该缩放因子在连续的时间限制中对应于神经随机微分方程,这与广泛的解释相反,即深度重新连接是神经普通微分方程的离散化。相比之下,在后一种制度中,具有特定相关初始化和$ \ alpha_l = 1/l $获得稳定性。我们的分析表明,与层指数的函数之间的缩放比例和规律性之间存在很强的相互作用。最后,在一系列实验中,我们表现出由这两个参数驱动的连续范围,这在训练之前和之后会共同影响性能。
The Transformer is widely used in natural language processing tasks. To train a Transformer however, one usually needs a carefully designed learning rate warm-up stage, which is shown to be crucial to the final performance but will slow down the optimization and bring more hyperparameter tunings. In this paper, we first study theoretically why the learning rate warm-up stage is essential and show that the location of layer normalization matters. Specifically, we prove with mean field theory that at initialization, for the original-designed Post-LN Transformer, which places the layer normalization between the residual blocks, the expected gradients of the parameters near the output layer are large. Therefore, using a large learning rate on those gradients makes the training unstable. The warm-up stage is practically helpful for avoiding this problem. On the other hand, our theory also shows that if the layer normalization is put inside the residual blocks (recently proposed as Pre-LN Transformer), the gradients are well-behaved at initialization. This motivates us to remove the warm-up stage for the training of Pre-LN Transformers. We show in our experiments that Pre-LN Transformers without the warm-up stage can reach comparable results with baselines while requiring significantly less training time and hyper-parameter tuning on a wide range of applications.
由于Jacot等人的著名结果,神经切线内核(NTK)被广泛用于分析过多散热性神经网络。 (2018):在无限宽度限制中,NTK在训练过程中是确定性和恒定的。但是,该结果无法解释深网的行为,因为如果深度和宽度同时无穷大,通常不会成立。在本文中,我们研究了与宽度相当的深度连接的Relu网络的NTK。我们证明NTK性质显着取决于初始化时的深度与宽度比和参数的分布。实际上,我们的结果表明,在Poole等人中确定的超参数空间中这三个阶段的重要性。 (2016年):订购,混乱和混乱的边缘(EOC)。我们在所有三个阶段中都在无限深度和宽度极限中得出NTK分散剂的精确表达式,并得出结论,NTK的可变性在EOC和混乱阶段随着深度而呈指数增长,但在有序阶段中却没有。我们还表明,深网的NTK只能在有序阶段训练期间保持恒定,并讨论NTK矩阵的结构在训练过程中如何变化。
在应用每个随机初始化的神经网络层后,数据集的几何表示如何改变?庆祝的约翰逊 - Lindenstrauss Lemma回答了线性完全连接的神经网络(FNNS)的这个问题,说明几何形状基本上是保存的。对于Relu激活的FNN,根据已知的映射,两个输入合同之间的角度。非线性卷积神经网络(CNNS)的问题变得更复杂。要回答这个问题,我们介绍了几何框架。对于线性CNNS,我们表明约翰逊 - 林登斯特兰LEMMA继续保持,即两个输入之间的角度被保留。另一方面,对于带有Relu激活的CNNS,行为富裕:输出合同之间的角度,其中收缩级别取决于输入的性质。特别地,在一层之后,基本上保留了自然图像的几何形状,而对于高斯相关的输入,CNNS表现出与具有Relu激活的FNN相同的收缩行为。
We provide results that exactly quantify how data augmentation affects the convergence rate and variance of estimates. They lead to some unexpected findings: Contrary to common intuition, data augmentation may increase rather than decrease the uncertainty of estimates, such as the empirical prediction risk. Our main theoretical tool is a limit theorem for functions of randomly transformed, high-dimensional random vectors. The proof draws on work in probability on noise stability of functions of many variables. The pathological behavior we identify is not a consequence of complex models, but can occur even in the simplest settings -- one of our examples is a ridge regressor with two parameters. On the other hand, our results also show that data augmentation can have real, quantifiable benefits.
Autoencoders are a popular model in many branches of machine learning and lossy data compression. However, their fundamental limits, the performance of gradient methods and the features learnt during optimization remain poorly understood, even in the two-layer setting. In fact, earlier work has considered either linear autoencoders or specific training regimes (leading to vanishing or diverging compression rates). Our paper addresses this gap by focusing on non-linear two-layer autoencoders trained in the challenging proportional regime in which the input dimension scales linearly with the size of the representation. Our results characterize the minimizers of the population risk, and show that such minimizers are achieved by gradient methods; their structure is also unveiled, thus leading to a concise description of the features obtained via training. For the special case of a sign activation function, our analysis establishes the fundamental limits for the lossy compression of Gaussian sources via (shallow) autoencoders. Finally, while the results are proved for Gaussian data, numerical simulations on standard datasets display the universality of the theoretical predictions.
现代神经网络通常以强烈的过度构造状态运行:它们包含许多参数,即使实际标签被纯粹随机的标签代替,它们也可以插入训练集。尽管如此,他们在看不见的数据上达到了良好的预测错误:插值训练集并不会导致巨大的概括错误。此外,过度散色化似乎是有益的,因为它简化了优化景观。在这里,我们在神经切线(NT)制度中的两层神经网络的背景下研究这些现象。我们考虑了一个简单的数据模型,以及各向同性协变量的矢量,$ d $尺寸和$ n $隐藏的神经元。我们假设样本量$ n $和尺寸$ d $都很大,并且它们在多项式上相关。我们的第一个主要结果是对过份术的经验NT内核的特征结构的特征。这种表征意味着必然的表明,经验NT内核的最低特征值在$ ND \ gg n $后立即从零界限,因此网络可以在同一制度中精确插值任意标签。我们的第二个主要结果是对NT Ridge回归的概括误差的表征,包括特殊情况,最小值-ULL_2 $ NORD插值。我们证明,一旦$ nd \ gg n $,测试误差就会被内核岭回归之一相对于无限宽度内核而近似。多项式脊回归的误差依次近似后者,从而通过与激活函数的高度组件相关的“自我诱导的”项增加了正则化参数。多项式程度取决于样本量和尺寸(尤其是$ \ log n/\ log d $)。
背部衰退的随机梯度下降是人工神经网络的主力。已经很久认识到,BackPropagation无法成为一种生物合理的算法。从根本上,它是一种非本地程序 - 更新一个神经元的突触权重,需要了解下游神经元的突触权重或接受领域。这限制了人工神经网络作为理解大脑中信息处理生物学原理的工具。 Lillicrap等人。 (2016)提出了一种更具生物合理的“反馈对齐”算法,该算法使用随机和固定的反向化重量,并显示有希望的模拟。在本文中,我们通过分析在平方误差损失下的两层网络的收敛和对准来研究反馈对准过程的数学特性。在过度指数化的设置中,我们证明误差会使误差快速收敛到零,并且还需要进行正则化,以便参数与随机背交量对齐。给出了与该分析一致的模拟,并建议进一步的概括。这些结果有助于我们了解生物学合理的算法如何以不同于Hebbian学习的方式进行体重学习,性能与完整的非本地反向验证算法相当。
由于其宽度趋于无穷大,如果梯度下降下的深度神经网络的行为可以简化和可预测(例如,如果神经切线核(NTK)给出,则如果适当地进行了参数化(例如,NTK参数化)。但是,我们表明,神经网络的标准和NTK参数化不接受可以学习特征的无限宽度限制,这对于训练和转移学习至关重要。我们对标准参数化提出了简单的修改,以允许在极限内进行特征学习。使用 * Tensor程序 *技术,我们为此类限制提供了明确的公式。在Word2Vec和Omniglot上通过MAML进行的几杆学习,这是两个依赖特征学习的规范任务,我们准确地计算了这些限制。我们发现它们的表现都优于NTK基准和有限宽度网络,后者接近无限宽度的特征学习表现,随着宽度的增加。更普遍地,我们对神经网络参数化的自然空间进行分类,该空间概括了标准,NTK和平均场参数化。我们显示1)该空间中的任何参数化都可以接受特征学习或具有内核梯度下降给出的无限宽度训练动力学,但并非两者兼而有之; 2)可以使用Tensor程序技术计算任何此类无限宽度限制。可以在github.com/edwardjhu/tp4上找到我们的实验代码。
我们考虑估计与I.I.D的排名$ 1 $矩阵因素的问题。高斯,排名$ 1 $的测量值,这些测量值非线性转化和损坏。考虑到非线性的两种典型选择,我们研究了从随机初始化开始的此非convex优化问题的天然交流更新规则的收敛性能。我们通过得出确定性递归,即使在高维问题中也是准确的,我们显示出算法的样本分割版本的敏锐收敛保证。值得注意的是,虽然无限样本的种群更新是非信息性的,并提示单个步骤中的精确恢复,但算法 - 我们的确定性预测 - 从随机初始化中迅速地收敛。我们尖锐的非反应分析也暴露了此问题的其他几种细粒度,包括非线性和噪声水平如何影响收敛行为。从技术层面上讲,我们的结果可以通过证明我们的确定性递归可以通过我们的确定性顺序来预测我们的确定性序列,而当每次迭代都以$ n $观测来运行时,我们的确定性顺序可以通过$ n^{ - 1/2} $的波动。我们的技术利用了源自有关高维$ m $估计文献的遗留工具,并为通过随机数据的其他高维优化问题的随机初始化而彻底地分析了高阶迭代算法的途径。
The study of feature propagation at initialization in neural networks lies at the root of numerous initialization designs. An assumption very commonly made in the field states that the pre-activations are Gaussian. Although this convenient Gaussian hypothesis can be justified when the number of neurons per layer tends to infinity, it is challenged by both theoretical and experimental works for finite-width neural networks. Our major contribution is to construct a family of pairs of activation functions and initialization distributions that ensure that the pre-activations remain Gaussian throughout the network's depth, even in narrow neural networks. In the process, we discover a set of constraints that a neural network should fulfill to ensure Gaussian pre-activations. Additionally, we provide a critical review of the claims of the Edge of Chaos line of works and build an exact Edge of Chaos analysis. We also propose a unified view on pre-activations propagation, encompassing the framework of several well-known initialization procedures. Finally, our work provides a principled framework for answering the much-debated question: is it desirable to initialize the training of a neural network whose pre-activations are ensured to be Gaussian?
The logit outputs of a feedforward neural network at initialization are conditionally Gaussian, given a random covariance matrix defined by the penultimate layer. In this work, we study the distribution of this random matrix. Recent work has shown that shaping the activation function as network depth grows large is necessary for this covariance matrix to be non-degenerate. However, the current infinite-width-style understanding of this shaping method is unsatisfactory for large depth: infinite-width analyses ignore the microscopic fluctuations from layer to layer, but these fluctuations accumulate over many layers. To overcome this shortcoming, we study the random covariance matrix in the shaped infinite-depth-and-width limit. We identify the precise scaling of the activation function necessary to arrive at a non-trivial limit, and show that the random covariance matrix is governed by a stochastic differential equation (SDE) that we call the Neural Covariance SDE. Using simulations, we show that the SDE closely matches the distribution of the random covariance matrix of finite networks. Additionally, we recover an if-and-only-if condition for exploding and vanishing norms of large shaped networks based on the activation function.
Deep nets generalize well despite having more parameters than the number of training samples. Recent works try to give an explanation using PAC-Bayes and Margin-based analyses, but do not as yet result in sample complexity bounds better than naive parameter counting. The current paper shows generalization bounds that're orders of magnitude better in practice. These rely upon new succinct reparametrizations of the trained net -a compression that is explicit and efficient. These yield generalization bounds via a simple compression-based framework introduced here. Our results also provide some theoretical justification for widespread empirical success in compressing deep nets.Analysis of correctness of our compression relies upon some newly identified "noise stability"properties of trained deep nets, which are also experimentally verified. The study of these properties and resulting generalization bounds are also extended to convolutional nets, which had eluded earlier attempts on proving generalization.
