With a goal of understanding what drives generalization in deep networks, we consider several recently suggested explanations, including norm-based control, sharpness and robustness. We study how these measures can ensure generalization, highlighting the importance of scale normalization, and making a connection between sharpness and PAC-Bayes theory. We then investigate how well the measures explain different observed phenomena.
translated by 谷歌翻译
We present a generalization bound for feedforward neural networks with ReLU activations in terms of the product of the spectral norm of the layers and the Frobenius norm of the weights. The key ingredient is a bound on the changes in the output of a network with respect to perturbation of its weights, thereby bounding the sharpness of the network. We combine this perturbation bound with the PAC-Bayes analysis to derive the generalization bound.
translated by 谷歌翻译
Deep nets generalize well despite having more parameters than the number of training samples. Recent works try to give an explanation using PAC-Bayes and Margin-based analyses, but do not as yet result in sample complexity bounds better than naive parameter counting. The current paper shows generalization bounds that're orders of magnitude better in practice. These rely upon new succinct reparametrizations of the trained net -a compression that is explicit and efficient. These yield generalization bounds via a simple compression-based framework introduced here. Our results also provide some theoretical justification for widespread empirical success in compressing deep nets.Analysis of correctness of our compression relies upon some newly identified "noise stability"properties of trained deep nets, which are also experimentally verified. The study of these properties and resulting generalization bounds are also extended to convolutional nets, which had eluded earlier attempts on proving generalization.
translated by 谷歌翻译
尽管过度参数过多,但人们认为,通过随机梯度下降(SGD)训练的深度神经网络令人惊讶地概括了。基于预先指定的假设集的Rademacher复杂性,已经开发出不同的基于规范的泛化界限来解释这种现象。但是,最近的研究表明,这些界限可能会随着训练集的规模而增加,这与经验证据相反。在这项研究中,我们认为假设集SGD探索是轨迹依赖性的,因此可能在其Rademacher复杂性上提供更严格的结合。为此,我们通过假设发生的随机梯度噪声遵循分数的布朗运动,通过随机微分方程来表征SGD递归。然后,我们根据覆盖数字识别Rademacher的复杂性,并将其与优化轨迹的Hausdorff维度相关联。通过调用假设集稳定性,我们得出了针对深神经网络的新型概括。广泛的实验表明,它可以很好地预测几种常见的实验干预措施的概括差距。我们进一步表明,分数布朗运动的HURST参数比现有的概括指标(例如幂律指数和上blumenthal-getoor索引)更具信息性。
translated by 谷歌翻译
This paper presents a margin-based multiclass generalization bound for neural networks that scales with their margin-normalized spectral complexity: their Lipschitz constant, meaning the product of the spectral norms of the weight matrices, times a certain correction factor. This bound is empirically investigated for a standard AlexNet network trained with SGD on the mnist and cifar10 datasets, with both original and random labels; the bound, the Lipschitz constants, and the excess risks are all in direct correlation, suggesting both that SGD selects predictors whose complexity scales with the difficulty of the learning task, and secondly that the presented bound is sensitive to this complexity.
translated by 谷歌翻译
我们观察到,给定两个(兼容的)函数类别$ \ MATHCAL {f} $和$ \ MATHCAL {h} $,具有较小的容量,按其均匀覆盖的数字测量,组成类$ \ Mathcal {H} \ Circ \ Mathcal {f} $可能会变得非常大,甚至无限。然后,我们证明,在用$ \ Mathcal {h} $构成$ \ Mathcal {f} $的输出中,添加少量高斯噪声可以有效地控制$ \ Mathcal {H} \ Circ \ Mathcal { F} $,提供模块化设计的一般配方。为了证明我们的结果,我们定义了均匀覆盖随机函数数量的新概念,相对于总变异和瓦斯坦斯坦距离。我们将结果实例化,以实现多层Sigmoid神经​​网络。 MNIST数据集的初步经验结果表明,在现有统一界限上改善所需的噪声量在数值上可以忽略不计(即,元素的I.I.D. I.I.D.高斯噪声,具有标准偏差$ 10^{ - 240} $)。源代码可从https://github.com/fathollahpour/composition_noise获得。
translated by 谷歌翻译
我们使用边缘赋予易于思考Pac-Bayesian界的一般配方,临界成分是我们随机预测集中在某种程度上集中。我们开发的工具直接导致各种分类器的裕度界限,包括线性预测 - 一个类,包括升高和支持向量机 - 单隐藏层神经网络,具有异常\(\ ERF \)激活功能,以及深度释放网络。此外,我们延伸到部分易碎的预测器,其中只去除一些随机性,让我们延伸到我们预测器的浓度特性否则差的情况。
translated by 谷歌翻译
This paper provides theoretical insights into why and how deep learning can generalize well, despite its large capacity, complexity, possible algorithmic instability, nonrobustness, and sharp minima, responding to an open question in the literature. We also discuss approaches to provide non-vacuous generalization guarantees for deep learning. Based on theoretical observations, we propose new open problems and discuss the limitations of our results.
translated by 谷歌翻译
在许多情况下,更简单的模型比更复杂的模型更可取,并且该模型复杂性的控制是机器学习中许多方法的目标,例如正则化,高参数调整和体系结构设计。在深度学习中,很难理解复杂性控制的潜在机制,因为许多传统措施并不适合深度神经网络。在这里,我们开发了几何复杂性的概念,该概念是使用离散的dirichlet能量计算的模型函数变异性的量度。使用理论论据和经验结果的结合,我们表明,许多常见的训练启发式方法,例如参数规范正规化,光谱规范正则化,平稳性正则化,隐式梯度正则化,噪声正则化和参数初始化的选择,都可以控制几何学复杂性,并提供一个统一的框架,以表征深度学习模型的行为。
translated by 谷歌翻译
古典统计学习理论表示,拟合太多参数导致过度舒服和性能差。尽管大量参数矛盾,但是现代深度神经网络概括了这一发现,并构成了解释深度学习成功的主要未解决的问题。随机梯度下降(SGD)引起的隐式正规被认为是重要的,但其特定原则仍然是未知的。在这项工作中,我们研究了当地最小值周围的能量景观的局部几何学如何影响SGD的统计特性,具有高斯梯度噪声。我们争辩说,在合理的假设下,局部几何形状力强制SGD保持接近低维子空间,这会引起隐式正则化并导致深神经网络的泛化误差界定更严格的界限。为了获得神经网络的泛化误差界限,我们首先引入局部最小值周围的停滞迹象,并施加人口风险的局部基本凸性财产。在这些条件下,推导出SGD的下界,以保留在这些停滞套件中。如果发生停滞,我们会导出涉及权重矩阵的光谱规范的深神经网络的泛化误差的界限,但不是网络参数的数量。从技术上讲,我们的证据基于控制SGD中的参数值的变化以及基于局部最小值周围的合适邻域的熵迭代的参数值和局部均匀收敛。我们的工作试图通过统一收敛更好地连接非凸优化和泛化分析。
translated by 谷歌翻译
我们专注于具有单个隐藏层的特定浅神经网络,即具有$ l_2 $ normalistization的数据以及Sigmoid形状的高斯错误函数(“ ERF”)激活或高斯错误线性单元(GELU)激活。对于这些网络,我们通过Pac-Bayesian理论得出了新的泛化界限。与大多数现有的界限不同,它们适用于具有确定性或随机参数的神经网络。当网络接受Mnist和Fashion-Mnist上的香草随机梯度下降训练时,我们的界限在经验上是无效的。
translated by 谷歌翻译
制定关于涉及新数据的任务的训练模型的表现的陈述是机器学习的主要目标之一,即了解模型的泛化功率。各种能力措施试图捕捉这种能力,但通常在解释我们在实践中观察到的模型的重要特征。在这项研究中,我们将本地有效维度提出作为一种容量测量,似乎与标准数据集的泛化误差很好。重要的是,我们证明了本地有效维度界限泛化误差并讨论了机器学习模型的这种容量措施的适应性。
translated by 谷歌翻译
We investigate the capacity, convexity and characterization of a general family of normconstrained feed-forward networks.
translated by 谷歌翻译
Existing generalization bounds fail to explain crucial factors that drive generalization of modern neural networks. Since such bounds often hold uniformly over all parameters, they suffer from over-parametrization, and fail to account for the strong inductive bias of initialization and stochastic gradient descent. As an alternative, we propose a novel optimal transport interpretation of the generalization problem. This allows us to derive instance-dependent generalization bounds that depend on the local Lipschitz regularity of the earned prediction function in the data space. Therefore, our bounds are agnostic to the parametrization of the model and work well when the number of training samples is much smaller than the number of parameters. With small modifications, our approach yields accelerated rates for data on low-dimensional manifolds, and guarantees under distribution shifts. We empirically analyze our generalization bounds for neural networks, showing that the bound values are meaningful and capture the effect of popular regularization methods during training.
translated by 谷歌翻译
Machine learning models are often susceptible to adversarial perturbations of their inputs. Even small perturbations can cause state-of-the-art classifiers with high "standard" accuracy to produce an incorrect prediction with high confidence. To better understand this phenomenon, we study adversarially robust learning from the viewpoint of generalization. We show that already in a simple natural data model, the sample complexity of robust learning can be significantly larger than that of "standard" learning. This gap is information theoretic and holds irrespective of the training algorithm or the model family. We complement our theoretical results with experiments on popular image classification datasets and show that a similar gap exists here as well. We postulate that the difficulty of training robust classifiers stems, at least partially, from this inherently larger sample complexity.
translated by 谷歌翻译
Despite their massive size, successful deep artificial neural networks can exhibit a remarkably small difference between training and test performance. Conventional wisdom attributes small generalization error either to properties of the model family, or to the regularization techniques used during training. Through extensive systematic experiments, we show how these traditional approaches fail to explain why large neural networks generalize well in practice. Specifically, our experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data. This phenomenon is qualitatively unaffected by explicit regularization, and occurs even if we replace the true images by completely unstructured random noise. We corroborate these experimental findings with a theoretical construction showing that simple depth two neural networks already have perfect finite sample expressivity as soon as the number of parameters exceeds the number of data points as it usually does in practice. We interpret our experimental findings by comparison with traditional models. * Work performed while interning at Google Brain.† Work performed at Google Brain.
translated by 谷歌翻译
A recent line of work studies overparametrized neural networks in the "kernel regime," i.e. when the network behaves during training as a kernelized linear predictor, and thus training with gradient descent has the effect of finding the minimum RKHS norm solution. This stands in contrast to other studies which demonstrate how gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms. Building on an observation by Chizat and Bach [2018], we show how the scale of the initialization controls the transition between the "kernel" (aka lazy) and "rich" (aka active) regimes and affects generalization properties in multilayer homogeneous models. We provide a complete and detailed analysis for a simple two-layer model that already exhibits an interesting and meaningful transition between the kernel and rich regimes, and we demonstrate the transition for more complex matrix factorization models and multilayer non-linear networks.
translated by 谷歌翻译
我们研究了使用尖刺,现场依赖的随机矩阵理论研究迷你批次对深神经网络损失景观的影响。我们表明,批量黑森州的极值值的大小大于经验丰富的黑森州。我们还获得了类似的结果对Hessian的概括高斯牛顿矩阵近似。由于我们的定理,我们推导出作为批量大小的最大学习速率的分析表达式,为随机梯度下降(线性缩放)和自适应算法(例如ADAM(Square Root Scaling)提供了通知实际培训方案,例如光滑,非凸深神经网络。虽然随机梯度下降的线性缩放是在我们概括的更多限制性条件下导出的,但是适应优化者的平方根缩放规则是我们的知识,完全小说。随机二阶方法和自适应方法的百分比,我们得出了最小阻尼系数与学习率与批量尺寸的比率成比例。我们在Cifar-$ 100 $和ImageNet数据集上验证了我们的VGG / WimerEsnet架构上的索赔。根据我们对象检的调查,我们基于飞行学习率和动量学习者开发了一个随机兰齐齐竞争,这避免了对这些关键的超参数进行昂贵的多重评估的需求,并在预残留的情况下显示出良好的初步结果Cifar的architecure - $ 100 $。
translated by 谷歌翻译
研究神经网络中重量扰动的敏感性及其对模型性能的影响,包括泛化和鲁棒性,是一种积极的研究主题,因为它对模型压缩,泛化差距评估和对抗攻击等诸如模型压缩,泛化差距评估和对抗性攻击的广泛机器学习任务。在本文中,我们在重量扰动下的鲁棒性方面提供了前馈神经网络的第一积分研究和分析及其在体重扰动下的泛化行为。我们进一步设计了一种新的理论驱动损失功能,用于培训互动和强大的神经网络免受重量扰动。进行实证实验以验证我们的理论分析。我们的结果提供了基本洞察,以表征神经网络免受重量扰动的泛化和鲁棒性。
translated by 谷歌翻译
The fundamental learning theory behind neural networks remains largely open. What classes of functions can neural networks actually learn? Why doesn't the trained network overfit when it is overparameterized?In this work, we prove that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations. Moreover, the learning can be simply done by SGD (stochastic gradient descent) or its variants in polynomial time using polynomially many samples. The sample complexity can also be almost independent of the number of parameters in the network.On the technique side, our analysis goes beyond the so-called NTK (neural tangent kernel) linearization of neural networks in prior works. We establish a new notion of quadratic approximation of the neural network (that can be viewed as a second-order variant of NTK), and connect it to the SGD theory of escaping saddle points.
translated by 谷歌翻译