We present a generalization bound for feedforward neural networks with ReLU activations in terms of the product of the spectral norm of the layers and the Frobenius norm of the weights. The key ingredient is a bound on the changes in the output of a network with respect to perturbation of its weights, thereby bounding the sharpness of the network. We combine this perturbation bound with the PAC-Bayes analysis to derive the generalization bound.
translated by 谷歌翻译
With a goal of understanding what drives generalization in deep networks, we consider several recently suggested explanations, including norm-based control, sharpness and robustness. We study how these measures can ensure generalization, highlighting the importance of scale normalization, and making a connection between sharpness and PAC-Bayes theory. We then investigate how well the measures explain different observed phenomena.
translated by 谷歌翻译
我们使用边缘赋予易于思考Pac-Bayesian界的一般配方,临界成分是我们随机预测集中在某种程度上集中。我们开发的工具直接导致各种分类器的裕度界限,包括线性预测 - 一个类,包括升高和支持向量机 - 单隐藏层神经网络,具有异常\(\ ERF \)激活功能,以及深度释放网络。此外,我们延伸到部分易碎的预测器,其中只去除一些随机性,让我们延伸到我们预测器的浓度特性否则差的情况。
translated by 谷歌翻译
This paper presents a margin-based multiclass generalization bound for neural networks that scales with their margin-normalized spectral complexity: their Lipschitz constant, meaning the product of the spectral norms of the weight matrices, times a certain correction factor. This bound is empirically investigated for a standard AlexNet network trained with SGD on the mnist and cifar10 datasets, with both original and random labels; the bound, the Lipschitz constants, and the excess risks are all in direct correlation, suggesting both that SGD selects predictors whose complexity scales with the difficulty of the learning task, and secondly that the presented bound is sensitive to this complexity.
translated by 谷歌翻译
Deep nets generalize well despite having more parameters than the number of training samples. Recent works try to give an explanation using PAC-Bayes and Margin-based analyses, but do not as yet result in sample complexity bounds better than naive parameter counting. The current paper shows generalization bounds that're orders of magnitude better in practice. These rely upon new succinct reparametrizations of the trained net -a compression that is explicit and efficient. These yield generalization bounds via a simple compression-based framework introduced here. Our results also provide some theoretical justification for widespread empirical success in compressing deep nets.Analysis of correctness of our compression relies upon some newly identified "noise stability"properties of trained deep nets, which are also experimentally verified. The study of these properties and resulting generalization bounds are also extended to convolutional nets, which had eluded earlier attempts on proving generalization.
translated by 谷歌翻译
我们观察到,给定两个(兼容的)函数类别$ \ MATHCAL {f} $和$ \ MATHCAL {h} $,具有较小的容量,按其均匀覆盖的数字测量,组成类$ \ Mathcal {H} \ Circ \ Mathcal {f} $可能会变得非常大,甚至无限。然后,我们证明,在用$ \ Mathcal {h} $构成$ \ Mathcal {f} $的输出中,添加少量高斯噪声可以有效地控制$ \ Mathcal {H} \ Circ \ Mathcal { F} $,提供模块化设计的一般配方。为了证明我们的结果,我们定义了均匀覆盖随机函数数量的新概念,相对于总变异和瓦斯坦斯坦距离。我们将结果实例化,以实现多层Sigmoid神经​​网络。 MNIST数据集的初步经验结果表明,在现有统一界限上改善所需的噪声量在数值上可以忽略不计(即,元素的I.I.D. I.I.D.高斯噪声,具有标准偏差$ 10^{ - 240} $)。源代码可从https://github.com/fathollahpour/composition_noise获得。
translated by 谷歌翻译
We investigate the capacity, convexity and characterization of a general family of normconstrained feed-forward networks.
translated by 谷歌翻译
我们研究神经网络的基于规范的统一收敛范围,旨在密切理解它们如何受到规范约束的架构和类型的影响,对于简单的标量价值一类隐藏的一层网络,并在其中界定了输入。欧几里得规范。我们首先证明,通常,控制隐藏层重量矩阵的光谱规范不足以获得均匀的收敛保证(与网络宽度无关),而更强的Frobenius Norm Control是足够的,扩展并改善了以前的工作。在证明构造中,我们识别和分析了两个重要的设置,在这些设置中(可能令人惊讶)仅光谱规范控制就足够了:首先,当网络的激活函数足够平滑时(结果扩展到更深的网络);其次,对于某些类型的卷积网络。在后一种情况下,我们研究样品复杂性如何受到参数的影响,例如斑块之间的重叠量和斑块的总数。
translated by 谷歌翻译
这项工作研究了浅relu网络通过梯度下降训练的浅relu网络,在底层数据分布一般的二进制分类数据上,(最佳)贝叶斯风险不一定为零。在此设置中,表明,在早期停止的梯度下降达到人口风险在不仅仅是逻辑和错误分类损失方面,也可以在校准方面任意接近最佳,这意味着其输出的符合矩阵映射近似于真正的条件分布任意精细。此外,这种分析的必要迭代,样本和架构复杂性,并且在真实条件模型的某种复杂度测量方面都是自然的。最后,虽然没有表明需要早期停止是必要的,但是显示满足局部内插特性的任何单变量分类器是不一致的。
translated by 谷歌翻译
我们专注于具有单个隐藏层的特定浅神经网络,即具有$ l_2 $ normalistization的数据以及Sigmoid形状的高斯错误函数(“ ERF”)激活或高斯错误线性单元(GELU)激活。对于这些网络,我们通过Pac-Bayesian理论得出了新的泛化界限。与大多数现有的界限不同,它们适用于具有确定性或随机参数的神经网络。当网络接受Mnist和Fashion-Mnist上的香草随机梯度下降训练时,我们的界限在经验上是无效的。
translated by 谷歌翻译
We investigate the sample complexity of bounded two-layer neural networks using different activation functions. In particular, we consider the class \[ \mathcal{H} = \left\{\textbf{x}\mapsto \langle \textbf{v}, \sigma \circ W\textbf{x} + \textbf{b} \rangle : \textbf{b}\in\mathbb{R}^d, W \in \mathbb{R}^{T\times d}, \textbf{v} \in \mathbb{R}^{T}\right\} \] where the spectral norm of $W$ and $\textbf{v}$ is bounded by $O(1)$, the Frobenius norm of $W$ is bounded from its initialization by $R > 0$, and $\sigma$ is a Lipschitz activation function. We prove that if $\sigma$ is element-wise, then the sample complexity of $\mathcal{H}$ is width independent and that this complexity is tight. Moreover, we show that the element-wise property of $\sigma$ is essential for width-independent bound, in the sense that there exist non-element-wise activation functions whose sample complexity is provably width-dependent. For the upper bound, we use the recent approach for norm-based bounds named Approximate Description Length (ADL) by arXiv:1910.05697. We further develop new techniques and tools for this approach, that will hopefully inspire future works.
translated by 谷歌翻译
我们提出了Pac-Bayes风格的概括结合,该结合可以用各种积分概率指标(IPM)替换KL-Divergence。我们提供了这种结合的实例,IPM是总变异度量和Wasserstein距离。获得的边界的一个显着特征是,它们在最坏的情况下(当前和后距离彼此远距离时)在经典均匀收敛边界之间自然插值,并且在更好的情况下(后验和先验都关闭时)优选界限。这说明了使用算法和数据依赖性组件加强经典概括界限的可能性,从而使它们更适合分析使用大假设空间的算法。
translated by 谷歌翻译
古典统计学习理论表示,拟合太多参数导致过度舒服和性能差。尽管大量参数矛盾,但是现代深度神经网络概括了这一发现,并构成了解释深度学习成功的主要未解决的问题。随机梯度下降(SGD)引起的隐式正规被认为是重要的,但其特定原则仍然是未知的。在这项工作中,我们研究了当地最小值周围的能量景观的局部几何学如何影响SGD的统计特性,具有高斯梯度噪声。我们争辩说,在合理的假设下,局部几何形状力强制SGD保持接近低维子空间,这会引起隐式正则化并导致深神经网络的泛化误差界定更严格的界限。为了获得神经网络的泛化误差界限,我们首先引入局部最小值周围的停滞迹象,并施加人口风险的局部基本凸性财产。在这些条件下,推导出SGD的下界,以保留在这些停滞套件中。如果发生停滞,我们会导出涉及权重矩阵的光谱规范的深神经网络的泛化误差的界限,但不是网络参数的数量。从技术上讲,我们的证据基于控制SGD中的参数值的变化以及基于局部最小值周围的合适邻域的熵迭代的参数值和局部均匀收敛。我们的工作试图通过统一收敛更好地连接非凸优化和泛化分析。
translated by 谷歌翻译
Gradient descent finds a global minimum in training deep neural networks despite the objective function being non-convex. The current paper proves gradient descent achieves zero training loss in polynomial time for a deep overparameterized neural network with residual connections (ResNet). Our analysis relies on the particular structure of the Gram matrix induced by the neural network architecture. This structure allows us to show the Gram matrix is stable throughout the training process and this stability implies the global optimality of the gradient descent algorithm. We further extend our analysis to deep residual convolutional neural networks and obtain a similar convergence result.
translated by 谷歌翻译
Deep neural networks are vulnerable to adversarial attacks. Ideally, a robust model shall perform well on both the perturbed training data and the unseen perturbed test data. It is found empirically that fitting perturbed training data is not hard, but generalizing to perturbed test data is quite difficult. To better understand adversarial generalization, it is of great interest to study the adversarial Rademacher complexity (ARC) of deep neural networks. However, how to bound ARC in multi-layers cases is largely unclear due to the difficulty of analyzing adversarial loss in the definition of ARC. There have been two types of attempts of ARC. One is to provide the upper bound of ARC in linear and one-hidden layer cases. However, these approaches seem hard to extend to multi-layer cases. Another is to modify the adversarial loss and provide upper bounds of Rademacher complexity on such surrogate loss in multi-layer cases. However, such variants of Rademacher complexity are not guaranteed to be bounds for meaningful robust generalization gaps (RGG). In this paper, we provide a solution to this unsolved problem. Specifically, we provide the first bound of adversarial Rademacher complexity of deep neural networks. Our approach is based on covering numbers. We provide a method to handle the robustify function classes of DNNs such that we can calculate the covering numbers. Finally, we provide experiments to study the empirical implication of our bounds and provide an analysis of poor adversarial generalization.
translated by 谷歌翻译
由学习的迭代软阈值算法(Lista)的动机,我们介绍了一种适用于稀疏重建的一般性网络,从少数线性测量。通过在层之间允许各种重量共享度,我们为非常不同的神经网络类型提供统一分析,从复发到网络更类似于标准前馈神经网络。基于训练样本,通过经验风险最小化,我们旨在学习最佳网络参数,从而实现从其低维线性测量的最佳网络。我们通过分析由这种深网络组成的假设类的RadeMacher复杂性来衍生泛化界限,这也考虑了阈值参数。我们获得了对样本复杂性的估计,基本上只取决于参数和深度的数量。我们应用主要结果以获得几个实际示例的特定泛化界限,包括(隐式)字典学习和卷积神经网络的不同算法。
translated by 谷歌翻译
通过梯度流优化平均平衡误差,研究了功能空间中神经网络的动态。我们认为,在underParameterized制度中,网络了解由与其特征值对应的率的神经切线内核(NTK)确定的整体运算符$ t_ {k ^ \ infty} $的特征功能。例如,对于SPENTE $ S ^ {D-1} $和旋转不变的权重分配的均匀分布式数据,$ t_ {k ^ \ infty} $的特征函数是球形谐波。我们的结果可以理解为描述interparameterized制度中的光谱偏压。证据使用“阻尼偏差”的概念,其中NTK物质对具有由于阻尼因子的发生而具有大特征值的特征的偏差。除了下公共条例的制度之外,阻尼偏差可用于跟踪过度分辨率设置中经验风险的动态,允许我们在文献中延长某些结果。我们得出结论,阻尼偏差在优化平方误差时提供了动态的简单和统一的视角。
translated by 谷歌翻译
In this paper we derive a PAC-Bayesian-Like error bound for a class of stochastic dynamical systems with inputs, namely, for linear time-invariant stochastic state-space models (stochastic LTI systems for short). This class of systems is widely used in control engineering and econometrics, in particular, they represent a special case of recurrent neural networks. In this paper we 1) formalize the learning problem for stochastic LTI systems with inputs, 2) derive a PAC-Bayesian-Like error bound for such systems, 3) discuss various consequences of this error bound.
translated by 谷歌翻译
The fundamental learning theory behind neural networks remains largely open. What classes of functions can neural networks actually learn? Why doesn't the trained network overfit when it is overparameterized?In this work, we prove that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations. Moreover, the learning can be simply done by SGD (stochastic gradient descent) or its variants in polynomial time using polynomially many samples. The sample complexity can also be almost independent of the number of parameters in the network.On the technique side, our analysis goes beyond the so-called NTK (neural tangent kernel) linearization of neural networks in prior works. We establish a new notion of quadratic approximation of the neural network (that can be viewed as a second-order variant of NTK), and connect it to the SGD theory of escaping saddle points.
translated by 谷歌翻译
研究神经网络中重量扰动的敏感性及其对模型性能的影响,包括泛化和鲁棒性,是一种积极的研究主题,因为它对模型压缩,泛化差距评估和对抗攻击等诸如模型压缩,泛化差距评估和对抗性攻击的广泛机器学习任务。在本文中,我们在重量扰动下的鲁棒性方面提供了前馈神经网络的第一积分研究和分析及其在体重扰动下的泛化行为。我们进一步设计了一种新的理论驱动损失功能,用于培训互动和强大的神经网络免受重量扰动。进行实证实验以验证我们的理论分析。我们的结果提供了基本洞察,以表征神经网络免受重量扰动的泛化和鲁棒性。
translated by 谷歌翻译