通常针对具有特定模型的特定输入而生成的对抗性示例,对于神经网络而言是无处不在的。在本文中,我们揭示了对抗声音的令人惊讶的属性,即,如果配备了相应的标签,则通过一步梯度方法制作的对抗性噪声是线性分离的。从理论上讲,我们为具有随机初始化条目的两层网络和神经切线内核设置证明了此属性,其中参数远离初始化。证明的想法是显示标签信息可以有效地反向输入,同时保持线性可分离性。我们的理论和实验证据进一步表明,对训练数据的对抗噪声进行训练的线性分类器可以很好地对测试数据的对抗噪声进行分类,这表明对抗性噪声实际上将分布扰动注入了原始数据分布。此外,我们从经验上证明,当上述条件受到损害时,在它们仍然比原始功能更容易分类时,对抗性的噪声可能会变得线性分离。
translated by 谷歌翻译
尽管使用对抗性训练捍卫深度学习模型免受对抗性扰动的经验成功,但到目前为止,仍然不清楚对抗性扰动的存在背后的原则是什么,而对抗性培训对神经网络进行了什么来消除它们。在本文中,我们提出了一个称为特征纯化的原则,在其中,我们表明存在对抗性示例的原因之一是在神经网络的训练过程中,在隐藏的重量中积累了某些小型密集混合物;更重要的是,对抗训练的目标之一是去除此类混合物以净化隐藏的重量。我们介绍了CIFAR-10数据集上的两个实验,以说明这一原理,并且一个理论上的结果证明,对于某些自然分类任务,使用随机初始初始化的梯度下降训练具有RELU激活的两层神经网络确实满足了这一原理。从技术上讲,我们给出了我们最大程度的了解,第一个结果证明,以下两个可以同时保持使用RELU激活的神经网络。 (1)对原始数据的训练确实对某些半径的小对抗扰动确实不舒适。 (2)即使使用经验性扰动算法(例如FGM),实际上也可以证明对对抗相同半径的任何扰动也可以证明具有强大的良好性。最后,我们还证明了复杂性的下限,表明该网络的低复杂性模型,例如线性分类器,低度多项式或什至是神经切线核,无论使用哪种算法,都无法防御相同半径的扰动训练他们。
translated by 谷歌翻译
The fundamental learning theory behind neural networks remains largely open. What classes of functions can neural networks actually learn? Why doesn't the trained network overfit when it is overparameterized?In this work, we prove that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations. Moreover, the learning can be simply done by SGD (stochastic gradient descent) or its variants in polynomial time using polynomially many samples. The sample complexity can also be almost independent of the number of parameters in the network.On the technique side, our analysis goes beyond the so-called NTK (neural tangent kernel) linearization of neural networks in prior works. We establish a new notion of quadratic approximation of the neural network (that can be viewed as a second-order variant of NTK), and connect it to the SGD theory of escaping saddle points.
translated by 谷歌翻译
“良性过度装备”,分类器记住嘈杂的培训数据仍然达到良好的概括性表现,在机器学习界造成了很大的关注。为了解释这种令人惊讶的现象,一系列作品在过度参数化的线性回归,分类和内核方法中提供了理论典范。然而,如果在对逆势实例存在下仍发生良性的过度,则尚不清楚,即欺骗分类器的微小和有意的扰动的例子。在本文中,我们表明,良性过度确实发生在对抗性培训中,是防御对抗性实例的原则性的方法。详细地,我们证明了在$ \ ell_p $普发的扰动下的子高斯数据的混合中的普遍培训的线性分类器的风险限制。我们的结果表明,在中度扰动下,尽管过度禁止嘈杂的培训数据,所以发生前列训练的线性分类器可以实现近乎最佳的标准和对抗性风险。数值实验验证了我们的理论发现。
translated by 谷歌翻译
现代神经网络通常具有很大的表现力,并且可以接受训练以使培训数据过高,同时仍能达到良好的测试性能。这种现象被称为“良性过度拟合”。最近,从理论角度出现了一系列研究“良性过度拟合”的作品。但是,它们仅限于线性模型或内核/随机特征模型,并且仍然缺乏关于何时以及如何在神经网络中发生过度拟合的理论理解。在本文中,我们研究了训练两层卷积神经网络(CNN)的良性过度拟合现象。我们表明,当信噪比满足一定条件时,通过梯度下降训练的两层CNN可以实现任意小的训练和测试损失。另一方面,当这种情况无法成立时,过度拟合就会有害,并且获得的CNN只能实现恒定的测试损失。这些共同证明了由信噪比驱动的良性过度拟合和有害过度拟合之间的急剧过渡。据我们所知,这是第一部精确地表征良性过度拟合在训练卷积神经网络中的条件的工作。
translated by 谷歌翻译
Machine learning models are often susceptible to adversarial perturbations of their inputs. Even small perturbations can cause state-of-the-art classifiers with high "standard" accuracy to produce an incorrect prediction with high confidence. To better understand this phenomenon, we study adversarially robust learning from the viewpoint of generalization. We show that already in a simple natural data model, the sample complexity of robust learning can be significantly larger than that of "standard" learning. This gap is information theoretic and holds irrespective of the training algorithm or the model family. We complement our theoretical results with experiments on popular image classification datasets and show that a similar gap exists here as well. We postulate that the difficulty of training robust classifiers stems, at least partially, from this inherently larger sample complexity.
translated by 谷歌翻译
Deep neural networks are vulnerable to adversarial attacks. Ideally, a robust model shall perform well on both the perturbed training data and the unseen perturbed test data. It is found empirically that fitting perturbed training data is not hard, but generalizing to perturbed test data is quite difficult. To better understand adversarial generalization, it is of great interest to study the adversarial Rademacher complexity (ARC) of deep neural networks. However, how to bound ARC in multi-layers cases is largely unclear due to the difficulty of analyzing adversarial loss in the definition of ARC. There have been two types of attempts of ARC. One is to provide the upper bound of ARC in linear and one-hidden layer cases. However, these approaches seem hard to extend to multi-layer cases. Another is to modify the adversarial loss and provide upper bounds of Rademacher complexity on such surrogate loss in multi-layer cases. However, such variants of Rademacher complexity are not guaranteed to be bounds for meaningful robust generalization gaps (RGG). In this paper, we provide a solution to this unsolved problem. Specifically, we provide the first bound of adversarial Rademacher complexity of deep neural networks. Our approach is based on covering numbers. We provide a method to handle the robustify function classes of DNNs such that we can calculate the covering numbers. Finally, we provide experiments to study the empirical implication of our bounds and provide an analysis of poor adversarial generalization.
translated by 谷歌翻译
深入学习在现代分类任务中取得了许多突破。已经提出了众多架构用于不同的数据结构,但是当涉及丢失功能时,跨熵损失是主要的选择。最近,若干替代损失已经看到了深度分类器的恢复利益。特别是,经验证据似乎促进了方形损失,但仍然缺乏理论效果。在这项工作中,我们通过系统地研究了在神经切线内核(NTK)制度中的过度分化的神经网络的表现方式来促进对分类方面损失的理论理解。揭示了关于泛化误差,鲁棒性和校准错误的有趣特性。根据课程是否可分离,我们考虑两种情况。在一般的不可分类案例中,为错误分类率和校准误差建立快速收敛速率。当类是可分离的时,错误分类率改善了速度快。此外,经过证明得到的余量被证明是低于零的较低,提供了鲁棒性的理论保证。我们希望我们的调查结果超出NTK制度并转化为实际设置。为此,我们对实际神经网络进行广泛的实证研究,展示了合成低维数据和真实图像数据中方损的有效性。与跨熵相比,方形损耗具有可比的概括误差,但具有明显的鲁棒性和模型校准的优点。
translated by 谷歌翻译
良性过度拟合,即插值模型在存在嘈杂数据的情况下很好地推广的现象,首先是在接受梯度下降训练的神经网络模型中观察到的。为了更好地理解这一经验观察,我们考虑了通过梯度下降训练的两层神经网络的概括误差,后者是随机初始化后的逻辑损失。我们假设数据来自分离良好的集体条件对数符合分布,并允许训练标签的持续部分被对手损坏。我们表明,在这种情况下,神经网络表现出良性过度拟合:它们可以驱动到零训练错误,完美拟合所有嘈杂的训练标签,并同时达到最小值最佳测试错误。与以前需要线性或基于内核预测的良性过度拟合的工作相反,我们的分析在模型和学习动力学基本上是非线性的环境中。
translated by 谷歌翻译
我们研究(选定的)宽,狭窄,深而浅,较浅,懒惰和非懒惰的训练环境中(选定的)深度神经网络中的平均鲁棒性概念。我们证明,在参数不足的环境中,宽度具有负面影响,而在过度参数化的环境中提高了鲁棒性。深度的影响紧密取决于初始化和训练模式。特别是,当用LeCun初始化初始化时,深度有助于通过懒惰训练制度进行稳健性。相反,当用神经切线核(NTK)初始化并进行初始化时,深度会损害稳健性。此外,在非懒惰培训制度下,我们演示了两层relu网络的宽度如何使鲁棒性受益。我们的理论发展改善了Huang等人的结果。[2021],Wu等。[2021]与Bubeck and Sellke [2021],Bubeck等人一致。[2021]。
translated by 谷歌翻译
It has been observed in practice that applying pruning-at-initialization methods to neural networks and training the sparsified networks can not only retain the testing performance of the original dense models, but also sometimes even slightly boost the generalization performance. Theoretical understanding for such experimental observations are yet to be developed. This work makes the first attempt to study how different pruning fractions affect the model's gradient descent dynamics and generalization. Specifically, this work considers a classification task for overparameterized two-layer neural networks, where the network is randomly pruned according to different rates at the initialization. It is shown that as long as the pruning fraction is below a certain threshold, gradient descent can drive the training loss toward zero and the network exhibits good generalization performance. More surprisingly, the generalization bound gets better as the pruning fraction gets larger. To complement this positive result, this work further shows a negative result: there exists a large pruning fraction such that while gradient descent is still able to drive the training loss toward zero (by memorizing noise), the generalization performance is no better than random guessing. This further suggests that pruning can change the feature learning process, which leads to the performance drop of the pruned neural network. Up to our knowledge, this is the \textbf{first} generalization result for pruned neural networks, suggesting that pruning can improve the neural network's generalization.
translated by 谷歌翻译
Adversarial examples have attracted significant attention in machine learning, but the reasons for their existence and pervasiveness remain unclear. We demonstrate that adversarial examples can be directly attributed to the presence of non-robust features: features (derived from patterns in the data distribution) that are highly predictive, yet brittle and (thus) incomprehensible to humans. After capturing these features within a theoretical framework, we establish their widespread existence in standard datasets. Finally, we present a simple setting where we can rigorously tie the phenomena we observe in practice to a misalignment between the (human-specified) notion of robustness and the inherent geometry of the data.
translated by 谷歌翻译
数据增强是机器学习管道的基石,但其理论基础尚不清楚。它只是人为增加数据集大小的一种方法吗?还是鼓励模型满足某些不变性?在这项工作中,我们考虑了另一个角度,我们研究了数据增强对学习过程动态的影响。我们发现,数据增强可以改变各种功能的相对重要性,从而有效地使某些信息性但难以学习的功能更有可能在学习过程中捕获。重要的是,我们表明,对于非线性模型,例如神经网络,这种效果更为明显。我们的主要贡献是对Allen-Zhu和Li [2020]最近提出的多视图数据模型中两层卷积神经网络的学习动态数据的详细分析。我们通过进一步的实验证据来补充这一分析,证明数据增加可以看作是特征操纵。
translated by 谷歌翻译
联合学习(FL)是一种趋势培训范式,用于利用分散培训数据。 FL允许客户端在本地更新几个时期的模型参数,然后将它们共享到全局模型以进行聚合。在聚集之前,该训练范式具有多本地步骤更新,使对抗性攻击暴露了独特的漏洞。对手训练是一种流行而有效的方法,可以提高网络对抗者的鲁棒性。在这项工作中,我们制定了一种一般形式的联邦对抗学习(FAL),该形式是从集中式环境中的对抗性学习改编而成的。在FL培训的客户端,FAL具有一个内部循环,可以生成对抗性样本进行对抗训练和外循环以更新本地模型参数。在服务器端,FAL汇总了本地模型更新并广播聚合的模型。我们设计了全球强大的训练损失,并将FAL培训作为最小最大优化问题。与依赖梯度方向的经典集中式培训中的收敛分析不同,由于三个原因,很难在FAL中分析FAL的收敛性:1)Min-Max优化的复杂性,2)模型未在梯度方向上更新聚合之前的客户端和3)客户间异质性的多局部更新。我们通过使用适当的梯度近似和耦合技术来应对这些挑战,并在过度参数化的制度中介绍收敛分析。从理论上讲,我们的主要结果表明,我们的算法下的最小损失可以收敛到$ \ epsilon $ Small,并具有所选的学习率和交流回合。值得注意的是,我们的分析对于非IID客户是可行的。
translated by 谷歌翻译
在这项工作中,我们在两层relu网络中提供了特征学习过程的表征,这些网络在随机初始化后通过梯度下降对逻辑损失进行了训练。我们考虑使用输入功能的XOR样函数生成的二进制标签的数据。我们允许不断的培训标签被对手破坏。我们表明,尽管线性分类器并不比随机猜测我们考虑的分布更好,但通过梯度下降训练的两层relu网络达到了接近标签噪声速率的概括误差。我们开发了一种新颖的证明技术,该技术表明,在初始化时,绝大多数神经元充当随机特征,仅与有用特征无关紧要,而梯度下降动力学则“放大”这些弱,随机的特征到强,有用的特征。
translated by 谷歌翻译
由路由器控制的稀疏激活模型(MOE)层的混合物(MOE)层在深度学习方面取得了巨大的成功。但是,对这种建筑的理解仍然难以捉摸。在本文中,我们正式研究了MOE层如何改善神经网络学习的性能以及为什么混合模型不会崩溃成单个模型。我们的经验结果表明,基本问题的集群结构和专家的非线性对于MOE的成功至关重要。为了进一步理解这一点,我们考虑了固有群集结构的具有挑战性的分类问题,这很难使用单个专家学习。然而,使用MOE层,通过将专家选择为两层非线性卷积神经网络(CNN),我们表明可以成功地学习问题。此外,我们的理论表明,路由器可以学习群集中心的特征,这有助于将输入复杂问题分为单个专家可以征服的更简单的线性分类子问题。据我们所知,这是正式了解MOE层的深度学习机制的第一个结果。
translated by 谷歌翻译
This work studies training one-hidden-layer overparameterized ReLU networks via gradient descent in the neural tangent kernel (NTK) regime, where, differently from the previous works, the networks' biases are trainable and are initialized to some constant rather than zero. The first set of results of this work characterize the convergence of the network's gradient descent dynamics. Surprisingly, it is shown that the network after sparsification can achieve as fast convergence as the original network. The contribution over previous work is that not only the bias is allowed to be updated by gradient descent under our setting but also a finer analysis is given such that the required width to ensure the network's closeness to its NTK is improved. Secondly, the networks' generalization bound after training is provided. A width-sparsity dependence is presented which yields sparsity-dependent localized Rademacher complexity and a generalization bound matching previous analysis (up to logarithmic factors). As a by-product, if the bias initialization is chosen to be zero, the width requirement improves the previous bound for the shallow networks' generalization. Lastly, since the generalization bound has dependence on the smallest eigenvalue of the limiting NTK and the bounds from previous works yield vacuous generalization, this work further studies the least eigenvalue of the limiting NTK. Surprisingly, while it is not shown that trainable biases are necessary, trainable bias helps to identify a nice data-dependent region where a much finer analysis of the NTK's smallest eigenvalue can be conducted, which leads to a much sharper lower bound than the previously known worst-case bound and, consequently, a non-vacuous generalization bound.
translated by 谷歌翻译
现代神经网络通常以强烈的过度构造状态运行:它们包含许多参数,即使实际标签被纯粹随机的标签代替,它们也可以插入训练集。尽管如此,他们在看不见的数据上达到了良好的预测错误:插值训练集并不会导致巨大的概括错误。此外,过度散色化似乎是有益的,因为它简化了优化景观。在这里,我们在神经切线(NT)制度中的两层神经网络的背景下研究这些现象。我们考虑了一个简单的数据模型,以及各向同性协变量的矢量,$ d $尺寸和$ n $隐藏的神经元。我们假设样本量$ n $和尺寸$ d $都很大,并且它们在多项式上相关。我们的第一个主要结果是对过份术的经验NT内核的特征结构的特征。这种表征意味着必然的表明,经验NT内核的最低特征值在$ ND \ gg n $后立即从零界限,因此网络可以在同一制度中精确插值任意标签。我们的第二个主要结果是对NT Ridge回归的概括误差的表征,包括特殊情况,最小值-ULL_2 $ NORD插值。我们证明,一旦$ nd \ gg n $,测试误差就会被内核岭回归之一相对于无限宽度内核而近似。多项式脊回归的误差依次近似后者,从而通过与激活函数的高度组件相关的“自我诱导的”项增加了正则化参数。多项式程度取决于样本量和尺寸(尤其是$ \ log n/\ log d $)。
translated by 谷歌翻译
对抗性的鲁棒性已经成为深度学习的核心目标,无论是在理论和实践中。然而,成功的方法来改善对抗的鲁棒性(如逆势训练)在不受干扰的数据上大大伤害了泛化性能。这可能会对对抗性鲁棒性如何影响现实世界系统的影响(即,如果它可以提高未受干扰的数据的准确性),许多人可能选择放弃鲁棒性)。我们提出内插对抗培训,该培训最近雇用了在对抗培训框架内基于插值的基于插值的培训方法。在CiFar -10上,对抗性训练增加了标准测试错误(当没有对手时)从4.43%到12.32%,而我们的内插对抗培训我们保留了对抗性的鲁棒性,同时实现了仅6.45%的标准测试误差。通过我们的技术,强大模型标准误差的相对增加从178.1%降至仅为45.5%。此外,我们提供内插对抗性培训的数学分析,以确认其效率,并在鲁棒性和泛化方面展示其优势。
translated by 谷歌翻译
Gradient descent finds a global minimum in training deep neural networks despite the objective function being non-convex. The current paper proves gradient descent achieves zero training loss in polynomial time for a deep overparameterized neural network with residual connections (ResNet). Our analysis relies on the particular structure of the Gram matrix induced by the neural network architecture. This structure allows us to show the Gram matrix is stable throughout the training process and this stability implies the global optimality of the gradient descent algorithm. We further extend our analysis to deep residual convolutional neural networks and obtain a similar convergence result.
translated by 谷歌翻译