We study the generalization of over-parameterized classifiers where Empirical Risk Minimization (ERM) for learning leads to zero training error. In these over-parameterized settings there are many global minima with zero training error, some of which generalize better than others. We show that under certain conditions the fraction of "bad" global minima with a true error larger than {\epsilon} decays to zero exponentially fast with the number of training data n. The bound depends on the distribution of the true error over the set of classifier functions used for the given classification problem, and does not necessarily depend on the size or complexity (e.g. the number of parameters) of the classifier function set. This might explain the unexpectedly good generalization even of highly over-parameterized Neural Networks. We support our mathematical framework with experiments on a synthetic data set and a subset of MNIST.
translated by 谷歌翻译
大型多层神经网络的概括性能越来越兴趣,可以接受训练以达到零训练错误,同时对测试数据进行良好的推广。该制度被称为“第二次下降”,似乎与常规观点相矛盾,即最佳模型复杂性应反映出不足和过度拟合之间的最佳平衡,即偏见差异权衡。本文介绍了双重下降的VC理论分析,并表明可以通过经典的VC将军范围来充分解释。我们说明了分析性VC结合的应用,用于对分类问题进行两次下降进行建模,并使用多种学习方法(例如SVM,最小二乘正方形和多层观察者分类器)的经验结果。此外,我们讨论了对深度学习社区中VC理论结果误解的几个原因。
translated by 谷歌翻译
With a goal of understanding what drives generalization in deep networks, we consider several recently suggested explanations, including norm-based control, sharpness and robustness. We study how these measures can ensure generalization, highlighting the importance of scale normalization, and making a connection between sharpness and PAC-Bayes theory. We then investigate how well the measures explain different observed phenomena.
translated by 谷歌翻译
我们证明了由例如He等人提出的广泛使用的方法。(2015年)并使用梯度下降对最小二乘损失进行训练并不普遍。具体而言,我们描述了一大批一维数据生成分布,较高的概率下降只会发现优化景观的局部最小值不好,因为它无法将其偏离偏差远离其初始化,以零移动。。事实证明,在这些情况下,即使目标函数是非线性的,发现的网络也基本执行线性回归。我们进一步提供了数值证据,表明在实际情况下,对于某些多维分布而发生这种情况,并且随机梯度下降表现出相似的行为。我们还提供了有关初始化和优化器的选择如何影响这种行为的经验结果。
translated by 谷歌翻译
在本文中,我们研究了全球最小值的泛化性能,以实现过度参数化的深度Relu网上的经验风险最小化(ERM)。利用新型Relu网的新型深化方案,我们严格证明,在温和条件下,有几种类型的数据实现了几乎最佳的全局概率界限。由于过度参数化是至关重要的,可以通过广泛使用的随机梯度下降(SGD)算法来实现深度Relu网的全局Minima,因此我们的结果确实填补了优化和泛化之间的差距。
translated by 谷歌翻译
我们从统计学习理论的角度调查分类生物神经网络的功能,以简化的设置为具有身份激活功能的连续时间随机经常性神经网络(RNN)。在纯粹的随机(鲁棒)制度中,我们提供了具有高概率的概括误差,从而表明经验风险最低限度是最典型的假设。我们表明RNNS保留了作为攻击培训和分类任务的唯一信息的路径的部分签名。我们认为这些RNNS很容易培训和强大,并在合成和实际数据的数值实验中培训和稳健。我们还在准确性和稳健性之间表现出权衡现象。
translated by 谷歌翻译
Neural networks have many successful applications, while much less theoretical understanding has been gained. Towards bridging this gap, we study the problem of learning a two-layer overparameterized ReLU neural network for multi-class classification via stochastic gradient descent (SGD) from random initialization. In the overparameterized setting, when the data comes from mixtures of wellseparated distributions, we prove that SGD learns a network with a small generalization error, albeit the network has enough capacity to fit arbitrary labels. Furthermore, the analysis provides interesting insights into several aspects of learning neural networks and can be verified based on empirical studies on synthetic data and on the MNIST dataset.
translated by 谷歌翻译
古典统计学习理论表示,拟合太多参数导致过度舒服和性能差。尽管大量参数矛盾,但是现代深度神经网络概括了这一发现,并构成了解释深度学习成功的主要未解决的问题。随机梯度下降(SGD)引起的隐式正规被认为是重要的,但其特定原则仍然是未知的。在这项工作中,我们研究了当地最小值周围的能量景观的局部几何学如何影响SGD的统计特性,具有高斯梯度噪声。我们争辩说,在合理的假设下,局部几何形状力强制SGD保持接近低维子空间,这会引起隐式正则化并导致深神经网络的泛化误差界定更严格的界限。为了获得神经网络的泛化误差界限,我们首先引入局部最小值周围的停滞迹象,并施加人口风险的局部基本凸性财产。在这些条件下,推导出SGD的下界,以保留在这些停滞套件中。如果发生停滞,我们会导出涉及权重矩阵的光谱规范的深神经网络的泛化误差的界限,但不是网络参数的数量。从技术上讲,我们的证据基于控制SGD中的参数值的变化以及基于局部最小值周围的合适邻域的熵迭代的参数值和局部均匀收敛。我们的工作试图通过统一收敛更好地连接非凸优化和泛化分析。
translated by 谷歌翻译
制定关于涉及新数据的任务的训练模型的表现的陈述是机器学习的主要目标之一,即了解模型的泛化功率。各种能力措施试图捕捉这种能力,但通常在解释我们在实践中观察到的模型的重要特征。在这项研究中,我们将本地有效维度提出作为一种容量测量,似乎与标准数据集的泛化误差很好。重要的是,我们证明了本地有效维度界限泛化误差并讨论了机器学习模型的这种容量措施的适应性。
translated by 谷歌翻译
This paper provides theoretical insights into why and how deep learning can generalize well, despite its large capacity, complexity, possible algorithmic instability, nonrobustness, and sharp minima, responding to an open question in the literature. We also discuss approaches to provide non-vacuous generalization guarantees for deep learning. Based on theoretical observations, we propose new open problems and discuss the limitations of our results.
translated by 谷歌翻译
Despite their massive size, successful deep artificial neural networks can exhibit a remarkably small difference between training and test performance. Conventional wisdom attributes small generalization error either to properties of the model family, or to the regularization techniques used during training. Through extensive systematic experiments, we show how these traditional approaches fail to explain why large neural networks generalize well in practice. Specifically, our experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data. This phenomenon is qualitatively unaffected by explicit regularization, and occurs even if we replace the true images by completely unstructured random noise. We corroborate these experimental findings with a theoretical construction showing that simple depth two neural networks already have perfect finite sample expressivity as soon as the number of parameters exceeds the number of data points as it usually does in practice. We interpret our experimental findings by comparison with traditional models. * Work performed while interning at Google Brain.† Work performed at Google Brain.
translated by 谷歌翻译
尽管过度参数化的模型已经在许多机器学习任务上表现出成功,但与培训不同的测试分布的准确性可能会下降。这种准确性下降仍然限制了在野外应用机器学习的限制。同时,重要的加权是一种处理分配转移的传统技术,已被证明在经验和理论上对过度参数化模型的影响较小甚至没有影响。在本文中,我们提出了重要的回火来改善决策界限,并为过度参数化模型取得更好的结果。从理论上讲,我们证明在标签移位和虚假相关设置下,组温度的选择可能不同。同时,我们还证明正确选择的温度可以解脱出少数群体崩溃的分类不平衡。从经验上讲,我们使用重要性回火来实现最严重的小组分类任务的最新结果。
translated by 谷歌翻译
标签 - 不平衡和组敏感分类中的目标是优化相关的指标,例如平衡错误和相同的机会。经典方法,例如加权交叉熵,在训练深网络到训练(TPT)的终端阶段时,这是超越零训练误差的训练。这种观察发生了最近在促进少数群体更大边值的直观机制之后开发启发式替代品的动力。与之前的启发式相比,我们遵循原则性分析,说明不同的损失调整如何影响边距。首先,我们证明,对于在TPT中训练的所有线性分类器,有必要引入乘法,而不是添加性的Logit调整,以便对杂项边缘进行适当的变化。为了表明这一点,我们发现将乘法CE修改的连接到成本敏感的支持向量机。也许是违反,我们还发现,在培训开始时,相同的乘法权重实际上可以损害少数群体。因此,虽然在TPT中,添加剂调整无效,但我们表明它们可以通过对乘法重量的初始负效应进行抗衡来加速会聚。通过这些发现的动机,我们制定了矢量缩放(VS)丢失,即捕获现有技术作为特殊情况。此外,我们引入了对群体敏感分类的VS损失的自然延伸,从而以统一的方式处理两种常见类型的不平衡(标签/组)。重要的是,我们对最先进的数据集的实验与我们的理论见解完全一致,并确认了我们算法的卓越性能。最后,对于不平衡的高斯 - 混合数据,我们执行泛化分析,揭示平衡/标准错误和相同机会之间的权衡。
translated by 谷歌翻译
Learning curves provide insight into the dependence of a learner's generalization performance on the training set size. This important tool can be used for model selection, to predict the effect of more training data, and to reduce the computational complexity of model training and hyperparameter tuning. This review recounts the origins of the term, provides a formal definition of the learning curve, and briefly covers basics such as its estimation. Our main contribution is a comprehensive overview of the literature regarding the shape of learning curves. We discuss empirical and theoretical evidence that supports well-behaved curves that often have the shape of a power law or an exponential. We consider the learning curves of Gaussian processes, the complex shapes they can display, and the factors influencing them. We draw specific attention to examples of learning curves that are ill-behaved, showing worse learning performance with more training data. To wrap up, we point out various open problems that warrant deeper empirical and theoretical investigation. All in all, our review underscores that learning curves are surprisingly diverse and no universal model can be identified.
translated by 谷歌翻译
经典的错误发现率(FDR)控制程序提供了强大而可解释的保证,而它们通常缺乏灵活性。另一方面,最近的机器学习分类算法是基于随机森林(RF)或神经网络(NN)的算法,具有出色的实践表现,但缺乏解释和理论保证。在本文中,我们通过引入新的自适应新颖性检测程序(称为Adadetect)来使这两个相遇。它将多个测试文献的最新作品范围扩展到高维度的范围,尤其是Yang等人的范围。 (2021)。显示AD​​ADETECT既可以强烈控制FDR,又具有在特定意义上模仿甲骨文之一的力量。理论结果,几个基准数据集上的数值实验以及对天体物理数据的应用,我们的方法的兴趣和有效性得到了证明。特别是,虽然可以将AdadEtect与任何分类器结合使用,但它在带有RF的现实世界数据集以及带有NN的图像上特别有效。
translated by 谷歌翻译
我们研究了基于分布强大的机会约束的对抗性分类模型。我们表明,在Wasserstein模糊性下,该模型旨在最大限度地减少距离分类距离的条件值 - 风险,并且我们探讨了前面提出的对抗性分类模型和最大限度的分类机的链接。我们还提供了用于线性分类的分布鲁棒模型的重构,并且表明它相当于最小化正则化斜坡损失目标。数值实验表明,尽管这种配方的非凸起,但是标准的下降方法似乎会聚到全球最小值器。灵感来自这种观察,我们表明,对于某一类分布,正则化斜坡损失最小化问题的唯一静止点是全球最小化器。
translated by 谷歌翻译
The fundamental learning theory behind neural networks remains largely open. What classes of functions can neural networks actually learn? Why doesn't the trained network overfit when it is overparameterized?In this work, we prove that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations. Moreover, the learning can be simply done by SGD (stochastic gradient descent) or its variants in polynomial time using polynomially many samples. The sample complexity can also be almost independent of the number of parameters in the network.On the technique side, our analysis goes beyond the so-called NTK (neural tangent kernel) linearization of neural networks in prior works. We establish a new notion of quadratic approximation of the neural network (that can be viewed as a second-order variant of NTK), and connect it to the SGD theory of escaping saddle points.
translated by 谷歌翻译
We study the compute-optimal trade-off between model and training data set sizes for large neural networks. Our result suggests a linear relation similar to that supported by the empirical analysis of Chinchilla. While that work studies transformer-based large language models trained on the MassiveText corpus (gopher), as a starting point for development of a mathematical theory, we focus on a simpler learning model and data generating process, each based on a neural network with a sigmoidal output unit and single hidden layer of ReLU activation units. We establish an upper bound on the minimal information-theoretically achievable expected error as a function of model and data set sizes. We then derive allocations of computation that minimize this bound. We present empirical results which suggest that this approximation correctly identifies an asymptotic linear compute-optimal scaling. This approximation can also generate new insights. Among other things, it suggests that, as the input space dimension or latent space complexity grows, as might be the case for example if a longer history of tokens is taken as input to a language model, a larger fraction of the compute budget should be allocated to growing the learning model rather than training data set.
translated by 谷歌翻译
多级分类问题的广义线性模型是现代机器学习任务的基本构建块之一。在本手稿中,我们通过具有任何凸损耗和正规化的经验风险最小化(ERM)来描述与通用手段和协方士的k $高斯的混合。特别是,我们证明了表征ERM估计的精确渐近剂,以高维度,在文献中扩展了关于高斯混合分类的几个先前结果。我们举例说明我们在统计学习中的两个兴趣任务中的两个任务:a)与稀疏手段的混合物进行分类,我们研究了$ \ ell_2 $的$ \ ell_1 $罚款的效率; b)Max-Margin多级分类,在那里我们在$ k> 2 $的多级逻辑最大似然估计器上表征了相位过渡。最后,我们讨论了我们的理论如何超出合成数据的范围,显示在不同的情况下,高斯混合在真实数据集中密切地捕获了分类任务的学习曲线。
translated by 谷歌翻译
Consider the multivariate nonparametric regression model. It is shown that estimators based on sparsely connected deep neural networks with ReLU activation function and properly chosen network architecture achieve the minimax rates of convergence (up to log nfactors) under a general composition assumption on the regression function. The framework includes many well-studied structural constraints such as (generalized) additive models. While there is a lot of flexibility in the network architecture, the tuning parameter is the sparsity of the network. Specifically, we consider large networks with number of potential network parameters exceeding the sample size. The analysis gives some insights into why multilayer feedforward neural networks perform well in practice. Interestingly, for ReLU activation function the depth (number of layers) of the neural network architectures plays an important role and our theory suggests that for nonparametric regression, scaling the network depth with the sample size is natural. It is also shown that under the composition assumption wavelet estimators can only achieve suboptimal rates.
translated by 谷歌翻译