过度参数化的深度神经网络能够在保持小的泛化误差时实现出色的训练精度。还发现它们能够适合任意标签,并且这种行为被称为记忆现象。在这项工作中,我们研究了带匝数辍学的记忆现象,有效的方法来估计影响和记忆,真实标签(真实数据)和随机标签的数据(随机数据)。我们的主要发现是:(i)对于真实数据和随机数据,易于示例(例如,实际数据)和困难示例(例如,随机数据)的优化由网络同时进行,速度较高; (ii)对于实际数据,训练数据集中的一个正确的难度示例比一个简单的更具信息性。通过显示随机数据和实际数据的记忆,我们突出了它们之间的一致性,并且我们强调了在优化期间记忆的含义。
translated by 谷歌翻译
We examine the role of memorization in deep learning, drawing connections to capacity, generalization, and adversarial robustness. While deep networks are capable of memorizing noise data, our results suggest that they tend to prioritize learning simple patterns first. In our experiments, we expose qualitative differences in gradient-based optimization of deep neural networks (DNNs) on noise vs. real data. We also demonstrate that for appropriately tuned explicit regularization (e.g., dropout) we can degrade DNN training performance on noise datasets without compromising generalization on real data. Our analysis suggests that the notions of effective capacity which are dataset independent are unlikely to explain the generalization performance of deep networks when trained with gradient based methods because training data itself plays an important role in determining the degree of memorization.
translated by 谷歌翻译
Deep neural networks may easily memorize noisy labels present in real-world data, which degrades their ability to generalize. It is therefore important to track and evaluate the robustness of models against noisy label memorization. We propose a metric, called susceptibility, to gauge such memorization for neural networks. Susceptibility is simple and easy to compute during training. Moreover, it does not require access to ground-truth labels and it only uses unlabeled data. We empirically show the effectiveness of our metric in tracking memorization on various architectures and datasets and provide theoretical insights into the design of the susceptibility metric. Finally, we show through extensive experiments on datasets with synthetic and real-world label noise that one can utilize susceptibility and the overall training accuracy to distinguish models that maintain a low memorization on the training set and generalize well to unseen clean data.
translated by 谷歌翻译
Many recent works on understanding deep learning try to quantify how much individual data instances influence the optimization and generalization of a model, either by analyzing the behavior of the model during training or by measuring the performance gap of the model when the instance is removed from the dataset. Such approaches reveal characteristics and importance of individual instances, which may provide useful information in diagnosing and improving deep learning. However, most of the existing works on data valuation require actual training of a model, which often demands high-computational cost. In this paper, we provide a training-free data valuation score, called complexity-gap score, which is a data-centric score to quantify the influence of individual instances in generalization of two-layer overparameterized neural networks. The proposed score can quantify irregularity of the instances and measure how much each data instance contributes in the total movement of the network parameters during training. We theoretically analyze and empirically demonstrate the effectiveness of the complexity-gap score in finding 'irregular or mislabeled' data instances, and also provide applications of the score in analyzing datasets and diagnosing training dynamics.
translated by 谷歌翻译
Despite their massive size, successful deep artificial neural networks can exhibit a remarkably small difference between training and test performance. Conventional wisdom attributes small generalization error either to properties of the model family, or to the regularization techniques used during training. Through extensive systematic experiments, we show how these traditional approaches fail to explain why large neural networks generalize well in practice. Specifically, our experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data. This phenomenon is qualitatively unaffected by explicit regularization, and occurs even if we replace the true images by completely unstructured random noise. We corroborate these experimental findings with a theoretical construction showing that simple depth two neural networks already have perfect finite sample expressivity as soon as the number of parameters exceeds the number of data points as it usually does in practice. We interpret our experimental findings by comparison with traditional models. * Work performed while interning at Google Brain.† Work performed at Google Brain.
translated by 谷歌翻译
了解培训实例对神经网络模型的影响导致提高解释性。但是,评估影响是困难和效率低下,这示出了如果未使用训练实例,则显示如何更改模型的预测。在本文中,我们提出了一种估计影响的有效方法。我们的方法是通过丢弃的启发,零掩模了子网并阻止子网学习每个训练实例。通过在丢弃掩码之间切换,我们可以使用学习或未学习每个培训实例的子网并估计其影响力。通过对分类数据集的BERT和VGGNET的实验,我们证明了该方法可以捕获训练影响,增强误差预测的可解释性,并清除培训数据集以改善概括。
translated by 谷歌翻译
人们通常认为,修剪网络不仅会降低深网的计算成本,而且还可以通过降低模型容量来防止过度拟合。但是,我们的工作令人惊讶地发现,网络修剪有时甚至会加剧过度拟合。我们报告了出乎意料的稀疏双后裔现象,随着我们通过网络修剪增加模型稀疏性,首先测试性能变得更糟(由于过度拟合),然后变得更好(由于过度舒适),并且终于变得更糟(由于忘记了有用的有用信息)。尽管最近的研究集中在模型过度参数化方面,但他们未能意识到稀疏性也可能导致双重下降。在本文中,我们有三个主要贡献。首先,我们通过广泛的实验报告了新型的稀疏双重下降现象。其次,对于这种现象,我们提出了一种新颖的学习距离解释,即$ \ ell_ {2} $稀疏模型的学习距离(从初始化参数到最终参数)可能与稀疏的双重下降曲线良好相关,并更好地反映概括比最小平坦。第三,在稀疏的双重下降的背景下,彩票票假设中的获胜票令人惊讶地并不总是赢。
translated by 谷歌翻译
深度学习在大量大数据的帮助下取得了众多域中的显着成功。然而,由于许多真实情景中缺乏高质量标签,数据标签的质量是一个问题。由于嘈杂的标签严重降低了深度神经网络的泛化表现,从嘈杂的标签(强大的培训)学习是在现代深度学习应用中成为一项重要任务。在本调查中,我们首先从监督的学习角度描述了与标签噪声学习的问题。接下来,我们提供62项最先进的培训方法的全面审查,所有这些培训方法都按照其方法论差异分为五个群体,其次是用于评估其优越性的六种性质的系统比较。随后,我们对噪声速率估计进行深入分析,并总结了通常使用的评估方法,包括公共噪声数据集和评估度量。最后,我们提出了几个有前途的研究方向,可以作为未来研究的指导。所有内容将在https://github.com/songhwanjun/awesome-noisy-labels提供。
translated by 谷歌翻译
机器学习(ML)模型需要经常在改变各种应用场景中更改数据集,包括数据估值和不确定量化。为了有效地重新培训模型,已经提出了线性近似方法,例如影响功能,以估计数据变化对模型参数的影响。但是,对于大型数据集的变化,这些方法变得不准确。在这项工作中,我们专注于凸起的学习问题,并提出了一般框架,用于学习使用神经网络进行不同训练集的优化模型参数。我们建议强制执行预测的模型参数,以通过正则化技术遵守最优性条件并保持效用,从而显着提高泛化。此外,我们严格地表征了神经网络的表现力,以近似凸起问题的优化器。经验结果展示了与最先进的准确高效的模型参数估计中提出的方法的优点。
translated by 谷歌翻译
对网络规模数据进行培训可能需要几个月的时间。但是,在已经学习或不可学习的冗余和嘈杂点上浪费了很多计算和时间。为了加速训练,我们引入了可减少的持有损失选择(Rho-loss),这是一种简单但原则上的技术,它大致选择了这些训练点,最大程度地减少了模型的概括损失。结果,Rho-loss减轻了现有数据选择方法的弱点:优化文献中的技术通常选择“硬损失”(例如,高损失),但是这种点通常是嘈杂的(不可学习)或更少的任务与任务相关。相反,课程学习优先考虑“简单”的积分,但是一旦学习,就不必对这些要点进行培训。相比之下,Rho-Loss选择了可以学习的点,值得学习的,尚未学习。与先前的艺术相比,Rho-loss火车的步骤要少得多,可以提高准确性,并加快对广泛的数据集,超参数和体系结构(MLP,CNNS和BERT)的培训。在大型Web绑带图像数据集服装1M上,与统一的数据改组相比,步骤少18倍,最终精度的速度少2%。
translated by 谷歌翻译
深层神经网络(DNN)越来越多地用于软件工程和代码智能任务。这些是强大的工具,能够通过数百万参数从大型数据集中学习高度概括的模式。同时,它们的大容量可以使他们容易记住数据点。最近的工作表明,当训练数据集嘈杂,涉及许多模棱两可或可疑的样本时,记忆风险特别强烈表现出来,而记忆是唯一的追索权。本文的目的是评估和比较神经代码智能模型中的记忆和概括程度。它旨在提供有关记忆如何影响神经模型在代码智能系统中的学习行为的见解。为了观察模型中的记忆程度,我们为原始训练数据集增加了随机噪声,并使用各种指标来量化噪声对训练和测试各个方面的影响。我们根据Java,Python和Ruby Codebase评估了几种最先进的神经代码智能模型和基准。我们的结果突出了重要的风险:数百万可训练的参数允许神经网络记住任何包括嘈杂数据,并提供错误的概括感。我们观察到所有模型都表现出某些形式的记忆。在大多数代码智能任务中,这可能会很麻烦,因为它们依赖于相当容易发生噪声和重复性数据源,例如GitHub的代码。据我们所知,我们提供了第一个研究,以量化软件工程和代码智能系统领域的记忆效应。这项工作提高了人们的意识,并为训练神经模型的重要问题提供了新的见解,这些问题通常被软件工程研究人员忽略。
translated by 谷歌翻译
我们解决了对追踪预测的影响函数的高效计算,回到培训数据。我们提出并分析了一种新方法来加速基于Arnoldi迭代的反向Hessian计算。通过这种改进,我们实现了我们的知识,首次成功实施了影响功能,该函数的尺寸为全尺寸(语言和愿景)变压器模型,具有数十亿个参数。我们评估我们对图像分类和序列任务的方法,以百万次训练示例。我们的代码将在https://github.com/google-research/jax-influence上获得。
translated by 谷歌翻译
Deep Learning with noisy labels is a practically challenging problem in weakly supervised learning. The stateof-the-art approaches "Decoupling" and "Co-teaching+" claim that the "disagreement" strategy is crucial for alleviating the problem of learning with noisy labels. In this paper, we start from a different perspective and propose a robust learning paradigm called JoCoR, which aims to reduce the diversity of two networks during training. Specifically, we first use two networks to make predictions on the same mini-batch data and calculate a joint loss with Co-Regularization for each training example. Then we select small-loss examples to update the parameters of both two networks simultaneously. Trained by the joint loss, these two networks would be more and more similar due to the effect of Co-Regularization. Extensive experimental results on corrupted data from benchmark datasets including MNIST, CIFAR-10, CIFAR-100 and Clothing1M demonstrate that JoCoR is superior to many state-of-the-art approaches for learning with noisy labels.
translated by 谷歌翻译
我们提出了一种用于预测性不确定性的框架,其神经网络取代了重量概率密度函数(PDF)的传统贝叶斯概念,其基于基于Gaussian再现内核Hilbert空间(RKHS)嵌入的模型权重的物理潜在场表示。这使我们能够使用量子物理学的扰动理论来制定模型权力关系关系的片刻分解问题。提取的时刻显示了模型输出的局部附近的重量PDF的连续正则化。这种局部时刻以极大的灵敏度确定重量PDF的局部异质性,从而提供比贝叶斯和集合方法特征的模型预测性不确定性的模型预测性的更大准确性。我们表明这导致更好地导致检测经历了经历了经过调节的测试数据的假模型预测,从而从模型中学到的培训PDF。我们在使用常见失真技术损坏的几个基准数据集中评估我们对基线不确定性定量方法的方法。我们的方法提供了快速模型预测性不确定性估计,具有更高的精度和校准。
translated by 谷歌翻译
Training accurate deep neural networks (DNNs) in the presence of noisy labels is an important and challenging task. Though a number of approaches have been proposed for learning with noisy labels, many open issues remain. In this paper, we show that DNN learning with Cross Entropy (CE) exhibits overfitting to noisy labels on some classes ("easy" classes), but more surprisingly, it also suffers from significant under learning on some other classes ("hard" classes). Intuitively, CE requires an extra term to facilitate learning of hard classes, and more importantly, this term should be noise tolerant, so as to avoid overfitting to noisy labels. Inspired by the symmetric KL-divergence, we propose the approach of Symmetric cross entropy Learning (SL), boosting CE symmetrically with a noise robust counterpart Reverse Cross Entropy (RCE). Our proposed SL approach simultaneously addresses both the under learning and overfitting problem of CE in the presence of noisy labels. We provide a theoretical analysis of SL and also empirically show, on a range of benchmark and real-world datasets, that SL outperforms state-of-the-art methods. We also show that SL can be easily incorporated into existing methods in order to further enhance their performance.
translated by 谷歌翻译
Power等人报道的\ emph {grokking现象} {power2021grokking}是指一个长期过度拟合之后,似乎突然过渡到完美的概括。在本文中,我们试图通过一系列经验研究来揭示Grokking的基础。具体而言,我们在极端的训练阶段(称为\ emph {slingshot机构)发现了一个优化的异常缺陷自适应优化器。可以通过稳定和不稳定的训练方案之间的循环过渡来测量弹弓机制的突出伪像,并且可以通过最后一层重量的规范的循环行为轻松监测。我们从经验上观察到,在\ cite {power2021grokking}中报道的无明确正规化,几乎完全发生在\ emph {slingshots}的开始时,并且没有它。虽然在更一般的环境中常见且容易复制,但弹弓机制并不遵循我们所知道的任何已知优化理论,并且可以轻松地忽略而无需深入研究。我们的工作表明,在培训的后期阶段,适应性梯度优化器的令人惊讶且有用的归纳偏见,要求对其起源进行修订。
translated by 谷歌翻译
In today's heavily overparameterized models, the value of the training loss provides few guarantees on model generalization ability. Indeed, optimizing only the training loss value, as is commonly done, can easily lead to suboptimal model quality. Motivated by prior work connecting the geometry of the loss landscape and generalization, we introduce a novel, effective procedure for instead simultaneously minimizing loss value and loss sharpness. In particular, our procedure, Sharpness-Aware Minimization (SAM), seeks parameters that lie in neighborhoods having uniformly low loss; this formulation results in a minmax optimization problem on which gradient descent can be performed efficiently. We present empirical results showing that SAM improves model generalization across a variety of benchmark datasets (e.g., CIFAR-{10, 100}, Ima-geNet, finetuning tasks) and models, yielding novel state-of-the-art performance for several. Additionally, we find that SAM natively provides robustness to label noise on par with that provided by state-of-the-art procedures that specifically target learning with noisy labels. We open source our code at https: //github.com/google-research/sam. * Work done as part of the Google AI Residency program.
translated by 谷歌翻译
Very large deep learning models trained using gradient descent are remarkably resistant to memorization given their huge capacity, but are at the same time capable of fitting large datasets of pure noise. Here methods are introduced by which models may be trained to memorize datasets that normally are generalized. We find that memorization is difficult relative to generalization, but that adding noise makes memorization easier. Increasing the dataset size exaggerates the characteristics of that dataset: model access to more training samples makes overfitting easier for random data, but somewhat harder for natural images. The bias of deep learning towards generalization is explored theoretically, and we show that generalization results from a model's parameters being attracted to points of maximal stability with respect to that model's inputs during gradient descent.
translated by 谷歌翻译
在机器学习中,一个极大的兴趣问题是了解哪些示例对于模型进行分类是有挑战性的。确定非典型示例可确保模型的安全部署,隔离需要进一步检查的样本,并为模型行为提供解释性。在这项工作中,我们提出梯度(VOG)的差异为有价值和有效的度量,以通过难度对数据进行排名,并浮出水面最具挑战性的人类审计示例的可行子集。我们表明,对于模型而言,具有较高VOG分数的数据点要在损坏或记忆的示例上学习和过度索引。此外,将评估限制为具有最低VOG的测试集实例,可以改善模型的泛化性能。最后,我们证明VOG是分布外检测的有价值和有效的排名。
translated by 谷歌翻译
最近的结果表明,在训练期间重新升级神经网络参数的子集可以改善泛化,特别是对于小型训练集。我们研究不同重新初始化方法在12个基准图像分类数据集中的几种卷积架构中的影响,分析了它们的潜在收益和突出显示限制。我们还介绍了一种新的层状重新初始化算法,优于先前的方法,并建议观察到的改进的泛化的解释。首先,我们表明,无需增加重量的规范,可以在不增加重量的规范的情况下增加训练示例的余量。因此,导致神经网络的边缘的泛化范围的改善。其次,我们证明它在损失表面的平坦局部最小值中稳定。第三,它鼓励学习一般规则,并通过强调神经网络的下层来劝阻记忆。我们的外带消息是使用自下而上的层状重新初始化的小型数据集可以改善卷积神经网络的准确性,其中重新初始层的数量可能因可用计算预算而变化。
translated by 谷歌翻译