影响功能有效地估计了删除单个训练数据点对模型学习参数的影响。尽管影响估计值与线性模型的剩余重新进行了良好的重新对齐,但最近的作品表明,在神经网络中,这种比对通常很差。在这项工作中,我们通过将其分解为五个单独的术语来研究导致这种差异的特定因素。我们研究每个术语对各种架构和数据集的贡献,以及它们如何随网络宽度和培训时间等因素而变化。尽管实际影响函数估计值可能是非线性网络中保留对方的重新培训的差异,但我们表明它们通常是对不同对象的良好近似值,我们称其为近端Bregman响应函数(PBRF)。由于PBRF仍然可以用来回答许多激励影响功能的问题,例如识别有影响力或标记的示例,因此我们的结果表明,影响功能估计的当前算法比以前的错误分析所暗示的更有用的结果。
translated by 谷歌翻译
How can we explain the predictions of a blackbox model? In this paper, we use influence functions -a classic technique from robust statistics -to trace a model's prediction through the learning algorithm and back to its training data, thereby identifying training points most responsible for a given prediction. To scale up influence functions to modern machine learning settings, we develop a simple, efficient implementation that requires only oracle access to gradients and Hessian-vector products. We show that even on non-convex and non-differentiable models where the theory breaks down, approximations to influence functions can still provide valuable information. On linear models and convolutional neural networks, we demonstrate that influence functions are useful for multiple purposes: understanding model behavior, debugging models, detecting dataset errors, and even creating visuallyindistinguishable training-set attacks.
translated by 谷歌翻译
Many problems in machine learning involve bilevel optimization (BLO), including hyperparameter optimization, meta-learning, and dataset distillation. Bilevel problems consist of two nested sub-problems, called the outer and inner problems, respectively. In practice, often at least one of these sub-problems is overparameterized. In this case, there are many ways to choose among optima that achieve equivalent objective values. Inspired by recent studies of the implicit bias induced by optimization algorithms in single-level optimization, we investigate the implicit bias of gradient-based algorithms for bilevel optimization. We delineate two standard BLO methods -- cold-start and warm-start -- and show that the converged solution or long-run behavior depends to a large degree on these and other algorithmic choices, such as the hypergradient approximation. We also show that the inner solutions obtained by warm-start BLO can encode a surprising amount of information about the outer objective, even when the outer parameters are low-dimensional. We believe that implicit bias deserves as central a role in the study of bilevel optimization as it has attained in the study of single-level neural net optimization.
translated by 谷歌翻译
Good models require good training data. For overparameterized deep models, the causal relationship between training data and model predictions is increasingly opaque and poorly understood. Influence analysis partially demystifies training's underlying interactions by quantifying the amount each training instance alters the final model. Measuring the training data's influence exactly can be provably hard in the worst case; this has led to the development and use of influence estimators, which only approximate the true influence. This paper provides the first comprehensive survey of training data influence analysis and estimation. We begin by formalizing the various, and in places orthogonal, definitions of training data influence. We then organize state-of-the-art influence analysis methods into a taxonomy; we describe each of these methods in detail and compare their underlying assumptions, asymptotic complexities, and overall strengths and weaknesses. Finally, we propose future research directions to make influence analysis more useful in practice as well as more theoretically and empirically sound. A curated, up-to-date list of resources related to influence analysis is available at https://github.com/ZaydH/influence_analysis_papers.
translated by 谷歌翻译
机器学习(ML)模型需要经常在改变各种应用场景中更改数据集,包括数据估值和不确定量化。为了有效地重新培训模型,已经提出了线性近似方法,例如影响功能,以估计数据变化对模型参数的影响。但是,对于大型数据集的变化,这些方法变得不准确。在这项工作中,我们专注于凸起的学习问题,并提出了一般框架,用于学习使用神经网络进行不同训练集的优化模型参数。我们建议强制执行预测的模型参数,以通过正则化技术遵守最优性条件并保持效用,从而显着提高泛化。此外,我们严格地表征了神经网络的表现力,以近似凸起问题的优化器。经验结果展示了与最先进的准确高效的模型参数估计中提出的方法的优点。
translated by 谷歌翻译
We propose an efficient method for approximating natural gradient descent in neural networks which we call Kronecker-factored Approximate Curvature (K-FAC). K-FAC is based on an efficiently invertible approximation of a neural network's Fisher information matrix which is neither diagonal nor low-rank, and in some cases is completely non-sparse. It is derived by approximating various large blocks of the Fisher (corresponding to entire layers) as being the Kronecker product of two much smaller matrices. While only several times more expensive to compute than the plain stochastic gradient, the updates produced by K-FAC make much more progress optimizing the objective, which results in an algorithm that can be much faster than stochastic gradient descent with momentum in practice. And unlike some previously proposed approximate natural-gradient/Newton methods which use high-quality non-diagonal curvature matrices (such as Hessian-free optimization), K-FAC works very well in highly stochastic optimization regimes. This is because the cost of storing and inverting K-FAC's approximation to the curvature matrix does not depend on the amount of data used to estimate it, which is a feature typically associated only with diagonal or low-rank approximations to the curvature matrix.
translated by 谷歌翻译
大量数据集上的培训机学习模型会产生大量的计算成本。为了减轻此类费用,已经持续努力开发数据有效的培训方法,这些方法可以仔细选择培训示例的子集,以概括为完整的培训数据。但是,现有方法在为在提取子集训练的模型的质量提供理论保证方面受到限制,并且在实践中的表现可能差。我们提出了Adacore,该方法利用数据的几何形状提取培训示例的子集以进行有效的机器学习。我们方法背后的关键思想是通过对Hessian的指数平均估计值动态近似损耗函数的曲率,以选择加权子集(核心),这些子集(核心)可提供与Hessian的完整梯度预处理的近似值。我们证明,对应用于Adacore选择的子集的各种一阶和二阶方法的收敛性有严格的保证。我们的广泛实验表明,与基准相比,ADACORE提取了质量更高的核心,并加快了对凸和非凸机学习模型的训练,例如逻辑回归和神经网络,超过2.9倍,超过4.5倍,而随机子集则超过4.5倍。 。
translated by 谷歌翻译
深度学习在广泛的AI应用方面取得了有希望的结果。较大的数据集和模型一致地产生更好的性能。但是,我们一般花费更长的培训时间,以更多的计算和沟通。在本调查中,我们的目标是在模型精度和模型效率方面提供关于大规模深度学习优化的清晰草图。我们调查最常用于优化的算法,详细阐述了大批量培训中出现的泛化差距的可辩论主题,并审查了解决通信开销并减少内存足迹的SOTA策略。
translated by 谷歌翻译
我们研究了使用尖刺,现场依赖的随机矩阵理论研究迷你批次对深神经网络损失景观的影响。我们表明,批量黑森州的极值值的大小大于经验丰富的黑森州。我们还获得了类似的结果对Hessian的概括高斯牛顿矩阵近似。由于我们的定理,我们推导出作为批量大小的最大学习速率的分析表达式,为随机梯度下降(线性缩放)和自适应算法(例如ADAM(Square Root Scaling)提供了通知实际培训方案,例如光滑,非凸深神经网络。虽然随机梯度下降的线性缩放是在我们概括的更多限制性条件下导出的,但是适应优化者的平方根缩放规则是我们的知识,完全小说。随机二阶方法和自适应方法的百分比,我们得出了最小阻尼系数与学习率与批量尺寸的比率成比例。我们在Cifar-$ 100 $和ImageNet数据集上验证了我们的VGG / WimerEsnet架构上的索赔。根据我们对象检的调查,我们基于飞行学习率和动量学习者开发了一个随机兰齐齐竞争,这避免了对这些关键的超参数进行昂贵的多重评估的需求,并在预残留的情况下显示出良好的初步结果Cifar的architecure - $ 100 $。
translated by 谷歌翻译
通过强制了解输入中某些转换保留输出的知识,通常应用数据增强来提高深度学习的性能。当前,使用的数据扩大是通过人类的努力和昂贵的交叉验证来选择的,这使得应用于新数据集很麻烦。我们开发了一种基于梯度的方便方法,用于在没有验证数据的情况下和在深度神经网络的培训期间选择数据增强。我们的方法依赖于措辞增强作为先前分布的不变性,并使用贝叶斯模型选择学习,该模型已被证明在高斯过程中起作用,但尚未用于深神经网络。我们提出了一个可区分的Kronecker因拉普拉斯(Laplace)近似与边际可能性的近似,作为我们的目标,可以在没有人类监督或验证数据的情况下优化。我们表明,我们的方法可以成功地恢复数据中存在的不断增长,这提高了图像数据集的概括和数据效率。
translated by 谷歌翻译
最近的立法导致对机器学习的兴趣,即从预测模型中删除特定的培训样本,就好像它们在培训数据集中从未存在。由于损坏/对抗性数据或仅仅是用户的更新隐私要求,也可能需要进行学习。对于不需要培训的模型(K-NN),只需删除最近的原始样品即可有效。但是,这个想法不适合学习更丰富的表示的模型。由于模型维度D的趋势,最新的想法利用了基于优化的更新,因为损失函数的Hessian颠倒了。我们使用新的条件独立系数L-CODEC的变体来识别模型参数的子集,其语义重叠在单个样本级别上。我们的方法完全避免了将(可能)巨大矩阵倒置的必要性。通过利用马尔可夫毯子的选择,我们前提是l-codec也适合深度学习以及视觉中的其他应用。与替代方案相比,L-Codec在原本是不可行的设置中可以实现近似学习,包括用于面部识别的视觉模型,人重新识别和可能需要未经学习的样品进行排除的NLP模型。代码可以在https://github.com/vsingh-group/lcodec-deep-unlearning/
translated by 谷歌翻译
We propose an algorithm for inexpensive gradient-based hyperparameter optimization that combines the implicit function theorem (IFT) with efficient inverse Hessian approximations. We present results on the relationship between the IFT and differentiating through optimization, motivating our algorithm. We use the proposed approach to train modern network architectures with millions of weights and millions of hyperparameters. We learn a data-augmentation networkwhere every weight is a hyperparameter tuned for validation performance-that outputs augmented training examples; we learn a distilled dataset where each feature in each datapoint is a hyperparameter; and we tune millions of regularization hyperparameters. Jointly tuning weights and hyperparameters with our approach is only a few times more costly in memory and compute than standard training.• We scale IFT-based hyperparameter optimization to modern, large neural architectures, including AlexNet and LSTM-based language models.• We demonstrate several uses for fitting hyperparameters almost as easily as weights, including perparameter regularization, data distillation, and learned-from-scratch data augmentation methods.• We explore how training-validation splits should change when tuning many hyperparameters.
translated by 谷歌翻译
The behaviors of deep neural networks (DNNs) are notoriously resistant to human interpretations. In this paper, we propose Hypergradient Data Relevance Analysis, or HYDRA, which interprets the predictions made by DNNs as effects of their training data. Existing approaches generally estimate data contributions around the final model parameters and ignore how the training data shape the optimization trajectory. By unrolling the hypergradient of test loss w.r.t. the weights of training data, HYDRA assesses the contribution of training data toward test data points throughout the training trajectory. In order to accelerate computation, we remove the Hessian from the calculation and prove that, under moderate conditions, the approximation error is bounded. Corroborating this theoretical claim, empirical results indicate the error is indeed small. In addition, we quantitatively demonstrate that HYDRA outperforms influence functions in accurately estimating data contribution and detecting noisy data labels. The source code is available at https://github.com/cyyever/aaai_hydra_8686.
translated by 谷歌翻译
从机器学习模型中删除指定的培训数据子集的影响可能需要解决隐私,公平和数据质量等问题。删除子集后剩余数据从头开始对模型进行重新审查是有效但通常是不可行的,因为其计算费用。因此,在过去的几年中,已经看到了几种有效拆除的新方法,形成了“机器学习”领域,但是,到目前为止,出版的文献的许多方面都是不同的,缺乏共识。在本文中,我们总结并比较了七个最先进的机器学习算法,合并对现场中使用的核心概念的定义,调和不同的方法来评估算法,并讨论与在实践中应用机器相关的问题。
translated by 谷歌翻译
二阶优化器被认为具有加快神经网络训练的潜力,但是由于曲率矩阵的尺寸巨大,它们通常需要近似值才能计算。最成功的近似家庭是Kronecker因块状曲率估计值(KFAC)。在这里,我们结合了先前工作的工具,以评估确切的二阶更新和仔细消融以建立令人惊讶的结果:由于其近似值,KFAC与二阶更新无关,尤其是,它极大地胜过真实的第二阶段更新。订单更新。这一挑战广泛地相信,并立即提出了为什么KFAC表现如此出色的问题。为了回答这个问题,我们提出了强烈的证据,表明KFAC近似于一阶算法,该算法在神经元上执行梯度下降而不是权重。最后,我们表明,这种优化器通常会在计算成本和数据效率方面改善KFAC。
translated by 谷歌翻译
我们开发了快速算法和可靠软件,以凸出具有Relu激活功能的两层神经网络的凸优化。我们的工作利用了标准的重量罚款训练问题作为一组组-YELL_1 $调查的数据本地模型的凸重新印度,其中局部由多面体锥体约束强制执行。在零规范化的特殊情况下,我们表明此问题完全等同于凸“ Gated Relu”网络的不受约束的优化。对于非零正则化的问题,我们表明凸面式relu模型获得了RELU训练问题的数据依赖性近似范围。为了优化凸的重新制定,我们开发了一种加速的近端梯度方法和实用的增强拉格朗日求解器。我们表明,这些方法比针对非凸问题(例如SGD)和超越商业内部点求解器的标准训练启发式方法要快。在实验上,我们验证了我们的理论结果,探索组-ELL_1 $正则化路径,并对神经网络进行比例凸的优化,以在MNIST和CIFAR-10上进行图像分类。
translated by 谷歌翻译
We introduce SketchySGD, a stochastic quasi-Newton method that uses sketching to approximate the curvature of the loss function. Quasi-Newton methods are among the most effective algorithms in traditional optimization, where they converge much faster than first-order methods such as SGD. However, for contemporary deep learning, quasi-Newton methods are considered inferior to first-order methods like SGD and Adam owing to higher per-iteration complexity and fragility due to inexact gradients. SketchySGD circumvents these issues by a novel combination of subsampling, randomized low-rank approximation, and dynamic regularization. In the convex case, we show SketchySGD with a fixed stepsize converges to a small ball around the optimum at a faster rate than SGD for ill-conditioned problems. In the non-convex case, SketchySGD converges linearly under two additional assumptions, interpolation and the Polyak-Lojaciewicz condition, the latter of which holds with high probability for wide neural networks. Numerical experiments on image and tabular data demonstrate the improved reliability and speed of SketchySGD for deep learning, compared to standard optimizers such as SGD and Adam and existing quasi-Newton methods.
translated by 谷歌翻译
Deep Learning optimization involves minimizing a high-dimensional loss function in the weight space which is often perceived as difficult due to its inherent difficulties such as saddle points, local minima, ill-conditioning of the Hessian and limited compute resources. In this paper, we provide a comprehensive review of 12 standard optimization methods successfully used in deep learning research and a theoretical assessment of the difficulties in numerical optimization from the optimization literature.
translated by 谷歌翻译
用于估计模型不确定性的线性拉普拉斯方法在贝叶斯深度学习社区中引起了人们的重新关注。该方法提供了可靠的误差线,并接受模型证据的封闭式表达式,从而可以选择模型超参数。在这项工作中,我们检查了这种方法背后的假设,尤其是与模型选择结合在一起。我们表明,这些与一些深度学习的标准工具(构成近似方法和归一化层)相互作用,并为如何更好地适应这种经典方法对现代环境提出建议。我们为我们的建议提供理论支持,并在MLP,经典CNN,具有正常化层,生成性自动编码器和变压器的剩余网络上进行经验验证它们。
translated by 谷歌翻译
我们分析了一类养生问题,其中高级问题在于平滑的目标函数的最小化和下层问题是找到平滑收缩图的固定点。这种类型的问题包括元学习,平衡模型,超参数优化和数据中毒对抗性攻击的实例。最近的几项作品提出了算法,这些算法温暖了较低级别的问题,即他们使用先前的下级近似解决方案作为低级求解器的凝视点。这种温暖的启动程序使人们可以在随机和确定性设置中提高样品复杂性,在某些情况下可以实现订单的最佳样品复杂性。但是,存在一些情况,例如元学习和平衡模型,其中温暖的启动程序不适合或无效。在这项工作中,我们表明没有温暖的启动,仍然可以实现订单的最佳或近乎最佳的样品复杂性。特别是,我们提出了一种简单的方法,该方法在下层下使用随机固定点迭代,并在上层处预测不精确的梯度下降,该梯度下降到达$ \ epsilon $ -Stationary Point,使用$ O(\ Epsilon^{-2) })$和$ \ tilde {o}(\ epsilon^{ - 1})$样本分别用于随机和确定性设置。最后,与使用温暖启动的方法相比,我们的方法产生了更简单的分析,不需要研究上层和下层迭代之间的耦合相互作用
translated by 谷歌翻译