Deep neural networks have been shown to be very powerful modeling tools for many supervised learning tasks involving complex input patterns. However, they can also easily overfit to training set biases and label noises. In addition to various regularizers, example reweighting algorithms are popular solutions to these problems, but they require careful tuning of additional hyperparameters, such as example mining schedules and regularization hyperparameters. In contrast to past reweighting methods, which typically consist of functions of the cost value of each example, in this work we propose a novel meta-learning algorithm that learns to assign weights to training examples based on their gradient directions. To determine the example weights, our method performs a meta gradient descent step on the current mini-batch example weights (which are initialized from zero) to minimize the loss on a clean unbiased validation set. Our proposed method can be easily implemented on any type of deep network, does not require any additional hyperparameter tuning, and achieves impressive performance on class imbalance and corrupted label problems where only a small amount of clean validation data is available.
translated by 谷歌翻译
Current deep neural networks (DNNs) can easily overfit to biased training data with corrupted labels or class imbalance. Sample re-weighting strategy is commonly used to alleviate this issue by designing a weighting function mapping from training loss to sample weight, and then iterating between weight recalculating and classifier updating. Current approaches, however, need manually pre-specify the weighting function as well as its additional hyper-parameters. It makes them fairly hard to be generally applied in practice due to the significant variation of proper weighting schemes relying on the investigated problem and training data. To address this issue, we propose a method capable of adaptively learning an explicit weighting function directly from data. The weighting function is an MLP with one hidden layer, constituting a universal approximator to almost any continuous functions, making the method able to fit a wide range of weighting functions including those assumed in conventional research. Guided by a small amount of unbiased meta-data, the parameters of the weighting function can be finely updated simultaneously with the learning process of the classifiers. Synthetic and real experiments substantiate the capability of our method for achieving proper weighting functions in class imbalance and noisy label cases, fully complying with the common settings in traditional methods, and more complicated scenarios beyond conventional cases. This naturally leads to its better accuracy than other state-of-the-art methods. Source code is available at https://github.com/xjtushujun/meta-weight-net. * Corresponding author. 1 We call the training data biased when they are generated from a joint sample-label distribution deviating from the distribution of evaluation/test set [1].
translated by 谷歌翻译
In this paper, we present a simple yet effective method (ABSGD) for addressing the data imbalance issue in deep learning. Our method is a simple modification to momentum SGD where we leverage an attentional mechanism to assign an individual importance weight to each gradient in the mini-batch. Unlike many existing heuristic-driven methods for tackling data imbalance, our method is grounded in {\it theoretically justified distributionally robust optimization (DRO)}, which is guaranteed to converge to a stationary point of an information-regularized DRO problem. The individual-level weight of a sampled data is systematically proportional to the exponential of a scaled loss value of the data, where the scaling factor is interpreted as the regularization parameter in the framework of information-regularized DRO. Compared with existing class-level weighting schemes, our method can capture the diversity between individual examples within each class. Compared with existing individual-level weighting methods using meta-learning that require three backward propagations for computing mini-batch stochastic gradients, our method is more efficient with only one backward propagation at each iteration as in standard deep learning methods. To balance between the learning of feature extraction layers and the learning of the classifier layer, we employ a two-stage method that uses SGD for pretraining followed by ABSGD for learning a robust classifier and finetuning lower layers. Our empirical studies on several benchmark datasets demonstrate the effectiveness of the proposed method.
translated by 谷歌翻译
模型 - 不可知的元学习(MAML),一种流行的基于梯度的元学习框架,假设每个任务或实例对元学习​​者的贡献相等。因此,在几次拍摄学习中,它无法解决基本和新颖类之间的域转移。在这项工作中,我们提出了一种新颖的鲁棒元学习算法,巢式MAML,它学会为训练任务或实例分配权重。我们将权重用为超参数,并使用嵌套双级优化方法中设置的一小组验证任务迭代优化它们(与MAML中的标准双级优化相比)。然后,我们在元培训阶段应用NestedMaml,涉及(1)从不同于元测试任务分发的分布中采样的多个任务,或(2)具有嘈杂标签的某些数据样本。对综合和现实世界数据集的广泛实验表明,巢式米姆有效地减轻了“不需要的”任务或情况的影响,从而实现了最先进的强大的元学习方法的显着改善。
translated by 谷歌翻译
标签噪声显着降低了应用中深度模型的泛化能力。有效的策略和方法,\ Texit {例如}重新加权或损失校正,旨在在训练神经网络时缓解标签噪声的负面影响。这些现有的工作通常依赖于预指定的架构并手动调整附加的超参数。在本文中,我们提出了翘曲的概率推断(WARPI),以便在元学习情景中自适应地整理分类网络的培训程序。与确定性模型相比,WARPI通过学习摊销元网络来制定为分层概率模型,这可以解决样本模糊性,因此对严格的标签噪声更加坚固。与直接生成损耗的重量值的现有近似加权功能不同,我们的元网络被学习以估计从登录和标签的输入来估计整流向量,这具有利用躺在它们中的足够信息的能力。这提供了纠正分类网络的学习过程的有效方法,证明了泛化能力的显着提高。此外,可以将整流载体建模为潜在变量并学习元网络,可以无缝地集成到分类网络的SGD优化中。我们在嘈杂的标签上评估了四个强大学习基准的Warpi,并在变体噪声类型下实现了新的最先进的。广泛的研究和分析还展示了我们模型的有效性。
translated by 谷歌翻译
最近关于使用嘈杂标签的学习的研究通过利用小型干净数据集来显示出色的性能。特别是,基于模型不可知的元学习的标签校正方法进一步提高了性能,通过纠正了嘈杂的标签。但是,标签错误矫予没有保障措施,导致不可避免的性能下降。此外,每个训练步骤都需要至少三个背部传播,显着减慢训练速度。为了缓解这些问题,我们提出了一种强大而有效的方法,可以在飞行中学习标签转换矩阵。采用转换矩阵使分类器对所有校正样本持怀疑态度,这减轻了错误的错误问题。我们还介绍了一个双头架构,以便在单个反向传播中有效地估计标签转换矩阵,使得估计的矩阵紧密地遵循由标签校正引起的移位噪声分布。广泛的实验表明,我们的方法在训练效率方面表现出比现有方法相当或更好的准确性。
translated by 谷歌翻译
部分标签学习(PLL)是一个典型的弱监督学习框架,每个培训实例都与候选标签集相关联,其中只有一个标签是有效的。为了解决PLL问题,通常方法试图通过使用先验知识(例如培训数据的结构信息)或以自训练方式提炼模型输出来对候选人集进行歧义。不幸的是,由于在模型训练的早期阶段缺乏先前的信息或不可靠的预测,这些方法通常无法获得有利的性能。在本文中,我们提出了一个新的针对部分标签学习的框架,该框架具有元客观指导性的歧义(MOGD),该框架旨在通过在小验证集中求解元目标来从设置的候选标签中恢复地面真相标签。具体而言,为了减轻假阳性标签的负面影响,我们根据验证集的元损失重新权重。然后,分类器通过最大程度地减少加权交叉熵损失来训练。通过使用普通SGD优化器的各种深网络可以轻松实现所提出的方法。从理论上讲,我们证明了元目标的收敛属性,并得出了所提出方法的估计误差界限。在各种基准数据集和实际PLL数据集上进行的广泛实验表明,与最先进的方法相比,所提出的方法可以实现合理的性能。
translated by 谷歌翻译
Recent deep networks are capable of memorizing the entire data even when the labels are completely random. To overcome the overfitting on corrupted labels, we propose a novel technique of learning another neural network, called Men-torNet, to supervise the training of the base deep networks, namely, StudentNet. During training, MentorNet provides a curriculum (sample weighting scheme) for StudentNet to focus on the sample the label of which is probably correct. Unlike the existing curriculum that is usually predefined by human experts, MentorNet learns a data-driven curriculum dynamically with StudentNet. Experimental results demonstrate that our approach can significantly improve the generalization performance of deep networks trained on corrupted training data. Notably, to the best of our knowledge, we achieve the best-published result on We-bVision, a large benchmark containing 2.2 million images of real-world noisy labels. The code are at https://github.com/google/mentornet.
translated by 谷歌翻译
Deep learning algorithms can fare poorly when the training dataset suffers from heavy class-imbalance but the testing criterion requires good generalization on less frequent classes. We design two novel methods to improve performance in such scenarios. First, we propose a theoretically-principled label-distribution-aware margin (LDAM) loss motivated by minimizing a margin-based generalization bound. This loss replaces the standard cross-entropy objective during training and can be applied with prior strategies for training with class-imbalance such as re-weighting or re-sampling. Second, we propose a simple, yet effective, training schedule that defers re-weighting until after the initial stage, allowing the model to learn an initial representation while avoiding some of the complications associated with re-weighting or re-sampling. We test our methods on several benchmark vision tasks including the real-world imbalanced dataset iNaturalist 2018. Our experiments show that either of these methods alone can already improve over existing techniques and their combination achieves even better performance gains 1 .
translated by 谷歌翻译
Deep neural networks may easily memorize noisy labels present in real-world data, which degrades their ability to generalize. It is therefore important to track and evaluate the robustness of models against noisy label memorization. We propose a metric, called susceptibility, to gauge such memorization for neural networks. Susceptibility is simple and easy to compute during training. Moreover, it does not require access to ground-truth labels and it only uses unlabeled data. We empirically show the effectiveness of our metric in tracking memorization on various architectures and datasets and provide theoretical insights into the design of the susceptibility metric. Finally, we show through extensive experiments on datasets with synthetic and real-world label noise that one can utilize susceptibility and the overall training accuracy to distinguish models that maintain a low memorization on the training set and generalize well to unseen clean data.
translated by 谷歌翻译
深度学习在大量大数据的帮助下取得了众多域中的显着成功。然而,由于许多真实情景中缺乏高质量标签,数据标签的质量是一个问题。由于嘈杂的标签严重降低了深度神经网络的泛化表现,从嘈杂的标签(强大的培训)学习是在现代深度学习应用中成为一项重要任务。在本调查中,我们首先从监督的学习角度描述了与标签噪声学习的问题。接下来,我们提供62项最先进的培训方法的全面审查,所有这些培训方法都按照其方法论差异分为五个群体,其次是用于评估其优越性的六种性质的系统比较。随后,我们对噪声速率估计进行深入分析,并总结了通常使用的评估方法,包括公共噪声数据集和评估度量。最后,我们提出了几个有前途的研究方向,可以作为未来研究的指导。所有内容将在https://github.com/songhwanjun/awesome-noisy-labels提供。
translated by 谷歌翻译
In this paper, we introduced the novel concept of advisor network to address the problem of noisy labels in image classification. Deep neural networks (DNN) are prone to performance reduction and overfitting problems on training data with noisy annotations. Weighting loss methods aim to mitigate the influence of noisy labels during the training, completely removing their contribution. This discarding process prevents DNNs from learning wrong associations between images and their correct labels but reduces the amount of data used, especially when most of the samples have noisy labels. Differently, our method weighs the feature extracted directly from the classifier without altering the loss value of each data. The advisor helps to focus only on some part of the information present in mislabeled examples, allowing the classifier to leverage that data as well. We trained it with a meta-learning strategy so that it can adapt throughout the training of the main model. We tested our method on CIFAR10 and CIFAR100 with synthetic noise, and on Clothing1M which contains real-world noise, reporting state-of-the-art results.
translated by 谷歌翻译
除了使用硬标签的标准监督学习外,通常在许多监督学习设置中使用辅助损失来改善模型的概括。例如,知识蒸馏增加了第二个教师模仿模型训练的损失,在该培训中,教师可能是一个验证的模型,可以输出比标签更丰富的分布。同样,在标记数据有限的设置中,弱标记信息以标签函数的形式使用。此处引入辅助损失来对抗标签函数,这些功能可能是基于嘈杂的规则的真实标签近似值。我们解决了学习以原则性方式结合这些损失的问题。我们介绍AMAL,该AMAL使用元学习在验证度量上学习实例特定的权重,以实现损失的最佳混合。在许多知识蒸馏和规则降解域中进行的实验表明,Amal在这些领域中对竞争基准的增长可显着。我们通过经验分析我们的方法,并分享有关其提供性能提升的机制的见解。
translated by 谷歌翻译
对网络规模数据进行培训可能需要几个月的时间。但是,在已经学习或不可学习的冗余和嘈杂点上浪费了很多计算和时间。为了加速训练,我们引入了可减少的持有损失选择(Rho-loss),这是一种简单但原则上的技术,它大致选择了这些训练点,最大程度地减少了模型的概括损失。结果,Rho-loss减轻了现有数据选择方法的弱点:优化文献中的技术通常选择“硬损失”(例如,高损失),但是这种点通常是嘈杂的(不可学习)或更少的任务与任务相关。相反,课程学习优先考虑“简单”的积分,但是一旦学习,就不必对这些要点进行培训。相比之下,Rho-Loss选择了可以学习的点,值得学习的,尚未学习。与先前的艺术相比,Rho-loss火车的步骤要少得多,可以提高准确性,并加快对广泛的数据集,超参数和体系结构(MLP,CNNS和BERT)的培训。在大型Web绑带图像数据集服装1M上,与统一的数据改组相比,步骤少18倍,最终精度的速度少2%。
translated by 谷歌翻译
我们考虑采用转移学习方法,可以在目标任务上微调一个预处理的深神经网络。我们研究微调的概括特性,以了解过度拟合的问题,而这种问题通常在实践中发生。先前的工作表明,约束与微调初始化的距离可改善概括。使用Pac-bayesian分析,我们观察到,除了初始化的距离外,黑森人还通过深神网络的噪声稳定性影响噪声注射。在观察过程中,我们为广泛的微调方法开发了基于HESSIAN距离的概括界。此外,我们研究了在嘈杂标签的情况下进行微调的鲁棒性。在我们的理论中,我们设计了一种算法,该算法结合了一致的损失和基于距离的正则化,以进行微调,以及在训练集标签中有条件独立噪声下的概括错误保证。我们对各种嘈杂的环境和体系结构进行了详细的经验研究。在六个图像分类任务上,其训练标签是通过编程标签生成的,我们发现比先前的微调方法的精度增长了3.26%。同时,微型模型的Hessian距离度量降低了六倍,是现有方法的六倍。
translated by 谷歌翻译
然而,他们的性能在火车时间存在嘈杂的标签存在下降。灵感来自于使用专家建议的学习,其中乘法权重(MW)更新最近被证明是在专家建议中适度的数据损坏的强大,我们建议在神经网络优化期间使用MW进行重新免除示例。我们理论上建立了当与梯度下降一起使用时的方法的收敛性,并证明其在1D案例中的标签噪声的优势。然后,我们通过表明MW在CIFAR-10,CIFAR-100和服装1M上的标签噪声存在下提高神经网络精度来验证我们的调查结果。我们还展示了我们对对抗性鲁棒性的影响。
translated by 谷歌翻译
大量数据集上的培训机学习模型会产生大量的计算成本。为了减轻此类费用,已经持续努力开发数据有效的培训方法,这些方法可以仔细选择培训示例的子集,以概括为完整的培训数据。但是,现有方法在为在提取子集训练的模型的质量提供理论保证方面受到限制,并且在实践中的表现可能差。我们提出了Adacore,该方法利用数据的几何形状提取培训示例的子集以进行有效的机器学习。我们方法背后的关键思想是通过对Hessian的指数平均估计值动态近似损耗函数的曲率,以选择加权子集(核心),这些子集(核心)可提供与Hessian的完整梯度预处理的近似值。我们证明,对应用于Adacore选择的子集的各种一阶和二阶方法的收敛性有严格的保证。我们的广泛实验表明,与基准相比,ADACORE提取了质量更高的核心,并加快了对凸和非凸机学习模型的训练,例如逻辑回归和神经网络,超过2.9倍,超过4.5倍,而随机子集则超过4.5倍。 。
translated by 谷歌翻译
Deep Neural Networks (DNNs) have been shown to be susceptible to memorization or overfitting in the presence of noisily-labelled data. For the problem of robust learning under such noisy data, several algorithms have been proposed. A prominent class of algorithms rely on sample selection strategies wherein, essentially, a fraction of samples with loss values below a certain threshold are selected for training. These algorithms are sensitive to such thresholds, and it is difficult to fix or learn these thresholds. Often, these algorithms also require information such as label noise rates which are typically unavailable in practice. In this paper, we propose an adaptive sample selection strategy that relies only on batch statistics of a given mini-batch to provide robustness against label noise. The algorithm does not have any additional hyperparameters for sample selection, does not need any information on noise rates and does not need access to separate data with clean labels. We empirically demonstrate the effectiveness of our algorithm on benchmark datasets.
translated by 谷歌翻译
对机器学习模型进行了训练,以最大程度地减少单个度量标准的平均损失,因此通常不考虑公平和稳健性。当培训数据不平衡或测试分布不同时,忽略培训中的这种指标可能会使这些模型容易违反公平。这项工作介绍了通过元学习(FormL)进行公平优化的重新加权,这是一种训练算法,通过共同学习培训样本权重和神经网络参数来平衡公平和鲁棒性与准确性。该方法通过学习通过动态重新重量从用户指定的保留集合中学到的数据来平衡分布的数据来平衡超额和代表性不足的子组的贡献来提高模型的公平性。 Forml提高了图像分类任务上的机会公平标准的平等性,减少了损坏的标签的偏见,并通过数据凝结促进了建立更多公平数据集。这些改进是在没有预处理数据或后处理模型输出的情况下实现的,而无需学习额外的加权函数,没有更改模型体系结构,而是在原始预测指标上保持准确性。
translated by 谷歌翻译
不平衡的数据对基于深度学习的分类模型构成挑战。解决不平衡数据的最广泛使用的方法之一是重新加权,其中训练样本与损失功能的不同权重相关。大多数现有的重新加权方法都将示例权重视为可学习的参数,并优化了元集中的权重,因此需要昂贵的双重优化。在本文中,我们从分布的角度提出了一种基于最佳运输(OT)的新型重新加权方法。具体而言,我们将训练集视为其样品上的不平衡分布,该分布由OT运输到从元集中获得的平衡分布。训练样品的权重是分布不平衡的概率质量,并通过最大程度地减少两个分布之间的ot距离来学习。与现有方法相比,我们提出的一种方法可以脱离每次迭代时的体重学习对相关分类器的依赖性。图像,文本和点云数据集的实验表明,我们提出的重新加权方法具有出色的性能,在许多情况下实现了最新的结果,并提供了一种有希望的工具来解决不平衡的分类问题。
translated by 谷歌翻译