Spurious correlations in training data often lead to robustness issues since models learn to use them as shortcuts. For example, when predicting whether an object is a cow, a model might learn to rely on its green background, so it would do poorly on a cow on a sandy background. A standard dataset for measuring state-of-the-art on methods mitigating this problem is Waterbirds. The best method (Group Distributionally Robust Optimization - GroupDRO) currently achieves 89\% worst group accuracy and standard training from scratch on raw images only gets 72\%. GroupDRO requires training a model in an end-to-end manner with subgroup labels. In this paper, we show that we can achieve up to 90\% accuracy without using any sub-group information in the training set by simply using embeddings from a large pre-trained vision model extractor and training a linear classifier on top of it. With experiments on a wide range of pre-trained models and pre-training datasets, we show that the capacity of the pre-training model and the size of the pre-training dataset matters. Our experiments reveal that high capacity vision transformers perform better compared to high capacity convolutional neural networks, and larger pre-training dataset leads to better worst-group accuracy on the spurious correlation dataset.
translated by 谷歌翻译
Standard training via empirical risk minimization (ERM) can produce models that achieve high accuracy on average but low accuracy on certain groups, especially in the presence of spurious correlations between the input and label. Prior approaches that achieve high worst-group accuracy, like group distributionally robust optimization (group DRO) require expensive group annotations for each training point, whereas approaches that do not use such group annotations typically achieve unsatisfactory worst-group accuracy. In this paper, we propose a simple two-stage approach, JTT, that first trains a standard ERM model for several epochs, and then trains a second model that upweights the training examples that the first model misclassified. Intuitively, this upweights examples from groups on which standard ERM models perform poorly, leading to improved worst-group performance. Averaged over four image classification and natural language processing tasks with spurious correlations, JTT closes 75% of the gap in worst-group accuracy between standard ERM and group DRO, while only requiring group annotations on a small validation set in order to tune hyperparameters.
translated by 谷歌翻译
接受经验风险最小化(ERM)训练的机器学习模型的预测性能可以大大降解分配变化。在训练数据集中存在虚假相关性的存在导致ERM训练的模型在对不存在此类相关性的少数群体评估时表现出很高的损失。已经进行了广泛的尝试来开发改善最差的鲁棒性的方法。但是,他们需要每个培训输入的组信息,或者至少需要一个带有组标签的验证设置来调整其超参数,这可能是昂贵的或未知的。在本文中,我们应对在培训或验证期间没有小组注释的情况下提高组鲁棒性的挑战。为此,我们建议根据``识别''模型提取的特征的革兰氏集矩阵将训练数据集分为组,并根据这些伪组应用强大的优化。在不可用的小组标签的现实情况下,我们的实验表明,我们的方法不仅可以改善对ERM的稳健性,而且还优于所有最近的基线
translated by 谷歌翻译
虽然大型审计的基础模型(FMS)对数据集级别的分布变化显示出显着的零击分类鲁棒性,但它们对亚群或组移动的稳健性相对却相对不受欢迎。我们研究了这个问题,并发现诸如剪辑之类的FMS可能对各种群体转移可能不健壮。在9个稳健性基准中,其嵌入式分类零射击分类导致平均和最差组精度之间的差距高达80.7个百分点(PP)。不幸的是,现有的改善鲁棒性的方法需要重新培训,这在大型基础模型上可能非常昂贵。我们还发现,改善模型推理的有效方法(例如,通过适配器,具有FM嵌入式作为输入的轻量级网络)不会持续改进,有时与零击相比会伤害组鲁棒性(例如,将精度差距提高到50.1 pp on 50.1 pp on On on 50.1 pp on Celeba)。因此,我们制定了一种适配器培训策略,以有效有效地改善FM组的鲁棒性。我们激励的观察是,尽管同一阶级中的群体中较差的鲁棒性在基础模型“嵌入空间”中分开,但标准适配器训练可能不会使这些要点更加紧密。因此,我们提出了对比度的适应,该适应器会通过对比度学习进行训练适配器,以使样品嵌入在同一类中的地面真相类嵌入和其他样品嵌入。在整个9个基准测试中,我们的方法始终提高组鲁棒性,使最差的组精度提高了8.5至56.0 pp。我们的方法也是有效的,这样做的方法也没有任何FM芬太尼,只有一组固定的冷冻FM嵌入。在水鸟和Celeba等基准上,这导致最差的组精度可与最先进的方法相媲美,而最先进的方法可以重新训练整个模型,而仅训练$ \ leq $ 1%的模型参数。
translated by 谷歌翻译
显示过次分辨率化,导致在亚组信息的各种设置下在罕见的子组上的测试精度差。为了获得更完整的图片,我们考虑子组信息未知的情况。我们调查模型规模在多种设置的经验风险最小化(ERM)下最差组泛化的影响,不同:1)架构(Reset,VGG或BERT),2)域(视觉或自然语言处理)3)模型尺寸(宽度或深度)和4)初始化(具有预先培训或随机重量)。我们的系统评价显示,模型大小的增加不会受到伤害,并且可以帮助所有设置的ERM下的最差群体测试性能。特别是,增加预先训练的模型大小一致地提高水鸟和多液体的性能。当子组标签未知时,我们建议从业者使用更大的预训练模型。
translated by 谷歌翻译
Empirical studies suggest that machine learning models trained with empirical risk minimization (ERM) often rely on attributes that may be spuriously correlated with the class labels. Such models typically lead to poor performance during inference for data lacking such correlations. In this work, we explicitly consider a situation where potential spurious correlations are present in the majority of training data. In contrast with existing approaches, which use the ERM model outputs to detect the samples without spurious correlations, and either heuristically upweighting or upsampling those samples; we propose the logit correction (LC) loss, a simple yet effective improvement on the softmax cross-entropy loss, to correct the sample logit. We demonstrate that minimizing the LC loss is equivalent to maximizing the group-balanced accuracy, so the proposed LC could mitigate the negative impacts of spurious correlations. Our extensive experimental results further reveal that the proposed LC loss outperforms the SoTA solutions on multiple popular benchmarks by a large margin, an average 5.5% absolute improvement, without access to spurious attribute labels. LC is also competitive with oracle methods that make use of the attribute labels. Code is available at https://github.com/shengliu66/LC.
translated by 谷歌翻译
Machine learning models have been found to learn shortcuts -- unintended decision rules that are unable to generalize -- undermining models' reliability. Previous works address this problem under the tenuous assumption that only a single shortcut exists in the training data. Real-world images are rife with multiple visual cues from background to texture. Key to advancing the reliability of vision systems is understanding whether existing methods can overcome multiple shortcuts or struggle in a Whac-A-Mole game, i.e., where mitigating one shortcut amplifies reliance on others. To address this shortcoming, we propose two benchmarks: 1) UrbanCars, a dataset with precisely controlled spurious cues, and 2) ImageNet-W, an evaluation set based on ImageNet for watermark, a shortcut we discovered affects nearly every modern vision model. Along with texture and background, ImageNet-W allows us to study multiple shortcuts emerging from training on natural images. We find computer vision models, including large foundation models -- regardless of training set, architecture, and supervision -- struggle when multiple shortcuts are present. Even methods explicitly designed to combat shortcuts struggle in a Whac-A-Mole dilemma. To tackle this challenge, we propose Last Layer Ensemble, a simple-yet-effective method to mitigate multiple shortcuts without Whac-A-Mole behavior. Our results surface multi-shortcut mitigation as an overlooked challenge critical to advancing the reliability of vision systems. The datasets and code are released: https://github.com/facebookresearch/Whac-A-Mole.git.
translated by 谷歌翻译
神经网络倾向于在训练数据的主要部分中表现出的类和潜在属性之间的虚假相关性,这破坏了其概括能力。本文提出了一种新的方法,用于培训错误的分类器,没有虚假属性标签。该方法的关键思想是采用分类器委员会作为辅助模块,该模块可以识别偏置冲突的数据,即没有虚假相关性的数据,并在训练主要分类器时向它们分配了很大的权重。该委员会被学到了一个自举的合奏,因此大多数分类器都具有偏见和多样化,并且故意无法相应地预测偏见的偏见。因此,预测难度委员会的共识为识别和加权偏见冲突数据提供了可靠的提示。此外,该委员会还接受了从主要分类器转移的知识的培训,以便它逐渐与主要分类器一起变得偏见,并强调随着培训的进行而更加困难的数据。在五个现实世界数据集中,我们的方法在没有像我们这样的虚假属性标签的现有方法上优于现有方法,甚至偶尔会超越依靠偏见标签的方法。
translated by 谷歌翻译
我们考虑了OOD概括的问题,其目标是训练在与训练分布不同的测试分布上表现良好的模型。已知深度学习模型在这种转变上是脆弱的,即使对于略有不同的测试分布,也可能遭受大量精度下降。我们提出了一种基于直觉的新方法 - 愚蠢的方法,即大量丰富特征的对抗性结合应提供鲁棒性。我们的方法仔细提炼了一位强大的老师的知识,该知识使用标准培训学习了几个判别特征,同时使用对抗性培训将其结合在一起。对标准的对抗训练程序进行了修改,以产生可以更好地指导学生的教师。我们评估DAFT在域床框架中的标准基准测试中,并证明DAFT比当前最新的OOD泛化方法取得了重大改进。 DAFT始终超过表现良好的ERM和蒸馏基线高达6%,对于较小的网络而言,其增长率更高。
translated by 谷歌翻译
Models trained via empirical risk minimization (ERM) are known to rely on spurious correlations between labels and task-independent input features, resulting in poor generalization to distributional shifts. Group distributionally robust optimization (G-DRO) can alleviate this problem by minimizing the worst-case loss over a set of pre-defined groups over training data. G-DRO successfully improves performance of the worst-group, where the correlation does not hold. However, G-DRO assumes that the spurious correlations and associated worst groups are known in advance, making it challenging to apply it to new tasks with potentially multiple unknown spurious correlations. We propose AGRO -- Adversarial Group discovery for Distributionally Robust Optimization -- an end-to-end approach that jointly identifies error-prone groups and improves accuracy on them. AGRO equips G-DRO with an adversarial slicing model to find a group assignment for training examples which maximizes worst-case loss over the discovered groups. On the WILDS benchmark, AGRO results in 8% higher model performance on average on known worst-groups, compared to prior group discovery approaches used with G-DRO. AGRO also improves out-of-distribution performance on SST2, QQP, and MS-COCO -- datasets where potential spurious correlations are as yet uncharacterized. Human evaluation of ARGO groups shows that they contain well-defined, yet previously unstudied spurious correlations that lead to model errors.
translated by 谷歌翻译
数据集偏见和虚假相关性可能会严重损害深层神经网络中的概括。许多先前的努力已经使用替代性损失功能或集中在稀有模式上的采样策略来解决此问题。我们提出了一个新的方向:修改网络体系结构以施加归纳偏见,从而使网络对数据集偏置进行鲁棒性。具体而言,我们提出了OCCAMNET,这些OCCAMNET有偏见以通过设计偏爱更简单的解决方案。 OCCAMNET具有两个电感偏见。首先,他们有偏见地使用单个示例所需的网络深度。其次,它们偏向使用更少的图像位置进行预测。尽管Occamnets偏向更简单的假设,但必要时可以学习更多复杂的假设。在实验中,OCCAMNET的表现优于或竞争对手的最先进方法在不包含这些电感偏见的体系结构上运行。此外,我们证明,当最先进的伪造方法与OCCAMNETS结合使用时,结果进一步改善。
translated by 谷歌翻译
通常对机器学习分类器进行培训,以最大程度地减少数据集的平均误差。不幸的是,在实践中,这个过程通常会利用训练数据中亚组不平衡引起的虚假相关性,从而导致高平均性能,但跨亚组的性能高度可变。解决此问题的最新工作提出了使用骆驼进行模型修补。这种先前的方法使用生成的对抗网络来执行类内的群间数据增强,需要(a)训练许多计算昂贵的模型以及(b)给定域模型的合成输出的足够质量。在这项工作中,我们提出了RealPatch,这是一个基于统计匹配的简单,更快,更快的数据增强的框架。我们的框架通过使用真实样本增强数据集来执行模型修补程序,从而减轻了为目标任务训练生成模型的需求。我们证明了RealPatch在三个基准数据集,Celeba,Waterbird和IwildCam的一部分中的有效性,显示了最差的亚组性能和二进制分类中亚组性能差距的改进。此外,我们使用IMSITU数据集进行了211个类的实验,在这种设置中,基于生成模型的修补(例如骆驼)是不切实际的。我们表明,RealPatch可以成功消除数据集泄漏,同时减少模型泄漏并保持高实用程序。可以在https://github.com/wearepal/realpatch上找到RealPatch的代码。
translated by 谷歌翻译
Overparameterized neural networks can be highly accurate on average on an i.i.d.test set yet consistently fail on atypical groups of the data (e.g., by learning spurious correlations that hold on average but not in such groups). Distributionally robust optimization (DRO) allows us to learn models that instead minimize the worst-case training loss over a set of pre-defined groups. However, we find that naively applying group DRO to overparameterized neural networks fails: these models can perfectly fit the training data, and any model with vanishing average training loss also already has vanishing worst-case training loss. Instead, the poor worst-case performance arises from poor generalization on some groups. By coupling group DRO models with increased regularization-a stronger-than-typical 2 penalty or early stopping-we achieve substantially higher worst-group accuracies, with 10-40 percentage point improvements on a natural language inference task and two image tasks, while maintaining high average accuracies. Our results suggest that regularization is important for worst-group generalization in the overparameterized regime, even if it is not needed for average generalization. Finally, we introduce a stochastic optimization algorithm, with convergence guarantees, to efficiently train group DRO models.
translated by 谷歌翻译
尽管无偏见的机器学习模型对于许多应用程序至关重要,但偏见是一个人为定义的概念,可以在任务中有所不同。只有输入标签对,算法可能缺乏足够的信息来区分稳定(因果)特征和不稳定(虚假)特征。但是,相关任务通常具有类似的偏见 - 我们可以利用在转移环境中开发稳定的分类器的观察结果。在这项工作中,我们明确通知目标分类器有关源任务中不稳定功能的信息。具体而言,我们得出一个表示,该表示通过对比源任务中的不同数据环境来编码不稳定的功能。我们通过根据此表示形式将目标任务的数据聚类来实现鲁棒性,并最大程度地降低这些集群中最坏情况的风险。我们对文本和图像分类进行评估。经验结果表明,我们的算法能够在合成生成的环境和现实环境的目标任务上保持鲁棒性。我们的代码可在https://github.com/yujiabao/tofu上找到。
translated by 谷歌翻译
Does the dominant approach to learn representations (as a side effect of optimizing an expected cost for a single training distribution) remain a good approach when we are dealing with multiple distributions. Our thesis is that such scenarios are better served by representations that are "richer" than those obtained with a single optimization episode. This is supported by a collection of empirical results obtained with an apparently na\"ive ensembling technique: concatenating the representations obtained with multiple training episodes using the same data, model, algorithm, and hyper-parameters, but different random seeds. These independently trained networks perform similarly. Yet, in a number of scenarios involving new distributions, the concatenated representation performs substantially better than an equivalently sized network trained from scratch. This proves that the representations constructed by multiple training episodes are in fact different. Although their concatenation carries little additional information about the training task under the training distribution, it becomes substantially more informative when tasks or distributions change. Meanwhile, a single training episode is unlikely to yield such a redundant representation because the optimization process has no reason to accumulate features that do not incrementally improve the training performance.
translated by 谷歌翻译
域的概括(DG)旨在仅使用有限的源域学习一个通用模型。先前的DG尝试仅由于训练和测试域之间的显着域移动而无法从源域中学习域不变表示。取而代之的是,我们使用Oracle模型使用共同信息重新构建了DG目标,该模型将概括为任何可能的域。我们通过通过预训练的模型近似oracle模型来得出一个可拖动的变化下限,称为使用Oracle(Miro)的相互信息正则化。我们的广泛实验表明,Miro可显着提高分布性能。此外,我们的缩放实验表明,预训练模型的尺度越大,miro的性能提高就越大。源代码可在https://github.com/kakaobrain/miro中获得。
translated by 谷歌翻译
学习不变表示是在数据集中虚假相关驱动的机器学习模型时的重要要求。这些杂散相关性,在输入样本和目标标签之间,错误地指导了神经网络预测,导致某些组的性能差,尤其是少数群体。针对这些虚假相关性的强大培训需要每个样本的组成员资格。这种要求在少数群体或稀有群体的数据标签努力的情况下是显着费力的,或者包括数据集的个人选择隐藏敏感信息的情况。另一方面,存在这种数据收集的存在力度导致包含部分标记的组信息的数据集。最近的作品解决了完全无监督的场景,没有用于组的标签。因此,我们的目标是通过解决更现实的设置来填补文献中的缺失差距,这可以在培训期间利用部分可用的敏感或群体信息。首先,我们构造一个约束集并导出组分配所属的高概率绑定到该集合。其次,我们提出了一种从约束集中优化了优化最严格的组分配的算法。通过对图像和表格数据集的实验,我们显示少数集团的性能的改进,同时在跨组中保持整体汇总精度。
translated by 谷歌翻译
Benchmark performance of deep learning classifiers alone is not a reliable predictor for the performance of a deployed model. In particular, if the image classifier has picked up spurious features in the training data, its predictions can fail in unexpected ways. In this paper, we develop a framework that allows us to systematically identify spurious features in large datasets like ImageNet. It is based on our neural PCA components and their visualization. Previous work on spurious features of image classifiers often operates in toy settings or requires costly pixel-wise annotations. In contrast, we validate our results by checking that presence of the harmful spurious feature of a class is sufficient to trigger the prediction of that class. We introduce a novel dataset "Spurious ImageNet" and check how much existing classifiers rely on spurious features.
translated by 谷歌翻译
神经网络分类器已成为当前“火车前的Fine-Tune”范例的De-Facto选择。在本文中,我们调查了K $ -Nearest邻居(K-NN)分类器,这是一种从预先学习时代的无古典无模型学习方法,作为基于现代神经网络的方法的增强。作为懒惰的学习方法,K-Nn简单地聚集了训练集中的测试图像和顶-k邻居之间的距离。我们采用k-nn具有由监督或自我监督方法产生的预训练的视觉表现,分为两个步骤:(1)利用K-NN预测概率作为培训期间容易\〜〜硬示例的迹象。 (2)用增强分类器的预测分布线性地插入k-nn。通过广泛的实验在广泛的分类任务中,我们的研究揭示了K-NN集成与额外见解的一般性和灵活性:(1)K-NN实现竞争结果,有时甚至优于标准的线性分类器。 (2)结合K-NN对参数分类器执行不良和/或低数据制度的任务特别有益。我们希望这些发现将鼓励人们重新考虑预先学习的角色,计算机愿景中的古典方法。我们的代码可用于:https://github.com/kmnp/nn-revisit。
translated by 谷歌翻译
几项研究在经验上比较了各种模型的分布(ID)和分布(OOD)性能。他们报告了计算机视觉和NLP中基准的频繁正相关。令人惊讶的是,他们从未观察到反相关性表明必要的权衡。这重要的是确定ID性能是否可以作为OOD概括的代理。这篇简短的论文表明,ID和OOD性能之间的逆相关性确实在现实基准中发生。由于模型的选择有偏见,因此在过去的研究中可能被错过。我们使用来自多个训练时期和随机种子的模型展示了Wilds-Amelyon17数据集上模式的示例。我们的观察结果尤其引人注目,对经过正规化器训练的模型,将解决方案多样化为ERM目标。我们在过去的研究中得出了细微的建议和结论。 (1)高OOD性能有时确实需要交易ID性能。 (2)仅专注于ID性能可能不会导致最佳OOD性能:它可能导致OOD性能的减少并最终带来负面回报。 (3)我们的示例提醒人们,实证研究仅按照现有方法来制定制度:在提出规定的建议时有必要进行护理。
translated by 谷歌翻译