机器学习模型通常使用诸如“依靠人的存在来检测网球拍”的虚假模式,这不概括。在这项工作中,我们介绍了一个端到端的管道,用于识别和减轻图像分类器的虚假模式。我们首先找到“模型对网球拍预测的模式,如果我们隐藏人民的时间似的63%。”然后,如果模式是虚幻的,我们通过新颖的数据增强来减轻它。我们展示了这种方法识别了一种多样化的杂散模式,并且它通过产生一个模型来减轻它们,这些模型在虚假图案对虚假模式对分布偏移不有用和更鲁棒的分布上进行更准确。
translated by 谷歌翻译
Models trained via empirical risk minimization (ERM) are known to rely on spurious correlations between labels and task-independent input features, resulting in poor generalization to distributional shifts. Group distributionally robust optimization (G-DRO) can alleviate this problem by minimizing the worst-case loss over a set of pre-defined groups over training data. G-DRO successfully improves performance of the worst-group, where the correlation does not hold. However, G-DRO assumes that the spurious correlations and associated worst groups are known in advance, making it challenging to apply it to new tasks with potentially multiple unknown spurious correlations. We propose AGRO -- Adversarial Group discovery for Distributionally Robust Optimization -- an end-to-end approach that jointly identifies error-prone groups and improves accuracy on them. AGRO equips G-DRO with an adversarial slicing model to find a group assignment for training examples which maximizes worst-case loss over the discovered groups. On the WILDS benchmark, AGRO results in 8% higher model performance on average on known worst-groups, compared to prior group discovery approaches used with G-DRO. AGRO also improves out-of-distribution performance on SST2, QQP, and MS-COCO -- datasets where potential spurious correlations are as yet uncharacterized. Human evaluation of ARGO groups shows that they contain well-defined, yet previously unstudied spurious correlations that lead to model errors.
translated by 谷歌翻译
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.
translated by 谷歌翻译
理解和解释训练有素的模型对许多机器学习目标至关重要,例如改善鲁棒性,解决概念漂移和减轻偏见。但是,这通常是一个临时过程,涉及手动查看许多测试样本上的模型的错误,并猜测这些错误的预测的根本原因。在本文中,我们提出了一种系统的方法,概念性的反事实解释(CCE),解释了为什么分类器在人类理解的概念方面在特定的测试样本上犯了一个错误(例如,此斑马被错误地分类为狗,因为因为是因为是因为是狗的。微弱的条纹)。我们基于两个先前的想法:反事实解释和概念激活向量,并在众所周知的预读模型上验证我们的方法,表明它有意义地解释了模型的错误。此外,对于接受具有虚假相关性数据的数据训练的新模型,CCE准确地将虚假相关性确定为单个错误分类测试样本中模型错误的原因。在两个具有挑战性的医学应用程序中,CCE产生了有用的见解,并由临床医生确认,涉及该模型在现实世界中犯的偏见和错误。
translated by 谷歌翻译
Current learning machines have successfully solved hard application problems, reaching high accuracy and displaying seemingly "intelligent" behavior. Here we apply recent techniques for explaining decisions of state-of-the-art learning machines and analyze various tasks from computer vision and arcade games. This showcases a spectrum of problem-solving behaviors ranging from naive and short-sighted, to wellinformed and strategic. We observe that standard performance evaluation metrics can be oblivious to distinguishing these diverse problem solving behaviors. Furthermore, we propose our semi-automated Spectral Relevance Analysis that provides a practically effective way of characterizing and validating the behavior of nonlinear learning machines. This helps to assess whether a learned model indeed delivers reliably for the problem that it was conceived for. Furthermore, our work intends to add a voice of caution to the ongoing excitement about machine intelligence and pledges to evaluate and judge some of these recent successes in a more nuanced manner.
translated by 谷歌翻译
We present a framework for ranking images within their class based on the strength of spurious cues present. By measuring the gap in accuracy on the highest and lowest ranked images (we call this spurious gap), we assess spurious feature reliance for $89$ diverse ImageNet models, finding that even the best models underperform in images with weak spurious presence. However, the effect of spurious cues varies far more dramatically across classes, emphasizing the crucial, often overlooked, class-dependence of the spurious correlation problem. While most spurious features we observe are clarifying (i.e. improving test-time accuracy when present, as is typically expected), we surprisingly find many cases of confusing spurious features, where models perform better when they are absent. We then close the spurious gap by training new classification heads on lowly ranked (i.e. without common spurious cues) images, resulting in improved effective robustness to distribution shifts (ObjectNet, ImageNet-R, ImageNet-Sketch). We also propose a second metric to assess feature reliability, finding that spurious features are generally less reliable than non-spurious (core) ones, though again, spurious features can be more reliable for certain classes. To enable our analysis, we annotated $5,000$ feature-class dependencies over {\it all} of ImageNet as core or spurious using minimal human supervision. Finally, we show the feature discovery and spuriosity ranking framework can be extended to other datasets like CelebA and WaterBirds in a lightweight fashion with only linear layer training, leading to discovering a previously unknown racial bias in the Celeb-A hair classification.
translated by 谷歌翻译
Standard training via empirical risk minimization (ERM) can produce models that achieve high accuracy on average but low accuracy on certain groups, especially in the presence of spurious correlations between the input and label. Prior approaches that achieve high worst-group accuracy, like group distributionally robust optimization (group DRO) require expensive group annotations for each training point, whereas approaches that do not use such group annotations typically achieve unsatisfactory worst-group accuracy. In this paper, we propose a simple two-stage approach, JTT, that first trains a standard ERM model for several epochs, and then trains a second model that upweights the training examples that the first model misclassified. Intuitively, this upweights examples from groups on which standard ERM models perform poorly, leading to improved worst-group performance. Averaged over four image classification and natural language processing tasks with spurious correlations, JTT closes 75% of the gap in worst-group accuracy between standard ERM and group DRO, while only requiring group annotations on a small validation set in order to tune hyperparameters.
translated by 谷歌翻译
通常对机器学习分类器进行培训,以最大程度地减少数据集的平均误差。不幸的是,在实践中,这个过程通常会利用训练数据中亚组不平衡引起的虚假相关性,从而导致高平均性能,但跨亚组的性能高度可变。解决此问题的最新工作提出了使用骆驼进行模型修补。这种先前的方法使用生成的对抗网络来执行类内的群间数据增强,需要(a)训练许多计算昂贵的模型以及(b)给定域模型的合成输出的足够质量。在这项工作中,我们提出了RealPatch,这是一个基于统计匹配的简单,更快,更快的数据增强的框架。我们的框架通过使用真实样本增强数据集来执行模型修补程序,从而减轻了为目标任务训练生成模型的需求。我们证明了RealPatch在三个基准数据集,Celeba,Waterbird和IwildCam的一部分中的有效性,显示了最差的亚组性能和二进制分类中亚组性能差距的改进。此外,我们使用IMSITU数据集进行了211个类的实验,在这种设置中,基于生成模型的修补(例如骆驼)是不切实际的。我们表明,RealPatch可以成功消除数据集泄漏,同时减少模型泄漏并保持高实用程序。可以在https://github.com/wearepal/realpatch上找到RealPatch的代码。
translated by 谷歌翻译
我们介绍了几个新的数据集即想象的A / O和Imagenet-R以及合成环境和测试套件,我们称为CAOS。 Imagenet-A / O允许研究人员专注于想象成剩余的盲点。由于追踪稳健的表示,以特殊创建了ImageNet-R,因为表示不再简单地自然,而是包括艺术和其他演绎。 Caos Suite由Carla Simulator构建,允许包含异常物体,可以创建可重复的合成环境和用于测试稳健性的场景。所有数据集都是为测试鲁棒性和衡量鲁棒性的衡量进展而创建的。数据集已用于各种其他作品中,以衡量其具有鲁棒性的自身进步,并允许切向进展,这些进展不会完全关注自然准确性。鉴于这些数据集,我们创建了几种旨在推进鲁棒性研究的新方法。我们以最大Logit的形式和典型程度的形式构建简单的基线,并以深度的形式创建新的数据增强方法,从而提高上述基准。最大Logit考虑Logit值而不是SoftMax操作后的值,而微小的变化会产生明显的改进。典型程分将输出分布与类的后部分布进行比较。我们表明,除了分段任务之外,这将提高对基线的性能。猜测可能在像素级别,像素的语义信息比类级信息的语义信息不太有意义。最后,新的Deepaulment的新增强技术利用神经网络在彻底不同于先前使用的传统几何和相机的转换的图像上创建增强。
translated by 谷歌翻译
已知性别偏见存在于大规模的视觉数据集中,并且可以在下游模型中反映甚至扩大。许多先前的作品通常通过尝试从图像中删除性别表达信息来减轻性别偏见。为了理解这些方法的可行性和实用性,我们研究了大规模视觉数据集中存在的$ \ textit {gengender伪像} $。我们将$ \ textit {性别伪像} $定义为与性别相关的视觉提示,专门针对那些由现代图像分类器学习并具有可解释的人类推论的线索。通过我们的分析,我们发现性别伪像在可可和开放型数据集中无处不在,从低级信息(例如,颜色通道的平均值)到图像的高级组成(例如姿势和姿势和姿势,,,,,,,,,地和图像的平均值),无处不在(例如,姿势和姿势,姿势和姿势,,,姿势和姿势,是姿势和姿势,是姿势和姿势,是姿势和姿势的平均值)。人的位置)。鉴于性别文物的流行,我们声称试图从此类数据集中删除性别文物的尝试是不可行的。取而代之的是,责任在于研究人员和从业人员意识到数据集中图像的分布是高度性别的,因此开发了对各组之间这些分配变化的强大方法。
translated by 谷歌翻译
We study how robust current ImageNet models are to distribution shifts arising from natural variations in datasets. Most research on robustness focuses on synthetic image perturbations (noise, simulated weather artifacts, adversarial examples, etc.), which leaves open how robustness on synthetic distribution shift relates to distribution shift arising in real data. Informed by an evaluation of 204 ImageNet models in 213 different test conditions, we find that there is often little to no transfer of robustness from current synthetic to natural distribution shift. Moreover, most current techniques provide no robustness to the natural distribution shifts in our testbed. The main exception is training on larger and more diverse datasets, which in multiple cases increases robustness, but is still far from closing the performance gaps. Our results indicate that distribution shifts arising in real data are currently an open research problem. We provide our testbed and data as a resource for future work at https://modestyachts.github.io/imagenet-testbed/.
translated by 谷歌翻译
执行零摄像推理时(即,在特定数据集上不进行微调)时,大型预训练的模型(例如剪辑或ALIGN)在一系列数据分布中提供一致的精度。尽管现有的微调方法显着提高了给定目标分布的准确性,但它们通常会降低分配变化的稳健性。我们通过引入一种简单有效的方法来提高鲁棒性,同时进行微调:结合零拍和微调模型(Wise-ft)的重量。与标准的微调相比,Wise-FT在分配变化下提供了巨大的准确性提高,同时保留了目标分布的高精度。在Imagenet和五个派生的分布变化上,Wise-FT在先前的工作中提高了分布转移的准确性4至6个百分点(PP),同时将Imagenet精度提高1.6pp。Wise-ft的稳健性相似(2至23 pp),明智之前与七个常用的转移学习数据集的标准微调相比,在一组进一步的分配转移的各种集合中,准确性增长率为0.8至3.3 pp。这些改进在微调或推理期间没有任何额外的计算成本。
translated by 谷歌翻译
显着性方法是一种流行的特征归因说明方法,旨在通过识别输入图像中的“重要”像素来捕获模型的预测推理。但是,由于缺乏获得地面模型推理的访问,这些方法的开发和采用受到阻碍,从而阻止了准确的评估。在这项工作中,我们设计了一个合成的基准测试框架SMERF,该框架使我们能够在控制模型推理的复杂性的同时进行基于基础真相的评估。在实验上,SMERF揭示了现有的显着性方法的重大局限性,因此代表了开发新显着性方法的有用工具。
translated by 谷歌翻译
对比训练有素的语言图像模型,例如剪辑,Align和Basic,已经证明了对多种具有挑战性的自然分配变化的前所未有的鲁棒性。由于这些语言图像模型与以前的培训方法有多种不同,因此一个重要的问题是导致稳定性增长的原因。我们通过系统的实验研究回答这个问题。具体而言,我们研究了鲁棒性增长的五个不同可能的原因:(i)训练集大小,(ii)培训分配,(iii)在培训时进行语言监督,(iv)测试时语言监督,以及(v)对比损失函数。我们的实验表明,更多样化的训练分布是稳健性增长的主要原因,其他因素几乎没有稳健性。除了实验结果之外,我们还引入了Imagenet捕获,这是一种来自Flickr的原始文本注释的Imagenet版本,以实现语言图像训练的进一步受控实验。
translated by 谷歌翻译
Image classification with small datasets has been an active research area in the recent past. However, as research in this scope is still in its infancy, two key ingredients are missing for ensuring reliable and truthful progress: a systematic and extensive overview of the state of the art, and a common benchmark to allow for objective comparisons between published methods. This article addresses both issues. First, we systematically organize and connect past studies to consolidate a community that is currently fragmented and scattered. Second, we propose a common benchmark that allows for an objective comparison of approaches. It consists of five datasets spanning various domains (e.g., natural images, medical imagery, satellite data) and data types (RGB, grayscale, multispectral). We use this benchmark to re-evaluate the standard cross-entropy baseline and ten existing methods published between 2017 and 2021 at renowned venues. Surprisingly, we find that thorough hyper-parameter tuning on held-out validation data results in a highly competitive baseline and highlights a stunted growth of performance over the years. Indeed, only a single specialized method dating back to 2019 clearly wins our benchmark and outperforms the baseline classifier.
translated by 谷歌翻译
Progress on object detection is enabled by datasets that focus the research community's attention on open challenges. This process led us from simple images to complex scenes and from bounding boxes to segmentation masks. In this work, we introduce LVIS (pronounced 'el-vis'): a new dataset for Large Vocabulary Instance Segmentation. We plan to collect ∼2 million high-quality instance segmentation masks for over 1000 entry-level object categories in 164k images. Due to the Zipfian distribution of categories in natural images, LVIS naturally has a long tail of categories with few training samples. Given that state-of-the-art deep learning methods for object detection perform poorly in the low-sample regime, we believe that our dataset poses an important and exciting new scientific challenge. LVIS is available at http://www.lvisdataset.org.
translated by 谷歌翻译
We build new test sets for the CIFAR-10 and ImageNet datasets. Both benchmarks have been the focus of intense research for almost a decade, raising the danger of overfitting to excessively re-used test sets. By closely following the original dataset creation processes, we test to what extent current classification models generalize to new data. We evaluate a broad range of models and find accuracy drops of 3% -15% on CIFAR-10 and 11% -14% on ImageNet. However, accuracy gains on the original test sets translate to larger gains on the new test sets. Our results suggest that the accuracy drops are not caused by adaptivity, but by the models' inability to generalize to slightly "harder" images than those found in the original test sets.
translated by 谷歌翻译
Despite being responsible for state-of-the-art results in several computer vision and natural language processing tasks, neural networks have faced harsh criticism due to some of their current shortcomings. One of them is that neural networks are correlation machines prone to model biases within the data instead of focusing on actual useful causal relationships. This problem is particularly serious in application domains affected by aspects such as race, gender, and age. To prevent models from incurring on unfair decision-making, the AI community has concentrated efforts in correcting algorithmic biases, giving rise to the research area now widely known as fairness in AI. In this survey paper, we provide an in-depth overview of the main debiasing methods for fairness-aware neural networks in the context of vision and language research. We propose a novel taxonomy to better organize the literature on debiasing methods for fairness, and we discuss the current challenges, trends, and important future work directions for the interested researcher and practitioner.
translated by 谷歌翻译
现有的方法用于隔离数据集中的硬群和虚假相关性通常需要人为干预。这可以使这些方法具有劳动密集型和特定于数据集的特定方式。为了解决这些缺点,我们提出了一种自动提炼模型故障模式的可扩展方法。具体而言,我们利用线性分类器来识别一致的误差模式,然后又诱导这些故障模式作为特征空间内的方向的自然表示。我们证明,该框架使我们能够发现并自动为培训数据集中的子群体提起挑战,并进行干预以改善模型对这些亚群的绩效。可在https://github.com/madrylab/failure-directions上找到代码
translated by 谷歌翻译
虽然神经网络在平均病例的性能方面对分类任务的成功显着,但它们通常无法在某些数据组上表现良好。这样的组信息可能是昂贵的;因此,即使在培训数据不可用的组标签不可用,较稳健性和公平的最新作品也提出了改善最差组性能的方法。然而,这些方法通常在培训时间使用集团信息的表现不佳。在这项工作中,我们假设没有组标签的较大数据集一起访问少量组标签。我们提出了一个简单的两步框架,利用这个部分组信息来提高最差组性能:训练模型以预测训练数据的丢失组标签,然后在强大的优化目标中使用这些预测的组标签。从理论上讲,我们在最差的组性能方面为我们的方法提供泛化界限,展示了泛化误差如何相对于培训点总数和具有组标签的培训点的数量。凭经验,我们的方法优于不使用群组信息的基线表达,即使只有1-33%的积分都有组标签。我们提供消融研究,以支持我们框架的稳健性和可扩展性。
translated by 谷歌翻译