图像分类器通常在其测试设置精度上进行评分,但高精度可以屏蔽微妙类型的模型故障。我们发现高分卷积神经网络(CNNS)在流行的基准上表现出令人不安的病理,即使在没有语义突出特征的情况下,即使在没有语义突出特征的情况下也能够显示高精度。当模型提供没有突出的输入功能而无突出的频率决定时,我们说分类器已经过度解释了它的输入,找到了太多的课程 - 以对人类荒谬的模式。在这里,我们展示了在CiFar-10和Imagenet上培训的神经网络患有过度诠释,我们发现CIFAR-10上的模型即使在屏蔽95%的输入图像中,人类不能在剩余像素子集中辨别出突出的特征。我们介绍了批量梯度SIS,一种用于发现复杂数据集的足够输入子集的新方法,并使用此方法显示故事中的边界像素的充分性以进行培训和测试。虽然这些模式在现实世界部署中移植了潜在的模型脆弱性,但它们实际上是基准的有效统计模式,单独就足以实现高测试精度。与对手示例不同,过度解释依赖于未修改的图像像素。我们发现合奏和输入辍学可以帮助缓解过度诠释。
translated by 谷歌翻译
We build new test sets for the CIFAR-10 and ImageNet datasets. Both benchmarks have been the focus of intense research for almost a decade, raising the danger of overfitting to excessively re-used test sets. By closely following the original dataset creation processes, we test to what extent current classification models generalize to new data. We evaluate a broad range of models and find accuracy drops of 3% -15% on CIFAR-10 and 11% -14% on ImageNet. However, accuracy gains on the original test sets translate to larger gains on the new test sets. Our results suggest that the accuracy drops are not caused by adaptivity, but by the models' inability to generalize to slightly "harder" images than those found in the original test sets.
translated by 谷歌翻译
Benchmark performance of deep learning classifiers alone is not a reliable predictor for the performance of a deployed model. In particular, if the image classifier has picked up spurious features in the training data, its predictions can fail in unexpected ways. In this paper, we develop a framework that allows us to systematically identify spurious features in large datasets like ImageNet. It is based on our neural PCA components and their visualization. Previous work on spurious features of image classifiers often operates in toy settings or requires costly pixel-wise annotations. In contrast, we validate our results by checking that presence of the harmful spurious feature of a class is sufficient to trigger the prediction of that class. We introduce a novel dataset "Spurious ImageNet" and check how much existing classifiers rely on spurious features.
translated by 谷歌翻译
我们在最常用的计算机视觉,自然语言和音频数据集中的10个测试集中识别标签错误,随后研究这些标签错误的可能性影响基准结果。测试集中的错误是众多和广泛的:我们估计10个数据集的至少3.3%的误差,例如标签错误包括至少6%的想象验证集。使用自信的学习算法识别推定的标签错误,然后通过众包(51%的算法上标记的候选者的51%确实错误地标记了数据集)。传统上,机器学习从业者选择基于测试准确性部署哪种模型 - 我们的调查结果在此提出谨慎行事,提出在正确标记的测试集上判断模型可能更有用,特别是对于嘈杂的现实世界数据集。令人惊讶的是,我们发现较低的容量模型可能与现实世界数据集中的更高容量模型几乎更有用,具有高比例的错误标记数据。例如,在具有校正标签的ImageNet上:Reset-18优于Reset-50,如果最初错误标记的测试示例的普及仅增加6%。在具有校正标签的CiFar-10上:VGG-11优于VGG-19,如果最初错误标记的测试示例的患病率达到5%。在HTTPS://labelerrors.com上查看10个数据集中的测试集错误,HTTPS://github.com/cleanlab/labelors可以再现所有标签错误。
translated by 谷歌翻译
可解释的人工智能(XAI)的新兴领域旨在为当今强大但不透明的深度学习模型带来透明度。尽管本地XAI方法以归因图的形式解释了个体预测,从而确定了重要特征的发生位置(但没有提供有关其代表的信息),但全局解释技术可视化模型通常学会的编码的概念。因此,两种方法仅提供部分见解,并留下将模型推理解释的负担。只有少数当代技术旨在将本地和全球XAI背后的原则结合起来,以获取更多信息的解释。但是,这些方法通常仅限于特定的模型体系结构,或对培训制度或数据和标签可用性施加其他要求,这实际上使事后应用程序成为任意预训练的模型。在这项工作中,我们介绍了概念相关性传播方法(CRP)方法,该方法结合了XAI的本地和全球观点,因此允许回答“何处”和“ where”和“什么”问题,而没有其他约束。我们进一步介绍了相关性最大化的原则,以根据模型对模型的有用性找到代表性的示例。因此,我们提高了对激活最大化及其局限性的共同实践的依赖。我们证明了我们方法在各种环境中的能力,展示了概念相关性传播和相关性最大化导致了更加可解释的解释,并通过概念图表,概念组成分析和概念集合和概念子区和概念子区和概念子集和定量研究对模型的表示和推理提供了深刻的见解。它们在细粒度决策中的作用。
translated by 谷歌翻译
Modern machine learning methods including deep learning have achieved great success in predictive accuracy for supervised learning tasks, but may still fall short in giving useful estimates of their predictive uncertainty. Quantifying uncertainty is especially critical in real-world settings, which often involve input distributions that are shifted from the training distribution due to a variety of factors including sample bias and non-stationarity. In such settings, well calibrated uncertainty estimates convey information about when a model's output should (or should not) be trusted. Many probabilistic deep learning methods, including Bayesian-and non-Bayesian methods, have been proposed in the literature for quantifying predictive uncertainty, but to our knowledge there has not previously been a rigorous largescale empirical comparison of these methods under dataset shift. We present a largescale benchmark of existing state-of-the-art methods on classification problems and investigate the effect of dataset shift on accuracy and calibration. We find that traditional post-hoc calibration does indeed fall short, as do several other previous methods. However, some methods that marginalize over models give surprisingly strong results across a broad spectrum of tasks.
translated by 谷歌翻译
为了改善模型概括,模型设计师通常会隐式或显式地限制其模型使用的功能。在这项工作中,我们通过将其视为数据的不同观点来探讨利用此类特征先验的设计空间。具体而言,我们发现经过多种功能先验训练的模型具有较少的重叠故障模式,因此可以更有效地组合。此外,我们证明,在其他(未标记的)数据上共同训练此类模型使他们能够纠正彼此的错误,这反过来又导致对虚假相关性的更好的概括和韧性。可在https://github.com/madrylab/copriors上找到代码
translated by 谷歌翻译
我们识别普遍对抗扰动(UAP)的性质,将它们与标准的对抗性扰动区分开来。具体而言,我们表明,由投影梯度下降产生的靶向UAPS表现出两种人对齐的特性:语义局部性和空间不变性,标准的靶向对抗扰动缺乏。我们还证明,除标准对抗扰动之外,UAPS含有明显较低的泛化信号 - 即,UAPS在比标准的对抗的扰动的较小程度上利用非鲁棒特征。
translated by 谷歌翻译
We present a framework for ranking images within their class based on the strength of spurious cues present. By measuring the gap in accuracy on the highest and lowest ranked images (we call this spurious gap), we assess spurious feature reliance for $89$ diverse ImageNet models, finding that even the best models underperform in images with weak spurious presence. However, the effect of spurious cues varies far more dramatically across classes, emphasizing the crucial, often overlooked, class-dependence of the spurious correlation problem. While most spurious features we observe are clarifying (i.e. improving test-time accuracy when present, as is typically expected), we surprisingly find many cases of confusing spurious features, where models perform better when they are absent. We then close the spurious gap by training new classification heads on lowly ranked (i.e. without common spurious cues) images, resulting in improved effective robustness to distribution shifts (ObjectNet, ImageNet-R, ImageNet-Sketch). We also propose a second metric to assess feature reliability, finding that spurious features are generally less reliable than non-spurious (core) ones, though again, spurious features can be more reliable for certain classes. To enable our analysis, we annotated $5,000$ feature-class dependencies over {\it all} of ImageNet as core or spurious using minimal human supervision. Finally, we show the feature discovery and spuriosity ranking framework can be extended to other datasets like CelebA and WaterBirds in a lightweight fashion with only linear layer training, leading to discovering a previously unknown racial bias in the Celeb-A hair classification.
translated by 谷歌翻译
When developing deep learning models, we usually decide what task we want to solve then search for a model that generalizes well on the task. An intriguing question would be: what if, instead of fixing the task and searching in the model space, we fix the model and search in the task space? Can we find tasks that the model generalizes on? How do they look, or do they indicate anything? These are the questions we address in this paper. We propose a task discovery framework that automatically finds examples of such tasks via optimizing a generalization-based quantity called agreement score. We demonstrate that one set of images can give rise to many tasks on which neural networks generalize well. These tasks are a reflection of the inductive biases of the learning framework and the statistical patterns present in the data, thus they can make a useful tool for analysing the neural networks and their biases. As an example, we show that the discovered tasks can be used to automatically create adversarial train-test splits which make a model fail at test time, without changing the pixels or labels, but by only selecting how the datapoints should be split between the train and test sets. We end with a discussion on human-interpretability of the discovered tasks.
translated by 谷歌翻译
我们介绍了几个新的数据集即想象的A / O和Imagenet-R以及合成环境和测试套件,我们称为CAOS。 Imagenet-A / O允许研究人员专注于想象成剩余的盲点。由于追踪稳健的表示,以特殊创建了ImageNet-R,因为表示不再简单地自然,而是包括艺术和其他演绎。 Caos Suite由Carla Simulator构建,允许包含异常物体,可以创建可重复的合成环境和用于测试稳健性的场景。所有数据集都是为测试鲁棒性和衡量鲁棒性的衡量进展而创建的。数据集已用于各种其他作品中,以衡量其具有鲁棒性的自身进步,并允许切向进展,这些进展不会完全关注自然准确性。鉴于这些数据集,我们创建了几种旨在推进鲁棒性研究的新方法。我们以最大Logit的形式和典型程度的形式构建简单的基线,并以深度的形式创建新的数据增强方法,从而提高上述基准。最大Logit考虑Logit值而不是SoftMax操作后的值,而微小的变化会产生明显的改进。典型程分将输出分布与类的后部分布进行比较。我们表明,除了分段任务之外,这将提高对基线的性能。猜测可能在像素级别,像素的语义信息比类级信息的语义信息不太有意义。最后,新的Deepaulment的新增强技术利用神经网络在彻底不同于先前使用的传统几何和相机的转换的图像上创建增强。
translated by 谷歌翻译
Current learning machines have successfully solved hard application problems, reaching high accuracy and displaying seemingly "intelligent" behavior. Here we apply recent techniques for explaining decisions of state-of-the-art learning machines and analyze various tasks from computer vision and arcade games. This showcases a spectrum of problem-solving behaviors ranging from naive and short-sighted, to wellinformed and strategic. We observe that standard performance evaluation metrics can be oblivious to distinguishing these diverse problem solving behaviors. Furthermore, we propose our semi-automated Spectral Relevance Analysis that provides a practically effective way of characterizing and validating the behavior of nonlinear learning machines. This helps to assess whether a learned model indeed delivers reliably for the problem that it was conceived for. Furthermore, our work intends to add a voice of caution to the ongoing excitement about machine intelligence and pledges to evaluate and judge some of these recent successes in a more nuanced manner.
translated by 谷歌翻译
Saliency methods compute heat maps that highlight portions of an input that were most {\em important} for the label assigned to it by a deep net. Evaluations of saliency methods convert this heat map into a new {\em masked input} by retaining the $k$ highest-ranked pixels of the original input and replacing the rest with \textquotedblleft uninformative\textquotedblright\ pixels, and checking if the net's output is mostly unchanged. This is usually seen as an {\em explanation} of the output, but the current paper highlights reasons why this inference of causality may be suspect. Inspired by logic concepts of {\em completeness \& soundness}, it observes that the above type of evaluation focuses on completeness of the explanation, but ignores soundness. New evaluation metrics are introduced to capture both notions, while staying in an {\em intrinsic} framework -- i.e., using the dataset and the net, but no separately trained nets, human evaluations, etc. A simple saliency method is described that matches or outperforms prior methods in the evaluations. Experiments also suggest new intrinsic justifications, based on soundness, for popular heuristic tricks such as TV regularization and upsampling.
translated by 谷歌翻译
缺失或缺乏输入功能,是许多模型调试工具的基础概念。但是,在计算机视觉中,不能简单地从图像中删除像素。因此,一种倾向于诉诸启发式方法,例如涂黑像素,这反过来又可能引入调试过程中的偏见。我们研究了这样的偏见,特别是展示了基于变压器的架构如何使遗失性更自然地实施,哪些侧架来侧翼这些问题并提高了实践中模型调试的可靠性。我们的代码可从https://github.com/madrylab/missingness获得
translated by 谷歌翻译
We study how robust current ImageNet models are to distribution shifts arising from natural variations in datasets. Most research on robustness focuses on synthetic image perturbations (noise, simulated weather artifacts, adversarial examples, etc.), which leaves open how robustness on synthetic distribution shift relates to distribution shift arising in real data. Informed by an evaluation of 204 ImageNet models in 213 different test conditions, we find that there is often little to no transfer of robustness from current synthetic to natural distribution shift. Moreover, most current techniques provide no robustness to the natural distribution shifts in our testbed. The main exception is training on larger and more diverse datasets, which in multiple cases increases robustness, but is still far from closing the performance gaps. Our results indicate that distribution shifts arising in real data are currently an open research problem. We provide our testbed and data as a resource for future work at https://modestyachts.github.io/imagenet-testbed/.
translated by 谷歌翻译
自我监督学习的最新进展证明了多种视觉任务的有希望的结果。高性能自我监督方法中的一个重要成分是通过培训模型使用数据增强,以便在嵌入空间附近的相同图像的不同增强视图。然而,常用的增强管道整体地对待图像,忽略图像的部分的语义相关性-e.g。主题与背景 - 这可能导致学习杂散相关性。我们的工作通过调查一类简单但高度有效的“背景增强”来解决这个问题,这鼓励模型专注于语义相关内容,劝阻它们专注于图像背景。通过系统的调查,我们表明背景增强导致在各种任务中跨越一系列最先进的自我监督方法(MOCO-V2,BYOL,SWAV)的性能大量改进。 $ \ SIM $ + 1-2%的ImageNet收益,使得与监督基准的表现有关。此外,我们发现有限标签设置的改进甚至更大(高达4.2%)。背景技术增强还改善了许多分布换档的鲁棒性,包括天然对抗性实例,想象群-9,对抗性攻击,想象成型。我们还在产生了用于背景增强的显着掩模的过程中完全无监督的显着性检测进展。
translated by 谷歌翻译
差异隐私(DP)提供了正式的隐私保证,以防止对手可以访问机器学习模型,从而从提取有关单个培训点的信息。最受欢迎的DP训练方法是差异私有随机梯度下降(DP-SGD),它通过在训练过程中注入噪声来实现这种保护。然而,以前的工作发现,DP-SGD通常会导致标准图像分类基准的性能显着降解。此外,一些作者假设DP-SGD在大型模型上固有地表现不佳,因为保留隐私所需的噪声规范与模型维度成正比。相反,我们证明了过度参数化模型上的DP-SGD可以比以前想象的要好得多。将仔细的超参数调整与简单技术结合起来,以确保信号传播并提高收敛速率,我们获得了新的SOTA,而没有额外数据的CIFAR-10,在81.4%的81.4%下(8,10^{ - 5}) - 使用40 -layer wide-Resnet,比以前的SOTA提高了71.7%。当对预训练的NFNET-F3进行微调时,我们在ImageNet(0.5,8*10^{ - 7})下达到了83.8%的TOP-1精度。此外,我们还在(8,8 \ cdot 10^{ - 7})下达到了86.7%的TOP-1精度,DP仅比当前的非私人SOTA仅4.3%。我们认为,我们的结果是缩小私人图像分类和非私有图像分类之间准确性差距的重要一步。
translated by 谷歌翻译
Transfer learning is a cornerstone of computer vision, yet little work has been done to evaluate the relationship between architecture and transfer. An implicit hypothesis in modern computer vision research is that models that perform better on ImageNet necessarily perform better on other vision tasks. However, this hypothesis has never been systematically tested. Here, we compare the performance of 16 classification networks on 12 image classification datasets. We find that, when networks are used as fixed feature extractors or fine-tuned, there is a strong correlation between ImageNet accuracy and transfer accuracy (r = 0.99 and 0.96, respectively). In the former setting, we find that this relationship is very sensitive to the way in which networks are trained on ImageNet; many common forms of regularization slightly improve ImageNet accuracy but yield penultimate layer features that are much worse for transfer learning. Additionally, we find that, on two small fine-grained image classification datasets, pretraining on ImageNet provides minimal benefits, indicating the learned features from Ima-geNet do not transfer well to fine-grained tasks. Together, our results show that ImageNet architectures generalize well across datasets, but ImageNet features are less general than previously suggested.
translated by 谷歌翻译
我们表明,将人类的先验知识与端到端学习相结合可以通过引入基于零件的对象分类模型来改善深神经网络的鲁棒性。我们认为,更丰富的注释形式有助于指导神经网络学习更多可靠的功能,而无需更多的样本或更大的模型。我们的模型将零件分割模型与一个微小的分类器结合在一起,并经过训练的端到端,以同时将对象分割为各个部分,然后对分段对象进行分类。从经验上讲,与所有三个数据集的Resnet-50基线相比,我们的基于部分的模型既具有更高的精度和更高的对抗性鲁棒性。例如,鉴于相同的鲁棒性,我们部分模型的清洁准确性高达15个百分点。我们的实验表明,这些模型还减少了纹理偏见,并对共同的腐败和虚假相关性产生更好的鲁棒性。该代码可在https://github.com/chawins/adv-part-model上公开获得。
translated by 谷歌翻译
知识蒸馏是一种培训小型学生网络的流行技术,以模仿更大的教师模型,例如网络的集合。我们表明,虽然知识蒸馏可以改善学生泛化,但它通常不得如此普遍地工作:虽然在教师和学生的预测分布之间,甚至在学生容量的情况下,通常仍然存在令人惊讶的差异完美地匹配老师。我们认为优化的困难是为什么学生无法与老师匹配的关键原因。我们还展示了用于蒸馏的数据集的细节如何在学生与老师匹配的紧密关系中发挥作用 - 以及教师矛盾的教师并不总是导致更好的学生泛化。
translated by 谷歌翻译