We introduce two challenging datasets that reliably cause machine learning model performance to substantially degrade. The datasets are collected with a simple adversarial filtration technique to create datasets with limited spurious cues. Our datasets' real-world, unmodified examples transfer to various unseen models reliably, demonstrating that computer vision models have shared weaknesses. The first dataset is called IMAGENET-A and is like the ImageNet test set, but it is far more challenging for existing models. We also curate an adversarial out-ofdistribution detection dataset called IMAGENET-O, which is the first out-of-distribution detection dataset created for ImageNet models. On IMAGENET-A a DenseNet-121 obtains around 2% accuracy, an accuracy drop of approximately 90%, and its out-of-distribution detection performance on IMAGENET-O is near random chance levels. We find that existing data augmentation techniques hardly boost performance, and using other public training datasets provides improvements that are limited. However, we find that improvements to computer vision architectures provide a promising path towards robust models.
translated by 谷歌翻译
我们介绍了几个新的数据集即想象的A / O和Imagenet-R以及合成环境和测试套件,我们称为CAOS。 Imagenet-A / O允许研究人员专注于想象成剩余的盲点。由于追踪稳健的表示,以特殊创建了ImageNet-R,因为表示不再简单地自然,而是包括艺术和其他演绎。 Caos Suite由Carla Simulator构建,允许包含异常物体,可以创建可重复的合成环境和用于测试稳健性的场景。所有数据集都是为测试鲁棒性和衡量鲁棒性的衡量进展而创建的。数据集已用于各种其他作品中,以衡量其具有鲁棒性的自身进步,并允许切向进展,这些进展不会完全关注自然准确性。鉴于这些数据集,我们创建了几种旨在推进鲁棒性研究的新方法。我们以最大Logit的形式和典型程度的形式构建简单的基线,并以深度的形式创建新的数据增强方法,从而提高上述基准。最大Logit考虑Logit值而不是SoftMax操作后的值,而微小的变化会产生明显的改进。典型程分将输出分布与类的后部分布进行比较。我们表明,除了分段任务之外,这将提高对基线的性能。猜测可能在像素级别,像素的语义信息比类级信息的语义信息不太有意义。最后,新的Deepaulment的新增强技术利用神经网络在彻底不同于先前使用的传统几何和相机的转换的图像上创建增强。
translated by 谷歌翻译
在真实世界的机器学习应用中,可靠和安全的系统必须考虑超出标准测试设置精度的性能测量。这些其他目标包括分销(OOD)鲁棒性,预测一致性,对敌人的抵御能力,校准的不确定性估计,以及检测异常投入的能力。然而,提高这些目标的绩效通常是一种平衡行为,即今天的方法无法在不牺牲其他安全轴上的性能的情况下实现。例如,对抗性培训改善了对抗性鲁棒性,但急剧降低了其他分类器性能度量。同样,强大的数据增强和正则化技术往往提高鲁棒性,但损害异常检测,提出了对所有现有安全措施的帕累托改进是可能的。为满足这一挑战,我们设计了利用诸如分数形的图片的自然结构复杂性设计新的数据增强策略,这优于众多基线,靠近帕累托 - 最佳,并圆形提高安全措施。
translated by 谷歌翻译
It is important to detect anomalous inputs when deploying machine learning systems. The use of larger and more complex inputs in deep learning magnifies the difficulty of distinguishing between anomalous and in-distribution examples. At the same time, diverse image and text data are available in enormous quantities. We propose leveraging these data to improve deep anomaly detection by training anomaly detectors against an auxiliary dataset of outliers, an approach we call Outlier Exposure (OE). This enables anomaly detectors to generalize and detect unseen anomalies. In extensive experiments on natural language processing and small-and large-scale vision tasks, we find that Outlier Exposure significantly improves detection performance. We also observe that cutting-edge generative models trained on CIFAR-10 may assign higher likelihoods to SVHN images than to CIFAR-10 images; we use OE to mitigate this issue. We also analyze the flexibility and robustness of Outlier Exposure, and identify characteristics of the auxiliary dataset that improve performance.
translated by 谷歌翻译
We introduce four new real-world distribution shift datasets consisting of changes in image style, image blurriness, geographic location, camera operation, and more. With our new datasets, we take stock of previously proposed methods for improving out-of-distribution robustness and put them to the test. We find that using larger models and artificial data augmentations can improve robustness on realworld distribution shifts, contrary to claims in prior work. We find improvements in artificial robustness benchmarks can transfer to real-world distribution shifts, contrary to claims in prior work. Motivated by our observation that data augmentations can help with real-world distribution shifts, we also introduce a new data augmentation method which advances the state-of-the-art and outperforms models pretrained with 1000× more labeled data. Overall we find that some methods consistently help with distribution shifts in texture and local image statistics, but these methods do not help with some other distribution shifts like geographic changes. Our results show that future research must study multiple distribution shifts simultaneously, as we demonstrate that no evaluated method consistently improves robustness.
translated by 谷歌翻译
Self-supervision provides effective representations for downstream tasks without requiring labels. However, existing approaches lag behind fully supervised training and are often not thought beneficial beyond obviating or reducing the need for annotations. We find that self-supervision can benefit robustness in a variety of ways, including robustness to adversarial examples, label corruption, and common input corruptions. Additionally, self-supervision greatly benefits out-of-distribution detection on difficult, near-distribution outliers, so much so that it exceeds the performance of fully supervised methods. These results demonstrate the promise of self-supervision for improving robustness and uncertainty estimation and establish these tasks as new axes of evaluation for future self-supervised learning research.
translated by 谷歌翻译
He et al. (2018) have called into question the utility of pre-training by showing that training from scratch can often yield similar performance to pre-training. We show that although pre-training may not improve performance on traditional classification metrics, it improves model robustness and uncertainty estimates. Through extensive experiments on adversarial examples, label corruption, class imbalance, out-of-distribution detection, and confidence calibration, we demonstrate large gains from pre-training and complementary effects with task-specific methods. We introduce adversarial pre-training and show approximately a 10% absolute improvement over the previous state-of-the-art in adversarial robustness. In some cases, using pre-training without task-specific methods also surpasses the state-of-the-art, highlighting the need for pretraining when evaluating future methods on robustness and uncertainty tasks.
translated by 谷歌翻译
已知现代深度神经网络模型将错误地将分布式(OOD)测试数据分类为具有很高信心的分数(ID)培训课程之一。这可能会对关键安全应用产生灾难性的后果。一种流行的缓解策略是训练单独的分类器,该分类器可以在测试时间检测此类OOD样本。在大多数实际设置中,在火车时间尚不清楚OOD的示例,因此,一个关键问题是:如何使用合成OOD样品来增加ID数据以训练这样的OOD检测器?在本文中,我们为称为CNC的OOD数据增强提出了一种新颖的复合腐败技术。 CNC的主要优点之一是,除了培训集外,它不需要任何固定数据。此外,与当前的最新技术(SOTA)技术不同,CNC不需要在测试时间进行反向传播或结合,从而使我们的方法在推断时更快。我们与过去4年中主要会议的20种方法进行了广泛的比较,表明,在OOD检测准确性和推理时间方面,使用基于CNC的数据增强训练的模型都胜过SOTA。我们包括详细的事后分析,以研究我们方法成功的原因,并确定CNC样本的较高相对熵和多样性是可能的原因。我们还通过对二维数据集进行零件分解分析提供理论见解,以揭示(视觉和定量),我们的方法导致ID类别周围的边界更紧密,从而更好地检测了OOD样品。源代码链接:https://github.com/cnc-ood
translated by 谷歌翻译
机器学习模型通常会遇到与训练分布不同的样本。无法识别分布(OOD)样本,因此将该样本分配给课堂标签会显着损害模​​型的可靠性。由于其对在开放世界中的安全部署模型的重要性,该问题引起了重大关注。由于对所有可能的未知分布进行建模的棘手性,检测OOD样品是具有挑战性的。迄今为止,一些研究领域解决了检测陌生样本的问题,包括异常检测,新颖性检测,一级学习,开放式识别识别和分布外检测。尽管有相似和共同的概念,但分别分布,开放式检测和异常检测已被独立研究。因此,这些研究途径尚未交叉授粉,创造了研究障碍。尽管某些调查打算概述这些方法,但它们似乎仅关注特定领域,而无需检查不同领域之间的关系。这项调查旨在在确定其共同点的同时,对各个领域的众多著名作品进行跨域和全面的审查。研究人员可以从不同领域的研究进展概述中受益,并协同发展未来的方法。此外,据我们所知,虽然进行异常检测或单级学习进行了调查,但没有关于分布外检测的全面或最新的调查,我们的调查可广泛涵盖。最后,有了统一的跨域视角,我们讨论并阐明了未来的研究线,打算将这些领域更加紧密地融为一体。
translated by 谷歌翻译
In this paper we establish rigorous benchmarks for image classifier robustness. Our first benchmark, IMAGENET-C, standardizes and expands the corruption robustness topic, while showing which classifiers are preferable in safety-critical applications. Then we propose a new dataset called IMAGENET-P which enables researchers to benchmark a classifier's robustness to common perturbations. Unlike recent robustness research, this benchmark evaluates performance on common corruptions and perturbations not worst-case adversarial perturbations. We find that there are negligible changes in relative corruption robustness from AlexNet classifiers to ResNet classifiers. Afterward we discover ways to enhance corruption and perturbation robustness. We even find that a bypassed adversarial defense provides substantial common perturbation robustness. Together our benchmarks may aid future work toward networks that robustly generalize.
translated by 谷歌翻译
增强了现实世界情景的稳健性已经被证明非常具有挑战性。一个原因是现有的鲁棒性基准是有限的,因为它们依赖于合成数据,或者它们只是将稳健性降低为数据集之间的概括,因此忽略各个滋扰因素的影响。在这项工作中,我们介绍了罗宾,是一个基准数据集,用于诊断视觉算法对现实世界中的个人滋扰的鲁棒性。罗宾在Pascal VOC 2012和Imagenet数据集中构建了10个刚性类别,并包括对象的分布示例3D姿势,形状,纹理,背景和天气状况。 Robin是丰富的注释,以实现图像分类,对象检测和3D姿势估计的基准模型。我们为许多流行的基线提供了结果,并进行了几个有趣的观察结果:1。与其他人相比,一些滋扰因素对性能有更强烈的负面影响。此外,对oodnuisance的负面影响取决于下游视觉任务。 2.利用强大数据增强的鲁棒性的目前的方法只有在现实世界的情况下只有边际效应,有时甚至会降低表现。 3.我们在鲁棒性方面,我们不会遵守卷积和变压器架构之间的任何显着差异。我们相信我们的数据集提供了丰富的试验台,以研究视觉算法的稳健性,并有助于大大推动该领域的前瞻性研究。
translated by 谷歌翻译
We study how robust current ImageNet models are to distribution shifts arising from natural variations in datasets. Most research on robustness focuses on synthetic image perturbations (noise, simulated weather artifacts, adversarial examples, etc.), which leaves open how robustness on synthetic distribution shift relates to distribution shift arising in real data. Informed by an evaluation of 204 ImageNet models in 213 different test conditions, we find that there is often little to no transfer of robustness from current synthetic to natural distribution shift. Moreover, most current techniques provide no robustness to the natural distribution shifts in our testbed. The main exception is training on larger and more diverse datasets, which in multiple cases increases robustness, but is still far from closing the performance gaps. Our results indicate that distribution shifts arising in real data are currently an open research problem. We provide our testbed and data as a resource for future work at https://modestyachts.github.io/imagenet-testbed/.
translated by 谷歌翻译
Modern deep neural networks can achieve high accuracy when the training distribution and test distribution are identically distributed, but this assumption is frequently violated in practice. When the train and test distributions are mismatched, accuracy can plummet. Currently there are few techniques that improve robustness to unforeseen data shifts encountered during deployment. In this work, we propose a technique to improve the robustness and uncertainty estimates of image classifiers. We propose AUGMIX, a data processing technique that is simple to implement, adds limited computational overhead, and helps models withstand unforeseen corruptions. AUGMIX significantly improves robustness and uncertainty measures on challenging image classification benchmarks, closing the gap between previous methods and the best possible performance in some cases by more than half.
translated by 谷歌翻译
在图像分类中,在检测分布(OOD)数据时发生了许多发展。但是,大多数OOD检测方法是在一组标准数据集上评估的,该数据集与培训数据任意不同。没有明确的定义``好的''ood数据集。此外,最先进的OOD检测方法已经在这些标准基准上取得了几乎完美的结果。在本文中,我们定义了2类OOD数据使用与分布(ID)数据的感知/视觉和语义相似性的微妙概念。我们将附近的OOD样本定义为感知上相似但语义上与ID样本的不同,并将样本转移为视觉上不同但在语义上与ID相似的点数据。然后,我们提出了一个基于GAN的框架,用于从这两个类别中生成OOD样品,给定一个ID数据集。通过有关MNIST,CIFAR-10/100和Imagenet的广泛实验,我们表明A)在常规基准上表现出色的ART OOD检测方法对我们提出的基准测试的稳健性明显较小。 N基准测试,反之亦然,因此表明甚至可能不需要单独的OOD集来可靠地评估OOD检测中的性能。
translated by 谷歌翻译
分布(OOD)检测对于确保机器学习系统的可靠性和安全性至关重要。例如,在自动驾驶中,我们希望驾驶系统在发现在训练时间中从未见过的异常​​场景或对象时,发出警报并将控件移交给人类,并且无法做出安全的决定。该术语《 OOD检测》于2017年首次出现,此后引起了研究界的越来越多的关注,从而导致了大量开发的方法,从基于分类到基于密度到基于距离的方法。同时,其他几个问题,包括异常检测(AD),新颖性检测(ND),开放式识别(OSR)和离群检测(OD)(OD),在动机和方法方面与OOD检测密切相关。尽管有共同的目标,但这些主题是孤立发展的,它们在定义和问题设定方面的细微差异通常会使读者和从业者感到困惑。在这项调查中,我们首先提出一个称为广义OOD检测的统一框架,该框架涵盖了上述五个问题,即AD,ND,OSR,OOD检测和OD。在我们的框架下,这五个问题可以看作是特殊情况或子任务,并且更容易区分。然后,我们通过总结了他们最近的技术发展来审查这五个领域中的每一个,特别关注OOD检测方法。我们以公开挑战和潜在的研究方向结束了这项调查。
translated by 谷歌翻译
尽管对检测分配(OOD)示例的重要性一致,但就OOD示例的正式定义几乎没有共识,以及如何最好地检测到它们。我们将这些示例分类为它们是否表现出背景换档或语义移位,并发现ood检测,模型校准和密度估计(文本语言建模)的两个主要方法,对这些类型的ood数据具有不同的行为。在14对分布和ood英语自然语言理解数据集中,我们发现密度估计方法一致地在背景移位设置中展开校准方法,同时在语义移位设置中执行更糟。此外,我们发现两种方法通常都无法检测到挑战数据中的示例,突出显示当前方法的弱点。由于没有单个方法在所有设置上都效果很好,因此在评估不同的检测方法时,我们的结果呼叫了OOD示例的明确定义。
translated by 谷歌翻译
深度神经网络对各种任务取得了出色的性能,但它们具有重要问题:即使对于完全未知的样本,也有过度自信的预测。已经提出了许多研究来成功过滤出这些未知的样本,但它们仅考虑狭窄和特定的任务,称为错误分类检测,开放式识别或分布外检测。在这项工作中,我们认为这些任务应该被视为根本存在相同的问题,因为理想的模型应该具有所有这些任务的检测能力。因此,我们介绍了未知的检测任务,以先前的单独任务的整合,用于严格检查深度神经网络对广谱的广泛未知样品的检测能力。为此,构建了不同尺度上的统一基准数据集,并且存在现有流行方法的未知检测能力进行比较。我们发现深度集合始终如一地优于检测未知的其他方法;但是,所有方法只针对特定类型的未知方式成功。可重复的代码和基准数据集可在https://github.com/daintlab/unknown-detection-benchmarks上获得。
translated by 谷歌翻译
Commonly used AI networks are very self-confident in their predictions, even when the evidence for a certain decision is dubious. The investigation of a deep learning model output is pivotal for understanding its decision processes and assessing its capabilities and limitations. By analyzing the distributions of raw network output vectors, it can be observed that each class has its own decision boundary and, thus, the same raw output value has different support for different classes. Inspired by this fact, we have developed a new method for out-of-distribution detection. The method offers an explanatory step beyond simple thresholding of the softmax output towards understanding and interpretation of the model learning process and its output. Instead of assigning the class label of the highest logit to each new sample presented to the network, it takes the distributions over all classes into consideration. A probability score interpreter (PSI) is created based on the joint logit values in relation to their respective correct vs wrong class distributions. The PSI suggests whether the sample is likely to belong to a specific class, whether the network is unsure, or whether the sample is likely an outlier or unknown type for the network. The simple PSI has the benefit of being applicable on already trained networks. The distributions for correct vs wrong class for each output node are established by simply running the training examples through the trained network. We demonstrate our OOD detection method on a challenging transmission electron microscopy virus image dataset. We simulate a real-world application in which images of virus types unknown to a trained virus classifier, yet acquired with the same procedures and instruments, constitute the OOD samples.
translated by 谷歌翻译
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.
translated by 谷歌翻译
随着变压器在计算机视觉中普及的激增,一些研究试图确定它们是否可以比卷积神经网络(CNN)更适合分配变化并提供更好的不确定性估计。几乎一致的结论是它们是,并且通常或多或少地明确地认为这种所谓优势的原因是归因于自我注意力的机制。在本文中,我们进行了广泛的经验分析,表明最近最新的CNN(尤其是Convnext)可以比当前的最新变压器更强大,可靠,甚至有时甚至更多。但是,没有明显的赢家。因此,尽管它很容易陈述一个建筑家族比另一种建筑的明确优势,但他们似乎在各种任务上享有类似的非凡表演,同时也遭受了类似的脆弱性,例如纹理,背景和简单性偏见。
translated by 谷歌翻译