智能论文笔记

Towards Open Set Deep Networks

Abhijit Bendale , Terrance Boult

分类：

2015-11-19

Deep networks have produced significant gains for various visual recognition problems, leading to high impact academic and commercial applications. Recent work in deep networks highlighted that it is easy to generate images that humans would never classify as a particular object class, yet networks classify such images high confidence as that given class -deep network are easily fooled with images humans do not consider meaningful. The closed set nature of deep networks forces them to choose from one of the known classes leading to such artifacts. Recognition in the real world is open set, i.e. the recognition system should reject unknown/unseen classes at test time. We present a methodology to adapt deep networks for open set recognition, by introducing a new model layer, OpenMax, which estimates the probability of an input being from an unknown class. A key element of estimating the unknown probability is adapting Meta-Recognition concepts to the activation patterns in the penultimate layer of the network. Open-Max allows rejection of "fooling" and unrelated open set images presented to the system; OpenMax greatly reduces the number of obvious errors made by a deep network. We prove that the OpenMax concept provides bounded open space risk, thereby formally providing an open set recognition solution. We evaluate the resulting open set deep networks using pre-trained networks from the Caffe Model-zoo on ImageNet 2012 validation data, and thousands of fooling and open set images. The proposed OpenMax model significantly outperforms open set recognition accuracy of basic deep networks as well as deep networks with thresholding of SoftMax probabilities.

translated by 谷歌翻译

Towards Open World Recognition

Abhijit Bendale , Terrance Boult

分类：

2014-12-18

With the of advent rich classification models and high computational power visual recognition systems have found many operational applications. Recognition in the real world poses multiple challenges that are not apparent in controlled lab environments. The datasets are dynamic and novel categories must be continuously detected and then added. At prediction time, a trained system has to deal with myriad unseen categories. Operational systems require minimum down time, even to learn. To handle these operational issues, we present the problem of Open World recognition and formally define it. We prove that thresholding sums of monotonically decreasing functions of distances in linearly transformed feature space can balance "open space risk" and empirical risk. Our theory extends existing algorithms for open world recognition. We present a protocol for evaluation of open world recognition systems. We present the Nearest Non-Outlier (NNO) algorithm which evolves model efficiently, adding object categories incrementally while detecting outliers and managing open space risk. We perform experiments on the ImageNet dataset with 1.2M+ images to validate the effectiveness of our method on large scale visual recognition tasks. NNO consistently yields superior results on open world recognition.

translated by 谷歌翻译

A Unified Survey on Anomaly, Novelty, Open-Set, and Out-of-Distribution Detection: Solutions and Future Challenges

Mohammadreza Salehi , Hossein Mirzaei , Dan Hendrycks , Yixuan Li , Mohammad Hossein Rohban , Mohammad Sabokrou

分类：计算机视觉 | 机器学习

2021-10-26

机器学习模型通常会遇到与训练分布不同的样本。无法识别分布（OOD）样本，因此将该样本分配给课堂标签会显着损害模型的可靠性。由于其对在开放世界中的安全部署模型的重要性，该问题引起了重大关注。由于对所有可能的未知分布进行建模的棘手性，检测OOD样品是具有挑战性的。迄今为止，一些研究领域解决了检测陌生样本的问题，包括异常检测，新颖性检测，一级学习，开放式识别识别和分布外检测。尽管有相似和共同的概念，但分别分布，开放式检测和异常检测已被独立研究。因此，这些研究途径尚未交叉授粉，创造了研究障碍。尽管某些调查打算概述这些方法，但它们似乎仅关注特定领域，而无需检查不同领域之间的关系。这项调查旨在在确定其共同点的同时，对各个领域的众多著名作品进行跨域和全面的审查。研究人员可以从不同领域的研究进展概述中受益，并协同发展未来的方法。此外，据我们所知，虽然进行异常检测或单级学习进行了调查，但没有关于分布外检测的全面或最新的调查，我们的调查可广泛涵盖。最后，有了统一的跨域视角，我们讨论并阐明了未来的研究线，打算将这些领域更加紧密地融为一体。

translated by 谷歌翻译

Measuring Human Perception to Improve Open Set Recognition

Jin Huang , Student Member , Derek Prijatelj , Justin Dulay , Walter Scheirer

分类：计算机视觉

2022-09-08

人类识别对象何时已知或当前新颖的能力胜过所有开放式识别算法。通过心理学视觉心理物理学的方法和过程来衡量的人类感知可以为计算机视觉中的视觉识别任务中的新颖性提供附加的数据流。例如，人类受试者的测量反应时间可以提供有关是否可能与新颖的样本相混淆的洞察力。在这项工作中，我们设计并进行了大规模的行为实验，该实验收集了超过200,000种与物体识别相关的人类反应时间测量。收集的数据指示的反应时间在样本级别的对象之间有意义地变化。因此，我们设计了一种新的心理物理损失函数，该函数在深网中与人类行为保持一致性，该函数在不同图像中显示出可变的反应时间。与生物学愿景一样，这种方法使我们能够在标记有限的培训数据的制度中实现良好的开放式识别性能。通过使用来自ImageNet的数据的实验，当训练具有这种新配方的多尺度登记材料时，可以观察到显着改善：经过损失功能训练的模型可显着提高TOP-1验证精度7％，对已知样品的TOP-1测试准确性提高18％，以及未知样品的TOP-1测试精度33％。我们将我们的方法与文献中的10种开放式识别方法进行了比较，这些方法在多个指标上的表现都优于。

translated by 谷歌翻译

A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification

Anastasios N. Angelopoulos , Stephen Bates

分类：机器学习 | 人工智能 | (统计)机器学习

2021-07-15

现在通常用于高风险设置，如医疗诊断，如医疗诊断，那么需要不确定量化，以避免后续模型失败。无分发的不确定性量化（无分布UQ）是用户友好的范式，用于为这种预测创建统计上严格的置信区间/集合。批判性地，间隔/集合有效而不进行分布假设或模型假设，即使具有最多许多DataPoints也具有显式保证。此外，它们适应输入的难度;当输入示例很困难时，不确定性间隔/集很大，信号传达模型可能是错误的。在没有多大的工作和没有再培训的情况下，可以在任何潜在的算法（例如神经网络）上使用无分发方法，以产生置信度集，以便包含用户指定概率，例如90％。实际上，这些方法易于理解和一般，应用于计算机视觉，自然语言处理，深度加强学习等领域出现的许多现代预测问题。这种实践介绍是针对对无需统计学家的免费UQ的实际实施感兴趣的读者。我们通过实际的理论和无分发UQ的应用领导读者，从保形预测开始，并使无关的任何风险的分布控制，如虚假发现率，假阳性分布检测，等等。我们将包括Python中的许多解释性插图，示例和代码样本，具有Pytorch语法。目标是提供读者对无分配UQ的工作理解，使它们能够将置信间隔放在算法上，其中包含一个自包含的文档。

translated by 谷歌翻译

Classifier Calibration: How to assess and improve predicted class probabilities: a survey

Telmo Silva Filho , Hao Song , Miquel Perello-Nieto , Raul Santos-Rodriguez , Meelis Kull , Peter Flach

分类：机器学习 | (统计)机器学习

2021-12-20

本文介绍了分类器校准原理和实践的简介和详细概述。校准的分类器正确地量化了与其实例明智的预测相关的不确定性或信心水平。这对于关键应用，最佳决策，成本敏感的分类以及某些类型的上下文变化至关重要。校准研究具有丰富的历史，其中几十年来预测机器学习作为学术领域的诞生。然而，校准兴趣的最近增加导致了新的方法和从二进制到多种子体设置的扩展。需要考虑的选项和问题的空间很大，并导航它需要正确的概念和工具集。我们提供了主要概念和方法的介绍性材料和最新的技术细节，包括适当的评分规则和其他评估指标，可视化方法，全面陈述二进制和多字数分类的HOC校准方法，以及几个先进的话题。

translated by 谷歌翻译

Interpreting deep learning output for out-of-distribution detection

Damian Matuszewski , Ida-Maria Sintorn

分类：计算机视觉

2022-11-07

Commonly used AI networks are very self-confident in their predictions, even when the evidence for a certain decision is dubious. The investigation of a deep learning model output is pivotal for understanding its decision processes and assessing its capabilities and limitations. By analyzing the distributions of raw network output vectors, it can be observed that each class has its own decision boundary and, thus, the same raw output value has different support for different classes. Inspired by this fact, we have developed a new method for out-of-distribution detection. The method offers an explanatory step beyond simple thresholding of the softmax output towards understanding and interpretation of the model learning process and its output. Instead of assigning the class label of the highest logit to each new sample presented to the network, it takes the distributions over all classes into consideration. A probability score interpreter (PSI) is created based on the joint logit values in relation to their respective correct vs wrong class distributions. The PSI suggests whether the sample is likely to belong to a specific class, whether the network is unsure, or whether the sample is likely an outlier or unknown type for the network. The simple PSI has the benefit of being applicable on already trained networks. The distributions for correct vs wrong class for each output node are established by simply running the training examples through the trained network. We demonstrate our OOD detection method on a challenging transmission electron microscopy virus image dataset. We simulate a real-world application in which images of virus types unknown to a trained virus classifier, yet acquired with the same procedures and instruments, constitute the OOD samples.

translated by 谷歌翻译

The pascal visual object classes (voc) challenge

分类：

The PASCAL Visual Object Classes (VOC) challenge is a benchmark in visual object category recognition and detection, providing the vision and machine learning communities with a standard dataset of images and annotation, and standard evaluation procedures. Organised annually from 2005 to present, the challenge and its associated dataset has become accepted as the benchmark for object detection.This paper describes the dataset and evaluation procedure. We review the state-of-the-art in evaluated methods for both classification and detection, analyse whether the methods are statistically different, what they are learning from the images (e.g. the object or its context), and what the methods find easy or confuse. The paper concludes with lessons learnt in the three year history of the challenge, and proposes directions for future improvement and extension.

translated by 谷歌翻译

Classification Under Ambiguity: When Is Average-K Better Than Top-K?

Titouan Lorieul , Alexis Joly , Dennis Shasha

分类： (统计)机器学习 | 计算机视觉 | 机器学习

2021-12-16

当可能的许多标签是可能的时，选择单个可以导致低精度。一个常见的替代方案，称为顶级k $分类，是选择一些数字$ k $（通常约5），并返回最高分数的$ k $标签。不幸的是，对于明确的案例，$ k> 1 $太多，对于非常暧昧的情况，$ k \ leq 5 $（例如）太小。另一种明智的策略是使用一种自适应方法，其中返回的标签数量随着计算的歧义而变化，但必须平均到所有样本的某些特定的$ k $。我们表示这种替代方案 - $ k $分类。本文在平均值的含量较低的误差率时，本文正式地表征了模糊性曲线，比固定的顶级k $分类更低。此外，它为固定尺寸和自适应分类器提供了自然估计程序，并证明了它们的一致性。最后，它报告了实际图像数据集的实验，揭示了平均值的效益 - 在实践中的价格超过高度k $分类。总的来说，当含糊不清的歧义时，平均值-$ k $永远不会比Top-$ K $更差，并且在我们的实验中，当估计时，这也持有。

translated by 谷歌翻译

Towards Robustness of Neural Networks

Steven Basart

分类：计算机视觉 | 机器学习

2021-12-30

我们介绍了几个新的数据集即想象的A / O和Imagenet-R以及合成环境和测试套件，我们称为CAOS。 Imagenet-A / O允许研究人员专注于想象成剩余的盲点。由于追踪稳健的表示，以特殊创建了ImageNet-R，因为表示不再简单地自然，而是包括艺术和其他演绎。 Caos Suite由Carla Simulator构建，允许包含异常物体，可以创建可重复的合成环境和用于测试稳健性的场景。所有数据集都是为测试鲁棒性和衡量鲁棒性的衡量进展而创建的。数据集已用于各种其他作品中，以衡量其具有鲁棒性的自身进步，并允许切向进展，这些进展不会完全关注自然准确性。鉴于这些数据集，我们创建了几种旨在推进鲁棒性研究的新方法。我们以最大Logit的形式和典型程度的形式构建简单的基线，并以深度的形式创建新的数据增强方法，从而提高上述基准。最大Logit考虑Logit值而不是SoftMax操作后的值，而微小的变化会产生明显的改进。典型程分将输出分布与类的后部分布进行比较。我们表明，除了分段任务之外，这将提高对基线的性能。猜测可能在像素级别，像素的语义信息比类级信息的语义信息不太有意义。最后，新的Deepaulment的新增强技术利用神经网络在彻底不同于先前使用的传统几何和相机的转换的图像上创建增强。

translated by 谷歌翻译

Proceedings of the 3rd International Workshop on Reading Music Systems

Jorge Calvo-Zaragoza , Alexander Pacha

分类：计算机视觉 | 机器学习

2022-12-01

The International Workshop on Reading Music Systems (WoRMS) is a workshop that tries to connect researchers who develop systems for reading music, such as in the field of Optical Music Recognition, with other researchers and practitioners that could benefit from such systems, like librarians or musicologists. The relevant topics of interest for the workshop include, but are not limited to: Music reading systems; Optical music recognition; Datasets and performance evaluation; Image processing on music scores; Writer identification; Authoring, editing, storing and presentation systems for music scores; Multi-modal systems; Novel input-methods for music to produce written music; Web-based Music Information Retrieval services; Applications and projects; Use-cases related to written music. These are the proceedings of the 3rd International Workshop on Reading Music Systems, held in Alicante on the 23rd of July 2021.

translated by 谷歌翻译

A Review of Open-World Learning and Steps Toward Open-World Learning Without Labels

Mohsen Jafarzadeh , Akshay Raj Dhamija , Steve Cruz , Chunchun Li , Touqeer Ahmad , Terrance E. Boult

分类：计算机视觉 | 人工智能 | 机器学习

2020-11-25

在开放世界学习中，代理商从一组已知类，检测和管理它不知道的事情，并从非静止数据流中随时间了解它们。开放世界学习与众多其他学习问题不同，本文简要介绍了各种问题之间的关键差异，包括增量学习，广义新奇发现和广义零射击学习。本文规范了各种开放世界学习问题，包括没有标签的开放世界学习。这些开放世界问题可以通过对已知元素的修改来解决，我们提出了一个新的框架，使代理能够组合各种模块用于新颖性检测，新颖性表征，增量学习和实例管理，以从未标记的流学习新类数据以无人监督的方式，调查如何适应一些最先进的技术来符合框架，并使用它们在没有标签问题的情况下为开放世界学习的性能定义七个基线。然后，我们讨论开放世界的学习质量，并分析如何改善实例管理。我们还讨论了没有标签的开放世界学习中发生的一些普遍歧义问题。

translated by 谷歌翻译

Estimating Classification Confidence Using Kernel Densities

Peter Salamon , David Salamon , V. Adrian Cantu , Michelle An , Tyler Perry , Robert A. Edwards , Anca M. Segall

分类： (统计)机器学习 | 机器学习

2022-07-13

本文研究了“探索性”机器学习分类问题的置信后的事后校准。这些问题的困难源于持续的愿望，即在策划数据集时具有足够的例子来推广哪些类别的界限以及对这些类别的有效性的混乱。我们认为，对于此类问题，必须使用“单一的所有”方法（顶级标签校准），而不是文献中其他地方提倡的“校准 - 满足 - 响应 - 摩托克质”方法。我们介绍并测试了四种旨在处理特定置信度估计的特质的新算法。这些方法中的主要主要是将内核密度比用于置信度校准，包括用于选择带宽的新颖的防弹算法。我们测试了我们的主张，并探讨了生物信息学应用程序（Phanns）1以及经典的MNIST基准2。最后，我们的分析认为，事后校准应始终执行，应仅基于测试数据集，并且应在视觉上进行理智检查。

translated by 谷歌翻译

Generalized Out-of-Distribution Detection: A Survey

Jingkang Yang , Kaiyang Zhou , Yixuan Li , Ziwei Liu

分类：计算机视觉 | 人工智能 | 机器学习

2021-10-21

分布（OOD）检测对于确保机器学习系统的可靠性和安全性至关重要。例如，在自动驾驶中，我们希望驾驶系统在发现在训练时间中从未见过的异常场景或对象时，发出警报并将控件移交给人类，并且无法做出安全的决定。该术语《 OOD检测》于2017年首次出现，此后引起了研究界的越来越多的关注，从而导致了大量开发的方法，从基于分类到基于密度到基于距离的方法。同时，其他几个问题，包括异常检测（AD），新颖性检测（ND），开放式识别（OSR）和离群检测（OD）（OD），在动机和方法方面与OOD检测密切相关。尽管有共同的目标，但这些主题是孤立发展的，它们在定义和问题设定方面的细微差异通常会使读者和从业者感到困惑。在这项调查中，我们首先提出一个称为广义OOD检测的统一框架，该框架涵盖了上述五个问题，即AD，ND，OSR，OOD检测和OD。在我们的框架下，这五个问题可以看作是特殊情况或子任务，并且更容易区分。然后，我们通过总结了他们最近的技术发展来审查这五个领域中的每一个，特别关注OOD检测方法。我们以公开挑战和潜在的研究方向结束了这项调查。

translated by 谷歌翻译

A Novel Data Augmentation Technique for Out-of-Distribution Sample Detection using Compounded Corruptions

Ramya S. Hebbalaguppe , Soumya Suvra Goshal , Jatin Prakash , Harshad Khadilkar , Chetan Arora

分类：计算机视觉 | 人工智能 | 机器学习

2022-07-28

已知现代深度神经网络模型将错误地将分布式（OOD）测试数据分类为具有很高信心的分数（ID）培训课程之一。这可能会对关键安全应用产生灾难性的后果。一种流行的缓解策略是训练单独的分类器，该分类器可以在测试时间检测此类OOD样本。在大多数实际设置中，在火车时间尚不清楚OOD的示例，因此，一个关键问题是：如何使用合成OOD样品来增加ID数据以训练这样的OOD检测器？在本文中，我们为称为CNC的OOD数据增强提出了一种新颖的复合腐败技术。 CNC的主要优点之一是，除了培训集外，它不需要任何固定数据。此外，与当前的最新技术（SOTA）技术不同，CNC不需要在测试时间进行反向传播或结合，从而使我们的方法在推断时更快。我们与过去4年中主要会议的20种方法进行了广泛的比较，表明，在OOD检测准确性和推理时间方面，使用基于CNC的数据增强训练的模型都胜过SOTA。我们包括详细的事后分析，以研究我们方法成功的原因，并确定CNC样本的较高相对熵和多样性是可能的原因。我们还通过对二维数据集进行零件分解分析提供理论见解，以揭示（视觉和定量），我们的方法导致ID类别周围的边界更紧密，从而更好地检测了OOD样品。源代码链接：https：//github.com/cnc-ood

translated by 谷歌翻译

Deep Open-Set Recognition for Silicon Wafer Production Monitoring

Luca Frittoli , Diego Carrera , Beatrice Rossi , Pasqualina Fragneto , Giacomo Boracchi

分类：计算机视觉

2022-08-30

任何电子设备中包含的芯片都是通过圆形硅晶片制造的，这些芯片是通过不同生产阶段的检查机对其进行监控的。检查机检测并找到晶圆中的任何缺陷，并返回晶圆缺陷图（WDM），即，缺陷为lie的坐标列表，可以将其视为巨大，稀疏和二进制图像。在正常情况下，晶片表现出少量随机分布的缺陷，而以特定模式分组的缺陷可能表明生产线中的已知或新颖类别。不用说，半导体行业的主要关注点是确定这些模式并尽快进行干预以恢复正常的生产条件。在这里，我们将WDM监视作为开放式识别问题，以准确地将WDM分类为已知类别并迅速检测到新颖的模式。特别是，我们提出了一条基于Submanifold稀疏卷积网络的晶圆监测的综合管道，这是一种深层体系结构，旨在以任意分辨率处理稀疏数据，并在已知类别上进行了培训。为了检测新颖性，我们根据拟合在分类器潜在表示上的高斯混合模型定义了一个离群检测器。我们在WDM的真实数据集上进行的实验表明，Submanifold稀疏卷积直接处理全分辨率WDMS在已知类别上比传统的卷积神经网络产生了卓越的分类性能，这需要初步的封装以减少代表WDM的二元图像的大小。此外，我们的解决方案优于最先进的开放式识别解决方案，以检测新颖性。

translated by 谷歌翻译

HTML版本

How to Certify Machine Learning Based Safety-critical Systems? A Systematic Literature Review

Florian Tambon , Gabriel Laberge , Le An , Amin Nikanjam , Paulina Stevia Nouwou Mindom , Yann Pequignot , Foutse Khomh , Giulio Antoniol , Ettore Merlo , François Laviolette

分类：机器学习

2021-07-26

背景信息：在过去几年中，机器学习（ML）一直是许多创新的核心。然而，包括在所谓的“安全关键”系统中，例如汽车或航空的系统已经被证明是非常具有挑战性的，因为ML的范式转变为ML带来完全改变传统认证方法。目的：本文旨在阐明与ML为基础的安全关键系统认证有关的挑战，以及文献中提出的解决方案，以解决它们，回答问题的问题如何证明基于机器学习的安全关键系统？'方法：我们开展2015年至2020年至2020年之间发布的研究论文的系统文献综述（SLR），涵盖了与ML系统认证有关的主题。总共确定了217篇论文涵盖了主题，被认为是ML认证的主要支柱：鲁棒性，不确定性，解释性，验证，安全强化学习和直接认证。我们分析了每个子场的主要趋势和问题，并提取了提取的论文的总结。结果：单反结果突出了社区对该主题的热情，以及在数据集和模型类型方面缺乏多样性。它还强调需要进一步发展学术界和行业之间的联系，以加深域名研究。最后，它还说明了必须在上面提到的主要支柱之间建立连接的必要性，这些主要柱主要主要研究。结论：我们强调了目前部署的努力，以实现ML基于ML的软件系统，并讨论了一些未来的研究方向。

translated by 谷歌翻译

Unmasking Clever Hans Predictors and Assessing What Machines Really Learn

Sebastian Lapuschkin , Stephan Wäldchen , Alexander Binder , Grégoire Montavon , Wojciech Samek , Klaus-Robert Müller

分类：

2019-02-26

Current learning machines have successfully solved hard application problems, reaching high accuracy and displaying seemingly "intelligent" behavior. Here we apply recent techniques for explaining decisions of state-of-the-art learning machines and analyze various tasks from computer vision and arcade games. This showcases a spectrum of problem-solving behaviors ranging from naive and short-sighted, to wellinformed and strategic. We observe that standard performance evaluation metrics can be oblivious to distinguishing these diverse problem solving behaviors. Furthermore, we propose our semi-automated Spectral Relevance Analysis that provides a practically effective way of characterizing and validating the behavior of nonlinear learning machines. This helps to assess whether a learned model indeed delivers reliably for the problem that it was conceived for. Furthermore, our work intends to add a voice of caution to the ongoing excitement about machine intelligence and pledges to evaluate and judge some of these recent successes in a more nuanced manner.

translated by 谷歌翻译

The Familiarity Hypothesis: Explaining the Behavior of Deep Open Set Methods

Thomas G. Dietterich , Alexander Guyer

分类：计算机视觉 | 机器学习 | (统计)机器学习

2022-03-04

在许多对象识别应用程序中，可能的类别集是一个开放集，而部署的识别系统将在训练过程中遇到属于观点的类别的新颖对象。检测此类``新型类别''对象通常被表达为一个异常检测问题。特征矢量数据的异常检测算法将异常识别为异常值，但是离群值检测在深度学习中效果不佳。取而代之的是，基于视觉对象分类器的计算徽标的方法可提供最新的性能。本文提出了这样的熟悉假说，即这些方法成功了，因为它们正在检测到缺乏熟悉的学术特征而不是新颖性的存在。这种区别很重要，因为在存在新颖性的许多情况下，基于熟悉的检测会失败。例如，当图像既包含一个新颖的对象又包含一个熟悉的对象时，熟悉度得分将很高，因此不会注意到新颖的对象。本文回顾了文献中的证据，并提供了我们自己实验的其他证据，这些证据为这一假设提供了强有力的支持。本文最后讨论了基于熟悉的检测是否是表示学习的必然结果。

translated by 谷歌翻译

Do ImageNet Classifiers Generalize to ImageNet?

Benjamin Recht , Rebecca Roelofs , Ludwig Schmidt , Vaishaal Shankar

分类：

2019-02-13

We build new test sets for the CIFAR-10 and ImageNet datasets. Both benchmarks have been the focus of intense research for almost a decade, raising the danger of overfitting to excessively re-used test sets. By closely following the original dataset creation processes, we test to what extent current classification models generalize to new data. We evaluate a broad range of models and find accuracy drops of 3% -15% on CIFAR-10 and 11% -14% on ImageNet. However, accuracy gains on the original test sets translate to larger gains on the new test sets. Our results suggest that the accuracy drops are not caused by adaptivity, but by the models' inability to generalize to slightly "harder" images than those found in the original test sets.

translated by 谷歌翻译