可说明的机器学习吸引了越来越多的关注,因为它提高了模型的透明度,这有助于机器学习在真实应用中受到信任。然而,最近证明了解释方法易于操纵,在那里我们可以在保持其预测常数的同时轻松改变模型的解释。为了解决这个问题,已经支付了一些努力来使用更稳定的解释方法或更改模型配置。在这项工作中,我们从训练角度解决了问题,并提出了一种称为对抗的解释培训的新培训计划(ATEX),以改善模型的内部解释稳定性,无论应用的具体解释方法如何。而不是直接指定数据实例上的解释值,而是仅为模型预测提供了要求,该预测避免了涉及在优化中的二阶导数。作为进一步的讨论,我们还发现解释稳定性与模型的另一个性质密切相关,即暴露于对抗性攻击的风险。通过实验,除了表明ATEX改善了针对操纵靶向解释的模型鲁棒性,它还带来了额外的益处,包括平滑解释,并在应用于模型时提高对抗性训练的功效。
translated by 谷歌翻译
Explainability has been widely stated as a cornerstone of the responsible and trustworthy use of machine learning models. With the ubiquitous use of Deep Neural Network (DNN) models expanding to risk-sensitive and safety-critical domains, many methods have been proposed to explain the decisions of these models. Recent years have also seen concerted efforts that have shown how such explanations can be distorted (attacked) by minor input perturbations. While there have been many surveys that review explainability methods themselves, there has been no effort hitherto to assimilate the different methods and metrics proposed to study the robustness of explanations of DNN models. In this work, we present a comprehensive survey of methods that study, understand, attack, and defend explanations of DNN models. We also present a detailed review of different metrics used to evaluate explanation methods, as well as describe attributional attack and defense methods. We conclude with lessons and take-aways for the community towards ensuring robust explanations of DNN model predictions.
translated by 谷歌翻译
已知深度神经网络(DNN)容易受到用不可察觉的扰动制作的对抗性示例的影响,即,输入图像的微小变化会引起错误的分类,从而威胁着基于深度学习的部署系统的可靠性。经常采用对抗训练(AT)来通过训练损坏和干净的数据的混合物来提高DNN的鲁棒性。但是,大多数基于AT的方法在处理\ textit {转移的对抗示例}方面是无效的,这些方法是生成以欺骗各种防御模型的生成的,因此无法满足现实情况下提出的概括要求。此外,对抗性训练一般的国防模型不能对具有扰动的输入产生可解释的预测,而不同的领域专家则需要一个高度可解释的强大模型才能了解DNN的行为。在这项工作中,我们提出了一种基于Jacobian规范和选择性输入梯度正则化(J-SIGR)的方法,该方法通过Jacobian归一化提出了线性化的鲁棒性,还将基于扰动的显着性图正规化,以模仿模型的可解释预测。因此,我们既可以提高DNN的防御能力和高解释性。最后,我们评估了跨不同体系结构的方法,以针对强大的对抗性攻击。实验表明,提出的J-Sigr赋予了针对转移的对抗攻击的鲁棒性,我们还表明,来自神经网络的预测易于解释。
translated by 谷歌翻译
Gradient-based explanation is the cornerstone of explainable deep networks, but it has been shown to be vulnerable to adversarial attacks. However, existing works measure the explanation robustness based on $\ell_p$-norm, which can be counter-intuitive to humans, who only pay attention to the top few salient features. We propose explanation ranking thickness as a more suitable explanation robustness metric. We then present a new practical adversarial attacking goal for manipulating explanation rankings. To mitigate the ranking-based attacks while maintaining computational feasibility, we derive surrogate bounds of the thickness that involve expensive sampling and integration. We use a multi-objective approach to analyze the convergence of a gradient-based attack to confirm that the explanation robustness can be measured by the thickness metric. We conduct experiments on various network architectures and diverse datasets to prove the superiority of the proposed methods, while the widely accepted Hessian-based curvature smoothing approaches are not as robust as our method.
translated by 谷歌翻译
In order for machine learning to be trusted in many applications, it is critical to be able to reliably explain why the machine learning algorithm makes certain predictions. For this reason, a variety of methods have been developed recently to interpret neural network predictions by providing, for example, feature importance maps. For both scientific robustness and security reasons, it is important to know to what extent can the interpretations be altered by small systematic perturbations to the input data, which might be generated by adversaries or by measurement biases. In this paper, we demonstrate how to generate adversarial perturbations that produce perceptively indistinguishable inputs that are assigned the same predicted label, yet have very different interpretations. We systematically characterize the robustness of interpretations generated by several widely-used feature importance interpretation methods (feature importance maps, integrated gradients, and DeepLIFT) on ImageNet and CIFAR-10. In all cases, our experiments show that systematic perturbations can lead to dramatically different interpretations without changing the label. We extend these results to show that interpretations based on exemplars (e.g. influence functions) are similarly susceptible to adversarial attack. Our analysis of the geometry of the Hessian matrix gives insight on why robustness is a general challenge to current interpretation approaches.
translated by 谷歌翻译
This study provides a new understanding of the adversarial attack problem by examining the correlation between adversarial attack and visual attention change. In particular, we observed that: (1) images with incomplete attention regions are more vulnerable to adversarial attacks; and (2) successful adversarial attacks lead to deviated and scattered attention map. Accordingly, an attention-based adversarial defense framework is designed to simultaneously rectify the attention map for prediction and preserve the attention area between adversarial and original images. The problem of adding iteratively attacked samples is also discussed in the context of visual attention change. We hope the attention-related data analysis and defense solution in this study will shed some light on the mechanism behind the adversarial attack and also facilitate future adversarial defense/attack model design.
translated by 谷歌翻译
最近的研究表明,深度神经网络(DNNS)极易受到精心设计的对抗例子的影响。对那些对抗性例子的对抗性学习已被证明是防御这种攻击的最有效方法之一。目前,大多数现有的对抗示例生成方法基于一阶梯度,这几乎无法进一步改善模型的鲁棒性,尤其是在面对二阶对抗攻击时。与一阶梯度相比,二阶梯度提供了相对于自然示例的损失格局的更准确近似。受此启发的启发,我们的工作制作了二阶的对抗示例,并使用它们来训练DNNS。然而,二阶优化涉及Hessian Inverse的耗时计算。我们通过将问题转换为Krylov子空间中的优化,提出了一种近似方法,该方法显着降低了计算复杂性以加快训练过程。在矿工和CIFAR-10数据集上进行的广泛实验表明,我们使用二阶对抗示例的对抗性学习优于其他FISRT-阶方法,这可以改善针对广泛攻击的模型稳健性。
translated by 谷歌翻译
在本讨论文件中,我们调查了有关机器学习模型鲁棒性的最新研究。随着学习算法在数据驱动的控制系统中越来越流行,必须确保它们对数据不确定性的稳健性,以维持可靠的安全至关重要的操作。我们首先回顾了这种鲁棒性的共同形式主义,然后继续讨论训练健壮的机器学习模型的流行和最新技术,以及可证明这种鲁棒性的方法。从强大的机器学习的这种统一中,我们识别并讨论了该地区未来研究的迫切方向。
translated by 谷歌翻译
有必要提高某些特殊班级的表现,或者特别保护它们免受对抗学习的攻击。本文提出了一个将成本敏感分类和对抗性学习结合在一起的框架,以训练可以区分受保护和未受保护的类的模型,以使受保护的类别不太容易受到对抗性示例的影响。在此框架中,我们发现在训练深神经网络(称为Min-Max属性)期间,一个有趣的现象,即卷积层中大多数参数的绝对值。基于这种最小的最大属性,该属性是在随机分布的角度制定和分析的,我们进一步建立了一个针对对抗性示例的新防御模型,以改善对抗性鲁棒性。构建模型的一个优点是,它的性能比标准模型更好,并且可以与对抗性训练相结合,以提高性能。在实验上证实,对于所有类别的平均准确性,我们的模型在没有发生攻击时几乎与现有模型一样,并且在发生攻击时比现有模型更好。具体而言,关于受保护类的准确性,提议的模型比发生攻击时的现有模型要好得多。
translated by 谷歌翻译
尽管机器学习系统的效率和可扩展性,但最近的研究表明,许多分类方法,尤其是深神经网络(DNN),易受对抗的例子;即,仔细制作欺骗训练有素的分类模型的例子,同时无法区分从自然数据到人类。这使得在安全关键区域中应用DNN或相关方法可能不安全。由于这个问题是由Biggio等人确定的。 (2013)和Szegedy等人。(2014年),在这一领域已经完成了很多工作,包括开发攻击方法,以产生对抗的例子和防御技术的构建防范这些例子。本文旨在向统计界介绍这一主题及其最新发展,主要关注对抗性示例的产生和保护。在数值实验中使用的计算代码(在Python和R)公开可用于读者探讨调查的方法。本文希望提交人们将鼓励更多统计学人员在这种重要的令人兴奋的领域的产生和捍卫对抗的例子。
translated by 谷歌翻译
对抗性的鲁棒性已经成为深度学习的核心目标,无论是在理论和实践中。然而,成功的方法来改善对抗的鲁棒性(如逆势训练)在不受干扰的数据上大大伤害了泛化性能。这可能会对对抗性鲁棒性如何影响现实世界系统的影响(即,如果它可以提高未受干扰的数据的准确性),许多人可能选择放弃鲁棒性)。我们提出内插对抗培训,该培训最近雇用了在对抗培训框架内基于插值的基于插值的培训方法。在CiFar -10上,对抗性训练增加了标准测试错误(当没有对手时)从4.43%到12.32%,而我们的内插对抗培训我们保留了对抗性的鲁棒性,同时实现了仅6.45%的标准测试误差。通过我们的技术,强大模型标准误差的相对增加从178.1%降至仅为45.5%。此外,我们提供内插对抗性培训的数学分析,以确认其效率,并在鲁棒性和泛化方面展示其优势。
translated by 谷歌翻译
对抗性实例的有趣现象引起了机器学习中的显着关注,对社区可能更令人惊讶的是存在普遍对抗扰动(UAPS),即欺骗目标DNN的单一扰动。随着对深层分类器的关注,本调查总结了最近普遍对抗攻击的进展,讨论了攻击和防御方的挑战,以及uap存在的原因。我们的目标是将此工作扩展为动态调查,该调查将定期更新其内容,以遵循关于在广泛的域中的UAP或通用攻击的新作品,例如图像,音频,视频,文本等。将讨论相关更新:https://bit.ly/2sbqlgg。我们欢迎未来的作者在该领域的作品,联系我们,包括您的新发现。
translated by 谷歌翻译
对抗性可转移性是一种有趣的性质 - 针对一个模型制作的对抗性扰动也是对另一个模型有效的,而这些模型来自不同的模型家庭或培训过程。为了更好地保护ML系统免受对抗性攻击,提出了几个问题:对抗性转移性的充分条件是什么,以及如何绑定它?有没有办法降低对抗的转移性,以改善合奏ML模型的鲁棒性?为了回答这些问题,在这项工作中,我们首先在理论上分析和概述了模型之间的对抗性可转移的充分条件;然后提出一种实用的算法,以减少集合内基础模型之间的可转换,以提高其鲁棒性。我们的理论分析表明,只有促进基础模型梯度之间的正交性不足以确保低可转移性;与此同时,模型平滑度是控制可转移性的重要因素。我们还在某些条件下提供了对抗性可转移性的下界和上限。灵感来自我们的理论分析,我们提出了一种有效的可转让性,减少了平滑(TRS)集合培训策略,以通过实施基础模型之间的梯度正交性和模型平滑度来培训具有低可转换性的强大集成。我们对TRS进行了广泛的实验,并与6个最先进的集合基线进行比较,防止不同数据集的8个白箱攻击,表明所提出的TRS显着优于所有基线。
translated by 谷歌翻译
了解神经网络的决策过程很难。解释的一种重要方法是将其决定归因于关键特征。尽管提出了许多算法,但其中大多数仅改善了模型的忠诚。但是,真实的环境包含许多随机噪声,这可能会导致解释中的波动。更严重的是,最近的作品表明,解释算法容易受到对抗性攻击的影响。所有这些使解释很难在实际情况下信任。为了弥合这一差距,我们提出了一种模型 - 不稳定方法\ emph {特征归因}(METFA)的中位数测试,以量化不确定性并提高使用理论保证的解释算法的稳定性。 METFA具有以下两个函数:(1)检查一个特征是显着重要还是不重要,并生成METFA相关的映射以可视化结果; (2)计算特征归因评分的置信区间,并生成一个平滑的图表以提高解释的稳定性。实验表明,METFA提高了解释的视觉质量,并在保持忠诚的同时大大减少了不稳定。为了定量评估不同噪音设置下解释的忠诚,我们进一步提出了几个强大的忠诚指标。实验结果表明,METFA平滑的解释可以显着提高稳健的忠诚。此外,我们使用两种方案来显示METFA在应用程序中的潜力。首先,当应用于SOTA解释方法来定位语义分割模型的上下文偏见时,METFA很重要的解释使用较小的区域来维持99 \%+忠实。其次,当通过不同的以解释为导向的攻击进行测试时,METFA可以帮助捍卫香草,以及自适应的对抗性攻击,以防止解释。
translated by 谷歌翻译
对抗培训,培训具有对抗性数据的深层学习模型的过程,是深度学习模型中最成功的对抗性防御方法之一。我们发现,如果我们在推理阶段微调这一模型以适应对抗的输入,可以进一步提高对普遍训练模型的白箱攻击的鲁棒性,以适应对手输入,其中包含额外信息。我们介绍了一种算法,即“邮政列车”在原始输出类和“邻居”类之间的推断阶段的模型,具有现有培训数据。预训练的FAST-FGSM CIFAR10分类器基础模型对白盒预计梯度攻击(PGD)的准确性可以通过我们的算法显着提高46.8%至64.5%。
translated by 谷歌翻译
Adaptive attacks have (rightfully) become the de facto standard for evaluating defenses to adversarial examples. We find, however, that typical adaptive evaluations are incomplete. We demonstrate that thirteen defenses recently published at ICLR, ICML and NeurIPS-and which illustrate a diverse set of defense strategies-can be circumvented despite attempting to perform evaluations using adaptive attacks. While prior evaluation papers focused mainly on the end result-showing that a defense was ineffective-this paper focuses on laying out the methodology and the approach necessary to perform an adaptive attack. Some of our attack strategies are generalizable, but no single strategy would have been sufficient for all defenses. This underlines our key message that adaptive attacks cannot be automated and always require careful and appropriate tuning to a given defense. We hope that these analyses will serve as guidance on how to properly perform adaptive attacks against defenses to adversarial examples, and thus will allow the community to make further progress in building more robust models.
translated by 谷歌翻译
Although deep neural networks (DNNs) have achieved great success in many tasks, they can often be fooled by adversarial examples that are generated by adding small but purposeful distortions to natural examples. Previous studies to defend against adversarial examples mostly focused on refining the DNN models, but have either shown limited success or required expensive computation. We propose a new strategy, feature squeezing, that can be used to harden DNN models by detecting adversarial examples. Feature squeezing reduces the search space available to an adversary by coalescing samples that correspond to many different feature vectors in the original space into a single sample. By comparing a DNN model's prediction on the original input with that on squeezed inputs, feature squeezing detects adversarial examples with high accuracy and few false positives.This paper explores two feature squeezing methods: reducing the color bit depth of each pixel and spatial smoothing. These simple strategies are inexpensive and complementary to other defenses, and can be combined in a joint detection framework to achieve high detection rates against state-of-the-art attacks.
translated by 谷歌翻译
Recent work has demonstrated that deep neural networks are vulnerable to adversarial examples-inputs that are almost indistinguishable from natural data and yet classified incorrectly by the network. In fact, some of the latest findings suggest that the existence of adversarial attacks may be an inherent weakness of deep learning models. To address this problem, we study the adversarial robustness of neural networks through the lens of robust optimization. This approach provides us with a broad and unifying view on much of the prior work on this topic. Its principled nature also enables us to identify methods for both training and attacking neural networks that are reliable and, in a certain sense, universal. In particular, they specify a concrete security guarantee that would protect against any adversary. These methods let us train networks with significantly improved resistance to a wide range of adversarial attacks. They also suggest the notion of security against a first-order adversary as a natural and broad security guarantee. We believe that robustness against such well-defined classes of adversaries is an important stepping stone towards fully resistant deep learning models. 1
translated by 谷歌翻译
Deep neural networks (DNNs) are one of the most prominent technologies of our time, as they achieve state-of-the-art performance in many machine learning tasks, including but not limited to image classification, text mining, and speech processing. However, recent research on DNNs has indicated ever-increasing concern on the robustness to adversarial examples, especially for security-critical tasks such as traffic sign identification for autonomous driving. Studies have unveiled the vulnerability of a well-trained DNN by demonstrating the ability of generating barely noticeable (to both human and machines) adversarial images that lead to misclassification. Furthermore, researchers have shown that these adversarial images are highly transferable by simply training and attacking a substitute model built upon the target model, known as a black-box attack to DNNs.Similar to the setting of training substitute models, in this paper we propose an effective black-box attack that also only has access to the input (images) and the output (confidence scores) of a targeted DNN. However, different from leveraging attack transferability from substitute models, we propose zeroth order optimization (ZOO) based attacks to directly estimate the gradients of the targeted DNN for generating adversarial examples. We use zeroth order stochastic coordinate descent along with dimension reduction, hierarchical attack and importance sampling techniques to * Pin-Yu Chen and Huan Zhang contribute equally to this work.
translated by 谷歌翻译
Although deep learning has made remarkable progress in processing various types of data such as images, text and speech, they are known to be susceptible to adversarial perturbations: perturbations specifically designed and added to the input to make the target model produce erroneous output. Most of the existing studies on generating adversarial perturbations attempt to perturb the entire input indiscriminately. In this paper, we propose ExploreADV, a general and flexible adversarial attack system that is capable of modeling regional and imperceptible attacks, allowing users to explore various kinds of adversarial examples as needed. We adapt and combine two existing boundary attack methods, DeepFool and Brendel\&Bethge Attack, and propose a mask-constrained adversarial attack system, which generates minimal adversarial perturbations under the pixel-level constraints, namely ``mask-constraints''. We study different ways of generating such mask-constraints considering the variance and importance of the input features, and show that our adversarial attack system offers users good flexibility to focus on sub-regions of inputs, explore imperceptible perturbations and understand the vulnerability of pixels/regions to adversarial attacks. We demonstrate our system to be effective based on extensive experiments and user study.
translated by 谷歌翻译