Deep learning methods have gained increased attention in various applications due to their outstanding performance. For exploring how this high performance relates to the proper use of data artifacts and the accurate problem formulation of a given task, interpretation models have become a crucial component in developing deep learning-based systems. Interpretation models enable the understanding of the inner workings of deep learning models and offer a sense of security in detecting the misuse of artifacts in the input data. Similar to prediction models, interpretation models are also susceptible to adversarial inputs. This work introduces two attacks, AdvEdge and AdvEdge$^{+}$, that deceive both the target deep learning model and the coupled interpretation model. We assess the effectiveness of proposed attacks against two deep learning model architectures coupled with four interpretation models that represent different categories of interpretation models. Our experiments include the attack implementation using various attack frameworks. We also explore the potential countermeasures against such attacks. Our analysis shows the effectiveness of our attacks in terms of deceiving the deep learning models and their interpreters, and highlights insights to improve and circumvent the attacks.
translated by 谷歌翻译
已知深度神经网络(DNN)容易受到用不可察觉的扰动制作的对抗性示例的影响,即,输入图像的微小变化会引起错误的分类,从而威胁着基于深度学习的部署系统的可靠性。经常采用对抗训练(AT)来通过训练损坏和干净的数据的混合物来提高DNN的鲁棒性。但是,大多数基于AT的方法在处理\ textit {转移的对抗示例}方面是无效的,这些方法是生成以欺骗各种防御模型的生成的,因此无法满足现实情况下提出的概括要求。此外,对抗性训练一般的国防模型不能对具有扰动的输入产生可解释的预测,而不同的领域专家则需要一个高度可解释的强大模型才能了解DNN的行为。在这项工作中,我们提出了一种基于Jacobian规范和选择性输入梯度正则化(J-SIGR)的方法,该方法通过Jacobian归一化提出了线性化的鲁棒性,还将基于扰动的显着性图正规化,以模仿模型的可解释预测。因此,我们既可以提高DNN的防御能力和高解释性。最后,我们评估了跨不同体系结构的方法,以针对强大的对抗性攻击。实验表明,提出的J-Sigr赋予了针对转移的对抗攻击的鲁棒性,我们还表明,来自神经网络的预测易于解释。
translated by 谷歌翻译
Explainability has been widely stated as a cornerstone of the responsible and trustworthy use of machine learning models. With the ubiquitous use of Deep Neural Network (DNN) models expanding to risk-sensitive and safety-critical domains, many methods have been proposed to explain the decisions of these models. Recent years have also seen concerted efforts that have shown how such explanations can be distorted (attacked) by minor input perturbations. While there have been many surveys that review explainability methods themselves, there has been no effort hitherto to assimilate the different methods and metrics proposed to study the robustness of explanations of DNN models. In this work, we present a comprehensive survey of methods that study, understand, attack, and defend explanations of DNN models. We also present a detailed review of different metrics used to evaluate explanation methods, as well as describe attributional attack and defense methods. We conclude with lessons and take-aways for the community towards ensuring robust explanations of DNN model predictions.
translated by 谷歌翻译
Deep neural networks (DNNs) are one of the most prominent technologies of our time, as they achieve state-of-the-art performance in many machine learning tasks, including but not limited to image classification, text mining, and speech processing. However, recent research on DNNs has indicated ever-increasing concern on the robustness to adversarial examples, especially for security-critical tasks such as traffic sign identification for autonomous driving. Studies have unveiled the vulnerability of a well-trained DNN by demonstrating the ability of generating barely noticeable (to both human and machines) adversarial images that lead to misclassification. Furthermore, researchers have shown that these adversarial images are highly transferable by simply training and attacking a substitute model built upon the target model, known as a black-box attack to DNNs.Similar to the setting of training substitute models, in this paper we propose an effective black-box attack that also only has access to the input (images) and the output (confidence scores) of a targeted DNN. However, different from leveraging attack transferability from substitute models, we propose zeroth order optimization (ZOO) based attacks to directly estimate the gradients of the targeted DNN for generating adversarial examples. We use zeroth order stochastic coordinate descent along with dimension reduction, hierarchical attack and importance sampling techniques to * Pin-Yu Chen and Huan Zhang contribute equally to this work.
translated by 谷歌翻译
许多最先进的ML模型在各种任务中具有优于图像分类的人类。具有如此出色的性能,ML模型今天被广泛使用。然而,存在对抗性攻击和数据中毒攻击的真正符合ML模型的稳健性。例如,Engstrom等人。证明了最先进的图像分类器可以容易地被任意图像上的小旋转欺骗。由于ML系统越来越纳入安全性和安全敏感的应用,对抗攻击和数据中毒攻击构成了相当大的威胁。本章侧重于ML安全的两个广泛和重要的领域:对抗攻击和数据中毒攻击。
translated by 谷歌翻译
Adversarial attacks hamper the decision-making ability of neural networks by perturbing the input signal. The addition of calculated small distortion to images, for instance, can deceive a well-trained image classification network. In this work, we propose a novel attack technique called Sparse Adversarial and Interpretable Attack Framework (SAIF). Specifically, we design imperceptible attacks that contain low-magnitude perturbations at a small number of pixels and leverage these sparse attacks to reveal the vulnerability of classifiers. We use the Frank-Wolfe (conditional gradient) algorithm to simultaneously optimize the attack perturbations for bounded magnitude and sparsity with $O(1/\sqrt{T})$ convergence. Empirical results show that SAIF computes highly imperceptible and interpretable adversarial examples, and outperforms state-of-the-art sparse attack methods on the ImageNet dataset.
translated by 谷歌翻译
Adaptive attacks have (rightfully) become the de facto standard for evaluating defenses to adversarial examples. We find, however, that typical adaptive evaluations are incomplete. We demonstrate that thirteen defenses recently published at ICLR, ICML and NeurIPS-and which illustrate a diverse set of defense strategies-can be circumvented despite attempting to perform evaluations using adaptive attacks. While prior evaluation papers focused mainly on the end result-showing that a defense was ineffective-this paper focuses on laying out the methodology and the approach necessary to perform an adaptive attack. Some of our attack strategies are generalizable, but no single strategy would have been sufficient for all defenses. This underlines our key message that adaptive attacks cannot be automated and always require careful and appropriate tuning to a given defense. We hope that these analyses will serve as guidance on how to properly perform adaptive attacks against defenses to adversarial examples, and thus will allow the community to make further progress in building more robust models.
translated by 谷歌翻译
Although deep learning has made remarkable progress in processing various types of data such as images, text and speech, they are known to be susceptible to adversarial perturbations: perturbations specifically designed and added to the input to make the target model produce erroneous output. Most of the existing studies on generating adversarial perturbations attempt to perturb the entire input indiscriminately. In this paper, we propose ExploreADV, a general and flexible adversarial attack system that is capable of modeling regional and imperceptible attacks, allowing users to explore various kinds of adversarial examples as needed. We adapt and combine two existing boundary attack methods, DeepFool and Brendel\&Bethge Attack, and propose a mask-constrained adversarial attack system, which generates minimal adversarial perturbations under the pixel-level constraints, namely ``mask-constraints''. We study different ways of generating such mask-constraints considering the variance and importance of the input features, and show that our adversarial attack system offers users good flexibility to focus on sub-regions of inputs, explore imperceptible perturbations and understand the vulnerability of pixels/regions to adversarial attacks. We demonstrate our system to be effective based on extensive experiments and user study.
translated by 谷歌翻译
Although deep neural networks (DNNs) have achieved great success in many tasks, they can often be fooled by adversarial examples that are generated by adding small but purposeful distortions to natural examples. Previous studies to defend against adversarial examples mostly focused on refining the DNN models, but have either shown limited success or required expensive computation. We propose a new strategy, feature squeezing, that can be used to harden DNN models by detecting adversarial examples. Feature squeezing reduces the search space available to an adversary by coalescing samples that correspond to many different feature vectors in the original space into a single sample. By comparing a DNN model's prediction on the original input with that on squeezed inputs, feature squeezing detects adversarial examples with high accuracy and few false positives.This paper explores two feature squeezing methods: reducing the color bit depth of each pixel and spatial smoothing. These simple strategies are inexpensive and complementary to other defenses, and can be combined in a joint detection framework to achieve high detection rates against state-of-the-art attacks.
translated by 谷歌翻译
Video classification systems are vulnerable to adversarial attacks, which can create severe security problems in video verification. Current black-box attacks need a large number of queries to succeed, resulting in high computational overhead in the process of attack. On the other hand, attacks with restricted perturbations are ineffective against defenses such as denoising or adversarial training. In this paper, we focus on unrestricted perturbations and propose StyleFool, a black-box video adversarial attack via style transfer to fool the video classification system. StyleFool first utilizes color theme proximity to select the best style image, which helps avoid unnatural details in the stylized videos. Meanwhile, the target class confidence is additionally considered in targeted attacks to influence the output distribution of the classifier by moving the stylized video closer to or even across the decision boundary. A gradient-free method is then employed to further optimize the adversarial perturbations. We carry out extensive experiments to evaluate StyleFool on two standard datasets, UCF-101 and HMDB-51. The experimental results demonstrate that StyleFool outperforms the state-of-the-art adversarial attacks in terms of both the number of queries and the robustness against existing defenses. Moreover, 50% of the stylized videos in untargeted attacks do not need any query since they can already fool the video classification model. Furthermore, we evaluate the indistinguishability through a user study to show that the adversarial samples of StyleFool look imperceptible to human eyes, despite unrestricted perturbations.
translated by 谷歌翻译
在过去的几十年中,人工智能的兴起使我们有能力解决日常生活中最具挑战性的问题,例如癌症的预测和自主航行。但是,如果不保护对抗性攻击,这些应用程序可能不会可靠。此外,最近的作品表明,某些对抗性示例可以在不同的模型中转移。因此,至关重要的是避免通过抵抗对抗性操纵的强大模型进行这种可传递性。在本文中,我们提出了一种基于特征随机化的方法,该方法抵抗了八次针对测试阶段深度学习模型的对抗性攻击。我们的新方法包括改变目标网络分类器中的训练策略并选择随机特征样本。我们认为攻击者具有有限的知识和半知识条件,以进行最普遍的对抗性攻击。我们使用包括现实和合成攻击的众所周知的UNSW-NB15数据集评估了方法的鲁棒性。之后,我们证明我们的策略优于现有的最新方法,例如最强大的攻击,包括针对特定的对抗性攻击进行微调网络模型。最后,我们的实验结果表明,我们的方法可以确保目标网络并抵抗对抗性攻击的转移性超过60%。
translated by 谷歌翻译
With rapid progress and significant successes in a wide spectrum of applications, deep learning is being applied in many safety-critical environments. However, deep neural networks have been recently found vulnerable to well-designed input samples, called adversarial examples. Adversarial perturbations are imperceptible to human but can easily fool deep neural networks in the testing/deploying stage. The vulnerability to adversarial examples becomes one of the major risks for applying deep neural networks in safety-critical environments. Therefore, attacks and defenses on adversarial examples draw great attention. In this paper, we review recent findings on adversarial examples for deep neural networks, summarize the methods for generating adversarial examples, and propose a taxonomy of these methods. Under the taxonomy, applications for adversarial examples are investigated. We further elaborate on countermeasures for adversarial examples. In addition, three major challenges in adversarial examples and the potential solutions are discussed.
translated by 谷歌翻译
随着深度神经网络的兴起,解释这些网络预测的挑战已经越来越识别。虽然存在许多用于解释深度神经网络的决策的方法,但目前没有关于如何评估它们的共识。另一方面,鲁棒性是深度学习研究的热门话题;但是,在最近,几乎没有谈论解释性。在本教程中,我们首先呈现基于梯度的可解释性方法。这些技术使用梯度信号来分配对输入特征的决定的负担。后来,我们讨论如何为其鲁棒性和对抗性的鲁棒性在具有有意义的解释中扮演的作用来评估基于梯度的方法。我们还讨论了基于梯度的方法的局限性。最后,我们提出了在选择解释性方法之前应检查的最佳实践和属性。我们结束了未来在稳健性和解释性融合的地区研究的研究。
translated by 谷歌翻译
深度学习的进步使得广泛的有希望的应用程序。然而,这些系统容易受到对抗机器学习(AML)攻击的影响;对他们的意见的离前事实制作的扰动可能导致他们错误分类。若干最先进的对抗性攻击已经证明他们可以可靠地欺骗分类器,使这些攻击成为一个重大威胁。对抗性攻击生成算法主要侧重于创建成功的例子,同时控制噪声幅度和分布,使检测更加困难。这些攻击的潜在假设是脱机产生的对抗噪声,使其执行时间是次要考虑因素。然而,最近,攻击者机会自由地产生对抗性示例的立即对抗攻击已经可能。本文介绍了一个新问题:我们如何在实时约束下产生对抗性噪音,以支持这种实时对抗攻击?了解这一问题提高了我们对这些攻击对实时系统构成的威胁的理解,并为未来防御提供安全评估基准。因此,我们首先进行对抗生成算法的运行时间分析。普遍攻击脱机产生一般攻击,没有在线开销,并且可以应用于任何输入;然而,由于其一般性,他们的成功率是有限的。相比之下,在特定输入上工作的在线算法是计算昂贵的,使它们不适合在时间约束下的操作。因此,我们提出房间,一种新型实时在线脱机攻击施工模型,其中离线组件用于预热在线算法,使得可以在时间限制下产生高度成功的攻击。
translated by 谷歌翻译
深度学习(DL)在许多与人类相关的任务中表现出巨大的成功,这导致其在许多计算机视觉的基础应用中采用,例如安全监控系统,自治车辆和医疗保健。一旦他们拥有能力克服安全关键挑战,这种安全关键型应用程序必须绘制他们的成功部署之路。在这些挑战中,防止或/和检测对抗性实例(AES)。对手可以仔细制作小型,通常是难以察觉的,称为扰动的噪声被添加到清洁图像中以产生AE。 AE的目的是愚弄DL模型,使其成为DL应用的潜在风险。在文献中提出了许多测试时间逃避攻击和对策,即防御或检测方法。此外,还发布了很少的评论和调查,理论上展示了威胁的分类和对策方法,几乎​​没有焦点检测方法。在本文中,我们专注于图像分类任务,并试图为神经网络分类器进行测试时间逃避攻击检测方法的调查。对此类方法的详细讨论提供了在四个数据集的不同场景下的八个最先进的探测器的实验结果。我们还为这一研究方向提供了潜在的挑战和未来的观点。
translated by 谷歌翻译
尽管机器学习系统的效率和可扩展性,但最近的研究表明,许多分类方法,尤其是深神经网络(DNN),易受对抗的例子;即,仔细制作欺骗训练有素的分类模型的例子,同时无法区分从自然数据到人类。这使得在安全关键区域中应用DNN或相关方法可能不安全。由于这个问题是由Biggio等人确定的。 (2013)和Szegedy等人。(2014年),在这一领域已经完成了很多工作,包括开发攻击方法,以产生对抗的例子和防御技术的构建防范这些例子。本文旨在向统计界介绍这一主题及其最新发展,主要关注对抗性示例的产生和保护。在数值实验中使用的计算代码(在Python和R)公开可用于读者探讨调查的方法。本文希望提交人们将鼓励更多统计学人员在这种重要的令人兴奋的领域的产生和捍卫对抗的例子。
translated by 谷歌翻译
时间序列数据在许多现实世界中(例如,移动健康)和深神经网络(DNNS)中产生,在解决它们方面已取得了巨大的成功。尽管他们成功了,但对他们对对抗性攻击的稳健性知之甚少。在本文中,我们提出了一个通过统计特征(TSA-STAT)}称为时间序列攻击的新型对抗框架}。为了解决时间序列域的独特挑战,TSA-STAT对时间序列数据的统计特征采取限制来构建对抗性示例。优化的多项式转换用于创建比基于加性扰动的攻击(就成功欺骗DNN而言)更有效的攻击。我们还提供有关构建对抗性示例的统计功能规范的认证界限。我们对各种现实世界基准数据集的实验表明,TSA-STAT在欺骗DNN的时间序列域和改善其稳健性方面的有效性。 TSA-STAT算法的源代码可在https://github.com/tahabelkhouja/time-series-series-attacks-via-statity-features上获得
translated by 谷歌翻译
在过去的几年中,卷积神经网络(CNN)在各种现实世界的网络安全应用程序(例如网络和多媒体安全)中表现出了有希望的性能。但是,CNN结构的潜在脆弱性构成了主要的安全问题,因此不适合用于以安全为导向的应用程序,包括此类计算机网络。保护这些体系结构免受对抗性攻击,需要使用挑战性攻击的安全体系结构。在这项研究中,我们提出了一种基于合奏分类器的新型体系结构,该结构将1级分类(称为1C)的增强安全性与在没有攻击的情况下的传统2级分类(称为2C)的高性能结合在一起。我们的体系结构称为1.5级(Spritz-1.5c)分类器,并使用最终密度分类器,一个2C分类器(即CNNS)和两个并行1C分类器(即自动编码器)构造。在我们的实验中,我们通过在各种情况下考虑八次可能的对抗性攻击来评估我们提出的架构的鲁棒性。我们分别对2C和Spritz-1.5c体系结构进行了这些攻击。我们研究的实验结果表明,I-FGSM攻击对2C分类器的攻击成功率(ASR)是N-Baiot数据集训练的2C分类器的0.9900。相反,Spritz-1.5C分类器的ASR为0.0000。
translated by 谷歌翻译
基于深的神经网络(DNNS)基于合成孔径雷达(SAR)自动靶标识别(ATR)系统已显示出非常容易受到故意设计但几乎无法察觉的对抗扰动的影响,但是当添加到靶向物体中时,DNN推断可能会偏差。在将DNN应用于高级SAR ATR应用时,这会导致严重的安全问题。因此,增强DNN的对抗性鲁棒性对于对现代现实世界中的SAR ATR系统实施DNN至关重要。本文旨在构建更健壮的DNN基于DNN的SAR ATR模型,探讨了SAR成像过程的领域知识,并提出了一种新型的散射模型引导的对抗攻击(SMGAA)算法,该算法可以以电磁散射响应的形式产生对抗性扰动(称为对抗散射器) )。提出的SMGAA由两个部分组成:1)参数散射模型和相应的成像方法以及2)基于自定义的基于梯度的优化算法。首先,我们介绍了有效的归因散射中心模型(ASCM)和一种通用成像方法,以描述SAR成像过程中典型几何结构的散射行为。通过进一步制定几种策略来考虑SAR目标图像的领域知识并放松贪婪的搜索程序,建议的方法不需要经过审慎的态度,但是可以有效地找到有效的ASCM参数来欺骗SAR分类器并促进SAR分类器并促进强大的模型训练。对MSTAR数据集的全面评估表明,SMGAA产生的对抗散射器对SAR处理链中的扰动和转换比当前研究的攻击更为强大,并且有效地构建了针对恶意散射器的防御模型。
translated by 谷歌翻译
在本文中,我们提出了一种防御策略,以通过合并隐藏的层表示来改善对抗性鲁棒性。这种防御策略的关键旨在压缩或过滤输入信息,包括对抗扰动。而且这种防御策略可以被视为一种激活函数,可以应用于任何类型的神经网络。从理论上讲,我们在某些条件下也证明了这种防御策略的有效性。此外,合并隐藏层表示,我们提出了三种类型的对抗攻击,分别生成三种类型的对抗示例。实验表明,我们的防御方法可以显着改善深神经网络的对抗性鲁棒性,即使我们不采用对抗性训练,也可以实现最新的表现。
translated by 谷歌翻译