Deep neural networks (DNNs) have greatly impacted numerous fields over the past decade. Yet despite exhibiting superb performance over many problems, their black-box nature still poses a significant challenge with respect to explainability. Indeed, explainable artificial intelligence (XAI) is crucial in several fields, wherein the answer alone -- sans a reasoning of how said answer was derived -- is of little value. This paper uncovers a troubling property of explanation methods for image-based DNNs: by making small visual changes to the input image -- hardly influencing the network's output -- we demonstrate how explanations may be arbitrarily manipulated through the use of evolution strategies. Our novel algorithm, AttaXAI, a model-agnostic, adversarial attack on XAI algorithms, only requires access to the output logits of a classifier and to the explanation map; these weak assumptions render our approach highly useful where real-world models and data are concerned. We compare our method's performance on two benchmark datasets -- CIFAR100 and ImageNet -- using four different pretrained deep-learning models: VGG16-CIFAR100, VGG16-ImageNet, MobileNet-CIFAR100, and Inception-v3-ImageNet. We find that the XAI methods can be manipulated without the use of gradients or other model internals. Our novel algorithm is successfully able to manipulate an image in a manner imperceptible to the human eye, such that the XAI method outputs a specific explanation map. To our knowledge, this is the first such method in a black-box setting, and we believe it has significant value where explainability is desired, required, or legally mandatory.
translated by 谷歌翻译
深度神经网络(DNNS)在各种方案中对对抗数据敏感,包括黑框方案,在该方案中,攻击者只允许查询训练有素的模型并接收输出。现有的黑框方法用于创建对抗性实例的方法是昂贵的,通常使用梯度估计或培训替换网络。本文介绍了\ textit {Attackar},这是一种基于分数的进化,黑框攻击。 Attackar是基于一个新的目标函数,可用于无梯度优化问题。攻击仅需要访问分类器的输出徽标,因此不受梯度掩蔽的影响。不需要其他信息,使我们的方法更适合现实生活中的情况。我们使用三个基准数据集(MNIST,CIFAR10和Imagenet)使用三种不同的最先进模型(Inception-V3,Resnet-50和VGG-16-BN)测试其性能。此外,我们评估了Attackar在非分辨率转换防御和最先进的强大模型上的性能。我们的结果表明,在准确性得分和查询效率方面,攻击性的表现出色。
translated by 谷歌翻译
Current neural network-based classifiers are susceptible to adversarial examples even in the black-box setting, where the attacker only has query access to the model. In practice, the threat model for real-world systems is often more restrictive than the typical black-box model where the adversary can observe the full output of the network on arbitrarily many chosen inputs. We define three realistic threat models that more accurately characterize many real-world classifiers: the query-limited setting, the partialinformation setting, and the label-only setting. We develop new attacks that fool classifiers under these more restrictive threat models, where previous methods would be impractical or ineffective. We demonstrate that our methods are effective against an ImageNet classifier under our proposed threat models. We also demonstrate a targeted black-box attack against a commercial classifier, overcoming the challenges of limited query access, partial information, and other practical issues to break the Google Cloud Vision API.
translated by 谷歌翻译
许多最先进的ML模型在各种任务中具有优于图像分类的人类。具有如此出色的性能,ML模型今天被广泛使用。然而,存在对抗性攻击和数据中毒攻击的真正符合ML模型的稳健性。例如,Engstrom等人。证明了最先进的图像分类器可以容易地被任意图像上的小旋转欺骗。由于ML系统越来越纳入安全性和安全敏感的应用,对抗攻击和数据中毒攻击构成了相当大的威胁。本章侧重于ML安全的两个广泛和重要的领域:对抗攻击和数据中毒攻击。
translated by 谷歌翻译
Deep neural networks (DNNs) are one of the most prominent technologies of our time, as they achieve state-of-the-art performance in many machine learning tasks, including but not limited to image classification, text mining, and speech processing. However, recent research on DNNs has indicated ever-increasing concern on the robustness to adversarial examples, especially for security-critical tasks such as traffic sign identification for autonomous driving. Studies have unveiled the vulnerability of a well-trained DNN by demonstrating the ability of generating barely noticeable (to both human and machines) adversarial images that lead to misclassification. Furthermore, researchers have shown that these adversarial images are highly transferable by simply training and attacking a substitute model built upon the target model, known as a black-box attack to DNNs.Similar to the setting of training substitute models, in this paper we propose an effective black-box attack that also only has access to the input (images) and the output (confidence scores) of a targeted DNN. However, different from leveraging attack transferability from substitute models, we propose zeroth order optimization (ZOO) based attacks to directly estimate the gradients of the targeted DNN for generating adversarial examples. We use zeroth order stochastic coordinate descent along with dimension reduction, hierarchical attack and importance sampling techniques to * Pin-Yu Chen and Huan Zhang contribute equally to this work.
translated by 谷歌翻译
Explainability has been widely stated as a cornerstone of the responsible and trustworthy use of machine learning models. With the ubiquitous use of Deep Neural Network (DNN) models expanding to risk-sensitive and safety-critical domains, many methods have been proposed to explain the decisions of these models. Recent years have also seen concerted efforts that have shown how such explanations can be distorted (attacked) by minor input perturbations. While there have been many surveys that review explainability methods themselves, there has been no effort hitherto to assimilate the different methods and metrics proposed to study the robustness of explanations of DNN models. In this work, we present a comprehensive survey of methods that study, understand, attack, and defend explanations of DNN models. We also present a detailed review of different metrics used to evaluate explanation methods, as well as describe attributional attack and defense methods. We conclude with lessons and take-aways for the community towards ensuring robust explanations of DNN model predictions.
translated by 谷歌翻译
愚弄深度神经网络(DNN)与黑匣子优化已成为一种流行的对抗攻击方式,因为DNN的结构先验知识始终是未知的。尽管如此,最近的黑匣子对抗性攻击可能会努力平衡其在解决高分辨率图像中产生的对抗性示例(AES)的攻击能力和视觉质量。在本文中,我们基于大规模的多目标进化优化,提出了一种关注引导的黑盒逆势攻击,称为LMOA。通过考虑图像的空间语义信息,我们首先利用注意图来确定扰动像素。而不是攻击整个图像,减少了具有注意机制的扰动像素可以有助于避免维度的臭名臭氧,从而提高攻击性能。其次,采用大规模的多目标进化算法在突出区域中遍历降低的像素。从其特征中受益,所产生的AES有可能在人类视力不可知的同时愚弄目标DNN。广泛的实验结果已经验证了所提出的LMOA在ImageNet数据集中的有效性。更重要的是,与现有的黑匣子对抗性攻击相比,产生具有更好的视觉质量的高分辨率AE更具竞争力。
translated by 谷歌翻译
Deep learning methods have gained increased attention in various applications due to their outstanding performance. For exploring how this high performance relates to the proper use of data artifacts and the accurate problem formulation of a given task, interpretation models have become a crucial component in developing deep learning-based systems. Interpretation models enable the understanding of the inner workings of deep learning models and offer a sense of security in detecting the misuse of artifacts in the input data. Similar to prediction models, interpretation models are also susceptible to adversarial inputs. This work introduces two attacks, AdvEdge and AdvEdge$^{+}$, that deceive both the target deep learning model and the coupled interpretation model. We assess the effectiveness of proposed attacks against two deep learning model architectures coupled with four interpretation models that represent different categories of interpretation models. Our experiments include the attack implementation using various attack frameworks. We also explore the potential countermeasures against such attacks. Our analysis shows the effectiveness of our attacks in terms of deceiving the deep learning models and their interpreters, and highlights insights to improve and circumvent the attacks.
translated by 谷歌翻译
随着机器学习技术的快速发展,在日常生活的几乎每个方面都已部署了深度学习模式。但是,这些模型的隐私和安全性受到反对派攻击的威胁。其中黑盒攻击更接近现实,其中可以从模型中获取有限的知识。在本文中,我们提供了关于对抗性攻击的基本背景知识,并分析了四个黑箱攻击算法:匪徒,NES,Square Attack和ZoSignsgd全面。我们还探索了新提出的方形攻击方法,方面规模,希望提高其查询效率。
translated by 谷歌翻译
近年来,可解释的人工智能(XAI)已成为一个非常适合的框架,可以生成人类对“黑盒”模型的可理解解释。在本文中,一种新颖的XAI视觉解释算法称为相似性差异和唯一性(SIDU)方法,该方法可以有效地定位负责预测的整个对象区域。通过各种计算和人类主题实验分析了SIDU算法的鲁棒性和有效性。特别是,使用三种不同类型的评估(应用,人类和功能地面)评估SIDU算法以证明其出色的性能。在对“黑匣子”模型的对抗性攻击的情况下,进一步研究了Sidu的鲁棒性,以更好地了解其性能。我们的代码可在:https://github.com/satyamahesh84/sidu_xai_code上找到。
translated by 谷歌翻译
Recent research has revealed that the output of Deep Neural Networks (DNN) can be easily altered by adding relatively small perturbations to the input vector. In this paper, we analyze an attack in an extremely limited scenario where only one pixel can be modified. For that we propose a novel method for generating one-pixel adversarial perturbations based on differential evolution (DE). It requires less adversarial information (a blackbox attack) and can fool more types of networks due to the inherent features of DE. The results show that 67.97% of the natural images in Kaggle CIFAR-10 test dataset and 16.04% of the ImageNet (ILSVRC 2012) test images can be perturbed to at least one target class by modifying just one pixel with 74.03% and 22.91% confidence on average. We also show the same vulnerability on the original CIFAR-10 dataset. Thus, the proposed attack explores a different take on adversarial machine learning in an extreme limited scenario, showing that current DNNs are also vulnerable to such low dimension attacks. Besides, we also illustrate an important application of DE (or broadly speaking, evolutionary computation) in the domain of adversarial machine learning: creating tools that can effectively generate lowcost adversarial attacks against neural networks for evaluating robustness.
translated by 谷歌翻译
The authors thank Nicholas Carlini (UC Berkeley) and Dimitris Tsipras (MIT) for feedback to improve the survey quality. We also acknowledge X. Huang (Uni. Liverpool), K. R. Reddy (IISC), E. Valle (UNICAMP), Y. Yoo (CLAIR) and others for providing pointers to make the survey more comprehensive.
translated by 谷歌翻译
已知深度神经网络(DNN)容易受到用不可察觉的扰动制作的对抗性示例的影响,即,输入图像的微小变化会引起错误的分类,从而威胁着基于深度学习的部署系统的可靠性。经常采用对抗训练(AT)来通过训练损坏和干净的数据的混合物来提高DNN的鲁棒性。但是,大多数基于AT的方法在处理\ textit {转移的对抗示例}方面是无效的,这些方法是生成以欺骗各种防御模型的生成的,因此无法满足现实情况下提出的概括要求。此外,对抗性训练一般的国防模型不能对具有扰动的输入产生可解释的预测,而不同的领域专家则需要一个高度可解释的强大模型才能了解DNN的行为。在这项工作中,我们提出了一种基于Jacobian规范和选择性输入梯度正则化(J-SIGR)的方法,该方法通过Jacobian归一化提出了线性化的鲁棒性,还将基于扰动的显着性图正规化,以模仿模型的可解释预测。因此,我们既可以提高DNN的防御能力和高解释性。最后,我们评估了跨不同体系结构的方法,以针对强大的对抗性攻击。实验表明,提出的J-Sigr赋予了针对转移的对抗攻击的鲁棒性,我们还表明,来自神经网络的预测易于解释。
translated by 谷歌翻译
Although deep neural networks (DNNs) have achieved great success in many tasks, they can often be fooled by adversarial examples that are generated by adding small but purposeful distortions to natural examples. Previous studies to defend against adversarial examples mostly focused on refining the DNN models, but have either shown limited success or required expensive computation. We propose a new strategy, feature squeezing, that can be used to harden DNN models by detecting adversarial examples. Feature squeezing reduces the search space available to an adversary by coalescing samples that correspond to many different feature vectors in the original space into a single sample. By comparing a DNN model's prediction on the original input with that on squeezed inputs, feature squeezing detects adversarial examples with high accuracy and few false positives.This paper explores two feature squeezing methods: reducing the color bit depth of each pixel and spatial smoothing. These simple strategies are inexpensive and complementary to other defenses, and can be combined in a joint detection framework to achieve high detection rates against state-of-the-art attacks.
translated by 谷歌翻译
对卷积神经网络(CNN)的对抗性攻击的存在质疑这种模型对严重应用的适合度。攻击操纵输入图像,使得错误分类是在对人类观察者看上去正常的同时唤起的 - 因此它们不容易被检测到。在不同的上下文中,CNN隐藏层的反向传播激活(对给定输入的“特征响应”)有助于可视化人类“调试器” CNN“在计算其输出时对CNN”的看法。在这项工作中,我们提出了一种新颖的检测方法,以防止攻击。我们通过在特征响应中跟踪对抗扰动来做到这一点,从而可以使用平均局部空间熵自动检测。该方法不会改变原始的网络体系结构,并且完全可以解释。实验证实了我们对在Imagenet训练的大规模模型的最新攻击方法的有效性。
translated by 谷歌翻译
With rapid progress and significant successes in a wide spectrum of applications, deep learning is being applied in many safety-critical environments. However, deep neural networks have been recently found vulnerable to well-designed input samples, called adversarial examples. Adversarial perturbations are imperceptible to human but can easily fool deep neural networks in the testing/deploying stage. The vulnerability to adversarial examples becomes one of the major risks for applying deep neural networks in safety-critical environments. Therefore, attacks and defenses on adversarial examples draw great attention. In this paper, we review recent findings on adversarial examples for deep neural networks, summarize the methods for generating adversarial examples, and propose a taxonomy of these methods. Under the taxonomy, applications for adversarial examples are investigated. We further elaborate on countermeasures for adversarial examples. In addition, three major challenges in adversarial examples and the potential solutions are discussed.
translated by 谷歌翻译
Video classification systems are vulnerable to adversarial attacks, which can create severe security problems in video verification. Current black-box attacks need a large number of queries to succeed, resulting in high computational overhead in the process of attack. On the other hand, attacks with restricted perturbations are ineffective against defenses such as denoising or adversarial training. In this paper, we focus on unrestricted perturbations and propose StyleFool, a black-box video adversarial attack via style transfer to fool the video classification system. StyleFool first utilizes color theme proximity to select the best style image, which helps avoid unnatural details in the stylized videos. Meanwhile, the target class confidence is additionally considered in targeted attacks to influence the output distribution of the classifier by moving the stylized video closer to or even across the decision boundary. A gradient-free method is then employed to further optimize the adversarial perturbations. We carry out extensive experiments to evaluate StyleFool on two standard datasets, UCF-101 and HMDB-51. The experimental results demonstrate that StyleFool outperforms the state-of-the-art adversarial attacks in terms of both the number of queries and the robustness against existing defenses. Moreover, 50% of the stylized videos in untargeted attacks do not need any query since they can already fool the video classification model. Furthermore, we evaluate the indistinguishability through a user study to show that the adversarial samples of StyleFool look imperceptible to human eyes, despite unrestricted perturbations.
translated by 谷歌翻译
随着在各种算法和框架中更广泛地应用深度神经网络(DNN),安全威胁已成为其中之一。对抗性攻击干扰基于DNN的图像分类器,其中攻击者可以在其中故意添加不可察觉的对抗性扰动,以欺骗分类器。在本文中,我们提出了一种新颖的纯化方法,称为纯化的引导扩散模型(GDMP),以帮助保护分类器免受对抗性攻击。我们方法的核心是将纯化嵌入到deno的扩散概率模型(DDPM)的扩散denoisis过程中,以便其扩散过程可以逐渐添加的高斯噪声淹没对抗性的扰动,并且可以同时删除这两种声音。指导的deNoising过程。在我们在各个数据集中进行的全面实验中,提出的GDMP被证明可将对抗攻击造成的扰动降低到浅范围,从而显着提高了分类的正确性。 GDMP将鲁棒精度提高了5%,在CIFAR10数据集对PGD攻击下获得了90.1%。此外,GDMP在具有挑战性的Imagenet数据集上达到了70.94%的鲁棒性。
translated by 谷歌翻译
时间序列数据在许多现实世界中(例如,移动健康)和深神经网络(DNNS)中产生,在解决它们方面已取得了巨大的成功。尽管他们成功了,但对他们对对抗性攻击的稳健性知之甚少。在本文中,我们提出了一个通过统计特征(TSA-STAT)}称为时间序列攻击的新型对抗框架}。为了解决时间序列域的独特挑战,TSA-STAT对时间序列数据的统计特征采取限制来构建对抗性示例。优化的多项式转换用于创建比基于加性扰动的攻击(就成功欺骗DNN而言)更有效的攻击。我们还提供有关构建对抗性示例的统计功能规范的认证界限。我们对各种现实世界基准数据集的实验表明,TSA-STAT在欺骗DNN的时间序列域和改善其稳健性方面的有效性。 TSA-STAT算法的源代码可在https://github.com/tahabelkhouja/time-series-series-attacks-via-statity-features上获得
translated by 谷歌翻译
This paper investigates recently proposed approaches for defending against adversarial examples and evaluating adversarial robustness. We motivate adversarial risk as an objective for achieving models robust to worst-case inputs. We then frame commonly used attacks and evaluation metrics as defining a tractable surrogate objective to the true adversarial risk. This suggests that models may optimize this surrogate rather than the true adversarial risk. We formalize this notion as obscurity to an adversary, and develop tools and heuristics for identifying obscured models and designing transparent models. We demonstrate that this is a significant problem in practice by repurposing gradient-free optimization techniques into adversarial attacks, which we use to decrease the accuracy of several recently proposed defenses to near zero. Our hope is that our formulations and results will help researchers to develop more powerful defenses.
translated by 谷歌翻译