Adversarial machine learning has been both a major concern and a hot topic recently, especially with the ubiquitous use of deep neural networks in the current landscape. Adversarial attacks and defenses are usually likened to a cat-and-mouse game in which defenders and attackers evolve over the time. On one hand, the goal is to develop strong and robust deep networks that are resistant to malicious actors. On the other hand, in order to achieve that, we need to devise even stronger adversarial attacks to challenge these defense models. Most of existing attacks employs a single $\ell_p$ distance (commonly, $p\in\{1,2,\infty\}$) to define the concept of closeness and performs steepest gradient ascent w.r.t. this $p$-norm to update all pixels in an adversarial example in the same way. These $\ell_p$ attacks each has its own pros and cons; and there is no single attack that can successfully break through defense models that are robust against multiple $\ell_p$ norms simultaneously. Motivated by these observations, we come up with a natural approach: combining various $\ell_p$ gradient projections on a pixel level to achieve a joint adversarial perturbation. Specifically, we learn how to perturb each pixel to maximize the attack performance, while maintaining the overall visual imperceptibility of adversarial examples. Finally, through various experiments with standardized benchmarks, we show that our method outperforms most current strong attacks across state-of-the-art defense mechanisms, while retaining its ability to remain clean visually.
translated by 谷歌翻译
This paper investigates recently proposed approaches for defending against adversarial examples and evaluating adversarial robustness. We motivate adversarial risk as an objective for achieving models robust to worst-case inputs. We then frame commonly used attacks and evaluation metrics as defining a tractable surrogate objective to the true adversarial risk. This suggests that models may optimize this surrogate rather than the true adversarial risk. We formalize this notion as obscurity to an adversary, and develop tools and heuristics for identifying obscured models and designing transparent models. We demonstrate that this is a significant problem in practice by repurposing gradient-free optimization techniques into adversarial attacks, which we use to decrease the accuracy of several recently proposed defenses to near zero. Our hope is that our formulations and results will help researchers to develop more powerful defenses.
translated by 谷歌翻译
The field of defense strategies against adversarial attacks has significantly grown over the last years, but progress is hampered as the evaluation of adversarial defenses is often insufficient and thus gives a wrong impression of robustness. Many promising defenses could be broken later on, making it difficult to identify the state-of-the-art. Frequent pitfalls in the evaluation are improper tuning of hyperparameters of the attacks, gradient obfuscation or masking. In this paper we first propose two extensions of the PGD-attack overcoming failures due to suboptimal step size and problems of the objective function. We then combine our novel attacks with two complementary existing ones to form a parameter-free, computationally affordable and user-independent ensemble of attacks to test adversarial robustness. We apply our ensemble to over 50 models from papers published at recent top machine learning and computer vision venues. In all except one of the cases we achieve lower robust test accuracy than reported in these papers, often by more than 10%, identifying several broken defenses.
translated by 谷歌翻译
Adaptive attacks have (rightfully) become the de facto standard for evaluating defenses to adversarial examples. We find, however, that typical adaptive evaluations are incomplete. We demonstrate that thirteen defenses recently published at ICLR, ICML and NeurIPS-and which illustrate a diverse set of defense strategies-can be circumvented despite attempting to perform evaluations using adaptive attacks. While prior evaluation papers focused mainly on the end result-showing that a defense was ineffective-this paper focuses on laying out the methodology and the approach necessary to perform an adaptive attack. Some of our attack strategies are generalizable, but no single strategy would have been sufficient for all defenses. This underlines our key message that adaptive attacks cannot be automated and always require careful and appropriate tuning to a given defense. We hope that these analyses will serve as guidance on how to properly perform adaptive attacks against defenses to adversarial examples, and thus will allow the community to make further progress in building more robust models.
translated by 谷歌翻译
最近的研究表明,深度神经网络(DNNS)极易受到精心设计的对抗例子的影响。对那些对抗性例子的对抗性学习已被证明是防御这种攻击的最有效方法之一。目前,大多数现有的对抗示例生成方法基于一阶梯度,这几乎无法进一步改善模型的鲁棒性,尤其是在面对二阶对抗攻击时。与一阶梯度相比,二阶梯度提供了相对于自然示例的损失格局的更准确近似。受此启发的启发,我们的工作制作了二阶的对抗示例,并使用它们来训练DNNS。然而,二阶优化涉及Hessian Inverse的耗时计算。我们通过将问题转换为Krylov子空间中的优化,提出了一种近似方法,该方法显着降低了计算复杂性以加快训练过程。在矿工和CIFAR-10数据集上进行的广泛实验表明,我们使用二阶对抗示例的对抗性学习优于其他FISRT-阶方法,这可以改善针对广泛攻击的模型稳健性。
translated by 谷歌翻译
Adversarial training, in which a network is trained on adversarial examples, is one of the few defenses against adversarial attacks that withstands strong attacks. Unfortunately, the high cost of generating strong adversarial examples makes standard adversarial training impractical on large-scale problems like ImageNet. We present an algorithm that eliminates the overhead cost of generating adversarial examples by recycling the gradient information computed when updating model parameters.Our "free" adversarial training algorithm achieves comparable robustness to PGD adversarial training on the CIFAR-10 and CIFAR-100 datasets at negligible additional cost compared to natural training, and can be 7 to 30 times faster than other strong adversarial training methods. Using a single workstation with 4 P100 GPUs and 2 days of runtime, we can train a robust model for the large-scale ImageNet classification task that maintains 40% accuracy against PGD attacks. The code is available at https://github.com/ashafahi/free_adv_train.
translated by 谷歌翻译
对对抗性攻击的鲁棒性通常以对抗精度评估。但是,该指标太粗糙,无法正确捕获机器学习模型的所有鲁棒性。当对强烈的攻击进行评估时,许多防御能力并不能提供准确的改进,同时仍会部分贡献对抗性鲁棒性。流行的认证方法遇到了同一问题,因为它们提供了准确性的下限。为了捕获更精细的鲁棒性属性,我们提出了一个针对L2鲁棒性,对抗角稀疏性的新指标,该指标部分回答了“输入周围有多少个对抗性示例”的问题。我们通过评估“强”和“弱”的防御能力来证明其有用性。我们表明,一些最先进的防御能力具有非常相似的精度,在它们不强大的输入上可能具有截然不同的稀疏性。我们还表明,一些弱防御能力实际上会降低鲁棒性,而另一些防御能力则以无法捕获的准确性来加强它。这些差异可以预测这种防御与对抗性训练相结合时的实用性。
translated by 谷歌翻译
Designing powerful adversarial attacks is of paramount importance for the evaluation of $\ell_p$-bounded adversarial defenses. Projected Gradient Descent (PGD) is one of the most effective and conceptually simple algorithms to generate such adversaries. The search space of PGD is dictated by the steepest ascent directions of an objective. Despite the plethora of objective function choices, there is no universally superior option and robustness overestimation may arise from ill-suited objective selection. Driven by this observation, we postulate that the combination of different objectives through a simple loss alternating scheme renders PGD more robust towards design choices. We experimentally verify this assertion on a synthetic-data example and by evaluating our proposed method across 25 different $\ell_{\infty}$-robust models and 3 datasets. The performance improvement is consistent, when compared to the single loss counterparts. In the CIFAR-10 dataset, our strongest adversarial attack outperforms all of the white-box components of AutoAttack (AA) ensemble, as well as the most powerful attacks existing on the literature, achieving state-of-the-art results in the computational budget of our study ($T=100$, no restarts).
translated by 谷歌翻译
We identify obfuscated gradients, a kind of gradient masking, as a phenomenon that leads to a false sense of security in defenses against adversarial examples. While defenses that cause obfuscated gradients appear to defeat iterative optimizationbased attacks, we find defenses relying on this effect can be circumvented. We describe characteristic behaviors of defenses exhibiting the effect, and for each of the three types of obfuscated gradients we discover, we develop attack techniques to overcome it. In a case study, examining noncertified white-box-secure defenses at ICLR 2018, we find obfuscated gradients are a common occurrence, with 7 of 9 defenses relying on obfuscated gradients. Our new attacks successfully circumvent 6 completely, and 1 partially, in the original threat model each paper considers.
translated by 谷歌翻译
最大限度的训练原则,最大限度地减少最大的对抗性损失,也称为对抗性培训(AT),已被证明是一种提高对抗性鲁棒性的最先进的方法。尽管如此,超出了在对抗环境中尚未经过严格探索的最小最大优化。在本文中,我们展示了如何利用多个领域的最小最大优化的一般框架,以推进不同类型的对抗性攻击的设计。特别是,给定一组风险源,最小化最坏情况攻击损失可以通过引入在域集的概率单纯x上最大化的域权重来重新重整为最小最大问题。我们在三次攻击生成问题中展示了这个统一的框架 - 攻击模型集合,在多个输入下设计了通用扰动,并制作攻击对数据转换的弹性。广泛的实验表明,我们的方法导致对现有的启发式策略以及对培训的最先进的防御方法而言,鲁棒性改善,培训对多种扰动类型具有稳健。此外,我们发现,从我们的MIN-MAX框架中学到的自调整域权重可以提供整体工具来解释跨域攻击难度的攻击水平。代码可在https://github.com/wangjksjtu/minmaxsod中获得。
translated by 谷歌翻译
许多最先进的ML模型在各种任务中具有优于图像分类的人类。具有如此出色的性能,ML模型今天被广泛使用。然而,存在对抗性攻击和数据中毒攻击的真正符合ML模型的稳健性。例如,Engstrom等人。证明了最先进的图像分类器可以容易地被任意图像上的小旋转欺骗。由于ML系统越来越纳入安全性和安全敏感的应用,对抗攻击和数据中毒攻击构成了相当大的威胁。本章侧重于ML安全的两个广泛和重要的领域:对抗攻击和数据中毒攻击。
translated by 谷歌翻译
We propose the Square Attack, a score-based black-box l2and l∞-adversarial attack that does not rely on local gradient information and thus is not affected by gradient masking. Square Attack is based on a randomized search scheme which selects localized squareshaped updates at random positions so that at each iteration the perturbation is situated approximately at the boundary of the feasible set. Our method is significantly more query efficient and achieves a higher success rate compared to the state-of-the-art methods, especially in the untargeted setting. In particular, on ImageNet we improve the average query efficiency in the untargeted setting for various deep networks by a factor of at least 1.8 and up to 3 compared to the recent state-ofthe-art l∞-attack of Al-Dujaili & OReilly (2020). Moreover, although our attack is black-box, it can also outperform gradient-based white-box attacks on the standard benchmarks achieving a new state-of-the-art in terms of the success rate. The code of our attack is available at https://github.com/max-andr/square-attack.
translated by 谷歌翻译
深度学习(DL)系统的安全性是一个极为重要的研究领域,因为它们正在部署在多个应用程序中,因为它们不断改善,以解决具有挑战性的任务。尽管有压倒性的承诺,但深度学习系统容易受到制作的对抗性例子的影响,这可能是人眼无法察觉的,但可能会导致模型错误分类。对基于整体技术的对抗性扰动的保护已被证明很容易受到更强大的对手的影响,或者证明缺乏端到端评估。在本文中,我们试图开发一种新的基于整体的解决方案,该解决方案构建具有不同决策边界的防御者模型相对于原始模型。通过(1)通过一种称为拆分和剃须的方法转换输入的分类器的合奏,以及(2)通过一种称为对比度功能的方法限制重要特征,显示出相对于相对于不同的梯度对抗性攻击,这减少了将对抗性示例从原始示例转移到针对同一类的防御者模型的机会。我们使用标准图像分类数据集(即MNIST,CIFAR-10和CIFAR-100)进行了广泛的实验,以实现最新的对抗攻击,以证明基于合奏的防御的鲁棒性。我们还在存在更强大的对手的情况下评估稳健性,该对手同时靶向合奏中的所有模型。已经提供了整体假阳性和误报的结果,以估计提出的方法的总体性能。
translated by 谷歌翻译
Neural networks provide state-of-the-art results for most machine learning tasks. Unfortunately, neural networks are vulnerable to adversarial examples: given an input x and any target classification t, it is possible to find a new input x that is similar to x but classified as t. This makes it difficult to apply neural networks in security-critical areas. Defensive distillation is a recently proposed approach that can take an arbitrary neural network, and increase its robustness, reducing the success rate of current attacks' ability to find adversarial examples from 95% to 0.5%.In this paper, we demonstrate that defensive distillation does not significantly increase the robustness of neural networks by introducing three new attack algorithms that are successful on both distilled and undistilled neural networks with 100% probability. Our attacks are tailored to three distance metrics used previously in the literature, and when compared to previous adversarial example generation algorithms, our attacks are often much more effective (and never worse). Furthermore, we propose using high-confidence adversarial examples in a simple transferability test we show can also be used to break defensive distillation. We hope our attacks will be used as a benchmark in future defense attempts to create neural networks that resist adversarial examples.
translated by 谷歌翻译
评估防御模型的稳健性是对抗对抗鲁棒性研究的具有挑战性的任务。僵化的渐变,先前已经发现了一种梯度掩蔽,以许多防御方法存在并导致鲁棒性的错误信号。在本文中,我们确定了一种更细微的情况,称为不平衡梯度,也可能导致过高的对抗性鲁棒性。当边缘损耗的一个术语的梯度主导并将攻击朝向次优化方向推动时,发生不平衡梯度的现象。为了利用不平衡的梯度,我们制定了分解利润率损失的边缘分解(MD)攻击,并通过两阶段过程分别探讨了这些术语的攻击性。我们还提出了一个Multared和Ensemble版本的MD攻击。通过调查自2018年以来提出的17个防御模型,我们发现6种型号易受不平衡梯度的影响,我们的MD攻击可以减少由最佳基线独立攻击评估的鲁棒性另外2%。我们还提供了对不平衡梯度的可能原因和有效对策的深入分析。
translated by 谷歌翻译
The evaluation of robustness against adversarial manipulation of neural networks-based classifiers is mainly tested with empirical attacks as methods for the exact computation, even when available, do not scale to large networks. We propose in this paper a new white-box adversarial attack wrt the l p -norms for p ∈ {1, 2, ∞} aiming at finding the minimal perturbation necessary to change the class of a given input. It has an intuitive geometric meaning, yields quickly high quality results, minimizes the size of the perturbation (so that it returns the robust accuracy at every threshold with a single run). It performs better or similar to stateof-the-art attacks which are partially specialized to one l p -norm, and is robust to the phenomenon of gradient masking.
translated by 谷歌翻译
现代神经网络能够在涉及对象分类和图像生成的许多任务中执行至少和人类。然而,人类难以察觉的小扰动可能会显着降低训练有素的深神经网络的性能。我们提供了分布稳健的优化(DRO)框架,其集成了基于人的图像质量评估方法,以设计对人类来说难以察觉而难以察觉的最佳攻击,而是针对深度神经网络造成显着损害。通过广泛的实验,我们表明我们的攻击算法比其他最先进的人类难以察觉的攻击方法产生更好的质量(对人类)的攻击。此外,我们证明了使用我们最佳设计的人类难以察觉的攻击的DRO培训可以改善图像分类中的群体公平。在最后,我们提供了一种算法实现,以显着加速DRO训练,这可能是独立的兴趣。
translated by 谷歌翻译
评估对抗性鲁棒性的量,以找到有输入样品被错误分类所需的最小扰动。底层优化的固有复杂性需要仔细调整基于梯度的攻击,初始化,并且可能为许多计算苛刻的迭代而被执行,即使专门用于给定的扰动模型也是如此。在这项工作中,我们通过提出使用不同$ \ ell_p $ -norm扰动模型($ p = 0,1,2,\ idty $)的快速最小规范(FMN)攻击来克服这些限制(FMN)攻击选择,不需要对抗性起点,并在很少的轻量级步骤中收敛。它通过迭代地发现在$ \ ell_p $ -norm的最大信心被错误分类的样本进行了尺寸的尺寸$ \ epsilon $的限制,同时适应$ \ epsilon $,以最小化当前样本到决策边界的距离。广泛的实验表明,FMN在收敛速度和计算时间方面显着优于现有的攻击,同时报告可比或甚至更小的扰动尺寸。
translated by 谷歌翻译
我们表明,当考虑到图像域$ [0,1] ^ D $时,已建立$ L_1 $ -Projected梯度下降(PGD)攻击是次优,因为它们不认为有效的威胁模型是交叉点$ l_1 $ -ball和$ [0,1] ^ d $。我们研究了这种有效威胁模型的最陡渐进步骤的预期稀疏性,并表明该组上的确切投影是计算可行的,并且产生更好的性能。此外,我们提出了一种自适应形式的PGD,即使具有小的迭代预算,这也是非常有效的。我们的结果$ l_1 $ -apgd是一个强大的白盒攻击,表明先前的作品高估了他们的$ l_1 $ -trobustness。使用$ l_1 $ -apgd for vercersarial培训,我们获得一个强大的分类器,具有sota $ l_1 $ -trobustness。最后,我们将$ l_1 $ -apgd和平方攻击的适应组合到$ l_1 $ to $ l_1 $ -autoattack,这是一个攻击的集合,可靠地评估$ l_1 $ -ball与$的威胁模型的对抗鲁棒性进行对抗[ 0,1] ^ d $。
translated by 谷歌翻译
虽然深度神经网络在各种任务中表现出前所未有的性能,但对对抗性示例的脆弱性阻碍了他们在安全关键系统中的部署。许多研究表明,即使在黑盒设置中也可能攻击,其中攻击者无法访问目标模型的内部信息。大多数黑匣子攻击基于查询,每个都可以获得目标模型的输入输出,并且许多研究侧重于减少所需查询的数量。在本文中,我们注意了目标模型的输出完全对应于查询输入的隐含假设。如果将某些随机性引入模型中,它可以打破假设,因此,基于查询的攻击可能在梯度估计和本地搜索中具有巨大的困难,这是其攻击过程的核心。从这种动机来看,我们甚至观察到一个小的添加剂输入噪声可以中和大多数基于查询的攻击和名称这个简单但有效的方法小噪声防御(SND)。我们分析了SND如何防御基于查询的黑匣子攻击,并展示其与CIFAR-10和ImageNet数据集的八种最先进的攻击有效性。即使具有强大的防御能力,SND几乎保持了原始的分类准确性和计算速度。通过在推断下仅添加一行代码,SND很容易适用于预先训练的模型。
translated by 谷歌翻译