基于随机搜索方案的对抗攻击已经获得最近的最先进的导致黑盒稳健性评估。然而,正如我们在这项工作中所展示的那样,他们在不同查询预算制度中的效率取决于潜在的提案分布的手动设计和启发式调整。我们研究如何通过基于攻击期间获得的信息在线调整提案分发来解决这个问题。我们考虑方攻击,这是一种最先进的基于分数的黑盒攻击,并演示了学习控制器如何改进其性能,该控制器可以在攻击期间调整提案分布的参数。我们使用白色框访问的CIFAR10模型上使用基于梯度的端到端培训培训控制器。我们展示将学习的控制器堵塞到攻击中,始终如一地将其黑盒稳健性估计在不同的查询制度中提高到20%,对于具有黑盒访问的广泛不同型号。我们进一步表明,学习的自适应原理将原理转移到其他数据分布,例如CIFAR100或ImageNet以及目标攻击设置。
translated by 谷歌翻译
We propose the Square Attack, a score-based black-box l2and l∞-adversarial attack that does not rely on local gradient information and thus is not affected by gradient masking. Square Attack is based on a randomized search scheme which selects localized squareshaped updates at random positions so that at each iteration the perturbation is situated approximately at the boundary of the feasible set. Our method is significantly more query efficient and achieves a higher success rate compared to the state-of-the-art methods, especially in the untargeted setting. In particular, on ImageNet we improve the average query efficiency in the untargeted setting for various deep networks by a factor of at least 1.8 and up to 3 compared to the recent state-ofthe-art l∞-attack of Al-Dujaili & OReilly (2020). Moreover, although our attack is black-box, it can also outperform gradient-based white-box attacks on the standard benchmarks achieving a new state-of-the-art in terms of the success rate. The code of our attack is available at https://github.com/max-andr/square-attack.
translated by 谷歌翻译
我们表明,当考虑到图像域$ [0,1] ^ D $时,已建立$ L_1 $ -Projected梯度下降(PGD)攻击是次优,因为它们不认为有效的威胁模型是交叉点$ l_1 $ -ball和$ [0,1] ^ d $。我们研究了这种有效威胁模型的最陡渐进步骤的预期稀疏性,并表明该组上的确切投影是计算可行的,并且产生更好的性能。此外,我们提出了一种自适应形式的PGD,即使具有小的迭代预算,这也是非常有效的。我们的结果$ l_1 $ -apgd是一个强大的白盒攻击,表明先前的作品高估了他们的$ l_1 $ -trobustness。使用$ l_1 $ -apgd for vercersarial培训,我们获得一个强大的分类器,具有sota $ l_1 $ -trobustness。最后,我们将$ l_1 $ -apgd和平方攻击的适应组合到$ l_1 $ to $ l_1 $ -autoattack,这是一个攻击的集合,可靠地评估$ l_1 $ -ball与$的威胁模型的对抗鲁棒性进行对抗[ 0,1] ^ d $。
translated by 谷歌翻译
The field of defense strategies against adversarial attacks has significantly grown over the last years, but progress is hampered as the evaluation of adversarial defenses is often insufficient and thus gives a wrong impression of robustness. Many promising defenses could be broken later on, making it difficult to identify the state-of-the-art. Frequent pitfalls in the evaluation are improper tuning of hyperparameters of the attacks, gradient obfuscation or masking. In this paper we first propose two extensions of the PGD-attack overcoming failures due to suboptimal step size and problems of the objective function. We then combine our novel attacks with two complementary existing ones to form a parameter-free, computationally affordable and user-independent ensemble of attacks to test adversarial robustness. We apply our ensemble to over 50 models from papers published at recent top machine learning and computer vision venues. In all except one of the cases we achieve lower robust test accuracy than reported in these papers, often by more than 10%, identifying several broken defenses.
translated by 谷歌翻译
许多最先进的ML模型在各种任务中具有优于图像分类的人类。具有如此出色的性能,ML模型今天被广泛使用。然而,存在对抗性攻击和数据中毒攻击的真正符合ML模型的稳健性。例如,Engstrom等人。证明了最先进的图像分类器可以容易地被任意图像上的小旋转欺骗。由于ML系统越来越纳入安全性和安全敏感的应用,对抗攻击和数据中毒攻击构成了相当大的威胁。本章侧重于ML安全的两个广泛和重要的领域:对抗攻击和数据中毒攻击。
translated by 谷歌翻译
Deep neural networks (DNNs) are one of the most prominent technologies of our time, as they achieve state-of-the-art performance in many machine learning tasks, including but not limited to image classification, text mining, and speech processing. However, recent research on DNNs has indicated ever-increasing concern on the robustness to adversarial examples, especially for security-critical tasks such as traffic sign identification for autonomous driving. Studies have unveiled the vulnerability of a well-trained DNN by demonstrating the ability of generating barely noticeable (to both human and machines) adversarial images that lead to misclassification. Furthermore, researchers have shown that these adversarial images are highly transferable by simply training and attacking a substitute model built upon the target model, known as a black-box attack to DNNs.Similar to the setting of training substitute models, in this paper we propose an effective black-box attack that also only has access to the input (images) and the output (confidence scores) of a targeted DNN. However, different from leveraging attack transferability from substitute models, we propose zeroth order optimization (ZOO) based attacks to directly estimate the gradients of the targeted DNN for generating adversarial examples. We use zeroth order stochastic coordinate descent along with dimension reduction, hierarchical attack and importance sampling techniques to * Pin-Yu Chen and Huan Zhang contribute equally to this work.
translated by 谷歌翻译
The goal of a decision-based adversarial attack on a trained model is to generate adversarial examples based solely on observing output labels returned by the targeted model. We develop HopSkipJumpAttack, a family of algorithms based on a novel estimate of the gradient direction using binary information at the decision boundary. The proposed family includes both untargeted and targeted attacks optimized for 2 and ∞ similarity metrics respectively. Theoretical analysis is provided for the proposed algorithms and the gradient direction estimate. Experiments show HopSkipJumpAttack requires significantly fewer model queries than several state-of-the-art decision-based adversarial attacks. It also achieves competitive performance in attacking several widely-used defense mechanisms.
translated by 谷歌翻译
深度卷积神经网络(CNN)很容易被输入图像的细微,不可察觉的变化所欺骗。为了解决此漏洞,对抗训练会创建扰动模式,并将其包括在培训设置中以鲁棒性化模型。与仅使用阶级有限信息的现有对抗训练方法(例如,使用交叉渗透损失)相反,我们建议利用功能空间中的其他信息来促进更强的对手,这些信息又用于学习强大的模型。具体来说,我们将使用另一类的目标样本的样式和内容信息以及其班级边界信息来创建对抗性扰动。我们以深入监督的方式应用了我们提出的多任务目标,从而提取了多尺度特征知识,以创建最大程度地分开对手。随后,我们提出了一种最大边缘对抗训练方法,该方法可最大程度地减少源图像与其对手之间的距离,并最大程度地提高对手和目标图像之间的距离。与最先进的防御能力相比,我们的对抗训练方法表明了强大的鲁棒性,可以很好地推广到自然发生的损坏和数据分配变化,并保留了清洁示例的模型准确性。
translated by 谷歌翻译
已知神经网络容易受到对抗性攻击的影响 - 轻微但精心构建的输入扰动,这会造成巨大损害网络的性能。已经提出了许多防御方法来通过培训对抗对抗扰动的投入来改善深网络的稳健性。然而,这些模型通常仍然容易受到在训练期间没有看到的新类型的攻击,甚至在以前看到的攻击中稍微强大。在这项工作中,我们提出了一种新的对抗性稳健性的方法,这在域适应领域的见解中建立了洞察力。我们的方法称为对抗性特征脱敏(AFD),目的是学习功能,这些特征是不变的对输入的对抗扰动。这是通过游戏实现的,我们学习了预测和鲁棒(对对抗性攻击不敏感)的特征,即不能用于区分自然和对抗数据。若干基准测试的经验结果证明了提出的方法对广泛的攻击类型和攻击优势的有效性。我们的代码可在https://github.com/bashivanlab/afd获得。
translated by 谷歌翻译
In the scenario of black-box adversarial attack, the target model's parameters are unknown, and the attacker aims to find a successful adversarial perturbation based on query feedback under a query budget. Due to the limited feedback information, existing query-based black-box attack methods often require many queries for attacking each benign example. To reduce query cost, we propose to utilize the feedback information across historical attacks, dubbed example-level adversarial transferability. Specifically, by treating the attack on each benign example as one task, we develop a meta-learning framework by training a meta-generator to produce perturbations conditioned on benign examples. When attacking a new benign example, the meta generator can be quickly fine-tuned based on the feedback information of the new task as well as a few historical attacks to produce effective perturbations. Moreover, since the meta-train procedure consumes many queries to learn a generalizable generator, we utilize model-level adversarial transferability to train the meta-generator on a white-box surrogate model, then transfer it to help the attack against the target model. The proposed framework with the two types of adversarial transferability can be naturally combined with any off-the-shelf query-based attack methods to boost their performance, which is verified by extensive experiments.
translated by 谷歌翻译
Adaptive attacks have (rightfully) become the de facto standard for evaluating defenses to adversarial examples. We find, however, that typical adaptive evaluations are incomplete. We demonstrate that thirteen defenses recently published at ICLR, ICML and NeurIPS-and which illustrate a diverse set of defense strategies-can be circumvented despite attempting to perform evaluations using adaptive attacks. While prior evaluation papers focused mainly on the end result-showing that a defense was ineffective-this paper focuses on laying out the methodology and the approach necessary to perform an adaptive attack. Some of our attack strategies are generalizable, but no single strategy would have been sufficient for all defenses. This underlines our key message that adaptive attacks cannot be automated and always require careful and appropriate tuning to a given defense. We hope that these analyses will serve as guidance on how to properly perform adaptive attacks against defenses to adversarial examples, and thus will allow the community to make further progress in building more robust models.
translated by 谷歌翻译
This paper investigates recently proposed approaches for defending against adversarial examples and evaluating adversarial robustness. We motivate adversarial risk as an objective for achieving models robust to worst-case inputs. We then frame commonly used attacks and evaluation metrics as defining a tractable surrogate objective to the true adversarial risk. This suggests that models may optimize this surrogate rather than the true adversarial risk. We formalize this notion as obscurity to an adversary, and develop tools and heuristics for identifying obscured models and designing transparent models. We demonstrate that this is a significant problem in practice by repurposing gradient-free optimization techniques into adversarial attacks, which we use to decrease the accuracy of several recently proposed defenses to near zero. Our hope is that our formulations and results will help researchers to develop more powerful defenses.
translated by 谷歌翻译
尽管机器学习系统的效率和可扩展性,但最近的研究表明,许多分类方法,尤其是深神经网络(DNN),易受对抗的例子;即,仔细制作欺骗训练有素的分类模型的例子,同时无法区分从自然数据到人类。这使得在安全关键区域中应用DNN或相关方法可能不安全。由于这个问题是由Biggio等人确定的。 (2013)和Szegedy等人。(2014年),在这一领域已经完成了很多工作,包括开发攻击方法,以产生对抗的例子和防御技术的构建防范这些例子。本文旨在向统计界介绍这一主题及其最新发展,主要关注对抗性示例的产生和保护。在数值实验中使用的计算代码(在Python和R)公开可用于读者探讨调查的方法。本文希望提交人们将鼓励更多统计学人员在这种重要的令人兴奋的领域的产生和捍卫对抗的例子。
translated by 谷歌翻译
Adversarial examples are perturbed inputs designed to fool machine learning models. Adversarial training injects such examples into training data to increase robustness. To scale this technique to large datasets, perturbations are crafted using fast single-step methods that maximize a linear approximation of the model's loss. We show that this form of adversarial training converges to a degenerate global minimum, wherein small curvature artifacts near the data points obfuscate a linear approximation of the loss. The model thus learns to generate weak perturbations, rather than defend against strong ones. As a result, we find that adversarial training remains vulnerable to black-box attacks, where we transfer perturbations computed on undefended models, as well as to a powerful novel single-step attack that escapes the non-smooth vicinity of the input data via a small random step. We further introduce Ensemble Adversarial Training, a technique that augments training data with perturbations transferred from other models. On ImageNet, Ensemble Adversarial Training yields models with stronger robustness to blackbox attacks. In particular, our most robust model won the first round of the NIPS 2017 competition on Defenses against Adversarial Attacks (Kurakin et al., 2017c). However, subsequent work found that more elaborate black-box attacks could significantly enhance transferability and reduce the accuracy of our models.
translated by 谷歌翻译
We identify obfuscated gradients, a kind of gradient masking, as a phenomenon that leads to a false sense of security in defenses against adversarial examples. While defenses that cause obfuscated gradients appear to defeat iterative optimizationbased attacks, we find defenses relying on this effect can be circumvented. We describe characteristic behaviors of defenses exhibiting the effect, and for each of the three types of obfuscated gradients we discover, we develop attack techniques to overcome it. In a case study, examining noncertified white-box-secure defenses at ICLR 2018, we find obfuscated gradients are a common occurrence, with 7 of 9 defenses relying on obfuscated gradients. Our new attacks successfully circumvent 6 completely, and 1 partially, in the original threat model each paper considers.
translated by 谷歌翻译
Current neural network-based classifiers are susceptible to adversarial examples even in the black-box setting, where the attacker only has query access to the model. In practice, the threat model for real-world systems is often more restrictive than the typical black-box model where the adversary can observe the full output of the network on arbitrarily many chosen inputs. We define three realistic threat models that more accurately characterize many real-world classifiers: the query-limited setting, the partialinformation setting, and the label-only setting. We develop new attacks that fool classifiers under these more restrictive threat models, where previous methods would be impractical or ineffective. We demonstrate that our methods are effective against an ImageNet classifier under our proposed threat models. We also demonstrate a targeted black-box attack against a commercial classifier, overcoming the challenges of limited query access, partial information, and other practical issues to break the Google Cloud Vision API.
translated by 谷歌翻译
对抗性培训已被广泛用于增强神经网络模型对抗对抗攻击的鲁棒性。但是,自然准确性与强大的准确性之间仍有一个显着的差距。我们发现其中一个原因是常用的标签,单热量矢量,阻碍了图像识别的学习过程。在本文中,我们提出了一种称为低温蒸馏(LTD)的方法,该方法基于知识蒸馏框架来产生所需的软标记。与以前的工作不同,LTD在教师模型中使用相对较低的温度,采用不同但固定的,温度为教师模型和学生模型。此外,我们已经调查了有限公司协同使用自然数据和对抗性的方法。实验结果表明,在没有额外的未标记数据的情况下,所提出的方法与上一项工作相结合,可以分别在CiFar-10和CiFar-100数据集上实现57.72 \%和30.36 \%的鲁棒精度,这是州的大约1.21 \%通常的方法平均。
translated by 谷歌翻译
作为研究界,我们仍然缺乏对对抗性稳健性的进展的系统理解,这通常使得难以识别训练强大模型中最有前途的想法。基准稳健性的关键挑战是,其评估往往是出错的导致鲁棒性高估。我们的目标是建立对抗性稳健性的标准化基准,尽可能准确地反映出考虑在合理的计算预算范围内所考虑的模型的稳健性。为此,我们首先考虑图像分类任务并在允许的型号上引入限制(可能在将来宽松)。我们评估了与AutoAtrack的对抗鲁棒性,白和黑箱攻击的集合,最近在大规模研究中显示,与原始出版物相比,改善了几乎所有稳健性评估。为防止对自动攻击进行新防御的过度适应,我们欢迎基于自适应攻击的外部评估,特别是在自动攻击稳健性潜在高估的地方。我们的排行榜,托管在https://robustbench.github.io/,包含120多个模型的评估,并旨在反映在$ \ ell_ \ infty $的一套明确的任务上的图像分类中的当前状态 - 和$ \ ell_2 $ -Threat模型和共同腐败,未来可能的扩展。此外,我们开源源是图书馆https://github.com/robustbench/robustbench,可以提供对80多个强大模型的统一访问,以方便他们的下游应用程序。最后,根据收集的模型,我们分析了稳健性对分布换档,校准,分配检测,公平性,隐私泄漏,平滑度和可转移性的影响。
translated by 谷歌翻译
Designing powerful adversarial attacks is of paramount importance for the evaluation of $\ell_p$-bounded adversarial defenses. Projected Gradient Descent (PGD) is one of the most effective and conceptually simple algorithms to generate such adversaries. The search space of PGD is dictated by the steepest ascent directions of an objective. Despite the plethora of objective function choices, there is no universally superior option and robustness overestimation may arise from ill-suited objective selection. Driven by this observation, we postulate that the combination of different objectives through a simple loss alternating scheme renders PGD more robust towards design choices. We experimentally verify this assertion on a synthetic-data example and by evaluating our proposed method across 25 different $\ell_{\infty}$-robust models and 3 datasets. The performance improvement is consistent, when compared to the single loss counterparts. In the CIFAR-10 dataset, our strongest adversarial attack outperforms all of the white-box components of AutoAttack (AA) ensemble, as well as the most powerful attacks existing on the literature, achieving state-of-the-art results in the computational budget of our study ($T=100$, no restarts).
translated by 谷歌翻译
深度神经网络已成为现代图像识别系统的驱动力。然而,神经网络对抗对抗性攻击的脆弱性对受这些系统影响的人构成严重威胁。在本文中,我们专注于一个真实的威胁模型,中间对手恶意拦截和erturbs网页用户上传在线。这种类型的攻击可以在简单的性能下降之上提高严重的道德问题。为了防止这种攻击,我们设计了一种新的双层优化算法,该算法在对抗对抗扰动的自然图像附近找到点。CiFar-10和Imagenet的实验表明我们的方法可以有效地强制在给定的修改预算范围内的自然图像。我们还显示所提出的方法可以在共同使用随机平滑时提高鲁棒性。
translated by 谷歌翻译