The field of defense strategies against adversarial attacks has significantly grown over the last years, but progress is hampered as the evaluation of adversarial defenses is often insufficient and thus gives a wrong impression of robustness. Many promising defenses could be broken later on, making it difficult to identify the state-of-the-art. Frequent pitfalls in the evaluation are improper tuning of hyperparameters of the attacks, gradient obfuscation or masking. In this paper we first propose two extensions of the PGD-attack overcoming failures due to suboptimal step size and problems of the objective function. We then combine our novel attacks with two complementary existing ones to form a parameter-free, computationally affordable and user-independent ensemble of attacks to test adversarial robustness. We apply our ensemble to over 50 models from papers published at recent top machine learning and computer vision venues. In all except one of the cases we achieve lower robust test accuracy than reported in these papers, often by more than 10%, identifying several broken defenses.
translated by 谷歌翻译
我们表明,当考虑到图像域$ [0,1] ^ D $时,已建立$ L_1 $ -Projected梯度下降(PGD)攻击是次优,因为它们不认为有效的威胁模型是交叉点$ l_1 $ -ball和$ [0,1] ^ d $。我们研究了这种有效威胁模型的最陡渐进步骤的预期稀疏性,并表明该组上的确切投影是计算可行的,并且产生更好的性能。此外,我们提出了一种自适应形式的PGD,即使具有小的迭代预算,这也是非常有效的。我们的结果$ l_1 $ -apgd是一个强大的白盒攻击,表明先前的作品高估了他们的$ l_1 $ -trobustness。使用$ l_1 $ -apgd for vercersarial培训,我们获得一个强大的分类器,具有sota $ l_1 $ -trobustness。最后,我们将$ l_1 $ -apgd和平方攻击的适应组合到$ l_1 $ to $ l_1 $ -autoattack,这是一个攻击的集合,可靠地评估$ l_1 $ -ball与$的威胁模型的对抗鲁棒性进行对抗[ 0,1] ^ d $。
translated by 谷歌翻译
The evaluation of robustness against adversarial manipulation of neural networks-based classifiers is mainly tested with empirical attacks as methods for the exact computation, even when available, do not scale to large networks. We propose in this paper a new white-box adversarial attack wrt the l p -norms for p ∈ {1, 2, ∞} aiming at finding the minimal perturbation necessary to change the class of a given input. It has an intuitive geometric meaning, yields quickly high quality results, minimizes the size of the perturbation (so that it returns the robust accuracy at every threshold with a single run). It performs better or similar to stateof-the-art attacks which are partially specialized to one l p -norm, and is robust to the phenomenon of gradient masking.
translated by 谷歌翻译
We propose the Square Attack, a score-based black-box l2and l∞-adversarial attack that does not rely on local gradient information and thus is not affected by gradient masking. Square Attack is based on a randomized search scheme which selects localized squareshaped updates at random positions so that at each iteration the perturbation is situated approximately at the boundary of the feasible set. Our method is significantly more query efficient and achieves a higher success rate compared to the state-of-the-art methods, especially in the untargeted setting. In particular, on ImageNet we improve the average query efficiency in the untargeted setting for various deep networks by a factor of at least 1.8 and up to 3 compared to the recent state-ofthe-art l∞-attack of Al-Dujaili & OReilly (2020). Moreover, although our attack is black-box, it can also outperform gradient-based white-box attacks on the standard benchmarks achieving a new state-of-the-art in terms of the success rate. The code of our attack is available at https://github.com/max-andr/square-attack.
translated by 谷歌翻译
在测试时间进行优化的自适应防御能力有望改善对抗性鲁棒性。我们对这种自适应测试时间防御措施进行分类,解释其潜在的好处和缺点,并评估图像分类的最新自适应防御能力的代表性。不幸的是,经过我们仔细的案例研究评估时,没有任何显着改善静态防御。有些甚至削弱了基本静态模型,同时增加了推理计算。尽管这些结果令人失望,但我们仍然认为自适应测试时间防御措施是一项有希望的研究途径,因此,我们为他们的彻底评估提供了建议。我们扩展了Carlini等人的清单。(2019年)通过提供针对自适应防御的具体步骤。
translated by 谷歌翻译
Designing powerful adversarial attacks is of paramount importance for the evaluation of $\ell_p$-bounded adversarial defenses. Projected Gradient Descent (PGD) is one of the most effective and conceptually simple algorithms to generate such adversaries. The search space of PGD is dictated by the steepest ascent directions of an objective. Despite the plethora of objective function choices, there is no universally superior option and robustness overestimation may arise from ill-suited objective selection. Driven by this observation, we postulate that the combination of different objectives through a simple loss alternating scheme renders PGD more robust towards design choices. We experimentally verify this assertion on a synthetic-data example and by evaluating our proposed method across 25 different $\ell_{\infty}$-robust models and 3 datasets. The performance improvement is consistent, when compared to the single loss counterparts. In the CIFAR-10 dataset, our strongest adversarial attack outperforms all of the white-box components of AutoAttack (AA) ensemble, as well as the most powerful attacks existing on the literature, achieving state-of-the-art results in the computational budget of our study ($T=100$, no restarts).
translated by 谷歌翻译
Adaptive attacks have (rightfully) become the de facto standard for evaluating defenses to adversarial examples. We find, however, that typical adaptive evaluations are incomplete. We demonstrate that thirteen defenses recently published at ICLR, ICML and NeurIPS-and which illustrate a diverse set of defense strategies-can be circumvented despite attempting to perform evaluations using adaptive attacks. While prior evaluation papers focused mainly on the end result-showing that a defense was ineffective-this paper focuses on laying out the methodology and the approach necessary to perform an adaptive attack. Some of our attack strategies are generalizable, but no single strategy would have been sufficient for all defenses. This underlines our key message that adaptive attacks cannot be automated and always require careful and appropriate tuning to a given defense. We hope that these analyses will serve as guidance on how to properly perform adaptive attacks against defenses to adversarial examples, and thus will allow the community to make further progress in building more robust models.
translated by 谷歌翻译
作为研究界,我们仍然缺乏对对抗性稳健性的进展的系统理解,这通常使得难以识别训练强大模型中最有前途的想法。基准稳健性的关键挑战是,其评估往往是出错的导致鲁棒性高估。我们的目标是建立对抗性稳健性的标准化基准,尽可能准确地反映出考虑在合理的计算预算范围内所考虑的模型的稳健性。为此,我们首先考虑图像分类任务并在允许的型号上引入限制(可能在将来宽松)。我们评估了与AutoAtrack的对抗鲁棒性,白和黑箱攻击的集合,最近在大规模研究中显示,与原始出版物相比,改善了几乎所有稳健性评估。为防止对自动攻击进行新防御的过度适应,我们欢迎基于自适应攻击的外部评估,特别是在自动攻击稳健性潜在高估的地方。我们的排行榜,托管在https://robustbench.github.io/,包含120多个模型的评估,并旨在反映在$ \ ell_ \ infty $的一套明确的任务上的图像分类中的当前状态 - 和$ \ ell_2 $ -Threat模型和共同腐败,未来可能的扩展。此外,我们开源源是图书馆https://github.com/robustbench/robustbench,可以提供对80多个强大模型的统一访问,以方便他们的下游应用程序。最后,根据收集的模型,我们分析了稳健性对分布换档,校准,分配检测,公平性,隐私泄漏,平滑度和可转移性的影响。
translated by 谷歌翻译
This paper investigates recently proposed approaches for defending against adversarial examples and evaluating adversarial robustness. We motivate adversarial risk as an objective for achieving models robust to worst-case inputs. We then frame commonly used attacks and evaluation metrics as defining a tractable surrogate objective to the true adversarial risk. This suggests that models may optimize this surrogate rather than the true adversarial risk. We formalize this notion as obscurity to an adversary, and develop tools and heuristics for identifying obscured models and designing transparent models. We demonstrate that this is a significant problem in practice by repurposing gradient-free optimization techniques into adversarial attacks, which we use to decrease the accuracy of several recently proposed defenses to near zero. Our hope is that our formulations and results will help researchers to develop more powerful defenses.
translated by 谷歌翻译
与标准的训练时间相比,训练时间非常长,尤其是针对稳健模型的主要缺点,尤其是对于大型数据集而言。此外,模型不仅应适用于一个$ l_p $ - 威胁模型,而且对所有模型来说都是理想的。在本文中,我们提出了基于$ l_p $ -Balls的几何特性的多元素鲁棒性的极端规范对抗训练(E-AT)。 E-AT的成本比其他对抗性训练方法低三倍,以进行多种锻炼。使用e-at,我们证明,对于ImageNet,单个时期和CIFAR-10,三个时期足以将任何$ L_P $ - 抛光模型变成一个多符号鲁棒模型。通过这种方式,我们获得了ImageNet的第一个多元素鲁棒模型,并在CIFAR-10上提高了多个Norm鲁棒性的最新型号,以超过$ 51 \%$。最后,我们通过对不同单独的$ l_p $ threat模型之间的对抗鲁棒性进行微调研究一般的转移,并改善了Cifar-10和Imagenet上的先前的SOTA $ L_1 $ - 固定。广泛的实验表明,我们的计划在包括视觉变压器在内的数据集和架构上起作用。
translated by 谷歌翻译
当有大量的计算资源可用时,AutoAttack(AA)是评估对抗性鲁棒性的最可靠方法。但是,高计算成本(例如,比项目梯度下降攻击的100倍)使AA对于具有有限计算资源的从业者来说是不可行的,并且也阻碍了AA在对抗培训中的应用(AT)。在本文中,我们提出了一种新颖的方法,即最小利润率(MM)攻击,以快速可靠地评估对抗性鲁棒性。与AA相比,我们的方法可实现可比的性能,但在广泛的实验中仅占计算时间的3%。我们方法的可靠性在于,我们使用两个目标之间的边缘来评估对抗性示例的质量,这些目标可以精确地识别最对抗性的示例。我们方法的计算效率在于有效的顺序目标排名选择(星形)方法,以确保MM攻击的成本与类数无关。 MM攻击开辟了一种评估对抗性鲁棒性的新方法,并提供了一种可行且可靠的方式来生成高质量的对抗示例。
translated by 谷歌翻译
到目前为止对抗训练是抵御对抗例子的最有效的策略。然而,由于每个训练步骤中的迭代对抗性攻击,它遭受了高的计算成本。最近的研究表明,通过随机初始化执行单步攻击,可以实现快速的对抗训练。然而,这种方法仍然落后于稳定性和模型稳健性的最先进的对手训练算法。在这项工作中,我们通过观察随机平滑的随机初始化来更好地优化内部最大化问题,对快速对抗培训进行新的理解。在这种新的视角之后,我们还提出了一种新的初始化策略,向后平滑,进一步提高单步强大培训方法的稳定性和模型稳健性。多个基准测试的实验表明,我们的方法在使用更少的训练时间(使用相同的培训计划时,使用更少的培训时间($ \ sim $ 3x改进)时,我们的方法达到了类似的模型稳健性。
translated by 谷歌翻译
评估防御模型的稳健性是对抗对抗鲁棒性研究的具有挑战性的任务。僵化的渐变,先前已经发现了一种梯度掩蔽,以许多防御方法存在并导致鲁棒性的错误信号。在本文中,我们确定了一种更细微的情况,称为不平衡梯度,也可能导致过高的对抗性鲁棒性。当边缘损耗的一个术语的梯度主导并将攻击朝向次优化方向推动时,发生不平衡梯度的现象。为了利用不平衡的梯度,我们制定了分解利润率损失的边缘分解(MD)攻击,并通过两阶段过程分别探讨了这些术语的攻击性。我们还提出了一个Multared和Ensemble版本的MD攻击。通过调查自2018年以来提出的17个防御模型,我们发现6种型号易受不平衡梯度的影响,我们的MD攻击可以减少由最佳基线独立攻击评估的鲁棒性另外2%。我们还提供了对不平衡梯度的可能原因和有效对策的深入分析。
translated by 谷歌翻译
评估对抗性鲁棒性的量,以找到有输入样品被错误分类所需的最小扰动。底层优化的固有复杂性需要仔细调整基于梯度的攻击,初始化,并且可能为许多计算苛刻的迭代而被执行,即使专门用于给定的扰动模型也是如此。在这项工作中,我们通过提出使用不同$ \ ell_p $ -norm扰动模型($ p = 0,1,2,\ idty $)的快速最小规范(FMN)攻击来克服这些限制(FMN)攻击选择,不需要对抗性起点,并在很少的轻量级步骤中收敛。它通过迭代地发现在$ \ ell_p $ -norm的最大信心被错误分类的样本进行了尺寸的尺寸$ \ epsilon $的限制,同时适应$ \ epsilon $,以最小化当前样本到决策边界的距离。广泛的实验表明,FMN在收敛速度和计算时间方面显着优于现有的攻击,同时报告可比或甚至更小的扰动尺寸。
translated by 谷歌翻译
Adversarial training, in which a network is trained on adversarial examples, is one of the few defenses against adversarial attacks that withstands strong attacks. Unfortunately, the high cost of generating strong adversarial examples makes standard adversarial training impractical on large-scale problems like ImageNet. We present an algorithm that eliminates the overhead cost of generating adversarial examples by recycling the gradient information computed when updating model parameters.Our "free" adversarial training algorithm achieves comparable robustness to PGD adversarial training on the CIFAR-10 and CIFAR-100 datasets at negligible additional cost compared to natural training, and can be 7 to 30 times faster than other strong adversarial training methods. Using a single workstation with 4 P100 GPUs and 2 days of runtime, we can train a robust model for the large-scale ImageNet classification task that maintains 40% accuracy against PGD attacks. The code is available at https://github.com/ashafahi/free_adv_train.
translated by 谷歌翻译
深度卷积神经网络(CNN)很容易被输入图像的细微,不可察觉的变化所欺骗。为了解决此漏洞,对抗训练会创建扰动模式,并将其包括在培训设置中以鲁棒性化模型。与仅使用阶级有限信息的现有对抗训练方法(例如,使用交叉渗透损失)相反,我们建议利用功能空间中的其他信息来促进更强的对手,这些信息又用于学习强大的模型。具体来说,我们将使用另一类的目标样本的样式和内容信息以及其班级边界信息来创建对抗性扰动。我们以深入监督的方式应用了我们提出的多任务目标,从而提取了多尺度特征知识,以创建最大程度地分开对手。随后,我们提出了一种最大边缘对抗训练方法,该方法可最大程度地减少源图像与其对手之间的距离,并最大程度地提高对手和目标图像之间的距离。与最先进的防御能力相比,我们的对抗训练方法表明了强大的鲁棒性,可以很好地推广到自然发生的损坏和数据分配变化,并保留了清洁示例的模型准确性。
translated by 谷歌翻译
作为反对攻击的最有效的防御方法之一,对抗性训练倾向于学习包容性的决策边界,以提高深度学习模型的鲁棒性。但是,由于沿对抗方向的边缘的大幅度和不必要的增加,对抗性训练会在自然实例和对抗性示例之间引起严重的交叉,这不利于平衡稳健性和自然准确性之间的权衡。在本文中,我们提出了一种新颖的对抗训练计划,以在稳健性和自然准确性之间进行更好的权衡。它旨在学习一个中度包容的决策边界,这意味着决策边界下的自然示例的边缘是中等的。我们称此方案为中等边缘的对抗训练(MMAT),该方案生成更细粒度的对抗示例以减轻交叉问题。我们还利用了经过良好培训的教师模型的逻辑来指导我们的模型学习。最后,MMAT在Black-Box和White-Box攻击下都可以实现高自然的精度和鲁棒性。例如,在SVHN上,实现了最新的鲁棒性和自然精度。
translated by 谷歌翻译
对抗性例子的现象说明了深神经网络最基本的漏洞之一。在推出这一固有的弱点的各种技术中,对抗性训练已成为学习健壮模型的最有效策略。通常,这是通过平衡强大和自然目标来实现的。在这项工作中,我们旨在通过执行域不变的功能表示,进一步优化鲁棒和标准准确性之间的权衡。我们提出了一种新的对抗训练方法,域不变的对手学习(DIAL),该方法学习了一个既健壮又不变的功能表示形式。拨盘使用自然域及其相应的对抗域上的域对抗神经网络(DANN)的变体。在源域由自然示例组成和目标域组成的情况下,是对抗性扰动的示例,我们的方法学习了一个被限制的特征表示,以免区分自然和对抗性示例,因此可以实现更强大的表示。拨盘是一种通用和模块化技术,可以轻松地将其纳入任何对抗训练方法中。我们的实验表明,将拨号纳入对抗训练过程中可以提高鲁棒性和标准精度。
translated by 谷歌翻译
由于在量化网络上的按位操作产生的有效存储器消耗和更快的计算,神经网络量化已经变得越来越受欢迎。尽管它们表现出优异的泛化能力,但其鲁棒性属性并不是很好地理解。在这项工作中,我们系统地研究量化网络对基于梯度的对抗性攻击的鲁棒性,并证明这些量化模型遭受梯度消失问题并显示出虚假的鲁棒感。通过归因于培训的网络中的渐变消失到较差的前后信号传播,我们引入了一个简单的温度缩放方法来缓解此问题,同时保留决策边界。尽管对基于梯度的对抗攻击进行了简单的修改,但具有多个网络架构的多个图像分类数据集的实验表明,我们的温度缩放攻击在量化网络上获得了近乎完美的成功率,同时表现出对普遍培训的模型以及浮动的原始攻击以及浮动 - 点网络。代码可在https://github.com/kartikgupta-at-anu/Attack-bnn获得。
translated by 谷歌翻译
添加到输入的最小侵犯扰动已被证明在愚弄深度神经网络方面有效。在本文中,我们介绍了几种创新,使白盒子目标攻击遵循攻击者的目标:欺骗模型将更高的目标类概率分配比任何其他更高的概率,同时停留在距原始距离的指定距离内输入。首先,我们提出了一种新的损失函数,明确地捕获了目标攻击的目标,特别是通过使用所有类的Logits而不是仅仅是一个子集。我们表明,具有这种损失功能的自动PGD比与其他常用损耗功能相比发现更多的对抗示例。其次,我们提出了一种新的攻击方法,它使用进一步发发版本的我们的损失函数捕获错误分类目标和$ l _ {\ infty} $距离限制$ \ epsilon $。这种新的攻击方法在CIFAR10 DataSet上比较成功了1.5--4.2%,而在ImageNet DataSet上比下一个最先进的攻击更成功。我们使用统计测试确认,我们的攻击优于最先进的攻击不同数据集和$ \ epsilon $和不同防御的价值。
translated by 谷歌翻译