评估机器学习模型对对抗性示例的鲁棒性是一个具有挑战性的问题。已经证明,许多防御能力通过导致基于梯度的攻击失败,从而提供了一种错误的鲁棒感,并且在更严格的评估下它们已被打破。尽管已经提出了指南和最佳实践来改善当前的对抗性鲁棒性评估,但缺乏自动测试和调试工具,使以系统的方式应用这些建议变得困难。在这项工作中,我们通过以下方式克服了这些局限性:(i)根据它们如何影响基于梯度的攻击的优化对攻击失败进行分类,同时还揭示了两种影响许多流行攻击实施和过去评估的新型故障; (ii)提出了六个新的失败指标,以自动检测到攻击优化过程中这种失败的存在; (iii)建议采用系统协议来应用相应的修复程序。我们广泛的实验分析涉及3个不同的应用域中的15多个模型,表明我们的失败指标可用于调试和改善当前的对抗性鲁棒性评估,从而为自动化和系统化它们提供了第一步。我们的开源代码可在以下网址获得:https://github.com/pralab/indicatorsofattackfailure。
translated by 谷歌翻译
Adaptive attacks have (rightfully) become the de facto standard for evaluating defenses to adversarial examples. We find, however, that typical adaptive evaluations are incomplete. We demonstrate that thirteen defenses recently published at ICLR, ICML and NeurIPS-and which illustrate a diverse set of defense strategies-can be circumvented despite attempting to perform evaluations using adaptive attacks. While prior evaluation papers focused mainly on the end result-showing that a defense was ineffective-this paper focuses on laying out the methodology and the approach necessary to perform an adaptive attack. Some of our attack strategies are generalizable, but no single strategy would have been sufficient for all defenses. This underlines our key message that adaptive attacks cannot be automated and always require careful and appropriate tuning to a given defense. We hope that these analyses will serve as guidance on how to properly perform adaptive attacks against defenses to adversarial examples, and thus will allow the community to make further progress in building more robust models.
translated by 谷歌翻译
This paper investigates recently proposed approaches for defending against adversarial examples and evaluating adversarial robustness. We motivate adversarial risk as an objective for achieving models robust to worst-case inputs. We then frame commonly used attacks and evaluation metrics as defining a tractable surrogate objective to the true adversarial risk. This suggests that models may optimize this surrogate rather than the true adversarial risk. We formalize this notion as obscurity to an adversary, and develop tools and heuristics for identifying obscured models and designing transparent models. We demonstrate that this is a significant problem in practice by repurposing gradient-free optimization techniques into adversarial attacks, which we use to decrease the accuracy of several recently proposed defenses to near zero. Our hope is that our formulations and results will help researchers to develop more powerful defenses.
translated by 谷歌翻译
评估对抗性鲁棒性的量,以找到有输入样品被错误分类所需的最小扰动。底层优化的固有复杂性需要仔细调整基于梯度的攻击,初始化,并且可能为许多计算苛刻的迭代而被执行,即使专门用于给定的扰动模型也是如此。在这项工作中,我们通过提出使用不同$ \ ell_p $ -norm扰动模型($ p = 0,1,2,\ idty $)的快速最小规范(FMN)攻击来克服这些限制(FMN)攻击选择,不需要对抗性起点,并在很少的轻量级步骤中收敛。它通过迭代地发现在$ \ ell_p $ -norm的最大信心被错误分类的样本进行了尺寸的尺寸$ \ epsilon $的限制,同时适应$ \ epsilon $,以最小化当前样本到决策边界的距离。广泛的实验表明,FMN在收敛速度和计算时间方面显着优于现有的攻击,同时报告可比或甚至更小的扰动尺寸。
translated by 谷歌翻译
The field of defense strategies against adversarial attacks has significantly grown over the last years, but progress is hampered as the evaluation of adversarial defenses is often insufficient and thus gives a wrong impression of robustness. Many promising defenses could be broken later on, making it difficult to identify the state-of-the-art. Frequent pitfalls in the evaluation are improper tuning of hyperparameters of the attacks, gradient obfuscation or masking. In this paper we first propose two extensions of the PGD-attack overcoming failures due to suboptimal step size and problems of the objective function. We then combine our novel attacks with two complementary existing ones to form a parameter-free, computationally affordable and user-independent ensemble of attacks to test adversarial robustness. We apply our ensemble to over 50 models from papers published at recent top machine learning and computer vision venues. In all except one of the cases we achieve lower robust test accuracy than reported in these papers, often by more than 10%, identifying several broken defenses.
translated by 谷歌翻译
We identify obfuscated gradients, a kind of gradient masking, as a phenomenon that leads to a false sense of security in defenses against adversarial examples. While defenses that cause obfuscated gradients appear to defeat iterative optimizationbased attacks, we find defenses relying on this effect can be circumvented. We describe characteristic behaviors of defenses exhibiting the effect, and for each of the three types of obfuscated gradients we discover, we develop attack techniques to overcome it. In a case study, examining noncertified white-box-secure defenses at ICLR 2018, we find obfuscated gradients are a common occurrence, with 7 of 9 defenses relying on obfuscated gradients. Our new attacks successfully circumvent 6 completely, and 1 partially, in the original threat model each paper considers.
translated by 谷歌翻译
在测试时间进行优化的自适应防御能力有望改善对抗性鲁棒性。我们对这种自适应测试时间防御措施进行分类,解释其潜在的好处和缺点,并评估图像分类的最新自适应防御能力的代表性。不幸的是,经过我们仔细的案例研究评估时,没有任何显着改善静态防御。有些甚至削弱了基本静态模型,同时增加了推理计算。尽管这些结果令人失望,但我们仍然认为自适应测试时间防御措施是一项有希望的研究途径,因此,我们为他们的彻底评估提供了建议。我们扩展了Carlini等人的清单。(2019年)通过提供针对自适应防御的具体步骤。
translated by 谷歌翻译
已经提出了数百种防御能力,以使深度神经网络可靠,以防止最小的(对抗)输入扰动。但是,只有少数这些防御能力提高了他们的主张,因为正确评估鲁棒性是极具挑战性的:弱攻击即使在不知不觉中也无法找到对抗性示例,从而使脆弱的网络看起来很健壮。在本文中,我们提出了一项测试,以识别弱攻击,从而确定弱国防评估。我们的测试稍微修改了神经网络,以确保每个样本的对抗示例存在。因此,任何正确的攻击都必须成功打破此修改后的网络。在13个先前出版的防御措施中,有11个对防御的最初评估未能通过我们的测试,而打破这些防御的更强烈的攻击则使它通过了。我们希望攻击单元测试(例如我们的)将成为未来鲁棒性评估的主要组成部分,并增加对当前受到怀疑的经验领域的信心。
translated by 谷歌翻译
作为研究界,我们仍然缺乏对对抗性稳健性的进展的系统理解,这通常使得难以识别训练强大模型中最有前途的想法。基准稳健性的关键挑战是,其评估往往是出错的导致鲁棒性高估。我们的目标是建立对抗性稳健性的标准化基准,尽可能准确地反映出考虑在合理的计算预算范围内所考虑的模型的稳健性。为此,我们首先考虑图像分类任务并在允许的型号上引入限制(可能在将来宽松)。我们评估了与AutoAtrack的对抗鲁棒性,白和黑箱攻击的集合,最近在大规模研究中显示,与原始出版物相比,改善了几乎所有稳健性评估。为防止对自动攻击进行新防御的过度适应,我们欢迎基于自适应攻击的外部评估,特别是在自动攻击稳健性潜在高估的地方。我们的排行榜,托管在https://robustbench.github.io/,包含120多个模型的评估,并旨在反映在$ \ ell_ \ infty $的一套明确的任务上的图像分类中的当前状态 - 和$ \ ell_2 $ -Threat模型和共同腐败,未来可能的扩展。此外,我们开源源是图书馆https://github.com/robustbench/robustbench,可以提供对80多个强大模型的统一访问,以方便他们的下游应用程序。最后,根据收集的模型,我们分析了稳健性对分布换档,校准,分配检测,公平性,隐私泄漏,平滑度和可转移性的影响。
translated by 谷歌翻译
Although deep neural networks (DNNs) have achieved great success in many tasks, they can often be fooled by adversarial examples that are generated by adding small but purposeful distortions to natural examples. Previous studies to defend against adversarial examples mostly focused on refining the DNN models, but have either shown limited success or required expensive computation. We propose a new strategy, feature squeezing, that can be used to harden DNN models by detecting adversarial examples. Feature squeezing reduces the search space available to an adversary by coalescing samples that correspond to many different feature vectors in the original space into a single sample. By comparing a DNN model's prediction on the original input with that on squeezed inputs, feature squeezing detects adversarial examples with high accuracy and few false positives.This paper explores two feature squeezing methods: reducing the color bit depth of each pixel and spatial smoothing. These simple strategies are inexpensive and complementary to other defenses, and can be combined in a joint detection framework to achieve high detection rates against state-of-the-art attacks.
translated by 谷歌翻译
Neural networks provide state-of-the-art results for most machine learning tasks. Unfortunately, neural networks are vulnerable to adversarial examples: given an input x and any target classification t, it is possible to find a new input x that is similar to x but classified as t. This makes it difficult to apply neural networks in security-critical areas. Defensive distillation is a recently proposed approach that can take an arbitrary neural network, and increase its robustness, reducing the success rate of current attacks' ability to find adversarial examples from 95% to 0.5%.In this paper, we demonstrate that defensive distillation does not significantly increase the robustness of neural networks by introducing three new attack algorithms that are successful on both distilled and undistilled neural networks with 100% probability. Our attacks are tailored to three distance metrics used previously in the literature, and when compared to previous adversarial example generation algorithms, our attacks are often much more effective (and never worse). Furthermore, we propose using high-confidence adversarial examples in a simple transferability test we show can also be used to break defensive distillation. We hope our attacks will be used as a benchmark in future defense attempts to create neural networks that resist adversarial examples.
translated by 谷歌翻译
Learning-based pattern classifiers, including deep networks, have shown impressive performance in several application domains, ranging from computer vision to cybersecurity. However, it has also been shown that adversarial input perturbations carefully crafted either at training or at test time can easily subvert their predictions. The vulnerability of machine learning to such wild patterns (also referred to as adversarial examples), along with the design of suitable countermeasures, have been investigated in the research field of adversarial machine learning. In this work, we provide a thorough overview of the evolution of this research area over the last ten years and beyond, starting from pioneering, earlier work on the security of non-deep learning algorithms up to more recent work aimed to understand the security properties of deep learning algorithms, in the context of computer vision and cybersecurity tasks. We report interesting connections between these apparently-different lines of work, highlighting common misconceptions related to the security evaluation of machine-learning algorithms. We review the main threat models and attacks defined to this end, and discuss the main limitations of current work, along with the corresponding future challenges towards the design of more secure learning algorithms.
translated by 谷歌翻译
Recent work has demonstrated that deep neural networks are vulnerable to adversarial examples-inputs that are almost indistinguishable from natural data and yet classified incorrectly by the network. In fact, some of the latest findings suggest that the existence of adversarial attacks may be an inherent weakness of deep learning models. To address this problem, we study the adversarial robustness of neural networks through the lens of robust optimization. This approach provides us with a broad and unifying view on much of the prior work on this topic. Its principled nature also enables us to identify methods for both training and attacking neural networks that are reliable and, in a certain sense, universal. In particular, they specify a concrete security guarantee that would protect against any adversary. These methods let us train networks with significantly improved resistance to a wide range of adversarial attacks. They also suggest the notion of security against a first-order adversary as a natural and broad security guarantee. We believe that robustness against such well-defined classes of adversaries is an important stepping stone towards fully resistant deep learning models. 1
translated by 谷歌翻译
评估防御模型的稳健性是对抗对抗鲁棒性研究的具有挑战性的任务。僵化的渐变,先前已经发现了一种梯度掩蔽,以许多防御方法存在并导致鲁棒性的错误信号。在本文中,我们确定了一种更细微的情况,称为不平衡梯度,也可能导致过高的对抗性鲁棒性。当边缘损耗的一个术语的梯度主导并将攻击朝向次优化方向推动时,发生不平衡梯度的现象。为了利用不平衡的梯度,我们制定了分解利润率损失的边缘分解(MD)攻击,并通过两阶段过程分别探讨了这些术语的攻击性。我们还提出了一个Multared和Ensemble版本的MD攻击。通过调查自2018年以来提出的17个防御模型,我们发现6种型号易受不平衡梯度的影响,我们的MD攻击可以减少由最佳基线独立攻击评估的鲁棒性另外2%。我们还提供了对不平衡梯度的可能原因和有效对策的深入分析。
translated by 谷歌翻译
防御对抗例子仍然是一个空旷的问题。一个普遍的信念是,推理的随机性增加了寻找对抗性输入的成本。这种辩护的一个例子是将随机转换应用于输入之前,然后将其馈送到模型。在本文中,我们从经验和理论上研究了这种随机预处理的防御措施,并证明它们存在缺陷。首先,我们表明大多数随机防御措施比以前想象的要弱。他们缺乏足够的随机性来承受诸如投影梯度下降之类的标准攻击。这对长期以来的假设产生了怀疑,即随机防御能力无效,旨在逃避确定性的防御和迫使攻击者以整合对转型(EOT)概念的期望。其次,我们表明随机防御与对抗性鲁棒性和模型不变性之间的权衡面临。随着辩护模型获得更多的随机化不变性,它们变得不太有效。未来的工作将需要使这两种效果分解。我们的代码在补充材料中可用。
translated by 谷歌翻译
已知深度神经网络(DNN)容易受到用不可察觉的扰动制作的对抗性示例的影响,即,输入图像的微小变化会引起错误的分类,从而威胁着基于深度学习的部署系统的可靠性。经常采用对抗训练(AT)来通过训练损坏和干净的数据的混合物来提高DNN的鲁棒性。但是,大多数基于AT的方法在处理\ textit {转移的对抗示例}方面是无效的,这些方法是生成以欺骗各种防御模型的生成的,因此无法满足现实情况下提出的概括要求。此外,对抗性训练一般的国防模型不能对具有扰动的输入产生可解释的预测,而不同的领域专家则需要一个高度可解释的强大模型才能了解DNN的行为。在这项工作中,我们提出了一种基于Jacobian规范和选择性输入梯度正则化(J-SIGR)的方法,该方法通过Jacobian归一化提出了线性化的鲁棒性,还将基于扰动的显着性图正规化,以模仿模型的可解释预测。因此,我们既可以提高DNN的防御能力和高解释性。最后,我们评估了跨不同体系结构的方法,以针对强大的对抗性攻击。实验表明,提出的J-Sigr赋予了针对转移的对抗攻击的鲁棒性,我们还表明,来自神经网络的预测易于解释。
translated by 谷歌翻译
Designing powerful adversarial attacks is of paramount importance for the evaluation of $\ell_p$-bounded adversarial defenses. Projected Gradient Descent (PGD) is one of the most effective and conceptually simple algorithms to generate such adversaries. The search space of PGD is dictated by the steepest ascent directions of an objective. Despite the plethora of objective function choices, there is no universally superior option and robustness overestimation may arise from ill-suited objective selection. Driven by this observation, we postulate that the combination of different objectives through a simple loss alternating scheme renders PGD more robust towards design choices. We experimentally verify this assertion on a synthetic-data example and by evaluating our proposed method across 25 different $\ell_{\infty}$-robust models and 3 datasets. The performance improvement is consistent, when compared to the single loss counterparts. In the CIFAR-10 dataset, our strongest adversarial attack outperforms all of the white-box components of AutoAttack (AA) ensemble, as well as the most powerful attacks existing on the literature, achieving state-of-the-art results in the computational budget of our study ($T=100$, no restarts).
translated by 谷歌翻译
尽管机器学习容易受到对抗性示例的影响,但它仍然缺乏在不同应用程序上下文中评估其安全性的系统过程和工具。在本文中,我们讨论了如何使用实际攻击来开发机器学习的自动化和可扩展的安全性评估,并在Windows恶意软件检测中报告了用例。
translated by 谷歌翻译
Adversarial training is an effective approach to make deep neural networks robust against adversarial attacks. Recently, different adversarial training defenses are proposed that not only maintain a high clean accuracy but also show significant robustness against popular and well studied adversarial attacks such as PGD. High adversarial robustness can also arise if an attack fails to find adversarial gradient directions, a phenomenon known as `gradient masking'. In this work, we analyse the effect of label smoothing on adversarial training as one of the potential causes of gradient masking. We then develop a guided mechanism to avoid local minima during attack optimization, leading to a novel attack dubbed Guided Projected Gradient Attack (G-PGA). Our attack approach is based on a `match and deceive' loss that finds optimal adversarial directions through guidance from a surrogate model. Our modified attack does not require random restarts, large number of attack iterations or search for an optimal step-size. Furthermore, our proposed G-PGA is generic, thus it can be combined with an ensemble attack strategy as we demonstrate for the case of Auto-Attack, leading to efficiency and convergence speed improvements. More than an effective attack, G-PGA can be used as a diagnostic tool to reveal elusive robustness due to gradient masking in adversarial defenses.
translated by 谷歌翻译
发言人识别系统(SRSS)最近被证明容易受到对抗攻击的影响,从而引发了重大的安全问题。在这项工作中,我们系统地研究了基于确保SRSS的基于对抗性训练的防御。根据SRSS的特征,我们提出了22种不同的转换,并使用扬声器识别的7种最新有前途的对抗攻击(4个白盒和3个Black-Box)对其进行了彻底评估。仔细考虑了国防评估中的最佳实践,我们分析了转换的强度以承受适应性攻击。我们还评估并理解它们与对抗训练相结合的自适应攻击的有效性。我们的研究提供了许多有用的见解和发现,其中许多与图像和语音识别域中的结论是新的或不一致的,例如,可变和恒定的比特率语音压缩具有不同的性能,并且某些不可差的转换仍然有效地抗衡。当前有希望的逃避技术通常在图像域中很好地工作。我们证明,与完整的白色盒子设置中的唯一对抗性训练相比,提出的新型功能级转换与对抗训练相比是相当有效的,例如,将准确性提高了13.62%,而攻击成本则达到了两个数量级,而其他攻击成本则增加了。转型不一定会提高整体防御能力。这项工作进一步阐明了该领域的研究方向。我们还发布了我们的评估平台SpeakerGuard,以促进进一步的研究。
translated by 谷歌翻译