对对抗性攻击的鲁棒性通常以对抗精度评估。但是,该指标太粗糙,无法正确捕获机器学习模型的所有鲁棒性。当对强烈的攻击进行评估时,许多防御能力并不能提供准确的改进,同时仍会部分贡献对抗性鲁棒性。流行的认证方法遇到了同一问题,因为它们提供了准确性的下限。为了捕获更精细的鲁棒性属性,我们提出了一个针对L2鲁棒性,对抗角稀疏性的新指标,该指标部分回答了“输入周围有多少个对抗性示例”的问题。我们通过评估“强”和“弱”的防御能力来证明其有用性。我们表明,一些最先进的防御能力具有非常相似的精度,在它们不强大的输入上可能具有截然不同的稀疏性。我们还表明,一些弱防御能力实际上会降低鲁棒性,而另一些防御能力则以无法捕获的准确性来加强它。这些差异可以预测这种防御与对抗性训练相结合时的实用性。
translated by 谷歌翻译
虽然已显示自动语音识别易受对抗性攻击的影响,但对这些攻击的防御仍然滞后。现有的,天真的防御可以用自适应攻击部分地破坏。在分类任务中,随机平滑范式已被证明是有效的卫生模型。然而,由于它们的复杂性和其输出的顺序性,难以将此范例应用于ASR任务。我们的论文通过利用更具增强和流动站投票的语音专用工具来设计一个对扰动强大的ASR模型来克服了一些这些挑战。我们应用了最先进的攻击的自适应版本,例如难以察觉的ASR攻击,我们的模型,并表明我们最强大的防守对所有使用听不清噪声的攻击是强大的,并且只能以非常高的扭曲破碎。
translated by 谷歌翻译
This paper investigates recently proposed approaches for defending against adversarial examples and evaluating adversarial robustness. We motivate adversarial risk as an objective for achieving models robust to worst-case inputs. We then frame commonly used attacks and evaluation metrics as defining a tractable surrogate objective to the true adversarial risk. This suggests that models may optimize this surrogate rather than the true adversarial risk. We formalize this notion as obscurity to an adversary, and develop tools and heuristics for identifying obscured models and designing transparent models. We demonstrate that this is a significant problem in practice by repurposing gradient-free optimization techniques into adversarial attacks, which we use to decrease the accuracy of several recently proposed defenses to near zero. Our hope is that our formulations and results will help researchers to develop more powerful defenses.
translated by 谷歌翻译
Adversarial examples are perturbed inputs designed to fool machine learning models. Adversarial training injects such examples into training data to increase robustness. To scale this technique to large datasets, perturbations are crafted using fast single-step methods that maximize a linear approximation of the model's loss. We show that this form of adversarial training converges to a degenerate global minimum, wherein small curvature artifacts near the data points obfuscate a linear approximation of the loss. The model thus learns to generate weak perturbations, rather than defend against strong ones. As a result, we find that adversarial training remains vulnerable to black-box attacks, where we transfer perturbations computed on undefended models, as well as to a powerful novel single-step attack that escapes the non-smooth vicinity of the input data via a small random step. We further introduce Ensemble Adversarial Training, a technique that augments training data with perturbations transferred from other models. On ImageNet, Ensemble Adversarial Training yields models with stronger robustness to blackbox attacks. In particular, our most robust model won the first round of the NIPS 2017 competition on Defenses against Adversarial Attacks (Kurakin et al., 2017c). However, subsequent work found that more elaborate black-box attacks could significantly enhance transferability and reduce the accuracy of our models.
translated by 谷歌翻译
Recent work has demonstrated that deep neural networks are vulnerable to adversarial examples-inputs that are almost indistinguishable from natural data and yet classified incorrectly by the network. In fact, some of the latest findings suggest that the existence of adversarial attacks may be an inherent weakness of deep learning models. To address this problem, we study the adversarial robustness of neural networks through the lens of robust optimization. This approach provides us with a broad and unifying view on much of the prior work on this topic. Its principled nature also enables us to identify methods for both training and attacking neural networks that are reliable and, in a certain sense, universal. In particular, they specify a concrete security guarantee that would protect against any adversary. These methods let us train networks with significantly improved resistance to a wide range of adversarial attacks. They also suggest the notion of security against a first-order adversary as a natural and broad security guarantee. We believe that robustness against such well-defined classes of adversaries is an important stepping stone towards fully resistant deep learning models. 1
translated by 谷歌翻译
Adaptive attacks have (rightfully) become the de facto standard for evaluating defenses to adversarial examples. We find, however, that typical adaptive evaluations are incomplete. We demonstrate that thirteen defenses recently published at ICLR, ICML and NeurIPS-and which illustrate a diverse set of defense strategies-can be circumvented despite attempting to perform evaluations using adaptive attacks. While prior evaluation papers focused mainly on the end result-showing that a defense was ineffective-this paper focuses on laying out the methodology and the approach necessary to perform an adaptive attack. Some of our attack strategies are generalizable, but no single strategy would have been sufficient for all defenses. This underlines our key message that adaptive attacks cannot be automated and always require careful and appropriate tuning to a given defense. We hope that these analyses will serve as guidance on how to properly perform adaptive attacks against defenses to adversarial examples, and thus will allow the community to make further progress in building more robust models.
translated by 谷歌翻译
我们表明,当考虑到图像域$ [0,1] ^ D $时,已建立$ L_1 $ -Projected梯度下降(PGD)攻击是次优,因为它们不认为有效的威胁模型是交叉点$ l_1 $ -ball和$ [0,1] ^ d $。我们研究了这种有效威胁模型的最陡渐进步骤的预期稀疏性,并表明该组上的确切投影是计算可行的,并且产生更好的性能。此外,我们提出了一种自适应形式的PGD,即使具有小的迭代预算,这也是非常有效的。我们的结果$ l_1 $ -apgd是一个强大的白盒攻击,表明先前的作品高估了他们的$ l_1 $ -trobustness。使用$ l_1 $ -apgd for vercersarial培训,我们获得一个强大的分类器,具有sota $ l_1 $ -trobustness。最后,我们将$ l_1 $ -apgd和平方攻击的适应组合到$ l_1 $ to $ l_1 $ -autoattack,这是一个攻击的集合,可靠地评估$ l_1 $ -ball与$的威胁模型的对抗鲁棒性进行对抗[ 0,1] ^ d $。
translated by 谷歌翻译
Adversarial training, in which a network is trained on adversarial examples, is one of the few defenses against adversarial attacks that withstands strong attacks. Unfortunately, the high cost of generating strong adversarial examples makes standard adversarial training impractical on large-scale problems like ImageNet. We present an algorithm that eliminates the overhead cost of generating adversarial examples by recycling the gradient information computed when updating model parameters.Our "free" adversarial training algorithm achieves comparable robustness to PGD adversarial training on the CIFAR-10 and CIFAR-100 datasets at negligible additional cost compared to natural training, and can be 7 to 30 times faster than other strong adversarial training methods. Using a single workstation with 4 P100 GPUs and 2 days of runtime, we can train a robust model for the large-scale ImageNet classification task that maintains 40% accuracy against PGD attacks. The code is available at https://github.com/ashafahi/free_adv_train.
translated by 谷歌翻译
Adversarial training, a method for learning robust deep networks, is typically assumed to be more expensive than traditional training due to the necessity of constructing adversarial examples via a first-order method like projected gradient decent (PGD). In this paper, we make the surprising discovery that it is possible to train empirically robust models using a much weaker and cheaper adversary, an approach that was previously believed to be ineffective, rendering the method no more costly than standard training in practice. Specifically, we show that adversarial training with the fast gradient sign method (FGSM), when combined with random initialization, is as effective as PGD-based training but has significantly lower cost. Furthermore we show that FGSM adversarial training can be further accelerated by using standard techniques for efficient training of deep networks, allowing us to learn a robust CIFAR10 classifier with 45% robust accuracy to PGD attacks with = 8/255 in 6 minutes, and a robust ImageNet classifier with 43% robust accuracy at = 2/255 in 12 hours, in comparison to past work based on "free" adversarial training which took 10 and 50 hours to reach the same respective thresholds. Finally, we identify a failure mode referred to as "catastrophic overfitting" which may have caused previous attempts to use FGSM adversarial training to fail. All code for reproducing the experiments in this paper as well as pretrained model weights are at https://github.com/locuslab/fast_adversarial.
translated by 谷歌翻译
Adversarial machine learning has been both a major concern and a hot topic recently, especially with the ubiquitous use of deep neural networks in the current landscape. Adversarial attacks and defenses are usually likened to a cat-and-mouse game in which defenders and attackers evolve over the time. On one hand, the goal is to develop strong and robust deep networks that are resistant to malicious actors. On the other hand, in order to achieve that, we need to devise even stronger adversarial attacks to challenge these defense models. Most of existing attacks employs a single $\ell_p$ distance (commonly, $p\in\{1,2,\infty\}$) to define the concept of closeness and performs steepest gradient ascent w.r.t. this $p$-norm to update all pixels in an adversarial example in the same way. These $\ell_p$ attacks each has its own pros and cons; and there is no single attack that can successfully break through defense models that are robust against multiple $\ell_p$ norms simultaneously. Motivated by these observations, we come up with a natural approach: combining various $\ell_p$ gradient projections on a pixel level to achieve a joint adversarial perturbation. Specifically, we learn how to perturb each pixel to maximize the attack performance, while maintaining the overall visual imperceptibility of adversarial examples. Finally, through various experiments with standardized benchmarks, we show that our method outperforms most current strong attacks across state-of-the-art defense mechanisms, while retaining its ability to remain clean visually.
translated by 谷歌翻译
对抗性的鲁棒性已经成为深度学习的核心目标,无论是在理论和实践中。然而,成功的方法来改善对抗的鲁棒性(如逆势训练)在不受干扰的数据上大大伤害了泛化性能。这可能会对对抗性鲁棒性如何影响现实世界系统的影响(即,如果它可以提高未受干扰的数据的准确性),许多人可能选择放弃鲁棒性)。我们提出内插对抗培训,该培训最近雇用了在对抗培训框架内基于插值的基于插值的培训方法。在CiFar -10上,对抗性训练增加了标准测试错误(当没有对手时)从4.43%到12.32%,而我们的内插对抗培训我们保留了对抗性的鲁棒性,同时实现了仅6.45%的标准测试误差。通过我们的技术,强大模型标准误差的相对增加从178.1%降至仅为45.5%。此外,我们提供内插对抗性培训的数学分析,以确认其效率,并在鲁棒性和泛化方面展示其优势。
translated by 谷歌翻译
We identify obfuscated gradients, a kind of gradient masking, as a phenomenon that leads to a false sense of security in defenses against adversarial examples. While defenses that cause obfuscated gradients appear to defeat iterative optimizationbased attacks, we find defenses relying on this effect can be circumvented. We describe characteristic behaviors of defenses exhibiting the effect, and for each of the three types of obfuscated gradients we discover, we develop attack techniques to overcome it. In a case study, examining noncertified white-box-secure defenses at ICLR 2018, we find obfuscated gradients are a common occurrence, with 7 of 9 defenses relying on obfuscated gradients. Our new attacks successfully circumvent 6 completely, and 1 partially, in the original threat model each paper considers.
translated by 谷歌翻译
在测试时间进行优化的自适应防御能力有望改善对抗性鲁棒性。我们对这种自适应测试时间防御措施进行分类,解释其潜在的好处和缺点,并评估图像分类的最新自适应防御能力的代表性。不幸的是,经过我们仔细的案例研究评估时,没有任何显着改善静态防御。有些甚至削弱了基本静态模型,同时增加了推理计算。尽管这些结果令人失望,但我们仍然认为自适应测试时间防御措施是一项有希望的研究途径,因此,我们为他们的彻底评估提供了建议。我们扩展了Carlini等人的清单。(2019年)通过提供针对自适应防御的具体步骤。
translated by 谷歌翻译
作为研究界,我们仍然缺乏对对抗性稳健性的进展的系统理解,这通常使得难以识别训练强大模型中最有前途的想法。基准稳健性的关键挑战是,其评估往往是出错的导致鲁棒性高估。我们的目标是建立对抗性稳健性的标准化基准,尽可能准确地反映出考虑在合理的计算预算范围内所考虑的模型的稳健性。为此,我们首先考虑图像分类任务并在允许的型号上引入限制(可能在将来宽松)。我们评估了与AutoAtrack的对抗鲁棒性,白和黑箱攻击的集合,最近在大规模研究中显示,与原始出版物相比,改善了几乎所有稳健性评估。为防止对自动攻击进行新防御的过度适应,我们欢迎基于自适应攻击的外部评估,特别是在自动攻击稳健性潜在高估的地方。我们的排行榜,托管在https://robustbench.github.io/,包含120多个模型的评估,并旨在反映在$ \ ell_ \ infty $的一套明确的任务上的图像分类中的当前状态 - 和$ \ ell_2 $ -Threat模型和共同腐败,未来可能的扩展。此外,我们开源源是图书馆https://github.com/robustbench/robustbench,可以提供对80多个强大模型的统一访问,以方便他们的下游应用程序。最后,根据收集的模型,我们分析了稳健性对分布换档,校准,分配检测,公平性,隐私泄漏,平滑度和可转移性的影响。
translated by 谷歌翻译
最近,Wong等人。表明,使用单步FGSM的对抗训练导致一种名为灾难性过度拟合(CO)的特征故障模式,其中模型突然变得容易受到多步攻击的影响。他们表明,在FGSM(RS-FGSM)之前添加随机扰动似乎足以防止CO。但是,Andriushchenko和Flammarion观察到RS-FGSM仍会导致更大的扰动,并提出了一个昂贵的常规化器(Gradalign),DEMATER(GARGALIGN)DES昂贵(Gradalign)Dust Forrasiniger(Gradalign)Dust co避免在这项工作中,我们有条不紊地重新审视了噪声和剪辑在单步对抗训练中的作用。与以前的直觉相反,我们发现在干净的样品周围使用更强烈的噪声与不剪接相结合在避免使用大扰动半径的CO方面非常有效。基于这些观察结果,我们提出了噪声-FGSM(N-FGSM),尽管提供了单步对抗训练的好处,但在大型实验套件上没有经验分析,这表明N-FGSM能够匹配或超越以前的单步方法的性能,同时达到3 $ \ times $加速。代码可以在https://github.com/pdejorge/n-fgsm中找到
translated by 谷歌翻译
The field of defense strategies against adversarial attacks has significantly grown over the last years, but progress is hampered as the evaluation of adversarial defenses is often insufficient and thus gives a wrong impression of robustness. Many promising defenses could be broken later on, making it difficult to identify the state-of-the-art. Frequent pitfalls in the evaluation are improper tuning of hyperparameters of the attacks, gradient obfuscation or masking. In this paper we first propose two extensions of the PGD-attack overcoming failures due to suboptimal step size and problems of the objective function. We then combine our novel attacks with two complementary existing ones to form a parameter-free, computationally affordable and user-independent ensemble of attacks to test adversarial robustness. We apply our ensemble to over 50 models from papers published at recent top machine learning and computer vision venues. In all except one of the cases we achieve lower robust test accuracy than reported in these papers, often by more than 10%, identifying several broken defenses.
translated by 谷歌翻译
神经网络对攻击的缺乏鲁棒性引起了对安全敏感环境(例如自动驾驶汽车)的担忧。虽然许多对策看起来可能很有希望,但只有少数能够承受严格的评估。使用随机变换(RT)的防御能力显示出令人印象深刻的结果,尤其是Imagenet上的Bart(Raff等,2019)。但是,这种防御尚未经过严格评估,使其稳健性的理解不足。它们的随机特性使评估更具挑战性,并使对确定性模型的许多拟议攻击不可应用。首先,我们表明BART评估中使用的BPDA攻击(Athalye等,2018a)无效,可能高估了其稳健性。然后,我们尝试通过明智的转换和贝叶斯优化来调整其参数来构建最强的RT防御。此外,我们创造了最强烈的攻击来评估我们的RT防御。我们的新攻击极大地胜过基线,与常用的EOT攻击减少19%相比,将准确性降低了83%($ 4.3 \ times $改善)。我们的结果表明,在Imagenette数据集上的RT防御(ImageNet的十级子集)在对抗性示例上并不强大。进一步扩展研究,我们使用新的攻击来对抗RT防御(称为Advrt),从而获得了巨大的稳健性增长。代码可从https://github.com/wagner-group/demystify-random-transform获得。
translated by 谷歌翻译
评估对抗性鲁棒性的量,以找到有输入样品被错误分类所需的最小扰动。底层优化的固有复杂性需要仔细调整基于梯度的攻击,初始化,并且可能为许多计算苛刻的迭代而被执行,即使专门用于给定的扰动模型也是如此。在这项工作中,我们通过提出使用不同$ \ ell_p $ -norm扰动模型($ p = 0,1,2,\ idty $)的快速最小规范(FMN)攻击来克服这些限制(FMN)攻击选择,不需要对抗性起点,并在很少的轻量级步骤中收敛。它通过迭代地发现在$ \ ell_p $ -norm的最大信心被错误分类的样本进行了尺寸的尺寸$ \ epsilon $的限制,同时适应$ \ epsilon $,以最小化当前样本到决策边界的距离。广泛的实验表明,FMN在收敛速度和计算时间方面显着优于现有的攻击,同时报告可比或甚至更小的扰动尺寸。
translated by 谷歌翻译
在本讨论文件中,我们调查了有关机器学习模型鲁棒性的最新研究。随着学习算法在数据驱动的控制系统中越来越流行,必须确保它们对数据不确定性的稳健性,以维持可靠的安全至关重要的操作。我们首先回顾了这种鲁棒性的共同形式主义,然后继续讨论训练健壮的机器学习模型的流行和最新技术,以及可证明这种鲁棒性的方法。从强大的机器学习的这种统一中,我们识别并讨论了该地区未来研究的迫切方向。
translated by 谷歌翻译
We identify a trade-off between robustness and accuracy that serves as a guiding principle in the design of defenses against adversarial examples. Although this problem has been widely studied empirically, much remains unknown concerning the theory underlying this trade-off. In this work, we decompose the prediction error for adversarial examples (robust error) as the sum of the natural (classification) error and boundary error, and provide a differentiable upper bound using the theory of classification-calibrated loss, which is shown to be the tightest possible upper bound uniform over all probability distributions and measurable predictors. Inspired by our theoretical analysis, we also design a new defense method, TRADES, to trade adversarial robustness off against accuracy. Our proposed algorithm performs well experimentally in real-world datasets. The methodology is the foundation of our entry to the NeurIPS 2018 Adversarial Vision Challenge in which we won the 1st place out of ~2,000 submissions, surpassing the runner-up approach by 11.41% in terms of mean 2 perturbation distance.
translated by 谷歌翻译