深神经网络容易受到对抗的例子(AES)的伤害,这具有对抗性转移性:为源模型产生的AES可以误导另一个(目标)模型的预测。然而,从阶级目标模型的预测被误导的角度来看,尚未理解的可转换性尚未理解(即,传播的可传送性)。在本文中,我们将目标模型预测与源模型(“相同错误”)或不同的错误类(“不同错误”)进行分析,以分析和提供对机制的解释。首先,我们的分析显示(1)与“非目标转移性”和(2)不同的错误在类似模型之间发生不同的错误,而不管扰动大小如何。其次,我们提出了一种证据表明,相同的差异和不同的错误可以通过非稳健的特征来解释,预测性但人的无法解释的模式:当AES中的非鲁棒特征被模型使用时发生不同的错误。因此,非鲁棒特征可以为AES的类感知转换性提供一致的解释。
translated by 谷歌翻译
An intriguing property of deep neural networks is the existence of adversarial examples, which can transfer among different architectures. These transferable adversarial examples may severely hinder deep neural network-based applications. Previous works mostly study the transferability using small scale datasets. In this work, we are the first to conduct an extensive study of the transferability over large models and a large scale dataset, and we are also the first to study the transferability of targeted adversarial examples with their target labels. We study both non-targeted and targeted adversarial examples, and show that while transferable non-targeted adversarial examples are easy to find, targeted adversarial examples generated using existing approaches almost never transfer with their target labels. Therefore, we propose novel ensemble-based approaches to generating transferable adversarial examples. Using such approaches, we observe a large proportion of targeted adversarial examples that are able to transfer with their target labels for the first time. We also present some geometric studies to help understanding the transferable adversarial examples. Finally, we show that the adversarial examples generated using ensemble-based approaches can successfully attack Clarifai.com, which is a black-box image classification system. * Work is done while visiting UC Berkeley.
translated by 谷歌翻译
作为反对攻击的最有效的防御方法之一,对抗性训练倾向于学习包容性的决策边界,以提高深度学习模型的鲁棒性。但是,由于沿对抗方向的边缘的大幅度和不必要的增加,对抗性训练会在自然实例和对抗性示例之间引起严重的交叉,这不利于平衡稳健性和自然准确性之间的权衡。在本文中,我们提出了一种新颖的对抗训练计划,以在稳健性和自然准确性之间进行更好的权衡。它旨在学习一个中度包容的决策边界,这意味着决策边界下的自然示例的边缘是中等的。我们称此方案为中等边缘的对抗训练(MMAT),该方案生成更细粒度的对抗示例以减轻交叉问题。我们还利用了经过良好培训的教师模型的逻辑来指导我们的模型学习。最后,MMAT在Black-Box和White-Box攻击下都可以实现高自然的精度和鲁棒性。例如,在SVHN上,实现了最新的鲁棒性和自然精度。
translated by 谷歌翻译
对抗性实例的有趣现象引起了机器学习中的显着关注,对社区可能更令人惊讶的是存在普遍对抗扰动(UAPS),即欺骗目标DNN的单一扰动。随着对深层分类器的关注,本调查总结了最近普遍对抗攻击的进展,讨论了攻击和防御方的挑战,以及uap存在的原因。我们的目标是将此工作扩展为动态调查,该调查将定期更新其内容,以遵循关于在广泛的域中的UAP或通用攻击的新作品,例如图像,音频,视频,文本等。将讨论相关更新:https://bit.ly/2sbqlgg。我们欢迎未来的作者在该领域的作品,联系我们,包括您的新发现。
translated by 谷歌翻译
Deep neural networks are vulnerable to adversarial examples, which poses security concerns on these algorithms due to the potentially severe consequences. Adversarial attacks serve as an important surrogate to evaluate the robustness of deep learning models before they are deployed. However, most of existing adversarial attacks can only fool a black-box model with a low success rate. To address this issue, we propose a broad class of momentum-based iterative algorithms to boost adversarial attacks. By integrating the momentum term into the iterative process for attacks, our methods can stabilize update directions and escape from poor local maxima during the iterations, resulting in more transferable adversarial examples. To further improve the success rates for black-box attacks, we apply momentum iterative algorithms to an ensemble of models, and show that the adversarially trained models with a strong defense ability are also vulnerable to our black-box attacks. We hope that the proposed methods will serve as a benchmark for evaluating the robustness of various deep models and defense methods. With this method, we won the first places in NIPS 2017 Non-targeted Adversarial Attack and Targeted Adversarial Attack competitions.
translated by 谷歌翻译
我们识别普遍对抗扰动(UAP)的性质,将它们与标准的对抗性扰动区分开来。具体而言,我们表明,由投影梯度下降产生的靶向UAPS表现出两种人对齐的特性:语义局部性和空间不变性,标准的靶向对抗扰动缺乏。我们还证明,除标准对抗扰动之外,UAPS含有明显较低的泛化信号 - 即,UAPS在比标准的对抗的扰动的较小程度上利用非鲁棒特征。
translated by 谷歌翻译
The authors thank Nicholas Carlini (UC Berkeley) and Dimitris Tsipras (MIT) for feedback to improve the survey quality. We also acknowledge X. Huang (Uni. Liverpool), K. R. Reddy (IISC), E. Valle (UNICAMP), Y. Yoo (CLAIR) and others for providing pointers to make the survey more comprehensive.
translated by 谷歌翻译
由于它们对机器学习系统部署的可靠性的影响,对抗性样本的可转移性成为严重关注的问题,因为它们发现了进入许多关键应用程序的方式。了解影响对抗性样本可转移性的因素可以帮助专家了解如何建立鲁棒和可靠的机器学习系统的明智决策。本研究的目标是通过以攻击为中心的方法提供对对抗性样本可转移性背后的机制的见解。这种攻击的视角解释了通过评估机器学习攻击的影响(在给定的输入数据集中的影响来解释对抗性样本。为实现这一目标,我们使用攻击者模型产生对抗性样本并将这些样本转移到受害者模型中。我们分析了受害者模型对抗对抗样本的行为,并概述了可能影响对抗性样本可转移性的四种因素。虽然这些因素不一定是详尽无遗的,但它们对机器学习系统的研究人员和从业者提供了有用的见解。
translated by 谷歌翻译
Adversarial examples have attracted significant attention in machine learning, but the reasons for their existence and pervasiveness remain unclear. We demonstrate that adversarial examples can be directly attributed to the presence of non-robust features: features (derived from patterns in the data distribution) that are highly predictive, yet brittle and (thus) incomprehensible to humans. After capturing these features within a theoretical framework, we establish their widespread existence in standard datasets. Finally, we present a simple setting where we can rigorously tie the phenomena we observe in practice to a misalignment between the (human-specified) notion of robustness and the inherent geometry of the data.
translated by 谷歌翻译
Vulnerability to adversarial attacks is a well-known weakness of Deep Neural Networks. While most of the studies focus on natural images with standardized benchmarks like ImageNet and CIFAR, little research has considered real world applications, in particular in the medical domain. Our research shows that, contrary to previous claims, robustness of chest x-ray classification is much harder to evaluate and leads to very different assessments based on the dataset, the architecture and robustness metric. We argue that previous studies did not take into account the peculiarity of medical diagnosis, like the co-occurrence of diseases, the disagreement of labellers (domain experts), the threat model of the attacks and the risk implications for each successful attack. In this paper, we discuss the methodological foundations, review the pitfalls and best practices, and suggest new methodological considerations for evaluating the robustness of chest xray classification models. Our evaluation on 3 datasets, 7 models, and 18 diseases is the largest evaluation of robustness of chest x-ray classification models.
translated by 谷歌翻译
深度学习的进步使得广泛的有希望的应用程序。然而,这些系统容易受到对抗机器学习(AML)攻击的影响;对他们的意见的离前事实制作的扰动可能导致他们错误分类。若干最先进的对抗性攻击已经证明他们可以可靠地欺骗分类器,使这些攻击成为一个重大威胁。对抗性攻击生成算法主要侧重于创建成功的例子,同时控制噪声幅度和分布,使检测更加困难。这些攻击的潜在假设是脱机产生的对抗噪声,使其执行时间是次要考虑因素。然而,最近,攻击者机会自由地产生对抗性示例的立即对抗攻击已经可能。本文介绍了一个新问题:我们如何在实时约束下产生对抗性噪音,以支持这种实时对抗攻击?了解这一问题提高了我们对这些攻击对实时系统构成的威胁的理解,并为未来防御提供安全评估基准。因此,我们首先进行对抗生成算法的运行时间分析。普遍攻击脱机产生一般攻击,没有在线开销,并且可以应用于任何输入;然而,由于其一般性,他们的成功率是有限的。相比之下,在特定输入上工作的在线算法是计算昂贵的,使它们不适合在时间约束下的操作。因此,我们提出房间,一种新型实时在线脱机攻击施工模型,其中离线组件用于预热在线算法,使得可以在时间限制下产生高度成功的攻击。
translated by 谷歌翻译
由明确的反对派制作的对抗例子在机器学习中引起了重要的关注。然而,潜在虚假朋友带来的安全风险基本上被忽视了。在本文中,我们揭示了虚伪的例子的威胁 - 最初被错误分类但是虚假朋友扰乱的投入,以强迫正确的预测。虽然这种扰动的例子似乎是无害的,但我们首次指出,它们可能是恶意地用来隐瞒评估期间不合格(即,不如所需)模型的错误。一旦部署者信任虚伪的性能并在真实应用程序中应用“良好的”模型,即使在良性环境中也可能发生意外的失败。更严重的是,这种安全风险似乎是普遍存在的:我们发现许多类型的不合标准模型易受多个数据集的虚伪示例。此外,我们提供了第一次尝试,以称为虚伪风险的公制表征威胁,并试图通过一些对策来规避它。结果表明对策的有效性,即使在自适应稳健的培训之后,风险仍然是不可忽视的。
translated by 谷歌翻译
Adversarial examples are perturbed inputs designed to fool machine learning models. Adversarial training injects such examples into training data to increase robustness. To scale this technique to large datasets, perturbations are crafted using fast single-step methods that maximize a linear approximation of the model's loss. We show that this form of adversarial training converges to a degenerate global minimum, wherein small curvature artifacts near the data points obfuscate a linear approximation of the loss. The model thus learns to generate weak perturbations, rather than defend against strong ones. As a result, we find that adversarial training remains vulnerable to black-box attacks, where we transfer perturbations computed on undefended models, as well as to a powerful novel single-step attack that escapes the non-smooth vicinity of the input data via a small random step. We further introduce Ensemble Adversarial Training, a technique that augments training data with perturbations transferred from other models. On ImageNet, Ensemble Adversarial Training yields models with stronger robustness to blackbox attacks. In particular, our most robust model won the first round of the NIPS 2017 competition on Defenses against Adversarial Attacks (Kurakin et al., 2017c). However, subsequent work found that more elaborate black-box attacks could significantly enhance transferability and reduce the accuracy of our models.
translated by 谷歌翻译
NLP系统的Black-Box对抗攻击中最近的工作引起了很多关注。先前的黑框攻击假设攻击者可以根据选定的输入观察目标模型的输出标签。在这项工作中,受到对抗性转移性的启发,我们提出了一种新型的黑盒NLP对抗性攻击,攻击者可以选择相似的域并将对抗性示例转移到目标域并在目标模型中导致性能差。基于领域的适应理论,我们提出了一种称为Learn2Weight的防御策略,该策略训练以预测目标模型的重量调整,以防止对类似的对抗性示例的攻击。使用亚马逊多域情绪分类数据集,我们从经验上表明,与标准的黑盒防御方法(例如对抗性训练和防御性蒸馏)相比,Learn2Weight对攻击有效。这项工作有助于越来越多的有关机器学习安全的文献。
translated by 谷歌翻译
Neural networks are vulnerable to adversarial examples, which poses a threat to their application in security sensitive systems. We propose high-level representation guided denoiser (HGD) as a defense for image classification. Standard denoiser suffers from the error amplification effect, in which small residual adversarial noise is progressively amplified and leads to wrong classifications. HGD overcomes this problem by using a loss function defined as the difference between the target model's outputs activated by the clean image and denoised image. Compared with ensemble adversarial training which is the state-of-the-art defending method on large images, HGD has three advantages. First, with HGD as a defense, the target model is more robust to either white-box or black-box adversarial attacks. Second, HGD can be trained on a small subset of the images and generalizes well to other images and unseen classes. Third, HGD can be transferred to defend models other than the one guiding it. In NIPS competition on defense against adversarial attacks, our HGD solution won the first place and outperformed other models by a large margin. 1 * Equal contribution.
translated by 谷歌翻译
代表学习,即对下游应用有用的表示形式的产生,是一项基本重要性的任务,它是深层神经网络(DNNS)成功的基础。最近,对对抗性例子的鲁棒性已成为DNNS的理想特性,促进了解释对抗性例子的强大训练方法的发展。在本文中,我们旨在了解通过鲁棒培训所学的表示的特性与从标准的,非运动培训获得的培训的特性不同。这对于诊断稳健网络中的众多显着陷阱至关重要,例如,良性输入的性能降解,鲁棒性的概括不良以及过度拟合的增加。我们利用一组强大的工具在三个视觉数据集中被称为表示相似性指标,以获得具有不同体系结构,培训程序和对抗性约束的稳健和非稳健DNN之间的层次比较。我们的实验突出显示了迄今为止稳健表示的属性,我们认为,这是强大网络的行为差异的基础。我们发现在强大的网络的表示中缺乏专业化以及“块结构”的消失。我们还发现在强大的训练中过度拟合会在很大程度上影响更深的层。这些以及其他发现还为更好的健壮网络的设计和培训提出了前进的方向。
translated by 谷歌翻译
In the scenario of black-box adversarial attack, the target model's parameters are unknown, and the attacker aims to find a successful adversarial perturbation based on query feedback under a query budget. Due to the limited feedback information, existing query-based black-box attack methods often require many queries for attacking each benign example. To reduce query cost, we propose to utilize the feedback information across historical attacks, dubbed example-level adversarial transferability. Specifically, by treating the attack on each benign example as one task, we develop a meta-learning framework by training a meta-generator to produce perturbations conditioned on benign examples. When attacking a new benign example, the meta generator can be quickly fine-tuned based on the feedback information of the new task as well as a few historical attacks to produce effective perturbations. Moreover, since the meta-train procedure consumes many queries to learn a generalizable generator, we utilize model-level adversarial transferability to train the meta-generator on a white-box surrogate model, then transfer it to help the attack against the target model. The proposed framework with the two types of adversarial transferability can be naturally combined with any off-the-shelf query-based attack methods to boost their performance, which is verified by extensive experiments.
translated by 谷歌翻译
尽管机器学习系统的效率和可扩展性,但最近的研究表明,许多分类方法,尤其是深神经网络(DNN),易受对抗的例子;即,仔细制作欺骗训练有素的分类模型的例子,同时无法区分从自然数据到人类。这使得在安全关键区域中应用DNN或相关方法可能不安全。由于这个问题是由Biggio等人确定的。 (2013)和Szegedy等人。(2014年),在这一领域已经完成了很多工作,包括开发攻击方法,以产生对抗的例子和防御技术的构建防范这些例子。本文旨在向统计界介绍这一主题及其最新发展,主要关注对抗性示例的产生和保护。在数值实验中使用的计算代码(在Python和R)公开可用于读者探讨调查的方法。本文希望提交人们将鼓励更多统计学人员在这种重要的令人兴奋的领域的产生和捍卫对抗的例子。
translated by 谷歌翻译
深度学习(DL)系统的安全性是一个极为重要的研究领域,因为它们正在部署在多个应用程序中,因为它们不断改善,以解决具有挑战性的任务。尽管有压倒性的承诺,但深度学习系统容易受到制作的对抗性例子的影响,这可能是人眼无法察觉的,但可能会导致模型错误分类。对基于整体技术的对抗性扰动的保护已被证明很容易受到更强大的对手的影响,或者证明缺乏端到端评估。在本文中,我们试图开发一种新的基于整体的解决方案,该解决方案构建具有不同决策边界的防御者模型相对于原始模型。通过(1)通过一种称为拆分和剃须的方法转换输入的分类器的合奏,以及(2)通过一种称为对比度功能的方法限制重要特征,显示出相对于相对于不同的梯度对抗性攻击,这减少了将对抗性示例从原始示例转移到针对同一类的防御者模型的机会。我们使用标准图像分类数据集(即MNIST,CIFAR-10和CIFAR-100)进行了广泛的实验,以实现最新的对抗攻击,以证明基于合奏的防御的鲁棒性。我们还在存在更强大的对手的情况下评估稳健性,该对手同时靶向合奏中的所有模型。已经提供了整体假阳性和误报的结果,以估计提出的方法的总体性能。
translated by 谷歌翻译
In this work, we study the black-box targeted attack problem from the model discrepancy perspective. On the theoretical side, we present a generalization error bound for black-box targeted attacks, which gives a rigorous theoretical analysis for guaranteeing the success of the attack. We reveal that the attack error on a target model mainly depends on empirical attack error on the substitute model and the maximum model discrepancy among substitute models. On the algorithmic side, we derive a new algorithm for black-box targeted attacks based on our theoretical analysis, in which we additionally minimize the maximum model discrepancy(M3D) of the substitute models when training the generator to generate adversarial examples. In this way, our model is capable of crafting highly transferable adversarial examples that are robust to the model variation, thus improving the success rate for attacking the black-box model. We conduct extensive experiments on the ImageNet dataset with different classification models, and our proposed approach outperforms existing state-of-the-art methods by a significant margin. Our codes will be released.
translated by 谷歌翻译