对抗性培训已被广泛用于增强神经网络模型对抗对抗攻击的鲁棒性。但是,自然准确性与强大的准确性之间仍有一个显着的差距。我们发现其中一个原因是常用的标签,单热量矢量,阻碍了图像识别的学习过程。在本文中,我们提出了一种称为低温蒸馏(LTD)的方法,该方法基于知识蒸馏框架来产生所需的软标记。与以前的工作不同,LTD在教师模型中使用相对较低的温度,采用不同但固定的,温度为教师模型和学生模型。此外,我们已经调查了有限公司协同使用自然数据和对抗性的方法。实验结果表明,在没有额外的未标记数据的情况下,所提出的方法与上一项工作相结合,可以分别在CiFar-10和CiFar-100数据集上实现57.72 \%和30.36 \%的鲁棒精度,这是州的大约1.21 \%通常的方法平均。
translated by 谷歌翻译
作为反对攻击的最有效的防御方法之一,对抗性训练倾向于学习包容性的决策边界,以提高深度学习模型的鲁棒性。但是,由于沿对抗方向的边缘的大幅度和不必要的增加,对抗性训练会在自然实例和对抗性示例之间引起严重的交叉,这不利于平衡稳健性和自然准确性之间的权衡。在本文中,我们提出了一种新颖的对抗训练计划,以在稳健性和自然准确性之间进行更好的权衡。它旨在学习一个中度包容的决策边界,这意味着决策边界下的自然示例的边缘是中等的。我们称此方案为中等边缘的对抗训练(MMAT),该方案生成更细粒度的对抗示例以减轻交叉问题。我们还利用了经过良好培训的教师模型的逻辑来指导我们的模型学习。最后,MMAT在Black-Box和White-Box攻击下都可以实现高自然的精度和鲁棒性。例如,在SVHN上,实现了最新的鲁棒性和自然精度。
translated by 谷歌翻译
已知深度神经网络(DNN)容易受到用不可察觉的扰动制作的对抗性示例的影响,即,输入图像的微小变化会引起错误的分类,从而威胁着基于深度学习的部署系统的可靠性。经常采用对抗训练(AT)来通过训练损坏和干净的数据的混合物来提高DNN的鲁棒性。但是,大多数基于AT的方法在处理\ textit {转移的对抗示例}方面是无效的,这些方法是生成以欺骗各种防御模型的生成的,因此无法满足现实情况下提出的概括要求。此外,对抗性训练一般的国防模型不能对具有扰动的输入产生可解释的预测,而不同的领域专家则需要一个高度可解释的强大模型才能了解DNN的行为。在这项工作中,我们提出了一种基于Jacobian规范和选择性输入梯度正则化(J-SIGR)的方法,该方法通过Jacobian归一化提出了线性化的鲁棒性,还将基于扰动的显着性图正规化,以模仿模型的可解释预测。因此,我们既可以提高DNN的防御能力和高解释性。最后,我们评估了跨不同体系结构的方法,以针对强大的对抗性攻击。实验表明,提出的J-Sigr赋予了针对转移的对抗攻击的鲁棒性,我们还表明,来自神经网络的预测易于解释。
translated by 谷歌翻译
评估防御模型的稳健性是对抗对抗鲁棒性研究的具有挑战性的任务。僵化的渐变,先前已经发现了一种梯度掩蔽,以许多防御方法存在并导致鲁棒性的错误信号。在本文中,我们确定了一种更细微的情况,称为不平衡梯度,也可能导致过高的对抗性鲁棒性。当边缘损耗的一个术语的梯度主导并将攻击朝向次优化方向推动时,发生不平衡梯度的现象。为了利用不平衡的梯度,我们制定了分解利润率损失的边缘分解(MD)攻击,并通过两阶段过程分别探讨了这些术语的攻击性。我们还提出了一个Multared和Ensemble版本的MD攻击。通过调查自2018年以来提出的17个防御模型,我们发现6种型号易受不平衡梯度的影响,我们的MD攻击可以减少由最佳基线独立攻击评估的鲁棒性另外2%。我们还提供了对不平衡梯度的可能原因和有效对策的深入分析。
translated by 谷歌翻译
对抗性例子的现象说明了深神经网络最基本的漏洞之一。在推出这一固有的弱点的各种技术中,对抗性训练已成为学习健壮模型的最有效策略。通常,这是通过平衡强大和自然目标来实现的。在这项工作中,我们旨在通过执行域不变的功能表示,进一步优化鲁棒和标准准确性之间的权衡。我们提出了一种新的对抗训练方法,域不变的对手学习(DIAL),该方法学习了一个既健壮又不变的功能表示形式。拨盘使用自然域及其相应的对抗域上的域对抗神经网络(DANN)的变体。在源域由自然示例组成和目标域组成的情况下,是对抗性扰动的示例,我们的方法学习了一个被限制的特征表示,以免区分自然和对抗性示例,因此可以实现更强大的表示。拨盘是一种通用和模块化技术,可以轻松地将其纳入任何对抗训练方法中。我们的实验表明,将拨号纳入对抗训练过程中可以提高鲁棒性和标准精度。
translated by 谷歌翻译
Adaptive attacks have (rightfully) become the de facto standard for evaluating defenses to adversarial examples. We find, however, that typical adaptive evaluations are incomplete. We demonstrate that thirteen defenses recently published at ICLR, ICML and NeurIPS-and which illustrate a diverse set of defense strategies-can be circumvented despite attempting to perform evaluations using adaptive attacks. While prior evaluation papers focused mainly on the end result-showing that a defense was ineffective-this paper focuses on laying out the methodology and the approach necessary to perform an adaptive attack. Some of our attack strategies are generalizable, but no single strategy would have been sufficient for all defenses. This underlines our key message that adaptive attacks cannot be automated and always require careful and appropriate tuning to a given defense. We hope that these analyses will serve as guidance on how to properly perform adaptive attacks against defenses to adversarial examples, and thus will allow the community to make further progress in building more robust models.
translated by 谷歌翻译
有必要提高某些特殊班级的表现,或者特别保护它们免受对抗学习的攻击。本文提出了一个将成本敏感分类和对抗性学习结合在一起的框架,以训练可以区分受保护和未受保护的类的模型,以使受保护的类别不太容易受到对抗性示例的影响。在此框架中,我们发现在训练深神经网络(称为Min-Max属性)期间,一个有趣的现象,即卷积层中大多数参数的绝对值。基于这种最小的最大属性,该属性是在随机分布的角度制定和分析的,我们进一步建立了一个针对对抗性示例的新防御模型,以改善对抗性鲁棒性。构建模型的一个优点是,它的性能比标准模型更好,并且可以与对抗性训练相结合,以提高性能。在实验上证实,对于所有类别的平均准确性,我们的模型在没有发生攻击时几乎与现有模型一样,并且在发生攻击时比现有模型更好。具体而言,关于受保护类的准确性,提议的模型比发生攻击时的现有模型要好得多。
translated by 谷歌翻译
尽管机器学习系统的效率和可扩展性,但最近的研究表明,许多分类方法,尤其是深神经网络(DNN),易受对抗的例子;即,仔细制作欺骗训练有素的分类模型的例子,同时无法区分从自然数据到人类。这使得在安全关键区域中应用DNN或相关方法可能不安全。由于这个问题是由Biggio等人确定的。 (2013)和Szegedy等人。(2014年),在这一领域已经完成了很多工作,包括开发攻击方法,以产生对抗的例子和防御技术的构建防范这些例子。本文旨在向统计界介绍这一主题及其最新发展,主要关注对抗性示例的产生和保护。在数值实验中使用的计算代码(在Python和R)公开可用于读者探讨调查的方法。本文希望提交人们将鼓励更多统计学人员在这种重要的令人兴奋的领域的产生和捍卫对抗的例子。
translated by 谷歌翻译
Deep neural networks (DNNs) are one of the most prominent technologies of our time, as they achieve state-of-the-art performance in many machine learning tasks, including but not limited to image classification, text mining, and speech processing. However, recent research on DNNs has indicated ever-increasing concern on the robustness to adversarial examples, especially for security-critical tasks such as traffic sign identification for autonomous driving. Studies have unveiled the vulnerability of a well-trained DNN by demonstrating the ability of generating barely noticeable (to both human and machines) adversarial images that lead to misclassification. Furthermore, researchers have shown that these adversarial images are highly transferable by simply training and attacking a substitute model built upon the target model, known as a black-box attack to DNNs.Similar to the setting of training substitute models, in this paper we propose an effective black-box attack that also only has access to the input (images) and the output (confidence scores) of a targeted DNN. However, different from leveraging attack transferability from substitute models, we propose zeroth order optimization (ZOO) based attacks to directly estimate the gradients of the targeted DNN for generating adversarial examples. We use zeroth order stochastic coordinate descent along with dimension reduction, hierarchical attack and importance sampling techniques to * Pin-Yu Chen and Huan Zhang contribute equally to this work.
translated by 谷歌翻译
Adversarial training is an effective approach to make deep neural networks robust against adversarial attacks. Recently, different adversarial training defenses are proposed that not only maintain a high clean accuracy but also show significant robustness against popular and well studied adversarial attacks such as PGD. High adversarial robustness can also arise if an attack fails to find adversarial gradient directions, a phenomenon known as `gradient masking'. In this work, we analyse the effect of label smoothing on adversarial training as one of the potential causes of gradient masking. We then develop a guided mechanism to avoid local minima during attack optimization, leading to a novel attack dubbed Guided Projected Gradient Attack (G-PGA). Our attack approach is based on a `match and deceive' loss that finds optimal adversarial directions through guidance from a surrogate model. Our modified attack does not require random restarts, large number of attack iterations or search for an optimal step-size. Furthermore, our proposed G-PGA is generic, thus it can be combined with an ensemble attack strategy as we demonstrate for the case of Auto-Attack, leading to efficiency and convergence speed improvements. More than an effective attack, G-PGA can be used as a diagnostic tool to reveal elusive robustness due to gradient masking in adversarial defenses.
translated by 谷歌翻译
大多数对抗攻击防御方法依赖于混淆渐变。这些方法在捍卫基于梯度的攻击方面是成功的;然而,它们容易被攻击绕过,该攻击不使用梯度或近似近似和使用校正梯度的攻击。不存在不存在诸如对抗培训等梯度的防御,但这些方法通常对诸如其幅度的攻击进行假设。我们提出了一种分类模型,该模型不会混淆梯度,并且通过施工而强大而不承担任何关于攻击的知识。我们的方法将分类作为优化问题,我们“反转”在不受干扰的自然图像上培训的条件发电机,以找到生成最接近查询图像的类。我们假设潜在的脆性抗逆性攻击源是前馈分类器的高度低维性质,其允许对手发现输入空间中的小扰动,从而导致输出空间的大变化。另一方面,生成模型通常是低到高维的映射。虽然该方法与防御GaN相关,但在我们的模型中使用条件生成模型和反演而不是前馈分类是临界差异。与Defense-GaN不同,它被证明生成了容易规避的混淆渐变,我们表明我们的方法不会混淆梯度。我们展示了我们的模型对黑箱攻击的极其强劲,并与自然训练的前馈分类器相比,对白盒攻击的鲁棒性提高。
translated by 谷歌翻译
对抗训练(AT)在防御对抗例子方面表现出色。最近的研究表明,示例对于AT期间模型的最终鲁棒性并不同样重要,即,所谓的硬示例可以攻击容易表现出比对最终鲁棒性的鲁棒示例更大的影响。因此,保证硬示例的鲁棒性对于改善模型的最终鲁棒性至关重要。但是,定义有效的启发式方法来寻找辛苦示例仍然很困难。在本文中,受到信息瓶颈(IB)原则的启发,我们发现了一个具有高度共同信息及其相关的潜在表示的例子,更有可能受到攻击。基于此观察,我们提出了一种新颖有效的对抗训练方法(Infoat)。鼓励Infoat找到具有高相互信息的示例,并有效利用它们以提高模型的最终鲁棒性。实验结果表明,与几种最先进的方法相比,Infoat在不同数据集和模型之间达到了最佳的鲁棒性。
translated by 谷歌翻译
为了应对对抗性实例的威胁,对抗性培训提供了一种有吸引力的选择,可以通过在线增强的对抗示例中的培训模型提高模型稳健性。然而,大多数现有的对抗训练方法通过强化对抗性示例来侧重于提高鲁棒的准确性,但忽略了天然数据和对抗性实施例之间的增加,导致自然精度急剧下降。为了维持自然和强大的准确性之间的权衡,我们从特征适应的角度缓解了转变,并提出了一种特征自适应对抗训练(FAAT),这些培训(FAAT)跨越自然数据和对抗示例优化类条件特征适应。具体而言,我们建议纳入一类条件鉴别者,以鼓励特征成为(1)类鉴别的和(2)不变导致对抗性攻击的变化。新型的FAAT框架通过在天然和对抗数据中产生具有类似分布的特征来实现自然和强大的准确性之间的权衡,并实现从类鉴别特征特征中受益的更高的整体鲁棒性。在各种数据集上的实验表明,FAAT产生更多辨别特征,并对最先进的方法表现有利。代码在https://github.com/visionflow/faat中获得。
translated by 谷歌翻译
深度卷积神经网络(CNN)很容易被输入图像的细微,不可察觉的变化所欺骗。为了解决此漏洞,对抗训练会创建扰动模式,并将其包括在培训设置中以鲁棒性化模型。与仅使用阶级有限信息的现有对抗训练方法(例如,使用交叉渗透损失)相反,我们建议利用功能空间中的其他信息来促进更强的对手,这些信息又用于学习强大的模型。具体来说,我们将使用另一类的目标样本的样式和内容信息以及其班级边界信息来创建对抗性扰动。我们以深入监督的方式应用了我们提出的多任务目标,从而提取了多尺度特征知识,以创建最大程度地分开对手。随后,我们提出了一种最大边缘对抗训练方法,该方法可最大程度地减少源图像与其对手之间的距离,并最大程度地提高对手和目标图像之间的距离。与最先进的防御能力相比,我们的对抗训练方法表明了强大的鲁棒性,可以很好地推广到自然发生的损坏和数据分配变化,并保留了清洁示例的模型准确性。
translated by 谷歌翻译
Adversarial examples are perturbed inputs designed to fool machine learning models. Adversarial training injects such examples into training data to increase robustness. To scale this technique to large datasets, perturbations are crafted using fast single-step methods that maximize a linear approximation of the model's loss. We show that this form of adversarial training converges to a degenerate global minimum, wherein small curvature artifacts near the data points obfuscate a linear approximation of the loss. The model thus learns to generate weak perturbations, rather than defend against strong ones. As a result, we find that adversarial training remains vulnerable to black-box attacks, where we transfer perturbations computed on undefended models, as well as to a powerful novel single-step attack that escapes the non-smooth vicinity of the input data via a small random step. We further introduce Ensemble Adversarial Training, a technique that augments training data with perturbations transferred from other models. On ImageNet, Ensemble Adversarial Training yields models with stronger robustness to blackbox attacks. In particular, our most robust model won the first round of the NIPS 2017 competition on Defenses against Adversarial Attacks (Kurakin et al., 2017c). However, subsequent work found that more elaborate black-box attacks could significantly enhance transferability and reduce the accuracy of our models.
translated by 谷歌翻译
积极调查深度神经网络的对抗鲁棒性。然而,大多数现有的防御方法限于特定类型的对抗扰动。具体而言,它们通常不能同时为多次攻击类型提供抵抗力,即,它们缺乏多扰动鲁棒性。此外,与图像识别问题相比,视频识别模型的对抗鲁棒性相对未开发。虽然有几项研究提出了如何产生对抗性视频,但在文献中只发表了关于防御策略的少数关于防御策略的方法。在本文中,我们提出了用于视频识别的多种抗逆视频的第一战略之一。所提出的方法称为Multibn,使用具有基于学习的BN选择模块的多个独立批量归一化(BN)层对多个对冲视频类型进行对抗性训练。利用多个BN结构,每个BN Brach负责学习单个扰动类型的分布,从而提供更精确的分布估计。这种机制有利于处理多种扰动类型。 BN选择模块检测输入视频的攻击类型,并将其发送到相应的BN分支,使MultiBN全自动并允许端接训练。与目前的对抗训练方法相比,所提出的Multibn对不同甚至不可预见的对抗性视频类型具有更强的多扰动稳健性,从LP界攻击和物理上可实现的攻击范围。在不同的数据集和目标模型上保持真实。此外,我们进行了广泛的分析,以研究多BN结构的性质。
translated by 谷歌翻译
对抗性训练(AT)已被证明可以通过利用对抗性示例进行训练来有效地改善模型鲁棒性。但是,大多数方法面对昂贵的时间和计算成本,用于在生成对抗性示例的多个步骤中计算梯度。为了提高训练效率,快速梯度符号方法(FGSM)在方法中仅通过计算一次来快速地采用。不幸的是,鲁棒性远非令人满意。初始化的方式可能引起一个原因。现有的快速在通常使用随机的样本不合时宜的初始化,这促进了效率,但会阻碍进一步的稳健性改善。到目前为止,快速AT中的初始化仍未广泛探索。在本文中,我们以样本依赖性的对抗初始化(即,来自良性图像条件的生成网络的输出及其来自目标网络的梯度信息的输出)快速增强。随着生成网络和目标网络在训练阶段共同优化,前者可以适应相对于后者的有效初始化,从而激发了逐渐改善鲁棒性。在四个基准数据库上进行的实验评估证明了我们所提出的方法比在方法上快速的最先进方法的优越性,以及与方法相当的鲁棒性。该代码在https://github.com//jiaxiaojunqaq//fgsm-sdi上发布。
translated by 谷歌翻译
We identify a trade-off between robustness and accuracy that serves as a guiding principle in the design of defenses against adversarial examples. Although this problem has been widely studied empirically, much remains unknown concerning the theory underlying this trade-off. In this work, we decompose the prediction error for adversarial examples (robust error) as the sum of the natural (classification) error and boundary error, and provide a differentiable upper bound using the theory of classification-calibrated loss, which is shown to be the tightest possible upper bound uniform over all probability distributions and measurable predictors. Inspired by our theoretical analysis, we also design a new defense method, TRADES, to trade adversarial robustness off against accuracy. Our proposed algorithm performs well experimentally in real-world datasets. The methodology is the foundation of our entry to the NeurIPS 2018 Adversarial Vision Challenge in which we won the 1st place out of ~2,000 submissions, surpassing the runner-up approach by 11.41% in terms of mean 2 perturbation distance.
translated by 谷歌翻译
已知神经网络容易受到对抗性攻击的影响 - 轻微但精心构建的输入扰动,这会造成巨大损害网络的性能。已经提出了许多防御方法来通过培训对抗对抗扰动的投入来改善深网络的稳健性。然而,这些模型通常仍然容易受到在训练期间没有看到的新类型的攻击,甚至在以前看到的攻击中稍微强大。在这项工作中,我们提出了一种新的对抗性稳健性的方法,这在域适应领域的见解中建立了洞察力。我们的方法称为对抗性特征脱敏(AFD),目的是学习功能,这些特征是不变的对输入的对抗扰动。这是通过游戏实现的,我们学习了预测和鲁棒(对对抗性攻击不敏感)的特征,即不能用于区分自然和对抗数据。若干基准测试的经验结果证明了提出的方法对广泛的攻击类型和攻击优势的有效性。我们的代码可在https://github.com/bashivanlab/afd获得。
translated by 谷歌翻译
Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10 30 . We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.
translated by 谷歌翻译