有针对性的训练集攻击将恶意实例注入训练集中,以导致训练有素的模型错误地标记一个或多个特定的测试实例。这项工作提出了目标识别的任务,该任务决定了特定的测试实例是否是训练集攻击的目标。目标识别可以与对抗性识别相结合,以查找(并删除)攻击实例,从而减轻对其他预测的影响,从而减轻攻击。我们没有专注于单个攻击方法或数据模式,而是基于影响力估计,这量化了每个培训实例对模型预测的贡献。我们表明,现有的影响估计量的不良实际表现通常来自于他们对训练实例和迭代次数的过度依赖。我们重新归一化的影响估计器解决了这一弱点。他们的表现远远超过了原始估计量,可以在对抗和非对抗环境中识别有影响力的训练示例群体,甚至发现多达100%的对抗训练实例,没有清洁数据误报。然后,目标识别简化以检测具有异常影响值的测试实例。我们证明了我们的方法对各种数据域的后门和中毒攻击的有效性,包括文本,视觉和语音,以及针对灰色盒子的自适应攻击者,该攻击者专门优化了逃避我们方法的对抗性实例。我们的源代码可在https://github.com/zaydh/target_indistification中找到。
translated by 谷歌翻译
Good models require good training data. For overparameterized deep models, the causal relationship between training data and model predictions is increasingly opaque and poorly understood. Influence analysis partially demystifies training's underlying interactions by quantifying the amount each training instance alters the final model. Measuring the training data's influence exactly can be provably hard in the worst case; this has led to the development and use of influence estimators, which only approximate the true influence. This paper provides the first comprehensive survey of training data influence analysis and estimation. We begin by formalizing the various, and in places orthogonal, definitions of training data influence. We then organize state-of-the-art influence analysis methods into a taxonomy; we describe each of these methods in detail and compare their underlying assumptions, asymptotic complexities, and overall strengths and weaknesses. Finally, we propose future research directions to make influence analysis more useful in practice as well as more theoretically and empirically sound. A curated, up-to-date list of resources related to influence analysis is available at https://github.com/ZaydH/influence_analysis_papers.
translated by 谷歌翻译
与令人印象深刻的进步触动了我们社会的各个方面,基于深度神经网络(DNN)的AI技术正在带来越来越多的安全问题。虽然在考试时间运行的攻击垄断了研究人员的初始关注,但是通过干扰培训过程来利用破坏DNN模型的可能性,代表了破坏训练过程的可能性,这是破坏AI技术的可靠性的进一步严重威胁。在后门攻击中,攻击者损坏了培训数据,以便在测试时间诱导错误的行为。然而,测试时间误差仅在存在与正确制作的输入样本对应的触发事件的情况下被激活。通过这种方式,损坏的网络继续正常输入的预期工作,并且只有当攻击者决定激活网络内隐藏的后门时,才会发生恶意行为。在过去几年中,后门攻击一直是强烈的研究活动的主题,重点是新的攻击阶段的发展,以及可能对策的提议。此概述文件的目标是审查发表的作品,直到现在,分类到目前为止提出的不同类型的攻击和防御。指导分析的分类基于攻击者对培训过程的控制量,以及防御者验证用于培训的数据的完整性,并监控DNN在培训和测试中的操作时间。因此,拟议的分析特别适合于参考他们在运营的应用方案的攻击和防御的强度和弱点。
translated by 谷歌翻译
从外界培训的机器学习模型可能会被数据中毒攻击损坏,将恶意指向到模型的培训集中。对这些攻击的常见防御是数据消毒:在培训模型之前首先过滤出异常培训点。在本文中,我们开发了三次攻击,可以绕过广泛的常见数据消毒防御,包括基于最近邻居,训练损失和奇异值分解的异常探测器。通过增加3%的中毒数据,我们的攻击成功地将Enron垃圾邮件检测数据集的测试错误从3%增加到24%,并且IMDB情绪分类数据集从12%到29%。相比之下,没有明确占据这些数据消毒防御的现有攻击被他们击败。我们的攻击基于两个想法:(i)我们协调我们的攻击将中毒点彼此放置在彼此附近,(ii)我们将每个攻击制定为受限制的优化问题,限制旨在确保中毒点逃避检测。随着这种优化涉及解决昂贵的Bilevel问题,我们的三个攻击对应于基于影响功能的近似近似这个问题的方式; minimax二元性;和karush-kuhn-tucker(kkt)条件。我们的结果强调了对数据中毒攻击产生更强大的防御的必要性。
translated by 谷歌翻译
对抗训练实例可能会严重扭曲模型的行为。这项工作调查了经过认证的回归防御措施,该防御措施提供了保证的限制,即在训练集攻击下回归者的预测可能会发生多少变化。我们的关键见解是,使用中位数作为模型的主要决策功能时,经认证的回归减少了认证分类。将我们的减少与现有认证分类器相结合,我们提出了六个新的可证明的回归剂。就我们的知识而言,这是第一部证明单个回归预测的鲁棒性的工作,而没有任何关于数据分布和模型体系结构的假设。我们还表明,现有的最先进的认证分类器通常会做出过分的假设,可以降低其可证明的保证。我们引入了对模型鲁棒性的更严格的分析,在许多情况下,这会大大改善认证的保证。最后,我们从经验上证明了我们的方法对回归和分类数据的有效性,在1%的训练集腐败和4%以下腐败以下预测中,可以保证多达50%的测试预测准确性。我们的源代码可在https://github.com/zaydh/certified-regression上获得。
translated by 谷歌翻译
在对抗机器学习中,防止对深度学习系统的攻击的新防御能力在释放更强大的攻击后不久就会破坏。在这种情况下,法医工具可以通过追溯成功的根本原因来为现有防御措施提供宝贵的补充,并为缓解措施提供前进的途径,以防止将来采取类似的攻击。在本文中,我们描述了我们为开发用于深度神经网络毒物攻击的法医追溯工具的努力。我们提出了一种新型的迭代聚类和修剪解决方案,该解决方案修剪了“无辜”训练样本,直到所有剩余的是一组造成攻击的中毒数据。我们的方法群群训练样本基于它们对模型参数的影响,然后使用有效的数据解读方法来修剪无辜簇。我们从经验上证明了系统对三种类型的肮脏标签(后门)毒物攻击和三种类型的清洁标签毒药攻击的功效,这些毒物跨越了计算机视觉和恶意软件分类。我们的系统在所有攻击中都达到了98.4%的精度和96.8%的召回。我们还表明,我们的系统与专门攻击它的四种抗纤维法措施相对强大。
translated by 谷歌翻译
后门攻击已被证明是对深度学习模型的严重安全威胁,并且检测给定模型是否已成为后门成为至关重要的任务。现有的防御措施主要建立在观察到后门触发器通常尺寸很小或仅影响几个神经元激活的观察结果。但是,在许多情况下,尤其是对于高级后门攻击,违反了上述观察结果,阻碍了现有防御的性能和适用性。在本文中,我们提出了基于新观察的后门防御范围。也就是说,有效的后门攻击通常需要对中毒训练样本的高预测置信度,以确保训练有素的模型具有很高的可能性。基于此观察结果,Dtinspector首先学习一个可以改变最高信心数据的预测的补丁,然后通过检查在低信心数据上应用学习补丁后检查预测变化的比率来决定后门的存在。对五次后门攻击,四个数据集和三种高级攻击类型的广泛评估证明了拟议防御的有效性。
translated by 谷歌翻译
最近的研究表明,深神经网络(DNN)易受对抗性攻击的影响,包括逃避和后门(中毒)攻击。在防守方面,有密集的努力,改善了对逃避袭击的经验和可怜的稳健性;然而,对后门攻击的可稳健性仍然很大程度上是未开发的。在本文中,我们专注于认证机器学习模型稳健性,反对一般威胁模型,尤其是后门攻击。我们首先通过随机平滑技术提供统一的框架,并展示如何实例化以证明对逃避和后门攻击的鲁棒性。然后,我们提出了第一个强大的培训过程Rab,以平滑训练有素的模型,并证明其稳健性对抗后门攻击。我们派生机学习模型的稳健性突出了培训的机器学习模型,并证明我们的鲁棒性受到紧张。此外,我们表明,可以有效地训练强大的平滑模型,以适用于诸如k最近邻分类器的简单模型,并提出了一种精确的平滑训练算法,该算法消除了从这种模型的噪声分布采样采样的需要。经验上,我们对MNIST,CIFAR-10和Imagenet数据集等DNN,差异私有DNN和K-NN模型等不同机器学习(ML)型号进行了全面的实验,并为反卧系攻击提供认证稳健性的第一个基准。此外,我们在SPAMBase表格数据集上评估K-NN模型,以展示所提出的精确算法的优点。对多元化模型和数据集的综合评价既有关于普通训练时间攻击的进一步强劲学习策略的多样化模型和数据集的综合评价。
translated by 谷歌翻译
计算能力和大型培训数据集的可用性增加,机器学习的成功助长了。假设它充分代表了在测试时遇到的数据,则使用培训数据来学习新模型或更新现有模型。这种假设受到中毒威胁的挑战,这种攻击会操纵训练数据,以损害模型在测试时的表现。尽管中毒已被认为是行业应用中的相关威胁,到目前为止,已经提出了各种不同的攻击和防御措施,但对该领域的完整系统化和批判性审查仍然缺失。在这项调查中,我们在机器学习中提供了中毒攻击和防御措施的全面系统化,审查了过去15年中该领域发表的100多篇论文。我们首先对当前的威胁模型和攻击进行分类,然后相应地组织现有防御。虽然我们主要关注计算机视觉应用程序,但我们认为我们的系统化还包括其他数据模式的最新攻击和防御。最后,我们讨论了中毒研究的现有资源,并阐明了当前的局限性和该研究领域的开放研究问题。
translated by 谷歌翻译
Adaptive attacks have (rightfully) become the de facto standard for evaluating defenses to adversarial examples. We find, however, that typical adaptive evaluations are incomplete. We demonstrate that thirteen defenses recently published at ICLR, ICML and NeurIPS-and which illustrate a diverse set of defense strategies-can be circumvented despite attempting to perform evaluations using adaptive attacks. While prior evaluation papers focused mainly on the end result-showing that a defense was ineffective-this paper focuses on laying out the methodology and the approach necessary to perform an adaptive attack. Some of our attack strategies are generalizable, but no single strategy would have been sufficient for all defenses. This underlines our key message that adaptive attacks cannot be automated and always require careful and appropriate tuning to a given defense. We hope that these analyses will serve as guidance on how to properly perform adaptive attacks against defenses to adversarial examples, and thus will allow the community to make further progress in building more robust models.
translated by 谷歌翻译
We introduce camouflaged data poisoning attacks, a new attack vector that arises in the context of machine unlearning and other settings when model retraining may be induced. An adversary first adds a few carefully crafted points to the training dataset such that the impact on the model's predictions is minimal. The adversary subsequently triggers a request to remove a subset of the introduced points at which point the attack is unleashed and the model's predictions are negatively affected. In particular, we consider clean-label targeted attacks (in which the goal is to cause the model to misclassify a specific test point) on datasets including CIFAR-10, Imagenette, and Imagewoof. This attack is realized by constructing camouflage datapoints that mask the effect of a poisoned dataset.
translated by 谷歌翻译
Vertical federated learning (VFL) is an emerging paradigm that enables collaborators to build machine learning models together in a distributed fashion. In general, these parties have a group of users in common but own different features. Existing VFL frameworks use cryptographic techniques to provide data privacy and security guarantees, leading to a line of works studying computing efficiency and fast implementation. However, the security of VFL's model remains underexplored.
translated by 谷歌翻译
许多最先进的ML模型在各种任务中具有优于图像分类的人类。具有如此出色的性能,ML模型今天被广泛使用。然而,存在对抗性攻击和数据中毒攻击的真正符合ML模型的稳健性。例如,Engstrom等人。证明了最先进的图像分类器可以容易地被任意图像上的小旋转欺骗。由于ML系统越来越纳入安全性和安全敏感的应用,对抗攻击和数据中毒攻击构成了相当大的威胁。本章侧重于ML安全的两个广泛和重要的领域:对抗攻击和数据中毒攻击。
translated by 谷歌翻译
We conduct a systematic study of backdoor vulnerabilities in normally trained Deep Learning models. They are as dangerous as backdoors injected by data poisoning because both can be equally exploited. We leverage 20 different types of injected backdoor attacks in the literature as the guidance and study their correspondences in normally trained models, which we call natural backdoor vulnerabilities. We find that natural backdoors are widely existing, with most injected backdoor attacks having natural correspondences. We categorize these natural backdoors and propose a general detection framework. It finds 315 natural backdoors in the 56 normally trained models downloaded from the Internet, covering all the different categories, while existing scanners designed for injected backdoors can at most detect 65 backdoors. We also study the root causes and defense of natural backdoors.
translated by 谷歌翻译
深度神经网络(DNNS)在训练过程中容易受到后门攻击的影响。该模型以这种方式损坏正常起作用,但是当输入中的某些模式触发时,会产生预定义的目标标签。现有防御通常依赖于通用后门设置的假设,其中有毒样品共享相同的均匀扳机。但是,最近的高级后门攻击表明,这种假设在动态后门中不再有效,在动态后门中,触发者因输入而异,从而击败了现有的防御。在这项工作中,我们提出了一种新颖的技术BEATRIX(通过革兰氏矩阵检测)。 BEATRIX利用革兰氏矩阵不仅捕获特征相关性,还可以捕获表示形式的适当高阶信息。通过从正常样本的激活模式中学习类条件统计,BEATRIX可以通过捕获激活模式中的异常来识别中毒样品。为了进一步提高识别目标标签的性能,BEATRIX利用基于内核的测试,而无需对表示分布进行任何先前的假设。我们通过与最先进的防御技术进行了广泛的评估和比较来证明我们的方法的有效性。实验结果表明,我们的方法在检测动态后门时达到了91.1%的F1得分,而最新技术只能达到36.9%。
translated by 谷歌翻译
Dataset distillation has emerged as a prominent technique to improve data efficiency when training machine learning models. It encapsulates the knowledge from a large dataset into a smaller synthetic dataset. A model trained on this smaller distilled dataset can attain comparable performance to a model trained on the original training dataset. However, the existing dataset distillation techniques mainly aim at achieving the best trade-off between resource usage efficiency and model utility. The security risks stemming from them have not been explored. This study performs the first backdoor attack against the models trained on the data distilled by dataset distillation models in the image domain. Concretely, we inject triggers into the synthetic data during the distillation procedure rather than during the model training stage, where all previous attacks are performed. We propose two types of backdoor attacks, namely NAIVEATTACK and DOORPING. NAIVEATTACK simply adds triggers to the raw data at the initial distillation phase, while DOORPING iteratively updates the triggers during the entire distillation procedure. We conduct extensive evaluations on multiple datasets, architectures, and dataset distillation techniques. Empirical evaluation shows that NAIVEATTACK achieves decent attack success rate (ASR) scores in some cases, while DOORPING reaches higher ASR scores (close to 1.0) in all cases. Furthermore, we conduct a comprehensive ablation study to analyze the factors that may affect the attack performance. Finally, we evaluate multiple defense mechanisms against our backdoor attacks and show that our attacks can practically circumvent these defense mechanisms.
translated by 谷歌翻译
随着机器学习数据的策展变得越来越自动化,数据集篡改是一种安装威胁。后门攻击者通过培训数据篡改,以嵌入在该数据上培训的模型中的漏洞。然后通过将“触发”放入模型的输入中的推理时间以推理时间激活此漏洞。典型的后门攻击将触发器直接插入训练数据,尽管在检查时可能会看到这种攻击。相比之下,隐藏的触发后托攻击攻击达到中毒,而无需将触发器放入训练数据即可。然而,这种隐藏的触发攻击在从头开始培训的中毒神经网络时无效。我们开发了一个新的隐藏触发攻击,睡眠代理,在制备过程中使用梯度匹配,数据选择和目标模型重新培训。睡眠者代理是第一个隐藏的触发后门攻击,以对从头开始培训的神经网络有效。我们展示了Imagenet和黑盒设置的有效性。我们的实现代码可以在https://github.com/hsouri/sleeper-agent找到。
translated by 谷歌翻译
我们开发了一种新的原则性算法,用于估计培训数据点对深度学习模型的行为的贡献,例如它做出的特定预测。我们的算法估计了AME,该数量量衡量了将数据点添加到训练数据子集中的预期(平均)边际效应,并从给定的分布中采样。当从均匀分布中采样子集时,AME将还原为众所周知的Shapley值。我们的方法受因果推断和随机实验的启发:我们采样了训练数据的不同子集以训练多个子模型,并评估每个子模型的行为。然后,我们使用套索回归来基于子集组成共同估计每个数据点的AME。在稀疏假设($ k \ ll n $数据点具有较大的AME)下,我们的估计器仅需要$ O(k \ log n)$随机的子模型培训,从而改善了最佳先前的Shapley值估算器。
translated by 谷歌翻译
Data poisoning is an attack on machine learning models wherein the attacker adds examples to the training set to manipulate the behavior of the model at test time. This paper explores poisoning attacks on neural nets. The proposed attacks use "clean-labels"; they don't require the attacker to have any control over the labeling of training data. They are also targeted; they control the behavior of the classifier on a specific test instance without degrading overall classifier performance. For example, an attacker could add a seemingly innocuous image (that is properly labeled) to a training set for a face recognition engine, and control the identity of a chosen person at test time. Because the attacker does not need to control the labeling function, poisons could be entered into the training set simply by leaving them on the web and waiting for them to be scraped by a data collection bot. We present an optimization-based method for crafting poisons, and show that just one single poison image can control classifier behavior when transfer learning is used. For full end-to-end training, we present a "watermarking" strategy that makes poisoning reliable using multiple (≈ 50) poisoned training instances. We demonstrate our method by generating poisoned frog images from the CIFAR dataset and using them to manipulate image classifiers.
translated by 谷歌翻译
A recent trojan attack on deep neural network (DNN) models is one insidious variant of data poisoning attacks. Trojan attacks exploit an effective backdoor created in a DNN model by leveraging the difficulty in interpretability of the learned model to misclassify any inputs signed with the attacker's chosen trojan trigger. Since the trojan trigger is a secret guarded and exploited by the attacker, detecting such trojan inputs is a challenge, especially at run-time when models are in active operation. This work builds STRong Intentional Perturbation (STRIP) based run-time trojan attack detection system and focuses on vision system. We intentionally perturb the incoming input, for instance by superimposing various image patterns, and observe the randomness of predicted classes for perturbed inputs from a given deployed model-malicious or benign. A low entropy in predicted classes violates the input-dependence property of a benign model and implies the presence of a malicious input-a characteristic of a trojaned input. The high efficacy of our method is validated through case studies on three popular and contrasting datasets: MNIST, CIFAR10 and GTSRB. We achieve an overall false acceptance rate (FAR) of less than 1%, given a preset false rejection rate (FRR) of 1%, for different types of triggers. Using CIFAR10 and GTSRB, we have empirically achieved result of 0% for both FRR and FAR. We have also evaluated STRIP robustness against a number of trojan attack variants and adaptive attacks.
translated by 谷歌翻译