基于深度学习的系统容易受到对抗性攻击的影响,在该系统中,输入的小小的,不可察觉的变化改变了模型的预测。但是,迄今为止,大多数检测这些攻击的方法都是为图像处理系统设计的。许多流行的图像对抗检测方法能够从嵌入特征空间中识别对抗性示例,而在NLP域中,现有最先进的检测方法仅关注输入文本特征,而无需考虑模型嵌入空间。这项工作研究了将这些图像移植到自然语言处理(NLP)任务时,将产生什么差异 - 发现这些检测器的端口不能很好地端口。这是可以预期的,因为NLP系统具有非常不同的输入形式:本质上的离散和顺序,而不是图像的连续和固定尺寸输入。作为等效的以模型为重点的NLP检测方法,这项工作提出了一个简单的基于“残基”检测器的句子,以识别对抗性示例。在许多任务上,它超过表现的移植图像域检测器和最新的NLP特定探测器的状态。
translated by 谷歌翻译
尽管机器学习系统的效率和可扩展性,但最近的研究表明,许多分类方法,尤其是深神经网络(DNN),易受对抗的例子;即,仔细制作欺骗训练有素的分类模型的例子,同时无法区分从自然数据到人类。这使得在安全关键区域中应用DNN或相关方法可能不安全。由于这个问题是由Biggio等人确定的。 (2013)和Szegedy等人。(2014年),在这一领域已经完成了很多工作,包括开发攻击方法,以产生对抗的例子和防御技术的构建防范这些例子。本文旨在向统计界介绍这一主题及其最新发展,主要关注对抗性示例的产生和保护。在数值实验中使用的计算代码(在Python和R)公开可用于读者探讨调查的方法。本文希望提交人们将鼓励更多统计学人员在这种重要的令人兴奋的领域的产生和捍卫对抗的例子。
translated by 谷歌翻译
Although deep neural networks (DNNs) have achieved great success in many tasks, they can often be fooled by adversarial examples that are generated by adding small but purposeful distortions to natural examples. Previous studies to defend against adversarial examples mostly focused on refining the DNN models, but have either shown limited success or required expensive computation. We propose a new strategy, feature squeezing, that can be used to harden DNN models by detecting adversarial examples. Feature squeezing reduces the search space available to an adversary by coalescing samples that correspond to many different feature vectors in the original space into a single sample. By comparing a DNN model's prediction on the original input with that on squeezed inputs, feature squeezing detects adversarial examples with high accuracy and few false positives.This paper explores two feature squeezing methods: reducing the color bit depth of each pixel and spatial smoothing. These simple strategies are inexpensive and complementary to other defenses, and can be combined in a joint detection framework to achieve high detection rates against state-of-the-art attacks.
translated by 谷歌翻译
尽管在许多机器学习任务方面取得了巨大成功,但深度神经网络仍然易于对抗对抗样本。虽然基于梯度的对抗攻击方法在计算机视野领域探索,但由于文本的离散性质,直接应用于自然语言处理中,这是不切实际的。为了弥合这一差距,我们提出了一般框架,以适应现有的基于梯度的方法来制作文本对抗性样本。在该框架中,将基于梯度的连续扰动添加到嵌入层中,并在前向传播过程中被放大。然后用掩模语言模型头解码最终的扰动潜在表示以获得潜在的对抗性样本。在本文中,我们将我们的框架与\ textbf {t} Extual \ TextBF {P} ROJECTED \ TextBF {G} Radient \ TextBF {D} excent(\ TextBF {TPGD})进行ronject \ textbf {p}。我们通过在三个基准数据集上执行转移黑匣子攻击来评估我们的框架来评估我们的框架。实验结果表明,与强基线方法相比,我们的方法达到了更好的性能,并产生更精细和语法的对抗性样本。所有代码和数据都将公开。
translated by 谷歌翻译
最近的自然语言处理(NLP)技术在基准数据集中实现了高性能,主要原因是由于深度学习性能的显着改善。研究界的进步导致了最先进的NLP任务的生产系统的巨大增强,例如虚拟助理,语音识别和情感分析。然而,随着对抗性攻击测试时,这种NLP系统仍然仍然失败。初始缺乏稳健性暴露于当前模型的语言理解能力中的令人不安的差距,当NLP系统部署在现实生活中时,会产生问题。在本文中,我们通过以各种维度的系统方式概述文献来展示了NLP稳健性研究的结构化概述。然后,我们深入了解稳健性的各种维度,跨技术,指标,嵌入和基准。最后,我们认为,鲁棒性应该是多维的,提供对当前研究的见解,确定文学中的差距,以建议值得追求这些差距的方向。
translated by 谷歌翻译
The authors thank Nicholas Carlini (UC Berkeley) and Dimitris Tsipras (MIT) for feedback to improve the survey quality. We also acknowledge X. Huang (Uni. Liverpool), K. R. Reddy (IISC), E. Valle (UNICAMP), Y. Yoo (CLAIR) and others for providing pointers to make the survey more comprehensive.
translated by 谷歌翻译
With rapid progress and significant successes in a wide spectrum of applications, deep learning is being applied in many safety-critical environments. However, deep neural networks have been recently found vulnerable to well-designed input samples, called adversarial examples. Adversarial perturbations are imperceptible to human but can easily fool deep neural networks in the testing/deploying stage. The vulnerability to adversarial examples becomes one of the major risks for applying deep neural networks in safety-critical environments. Therefore, attacks and defenses on adversarial examples draw great attention. In this paper, we review recent findings on adversarial examples for deep neural networks, summarize the methods for generating adversarial examples, and propose a taxonomy of these methods. Under the taxonomy, applications for adversarial examples are investigated. We further elaborate on countermeasures for adversarial examples. In addition, three major challenges in adversarial examples and the potential solutions are discussed.
translated by 谷歌翻译
Adversarial training is widely acknowledged as the most effective defense against adversarial attacks. However, it is also well established that achieving both robustness and generalization in adversarially trained models involves a trade-off. The goal of this work is to provide an in depth comparison of different approaches for adversarial training in language models. Specifically, we study the effect of pre-training data augmentation as well as training time input perturbations vs. embedding space perturbations on the robustness and generalization of BERT-like language models. Our findings suggest that better robustness can be achieved by pre-training data augmentation or by training with input space perturbation. However, training with embedding space perturbation significantly improves generalization. A linguistic correlation analysis of neurons of the learned models reveal that the improved generalization is due to `more specialized' neurons. To the best of our knowledge, this is the first work to carry out a deep qualitative analysis of different methods of generating adversarial examples in adversarial training of language models.
translated by 谷歌翻译
Robustness evaluation against adversarial examples has become increasingly important to unveil the trustworthiness of the prevailing deep models in natural language processing (NLP). However, in contrast to the computer vision domain where the first-order projected gradient descent (PGD) is used as the benchmark approach to generate adversarial examples for robustness evaluation, there lacks a principled first-order gradient-based robustness evaluation framework in NLP. The emerging optimization challenges lie in 1) the discrete nature of textual inputs together with the strong coupling between the perturbation location and the actual content, and 2) the additional constraint that the perturbed text should be fluent and achieve a low perplexity under a language model. These challenges make the development of PGD-like NLP attacks difficult. To bridge the gap, we propose TextGrad, a new attack generator using gradient-driven optimization, supporting high-accuracy and high-quality assessment of adversarial robustness in NLP. Specifically, we address the aforementioned challenges in a unified optimization framework. And we develop an effective convex relaxation method to co-optimize the continuously-relaxed site selection and perturbation variables and leverage an effective sampling method to establish an accurate mapping from the continuous optimization variables to the discrete textual perturbations. Moreover, as a first-order attack generation method, TextGrad can be baked into adversarial training to further improve the robustness of NLP models. Extensive experiments are provided to demonstrate the effectiveness of TextGrad not only in attack generation for robustness evaluation but also in adversarial defense.
translated by 谷歌翻译
深度学习(DL)在许多与人类相关的任务中表现出巨大的成功,这导致其在许多计算机视觉的基础应用中采用,例如安全监控系统,自治车辆和医疗保健。一旦他们拥有能力克服安全关键挑战,这种安全关键型应用程序必须绘制他们的成功部署之路。在这些挑战中,防止或/和检测对抗性实例(AES)。对手可以仔细制作小型,通常是难以察觉的,称为扰动的噪声被添加到清洁图像中以产生AE。 AE的目的是愚弄DL模型,使其成为DL应用的潜在风险。在文献中提出了许多测试时间逃避攻击和对策,即防御或检测方法。此外,还发布了很少的评论和调查,理论上展示了威胁的分类和对策方法,几乎​​没有焦点检测方法。在本文中,我们专注于图像分类任务,并试图为神经网络分类器进行测试时间逃避攻击检测方法的调查。对此类方法的详细讨论提供了在四个数据集的不同场景下的八个最先进的探测器的实验结果。我们还为这一研究方向提供了潜在的挑战和未来的观点。
translated by 谷歌翻译
关于NLP模型的最先进攻击缺乏对成功攻击的共享定义。我们将思考从过去的工作蒸馏成统一的框架:一个成功的自然语言对抗性示例是欺骗模型并遵循一些语言限制的扰动。然后,我们分析了两个最先进的同义词替换攻击的产出。我们发现他们的扰动通常不会保留语义,38%引入语法错误。人类调查显示,为了成功保留语义,我们需要大大增加交换词语的嵌入和原始和扰动句子的句子编码之间的最小余弦相似之处。与更好的保留语义和语法性,攻击成功率下降超过70个百分点。
translated by 谷歌翻译
Adaptive attacks have (rightfully) become the de facto standard for evaluating defenses to adversarial examples. We find, however, that typical adaptive evaluations are incomplete. We demonstrate that thirteen defenses recently published at ICLR, ICML and NeurIPS-and which illustrate a diverse set of defense strategies-can be circumvented despite attempting to perform evaluations using adaptive attacks. While prior evaluation papers focused mainly on the end result-showing that a defense was ineffective-this paper focuses on laying out the methodology and the approach necessary to perform an adaptive attack. Some of our attack strategies are generalizable, but no single strategy would have been sufficient for all defenses. This underlines our key message that adaptive attacks cannot be automated and always require careful and appropriate tuning to a given defense. We hope that these analyses will serve as guidance on how to properly perform adaptive attacks against defenses to adversarial examples, and thus will allow the community to make further progress in building more robust models.
translated by 谷歌翻译
在过去的几年中,保护NLP模型免受拼写错误的障碍是研究兴趣的对象。现有的补救措施通常会损害准确性,或者需要对每个新的攻击类别进行完整的模型重新训练。我们提出了一种新颖的方法,可以向基于变压器的NLP模型中的拼写错误增加弹性。可以实现这种鲁棒性,而无需重新训练原始的NLP模型,并且只有最小的语言丧失理解在没有拼写错误的输入上的性能。此外,我们提出了一种新的有效近似方法来产生对抗性拼写错误,这大大降低了评估模型对对抗性攻击的弹性所需的成本。
translated by 谷歌翻译
Adversarial attacks in NLP challenge the way we look at language models. The goal of this kind of adversarial attack is to modify the input text to fool a classifier while maintaining the original meaning of the text. Although most existing adversarial attacks claim to fulfill the constraint of semantics preservation, careful scrutiny shows otherwise. We show that the problem lies in the text encoders used to determine the similarity of adversarial examples, specifically in the way they are trained. Unsupervised training methods make these encoders more susceptible to problems with antonym recognition. To overcome this, we introduce a simple, fully supervised sentence embedding technique called Semantics-Preserving-Encoder (SPE). The results show that our solution minimizes the variation in the meaning of the adversarial examples generated. It also significantly improves the overall quality of adversarial examples, as confirmed by human evaluators. Furthermore, it can be used as a component in any existing attack to speed up its execution while maintaining similar attack success.
translated by 谷歌翻译
随着在图像识别中的快速进步和深度学习模型的使用,安全成为他们在安全关键系统中部署的主要关注点。由于深度学习模型的准确性和稳健性主要归因于训练样本的纯度,因此深度学习架构通常易于对抗性攻击。通过对正常图像进行微妙的扰动来获得对抗性攻击,这主要是人类,但可以严重混淆最先进的机器学习模型。什么特别的智能扰动或噪声在正常图像上添加了它导致深神经网络的灾难性分类?使用统计假设检测,我们发现条件变形自身偏析器(CVAE)令人惊讶地擅长检测难以察觉的图像扰动。在本文中,我们展示了CVAE如何有效地用于检测对图像分类网络的对抗攻击。我们展示了我们的成果,Cifar-10数据集,并展示了我们的方法如何为先前的方法提供可比性,以检测对手,同时不会与嘈杂的图像混淆,其中大多数现有方法都摇摇欲坠。
translated by 谷歌翻译
Machine learning algorithms are often vulnerable to adversarial examples that have imperceptible alterations from the original counterparts but can fool the state-of-the-art models. It is helpful to evaluate or even improve the robustness of these models by exposing the maliciously crafted adversarial examples. In this paper, we present TEXTFOOLER, a simple but strong baseline to generate adversarial text. By applying it to two fundamental natural language tasks, text classification and textual entailment, we successfully attacked three target models, including the powerful pre-trained BERT, and the widely used convolutional and recurrent neural networks. We demonstrate three advantages of this framework:(1) effective-it outperforms previous attacks by success rate and perturbation rate, (2) utility-preserving-it preserves semantic content, grammaticality, and correct types classified by humans, and (3) efficient-it generates adversarial text with computational complexity linear to the text length. 1
translated by 谷歌翻译
对抗性实例的有趣现象引起了机器学习中的显着关注,对社区可能更令人惊讶的是存在普遍对抗扰动(UAPS),即欺骗目标DNN的单一扰动。随着对深层分类器的关注,本调查总结了最近普遍对抗攻击的进展,讨论了攻击和防御方的挑战,以及uap存在的原因。我们的目标是将此工作扩展为动态调查,该调查将定期更新其内容,以遵循关于在广泛的域中的UAP或通用攻击的新作品,例如图像,音频,视频,文本等。将讨论相关更新:https://bit.ly/2sbqlgg。我们欢迎未来的作者在该领域的作品,联系我们,包括您的新发现。
translated by 谷歌翻译
深度变压器神经网络模型在生物医学域中提高了智能文本处理系统的预测精度。他们在各种各样的生物医学和临床自然语言处理(NLP)基准上获得了最先进的性能分数。然而,到目前为止,这些模型的稳健性和可靠性较小。神经NLP模型可以很容易地被对抗动物样本所欺骗,即输入的次要变化,以保留文本的含义和可理解性,而是强制NLP系统做出错误的决策。这提出了对生物医学NLP系统的安全和信任的严重担忧,特别是当他们旨在部署在现实世界用例中时。我们调查了多种变压器神经语言模型的强大,即Biobert,Scibert,Biomed-Roberta和Bio-Clinicalbert,在各种生物医学和临床文本处理任务中。我们实施了各种对抗的攻击方法来测试不同攻击方案中的NLP系统。实验结果表明,生物医学NLP模型对对抗性样品敏感;它们的性能平均分别平均下降21%和18.9个字符级和字级对抗噪声的绝对百分比。进行广泛的对抗训练实验,我们在清洁样品和对抗性投入的混合物上进行了微调NLP模型。结果表明,对抗性训练是对抗对抗噪声的有效防御机制;模型的稳健性平均提高11.3绝对百分比。此外,清洁数据的模型性能平均增加2.4个绝对存在,表明对抗性训练可以提高生物医学NLP系统的概括能力。
translated by 谷歌翻译
现有的研究表明,对抗性示例可以直接归因于具有高度预测性的非稳态特征的存在,但很容易被对手对愚弄NLP模型进行操纵。在这项研究中,我们探讨了捕获特定于任务的鲁棒特征的可行性,同时使用信息瓶颈理论消除了非舒适的特征。通过广泛的实验,我们表明,通过我们的信息基于瓶颈的方法训练的模型能够在稳健的精度上取得显着提高,超过了所有先前报道的防御方法的性能,而在SST-2上几乎没有遭受清洁准确性的表现下降,Agnews和IMDB数据集。
translated by 谷歌翻译
已知深度学习模型容易受到针对恶意目的设计的对抗性例子的影响,并且对人类感知系统是无法察觉的。自动编码器仅在良性示例上接受训练时,已广泛用于(自我监管的)对抗检测,基于以下假设,即对抗性示例会产生较大的重建误差。但是,由于其训练中缺乏对抗性示例和自动编码器的过于强大的概括能力,因此在实践中,这种假设并不总是成立的。为了减轻这个问题,我们探索如何在自动编码器结构下使用分离的标签/语义特征检测对抗性示例。具体而言,我们提出了基于删除表示的重建(DRR)。在DRR中,我们对正确配对的标签/语义功能和错误配对的标签/语义功能进行训练,以重建良性和反描述。这模仿了对抗性示例的行为,并可以降低自动编码器的不必要的概括能力。我们将我们的方法与不同的对抗性攻击和不同受害者模型下的最先进的自我监督检测方法进行了比较,并且在各种指标(ROC曲线下的区域,真实的正率和真实的负率)中表现出更好的性能)对于大多数攻击设置。尽管DRR最初是为视觉任务设计的,但我们证明它也可以轻松扩展到自然语言任务。值得注意的是,与其他基于自动编码器的检测器不同,我们的方法可以为自适应对手提供抗性。
translated by 谷歌翻译