A Perturbation Resistant Transformation and Classification System for Deep Neural Networks

Nathaniel Dean nxd551@miami.edu Dilip Sarkar sarkar@cs.miami.edu

Abstract

Deep convolutional neural networks accurately classify a diverse range of natural images, but may be easily deceived when designed, imperceptible perturbations are embedded in the images. In this paper, we design a multi-pronged training, input transformation, and image ensemble system that is attack agnostic and not easily estimated. Our system incorporates two novel features. The first is a transformation layer that computes feature level polynomial kernels from class-level training data samples and iteratively updates input image copies at inference time based on their feature kernel differences to create an ensemble of transformed inputs. The second is a classification system that incorporates the prediction of the undefended network with a hard vote on the ensemble of filtered images. Our evaluations on the CIFAR10 dataset show our system improves the robustness of an undefended network against a variety of bounded and unbounded white-box attacks under different distance metrics, while sacrificing little accuracy on clean images. Against adaptive full-knowledge attackers creating end-to-end attacks, our system successfully augments the existing robustness of adversarially trained networks, for which our methods are most effectively applied.

1 Introduction

Modern neural networks provide state-of-the-art classification performance in many applications such as image recognition and language processing. As such, neural networks are seeing tremendous application growth in every day tasks and thus their safety and security have never been more paramount to human society.

Despite their increasing sophistication and diminishing error rates, original work by Szegedy et al. [30] demonstrated that small manipulations to images, called adversarial examples, can easily fool neural networks. In a white-box adversarial setting, a malicious actor exploits a knowledge of a neural network’s parameters and defenses to intelligently create these examples that are then misclassified by the network. Previous efforts to combat these perturbations rely on several strategies to formulate algorithms or network architectures that are more resistant to adversarial attacks. The two main camps of defenses either manipulate network learning procedures at training time (adversarial training) [18, 10, 21, 23, 36, 35, 32, 22] or apply transformations to inputs that counter adversarial perturbations before presenting them to the network for classification or detection [29, 9, 28, 20, 33, 13, 1, 34].

In response to the previously mentioned defenses, many white-box attack techniques have been developed that have minimized their effectiveness or bypassed them altogether; particularly against the latter camp of defense techniques [7, 6, 15, 2].

In this paper, we present a novel system whose elements span the work of both camps. We adversarially train a neural network and then encapsulate it with custom preprocessing and classification layers. The core element of our preprocessing technique borrows ideas from neural style transfer [11]: we copy the input once for each possible label and impart the feature level polynomial kernel statistics of sampled training set images from each class onto the copies. Our classification layer treats the transformed copies as a hard vote committee for classification. However, by default, the system assumes that the classification on the original input is correct and only changes its decision if a minimum quorum of the committee agrees on a new label.

We evaluate our defense both statically and adaptively against a ResNet18 [14] classifying the CIFAR10 [16] dataset.

Our system does share similarities to previous defense strategies that have vulnerabilities; namely in the preprocessing and ensembling effects, but we also differ in one fundamental way. Previous ensembles of preprocessing defenses have been rendered ineffective [15] by adaptive attackers, but to our knowledge these defenses incorporated processing filters that attempted to ‘clean’ contaminated examples and maintain (or simplify) their salient features in the pixel space. In contrast, our polynomial kernel optimization strategy splays the input image towards all other labels at the feature level. Coupled with our classification system, the full-knowledge attacker has the burden of convincing a critical mass of these seemingly opposing transformations to coalesce around a single incorrect label with a single perturbation set or otherwise the perturbed image passes through the baseline adversarially trained network (the default classification unless the committee overrules it).

We make the following contributions in this paper:

We construct a novel defense system that shows increased robustness against a diverse set of attacks of varying type, imperceptibility, and strength.
We demonstrate how imparting feature level polynomial kernels from training set sample images to new examples can correctly reclassify weak adversaries and form an ensemble of diversely transformed images.
We form a robust classification system that adapts the final prediction based on both the original input and transformed input copies.

1.1 Preliminary Notations

A convolutional neural network classifier is considered as a function $f (x; θ)$ that maps an input image $x \in [0, 1]^{C x W x H}$ to a vector $y \in R^{K}$ given $θ$ , a set of network weights and biases. Element $y_{k}$ of $y = {y_{1}, y_{2}, . . . y_{K}}$ represents the probability $x$ belongs to class $k$ such that $\sum_{k = 1}^{K} y_{k} = 1$ and $K$ is the number of possible distinct labels that can be individually assigned to $x$ . A final predicted label, $y_{p}$ , is assigned to $x$ by the network using the classification function $Y = {a r g m a x}_{k} (y)$ . If the ground-truth label of $x$ , $y_{t}$ , is equal to $y_{p}$ , then the network has correctly classified $x$ .

Given a training set $T$ , a classifier $f (x; θ)$ trained on T, and evaluated on a batch of images $X = {x_{1}, x_{2}, . . . x_{B}}$ , an illicit actor may gain unhindered access to $θ$ and compute a set of perturbations $Δ = {δ_{1}, δ_{2}, . . . δ_{B}}$ to define a set of adversarial inputs, $x_{a d v, b} = x_{b} + δ_{b}, b = 1... B$ . The goal of each adversary is to trick the network into a misclassification $Y (f (x_{a d v, b}; θ)) \neq y_{t} |_{x_{b}}$ and more generally the actor maximizes $E [1 (Y (f (x_{a d v}; θ)) \neq y_{t} |_{x})]$ .

1.2 Problem Statement

In this work, we propose and evaluate defense mechanisms that minimize the effect of adversarial attacks. Formally,

Given a trained neural network $f (x; θ)$ and set of potentially perturbed images in a white-box adversarial setting, we must present a defense that minimizes $E [1 (Y (f (x_{a d v}; θ)) \neq y_{t} |_{x})]$ .

1.3 Threat Model

The threat model determines under what conditions we claim the defense mechanism to be secure [2, 5]. We attempt to evaluate performance in a white-box adversarial setting with the following assumptions that are specific to our defense:

The attacker knows the neural network’s training set, $T$ , and the network’s weights and biases, $θ$ .
The attacker is aware of the defense mechanism’s algorithmic steps, hyperparameter settings, and feature level loss calculations.

Before we present our system we briefly review existing attack methods and previously proposed defenses.

2 Review

First we present attack methods and then the defenses working to mitigate the attacks.

2.1 Attack Methods

2.1.1 Fast Gradient Sign Method (FGSM)

Goodfellow et al. [12] proposed the Fast Gradient Sign Method as an efficient approach to generate adversarial examples based on the gradient of the classification loss function with respect to the image. An image $x$ is perturbed to obtain an adversarial image $x_{a d v}$ as follows:

x_{a d v} = x + α * s i g n (\nabla_{x} L (x, y_{t}; θ))

(1)

where $α$ is an $L_{\infty}$ -bounded step size, $y_{t}$ is the ground-true label of x, L is a classification loss function such as cross-entropy loss, and $θ$ is the set of neural network parameters. FGSM calculates the adversarial example in a single step by taking the sign of the gradient and moving a fixed distance to stay within an $L_{\infty}$ -norm boundary around $x$ .

2.1.2 Basic Iterative Method (BIM)

The Basic Iterative Method (BIM) [17] is an iterative version of FGSM using smaller incremental step sizes and clipping pixel values after each update as necessary. The basic update step to the input is,

x_{n + 1} = C l i p_{ϵ} (x_{n} + α * s i g n (\nabla_{x_{n}} L (x_{n}, y_{t}; θ)))

(2)

where $C l i p_{ϵ}$ projects the image to a maximum $L_{\infty}$ -bound set by $ϵ$ .

2.1.3 Projected Gradient Descent (PGD)

Madry et al. [18] generate adversarial examples from a generalized version of BIM that projects to different $L_{p}$ -bounds. They have also shown that networks trained using adversarial examples generated by their methods are resilient against adversarial attacks.

2.1.4 DeepFool

Dezfooli et al. [19] convert the local non-linear problem of finding an adversarial example to that of an affine problem by linearizing the decision boundary between classes and projecting the input image towards a classification boundary in an iterative manner until misclassification occurs. This process extends to the multinomial classification problem by approximating the non-linear boundaries of multiple classes surrounding $x$ as a polyhedron and finding the minimum distance to its surface.

2.1.5 Carlini-Wagner

Carlini et al. [7] formulate an optimization problem that minimizes the $L_{p}$ -norm distance between the clean and adversarial image, subject to the constraint that the adversarial image is misclassified by the neural network. They move the misclassification constraint into the objective function to form the non-convex problem:

	$min$	$\| \| x_{a d v} - x \| \|_{p} + c * f (x_{a d v})$		(3)
	s.t.	$x_{a d v} \in [0, 1]^{n}$		(3)

where $f (x_{a d v})$ is a function that signals when the adversary is found and $c$ is a hyperparameter that controls the relative objective weights of minimizing distance and finding the adversary. The authors propose several different functions for $f$ and show general trends for selecting $c$ . We make special note that a confidence parameter, $κ$ , may be prescribed to the increase the confidence of misclassification at the expense of higher distortion. Carlini-Wagner (CW) attacks are strong and unbounded attacks that seek imperceptibility and are employed in our evaluations.

2.1.6 Backward Pass Differentiable Approximation and Expectation Over Transformation

Although not technically new attacks, backward pass differentiable approximation (BPDA) and expectation over transformation (EOT) [2, 3] allow existing attacks to handle defense systems that apply non-differentiable or randomized techniques.

Athalye et al. point out that these defenses effectively corrupt available gradient information through shattering (making gradients unavailable at the input due to a non-differentiable operation), stochasticity (applying randomness such that a single calculation of gradients is not representative), and explosion/vanishment (artificial overflow or underflow of gradient values). From a high level, by sampling transformations in the forward pass (EOT) and approximating non-differentiable operations as the identity function in the backward pass (BPDA), gradient based attacks can still find approximate gradients at the input and over many iterations still find a successful perturbation set.

2.1.7 Decision-Based Attack

In situations where gradient information is not reliable, a decision-based attack, like the Boundary Attack proposed by Brendel et al. [4], relies only on the forward pass prediction of the network. The Boundary Attack finds an initial perturbation distribution that is adversarial to the network and then progressively draws new perturbations that both maintain misclassification and reduce the vector norm distance to the original image.

2.2 Review of previously proposed defenses

2.2.1 Adversarial Training

Conceptually, the adversarial training approach forces the network to learn more robust decision boundaries at training time via exposure to training examples generated using one or more adversarial attack methods or modifications to the learning process.

Madry et. al [18] increased network robustness by perturbing training examples up to a fixed $L_{p}$ -bound using PGD at each epoch and letting the network learn from the adversarial distribution. As an extension of [18], Ding et. al [10] proposed in Max-Margin training (MMA) that the $L_{p}$ -bound placed on adversarially generated examples need not be fixed and in fact could be optimized on a per example basis to optimize the average decision boundary margin. Zhang et. al [36] included a regularization term in their training loss that searches for an adversarial example in an $ϵ$ -neighborhood that maximally changes the output of the network compared to the clean example before updating the network parameters, thus encouraging decision boundaries to move maximally away from the data.

Further advancement in adversarial training focused on crafting adversarial examples based on not only classification loss, but also on the inter-example feature relationships between clean and generated examples. In [35], a PGD adversarial distribution is selected based on maximizing an optimal transport distance from the clean example distribution at the feature map level resulting in adversarial examples that are diverse and distant from clean examples during training.

2.2.2 Image Transformation and Reconstruction

A different defensive strategy is to apply voluntary transformations to inputs or network behaviors at inference time in the hope that these effects cannot be easily approximated by the attacker. Examples of image transformation include random resizing and padding [33], JPEG compression [9], and image quilting [13].

A related defensive strategy involves attempting to eliminate adversarial perturbations through various means [9, 20, 29, 28]. The basis for this strategy relies on the observation that clean and adversarial images, despite often visually similar in pixel space, come from different manifolds in feature space and that any mapping which bridges these two distributions could reduce adversarial corruptions in an image.

In the next section, we present the polynomial kernel matrix, which we use to transform the feature maps of denoised input images.

3 Polynomial Kernels for Image Transformation

A convolutional neural network, $f (x; θ)$ , can be viewed as a composition of functions:

f (x; θ) = (F_{S} \circ F_{f c} \circ F_{L} \circ F_{L - 1} \circ . . . \circ F_{1}) (x)

(4)

where $F_{S}$ is the softmax activation function, $F_{f c}$ is a sequence of fully connected layers, and $F_{1} . . . F_{L}$ are convolutional layers, that perform the feature extraction function of the network. Each $F_{l}$ , $l \in {1... L}$ , is itself a composition of functions, transforming the output of the convolutional layers before it,

F_{l} (x; θ_{l}) = [h \circ [C o n v (F_{l - 1} \circ . . . \circ F_{1})]] (x)

(5)

such that $h$ is a non-linear mapping and $C o n v (\cdot)$ is a convolutional operation.

Let the set $V_{l} = {v_{1}, v_{2}, . . . v_{C}}$ represent the C feature maps of dimension $H$ x $W$ at layer $l$ .

Given any pair of feature maps within the same layer, $v_{i}$ and $v_{j}$ , we define a polynomial kernel function,

κ (v_{i}, v_{j}) = {(⟨ v_{i}, v_{j} ⟩ + e)}^{d}

(6)

where $e \geq 0$ and integer $d \geq 1$ are chosen parameters and $⟨ \cdot, \cdot ⟩$ is the inner product. Treating input maps $v_{i}$ and $v_{j}$ as having dimension $N = H \times W$ , the kernel function $κ (\cdot, \cdot)$ has the property that there exists a mapping

ϕ (v_{i}) : R^{N} \to R^{(\frac{N + d}{d})}

such that,

κ (v_{i}, v_{j}) = ⟨ ϕ (v_{i}), ϕ (v_{j}) ⟩

(7)

Given a set of feature vectors $V$ in a convolutional layer, the elements of a kernel matrix $G \in S_{+}^{C x C}$ for that layer are computed by,

G_{i, j} = κ (v_{i}, v_{j})

(8)

The kernel matrix is a positive semi-definite matrix unique to each layer. Note that for $e = 0$ and $d = 1$ , the kernel matrix is identical to the Gram matrix.

Specifically for polynomial kernels, the mapping $ϕ (v_{i})$ transforms the input feature map into a higher dimensional vector whose components are formed by all monomial combinations of the input vector components up to total order d. For example, if $N = 2, m = 1$ , and $d = 3$ , then $ϕ (v_{i}) : R^{2} \to R^{10}$ maps the components of $v_{i} = [v_{i 1}, v_{i 2}]$ to a vector that includes all monomial terms from the set ${a_{r, s} v_{i 1}^{r} v_{i 2}^{s} | r, s = 0...3, r + s \leq 3, a_{r, s} \in R}$ .

The next section details our defense system construction and the conceptual ideas behind our design choices.

4 Proposed Training and Defense Methods

As shown in Fig. 1, we describe a training phase (left column) and defense methodology (right column) to produce a perturbation resistant deep neural network. All of the steps contained in the following section can be found in Fig. 1.

4.1 Training Phase

Train neural network on clean and smoothed data. Since our defense system applies a traditional image smoothing operation such as a median filter to the image, we wish for the network to be able to recognize the smoothed versions of the original image. Therefore, for each image $t \in T$ , we generate a corresponding smoothed image $t_{R}$ to obtain a set $T_{R}$ and then use all examples in ${T} \cup {T_{R}}$ to train the network.

Figure 1: Schematic of proposed training and defense processes.

4.2 Inference Phase

Receive new input $x$ . Classify $x$ with $f (x; θ)$ and store result $y_{p} | x$ .

Apply smoothing operation to $x$ to create $x_{R}$ . Apply the same smoothing operation to $x$ that was used to generate set ${T_{R}}$ in the training phase.

From the correctly classified, denoised training data set ${T_{R, c}}$ , draw one sample image corresponding to each class to create a sample set ${s_{1} . . . s_{K}}$ . Compute and store the polynomial kernel matrices for each sample image. The $K$ stored polynomial kernel matrix sets from ${s_{1} . . . s_{K}}$ will act as target kernel matrices towards which our smoothed $x_{R}$ will be transformed.

Create $K$ copies of $x_{R}$ . Each $x_{R}$ copy’s polynomial kernels will be transformed independently towards one of the $K$ sets of polynomial kernels stored from ${s_{1} . . . s_{K}}$ in the next step.

4.2.1 Iteratively impart polynomial kernel values of $s_{k}$ to $x_{R, k}$ .

The following discussion is based on the flowchart in Fig. 2. Conceptually similar to but substantially different from neural style transfer [11], the polynomial kernel module computes losses in one or more layers of the network comparing the kernel matrix values of the $k^{t h}$ copy of $x_{R}$ to the $k^{t h}$ sampled image $s_{k}$ (from class $k$ ). The MSE loss for a single layer is:

L_{G, l} = \sum i, j {(G_{x_{R, k}, i, j} - G_{s_{k}, i, j})}^{2}

(9)

We compute MSE losses at chosen network feature extraction layers and backpropagate their summed loss to the input, updating $x_{R, k}$ using a gradient descent algorithm. After every pixel update, we apply both an $L_{1}$ [8] and $L_{\infty}$ distance constraint to the current updated input, limiting the vector distance between $x_{R, k}$ and its final transformed version $x_{R T, k}$ . The values of these constraints are hyperparameters obtained from validation study. We begin the iteration process by randomly perturbing the target $x_{R}$ within the chosen $L_{\infty}$ ball.

We iterate on the loss-backpropagate-constraint loop for a chosen number of iterations, recalculating the layer losses, pixel gradients, and vector norm projections independently for each iteration. We emphasize that this transfer procedure is completed for each $x_{R, k}$ corresponding to a different polynomial kernel set generated from $s_{k}$ . The end result is that we output $K$ differently transformed versions of the input, $x_{R T, 1} . . . x_{R T, K}$ , to the next step.

Figure 2: Schematic of polynomial kernel transformation procedure.

4.2.2 Form voting committee of transformed image predictions and classify $x$ based on images that changed label prediction from $f (x; θ)$ .

Having transformed all $K$ copies of $x_{R}$ , we run inference on each transformed copy through the network and store each label prediction in a set, ${y_{p} | x_{R T, 1} . . . y_{p} | x_{R T, K}}$ . From this set of predictions, we create a subset $H$ keeping only the predictions that are different than the stored, original prediction on $x$ . We then apply a simple hard vote type rule to the remaining predictions: if the most common predicted label, $m o d e (H)$ , of the predictions that changed due to transformation has a count greater than or equal to a threshold hyperparameter $c_{3}$ , we change our final prediction to this most common label. Otherwise, we keep the original network prediction on $x$ .

4.3 Defense System Intuition

4.3.1 From an Attacker’s Perspective

As shown in Fig. 3, the attacker can choose to attack the network directly (option A) or simulate the full defense system (option B) by implementing the defense algorithm on the forward pass, using gradient approximations on the backward pass, and increasing their sample sizes to better estimate the polynomial kernel transformations that the defender will use at inference time. These two different choices would result in one of two different adversarial images, $x_{a d v, A}$ or $x_{a d v, B}$ , being fed to the defense system. The defense system initializes by making a prediction on the adversarial image through the ‘bare’ network, thus storing either $f (x_{a d v, A}; θ)$ or $f (x_{a d v, B}; θ)$ . Note how $x_{a d v, B}$ was not designed to attack the bare network $f (θ)$ without the defense transformations. Adversarial image $x_{a d v, B}$ could still transfer to $f (θ)$ , but we posit that it cannot be more likely to succeed than if the attack algorithm had just attacked $f (θ)$ directly, i.e. crafted $x_{a d v, A}$ in the first place.

The ideal scenario B outcome for the attacker is that a multitude (greater than threshold parameter $c_{3}$ ) of the polynomial kernel transforms move $x_{a d v, B}$ to the same incorrect label; otherwise the network effectively drops its own defense and resorts to $f (x_{a d v, B}; θ)$ , for which we have already stated $x_{a d v, B}$ was not designed.

Figure 3: Decision flowchart available to attacker showing two possible scenarios.

4.3.2 The Polynomial Kernel Transforms

Recall our defense system takes $K$ (number of classes) copies of the smoothed input $x_{R}$ and iteratively imparts the polynomial kernel values of $K$ sampled images from the clean training set (one sample from each class). In the toy example of Fig. 4, a CIFAR10 ’dog’ is initially correctly classified as class 5, but incorrectly classified to class 3 ’cat’ when the ResNet18 network is directly attacked (BIM-scenario A). We draw a sample of images from a few classes and also include the original clean image, then apply 25 iterations of kernel transformation from the samples to the adversarial example using feature maps from each residual block in ResNet18 (early and deep layers). Several transformations are successful in restoring the correct classification and some are not.

Our results suggest several important conclusions. First, if the kernel matrix information of the original image is imparted on a perturbed version of itself, the network recovers the correct classification. Second, transformations derived from a sample class $k^{'}$ can recover the correct classification of images from other classes $k \neq k^{'}$ as can be seen from the boat sample reclassifying the dog. Third, we track the average polynomial kernel loss 9 at iteration 0 for the deepest residual block in ResNet18 as a $(%)$ share of the total loss across all layers and recognize the original image ‘sample’ has the highest loss share compared to other samples; likewise, the early layers have relatively low kernel loss. This effect implies that adversarial attacks’ imperceptible pixel manipulations create correspondingly small changes in early layer kernel matrices, but larger changes in the deeper layers, which are closer to classification. Lastly, not all transformations result in correct classifications. An important consideration not shown in this particular example, is sometimes we receive random classifications $k^{''} \notin {k, k^{'}}$ .

Our interpretation of the toy experiment is pictured on the far right of Fig. 4, a clean example $x$ (black) is perturbed across the decision boundary to a misclassified $x_{a d v}$ . Several transformation directions are shown altering $x_{a d v}$ to $x_{R T, k}$ depending on the sample’s class. Some transformations cross back over the decision boundary and are correctly classified, some remain misclassified, and others end up in random other classes. Let the tuple $(k^{'}, k^{''})$ represent the decision boundary between class $k^{'}$ and $k^{''}$ . We believe and have found that if the adversary is close to the boundary of the correct class or ’weak’, then $(k, k^{'})$ presents as a larger cross-sectional target to the kernel transforms as compared to others in ${(k^{''}, k), (k^{''}, k^{'}), (k^{''}, k^{''})}$ . Therefore, if any reclassification is to occur, a propensity exists for a correct classification over a random classification. What about clean examples, will they be mistakenly reclassified? Particularly in the case of adversarially trained networks, the clean data is pushed away from the boundary, so the risk of any reclassification is lowered compared to a weak adversary pushed just over the boundary. Further, we perform validation studies to choose a voting threshold parameter $c_{3}$ that allows for a few random reclassifications of clean examples without the defense changing its prediction due to the voting committee.

How do these ideas interact with the attack scenario of Fig. 3? The polynomial kernel transforms are not attempting to cleanse adversarial perturbations, rather they are independent feature transformations to all possible classes. The scenario B attacker, to achieve the ideal scenario, must coalesce these different transformations with a single perturbation set or risk that the designed adversary is transferred back to the bare network as a weaker adversary than if it had just targeted the bare network originally.

Figure 4: Simple toy example of polynomial kernel transformation for a CIFAR10 ’dog’ image. Adversary created using 25 iterations of Basic Iterative Method ( $ϵ = 8 / 255$ ) against an adversarially trained ResNet18 network. Each ’sample’ image’s polynomial kernel values are imparted onto the adversary to create a separate ’transform’ image. Some of the pixel variations between the clean, adversarial, and transformed examples can be seen on the dog’s bed surface.

5 Empirical Evaluation

5.0.1 Evaluation networks and datasets

To evaluate our defense system, we applied our techniques to a basic Resnet18 neural network [14] and tested it against a variety of attacks on the CIFAR10 dataset [16] in the Pytorch environment [24] with 1 GPU. CIFAR10 examples consist of 50,000 training and 10,000 test 32x32 3-channel color images of various natural objects ranging from airplanes to dogs. We trained the network in two different ways. In the first ’Std’ training, the network was trained for 100 epochs using standard cross-entropy loss and in the second ’Adv’ training, the network was adversarially trained for 100 epochs using the free adversarial training method (m=5) [27] and TRADES loss [36] with $λ = 1$ .

5.0.2 Hyperparameter values

We segmented off 400 randomly selected points from the CIFAR10 test set and performed validation studies with the trained networks to find suitable settings for kernel layer selections, $L_{1}$ -limits ( $c_{1}$ ), $L_{\infty}$ -limits ( $c_{2}$ ), voting thresholds $c_{3}$ , transform iterations, initial learning rate, and optimizer momentum. We employed the RMSprop [31] optimizer ( $ρ = 0.1$ , $l r = 0.07$ ) for updating input images. The kernel layers are indexed from [0,1,2,3,4,5], referring to ‘after the 1st convolution’, after ‘the 1st ReLU()’, ‘after residual block 1’, ‘after residual block 2’, and so on in ResNet18. Our smoothing operation was a median filter with window size 2x2.

5.0.3 Generation of Adversarial Examples

We used the FoolBox [26, 25] library to create a set of $L_{\infty}$ -bounded (BIM) and unbounded attacks (CW-L2, DeepFool) generated against 1000 randomly selected CIFAR10 test set images (drawn separately from the validation study points). The Carlini-Wagner attacks optimized the $L 2$ distance metric and executed 15 binary search steps beginning with $c = 0.01$ . For scenario B adaptive attacks, we used BPDA across the kernel transforms and Pytorch subgradients across the median filter. To handle the ensemble function, we modified the logit output similar to [15]. If $Z_{k}$ is the output logit vector of sample $s_{k}$ , then our adaptive attack model outputs $\sum_{k = 1}^{K} \frac{Z_{k}}{| | Z_{k} | |}$ .

Parameter	Std	Adv
$c_{1}$	30.0	30.0
$c_{2}$	.02	.02
$c_{3}$	3	9
$I t e r a t i o n s$	10	10

Table 1: Defense System Hyperparameters

5.1 CIFAR10 Evaluation

Our results in Table 2 show several important trends. We applied our defense system, although intended for an adversarially trained network, to a standard trained network for completeness. Those results show that while the system correctly classifies static scenario A adversarial examples quite frequently, particularly the weaker $L_{2}$ adversaries (DeepFool, CW $κ = 0$ ), the system is completely bypassed in scenario B by stronger attacks.

In the case of an adversarially trained network under an adaptive BIM $ϵ = 8 / 255$ attack (scenario B), the defense system maintains a minimum accuracy of $40.4 %$ compared to the baseline network accuracy of $38.6 %$ when we use polynomial kernel transforms from layers 1-5. It does not maintain this accuracy and in fact is reduced from the baseline when we use kernel matrices from only early layers in the network (0-2). This difference in performance suggests that our system must impart information from the deeper feature extraction layers, which contain more class-specific features.

We notice a bifurcation in trend between the weaker (DeepFool, CW $κ = 0$ ) and stronger (CW $κ = 5$ ) attacks between scenarios A and B, where the system improves accuracy against weaker attacks, but lowers accuracy against stronger attacks. As stated before, we believe the system is better suited for weaker attacks and thus in this case, these attacks close to the decision boundary are failing to find a consensus incorrect label in the ensemble and transferring the perturbed image back to the bare network for classification. The system is not as adept at handling stronger attacks and presumably the attacks can take advantage of our transformations by applying small perturbations in the image space that are favorably magnified by our transforms, so we see the opposite trend against CW $κ = 5$ attacks. Increasing $κ$ has also been shown to increase the transferability of an attack [7] and would more likely cause our bare network to misclassify even if a voting threshold is not reached. Appropriately, if we were to increase $κ$ further, the system would collapse, but at increasing risk to the attacker that perturbations becomes visually perceptible. Lastly, using a higher order polynomial kernel (d=2) did not provide a benefit in performance and in most cases hurt the system.

Network & Defense	Training	Scenario	Layers	Clean	DeepFool	CW	CW	BIM	BIM
Parameters:	-	-	-	-	-	$κ = 0$	$κ = 5$	$ϵ$ = 4	$ϵ$ = 8
Attack Iterations:	-	-	-	-	50	20	20	20	20
ResNet18, No Defense	Std	A	-	95.8	0.0	0.0	0.0	0.0	0.0
ResNet18, Full $e, d = 0, 1$	Std	A	0-2	92.8	92.7	93.0	88.5	32.5	4.7
ResNet18, Full $e, d = 0, 1$	Std	B	0-2	92.8	24.9	27.2	0.0	0.0	0.0
ResNet18, No Defense	Adv	A	-	87.0	0.0	8.0	64.0	64.5	38.6
ResNet18, Full $e, d = 0, 1$	Adv	A	0-2	84.7	56.5	63.9	61.3	66.2	43.1
ResNet18, Full $e, d = 0, 1$	Adv	A	1-5	84.6	46.1	51.6	61.2	65.3	41.4
ResNet18, Full $e, d = 1, 2$	Adv	A	1-5	84.5	49.1	54.6	61.2	65.2	42.0
ResNet18, Full $e, d = 0, 1$	Adv	B	0-2	84.7	75.4	76.2	48.5	62.8	38.4
ResNet18, Full $e, d = 0, 1$	Adv	B	1-5	84.6	72.9	77.8	46.9	66.4	40.4
ResNet18, Full $e, d = 1, 2$	Adv	B	1-5	84.5	62.5	74.3	47.1	65.2	39.6

Results in %. Layers column indicates chosen feature extraction layers to impart kernel transforms.

Table 2: CIFAR10 Test Set Results - ResNet18

6 Conclusion

In this paper, we developed an adversarial defense system that improves the robustness of a neural network against a diverse array of white-box attacks. Our GPU resources limited our generated attacks to a small number of iterations compared to previous works [7], but on a comparative basis our defense maintained increases in robust accuracy against adaptive attacks under certain hyperparameter settings.

References

[1] M. Abbasi and C. Gagné (2017) Robustness to adversarial examples through an ensemble of specialists. CoRR abs/1702.06856. External Links: Link, 1702.06856 Cited by: §1.
[2] A. Athalye, N. Carlini, and D. A. Wagner (2018) Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples. CoRR abs/1802.00420. External Links: Link, 1802.00420 Cited by: §1.3, §1, §2.1.6.
[3] A. Athalye, L. Engstrom, A. Ilyas, and K. Kwok (2017) Synthesizing robust adversarial examples. CoRR abs/1707.07397. External Links: Link, 1707.07397 Cited by: §2.1.6.
[4] W. Brendel, J. Rauber, and M. Bethge (2017) Decision-based adversarial attacks: reliable attacks against black-box machine learning models. arXiv. External Links: Document, Link Cited by: §2.1.7.
[5] N. Carlini, A. Athalye, N. Papernot, W. Brendel, J. Rauber, D. Tsipras, I. J. Goodfellow, A. Madry, and A. Kurakin (2019) On evaluating adversarial robustness. CoRR abs/1902.06705. External Links: Link, 1902.06705 Cited by: §1.3.
[6] N. Carlini and D. Wagner (2017) MagNet and ”efficient defenses against adversarial attacks” are not robust to adversarial examples. External Links: 1711.08478 Cited by: §1.
[7] N. Carlini and D. Wagner (2017) Towards evaluating the robustness of neural networks. External Links: 1608.04644 Cited by: §1, §2.1.5, §5.1, §6.
[8] T. Chandra (2008) Efficient projections onto the l1-ball for learning in high dimensions. Note: International Conference on Machine Learning Cited by: §4.2.1.
[9] N. Das, M. Shanbhogue, S. Chen, F. Hohman, S. Li, L. Chen, M. E. Kounavis, and D. H. Chau (2018) Shield: fast, practical defense and vaccination for deep learning using JPEG compression. CoRR abs/1802.06816. External Links: Link, 1802.06816 Cited by: §1, §2.2.2, §2.2.2.
[10] G. W. Ding, Y. Sharma, K. Y. C. Lui, and R. Huang (2020) MMA training: direct input space margin maximization through adversarial training. External Links: 1812.02637 Cited by: §1, §2.2.1.
[11] L. A. Gatys, A. S. Ecker, and M. Bethge (2015) A neural algorithm of artistic style. CoRR abs/1508.06576. External Links: Link, 1508.06576 Cited by: §1, §4.2.1.
[12] I. J. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. External Links: 1412.6572 Cited by: §2.1.1.
[13] C. Guo, M. Rana, M. Cissé, and L. van der Maaten (2017) Countering adversarial images using input transformations. CoRR abs/1711.00117. External Links: Link, 1711.00117 Cited by: §1, §2.2.2.
[14] K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. External Links: 1512.03385 Cited by: §1, §5.0.1.
[15] W. He, J. Wei, X. Chen, N. Carlini, and D. Song (2017) Adversarial example defenses: ensembles of weak defenses are not strong. CoRR abs/1706.04701. External Links: Link, 1706.04701 Cited by: §1, §1, §5.0.3.
[16] A. Krizhevsky, V. Nair, and G. Hinton (2009) CIFAR-10 (canadian institute for advanced research). . External Links: Link Cited by: §1, §5.0.1.
[17] A. Kurakin, I. Goodfellow, and S. Bengio (2017) Adversarial examples in the physical world. External Links: 1607.02533 Cited by: §2.1.2.
[18] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2019) Towards deep learning models resistant to adversarial attacks. External Links: 1706.06083 Cited by: §1, §2.1.3, §2.2.1.
[19] S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard (2015) DeepFool: a simple and accurate method to fool deep neural networks. CoRR abs/1511.04599. External Links: Link, 1511.04599 Cited by: §2.1.4.
[20] A. Mustafa, S. H. Khan, M. Hayat, J. Shen, and L. Shao (2019) Image super-resolution as a defense against adversarial attacks. CoRR abs/1901.01677. External Links: Link, 1901.01677 Cited by: §1, §2.2.2.
[21] A. Mustafa, S. Khan, M. Hayat, R. Goecke, J. Shen, and L. Shao (2019) Adversarial defense by restricting the hidden space of deep neural networks. External Links: 1904.00887 Cited by: §1.
[22] T. Na, J. H. Ko, and S. Mukhopadhyay (2018) Cascade adversarial machine learning regularized with a unified embedding. External Links: 1708.02582 Cited by: §1.
[23] M. Naseer, S. Khan, M. Hayat, F. S. Khan, and F. Porikli (2020) Stylized adversarial defense. External Links: 2007.14672 Cited by: §1.
[24] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. External Links: Link Cited by: §5.0.1.
[25] J. Rauber, W. Brendel, and M. Bethge (2017) Foolbox: a python toolbox to benchmark the robustness of machine learning models. In Reliable Machine Learning in the Wild Workshop, 34th International Conference on Machine Learning, External Links: Link Cited by: §5.0.3.
[26] J. Rauber, R. Zimmermann, M. Bethge, and W. Brendel (2020) Foolbox native: fast adversarial attacks to benchmark the robustness of machine learning models in pytorch, tensorflow, and jax. Journal of Open Source Software 5 (53), pp. 2607. External Links: Document, Link Cited by: §5.0.3.
[27] A. Shafahi, M. Najibi, A. Ghiasi, Z. Xu, J. Dickerson, C. Studer, L. S. Davis, G. Taylor, and T. Goldstein (2019) Adversarial training for free!. arXiv. External Links: Document, Link Cited by: §5.0.1.
[28] S. Shen, G. Jin, K. Gao, and Y. Zhang (2017) APE-gan: adversarial perturbation elimination with gan. External Links: 1707.05474 Cited by: §1, §2.2.2.
[29] Y. Song, T. Kim, S. Nowozin, S. Ermon, and N. Kushman (2017) PixelDefend: leveraging generative models to understand and defend against adversarial examples. CoRR abs/1710.10766. External Links: Link, 1710.10766 Cited by: §1, §2.2.2.
[30] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. External Links: 1312.6199 Cited by: §1.
[31] T. Tieleman and G. Hinton (2012) Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. Note: COURSERA: Neural Networks for Machine Learning Cited by: §5.0.2.
[32] F. Tramèr, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, and P. McDaniel (2020) Ensemble adversarial training: attacks and defenses. External Links: 1705.07204 Cited by: §1.
[33] C. Xie, J. Wang, Z. Zhang, Z. Ren, and A. L. Yuille (2017) Mitigating adversarial effects through randomization. CoRR abs/1711.01991. External Links: Link, 1711.01991 Cited by: §1, §2.2.2.
[34] W. Xu, D. Evans, and Y. Qi (2018) Feature squeezing: detecting adversarial examples in deep neural networks. In Proceedings 2018 Network and Distributed System Security Symposium, External Links: Document, Link Cited by: §1.
[35] H. Zhang and J. Wang (2019) Defense against adversarial attacks using feature scattering-based adversarial training. External Links: 1907.10764 Cited by: §1, §2.2.1.
[36] H. Zhang, Y. Yu, J. Jiao, E. P. Xing, L. E. Ghaoui, and M. I. Jordan (2019) Theoretically principled trade-off between robustness and accuracy. External Links: 1901.08573 Cited by: §1, §2.2.1, §5.0.1.