BinImg2Vec: Augmenting Malware Binary Image Classification with Data2Vec
^†^†thanks:

Lee Joon Sern Ensign Labs
Ensign InfoSecurity
Singapore
lee_joonsern@ensigninfosecurity.com Tay Kai Keng Ensign Consulting
Ensign InfoSecurity
Singapore
kaikeng_tay@ensigninfosecurity.com Chua Zong Fu Ensign Consulting
Ensign InfoSecurity
Singapore
zongfu_chua@ensigninfosecurity.com

Abstract

Rapid digitalisation spurred by the Covid-19 pandemic has resulted in more cyber crime. Malware-as-a-service is now a booming business for cyber criminals. With the surge in malware activities, it is vital for cyber defenders to understand more about the malware samples they have at hand as such information can greatly influence their next course of actions during a breach. Recently, researchers have shown how malware family classification can be done by first converting malware binaries into grayscale images and then passing them through neural networks for classification. However, most work focus on studying the impact of different neural network architectures on classification performance. In the last year, researchers have shown that augmenting supervised learning with self-supervised learning can improve performance. Even more recently, Data2Vec was proposed as a modality agnostic self-supervised framework to train neural networks. In this paper, we present BinImg2Vec, a framework of training malware binary image classifiers that incorporates both self-supervised learning and supervised learning to produce a model that consistently outperforms one trained only via supervised learning. We were able to achieve a 4% improvement in classification performance and a 0.5% reduction in performance variance over multiple runs. We also show how our framework produces embeddings that can be well clustered, facilitating model explanability.

Malware, Neural Networks

I Introduction

The rapid digital adoption, spurred largely by the Covid-19 pandemic, to support work-from-home arrangements all over the world, has resulted in a surge in worldwide Internet usage. [internetPandemic]. This has in turn fueled cyber-attacks; In 2021, Governments worldwide experienced a 1,885% increase in ransomware attacks, while the healthcare industry alone experienced a 775% increase in those attacks in 2021 [surgeattacks]. In fact, Malware-as-a-service (MaaS) is now a booming business for cybercrime organisations with new malware samples increasing 106% year-on-year in 2021 [maas]. Traditional signature-based matching approaches to detect and profile malware into their respective families may no longer work simply because of the inability to capture and obtain signatures for every single malware out there in the wild. In fact, evaded found that around three quarters of malware samples collected in Q1 2021 would have evaded traditional signature-based secturity tools [evaded]. As such, it is increasingly important for cyber defenders to move away from signature-based methods and find generalizable, scalable methods to detect and profile the rapidly increasing number of malware. In this paper, we study how image classification methods can be applied to the malware family classification problem. This is particularly important as the ability to classify the malware family will provide cyber defenders an idea of what the malware was designed to do, thus reducing the search space for cyber incident responders when trying to contain a breach by the malware.

Ii Related Work

In 2011, firstmalimg coined a novel idea to turn malware binaries into byte images. They then made use of the well known GIST image descriptor method to extract feature vectors from each of the images. They found that malware belonging to the same family tend to have similar feature vectors resulting in them being loosely clustered together [firstmalimg]. Though extremely promising, it was noted that improvements could be made through a learning based algorithm such as artificial neural networks[ann2015].

In 2015, ann2015 proposed to enhance firstmalimg’s work by applying the Discrete Wavelet Transformation on the malware images to extract additional features, on top of GIST, before passing them all through a multi-layered perceptron (MLP) neural network to train a malware family classifier [ann2015]. Subsequently, with the advent and adoption of Convolutional Neural Networks (CNN) over multiple domains, cnn2017 proposed passing the entire malware image binary through a CNN without any hand engineered features [cnn2017]. They showed that they were able to get 93.17% classification accuracy over 32 malware families. Separately, cnn2020 also showed good performance of CNN based techniques, achieving 99.24% accuracy over 25 families [cnn2020]. With deep learning showing promising results in this domain, cnn2021, in 2021, conducted an empirical analysis of various well known deep learning methods ranging from MLPs, CNNs, Recurrent Neural Networks (RNN), Long Short-Term Memory (LSTM), etc. He found that CNN based methods provided the best results empirically [cnn2021].

To the best of our knowledge, all previous works in this domain trained the deep neural networks via supervised learning, using the softmax cross-entropy loss function, which is commonly used to train multi-class classifiers. While supervised learning has conventionally been shown to be the de-facto method to train neural networks, if labels are available, due to the fact that its associated loss functions like the softmax cross-entropy loss function is convex, a number of recent works have proposed methods that have been shown to be able to surpass supervised-only training. In particular, recent work showed that augmenting supervised learning with self-supervised learning techniques could improve model performance significantly. For example, maskedautoenc, in 2021, showed that randomly masking pixels of an input image helps an autoencoder learn more robust embeddings, which would be useful for subsequent fine tuning tasks [maskedautoenc]. Separately, beit showed that pre-training transformer based neural networks via a self-supervised approach helped improve classification accuracy by approximately 2% [beit]. A significant drawback of these findings is that they are domain dependent, meaning that some of these proposed methods only work well for natural language processing, while others work well on image tasks; they may not work well across multiple domains (i.e. modalities). To address this issue data2vec, in Feb 2022, proposed a possible modality agnostic framework that they showed could work equally well across multiple modalities, from image processing to natural language processing and even speech processing [data2vec].

In particular, data2vec proposed the data2vec framework as shown in Fig. 1 [data2vec]. Similar to maskedautoenc, they first proposed to apply a random mask to the input, be it an image, speech or even text. The unmasked input is then passed to a teacher neural network (TNN) to output a series of embeddings. Similarly, the masked input is passed to a student neural network (SNN) to output another series of embeddings. It is important to note that the TNN and the SNN are identical in architecture. However, the weights of the TNN are actually an exponential moving average of the SNN as it is updated. The weights of the TNN are parameterised as shown in Equation (1), where $△_{n + 1}$ represents the weights of the TNN at training step $n + 1$ , $△_{n}$ represents the weights of the TNN at step $n$ , $θ$ represents the weights of the SNN at step $n$ and $τ$ represents the exponetial moving average multiplier. Finally, the loss between the series of embeddings output by both the TNN and SNN is minimised, while holding the embedding output of the TNN constant (i.e. a fixed non-differentiable target).

△_{n + 1} = △_{n} τ + (1 - τ) θ_{n}

(1)

Data2Vec has been well received by the deep learning community not only because of it being modality agnostic but also because it precludes the need to train the decoding portion of auto-encoders. Conventionally, one would train an auto-encoder neural network comprising an encoder and a decoder, to obtain embeddings that contain summarized information of the encoder’s input [maskedautoenc]. However, the decoder is rarely used once the embeddings are obtained as the main purpose of the autoencoder was to obtain the encoder and its ouput embeddings. Furthermore, conducting backpropagation of gradients through the decoder is an expensive process as the decoder needs to reconstruct the encoder’s input. Data2Vec is a significant step up as it precludes the need to train a decoder. Instead, it bootstraps the training using an exponential moving average of the SNN’s weights. This together with its modality agnostic characteristic have resulted in data2vec garnering alot of interest from the research community recently [d2v1] [d2v2].

We now highlight our main contributions. Our major contribution is the study of how the data2vec framework can be applied to the malware image binary classification problem. We will validate if data2vec is applicable even to malware binary images and we will also propose a method to incorporate the data2vec framework in order to train a malware binary image family classifier in an end-to-end fashion. We will also evaluate the performance of our proposed framework. Our other contribution is an in-depth study of the neural network’s output to better understand what the network has learnt in order to aid explanability and deeper analysis.

Fig. 2: Examples of Malware Binary Images

Iii Preliminaries and Problem Setup

In this section, we will review how malware binaries can be converted to images and formally define the classification problem. In this paper, the malware binaries that we analyse pertain only to Windows malware. Thus, they would be in Portable Executable format, having file extensions .bin, .dll and .exe [cnn2020]. These malware binaries comprise of a series of bits and can be converted 8 bits at a time to pixels in a grayscale image, comprising of textural patterns. This is because each byte (i.e. 8 bit) can be mapped to an integer between 0 to 255 and the entire byte array can be mapped to a grayscale image.

Our dataset of malware binaries were obtained from multiple online data sources [malsamp1] [malsamp2] and comprise of 61 Windows malware families. To convert them to images, we first studied the size of the byte array. We found that a majority of the malware binaries, when converted to a byte array had lengths around 640,000 (i.e. 640 kilobytes). Thus we set the width of the image to be 800 so that most of the images would be square in nature. This would facilitate downstream resizing without significant information loss. It would also facilitate easier CNN designs as the kernel sizes can all be square too. The steps below summarize how we preprocess each of the malware binaries into images before passing them as input to the classification model.

Read in malware binaries as a byte array.
Zero pad the byte array so that its length is divisible by 800.
Reshape the zero padded byte array such that its width is 800.
Resize the entire array into a 400 $\times$ 400 grayscale image. This facilitates a larger batch size when training on RAM constrained Graph Processing Units (GPUs)

Figure 2 depicts samples of 2 malware families from our dataset after the above preprocessing steps. As can be seen, malware of the same family show similar visual textures. In particular, those of the Gandcrab malware have near identical characteristics at the top of the image, while those of AgentTesla have diagonal lines cutting across the image. These hint at the possibility of image classification methods being used to classify malwares into their family. After images of the malware binaries are obtained, methods described in previous works [cnn2017][cnn2020][cnn2021] all train a CNN using the softmax cross entropy loss function. The softmax cross entropy loss is as defined in (2), where $m$ is the batch size, $y_{i}$ is a one hot vector indicating the labelled class for that particular sample and $_{i}$ is the softmax output of the CNN, which is actually a discrete probability distribution describing the probability a particular malware binary image sample belongs to each of the known malware families in the dataset. This loss function is convex and is therefore typically used to train multi-class classifiers.

L_{c e} = - 1 / m m \sum i = 1 y_{i} \cdot l n (_{i})

(2)

As mentioned earlier, recent deep learning research has shown that incorporating self-supervised learning as part of the neural network’s training process can improve model performance [maskedautoenc][beit][data2vec]. Inspired by these promising results, we will next describe how we incorporated data2vec’s data2vec algorithm into our training process.

Iv Methodology

One of the drawbacks of neural networks is the lack of explanability. To the best of our knowledge, all previous works in this domain treated the CNN models that they trained as a blackbox. After training the CNN, they only evaluated the output of the CNN with no additional insight as to how the model was able to produce its prediction. In this paper, we propose to make use of data2vec to not only enhance performance but also improve explanability of the neural network so as to provide greater confidence that the model is indeed learning as we expect it to be. To do this, we broke the model into 2 sub nueral networks as shown in Figure 3. Essentially, we will make use of a CNN to ingest the various malware binary images and produce embeddings. These embeddings will then be passed to a MLP to produce a softmax output, indicating the probability of the input malware binary image belonging to each of the known malware families in the training dataset. Intuitively, we would like the CNN to be a feature extractor while the MLP to be a feature analyzer and class prediction module.

We propose to train the entire network (both the CNN and MLP) using the softmax cross entropy loss, while regularising the embeddings output by the CNN using the data2vec framework. Figure 4 shows our proposed framework.

As shown, it comprises 2 main workflows. The first workflow is the conventional supervised learning workflow whereby the entire model, comprising both the CNN and MLP are trained end-to-end using the labelled dataset via the conventional softmax cross entropy loss function described in (2). Do note that in this workflow, the entire unmasked image is fed to the CNN model. Also, we use the student CNN for this workflow as the teacher CNN is in fact a frozen moving average of the student CNN (i.e. gradient propagation should not go through the teacher CNN). The second workflow aims to regularise the output of the student CNN to encourage the model’s output embeddings contain information representative of the input image. It also helps to prevent the model from over-fitting to the dataset. This workflow is as what was proposed in [data2vec] whereby the student CNN is input a masked version of the malware binary image and the teacher CNN is input the entire malware binary image. The difference between the output embeddings are then minimized via (3), where $z_{t}$ is the output embeddings produced by the teacher CNN, $_{t}$ is the output embeddings produced by the student CNN. Note that $β$ is used to control the loss function’s sensitivity to outliers. In our experiments, we set this to 0.5.

Our final proposed composite loss function to train the model end-to-end is (4), where $λ$ is a weighting factor, which we set to 1.

Ldata2vec={12(zt−^zt)2/β,if |zt−^zt|≤β(|zt−^zt|−12β),otherwise

(3)

L = L_{c e} + λ L_{d a t a 2 v e c}

(4)

Iv-a Network Design

Table I summarizes the network architectural design. It should be noted that although data2vec proposed a transformer architecture, he also noted that the data2vec methodology would also be applicable to other alternative architectures [data2vec]. In this paper, we chose a CNN architecture in place of transformer given that it has achieved good results in this domain and is less computationally intensive. The first 3 convolutional layers that does "same" padding aims to extract features from the input image, with minimal information loss from the periphereal pixels. The next few convolutional layers are feature extractors. It should be noted that the output of the convolutional layers is a matrix of size 3 $\times$ 3 $\times$ 256. This is then passed through a dense layer (i.e. Dense (CNN)) to output a matrix of the same size but with linear activation. This matrix forms the embedding layer, which will be trained using $L_{d a t a 2 v e c}$ . The next few layers are part of the MLP network. It first flattens the 3 dimensional matrix into a single vector, before passing it through 3 Dense layers of the same number of units. These Dense layers follow the ResNet architecture to facilitate back-propagation of gradients. In addition, we added a dropout of 0.2 throughout the entire network to prevent overfitting.

Filters Stride Kernel Convolution Type Padding activation 32 [1,1] [3,3] Conv2D (CNN) Same leaky relu 64 [1,1] [3,3] Conv2D (CNN) Same leaky relu 128 [1,1] [3,3] Conv2D (CNN) Same leaky relu 128 [2,2] [5,5] Conv2D (CNN) Valid leaky relu 128 [2,2] [5,5] Conv2D (CNN) Valid leaky relu 256 [2,2] [5,5] Conv2D (CNN) Valid leaky relu 256 [2,2] [5,5] Conv2D (CNN) Valid leaky relu 256 [2,2] [5,5] Conv2D (CNN) Valid leaky relu 256 [2,2] [5,5] Conv2D (CNN) Valid leaky relu 256 - - Dense (CNN) - linear - - - Flatten (MLP) - - 128 - - Dense (MLP) - leaky relu 128 - - Dense (MLP) - leaky relu 128 - - Dense (MLP) - leaky relu 61 - - Dense (MLP) - softmax

TABLE I: Neural Network Architecture

Iv-B Experimental Setup

To showcase the improvement of our methods over the existing literature, we will train 2 models on 90% of our dataset, segmented by family to ensure that the test dataset has every malware family. The first model is trained using only $L_{c e}$ while the second model is trained using our composite loss function $L$ . We will then repeat this process thrice using different random seeds to objectively determine whether our proposed framework can improve the malware family classification accuracy. All models took approximately 12 hours of training to converge. Loss curves were also observed throughout the entire training process to ensure convergence has been reached.

V Results and Discussion

The results of our experiments are as shown in II. As can be seen our proposed method of training the malware binary image classification model produce a marked improvement of nearly 4% consistently across the 3 runs. Furthermore the variance across the runs was reduced from 0.7% to 0.2%. This shows that our framework of using both supervised and self-supervised training to train the neural network gives better results.

Experiment $L_{c e}$ $L$ Family Classification Accuracy 1 0.956 0.989 2 0.950 0.991 3 0.949 0.990 Mean Accuracy 0.952 0.991

TABLE II: Accuracy & F1-Score of various Experiments and Scenarios

To better understand why this is so, we present the training loss curves of the 2 different approaches in Figure 5. The red curve depicts the softmax cross entropy loss as the model is trained using just $L_{c e}$ while the blue curve depicts the softmax cross entropy loss component of our composite loss function when the model is trained using $L$ . As can be seen, the blue curve settles on a higher loss but is much more stable compared to the red one. We opine that this is one reason why the model, when trained using $L$ results in less variation over multiple seeds compared to if it were just trained using $L_{c e}$ .

Secondly, we also note that the $L_{d a t a 2 v e c}$ loss function for the red curve settled on at least 2 orders of magnitude higher than the blue curve. This is as expected as the red curve was not trained using this loss function (although we log it too for comparison). However, it is interesting to discover that the difference would be so huge. Intuitively, what this means is that the embeddings produced by the 2 networks could have vastly different meanings. For the blue curve, the embeddings must contain information of the original input image. On the other hand, this may not be the case for the red curve; its embeddings may well be tailored to overfit to the training dataset.

We attempt to explain this phenomenon by trying to understand the inner-workings of the model. To do this, we took the embeddings produced by the CNN model (1 trained using $L_{c e}$ and another using $L$ ) and passed them through UMAP [umap] to reduce the embeddings to 2 dimensions, allowing us to plot them in a 2-dimensional plane. As can be seen in Figure 6, the embeddings produced by our proposed framework are actually well clustered by the various families in the dataset (i.e. each color represents a particular family and each cluster is predominantly made up of one color, indicating that each cluster is predominantly made up of 1 family), even though this was not explicitly expressed in the $L_{d a t a 2 v e c}$ loss function. On the other hand, the embeddings of the model trained using just the softmax cross entropy loss is rather erratic with no clear cluster. This characteristic is extremely helpful to us because of the following:

Firstly, it enhances the explanability of the model. We are now sure that the CNN trained under our proposed framework indeed extracts interpretable features for the MLP network to analyse. In contrast, this is not the case if the model was simply trained using the $L_{c e}$ loss function.
Secondly, it facilitates the discovery of new malware families. If the MLP’s confidence is low, we can analyse the embeddings output by the CNN to see if it is indeed far from all the known clusters. If so, the malware sample is probably a new malware family.

Vi Conclusion and Future Work

In this paper we have made a few findings. First, we verify that the data2vec framework indeed also works in the cyber domain. To the best of our knowledge, we are the first to apply this framework to the cybersecurity domain. Secondly, we demonstrate how the addition of $L_{d a t a 2 v e c}$ as a regularising term in the loss function is able to improve malware family classification and also stabilise training so that the variance of the model’s performance over multiple runs is relatively stable. Thirdly, we show how our proposed framework can enhance explanability of the model and possibly identify new malware families through clustering of the embeddings output by the CNN network. In this work, we have verified our proposed framework over 61 malware families, which is more than all previous work to the best of our knowledge. Future work, could extend this to even more families and also explore malware functionality classification (i.e. Trojan, Worm, etc.)

BinImg2Vec: Augmenting Malware Binary Image Classification with Data2Vec ††thanks: