Deep Learning for Malware Classification

Published in

AI/ML at Symantec

7 min readMar 28, 2019

*Figure 1:* Bytes of malware executable file

Malicious software can inflict huge damage to enterprises and users and it continues to pose significant challenge for cybersecurity. Malware changes and evolves with time making it difficult to detect using traditional signature-based techniques. Malware authors have developed techniques such as malware packers to counter traditional signature-based approaches. In response, machine learning techniques have become popular for malware detection. Traditional machine learning approaches are based on heuristic feature engineering, which is expensive and unscalable. In more recent years, advanced machine learning techniques based on artificial intelligence (AI), specifically deep learning, are being actively investigated for this critical cybersecurity application. The hope for these techniques is that they will lead to more robust solutions with higher accuracy.

In this post, we provide an overview of deep learning based malware classification and how it can be employed for static analysis of malware.

Machine Learning

In Machine Learning, classification is the problem of assigning an input sample into one of the target categories. For malware detection, the two categories are benign and malicious files. Training data consists of data samples with ground truth labels. Some well-known examples of machine learning models are support vector machines, logistic regression, decisions trees and neural networks. In classical machine learning, suitable features, which are signals or characteristics of data capable of distinguishing between different categories, are extracted using a process called feature engineering. For a data sample, the extracted features are input into the machine learning model and the output is the classification. Using a training algorithm, parameters of a machine learning model are optimized under a suitably chosen cost function over the training set.

A non-trivial effort is spent in feature engineering which requires domain expertise because one has to know what characteristics make objects belong to different categories and then implement and fine tune all such features. Furthermore, such feature-based solutions for malware detection suffer from a drift problem due to the constant evolution of malware thereby making the efficacy decay over time.

Deep Learning

Deep learning is the application of deep neural networks to machine learning. A neural network consists of layers of neurons, starting with an input layer, followed by a few hidden layers, and ending with an output layer. A neuron receives inputs from previous layer, performs simple mathematical operations and outputs a value. Output of a layer becomes input to the next layer in a cascaded manner. A deep neural network consists of many hidden layers, which can be even in hundreds, and is therefore deep. A large variation of deep neural networks exists in practice, customized for different applications. A well-known example is convolutional neural network which is based on operation of convolution, in which a local receptive field from previous layer is processed in a sliding window manner. Convolutional neural networks are popular in computer vision applications.

Deep learning based solutions can work on raw data without any need of manual feature engineering. Different layers of neural networks automatically learn features at different levels, from low level local features to high level global features. The classifier’s performance typically keeps on gradually increasing as we increase the size of training data. In contrast, it has been empirically observed that performance of a classical machine learning model gradually plateaus out with increasing size of training data. One potential advantage in the case of deep learning is that increasing training data makes neural network learn more robust features. Learning of good robust features is an active area of research in AI. For malware detection, the hope for these techniques is that they will be less prone to the time drift problem. And if the performance does plateau out at some point, we can increase the capacity of the neural network by making it larger by increasing the number of neurons or layers or both and repeat the continuously improving cycle. That is precisely the reason why neural networks have shown steadily increasing gains in computer vision, speech recognition, natural language understanding and game playing.

The Convolutional Neural Network (CNN)

Before we review how deep learning is employed for malware classification, let us revisit how convolutional neural networks are used for image classification. An image is input to the network in its raw pixel format. The image goes through a sequence of convolutional layers which can be viewed as automatically computing image features at different levels of abstraction. The spatial dimension of feature maps decreases due to max pooling layers. Neurons in higher layers correspond to larger receptive fields of pixels in the input image over which features are being computed. These convolutional layers are followed by fully connected layers (dense layers), or in more modern architectures, by global average pooling layer. Right in the end, we have classification output layer which outputs probabilities of the image being in different categories. For speech recognition, we can convert speech signal into a 2-D image called spectrogram in which time is one axis and other is frequency, and we can apply similar techniques.

*Figure 2*: LeNet-5, a convolutional neural network dating back to 1990s for digit recognition. See reference: Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, *Gradient-Based Learning Applied to Document Recognition,* Proceedings of the IEEE, November 1998.

Embedding

For images or for speech, it is natural to use raw data as the input. This is because an image pixel which is similar in value to another is also visually similar to it. That is why image pixels exhibit correlation in local neighborhoods. In natural language understanding, the situation is radically different. The input is a sequence of words rather than numbers. We will like to convert these words to numerical vectors based on their neighborhood context with adjacent words. Consider them as multi-dimensional thought pixels of the image of a document. In color images which have red, green and blue components, we have 3-dimensional pixels. In natural languages, the number of dimensions or colors can be in hundreds as words have complex meanings and subtle nuances. This approach is called embedding in AI. The good news here is that it can be implemented as the very first layer of a neural network which can be trained by the same algorithm. By converting documents to sequences of these embedding vectors, we can apply deep neural networks to them just as in computer vision and speech recognition.

Classifying Raw Bytes

Now since we have understood how deep neural networks work, we can see how they can be employed for malware classification by static analysis, that is, by just looking at the contents of the file without any dynamic analysis. Consider the specific case of a windows portable executable (PE) file. It is a sequence of bytes as per PE file format with different sections. A classical machine learning method for static analysis will extract features such as byte n-grams, strings, import tables, entropy features, etc., based on a training set of malicious and benign files and then input these extracted features to a machine learning model such as SVM, nearest neighbors, decision trees or neural networks.

In deep learning, the raw bytes will be the input. Consider a deep convolutional neural network for this task. Since the PE files follow format of machine language and data, these bytes go through an embedding layer as in the case of natural language understanding. Initial convolutional layers process local regions in the file and learn discriminating features. Subsequent layers will process these local features and combine them to form intermediate features, and finally the network will learn global patterns over the entire file. By providing a large amount of training data, we can expect the neural network to learn malware family specific patterns as well as time invariant robust features which distinguish malware from benign files.

*Figure 3*: Convolutional neural network for classification of malware. The visual representation of data on the left is one a trained human analyst would consider suspicious. Specifically it’s data from a threat called ‘Trojan.Tracur’ that has been encrypted to avoid traditional signatures. This type of pattern is rare in clean files and common in malware.

Convolutional neural networks are natural choice as they work great for computer vision and human malware analysts can often tell if a file is malware by looking at the raw bytes and picking up certain visual features. For expert malware analysts this visual analysis approach is very resistant to drift because extracted features are more fundamentally related to malware. For example, humans can recognize data obfuscated in a way designed to avoid traditional signatures.

Enrichment / Attribution as a Side Benefit

One advantage of using deep learning is that it can also tell us which portions of a file are malicious by looking at receptive fields of neurons which get activated. This opens the possibility of a malware analyst using deep learning as a tool to assist him or her in the task of classifying difficult cases.

Conclusion

In cybersecurity, there is now growing interest in applying deep learning to malware classification as evidenced by several publications from academia as well as industry. Review of these publications show that though there are differences in their neural network architectures, but they all share common themes of very large training datasets and underlying motivation of automatic feature engineering.

The application area of malware classification is very different from conventional areas such as computer vision, speech recognition and natural language understanding, but it is exciting to see that concepts of complex pattern recognition and of hierarchical features learned in an automated data driven manner still carry over to this new domain. Future research work will further advance applications of AI techniques for malware classification and other problems in cybersecurity.

Acknowledgements: Author thanks Andrew Gardner and Nolan Kent for very helpful comments and feedback.