LSTM based malware detection (Python & Tensorflow)

Islem BOUZENIA
Nerd For Tech
Published in
4 min readApr 9, 2021

--

In one of our previous posts, we showed how to create a malware detector using convolution neural networks by transforming an executable into a gray scale image. In this post, we will see how we can explore the assembly code of an executable to analyze it for malicious content.

Starting from the middle…

We can represent an executable by it’s sequence of instructions, each sequence of N instructions is considered as a text of N words written in assembly language. Thus we will be treating a problem of text classification where the language is Assembly. The latter contains more 1600 instructions in its dictionary (vocabulary). Each instruction can be represented by a vector. We can do so by using one-hot vector representation. Simply a one hot vector is vector where each entry is set to 0 except one is set to 1. It’s also a vector from the canonical base of a space of dimension N.

Using the canonical representation on 1600 instruction has a lot of drawbacks. First, representing each instruction with a sparse vector of size 1600 will decrease the performance of neural network. Second, using a canonical vector should apply only on a base which means the words should be enough to generate any expression (meaning) in the language with no correlation between the components of the base, something that is surely not satisfied in the assembly language since there is a correlation between some instructions. To overcome this problem, we will regroup the instructions into families, each family contain a set of instructions of same use: arithmetic instructions, data transfer instructions… In this way we will increase the correlation inside a family of instructions and reduce it between different families which also allows to reduce the space dimension.

In the work that I have done, I mapped the different instructions of assembly language into 49 families and for that I used the classification given by Intel in their documentation. Here is the clustering of instructions:

I didn’t go crazy and classify all instructions so this isn’t the whole dictionary of assembly language

I know this is not a pleasant thing to look at, but fortunately I did it so no one have to do it after.

And then comes LSTM

Since we are treating the problem of a sequence analysis in space of reasonable dimension, using LSTM models is an intuitive idea to have in mind.

LSTM models are powerful when it comes to sequence analysis but they are also tricky to use. To make sure we have chosen the right parameters for our network, I varied multiple parameters of the network and projected the average performance over all values.

For the purpose of this study, using a 3 layers of size 128 and sequence length of 128 instructions (mapped into families) is the best architecture.

3-layers LSTM model

Straight to the implementation

To implement our idea we firstly need to disassemble an executable. Luckily I already explained how to create a disassembler in python in one of my previous posts (https://isleem.medium.com/create-your-own-disassembler-in-python-pefile-capstone-754f863b2e1c).

But don’t worry if you don’t have time to look at it, I have a full python script to get sequence of instructions from an exe file, though I encourage you to understand how it works so you can maintain it. You can find the entire python script for this in my GitHub at this location:

If you open that script, what really matters is the last block which is the loop doing extraction where you have to change path to your folder of exe files. and also remove 3 or lines because they are not important and they might raise an error.

Once the disassembling is done, we need to encode the data for our neural network and create the model, something like that:

Now, you want to see a demo, no problem! refer to my previous post: https://medium.com/analytics-vidhya/deep-learning-based-malware-detection-demo-d545c5653200

In the previous post, you find a link to the kaggle notebook that contains all the models exported so you can download them and reuse them. Also you find the entire code for the demo where you can link it to a beautiful dashboard of your choice. I have chosen to use Anvil.works, give it a try, it’s simple and easy.

And that’s it for LSTMs. I will share with you soon the dataset for training with assembly instructions sequences.

Update:

You can find the full implementation, the models and more on this GitHub repo: https://github.com/islem-esi/DeepMalwareDetector

--

--

Islem BOUZENIA
Nerd For Tech

I am dedicated to science, no more and no less. Artificial intelligence, data science and problem solving are my main interests.