Machine Learning for Malware Analysis

5 min readJul 31, 2017

Malware developers find new ways to avoid detection (Source: http://www.worldtechtoday.com)

Editor’s Note: This article describes the some of the work done towards the project “GotMalware?” to explore Malware fingerprinting, use of machine learning and visualization techniques, with help from MalwareBytes & Lib13 Inc by Polina Khapikova, Muhammad Qureshi, Akshatha Muralidhar, Willie Santos as part of the Cyber Defenders 2017 Program.

Introduction

In our first blog post, we gave an overview of our project and our internship at the Cyber defenders program. In this post, we will focus on the machine learning part of our project: the research we’ve done on the topic, and how we are beginning to incorporate Machine Learning in our work.

A quick note: In our initial project work, we dealt with identifying virus fingerprints. And by doing so, it helped us to create a test-bed and understand the behavior of viruses. We are now moving towards our next goal; to use machine learning in order to classify executable files as malware or benign.

Research

Machine learning is a technique that allows computers to learn and improve from their past experiences without being explicitly programmed. In other words, we can imagine Machine Learning as a student who learns from their prior mistakes. Machine learning is compiled of five different learning protocols, unsupervised learning, supervised learning, semi- supervised learning, active learning and reinforcement learning.

Unsupervised learning: Unsupervised learning is a type of machine learning used to draw inferences from datasets consisting of input data without labeled responses. There are two outputs: “classification” and “regression”. Classification is the process of taking an input “X” and mapping it to some discreet label like true or false. For example, if you want to determine a gender of an image and determine if the image looks like a male or female. Regression, however, is more about continuous valued functions.

Introduction to Machine Learning in Python with scikit-learn (Source: http://ipython-books.github.io)

Supervised learning: Supervised learning is the machine learning protocol of inferring a function from labeled training data. It can be thought of as a teacher supervising the learning process. We know the correct answers, and the algorithm iteratively makes predictions on the training data and is corrected if necessary. Learning stops when the algorithm achieves an acceptable level of performance.
Semi-supervised learning: In semi-supervised learning, the algorithm has access to both labeled and unlabeled examples.
Active Learning: In active learning, the algorithm and teacher interact with each other. For example, the program can ask questions.
Reinforcement Learning: Reinforcement Learning allows the machine or software agent to learn its behavior based on feedback from its environment. This type of learning can give the best solution to a given problem while also having the ability to keep learning from its environment for a better possible solution. This type of machine learning can be thought of as learning through reward, almost how a dog learns commands through treats or praise. Reinforcement learning can also learn from negative feedback to make sure it makes the correct decision.

Machine Learning Tutorial

In addition to research, our group has begun working with machine learning algorithms. This week, we completed the “Machine Learning for Malware Detection” tutorial from the InfoSec Institute. The original tutorial can be found here. Our annotated, modified tutorial can be found here in the format of a jupyter notebook. Through this tutorial, we were able to go through all the steps required in machine learning pipeline.

First, we downloaded over a hundred virus executable files from the OpenMalware.org site, and gathered benign executable files from one of our computers. Combined, these formed our dataset. (This program was only able to compare pe files, about which we explain more in the upcoming section) Next, we extracted 12 different features from each of these files. Then we used matplotlib to graph the data so that we could get a better idea of how each feature was important in the classification.

We used two algorithms - random forest classifier and multi-layer perceptron, to classify the data.

Thoughts About the Tutorial

Performance of Algorithms

We noticed that the random forest classifier performed better than the multi-layer perceptron, which was a bit unexpected. Random forest is a classification method made by using multiple decision trees (a simple logical flowchart of sorts) and taking the ultimate outcome of all of the trees. Below is a diagram which depicts the random forest algorithm that helped us understand this concept.

Random Forest Template for TIBCO Spotfire® — Wiki page (Source: https://community.tibco.com)

Here, we can see that this algorithm takes the results from three decision trees (here, two trees have voted for Class-B and one has voted for Class-A). Then the algorithm uses some specified method to determine how to use the individual trees’ results to make a final conclusion.

In contrast, the multi-layer perceptron is a neural network, which works completely differently.

Schematic drawing of multilayer perceptron neural network (Source: https://www.researchgate.net)

Neural networks are characterized by their hidden layers, the grey circles in the picture below. The hidden layers are algorithms or formulas that accept an input, and do something with it that will help you classify it as an output. Although in the picture the neural network has only one hidden layer, it is common to have multiple layers — ours has six. Neural networks can be extremely useful, but to provide accurate results they need a large amount of data — tens of thousands of files. Therefore, for our purposes, it was better to use the simpler algorithm.

PE File Format

Another interesting thing was the PE file format this tutorial introduced us to. PE stands for “portable executable”, which gives us a hint as to what this is. The PE file is a highly structured binary file format for an executable or DLL so that the Windows operating system is able to process them. We were able to extract features from each of the files, so that the algorithms could analyze them.

Matplotlib

Matplotlib is a plotting library for the python programming language. It provides a way to visualize data in a 2D graphic form. This library can be used to make different types of plots, charts, tables, and grids. In the tutorial we used matplotlib to visualize the correlation between individual features and whether the file was malware.

Next Steps

Our next goal will be to implement a machine learning program to classify Android viruses. We will be working with .apk files, the Android equivalent of PE files.