Code Similarity Analysis using Machine Learning part 1.

6 min readMay 14, 2022

Overview
Installing tools
Quick demo
Conclusion

Overview

In this series we’ll continue analyzing hancitor (poor hancitor) but our analysis will be from a different angle. We’ll be concentrating our analysis around the decryption algorithm hancitor utilizes. We don’t want to reanalyze a sample if we don’t have to ( we can use our config extractor) but we are interested in new variants , that have been updated, which might thwart our config extractor(link to building the config extractor).

We could use Ghidra program diff feature or other 3rd party tools for comparative analysis but what if we have 5 samples, or 8. We don’t want to manually load a bunch of sample into a program like Ghidra (yes, you can use headless mode for mass analysis) when we can build automation, based around code similarity, using machine learning(also, ML is pretty cool).

This is going to be the main focus of this series, can we built effective automation around detecting new encryption algorithms employed by hancitor. This will enable us to focus on reverse engineering around new variants, while letting our config extractor take care of the rest. This first part is going to cover installing the necessary tools, the thought process and then demonstrate a few use cases to get familiar with the tools.

Installing tools

The main environment we’ll be programming in, will be Jupyter notebooks. It makes code execution and troubleshooting much easier than your typical python IDE. Also, its one of the most used among the ML community, due to its ease of displaying charts, graphs and other types of output. You can download the standalone application (here) or you can install vscode (here) and add the extension(I’ve chosen the latter).

Once you have vscode installed, we’ll need to install some extensions and packages, to get our environment ready. One the left hand side in vscode you should see an icon with four blocks, click on it. On the top, you can search for the extensions to install. Below are the are 5 extensions you want to have installed:

Jupyter
Jupyter Keymap
Jupyter Notebook Renderers
Python
Pylance

Once everything is installed, your extension list should look something this(see below)

Next, we need to create a directory to work out of. By clicking on the file icon on left hand side, it will give you the option to open a directory (if you havent already created one) or from the File option drop down, you can create one.

Once you’ve created your working directory, we’ll need to install a couple of packages via pip (comes preinstalled with python). Open a command shell and type the following to get the packages installed:

# Get updated version of setuptools
pip install setuptools --upgrade# Install Lief
pip install lief# Install capstone
pip install capstone# Install numpy
pip install numpy# Install pandas
pip install pandas# Install matplotlib
pip install matplotlib# install scikit-learn
pip install scikit-learn

Don’t worry if you haven't heard of these tools before, below is a small description of each. I have also provided links to their documentation as well. All these tools together, will allow us to parse, extract features and provide visualizations, to help us accomplish our objective.

Lief is a cross platform library which can parse, modify and abstract ELF, PE, MachO and Android formats (Link)
Capstone a lightweight multi-platform, multi-architecture disassembly framework. (Link)
Numpy is the fundamental package for scientific computing in Python. (Link)
Pandas aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. (Link)
Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. (Link)
Scikit-Learn allows for various classification, regression and clustering algorithms to be used. (Link)

Quick demo

Now that we have our environment setup, we can get familiar with some of the basic commands and features, we’ll be using through this series.

Going back to vscode ( after creating a new directory) we’ll want to create a Jupyter notebook file (file with .ipynb extension). Clicking on the down arrow in our directory, a few icons will appear. Clicking on the first will allow us to create a file of our choosing. Create a file and give it a .ipynb extension.

Once you have created the new file, you should be presented with a blank notebook with an open cell.

For this demo were going to use lief to parse a file and display its imports and give us the total length (we’ll leave the fancy stuff for part 2).

First we’ll want to import the necessary library (lief) into our notebook and execute the cell to make sure everything worked OK.

note: while in a cell if you hold shift then press enter, Jupyter notebooks will auto execute the current cell and create a new one below it.
Also, if its your first time executing the first cell, jupyter notebooks might need to start the local server. Before doing so, vscode might ask which python version (if you have multiple) you want to use. Just select the newest version or the suggested one, and you should be good to go.

To start enter the code below into the first cell of your notebook. The first line is just a commented line to provide a description of what the code is doing. The second line is where we’re importing lief package.

code snippet:

#import library
import lief

Once executed, and if everything worked ok, the first cell should execute and created another below.

Well now need to provide lief with the path to a binary to parse. You can use your own, or you can download the hancitor sample set were going to use in part 2 and use one of those (link). Regardless on your choice, make sure to provide the full path.

note: All the samples extensions have been changed to .bin to prevent accidental execution but as always proceed with caution.

Code snippet:

First line is assigning the full path of our binary to a variable call “path”
Second line is passing the path variable to liefs parse function and assigning the output to another variable called “binary”


path = "/home/crovax/Malware_Research/Malware_projects/Hancitor1.bin"binary = lief.parse(path)

Now that we have access to out binary data, we can leverage some of the other functions in lief to extract all kinds of information (sections, imports, exports etc.). What we want to do is list all the libraries and their imports within the binary. To do this, we’ll use lief’s libraries method to retrieve an object that is iterable. We’ll through the object and append each result to a list called “my_library” then print the results.

code snippet:

Lines 1 and 2 are just commented lines for code description
Line 3 is creating a list to store our results in
The rest of the code is looping through the binary.libraries object and assign each entry in library then appending the result to our list. Once completed, we print our list to the screen.


# create a list to store each library retrieved from the binary 
# then print the listmy_library = []for library in binary.libraries:    my_library.append(library)print(my_library)

Note: if you want a copy of the notebook created I have uploaded it to my GitHub(link).

Conclusion

To wrap this part off, we have covered how to install the necessary tools and packages to get us started with code similarity analysis and feature extraction. In the next part of this series, we’ll use lief to parse the .text section of hancitor, then use capstone to disassembly the bytes for us to perform our code analysis.

As always, don’t expect much, as I have no clue what I’m doing. 😃

Code Similarity Analysis using Machine Learning part 1.

Table of contents

Overview

Installing tools

Quick demo

Conclusion

Written by Crovax