Revolutionizing Malware Detection with Custom Machine Learning Classifier — Part 2

Deepak
4 min readNov 10, 2023

--

Welcome to the 2nd part of this Blog series. Here’s the link to Part 1.

100% accuracy in detecting Malware by our own Classifier

Capturing System Calls: The Heart of Our Methodology

Think of system call as a program’s special request to the computer’s operating system (user mode to kernel mode). To monitor and display the interaction between a process and the kernel, we can use system diagnostic tools like strace, and NtTrace.

NtTrace is like spyglass that lets us zoom in on the critical spots in Ntdll, a vital part of the Windows operating system, which allow us to set up ‘breakpoints’ around the Windows system calls — think of these as hidden surveillance cameras monitoring the kernel’s interactions.

Now, what happens when these system calls activate one of our carefully placed breakpoints? That’s when our tool, playing the role of a vigilant security officer, steps in. It quickly takes note of the arguments — the specific instructions — that were passed along and returned during that interaction. here’s how it look likes when we run strace (For Linux):

This is how system calls are look like when we run a command

In this process, strace is employed to execute the common command ‘pwd’. strace’s role is to act as an interceptor and recorder for all system calls instigated by the ‘pwd’ command. Following the execution, the intercepted system calls and signals are then reported back and displayed on the console upon the command’s completion.

If we take a closer look at the diagram above, the first point to note is that each individual line in the output corresponds to one specific system call made by the command. Taking our ‘pwd’ command as an example, the initial line indicates that the ‘execve’ system call is invoked at the command’s commencement. The ‘execve’ system call is essentially the kernel’s way of launching a new program, specifically, the program that is pointed to by the first argument.

Progressing further, the diagram reveals that strace also meticulously lists the precise arguments involved in each system call. In the context of our ‘pwd’ command, the ‘execve’ system call executes the binary situated at the path ‘/usr/bin/pwd’ and submits ‘pwd’ as its principal argument.

As we delve deeper into the output, we can journey line by line to scrutinize the command’s behavior and actions at each stage. Each system call — be it ‘read’, ‘write’, ‘connect’, and so forth — tells its own unique story about the operations performed by the command.

Segregating System Calls: Preparing Data for Analysis

Upon capturing raw data for both malware and legitimate application system calls, it’s essential to segregate and store them separately for future analysis, as shown below images:

Legitimate app system calls
Malware app system calls

Dissecting System Calls: The Key to Unlock Malware Secrets

Now, let’s turn our attention to examining the structure of a system call. Think of it as opening a book to understand its content. We’ll open one of the captured files and take a peek into its anatomy.

As you will see from the image below, every system call carries a unique identity. It has a distinctive name, and it comes with a set of arguments and a return value.

Raw System Calls

Those system calls can be any of the followings as shown below image which offers us vital clues about what the system call is, what it does, and how it works. It’s like unlocking the secret language of your computer’s operating system.:

System Call Symbols
Examples of System Calls

As we venture forward in this exploration, our focus will narrow down to system call names and their corresponding sequences and frequencies. This is where machine learning enters the picture, transforming complex data into digestible insights.

To do this, I’ve harnessed the power of regular expressions, an invaluable tool in parsing textual data. It helps us sift through the plethora of system calls and identify the ones that are significant to our analysis.

Below, you can view an image that visually represents how regular expressions enable this process, simplifying the extraction of crucial information from the vast sea of system calls. Remember, despite the technicality of the process, it’s like sorting through a mixed bag of items to pick out the ones we need.

Count of each system call of one program

In the next part of this Blog, we will train our dataset and test them against our own classifier. Then we will predict the result of our work…

Link to Part 3

--

--