SAFE: a step into the creation of embeddings for binary code similarity detection

May 7 · 4 min read
“Can you spot the differences between function A and function B? The same source code can easily produce widely different binary code if compiled with different software or settings.”

Have you ever tried to compile a simple function written in c with two different compilers and look at the compiled binary code? Probably, they will look very different and it will require some time (and experience!) for you to realize that the two binary code functions actually implement the same C function!

When I learnt assembly, a few years ago, an aspect that I found a bit disturbing was the fact that you can implement the same semantic in infinite ways. Personally, I don’t like things that expose this level of variability, I like much more numbers and math… so what if we have to answer the question “have the two binary functions been compiled from the same C code?” and we can do it only looking at a single number? For me it would be great. But it would be great for a computer too!

Recently, we tried to implement a neural network that tries to answer this question. Our objective was to transform binary code functions contained inside any software, into vectors (also called embeddings). Vectors are generated in such a way that if two binary code functions have been compiled from the same source code, their vectors are similar in terms of cosine similarity. We called our solution SAFE (Self Attentive Function Embedding) and it permits to compute binary function embeddings to find out if two functions are similar or not.

How does SAFE work?

SAFE first transform each instruction into an embedding vector. Then, using a self attentive network instructions embeddings are aggregated together to compute the embedding of the function.

I will try to briefly explain how we designed SAFE. SAFE leverages two famous techniques borrowed from natural language processing:

  • Word2vec
  • Self Attentive Recurrent Neural Network

This former permits us to associate an embedding vector to each assembly instruction, treating it as a natural language term. We trained the word2vec model on a big corpus of assembly code and called our resulting model Instruction2Vec (i2v). The power of the i2v model stands in its ability in identifying instructions that have the same semantic. For example after the training of the model we tested it asking to the model to solve relations like:

  • push rbp : push rsp = pop rbp : ?
  • add rax rbx: sub rax rbx = add rax rcx : ?

The model was able to solve these simple relations answering “pop rsp” for the first and “sub rax rcx” for the second.

After this first amazing result, we used a self attentive recurrent neural network to “merge” embeddings of single instructions and create an embedding for a whole function. The network takes as input the list of instructions contained inside a function and returns its embedding.

In order to train the network we used a siamese approach, where we fed to two equal networks couples of similar and dissimilar functions. The loss function was based on the cosine similarity of the embeddings produced by each of the two networks.

After the training we picked only one of the two equal networks and used it to infer embeddings for functions given in input.

SAFE can also identify similar function compiled for different platforms (e.g. ARM and X86)! This makes it useful for analyzing firmware of embedded devices that are often built on top of different hardware architecture.

Embedding Power!

SAFE embeddings can be used for multiple tasks to help the reverse engineer.

Building a function search engine. Oftentimes, when disassembling a program, the analyst spends lot of time analyzing functions that are simple variants of something he has already seen. Reverse engineer softwares offer some tools to avoid this (e.g. Ida’s FLIRT). However, embeddings created with SAFE are resilient to the recompilation of a function and they can match much more functions than standard tools.

Finding vulnerabilities. What if you want to find out if your home router is affected by the hearthbleed vulnerability? Using SAFE embeddings you just need to have one example of the function that is affected by the bug and compute its embedding. Using it you can explore the firmware of your router in search for the vulnerability!

Extracting function semantic. SAFE embeddings represent also the semantic of the function. This means that, for example, it can be possible to distinguish between a function that implements a sorting algorithm from another that implements an encryption algorithm.

Semantic classification of functions using SAFE embeddings.

Automatic creation of function signatures. Using SAFE you can also build function signatures to include in your YARA rules! This permits a faster way to create rules that will match the function you want to search without looking at any line of assembly code. We developed a YARA module that implements this functionality !

Finally, SAFE is open-source! To start playing with embeddings take a look at our repositories:

Luca Massarelli

Written by

PhD Student @ Sapienza University of Rome — massarelli@diag.uniroma1.it

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade