Judging a book by its cover — Part Two: Analysing Malware Metadata at Scale

Published in

CSIT tech blog

17 min readAug 19, 2022

Introduction

In Part One of this series, we shared what PE metadata is and how to identify Developer-related PE metadata. We caught a glimpse of how the insights from PE metadata can help us detect malware, attribute to its source, and understand its evolution. You may also recall that we talked about another type of PE metadata that can be useful for analysis. That’s right, Design-related PE metadata!

Unlike Part One’s Developer-related PE metadata, which are akin to “fingerprints” of individual malware authors, this article’s Design-related PE metadata characteristics are produced based on the design and compilation environment of the malware. Such metadata not only provide analysts with a different set of insights, but are also harder to modify without affecting the consistency of the PE file, such as disrupting the execution flow of the malware, or leaving “breadcrumbs” that show evidence of tampering by the malware authors.

At CSIT, analysts have to plough through a mountain load of malware every day. After all, the sheer volume of new malware found each day can be enormous! The good news is that there are benefits to working with huge amounts of data. It opens up new analysis opportunities that can reveal meaningful perspectives that go beyond the small-scale analysis of samples.

Machine-learning techniques are part of our toolkit to build an automated system that crunches malware on a large scale and churns out calculated predictions. To do this, we need to build a model that can do all the heavy processing for us, and prepare suitable input data for our automated system. PE metadata (both developer and design-related) is one such valuable input. It can be used to distinguish members of one malware family from another, or identify similarities between a new sample and previously analysed samples. We saw some of this in Part One and will see more in this article.

For this article, let’s start by unpacking what lies within Design-related PE metadata. Then, we will use our newfound understanding to build an automated system that can harness machine learning to both types of metadata: and start clustering malware samples at scale!

Section names — Chapters of a malware

The Sections Table displays all the sections that are present within the PE file. We may think of a Sections Table as something like the contents page of a book. The Sections Table includes details that are associated with each section such as the name, location, and data size.

Some common sections that you may find in most PE files are:

Sometimes, we do see uncommon section names that have been intentionally named by the malware authors or added by other programs such as packers. Packers are used by malware authors to compress and obfuscate their malware, and in doing so, hide their malicious code. UPX is one example of a packer that is widely used in malware. The original file is passed into the packer routine, stored in a packed section, and usually given a unique section name, such as “UPX0”, in a new PE file.

As the section names added by packers tend to stick out, identifying them helps us to surface suspicious files. Thanks to @Hexacorn, this comprehensive list of known section names can be used to quickly filter out PE files with suspicious section names from a heap of samples.

Import Data — Importing the similarity

There are many ways to write a program — engineers use different APIs, libraries, and sequences of importing functions. These differences result in varied constructions of the Import Address Table (IAT), which stores information on imported functions. Malware authors sometimes reuse code across samples or even variants, and this may generate exact or similar import data among their samples. Hence, spotting identical or similar IAT (such as by comparing the hashes of the IATs, also called ImpHash-es) is a way to identify malware clusters.

The malware family MiniDuke, used in the Dukes campaigns, shows Import data metadata at play. In the visualisation below, there are 11 unique ImpHash-es that are mapped from more than 2000 samples. It is likely that these samples were created from a small pool of code bases, where clusters represent samples that possibly share very similar code implementations.

11 Unique ImpHash-es clusters formed over 2000+ Miniduke malware samples

While ImpHash-es are a powerful way to identify related malware based on the relative uniqueness of the import data, they are not robust enough to pick up minor import differences. Malware authors could change the order in which they import functions or libraries so that their samples would not generate the same ImpHash. To make up for the shortcomings of ImpHash, FuzzyImportHash can be used to measure the degree of similarity of import data instead of doing a binary comparison.

Rich Header — The “Rich”-est information indeed!

The Rich Header has been part of the PE file format since the release of Visual Studio 1997 SP3. It contains information such as the compiler, linker, imports that were used in the compilation, and a brief description of the overall structure of the PE file. However, till today, no introduction or documentation has been released by Microsoft, and it remains a mystery “treasure box”.

The Rich Header consists of several Rich Header entries and a XOR key that is used to encrypt the entire Rich Header. Each entry has the following structure:

Product ID and Product Version are used as a unique pair in the Rich Header entries. And the total occurrence for each pair is listed in the Count field.

In this section, we will use PE Studio to extract the Rich Header, and show how it can be used for malware clustering. If you would like to explore unlocking this “treasure box” by yourself, you may refer to this detailed guide by Erik Pistelli.

In the figure (extracted RICH header) below, information such as

Estimated number of files and imports that were used in the development project;
Indication of the existence of data directories in the PE file — Export and Resource directories; and
Derivation of the most “probable” compilation timestamp, which can be inferred by comparing the release dates of the compiler and linker

Hence, the Rich Header entries, when placed together, can be considered relatively unique to a project as they can remain unchanged for weeks or even months throughout its development cycle.

An extensive research study by SANS examined the effectiveness of using Rich header for malware detection and linking. In the paper, Rich Header Hash (Product ID, Product Version & Count) and Rich Header PV Hash (Product ID & Product Version) were computed and used as fingerprints of the malware. Two PE files with identical Rich Header Hashes (plausibly different versions of the same project) can be identified to be built from the same environment.

While it is possible to modify a Rich Header, it is much more difficult to forge a realistic one and still maintain consistency with other parts of the PE metadata. This difficulty can be seen in one of the “most deceptive hacks in history”, the Olympic Destroyer case in 2018, where the Rich Header was deliberately modified to mask its provenance and cause misattribution. In the article, analysts could determine that the malware was built using Visual Studio 6.0 from its Rich header. However, when the information was compared against the other parts of the PE file, it appeared to be importing functions from a library (mscoree.dll) that did not even exist at that point in time!

The Rich Header forgery ultimately resulted in an inconsistency between the Rich Header and the other components of the malware. This triggered the suspicions of analysts, who went on to expose the malware author’s intent of imitating other threat groups’ samples. Indeed, the act of modifying the Rich Header convincingly is not easy, and may instead become an indicator used to reveal the malware’s true colours. Hence, looking out for such inconsistencies can potentially assist us in performing attribution more accurately.

Dealing with all the malware

As we mentioned in the beginning, according to Kaspersky, around 360,000 malicious files were detected daily in 2020, and this number increases every year! When a new suspicious artefact is detected, analysts must first determine if it is malicious or benign, then identify the family it belongs to, and finally craft a signature to detect it. This can be very challenging, and the average time required to manually analyse and craft the signature for each malware can take hours, days or even months.

Based on an average of 360,000 new malware being detected daily, an average of approximately 5,000,000 hours a day is needed to manually analyse and respond to all of them promptly. You can tell where this is going — it is humanly impossible to manually analyse such a tremendous amount of malware. Therefore, this is where machine learning techniques come into play, allowing us to build an automated system that helps us analyse, sieve, and prioritise the files that require a deeper analysis. But there are so many machine learning techniques; which one do we use?

We will describe how we employ Clustering, a common unsupervised machine learning approach, to cluster and analyse samples based on PE metadata. Clustering works on datasets in which there is no outcome (target) variable nor anything known about the relationship among the data points, i.e. unlabelled data. Clustering algorithms find structure in the data so that elements of the same cluster (or group) are more similar to each other than to those of different clusters. The machine learning model will be able to infer the most probable classes to group the samples purely by comparing the samples against each other, without any prior information about the data. Hence, it is also known as learning without a teacher.

Metadata-driven Malware Clustering — Where does this belong?

There are many models and algorithms that we can use to perform cluster analysis. For this article, we will use one of the off-the-shelf techniques, the K-means algorithm. We will demonstrate the applications of using the K-means algorithm with input data coming from a combination of both design-related and developer-related PE metadata. This application can perform a quick triage that allows us to sieve out interesting samples within minutes, which helps to prioritise the samples of interest for a more detailed analysis.

K-means algorithm implementation requires us to provide an assumption or “guess” the “true” number of clusters “k”, and identifies these clusters among the input data through a learning process. To find a satisfactory clustering result, usually, a few iterations are needed to experiment with different values of k. This can be performed with several statistical measures or visual verification. Visual verification is often applied widely and used to present the clustering results because of its simplicity and explanation possibilities.

To keep things simple, we will work with PE metadata extracted from samples of the three malware campaigns (Cloud Hopper, Hangover, and Dukes) that we have seen throughout this series, and use visual verification to identify an optimal number of clusters for our datasets. For simplicity, we will also assume that the samples have not been intentionally altered. In our demonstration, we will increase the k values (k = 3, 4 and 5) and compare the formation of the different clusters.

K-means clustering with different k values (Centroids are marked with red cross)

In the different k clusters diagrams above, we can see that many of the samples in the first two diagrams are not tightly gravitated to its cluster’s centroid (marked with a red cross). This can be seen for the samples that were identified as a cluster labelled as A (the brown region when k is 3) and B (the cyan region when k is 4). From here, we can possibly hypothesize that setting k = 5 might be an optimal value for our datasets.

Our automated system (an unsupervised learning model) has clustered these samples without a training data set or any known outcomes. Essentially, our automated system approaches the clustering problem blind — with only the input data that processes to sort the samples into clusters. To further expand our understanding of these clusters, let’s see how they turn out when we reveal the “model answers” (malware campaigns that the samples belong to).

K-means clustering with different k values and labels included (Centroids are marked with red cross)

In the diagrams (updated with the “model answers”) above, our automated system was able to identify and cluster most of the malware samples with samples from its respective malware campaigns (Cloud Hopper, Hangover, and Dukes). It also reaffirms our hypothesis that setting k = 5 might be the more optimal k value. However, it could be puzzling why five “k” homogenous groups (i.e. clusters) of malware was seen to be the most optimal value despite there being only three malware campaigns. We will continue to dive deeper in understanding possible explanations for this outcome.

K-means clustering with malware campaigns (Centroids are marked with red cross)

After identifying the optimal value of k, we can now study the degree of similarity between malware from different families used in the same campaign. We can see that the Dukes and Cloud Hopper families formed a tighter cluster among their respective samples as compared to the Hangover family. The result of a tighter cluster could be due to the samples within the family exhibiting stronger similarities in their PE metadata fragments. From the results, we could probably deduce that Dukes and Cloud Hopper samples belong to its campaign respectively as they exhibit similar PE metadata traits within its own campaign. Analysts can use these traits to derive generalised signatures that can work against them.

One observation that you may have noted is the formation of three distinct clusters of the Hangover family within itself. At this point, we may conjecture that samples from Hangover campaigns share three unique sets of PE metadata. But what does that mean, and how do we find out? Let’s take a closer look to understand how these three Hangover clusters may be formed.

Three Hangover clusters? — Understanding the clusters

Hangover clusters relabeled into Group A, B and C

Since there are three clusters and for easier identification among them, we have re-labelled them into Group A, Group B, and Group C. Now, let’s zoom into the groups’ PE metadata to discover if any stands out! With the analysis of each group’s PE metadata, we may be able to deduce some patterns that our model had picked up that analysts could have missed.

Group A is the biggest group (over 100+ Hangover samples) among the three, and at first glance, has one prominent characteristic: the lack of information from Export Data (0 exported functions) and Debug Data (only 3 unique PDB paths extracted from 3 different samples). This may mean that samples that were classified as Group A are executables (EXE), and that their debug data were removed during the compilation process.

Extracted ImpHash-es and RichHash-es from samples in Group A

As we continue to look at the other PE metadata in Group A — Import data (ImpHash generated from the imported functions) and Rich Header (RichHash generated from the Rich Header entries), we see some interesting points: Over 50 samples have both identical ImpHash `c24262a0a40a5019cec3329c7b32c5a3` and RichHash `9072098ed4c00fe3b7cac80383a6f05e`. One possible explanation for the above observation is that these samples have similar implementations and were built in one unique development environment. This is an example of how the combination of PE metadata, when used together, can corroborate malware samples’ similarity to each other.

You may be wondering how this information can be useful. Remember that one of our reasons for building an automated analyser is to sieve through massive data and flag the interesting files for analysts’ attention? From the results of our analysis, you can see that these 50+ samples are highly likely to have similar implementations, routines, and functions. These samples, if analysed before, may not require as much resources to re-analyse if there isn’t a significant difference when compared against its older versions. This reduces the additional effort required to attribute these samples to their author. Analysts can then focus their efforts on dealing with novel or enhanced malware.

Comparison of PDB Paths extracted from samples in Group B and Group C

Let’s continue to look at Group B’s and Group C’s PE metadata. One clear similarity between the two groups is the overlapping PDB file names such as “HTTP_T.pdb”, “Ron.pdb”, “FirstBloodA1.pdb” found in the Debug data (PDB paths extracted from the samples). When analysing PDB paths, we may look at the absolute PDB path or just the PDB file name, both of which are valuable as part of PE metadata. For our demonstration, our model used the absolute PDB path as one of its features. Certainly, there are several ways of implementing features for Debug data, such as using the PDB file name, or further processing it by slicing the absolute path into several parts and using them as features.

Taking a closer look at Group B and C samples, most of the PDB paths are unique and do not indicate clear cluster patterns. Similarly, no defining similarities were found for Import data and Rich Header extracted from the samples in these two groups. Considering the groups’ PE metadata, these features do not reveal strong explanations of why the two clusters were formed. We still do not know why there are two disparate clusters instead of just one or several mini clusters.

But when we examined the other PE metadata, we found one possible feature that our model had identified and used to form these two groups. The Unique languages that were specific to each group (the ResourcesLangHash generated from the resources’ language), were found to contain different sets of PE metadata such as Debug data, Import data, and Rich header. This may be why there are two clusters identified by our model. In this case, the separation may be worth taking a second look by analysts placing the samples at a higher priority.

This separation may be an indication of an evolution in the features of the samples. Or it can be used to identify other malware actors who are trying to insert misleading footprints to misdirect analysts’ attribution efforts. In this scenario, these samples could be ranked higher, where reassessment of the samples is required.

So far, we have discussed how machine learning techniques can be applied to PE metadata to help analysts perform a quick triage on the new samples that emerge daily. For example, rather than manually trying to identify clusters, we can offload the task to the automated system which is not only faster, but can be more effective! But we must not forget that machine learning techniques might not always produce the “right” answer. The outcomes can still be biased due to inadequate models or unrepresentative datasets. So we need to know how to improve our model, so let’s consider experimenting with various features and training parameters.

Answering the bigger questions (I) — What about crafted PE metadata?

One limitation of our model is that it does not account for the problem of misinformation in PE metadata. In our two-part series, we have seen how some malware authors craft PE metadata to mislead analysts’ attribution efforts. On the other hand, we also saw how these crafted PE metadata can be a double-edged sword that can be used against them by revealing tell-tale signs.

For example, if a malware always has its Rich Header copied from another malware family, this becomes an identifiable trait of the malware families or even the authors. Such hypotheses can be folded into the clustering model, to help analysts detect misinformation in PE metadata, and perform more precise attribution. In other words, once misinformation is identified, it flips from being an obstacle to a valuable component of your analysis.

Answering the bigger questions (II) — How can we boost the effectiveness of the model?

A limitation of the earlier clustering model is that PE metadata are treated with equal weightage, but in reality, some PE metadata are more useful than others. Another limitation is that we only used PE metadata, which is only one part of the PE file but other parts can help to refine our model. Let’s consider loosening these conditions.

Moving away from treating PE metadata equally …

As Design-related PE metadata cannot be easily altered without implications, they deserve a higher weighting than Developer-related PE metadata. The natural follow-up question is how much more weightage. Machine-learning techniques can help us answer this question. Iterative training can fine-tune the weights to give us the optimal values for PE metadata features based on their relative correlations to family and authorship. The refinements will improve the model’s ability to take on large-scale PE metadata.

Moving away from only PE metadata …

Cluster analysis that only uses PE metadata does not factor in the dynamic behaviour/heuristics of malware families, usually classified as Tactics, Techniques and Procedures (TTPs). Some examples include the malware’s Command and Control communication mechanisms, or unique file paths that the malware has created for its execution. These unique traits are useful information that can be used as part of malware family or author identification. Two common frameworks used by malware analysts to identify these characteristics are MITRE ATT&CK and Malware Behaviour Catalog (MBC). These frameworks can be retrieved from open source dynamic analysis tools such as Cuckoo Sandbox, or capability detection tools such as Capa by Mandiant’s FLARE team, to improve our models.

Incorporating such data points and features can give better insights to help with identifying threats, crafting more effective signatures, and making attribution hypotheses. This article’s scope does not cover how to use such TTPs for analysis, but if you are interested do look it up as your next learning chapter.

Malware Samples — Where to get them?

Now, we have looked at how malware clusters of different families (or even campaigns) can be formed with PE metadata. At the same time, we derived some meaningful insights that analysts can use to refine our model. If you are excited to get started but are missing the most critical components to start — malware samples, fear not! Because you don’t need them.

EMBER: a labelled benchmark dataset (scanned with VirusTotal) is an open-source repository that uses the LIEF project to extract PE metadata, Byte Histograms etc. of over 1 million benign and malicious PE files. The features in EMBER have and can be used to create different machine learning models to statically detect malicious Windows PE files. With this, you can now kickstart your journey as a malware analyst! If you are keen to further your interest (and be paid a monthly salary while doing so — wink), we welcome you to join us at CSIT, or to try out other such opportunities.

Conclusion

PE metadata holds a wealth of clues that analysts can tap in malware clustering studies. We can then anticipate upcoming threats, and create rules/signatures to eradicate new malware before it becomes the next headline. Analysis has to be done carefully, as PE metadata may be crafted to misdirect. But this dimension also makes the whole process challenging and constantly interesting.

The large-scale availability of samples is an information-rich asset which can be explored with emerging technologies like machine learning. Research shows that machines can be vastly better than humans at detecting patterns, both in terms of speed and performance. This paves the way for exciting new ideas at this crossroad of cyber security and machine learning analytics. The union of data and technology is a new world of possibilities for building cyber security solutions. Be part of the CSIT team looking out for game-changers in the defence against malware!