Judging a book by its cover — Part One: Dissecting Malware Metadata for Insights

Chase
CSIT tech blog
Published in
12 min readMar 31, 2022

A book is made up of pages of information, and a book cover describes what the book is about. Similarly, malware contains information that is represented as bits and bytes, and akin to a book cover, metadata describes the contents of the malware. And just like books of the same series, a set of malware determined to have significant code overlaps is known as a malware family. As malware from the same family commonly share the same malware author, they may exhibit similar fragments of the author’s techniques and quirks.

Featured Illustration: Chase

Unlike authors of books who identify themselves readily, malware authors do not typically identify themselves within the malware to avoid being pinpointed. Narrowing down the authorship of a malware helps analysts to derive strategic and operational insights. It allows them to uncover the origin of the malware, correlate the malware to previously known threats, or assist in developing techniques for thwarting future similar malware. This process of uncovering the provenance of the malware is called Malware Authorship Attribution.

As part of our work in CSIT, we analyse malware to understand the evolution of malware, craft effective signatures against them and hypothesise the identification of threat actors. Continuous research is also conducted to study state-of-the-art techniques used by threat actors. These provide early warning for the safeguarding of our cyberspace. Welcome to Part One of a two-part series on malware metadata analysis and the new world of possibilities it brings!

A Brief Primer on Portable Executable (PE) file — The structure of a malware

Before we get into identifying malware samples, we need to understand what a malware constitutes of. Malware on Windows Operating Systems commonly appear as PE files. PE is a widely used file format that is developed by Microsoft Windows. The most common ones that you are likely to recognize are files with extensions — exe (executables) and dll (dynamic link libraries). We use them every day! Some examples are our Internet-surfing applications like Google Chrome or note-taking tools such as Notepad++.

Illustration of PE File — Header and Sections
Illustration of PE File — Header and Sections

As illustrated above, a PE file (e.g. Hello.exe) is made of two components — Header and Sections. The Header consists of different headers that contain data about the file and information that supports the loading of a PE file into memory, while Sections consists of information related to execution. If you are interested in understanding the technicality, we highly recommend looking at these — a good primer and an article by Microsoft on PE Format.

Malware (PE) metadata — Looking at the “book cover” of a malware

Metadata is a static property, found in both Header and Sections of PE files, that can be analysed safely without executing the files. Traits found in PE Metadata can often be drawn upon to help analysts detect and attribute malware. But how?

Since humans are creatures of habit and habits tend to persist, unique styles and techniques of malware authors tend to exhibit themselves and persist in their malware. These patterns may appear in the metadata or the content of the malware. For this post, we will be focusing on the metadata. Malware metadata can be broadly categorised into two categories as follows:

  1. Developer-related PE metadata — details that may reveal attributes of the malware author
  2. Design-related PE metadata — intrinsic properties of the malware that contain traits of its code bases and development environments
Developer-related & Design-related PE Metadata
Developer-related & Design-related PE Metadata

Developer-related PE metadata can be understood as fragments of human (authors) habits or carelessness that can be regarded as “fingerprints” and author-specific malware traits. On the other hand, Design-related PE Metadata contains properties intrinsic to the design and the development environment of the malware that can be used to identify similar code bases, build environments, and even its family lineage. Hence, PE metadata is one type of information found in malware that can be valuable as analysts may be able to identify characteristics of the malware author and craft detection rules.

In both Part One and Two of this series, we will be using PE Studio to dissect malware metadata and demonstrate how it can be analysed.

PE Checksum — The unchecked Checksum

The optional header contains a field for the checksum of the file. Typically, the compiler generates a checksum at compile time and writes it into the checksum field. PE checksums were initially implemented to prevent corrupted driver DLLs from being loaded. However, this does not apply to EXEs. In other words, a valid checksum is not required for an EXE to run on Windows.

Malware authors tend to modify parts of the file to evade detection, which results in the initially generated checksum becoming invalid. They may choose to null the checksum altogether, or sometimes simply forget to update the checksum after making the modifications. And this is why we often see malware with null or invalid checksums.

The diagram below illustrates the disparity between a random sampling of 10000 benign and 10000 malicious PE files with respect to valid and invalid/null checksum values

Disparity between Benign and Malicious PE files
Disparity between Benign and Malicious PE files

An astounding 94.2% of malicious PE files (malware) contain invalid/null PE checksum values. With so many samples and variants of malware sharing this PE trait, it can be a useful characteristic to broadly sieve out suspicious samples. Having said that, PE checksum should be used together with the analysis of other metadata to derive more definite findings.

PE Timestamps — Are you in the Zone?

Timestamps in PE files represent their (possible) compilation datetime. While the File Header is the first place that we look for timestamps, additional timestamps can be found in several other directories in the PE file.

TimeDateStamp fields located in different directories in a PE file
TimeDateStamp fields located in different directories in a PE file

Malware authors may attempt to forge the timestamps to hide the true creation date and time of the malware. While it is difficult or even impossible for analysts to deduce the original timestamp from a forged one, we may be able to find inconsistencies that suggest its invalidity.

The following snippet of an Emotet sample is an example of timestamp inconsistencies of the different TimeDateStamp fields within the PE file.

Timestamp values of different directories in an Emotet malware sample
Timestamp values of different directories in an Emotet malware sample

We can see that the compiler timestamp value has been modified to 0x00000000. However, the malware author had probably forgotten to modify the other timestamps in the PE file, such as the debugger and export timestamps, which suggests a different creation date and time of the malware and exposes the author’s intention to conceal the true compilation timestamp.

Other than spotting such mistakes, timestamp values may reveal faint links to its author. Interestingly, malware authors do have work schedules, just like how some of us have a 9-to-5, 5-day work-week. This means that malware would likely be compiled at dates and times that correspond to the malware authors’ work periods.

With a substantial number of samples, we may be able to make some interesting observations. Simple statistical methods can be applied to determine the patterns of life of the malware authors. Malware families from the Cloud Hopper campaign is one such example.

Probable Cloud Hopper authors’ work schedules
Probable Cloud Hopper authors’ work schedules

From the illustration above, we can deduce that most Cloud Hopper samples statistically have compile timestamps between 7:14am and 5:03pm. By observing which time zones they fall under, we may be one step closer to understanding the operation times of the authors.

But be aware: even though PE timestamps is a valuable forensic artifact, they can also be forged with ease. Thus, it needs to be corroborated with other PE metadata.

Debug Data — Hey, isn’t that the Malware Name?

Malware authors, being developers, test their malware before using it. Modern Integrated Development Environments (IDEs), such as Microsoft Visual Studio (VS), provides debugging functionality. If the debugging functionality is enabled during compilation, VS generates additional information called program database (PDB) in the directory of the development environment. This adds the absolute path of the PDB into the PE file so that the debug information can be found during execution for testing purposes.

For example, after compiling a Hello World project (Hello.exe) in VS with debug functionality enabled, the following PDB path is generated with these information — username, project directory, and name as illustrated below:

Breakdown of PDB path found in a PE file
Breakdown of PDB path found in a PE file

Dates, malware names and version numbers may sometimes be visible from the PDB path. These may inspire the formulation of malware families or campaign names, which can then be used during the classification of related malware samples. “Operation Hangover” is one such example, where the name of the campaign, Hangover, was derived from the PDB path of its samples.

“HangOver” string and dates found in “Operation Hangover” malware samples
“HangOver” string and dates found in “Operation Hangover” malware samples

In some malware samples, unique keywords in the PDB paths like “kaam” or “BNaga”, as seen in the example below, provide quick indications of malware belonging to the same family. In this instance, these keywords reveal the malware author’s name and project directory. By identifying unique keywords or patterns indicative of a particular author, suspicious samples exhibiting similar patterns from a collection of malware samples can be surfaced.

Unique keywords identified in other malware samples related to “Operation Hangover”
Unique keywords identified in other malware samples related to “Operation Hangover”

PDB paths can be used for detecting and eradicating malware from the same family. One useful tool to hunt for samples from the same family by their PDB paths is YARA, which is an open source pattern matching tool commonly used by malware researchers. Malware analysts create YARA rules, which are descriptions of malware families based on textual or binary patterns, to identify malware samples containing those patterns. For instance, an AV company — McAfee Corp shared their “Operation HangOver” YARA rule, which is made from a series of similar PDB paths that we just found! This is one such example of using metadata to craft signatures to detect and eradicate the malware family.

Export Data — Exporting the evidence

Export data allows PE files to export existing functions to other programs. All the exported functions are stored in what is called an export table. It usually exists in DLLs but can also be found in some EXEs.

One common trait of malware is the attempt to blend into its targeted environment. This is to deter any unwanted attention from users, administrators, and security analysts. For instance, malware authors may design their malware to resemble DLLs released by Microsoft.

Malicious PE file (ChChes malware sample masquerading as Microsoft’s rcdll.dll)
Malicious PE file (ChChes malware sample masquerading as Microsoft’s rcdll.dll)
Benign PE (Microsoft’s rcdll.dll)
Benign PE (Microsoft’s rcdll.dll)

The above depicts the export tables of a malware sample from the ChChes family used in the Cloud Hopper campaign and a benign rcdll.dll from Microsoft. While comparing the export tables, we can unmask the differences between the benign Microsoft PE file and the malware:

  1. The malware has more functions in the export table than Microsoft’s rcdll.dll, and
  2. Exported function names found in the malware overlap with those found in Microsoft’s rcdll.dll

But does it really have more functions than Microsoft’s rcdll.dll? If you look at the table closely, you will be able to notice something suspicious. The exported functions of the malware sample have only two unique locations — 0x10001A20 & 0x10001A80, and are referenced by 9 different function names. This trait could be unique to the design of the malware or mistakes (possibly used for debugging the exported functions) left behind by the author. Nevertheless, the names of the exported functions can be hashed with hashing algorithms (e.g. md5) to create an exphash and used as an indicator to correlate malware samples.

Resource Data — TODO or not to do

Have you ever wondered where the icon of the PE file comes from? The PE file has an independent resource segment stored in the resource directory, linking all required resource files for program execution.

Each resource holds three essential fields (Type, ID/Name and Language). Language, in particular, is useful in the process of malware attribution to find clues of malware authors’ build environments. By default, the Language field in the resource of a PE file, represented in code pages, is set based on the developer’s machine. For instance, if a PE file is developed on a machine that uses Chinese Windows with a Chinese keyboard layout, the code page representing Chinese will be inserted in the Language field. One such example is PlugX malware samples from Cloud Hopper campaign, where most of the resources’ language fields have code pages representing Chinese, as seen below:

Resource entries of a PlugX malware sample

Even though code pages for resources are set, they can still be arbitrarily changed. Thus, code pages alone should not be used as a singular indicator in malware attribution. With that being said, as long as supporting evidence is available to validate the hypothesis, it can be used to shape one’s understanding of the provenance of the malware.

Another common resource, VersionInfo, can also be used as a pivot to find interesting samples and traits of a malware author. VersionInfo is used to represent information about the file such as “CompanyName”, “FileDescription”, “InternalName”, “OriginalFilename” etc. Malware samples can have plausible values as attempts to divert suspicions. However, frequent typos, incorrect copyright symbols, casing of names, and spelling mistakes can also be observed in malware samples. This can be seen in the sample below:

VersionInfo of a PlugX malware sample
VersionInfo of a PlugX malware sample

Such mistakes or typos may reveal how the author customises and reviews the malware. And these can be used in conjunction with other parts of the resource, such as language, for malware detection and attribution.

Digital Signatures — A certified malware?

Digital signatures are like electronic “fingerprints” that are used to validate the authenticity and integrity of a file. As they are extremely difficult to forge or replicate, a valid digital certificate within the PE file gives us the assurance that the file has not been maliciously altered, and is certified “APPROVED” by a software publisher or well-known certificate authorities. Resultantly, it is no surprise that malware authors are incentivised to find various means, such as “becoming a business” or stealing private keys, to certify their malware.

One of the reasons malware authors do so is to help them in removing some security obstacles in their attacks. Fortunately for analysts, these certificates can be used to detect malware or assemble clues about the author, and in many cases both. Private keys of companies that are used to create digital certificates are usually stored in dedicated and well-protected modules, which make them harder to steal. Once compromised or revoked digital certificates are known, PE files using these certificates would no longer be trusted. Analysts are sometimes able to group malware based on shared digital certificates as the difficulty of sourcing for new legitimate keys results in malware authors often reusing the same compromised keys across their samples.

While not commonly observed, private keys of legitimate software publishers are sometimes sold in underground markets. More than one threat actor could be associated with a digital signature, and therefore caution should be exercised to verify the uniqueness of this indicator to avoid attribution mishaps. In most cases, they still serve as a useful piece of evidence in malware detection.

Conclusion

Today, malware metadata remains as one of the keys in helping analysts derive meaningful insights. At CSIT, in-house malware analysis systems are developed to harness malware metadata as an analytical pivot to cluster malware campaigns and families in understanding malware evolution, crafting detection rules and correlating malware of the same families.

We hope that this series so far has given you a nice foray into this exciting adventure of malware metadata analysis.

Keen readers of the article would also recall that there is another type of PE metadata (design-related) mentioned at the beginning of the article. Design-related PE metadata can be leveraged to analyse the code design and functionalities of the malware. As design-related PE metadata is usually a by-product of the development process, they are much more difficult for authors to modify or craft without leaving traces.

We will be diving into design-related PE metadata traits as well as exploring the application of machine learning techniques on malware metadata in part two of this series — Judging a book by its cover: Analysing Malware Metadata at Scale.

Do look out for it!

--

--