Detailed Guide of PE Structure for Reversers

Jasem Al-Sadi
15 min readJun 30, 2020

One of the things that really fuzz me when I started learning reverse engineering is to get solid understanding for how the portable executable is structured and why it’s structured in that why. So, I decided to be brave and jump to the MS PE specification hoping to get good understanding, but I was wrong 😂

So, the below (maybe :) a good starting point if you started to get confused about PE structures.

Before we get to PE file structures, we need to rehash some concepts :

Virtual address (VA) vs relative virtual address RVA:
let assume that our address space got it is :
0x10000000
and the main function located at :
0x10001000
Then,:
VA = 0x10001000
RVA. = VA — Base_address = 0x10001000–0x10000000 = 0x00001000.
So the RVA is like the offset in memory, it was invented to ease relocations.
That why it called Relative to Virtual address.

The value of RVA of a method/variable is NOT always its offset from the beginning of the file.
The RVA of an item will almost always differ from its position within the file on disk

PE file = MS-DOS MZ header + stub program + PE file signature + PE file header + the PE optional header + all of the section headers + all of the section bodies.

The PE file format has 11 common predefined sections, but developer can write his own unique section.

Each PE file has a debug section but it can be extracted to separate debug file.

Most definitions of the PE file attributes exist in WINNT.h file.

Structure of a Portable Executable file image:

1. MS-DOS/Real-Mode Header

The main reason for keeping the same structure intact at the beginning of the PE file format is so that, when u create the file in windows 3.1 or newer or MS DOS v2.0 or older, the newly OS ( Greater than windows 3.1 and MS-DOS 2.0 ) will be able to read the MS DOS header and see that it’s an old file, so it will send this message :
“This program cannot be run in DOS mode.”
And exit, because if the header doesn’t exist, the OS will alert a stupid useless message:
“The name specified is not recognized as an internal or external command, operable program or batch file.”

DOS header Struct :

2. Real-Mode Stub Program

When u want to develop an actual old MS-DOS executable file, u need some code to run as Stub or “backup code” when a new OS doesn’t support your old MS-DOS file.
So u need a code to be executed to tell the user that it doesn’t support your OS, the default is alerting the message :
“This program cannot be run in DOS mode”
But the developer can put any code in the sub and any message.

when the MS-DOS executable file is run in MS-DOS OS . The user can specify a different stub by using the /STUB linker option.

3. PE Signature and File Header

The PE file signature address = base_address + e_lfanew == NTSIGNATURE, e_lfanew will point us to the start of the COFF header which is the start of the NT header :

We can define it as macro “constant” c variable :

Note: “a” input means pointer to the PE file address in the memory.

File signatures specifies the target OS version. The size of it signature differs from version to version but in general it’s before PE file header.

To determine the type of the file from the signature, the following code :

The PE file header is defined as struct in winNT.h file :

PE file header struct

You can see example of Characteristics flag in tools like CFF explorer :

4. PE Optional Header:

The size of it is 224 Bytes, it’s directly after the file header
#define OPTHDROFFSET(a) ((LPVOID)((BYTE *)a + \
((PIMAGE_DOS_HEADER)a)->e_lfanew + SIZE_OF_NT_SIGNATURE + \
sizeof (IMAGE_FILE_HEADER)))
// IMAGE_FILE_HEADER is struct defined above

The optional header contains most important information such as the entry point, preferred base address, section alignment and so forth.
It’s long one, but we will discuss important fields. We can separate the fields into :
1. Standard Fields.
First, note that the structure is divided into “Standard fields” and “NT additional fields.” The standard fields are those common to the Common Object File Format (COFF), which most UNIX executable files use.
2. Windows NT Additional Fields.
The additional fields added to the Windows NT PE file format provide loader support for much of the Windows NT-specific process behavior.

I detailed every element of the PE optional image header struct :

Subsystem field possible targets :

Data Directories: Think of them as pointers to table or string that Windows uses. Each directory will contain a struct that will point u to the stating point of the table/string :

They are loaded into the memory and used in run time.
There are 16 possible data directory we can generate:

Can we get the location of a certain Data Directory from VirtualAddress in the above struct ?
No, because the VA is added from the baseAddreess which might not be the same in memory, also the data directory itself is inside the a section, to illustrate it :

So the virtual address in the data directory struct in an offset only inside the body that holding it.
Note: imageBase/Base Address in our case is 0 since the file is not loaded to memory and we are inspecting it on the disk

5. PE File Sections

PE file consists of many sections, each section = header section + body section = code + data + resources + executable information.
1. Section header:
Located after the optional header, each is 40 Bytes with no padding between each section header. All of them put in sequential order. Example :

The section header is struct defined in WINNT.h file :

Characteristics important values :

How do you go about getting section header information for a particular section?
Since each of the sections header sequentially ordered, we can loop them and compare with the name because the name is unique.
The following function can be used:

2. Section body
To define a new custom section, u need to pass to the compiler compiler option -NT — .
A) .text section
It contains the executable code, Windows NT compiler combines all the code segments into on single section “.text section”, because it’s easier to the memory manger to manage the code base.
Note: malware authors can put a code in any section and then later jump to that section. The default behavior for the loader is to go to the text section to execute some code but nothing prevent to the developer to jump to another section that contains executable code.

The .text section started with Import Address Table (IAT) , IAT is just an array of fixed addresses of the functions used in the .text section. These jumps are Windows API’s functions.


The loader takes these fixed addresses and substitute them with another addresses that point to the memory locations of the corresponding library functions. Because these function might be loaded in a different address than the one in IAT array.
After the IAT table, the entry point of the executable code.

So the loader finds the IAT table ?
The IMAGE optional header struct contains a field called “AddressOfEntryPoint”, it used to find the entry point address of the executable code, then the loader move back to get the last jumps of the IAT table, then it goes backward to get the starting point of the IAT table.
The loader relies on the fact that the IAT occurs immediately before the entry point and each entry in the IAT is the same size.

B) .bss section

It stands for Better Save Space, it contains all uninitialized data. e.g. static variables. Not all data items need to have values before the program begins running. When you’re reading data from a disk file, for example, you need to have a place for the data to go after it comes in from disk. So .bss only here to hold the size (bytes) needed for the uninitialized variable. So the main objective of this strategy is to reduce binary size in disk and allowing faster loading time.
For example, let see this code :

Variable ”a” will be in the data section while “b” will be in the .bss section , Once the program is loaded, the distinction becomes immaterial. At run time, b occupies 20 * sizeof(int) bytes.
C) .rdata section :

read-only data, such as literal strings, constants, and debug directory information.

D) .data section :

Variables that are application or module global. Also initialized variables.

E) .rsrc Resource section :

It contains resource such as images, videos, icons, strings.
What the difference between the string stored in the resource section and the strings stored in the .rdata section ?
Strings are located in either resources or in read-only data sections of PE file, the most common is in the .rdata section. So Microsoft gives many ways to store strings.

The resource directory is stored in a tree structure, each branch equal one resource, u can think of it as the file structure of linux, where there is a root directroy,
and it contains sub directories and the sub directories might contains sub directories or might contains raw data (file).
To illustrate more, let’s look at this resource example of a PE file:

So if t we take it as analogy :
1. the root node of the image above correspond to the root directory in the file system. The root node follow IMAGE_RESOURCE_DIRECTORY structure.
2. The children can be represented in two types :

a. Raw data or called file (file system terminology) ===> correspond to raw data (pink boxes) , they called “entry” in PE terminology. They follow _IMAGE_RESOURCE_DATA_ENTRY struct. They are the leafs of the tree.
The _IMAGE_RESOURCE_DATA_ENTRY struct :

typedef struct _IMAGE_RESOURCE_DATA_ENTRY {
ULONG OffsetToData;
ULONG Size;
ULONG CodePage;
ULONG Reserved;
} IMAGE_RESOURCE_DATA_ENTRY, *PIMAGE_RESOURCE_DATA_ENTRY;

b. Subdirectory (file system terminology) ===> correspond to another _IMAGE_RESOURCE_DIRECTORY structure in the resource section. The job of this struct to knows the children resources of it.
Because sometimes, an image/resource may contains multiple sub images/resource children inside it, so the same thing in linux, directory might have multiple sub directories.

The struct definition :

typedef struct _IMAGE_RESOURCE_DIRECTORY {
ULONG Characteristics;
ULONG TimeDateStamp;
USHORT MajorVersion;
USHORT MinorVersion;
USHORT NumberOfNamedEntries;
USHORT NumberOfIdEntries;
} IMAGE_RESOURCE_DIRECTORY, *PIMAGE_RESOURCE_DIRECTORY;

U notice, it doesn’t has a pointer to any child node, because after the struct, there is an array of _IMAGE_RESOURCE_DIRECTORY_ENTRY immediately follow it,

IMAGE_RESOURCE_DIRECTORY_ENTRY DirectoryEntries[]
So to illustrate more, the picture shows how Microsoft align them in the PE file on the disk :

So, in the resource section address, and the entries occur after the size of t_directroy struct == sizeOf(_IMAGE_RESOURCE_DIRECTORY).
The _IMAGE_RESOURCE_DIRECTORY_ENTRY contains the name/type of the resource and a pointer to _IMAGE_RESOURCE_DIRECTORY.

typedef struct _IMAGE_RESOURCE_DIRECTORY_ENTRY {
ULONG Name; // It can be used as integer or pointer to another struct.
/*
This field contains either an integer ID or a pointer to a structure that contains a string name. If the high bit is zero, this field is interpreted as an integer ID. If the high bit is nonzero, the lower 31 bits are an offset (relative to the start of the resources) to an IMAGE_RESOURCE_DIR_STRING_U structure.
*/
ULONG OffsetToData; // pointer to child node in the tree.
} IMAGE_RESOURCE_DIRECTORY_ENTRY, *PIMAGE_RESOURCE_DIRECTORY_ENTRY;

What is the difference between _IMAGE_RESOURCE_DIRECTORY_ENTRY and _IMAGE_RESOURCE_DIRECTORY ?
_IMAGE_RESOURCE_DIRECTORY : think of it as container that hold an array of _IMAGE_RESOURCE_DIRECTORY_ENTRY
_IMAGE_RESOURCE_DIRECTORY_ENTRY contains the metadata (e.g. language, id, sibling node)
To illustrate it more, let’s look at this _IMAGE_RESOURCE_DIRECTORY example :

U can notice that it’s the root node because the ID equal to zero. Also it has Three _IMAGE_RESOURCE_DIRECTORY_Entry elements. Each one point to _IMAGE_RESOURCE_DIRECTORY child.

Also another, where Image_resource_directory points to raw data :

U can see, that the child contains a pointer to raw heart image.

To get the big picture, look at this example that contains resource section :

U can see that resource root node have 3 sub node == 3 resources.

Let looks at the depths/levels of the tree:

Level 1 is _IMAGE_RESOURCE_DIRECTORY struct that contains two entries. Each entry follows “_IMAGE_RESOURCE_DIRECTORY_Entry” struct, they define the type and pointer to another _IMAGE_RESOURCE_DIRECTORY entry.

The type of resource is defined in the winnt.h file :

/*
* Predefined Resource Types
*/
#define RT_CURSOR MAKEINTRESOURCE(1)
#define RT_BITMAP MAKEINTRESOURCE(2)
#define RT_ICON MAKEINTRESOURCE(3)
#define RT_MENU MAKEINTRESOURCE(4)
#define RT_DIALOG MAKEINTRESOURCE(5)
#define RT_STRING MAKEINTRESOURCE(6) // represent string in unicode or ascii hex format.
#define RT_FONTDIR MAKEINTRESOURCE(7)
#define RT_FONT MAKEINTRESOURCE(8)
#define RT_ACCELERATOR MAKEINTRESOURCE(9)
#define RT_RCDATA MAKEINTRESOURCE(10)
#define RT_MESSAGETABLE MAKEINTRESOURCE(11)

Level 2 is also is _IMAGE_RESOURCE_DIRECTORY struct represent an identifier for the resource, the identifier can be :
1. Pointer to IMAGE_RESOURCE_DIR_STRING_U struct : the left hand of the tree uses the “name” attribute as pointer to another object, which follows IMAGE_RESOURCE_DIR_STRING_U.
2. Integer only as identifier : The right hand of the tree uses “name” attribute as integer or id.


Level 3: used to defines the language used for the resources, they use hex decimal to represent a resource language. Also the pSibling “OffsetToData attributes” used to point to the actual resource raw data.

Leafs: Just an _IMAGE_RESOURCE_DATA_ENTRY struct.

Why the .rsrc section is so complex ? ???
To gain flexibility and time optimization in-terms of gathering the meta data of a single resource without digging to the resource actual data.
For example, u can very fast gather all resources for a certain language.

E) .edata Export section :
It contains the exported functions/data for the application or the dll. It starts with a directory structure that has information about the struct itself. The struct defined in the WINNT.h file :

typedef struct _IMAGE_EXPORT_DIRECTORY {
ULONG Characteristics; // unused
ULONG TimeDateStamp; // the time the table was created
USHORT MajorVersion; // often 0
USHORT MinorVersion;// often 0
ULONG Name; // RVA address to the name of the executable module
ULONG Base;
ULONG NumberOfFunctions; // Total number of functions exported. Maybe called by name or ordinal.
ULONG NumberOfNames; // / Total number of functions exported and named.
PULONG *AddressOfFunctions; // RVA to a list of exported function entry points
PULONG *AddressOfNames; // An RVA to a null-separated list of export function names.
// how many functions and function names are being exported from the module
PUSHORT *AddressOfNameOrdinals; // RVA to ordinal(integer) values to represent the exported functions in integer style.

/*
The three AddressOf... fields are relative virtual addresses into the address space of a process once the module has been loaded
*/
} IMAGE_EXPORT_DIRECTORY, *PIMAGE_EXPORT_DIRECTORY;

So, to give better idea to the exported function layout in the PE file, let assume that we have a DLL that has 10 exported functions:

We can see that it has 10 exported functions. And it named 5 only of them. But why ??
Because functions can be called either by name or ordinal(integer).
We can see the AddressOfNameOrdinals fields point to a list of numbers, each one correspond to the same index in the AddressOfNames list. Then after that it’s only mapped to the the AddressOfFunctions list.
Note: Remember that there is no indication which name links to which address in the AddressOfNames array. We showed the arrows just so we can better illustrate the example.
But how we u know which name to which address ?
using the ordinals value. We knew that each ordinal correspond index to the same index in the function names list. But the value of the ordinal correspond to which address the function name belong to.
For example, we can see the that ordinal “8” value correspond to addr8 in the address of function list.

F) .idata Import section == import table
NOTE: import table is different from import address table, import table is the container of import address table and other structures.
Most of the Microsoft code written as DLLs, which are nothing but just libraries with public functions to be used by anyone.
While in order to call a certain function from a certain dll, I need to know the address of the function, the issue is the developer of the dll might change the name of the function or update the dll code.
So relaying in static global addresses for each function in each dll written in the world is bad idea because every thing change.
IDT (Import Directory Table) table contains IMAGE_IMPORT_DESCRIPTOR entries.
So Microsoft invent something called “_IMAGE_IMPORT_DESCRIPTOR”, which is a structure that define what imported functions I use for a certain DLL, so windows loader look to it and then it does the necessary steps to load the dll to memory or just grab the address from the memory.

But let’s see the big picture of _IMAGE_IMPORT_DESCRIPTOR structure :

Hint names table == Import names table == Import lookup table
The OriginalFirstThunk points to the Import Name Table (also called Hint names table HNT )and the FirstThunk points to the Import Address Table.
So the idata section starts with array of _IMAGE_IMPORT_DESCRIPTOR struct,
each struct describe the imported functions for a certain dll.
There is no length attribute to count the array elements. Instead Microsoft append to the end an empty null _IMAGE_IMPORT_DESCRIPTOR structure .

To be more precise, the struct:

Each entry of the IAT and INT follow this struct:

These two tables (HNT and IAT) are exactly the same.
But why we need two tables ??
Because HNT is fixed table and never modified, while IAT will be overwritten by the PE loader when your PE file loaded into memory.
So when u have an instruction like this :
JMP DWORD PTR [XXXXXXXX]
let assume that it calls function in another dll, The [XXXXXXXX] portion is point to a function in the Import address table because it’s the one who have the real pointer the function address.
So HNT is quite useless or stub.
So before the PE loader override the IAT, the structure look like this :

INT and IAT point to the same IMAGE_IMPORT_BY_NAME struct which contains only two attributes , hint ( ordinal value) and name of the function.
But after the PE loader overwrite IAT, they become different in terms of values:

Note: IAT and INT preserve same order or the array, e.g. look at loadLibrayA , in both tables, they are in index 2. So lets see the bigger picture from developer perspective :

Note: ILT = = Import lookup table == just another name for INT == Import Name Array.
Note: For each imported DLL, it has its own IMAGE_IMPORT_DESCRIPTOR. Each function from a certain DLL has entry in IAT and ILT.

Conclusions :

Well, going through PE file structure is never an easy tasks 😂 . Hopefully, this will serve you a good reference when get lost at the details of PE file. If you have any comments , additions or suggestions to improve it, let me know and I will add you as collaborator to modify thearticlee.

References:

  1. https://docs.microsoft.com/en-us/windows/win32/debug/pe-format#:~:text=This%20specification%20describes%20the%20structure,(COFF)%20files%2C%20respectively.
  2. https://docs.microsoft.com/en-us/archive/msdn-magazine/2002/february/inside-windows-win32-portable-executable-file-format-in-detail
  3. https://resources.infosecinstitute.com/topic/2-malware-researchers-handbook-demystifying-pe-file/
  4. https://blog.kowalczyk.info/articles/pefileformat.html
  5. https://www.cnblogs.com/night-ride-depart/p/5776107.html
  6. https://malwology.com/2018/10/05/exploring-the-pe-file-format-via-imports/
  7. https://docs.microsoft.com/en-us/previous-versions/ms809762(v=msdn.10)#pe-file-imports
  8. https://ired.team/miscellaneous-reversing-forensics/pe-file-header-parser-in-c++#code
  9. https://lief.quarkslab.com/doc/stable/tutorials/07_pe_resource.html
  10. http://bytepointer.com/resources/pietrek_peering_inside_pe.htm
  11. http://read.pudn.com/downloads756/ebook/3008316/Bin_Portable_Executable_File_Format_%E2%80%93_A_Reverse_Engineer_View_2012-1-31_16.43_CBM_1_2_2006_Goppit_PE_Format_Reverse_Engineer_View.pdf

--

--

Jasem Al-Sadi

Computer engineer. Interested in Reverse Engineering, Exploit Development and Penetration testing.