User mode VS Kernel mode

Sagi Dana
18 min readDec 12, 2017

--

One of the most interesting and commonly used concepts in the x86 architecture is Protected mode and it’s support in 4 modes(aka rings):

It was a challenging idea to grasp and I’ll try to explain it as clearly as possible in this post. We’ll cover the following concepts:

  • GDT, LDT, IDT.
  • Virtual memory translation.
  • ASLR and Kernel ASLR (KASLR).

Let’s start with the basics, any computer have at least (hopefully) the following components: CPU, Disk and RAM. Each of these components holds a key role in the flow of the system. The CPU execute the commands and operations on the memory (RAM), the RAM holds the data we’re using and enable fast and reliable access to it, the disk holds persistent data we need to exist even after reboot or shutdown. I am starting from this because even though this is the very basic, its important to keep that in mind and as you read through this article ask yourself which component are we talking about at that moment.

The operating system is the software that orchestrate it all, and also the one that enable fast, convenient, consistent and efficient interface to access all of it’s capabilities — some of which is access to that hardware, and others is to enhance convenience and performance.

Like any good software the OS work in layers, the kernel is the first layer and — in my opinion — the most important one. For us to understand the importance of the kernel we first need to understand it’s actions and the challenges it faces, so let’s look at some of its responsibilities:

  • Handle system calls (the very same interface we talked about).
  • Allocate resources (RAM, CPU, and much more) to the processes/threads in hand.
  • Secure the operations performed.
  • Intermediate between the hardware and the software.

Many of these actions performed with the generous help of the processor, in the case of x86, Protected mode is the mode that enable us to limit the power (instructions set) of the currently running execution context.

Let’s assume we have two worlds — the user’s world and the supervisor’s world. At any given time you can only be in one of those worlds. When you in user’s world you see the world as the supervisor want you to see it. Let’s see what I mean by that:

Let’s say you are a process. A process is container of one or more threads. A thread is an execution context, it is the logical unit of which the machine instructions are executed. It means that when the thread is executing, let’s say, reading from the memory address 0x80808080, it actual referncing the virtual address 0x80808080 of the current process. As you can guess the content of the address is going to be different between two processes. Now, virtual address space is at the process level, which mean all the threads of the same process have the same address space and can access the same virtual memory. To give an example of resource that is at the thread level lets use the famous stack.

So I have a thread that executes the following code:

Our thread execute the main function which will call our “func” function. Let’s say we break the thread at line 9. the stack layout will be as follow:

  1. variable_a.
  2. parameter.
  3. return address — address of line 20.
  4. variable_b.

To illustrate:

In the given code we create 3 threads for our process and each of them print their id, stack segment and stack pointer.

A possible output of that program is:

As you can see all threads had the same stack segment because they have the same virtual address space. the stack pointer for each one is different because each have it’s own stack to store it’s values in.

Side note about the stack segment — I will explain about segment registers more in the GDT/LDT section — for now take my word for it.

So why is this important? At any given time, the processor can freeze the thread and give the control to any other thread it wants. As part of the kernel, the scheduler is the one that allocate the CPU to the currently existing (and “ready”) threads. In order for the threads to be able to run reliably and efficiently it is essential that each will have its own stack that he can save it’s relevant values in it (local variables and return addresses for example).

To manage its threads, the operating system keeps special structure for each thread called TCB (Thread Control Block), in that structure it save — among other things — the context of that thread and its state (running / ready / etc…). The context contains — again — among other things, the CPU registers values:

  • EBP -> Base address of the stack, each function uses this address as the base address from which it offset to access local variables and parameters.
  • ESP -> The current pointer to the last value (first to pop) on the stack.
  • General purpose registers -> EAX, EBX, etc…
  • Flags register.
  • C3 -> contain the location of the page directory (will be discussed later).
  • EIP — The next instruction to be executed.

Besides threads the operating system needs to keep track after a lot of other things, including processes. For processes the OS save the PCB (Process Control Block) structure, we said that for each process there is an isolated address space. For now lets assume there is a table that maps each virtual address to a physical one and that table is saved in the PCB, the OS responsible to update that table and keep it updated to the correct state of the physical memory. Every time the scheduler switch the execution to a given thread the table that saved for that thread’s owning process is applied to the CPU so he’ll be able to translate correctly the virtual addresses.

That’s enough for the concepts, lets understand how it is actually done. For that lets look at the world from the processor’s perspective:

Global Descriptor Table

We all know the processor have registers that help him to make calculations, some registers more then others (;)). By design the x86 support multiple modes but the most important are user and supervised, the CPU have special register called gdtr (Global Descriptor Table Register) that hold the address to a very important table. that table maps every virtual address to its corresponding processor’s mode, it also contain the permissions for that address (READ | WRITE | EXECUTE). obviously that register can be changed only from supervisor mode. As part of the processor’s execution, it checks which instruction to execute next (and what address it is at), it checks that address against the GDT and that way he knows if it is a valid instruction based on its wanted mode (match the CPU’s current mode to the mode in the GDT) and permissions (if not executable — invalid). An example is ‘lgdtr’ the instruction that load value to the gdtr register and it can be executed only from supervised mode as stated. The key point to stress here is that any protection over the memory operations (executing instruction / writing to invalid location / reading from invalid location) is done by the GDT and LDT (coming next) at the processor level using these structures that were built by the OS.

This is what the content of an entry in GDT / LDT is looks like:

http://wiki.osdev.org/Global_Descriptor_Table

As you can see it have the range of addresses the entry relevant to, and it’s attributes (permissions) as you’d expect.

Local Descriptor Table

Everything we said about the GDT is also true for LDT with small (but big) difference. As it’s name suggest the GDT is applied globally on the system while the LDT is locally, what do I mean by globally and locally? The GDT keeps track over the permissions for all the processes, for every thread and it is not change between context switching, the LDT on the other hand is. It’s only make sense that if every process have it’s own address space, it’s possible that for one process address 0x10000000 is executable and for another it’s read/write only. This is a specially true with ASLR on (will be discussed later). The LDT is responsible to keep the permissions that distinct each process.

One thing to note is that everything that said is the purpose of the structure, but in reality some OSes might or might not use some of the structure at all. for example it is possible to use only the GDT and change it between context switching and never use the LDT. It is all part of designing the OS and trade offs. The entries of that table looks similar to that of GDT.

Selectors

How the processor know where to look in the GDT or LDT when it execute a specific instruction? The processor have special registers that called segment registers:

https://en.wikibooks.org/wiki/X86_Assembly/X86_Architecture

- Stack Segment (SS). Pointer to the stack. - Code Segment (CS). Pointer to the code. - Data Segment (DS). Pointer to the data. - Extra Segment (ES). Pointer to extra data ('E' stands for 'Extra'). - F Segment (FS). Pointer to more extra data ('F' comes after 'E'). - G Segment (GS). Pointer to still more extra data ('G' comes after 'F').

Each register is 16 bits long and it’s structure is as follows:

http://www.c-jump.com/CIS77/ASM/Memory/M77_0290_segment_registers_protected.htm

So we have the index to the GDT/LDT, we also have the bit that says whether it is the LDT or the GDT, and which mode it is need to be (RPL 0 is supervisor, 4 is user).

Interrupt Descriptor Table

Beside the GDT and LDT we also have the IDT (Interrupt Descriptor Table), the IDT is simply a table that hold the addresses of a very important functions, some of them belong to the OS, others to drivers and physical devices connected to the PC. Like the gdtr we have idtr which, as you probably guessed, the register hold the address of IDT. What makes the IDT so special? When we initiate interrupt, the CPU automatically switch to supervised mode, which means every function inside the IDT is running in supervised mode. Every thread from every mode can trigger an interrupt by issuing the ‘int’ instruction followed by a number which tells the CPU what index the target function resides on. With that being said, it is now obvious that every function inside the IDT is a potential gateway into supervised mode.

So we know that we have the GDT/LDT that tells the CPU the permissions for each virtual address and we have the IDT that points the ‘gateway’ functions to our beloved kernel (which obviously resides inside the supervised section of memory). How these structures behave in a running system?

Virtual memory

Before we can understand how it all play together we need to cover one more concept — Virtual memory. Remember when I said there is a table that maps each virtual memory address to it’s physical one? It’s actually a bit more complicate than that. First we can’t simply map every virtual address as it will take more space than we actually have, and putting the need to be efficient aside, the OS also can swap pages of memory into the disk (for efficiency and performance), it’s possible that the memory page of the needed virtual address is not in memory at the moment, so beside translate the virtual address to physical one we also need to save if the memory is in RAM and if not, where is it (there could be more than one page file). The MMU (Memory Management Unit) is the component responsible to translate the virtual memory into a physical one.

One thing really important to understand is that every instruction in every mode is going through the process of virtual address translation, even code in supervised mode. Once the CPU in protected mode, every instruction it execute uses virtual address — never physical (there are some tricks that the actual virtual address will always translate into the exact same virtual memory but that is outside the scope of this post).

So once in protected mode, how the CPU knows where to look when he needs to translate virtual address? the answer is CR3 register, this register hold the address to the structure that contain the required information — Page directory. It’s value changes with the currently running process (again, different virtual address space).

So how this Page Directory looks like? When it comes to efficiency we need to be able to query this “table” as fast as possible, we also need it to be as small as possible because this table going to be created for each process. The solution to that problem is nothing less than brilliant. The best image I could found to illustrate the translation process is this one (from wikipedia):

The MMU have 2 inputs, the virtual address to translate and the CR3 (address to the currently relevant page directory). The x86 specification chops the virtual address into 3 pieces:

  • 10 bit number — index to the page directory.
  • 10 bit number — index to the page table.
  • 12 bit number — offset to the physical address itself.

So the processor takes the fist 10bit number and uses it as index to the page directory, for each entry in the page directory we have page table, which then the processor uses the next 10bit number as index. Each directory table entry point to 4K boundary memory page which then the last 12bit offset from the virtual address is used to point the exact location in physical. The brilliance in that solution is:

  • The flexibility that each virtual address locate to completely unrelated physical one.
  • The efficiency in space of the structures involved is amazing.
  • Not every entry of every table is used, only virtual addresses that actually used and mapped by the process are exist in the tables.

I’m truly sorry for not explain this process in more details, this is a well-documented process that many people worked hard on explaining better than I could ever do — google it.

Kernel vs User

This is where it get’s interesting (and magical if I may).

We started this article by stating that the OS is orchestrate it all, it does that using the kernel. As already stated the kernel is running in a memory section that is maped as supervised mode only in the GDT for all the processes. Yes I know that each process have it’s own address space, but the kernel is cutting that address space (usually the upper half, depend on the OS) for it’s personal use, not only cut the address space but also at the very same address for all the processes. This is important because the code of the kernel is fixed and every reference to variables and structures needs to be at the same location for all the processes. You can look at the kernel as special library loaded to each and every process in the very same location.

Deeper into interrupts

We know that the IDT contain addresses of functions, these functions called ISR (Interrupt Service Routine), some execute when hardware event occur (key press on the keyboard) and others when software initiate the interrupt for example to switch to kernel mode.

Windows have a cool concept about interrupts and prioritizing them: One especially important interrupt is the clock’s ticks. With every clock’s tick there is an interrupt which handled by it’s ISR. The scheduler of the OS uses this clock event to control how much time each process is running and whether or not it’s another’s turn. As you can understand this interrupt is super important and needs to be served as soon as it happens, not all ISR’s have the same importance and this is where priorities between the interrupt kicks in. Let’s take key press on the keyboard for example and assume it have the priority of 1, I just pressed a key on the keyboard and it’s ISR is executing, while executing the keyboard’s ISR all interrupt of the same priority and lower are ignored. While executing the ISR, the clock’s ISR is triggered with priority of 2 (which is why it didn’t disabled), immediate switch occurs to the clock’s ISR, once the clock finishes it return control to the keyboard’s ISR from where it stopped. these interrupts priorities called IRQLs (Interrupt ReQuest Level), as the IRQL of interrupt goes up it’s priority is higher. The interrupts with the highest priority are never interrupts in the middle, they run until the end, always. IRQLs is windows specific — the IRQL is a number between 0–31, for Linux, on the other hand, it does not exist, Linux handle every interrupt with the same priority and simply disable all interrupts when it really needs that specific routine to not be disturbed. As you can see it’s all a matter of design and preferences.

Let’s connect it all to our beloved User mode . The ISR of that clock event is going to execute regardless of what thread currently running and might even interrupt to another ISR for unrelated task. this is a perfect example for why the kernel is at the same address for all processes we don’t want to change the GDT and the Page Directory (in C3) each time we execute interrupt as it happens MANY times during even a single function of any given user mode process. A lot is happening between those lines of code you write when you develop your user mode application (;)).

Another way of looking at interrupts is as external and independent inputs to our OS, this definition is not accurate (not all the interrupts are external or independent) but it is good to make a point, big part of the kernel’s job is to make sense of the events that occurring all the time from every location (input devices) and from one side to serve those event and the other to make sure everything is correlated correctly.

Putting it all together

So to make sense out of all of it let’s start with simple user mode application executing the following instruction:

0x0000051d push ebp;

For each instruction that the CPU is executing it first examine the the address of that instruction (in that case ‘0x0000051d’) against the GDT/LDT using the code segment register (‘cs’ because it is instruction to execute) to know the index to look for in the table (remember the segment register tells the CPU exactly where to look). Once the CPU know the instruction is in executable location and we in the right ring (user mode/kernel mode) it now continue to execute the instruction. In this case the ‘push ebp’ instruction is not effecting only the register but also the stack of the program (it pushes the stack the ebp content) so the CPU also check against the GDT/LDT for the address inside the esp register (the address of the current location on the stack, and because it is the stack location the CPU knows to use the stack segment register to do that) to make sure it is writable in that specific ring. Please note that if this instruction were also reading from memory the CPU would have check the relevant address for read access too.

This is not all, after the CPU checked all security aspects it is now need to access and manipulate the memory, as you recall the addresses are in their virtual format. The MMU now translate each virtual memory specified by the instruction into a physical memory address using the CR3 register that points to the page directory (that points to the page table) that enables us to eventually translate the address into physical one. Note that the address might not be in memory at the moment of need, in that case the OS will generate a page fault (an exception that generate interrupt) and will bring the data to the physical memory for us and than continue the execution (this is transparent to the user mode app).

From user to kernel

Each exchange between user mode to kernel mode is happening using the IDT. From user mode application, the instruction ‘int <num>’ is transfer execution to the function in the IDT at the index num. When the execution is in kernel mode a lot of the rules are changing, each thread have different stacks for user and kernel mode, memory access checks are much more complicated and mandatory, in kernel mode there is very little you cannot do and a lot you can break.

ASLR and KASLR

more often than not it is “only” the lack of knowledge that prevent us from achieving the impossible.

ASLR (Address Space Layout Randomization) is a concept that implemented differently within each OS, the concept is to randomize the virtual addresses of the processes and their loaded libraries.

Before we dive in I wanted to note that I decided to include ASLR in this post because it is a nice way to see how protected mode and it’s structures enabled this kind of capability even though it is not the one that implement it or responsible for it for that matter.

Why ASLR?

The why is easy, to prevent attacks. When someone is able to inject code into a running process, not knowing the addresses of some beneficial functions is what can cause the attack to fail.

We already have different address space for each process, this means that without ASLR all the processes would have the same base addresses, this is because when each process in his own virtual address space we do not have to wary about collisions between processes. When we link the program, the linker choose a fixed base address on which it link the executable against. On paper all the executable files linked by the same linker with the default parameters (the base address can be configured if needed) will have the same base address. To give example I wrote two applications one called “1.exe” and the second “2.exe”, both are different projects in Visual Studio and yet they both got the same base address (I used exeinfo PE to vie the base address in the PE file):

Not only these two executables have the same base address they both do not support ASLR (I disabled it):

You can also see it included in the PE format under File Characteristics:

Now let’s run both executables at the same time and the both share the same base address (I will be using vmmap from Sysinternals to view the base image):

We can see the both process do not use ASLR and have the same base address of 0x00400000. If we were attackers and we had access to this executable, we could have known exactly what addresses will be available to this process once we found away to inject ourselves into its execution. let’s enable ASLR in our executable 1.exe and see it’s magic:

It changed!

KASLR (Kernel ASLR) is the same as ASLR, only it works at the kernel level, which means once an attacker was able to inject itself into the kernel’s context he (hopefully) wont be able to know what addresses contain what structures (for example where the GDT sits in memory). One thing to mention here is that ASLR works its magic with every spawn of a new process (that support ASLR of course) while KASLR does it every restart as this is when the kernel is “spawn”.

How ASLR?

So how does it work and how it is connected to protected mode? The one that responsible to implement ASLR is the loader. When a process is launch the loader is the one that needs to put it in memory, create the relevant structures and fire it’s thread. The loader first checks if the executable support ASLR and if so, it randomize some base address within the range of the available addresses (the kernel space for example is not available obviously). Based on that address the loader now initialize the Page Directory for that process to point the randomized address space to the physical one. The flexibility of LDT is also comes to our rescue as the loader simply create LDT the correspond the randomized address with the relevant permissions. The beauty here is that protected mode is not even aware that ASLR is being used, it is flexible enough to not care.

Some interesting implementation detail is that in windows the randomized address for specific executable is fixed for efficiency reasons. What I mean by that is that if we randomized address for, say, calc.exe, the second time it is executed the base address will be the same. So if I open 2 calculators at the same time — they will have the same base address. Once I will close both calculators and open them again they will both have the same address again only this one will be different from the address of the former calculators. Why isn’t this efficient you ask? think about commonly used DLLs. Many processes use them and if their base addresses were different with each instance of process, their code would also be different (the code reference the data using this base address) and if the code is different the DLL will need to be loaded into memory for each process. In reality the OS load the images only once for all the process that using this image. It save space — a lot of it!

Conclusion

By now you should be able to picture the kernel at work and understand how all of the key structures of x86 architecture play together to a bigger picture and enable us to run possibly dangerous applications in user mode without (or little) fear.

Until next time 🙂

Originally published at http://unravelit.net on December 12, 2017.

--

--