How to Read x86/x64 Disassembler Output

There is also an interactive intro to this: http://wordsandbuttons.online/you_dont_have_to_learn_assembly_to_read_disassembly.html

There are several reasons why you might want to read x86 disassembly instead of some readable source files, but they all boil down to one. Compilers do not always do what you expect from them. Sometimes they fail to optimize pretty obvious things like inline a function, or unroll a cycle, or use super-scalar operations. And sometimes we can help them with very little effort.

You don’t have to get your hands on the assembly itself, but knowing the guts of your code helps you help the compiler. You can reimplement some function in a way it becomes pure, and the compiler would gladly inline it. Or change loop index type to native so the compiler would unroll it. Or tinker with function arguments to hint the compiler to put them all in one big register. These are the easiest things that bring immediate gain with little cost.

And you don’t have to actually know assembly language to get useful information from disassembly output. Although language proficiency helps, you only have to know very basic things to collect useful evidence from it. For instance, most of the time your point of interest would be simple `call` instruction, which is meant to literally call some function. By looking at calls only, you will see a lot about the intrinsic structure of your code. You can tell what functions are inlined and what not for the very least. Or you can use your debugger and do down into some third-party code with no source on your hands, and get an impression of what’s going on in it only by reading function names and observing call graph. So even knowing single instruction is enough to make use of disassembly.

Still, there are about seven hundred of them in x86 instruction list as for 2016, and the number will only rise as Intel is determined to preserve backward compatibility. And there are not only instructions, but registers, flags, interrupts, directives and macroses, coding conventions etc. Although assembly programming is fun and educational, it is hardly the skill you want to invest this much into. So let’s only highlight the very essentials.

Structure of Disassembly

Usually what you would have to deal looks somewhat like this:

000000be    mov    eax,dword ptr [rsp+20h] 
000000c2 mov dword ptr [rsp+00000088h],eax
000000c9 mov rax,qword ptr [rsp+000000C0h]
000000d1 cmp byte ptr [rax],0
000000d4 mov rdx,qword ptr [rsp+000000C0h]
000000dc mov r9d,dword ptr [rsp+34h]
000000e1 mov r8d,dword ptr [rsp+38h]
000000e6 lea rcx,[rsp+68h]
000000eb call FFFFFFFFF18E52D8

The hexadecimal number on the left is the instruction address. Most of the time it’s not important. But sometimes disassembler can’t decipherer function name for you so it uses addresses in calls and conditional branches, so it might suddenly get useful.

The short word in the middle is the instruction mnemonic. It is not the same as the opcode, or machine code, as opcode is also dependent on mnemonics argument. For instance, `call` may turn into 3 different opcodes depending on what type of address it is provided with.

The unspeakables on the right are the mnemonic arguments. In Intel tradition, they are read right to left, so:

mov eax, dword ptr [rsp+20]

actually means:

move 4 bytes that lie 20 bytes from the stack head to the `eax` register.

Instructions are imagined to run consequently from top to bottom. This is not exactly so. Modern processors can execute several instructions at the same time or some of them out of order at all. But we can stay assured that when the order of execution matters, it will be preserved. This way we can walk through the code with the debugger line by line and the result will not differ from what we get after the actual code run.

Data Types

Assembly has a very simple type system. It’s all native types. You might know them as byte, word, double word and quadruple word. Super-scalar instructions operate on “packed” version of this types. For instance packed word in 128-bit MMX context is 8 words packed together in one register.

Floating point co-processor reuses these types as well. Although there is a bit of confusion. Single precision IEEE 754 floating point is actually a double word, and double precision is a quadruple word. There are also “extended real” and “packed BCD” which are untypical 80-bit structures. You would hardly meet them in the disassembly, they are mostly preserved for backward compatibility.

Traditionally the most important data type for every architecture is not the byte, but the word. It’s the machine’s natural type, 16-bit machines have 16-bit words, 32-bit machines — 32-bit words and so on. There were machines with 12-bit words which had 6-bit byte, that’s what trigraphs in C++ is for.

In x86 architecture, it would be fair to call byte a “halfword”. But also, if being consistent, for x64 the word should actually be 64-bit long. This would have destroyed backward compatibility though, so we prefer historical names here. Even on the most modern x64 processors the byte is 8-bit, the word is 16-bit, double word is 32-bit, quadruple word is 64 bit, and 128-bit, 256-bit and 512-bit super-scalar types are just “packed” bytes and words.

Registers

Registers are the bits of memory especially close to the processor. They do not have addresses, as they do not belong to the RAM, instead they are referenced by names or even implicitly. They are very fast to read and write, so in x86 most of the integer arithmetics is done on them. There are a lot of registers in modern architecture. There is no point in memorizing them all, but some are definitely worth mentioning.

eax, ebx, ecx and edx are 32-bit general purpose registers. Most of the double word data manipulations are done through them. They all have their names: accumulator, base, counter and data, so it’s not just an alphabet. But they only imply conventional usage for human written assembly, compilers don’t have to follow these conventions, so there is no point to get used to these names.

Each of these registers’ lower words can be accessed as a separate register with names respectively: ax, bx, cx and dx. These are formerly general purpose registers for 16-bit architecture like in original 8086 or 80286, but now they became parts of their younger siblings.

Each of 16-bit registers has access to their lower and higher bytes: al, bl, cl and dl for lower; ah, bh, ch and dh for higher.

But also each of 32-bit registers has its 64-bit expansion: rax, rbx, rcx and rdx. Like ax is the lower half or eax, eax itself is the lower half or rax.

There are 16 64-bits registers in total: rax, rbx, rcx, rdx, rsi, rdi, rbp, rsp, and the rest are just r8 to r15. First eight have their historical names and meanings, the rest do not.

Two of them are the most important for us as we will see them all the time. They are rbp and rsp.

rbp is the Base Pointer, also known as Frame Pointer. It shows the current frame of a stack. It is important because all the local variables and function arguments allocated on the stack has the form of [rbp-SOME_NUMBER]. Note that in x86 the stack actually grows back. So the more you allocate in your local context, the bigger this negative number will be.

rsp is the Stack Pointer, which shows where your stack head is. So with some other compiler, you might see local variables and function arguments as [rsp+SOME_NUMBER]. It is less convenient for human readability as every stack operation changes this offset for every variable, but some compilers do that, so it’s only fair to give a heads up.

There are also floating point registers: `ST0` to `ST7` that form stack by themselves; MMX super-scalar registers: `MM0` to `MM7`; SIMD registers `XMM0` to `XMM7` for 128-bit, `YMM0` to `YMM7` for 256-bit, and `ZMM0` to `ZMM7` for 512-bit.

You don’t have to remember them all, just acknowledge that they exist.

Calling Procedure

Full name of call is Call Procedure and that’s basically all in does. But what it does exactly, is it stores the address of the next instruction in a stack and only then jumps to the procedure. Or a function, a method — they are all procedures in assembly.

And in the end of the procedure, there is a ret. It reads the address from the stack and jumps to that address.

So if you haven’t corrupted your stack, and modern compilers will preserve it in a perfect state for you, the control flow will enter the procedure, then, if there is a call inside, it will go further to the next procedure, and then to the next, and so on. All this time return addresses will be populating the stack. But since it is the stack, another words “first in last out” structure, every ret will unload a stack just one address at a time and eventually, when you get back to the very first call, the stack will be empty.

Or you can overflow the stack, for instance, with a really deep recursion, and never get back. Although modern compilers have a remedy for this. It is called Tail Call Optimization and it makes your call into jump for tail call recursion, so you wouldn’t store the same address over and over in a stack. You can still get stuck in an infinite loop, but that’s completely different story.

Function arguments may also be passed to a procedure in a stack. Actually, it is the conventional way for manually written assembly or WinAPI calls. Still, modern compilers know that pushing and popping things through the stack is not free, and they try to minimize this overhead for you. They reuse registers, especially super-scalar ones, as they are much more costly to pass through the general stack, but also floating point registers and even general purpose ones when applicable.

Branching AKA `if`s `for`s and `while`s

Although modern macro-assemblers have nice macroses to implement branching in convenient structure-oriented manner, you would not see them in disassembly. All the actual branching is done with conditional and unconditional jumps.

Unconditional jump is easy. It’s jmp instruction followed by the address often in a form of explicit relative offset. It works just like goto — carries control flow anywhere it is asked to.

Conditional jumps are a bit trickier. They all start with j.. prefix, so if you see something like jng, jz or jl in a code, it is definitely one of them. They are relatively easy to comprehend if you understand the naming scheme. jng stands for “jump when not greater”, jz for “jump when zero”, and jl for “jump when lower”.

They might also be preceded by cmp instruction that does a comparison for them.

cmp eax, ebx ; compare `eax` and `ebx`
jg SOMEWHERE ; jump if former greater than the later (eax > ebx)

But actually all that cmp does — it sets the flags in a special flag register, so they don’t have to be always bundled together. In fact, not only cmp sets the flags, but many other instructions do. For instance, this is completely legitimate:

sub eax, 42        ; subtract 42 from `eax`
jz SOMEWHERE_ELSE ; jump if `eax` is now zero

There are a lot of flags and jumps respectively. Once again, you don’t have to know them all to start reading code.

Computation

You will see at least two different kinds of computation in x86. First is integer arithmetics done with instructions like add, sub, mul and div, which always work on general purpose registers, and the second is floating point arithmetics: fabs, fsub, fmul and fdiv, which works on the special stack made by 8 floating point registers. There are other types of computations involving super-scalar registers and instructions, but they are generally an extension on floating point computation.

Integer arithmetics is historically the most native to x86. It is also much more esoteric. For instance, div instruction always divides implicitly an *dx:*ax instruction pair leaving quotient in *ax and remainder in *dx. So all you see in code is only:

div ebx

When what it actually implies is:

{edx, eax} = {(edx * 2^32 + eax) % ebx, (edx * 2^32 + eax) / ebx}

Multiplication also implies *ax as the first multiplier and *dx:*ax pair as an ouptut. Luckily add and sub are explicit. They work lust like the mov instruction, meaning add eax, 8 can be translated as “move (`eax` + 8) in the `eax` register”.

There are also rules on what can be the argument of arithmetic instruction and what not. For instance, you can’t add two bytes from arbitrary place in memory, you have to load at least one of them in the register. Or you can’t multiply registers of the different size. Still, if you are going to read the code and not to write it, it is ok for you not to know all the details, the compiler will have them settled just right for you.
 
Floating point arithmetic is much more consistent. It is heavily influenced by the ideas of RISC architecture, mostly in part of registers forming a stack and the operations being held on it. It’s basically all postfix notation with floating points. Just look at this piece:

fld     dword ptr [ebp-1Ch] ; A
fmul dword ptr [ebp-0Ch] ; x
fld dword ptr [ebp-20h] ; B
fmul dword ptr [ebp-10h] ; y
faddp st(1),st
fadd dword ptr [ebp-24h] ; C
fmul dword ptr [ebp-4Ch] ; inv_d
fstp dword ptr [ebp-50h] ; new_x

If you just read it from top to bottom you will get:

load A to the stack
multiply it by x and leave the result on the stack
load B to the stack
multiply it by y and leave the result on the stack
sum whatever is on the stack head and one step below together
add C to it
multiply it on inv_d
take it from the stack to store in local variable new_x

In Forth-style postfix notation, also known as polish, the same thing is:

A x * B y * + C + inv_d * new_x !

And in conventional infix notation:

 new_x = (A * x + B * y + C) * inv_d

You might have to get used to it, but it’s consistent and explicit, so it wouldn’t take much effort.

However, even by looking at this code and just counting the instructions, you can tell whether compiler optimized code well enough, or it needs a little help. For instance, this `inv_d` is here for a reason. For performance considerations, you don’t want to do many divisions, where you can get by with multiplication. So instead of doing `… / d` several times you’d generally be better with `inv_d = 1/d; … * inv_d; … * inv_d; … * inv_d`.

Compilers knows that trick, but they also know that you would loose a little bit of precision by doing so. This makes the compiler uncertain, should it undertake this risky optimization, or not. What is of utmost importance to you: performance or precision? Compilers doesn’t know that. But you do. And you can either select the right compiler flag, or simply do this substitution yourself. It’s that simple. All you need is an evidence that the compiler wouldn’t do it for you.

Conclusion and Further Reading

This introductory text should be just enough to help you start harvesting important evidence out of disassembly output. Of course, it’s nothing near comprehensive and maybe not even that insightful and helpful in the first place. But that’s something to start.

If you’re willing to dive deeper into reading and writing x86 assembly, the better sources would be Agner Fog’s manuals on optimization http://www.agner.org/optimize/ and Intel Software Developer Manuals http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html.