Update: user RelativeTrifle at the ReverseEngineering subreddit pointed out to me that this is not a new concept, as Johannes Kinder had already written about it in 2010 while calling it “overlapping instructions”. This paper was mentioned at StackExchange in 2013. While I did research terms and expressions in order to figure out if this technique had been mentioned before, it is very hard to guess what it could be called by someone else. Therefore, I apologize to Johannes Kinder for having had the “new” in the title, as it isn’t. However, I will still keep this article as it shows some more practical examples/code that helps in its understanding and execution, and is based on x64, instead of x86, which gives the programmer more “room to play”. I just wanted to credit Johannes Kinder for the original idea.
You can see this technique as an improvement on what is called “impossible disassembly” which you can read about in a classic: “Practical Malware Analysis”, chapter 15. I’ll be showing some slightly more advanced examples and, of course, the technique itself.
In order to explain this, I’ll be going through the basics till the more advanced stuff, always explaining with code:
- Background — Linear sweep and Recursive descent
- “Impossible disassembly” — the technique
- Example in Windows using C
- Advanced version — “jmp short -9”
- Assembly wrapping
Disclaimer: While this technique might be used for malicious purposes, I do not condone it. The only reason I look into such techniques is that 1: they are technically interesting; and 2: as part of a Red Team, you will undoubtedly be developing your own tool arsenal and, as such, end up going deep into reverse engineering (RE) and implementing anti-RE techniques.
I will be using x64 assembly and C for proof-of-concept code. The C code will be compiled with mingw-w64 compiler, and if you are wondering “why aren’t you using Visual Studio?”, that would be because they (Microsoft) don’t support inline assembly for x64 architecture. Just to be clear, this is not an incapability issue, it’s actually a choice they made for better code compiling optimization.
Also, the whole point of this blog is to show a specific anti-disassembly technique in disassemblers. But not only in the purest of forms, such as in tools like objdump or IDA, but also the disassemblers inside debuggers (e.g. x64dbg). So, while I would not intentionally compare apples and oranges, when I do mention debuggers, what I actually mean is the disassembler inside them. Also, while I do use Immunity quite a lot, it doesn’t support 64-bit PE files, so it won’t be featured here.
Background — Linear sweep and Recursive descent
As a beginner in RE, one tends to assume that the disassembled code shown by disassemblers (e.g. IDA, objdump) or debuggers (e.g. x64dbg, gdb) is definitive. However, that is not true, as proven by the fact that most of these tools allow the researcher to rearrange the code/data analysis.
Bottom line, the linear sweep starts from entry point / start of function and just runs through the opcodes assuming everything is code and all linear (one instruction starts after the other), while the recursive descent is smarter and follows the control flow to better distinguish between code and data. However, there’s still linear sweep when doing recursive descent, as it only stops doing the linear sweep when finding control flow instructions such as conditional jumps, absolute jumps, calls, and returns.
As an example, check the following code, with data in between. It’s not relevant here to understand all instructions, but rather the fact that there’s data (“buffer db …”) in between the code:
This simply prints out “Australia” in the console. If you want to know more about this code, I’ve written extensively about shellcoding (even though this is not shellcoding as it has null bytes and absolute references to the buffer memory position) in previous blogs such as https://pentesterslife.blog/2017/11/01/x86_64-tcp-bind-shellcode-with-basic-authentication-on-linux-systems/.
The point here is to show objdump and gdb doing a linear sweep, and IDA using the recursive descent to analyze the code, so let’s see how each distinguish between code and data. Note that the “len” line will not show up as it will simply be calculated and replaced in the relevant locations — mov rdx,len — by the compiler.
You can clearly tell that the recursive descent (done by IDA here) is better at telling the difference between code and data.
“Impossible disassembly” — the technique
By understanding the algorithms previously mentioned, one can imagine there are a few ways to exploit the disassembly process. One such way is called impossible disassembly. The name is a bit unfortunate because it’s not actually impossible, as it is only referenced that way because of the predicament in which the disassembler will find itself: one byte belonging to two instructions.
In the following image, the disassembly you see above the hex bytes is what the disassembler will show you or “see”, but the instructions below is what it turns out to be (after the jump is made).
The difficulty here (and hence the “impossible”) is a basic assumption in disassembly, which states that one byte is only interpreted in the context of one instruction. This is obviously not true, as shown above where 0xff belongs to two instructions.
Note that, after the jump, there is an increment to eax. Make sure this does not impact your hidden code! If you were messing with eax and were expecting it to be a specific value, this will change it of course. Also, you can swap the 0xc0 for something else, but you have to keep in mind that this new byte has to be the start of an instruction that will consume some of the next bytes of “real code” and only partially.
So, let’s see it in action! Using the previous assembly example, but this time printing out “Evil code!”, which will be what we’re trying to hide from the disassembler.
Let’s inject the anti-disassembly code right at the beginning of the _start entry point:
After compiled with nasm, you can see the code is hidden, even with recursive descent:
And this runs just as before:
Keep in mind that, as I’ve pointed out before, a reverse engineer can instruct IDA to interpret bytes/opcodes as data (pressing ‘D’) or as code (pressing ‘C’) after identifying this technique. So, this won’t stop any decent reverse engineer but might slow them down. And you can slow them down even more with the advanced variation of this that we’ll look at ahead.
Also, the “inc eax” (ff c0 executed after the jump -1) is irrelevant in this example, given that I set rax to 1 right after.
Example in Windows using C
Why not show a similar example but with assembly, in Windows? Because the concept is exactly the same. The only actual difference is that instead of doing the “db …” (stands for define byte) in the beginning (define byte), you’d be doing a “.byte 0xeb,0xff,0xc0” (different compilers…).
So, for our proof-of-concept, let’s use the following, and see if we can hide the “evil” part from disassemblers:
Which then executes into the following, where you can see the “evil” code being executed:
But when looking at it, using different tools, they can’t disassembly it right, at first — again, you can do this manually in both the following tools, but it requires extra work in your RE.
So, let’s look at IDA:
The first “printf” (“puts”) is clearly shown but not the second “evil” code. So, the added 3 bytes do exactly what they are supposed to do.
In x64dbg, it compromises its analysis as well. A common task in RE is to search for string references and intermodular calls, and in this case, the “evil code” doesn’t show up and, in looking for intermodular calls, it only shows one occurrence of the “puts” function:
And, of course, the disassembled code:
Advanced version: “jmp short -9”
Now, just in case you’re wondering if one couldn’t simply write down a script (e.g. IDAPython) and patch that “eb ff c0” sequence by removing them, think again. While that would solve this specific problem, it wouldn’t solve the million variations that you can come up with. Also, after the CPU does the jump, it’ll execute “inc eax”, and to avoid complexity, we neglected that instruction. But one could write code right after, that would depend on that increment, or validate eax to a specific/expected value. So, you can see there’s no universal solution here. Moreover, what if we don’t just jump back one byte (jmp short -1 | eb ff) but, instead, jump 9 bytes back?
Let’s see such an example:
To understand this image, read the code as disassembled above the hex code, and then after the “jmp” backwards (-9), read the code below the hex code, which will be hidden from the disassembler.
What you’ll see in the disassembler is:
The jump (right before the last instruction call) will go backwards and land the CPU (rip) right in the “middle” of the data I first put into rax, in the eb 08 part to be more precise, which is another jmp but ahead, right after the e8. The e8 is important here because it’s the start of a call instruction and will consume the next bytes as a memory address (as function/code memory location) and will, therefore, hide the instructions that those bytes actually represent. And that’s why the “jmp” ahead (eb 08) lands right after the e8 (fake call).
Let’s see it working then:
The other interesting thing is that, because of the nature of this code, the ff’s are quite irrelevant. So you can replace them with any other bytes and it would still work. Ideally, if you have different variations on the same executable, it would make it harder to automate detection and elimination, through scripting, of these byte sequences in the code.
So the previous code has something interesting about it: it jumps back into a large instruction (mov rax,…) and executes code that is inside the value you’re putting into rax. The ff’s are quite irrelevant but, what if they weren’t? What if I could build a “skeleton” code where I could then place hidden code instead of the ff’s? This is what I came up with:
The hidden code execution will be triggered by the jmp instruction. The values are reversed (little-endian), but what’s happening here is a jump back to the first byte of the 8-byte value placed in rax, which will be (in the correct order) “ff ff ff ff ff ff eb 04”. Now, all ff’s will be replaced with my real/hidden code which will be executed, and then the “eb 04” is simply a jump ahead into the start of the next 8-byte value placed into the next “mov rax,…”, which will again execute the code that will be placed there until it reaches the “eb 02” which, again, jumps ahead into the next hidden instruction.
Before writing up code to be hidden inside this skeleton, we must acknowledge some limitations:
- Given the fact that I chose the “mov rax,…” (a 10-byte instruction) as my “skeleton”, none of the hidden code’s instructions must be longer than 6 bytes. This is because the bytes/opcodes on a single instruction must be placed right next to each other when being read by the CPU, and I only have 6 available, given that “mov rax,…” is made of 2 bytes that identify the instruction, and 8 bytes to put in the register. I still need to take on 2 of these 8 bytes for the “jmp short 2 / eb 02”, so I’m left with 6 bytes to play with. However, I can still join instructions together, as long as the total number of bytes doesn’t exceed the number 6. And you’ll also notice that I sometimes have to “pad” (encryption term) the instructions when they’re shorter than 6 bytes with NOPs (0x90), otherwise, the compiler nasm will have null bytes appended to the higher end of the 8-byte value.
- Given the previous point, you can now understand why I can’t write this in C, as you’ll definitely have the C compiler throw assembly code with instructions longer than 6 bytes at you. So I need full control on writing the assembly, which forces me to write it myself.
I’ll mention the advantages after the example as you’ll understand my point better.
So the code we wish to hide is the following:
This is not as simple or straight-forward as the previous assembly codes I’ve shown, because this is more like actual shellcode, while still just printing something — “Evil\n” — out on the command line. I chose to do this, this way because I want to:
- have as short instructions as possible to fit the most inside those 6 bytes, in a single “mov rax,…”.
- have no null bytes, again to save on space.
- need position-independent code as absolute memory positions will change once placed inside the skeleton code.
These are all characteristics of shellcode, which I’ve written extensively about in my previous blog so I won’t delve into the details of the code, but suffice to say it simply prints out “Evil\n”:
The executable has the following opcode:
Notice that there are, as in previous examples, two syscalls: the write to stdout file descriptor and the exit (process). Without this last one, the process breaks (rip is incremented out of the .text memory section and tries to execute data in memory where it has no permissions to execute) and you’ll see an error being shown.
This is interesting because, as you’ll see ahead, the skeleton itself doesn’t properly exit the process, so it should crash. However, it doesn’t actually crash, because the actual code it’ll be executing does exit properly.
Success!! And the disassemblers will simply show the skeleton and not the hidden code:
Now while this specific example is very easy to recognize as an anti-disassembly technique (a first mov rax,… then a jmp -x, and a never-ending sequence of mov rax,…), you have to consider its flexibility. If you spread the movs further (even though no longer than 256 bytes as per the relative jump: “jmp short”) and place other code (that will simply never be executed) in between, you can make this look a lot like something else completely benign, which could be a huge advantage in hiding the real code. Another advantage is the fact that this is a pain to manually instruct the disassembler on how to interpret the code/data. So you’d have to end up writing some plugin to help you if the hidden code is large (albeit shellcode), which could be very tricky if you think about the fact that you’ll have to automate the distinction between real “mov rax,…” instructions and the “wrappers”.
Also, you can choose longer instructions as wrappers, which will give you more space, per line/instruction to fit in your hidden code.
So, there you go. I hope you found it as interesting as I did.