Deobfuscating obfuscated code for fun and no profit
This article is aiming to be the first in a series of articles on figuring out what obfuscated code does. The reasons for this are pretty simple:
- I’m masochistic
- Obfuscated code is pretty interesting to me
A lot of people when they see obfuscated code usually go “what the hell”, maybe followed by “cool” if they bother to compile / run it. I think obfuscated code deserves more attention, it’s an art unto itself and it’s worth the effort of figuring out how these things work. To reach this end, I’m going to be drawing from winning IOCCC entries (IOCCC = International Obfuscated C Code Contest) and explaining in detail how they work. I should note that I’m by no means a C expert, I mainly poke the program in different ways until I figure out why it is doing what it does. If you find anything incorrect in my analysis or want to further clarify something I’d be happy to hear from you! Finally, these posts might be a bit long — I’m simply trying to clarify everything involved in making the chosen programs work.
With all that in mind let’s begin!
Part 1: The simple stuff
First we’re going to look at “anonymous.c” from 1984. It is reproduced in full below:
int i;main(){for(;i["]<i;++i){--i;}"];read('-'-'-',i+++"hell\
o, world!\n",'/'/'/'));}read(j,i,p){write(j/p+p,i---j,i/i);}
I compiled the above on Ubuntu 16.04 using cc --std=c89 -w <filename>
and it ran fine. If you couldn’t guess by the one English string in there it prints “hello, world!”. So, how does that happen?? Well first things first let’s consider if the preprocessor is doing anything here. It may look like it isn’t, but actually it is — the backslash at the end of the line one is called a “backslash newline” and is a signal to the preprocessor to merge the next line onto the first — effectively turning the program into one long line. After considering what the preprocessor does I like to “clean up” the code a bit — so here is my attempt at that (note: medium broke the one long line, not me):
int i;main(){
for(;i["]<i;++i){--i;}"];read('-'-'-',i+++"hello, world!\n",'/'/'/'));
}read(j,i,p){
write(j/p+p,i---j,i/i);
}
If you compile that in the same manner as we did the first time you’ll get the same result. Cool. At this point, I ran strace on the program and discovered that the characters were being printed one-by-one to the screen:
So, obviously the for-loop is running and each iteration it calls the “read” function which actually calls the “write” system call. Now, about that read function…from what I can tell it’s a K&R-style function and can be declared without types (it just defaults the types to int
). Now, about the call itself- it has 3 parameters. We’ll focus on the first and third and come back to the second later. The first one is '-'-'-'
, and this is actually pretty simple. Since chars in C are ints this is just integer subtraction — and since the value is being subtracted from itself this gives us 0. Similarly in '/'/'/'
, it is ‘/’ divided by itself giving 1. This lets us simplify the call into: read(0, i+++"hello, world!\n", 1)
. Compiling/running once again tells us nothing has changed.
Since the function is being called with 2 constant values we can probably eliminate some variables…let’s do that. read
is defined as:
read(j,i,p){
write(j/p+p,i---j,i/i);
}
We know j is 0 and p is 1. The first argument to write
is j/p+p
, or, 0/1+1
. In C this actually evaluates to 1 — it does 0/1 first (giving 0) and then adds 1 to it. The third argument to write
is i/i
, and although we don’t have a clear understanding of the second argument yet we can safely assume the result of this will be 1. Speaking of j, notice it’s hanging around in the second argument to write
— i.e, i---j
. Would be nice to get rid of it…well, because j is just 0 we actually can remove it with no issue. Anything -0
will just be itself after all. Knowing all of this we can simplify the read / write function into:
read(i){
write(1,i--,1);
}
We adjust the read call in the main for-loop accordingly. Once again, compiling and running gives us the same result.
Now would be a good time to explain the write
system call for those of you unfamiliar. According to the man pages write takes 3 arguments, a file descriptor, a pointer to a buffer, and the number of bytes to be written from that buffer. If the first argument, the file descriptor; is 1, then it writes to standard output. So in our case, we’re writing to standard output 1 byte from whatever is at i--
. Speaking of i--
, the decrement is only evaluated after write
is called. Since the function is over at that point and i
is local (overshadowing the global i
) we effectively toss out the decremented value — we can remove the decrement with no change to our program. Now…moving onto more arcane bits of things.
Part 2: The Tricky Stuff
Before we go further, this is what my code currently looks like:
int i;main(){
for(;i["]<i;++i){--i;}"];read(i+++"hello, world!\n"));
}read(i){
write(1,i,1);
}
Less confusing than the original, but still a bit weird. There isn’t much left to do though so let’s get to it!
As a bit of a sidenote, the global variable i
is set to 0 when the program runs. That’s just how C do. (I couldn’t think of a good place to shoehorn in this fact but it’s worth keeping in mind).
The first thing I want to explore is i["]<i;++i){--i;}"];
, it is the condition terminating the for loop (lol what). Okay so how the hell does that terminate the loop? Well, apparently because C is insane you can index an array backwards — yes you can do something like 4["12345"]
and actually get '5'
. The reason for this has to do with the relationship arrays and pointers have in C. Basically when you access an array like: arr[4]
it’s the same as doing*(arr + 4)
, since addition is commutative C thinks it’s totally cool to do this backwards. (Note: I’m aware arrays and pointers have subtle differences but I don’t want to get into it here.)
Now, going back to the actual expression — to terminate the for-loop we need a false condition, or a value of 0. (Every non-0 value in C is considered to be ‘true’). When the loop runs it goes through the string "]<i;++i){--i;}"
, each character in here is evaluated to whatever int value it has and the loop continues. What about when we go past the end of the string though? Well remember in C strings are null-terminated, so there is an implicit null-byte at the end of this string. This is enough to stop the loop.
If you didn’t guess it by now the contents of that string don’t matter, you can have i["11111111111111"]
and it still works the same. It’s a distraction to look meaningful — the only important thing is that this string is the same length as the hello, world! string.
So now the final mystery to uncover — how does i++ + "hello, world!\n"
know to pass in only 1 character at a time to read
? To solve this I ran things in gdb to see what values the read
function was getting. (Sidenote: i
is incremented after the addition with the string).
With a break point on line 8, (where we call write
) we see:
It took me a bit of time but eventually I realized the value of i being passed to read is a pointer address. If we translate 4195892 into hex we get: 0x400634. Let’s examine what’s at that address (+ an additional 14 bytes):
If you don’t got it yet here is the above in a more readable form:
Yep, it’s our hello world string directly in memory..crazy. So it turns out when you add an integer to a string literal you are adding to the pointer that points to the first character in that string literal. Since write accepts a pointer to some chunk of memory (regardless of type) it’s totally cool with this and will just write out however many bytes you say to write out from that memory address.
I think that about covers it! I hope you had fun reading this and learning about this program. This one was from 1984 and is actually one of the easier ones I’ve looked at — stay tuned for further dissections.