How to “de-obfuscate” Jim Hague’s IOCCC winner program

Laura Roudge
10 min readMar 12, 2019

--

The IOCCC, International Obfuscated C Code Contest, is a competition which goal is to create the most obscure C program. In 1986, Jim Hague won the contest with his (almost) undecipherable morse encryptor. Today, we are going to try and understand how his program works, step by step. Here’s what it looks like:

#define DIT (
#define DAH )
#define __DAH ++
#define DITDAH *
#define DAHDIT for
#define DIT_DAH malloc
#define DAH_DIT gets
#define _DAHDIT char
_DAHDIT _DAH_[]="ETIANMSURWDKGOHVFaLaPJBXCYZQb54a3d2f16g7c8a90l?e'b.s;i,d:"
;main DIT DAH{_DAHDIT
DITDAH _DIT,DITDAH DAH_,DITDAH DIT_,
DITDAH _DIT_,DITDAH DIT_DAH DIT
DAH,DITDAH DAH_DIT DIT DAH;DAHDIT
DIT _DIT=DIT_DAH DIT 81 DAH,DIT_=_DIT
__DAH;_DIT==DAH_DIT DIT _DIT DAH;__DIT
DIT'\n'DAH DAH DAHDIT DIT DAH_=_DIT;DITDAH
DAH_;__DIT DIT DITDAH
_DIT_?_DAH DIT DITDAH DIT_ DAH:'?'DAH,__DIT
DIT' 'DAH,DAH_ __DAH DAH DAHDIT DIT
DITDAH DIT_=2,_DIT_=_DAH_; DITDAH _DIT_&&DIT
DITDAH _DIT_!=DIT DITDAH DAH_>='a'? DITDAH
DAH_&223:DITDAH DAH_ DAH DAH; DIT
DITDAH DIT_ DAH __DAH,_DIT_ __DAH DAH
DITDAH DIT_+= DIT DITDAH _DIT_>='a'? DITDAH _DIT_-'a':0
DAH;}_DAH DIT DIT_ DAH{ __DIT DIT
DIT_>3?_DAH DIT DIT_>>1 DAH:'\0'DAH;return
DIT_&1?'-':'.';}__DIT DIT DIT_ DAH _DAHDIT
DIT_;{DIT void DAH write DIT 1,&DIT_,1 DAH;}

Pretty unreadable, isn’t it? And yet, so beautiful…

First things first: what happens when we try to compile it with gcc? We see a lot of errors and warnings popping up, mostly about undeclared variables and implicit declarations. This means that the program doesn’t follow C standards in terms of function or variable declarations. Let’s take a closer look.

Macros, macros everywhere

One of the first things we can see in the program is a long list of preprocessor directives “#define”. The preprocessor is the part of the compiler that gets rid of comments, includes header files and expands macros. What is a macro? it’s an identifier that will be replaced by a piece of code. In our program, for example, we have #define DIT (, this means that every time we will type the macro DIT later on in the program, it will be replaced by the character ‘(‘. We can see there is A LOT of macros in this program, and that’s what makes it messy.

I will put here my de-obfuscated version of the code, and explain each step that lead to this transformation.

#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#define DIT (
#define DAH )
#define __DAH ++
#define DITDAH *
#define DAHDIT for#define DIT_DAH malloc
#define DAH_DIT gets
#define _DAHDIT char
char morse[]=”ETIANMSURWDKGOHVFaLaPJBXCYZQb54a3d2f16g7c8a90l?e’b.s;i,d:”;
char translate(int c);
int _putchar(char c);
int main(void)
{
char *string, *c, *next, *morsecpy, *gets(char *);
for (string = malloc(81), next = string++; gets(string); _putchar(‘\n’))
{
for (c = string; *c; _putchar(*morsecpy ? translate(*next) : ‘?’), _putchar(‘ ‘), c++)
{
for (*next = 2, morsecpy = morse; *morsecpy && (*morsecpy != (*c >= ‘a’ ? *c & 223 : *c)); (*next)++, morsecpy++)
{
if (*morsecpy >= ‘a’)
*next += *morsecpy — ‘a’;
else
*next += 0;
}
}
}
return (0);
}
char translate(int c)
{
if (c > 3)
_putchar(translate(c >> 1));
else
_putchar(‘\0’);
if (c & 1)
return (‘-’);
else
return (‘.’);
}
int _putchar(char c)
{
return (write(1 , &c , 1));
}

Step 1: replacing the macros

In order to understand the logic of the code we have, I chose to replace very macro I saw by its value, get rid of the unnecessary white space and try to indent the code when I felt like it needed it. Thanks to the macro DAHDIT that has the value “for”, we can already see three nested for loops. In C, a for loop is a piece of code that is executed several times until it reaches an end condition, and a nested loop is just a loop within a loop. We can also see the use of {} that indicate that we have several functions: main() which is our entry point, the code that the program will run first, __DIT() and _DAH.

Step 2: replacing the names of the functions and variables

Inside and outside the functions, we can wee variable being declared like _DAH_[], which is a character array, or _DIT. For more clarity I decided to give these variables names and to change the names of the functions we already have.

  • _DIT will become string
  • DAH_ will become c
  • DIT_ will become next
  • _DAH_[] will become morse
  • _DIT_ will become morsecpy

Note that thanks to the macro _DAHDIT, we know that all these variables are of type char *, meaning pointer to a char.

After analyzing a bit the behavior of the functions, I was able to see that __DIT() did the same as the library function putchar(), which prints a character to the standard output, our terminal window. Therefore, I decided to rename it _putchar(). Then, I changed the name of the function _DAH() to translate(), I’ll explain in the next steps.

Step 3: the role of the translate() function

char translate(int c)
{
if (c > 3)
_putchar(translate(c >> 1));
else
_putchar(‘\0’);
if (c & 1)
return (‘-’);
else
return (‘.’);
}

From what I could gather, the function needed to be passed a parameter of type int, because it is compared to an int. So I decided to pass it the parameter c. Now, the function is recursive, meaning that it will call itself and create a stack of calls to itself with different values every time until it reaches a base case. Here, our base case is if c is under or equal to 3, then it will print a terminating null character (‘\0’), which marks the end of strings in C. We can already see two binary operators in the function (& and >>), so there must be something about the binary value of the parameter c. After some research I discovered that 3 is the last positive number which binary equivalent is 2 bits (11). But wait, isn’t the morse code binary? Yes it is! Now it starts to make sense why we deal with binary values! Back to our translate() function. Essentially, if c is above 3, meaning if c has more than two bits in binary, we will print whatever the call to the translate() function returns when we pass it c shifted by 1 bit to the right (that’s what the >> operator mean), until it reach 3 and print a ‘\0’ character. But what about the second part? That’s where the real translation begins: it returns the character ‘-’, which is the morse equivalent of 1, when c & 1 is different than 0, or else it returns the character ‘.’ which is the morse equivalent of 0. Two interesting things here: the fact that these return statements return characters allowed me to determine the return type of the translate() function, which is char, and the c & 1 operation means that we will compare the bits in c with the bits in 1 (which are only… 1) and it will be equal to 1 only if both bits are 1 (hence the ‘-’), otherwise it’s equal to 0 (hence the ‘.’).

This part is tricky to understand, but it’s the key to our morse translation. That’s why I chose the name “translate” for this function.

Step 4: tweaking our _putchar() function

int _putchar(char c)
{
return (write(1 , &c , 1));
}

Just for clarity, I thought it was wiser to make our _putchar() function look like the real one from the standard library. So I added a return type, int, and a return statement in which we will put the system call “write” given by the original program. I also added a parameter c of type char, which is the character we want to print.

Step 5: the main() function, explained

This is by far the hardest part of the code to understand, so I will do my best to explain it. I’ll break down each nested loop and explain what every variable is for.

  • morse[]: string that contains all the characters we are going to compare our input string with
  • *string: pointer to the string we sill get from the standard input, aka the keyboard, that we will loop through, character by character
  • *c: pointer to the string that holds the current character in the input string
  • *next: pointer to a placeholder character
  • *morsecpy: pointer to string that contains the characters of morse and that will be modified
  • *gets: this is the standard library function that gets a string from the standard input

For each for loop, I will break it down into three steps: the initialization, which sets the value of a variable to something and is our starting point, the condition to reach, we loop until we reach it, and the iteration, which is an action that happens every time the loop loops. Each step is separated by ‘;’ and they are all inside parenthesis, but each step can contain several actions, separated by a ‘,’.

But why so many loops? I also was wondering this, and I think it’s because each letter, once converted to morse, can be several characters (i.e. “H” is “….”).

The first loop

for (string = malloc(81), next = string++; gets(string); _putchar(‘\n’))
  1. Initialization: we dynamically allocate 81 bytes in memory for our input string, and we set next to the next character inside the string
  2. Condition: we loop until we reach the end of the input string, aka the ‘\0’ character
  3. Iteration: we print a new line every time

What this loop does is that it takes a string and executes the following code and then print a new line, so it’s ready to take a new string from the standard input.

The second loop

for (c = string; *c; _putchar(*morsecpy ? translate(*next) : ‘?’), _putchar(‘ ‘), c++)
  1. Initialization: we set the current character c to the current character in the input string
  2. Condition: until the string c reaches the ‘\0’ character
  3. Iteration: here we have a ternary operator inside the call to _putchar, which works like an if/else conditional statement. Is the current character in morsecpy different from ‘\0’? If yes, print the return value given by the call to translate() with next as its parameter (reminder: ‘-’ or ‘.’); if not, print a ‘?’. Then we have another action that simply says to print a space, which will be the space in between each morse letter, and then we move on to the next character in c (the next character in the input string)

This loop will be the one that prints the morse symbols for each letter. This loop will run entirely for every iteration of the previous loop.

The last nested loop

for (*next = 2, morsecpy = morse; *morsecpy && (*morsecpy != (*c >= ‘a’ ? *c & 223 : *c)); (*next)++, morsecpy++)
  1. Initialization: we set the current character in next to 2; and we set our morsecpy character to the character in morse
  2. Condition: until we reach the end of morsecpy AND (&&) while morsecpy is different from the result of another ternary operator. Is *c ≥ ‘a’, aka is it lowercase (all characters actually are a positive numerical value, cf the ASCII table)? If yes then the result is *c & 223, if not than it’s just *c. This expression “*c & 223” means that we do the AND binary operator on *c and 223, and with the help of an online compiler I found that it converts the character to its uppercase value! That means that the second part of the ending condition is *morsecpy being different from *c or its uppercase value, depending on if *c is lowercase. In essence, the loop continues as long as the *c variable doesn’t match the characters in morse[]
  3. Iteration: increase next by one and move to the next character in morsecpy

This loop will run entirely for every iteration of the previous loop.

The code inside the loops

if (*morsecpy >= ‘a’)
*next += *morsecpy — ‘a’;
else
*next += 0;

Here, I modified the ternary operator to an if/else for clarity. This code is executed for every iteration of the last loop.

So what happens here is that if the character in morsecpy is lowercase, then I add the ASCII value of *morsecpy - ‘a’ to next, otherwise it doesn’t add anything to next.

Example:

I thin it will be easier to understand with an example. Let’s take the letter ‘T’. So this letter is the first letter inside our input string. So in the second loop, c is initialized to ‘T’, then we _putchar (we print) the result of the ternary operation. We haven’t reached the end of morsecpy, so we print the return value of the call to translate(*next). ‘T’ is the second letter inside morse, so *next will be equal to 3. translate(3) will print a ‘\0’ char to terminating the preceding string, and since 3 & 1 = 1, it will return ‘-’. So the loop prints ‘-’ which is the correct morse translation of ‘T’.

Now if we compile the program and name it “f”, we can run it by entering “./f” in the command line, and pressing enter. The program is waiting for us to input a string (remember our gets()). So I can input something like Hello, Holbertonand the result will be:

.... . .-.. .-.. --- --..-- ? .... --- .-.. -... . .-. - --- -.

We can check with an online morse code translator, and it works!!

Don’t forget the standard libraries

Before, we didn’t have any standard libraries included in our file and we were using functions defined in those libraries such ad gets() or malloc(). The compiler was really not happy about it, so in order for it to compile, I included them at the top of the file:

#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>

Little modifications for clarity

I also allowed myself to add {} when I felt like it needed it, and I added small things like a return type (int) and a return value (0 for success) to the main function. This is just to make things easier for me to read and to compile.

--

--

Laura Roudge

Software Engineer at Deezer in Paris, former student at Holberton School in San Francisco, always striving to build a better world.