How To Start Reverse Engineering — A Guide
As the name suggests, reverse engineering is finding out how something works, bottoms-up. Reverse engineering in computer science is finding out how a program works. It might come in handy in the real world for various reasons. From wanting to recreate software with enhancements, and simply trying to understand how something works to more professional uses like malware analysis; everything uses reverse engineering.
The Ground Basics
Now that the idea of reverse engineering is covered, let’s briefly cover some concepts.
The first step in understanding reverse engineering involves knowing how programs work. Most software we use daily is written in compiled languages. Apart from compiled languages (C, C++), some languages are interpreted (Python, Javascript), or use slightly different mechanisms (like Java).
Here, we are going to focus only on reversing compiled programs.
Compilers
Compilers are computer software that converts code written in some higher-level language to lower-level, machine instructions. This bundle of instructions (called an executable binary, executable, or simply a binary) can then be executed by the computer.
With reverse engineering, our goal is to attempt and understand how the binary works.
For example,
A very simple snippet of code in C might look like the following
int a = 12, b = 5;
int c = a + b;
This snippet might get translated to x86–64 Assembly in the following way
push 12
push 5
mov rax, [rsp + 0x8]
mov rbx, [rsp]
add rax, rbx
push rax
This when further assembled to machine instructions (AMD64 Linux), might look like the following (hexadecimal representation of a sequence of bytes)
0x6A 0xC 0x6A 0x5 0x48 0x8B 0x44 0x24 0x8 0x48 0x8B 0x1C 0x24 0x48 0x1 0xD8 0x50
If you open any binary in a hex editor, the sequence of bytes represented as hexadecimal would look pretty similar.
Some Useful Tools
Now imagine that you are given an executable. Viewing the hex dump alone wouldn’t take us anywhere.
As seen in the above pictures, a simple Hello World program can become hundreds of bytes of instructions. For instance, this simple 7-line program had 997 lines in its hex dump!
Disassemblers
To make things more readable, we use disassemblers. Just as assemblers convert assembly code to machine instructions, disassemblers do the opposite.
GDB is one of the oldest and most powerful tools that can be used as a disassembler. However using GDB can be hard as it’s primarily a terminal-based program and without a thorough knowledge of its commands, using GDB can be a very rudimentary experience. So, there are other tools like Ghidra, Ida, or Binary Ninja which makes working with binaries more interactive and easier.
Now, this is certainly better than trying to outright read binary. However, reading assembly code directly can still be challenging and confusing, especially for more complicated programs.
Decompilers
Enter decompilers. Again, going by the name you might’ve guessed what they do. They take the assembly generated by disassemblers and output readable (well, most of the time) code in a desired language.
Softwares like Ghidra, Ida etc. come with their decompiler plugins and are very capable. However, one important thing to note is unlike disassemblers which produce a one-to-one accurate translation of machine code to assembly; decompiled code may not be a full translation from assembly.
Decompiling is much more of a harder process, especially as coding concepts get harder (code involving classes and objects for instance). The resultant code must only be treated as pseudo-code.
A Review of a CTF Rev Challenge
CTF (Capture The Flag) competitions are a great way to ethically practice your cybersecurity skills. The competition includes challenges from various categories of cybersecurity. Each challenge involves unearthing a “flag”, which is usually a string in a particular format, for example, flag{...}
. The contents inside the curly brackets are of course unique to the challenge.
Reverse engineering is a very popular category and often a part of most CTFs.
Csaw CTF 2023 / Rebug 2
Csaw CTF, which was hosted around mid-September had some beginner-friendly challenges. Rebug 2 is such a challenge from the reverse category.
The challenge description was the following
On downloading the file, the first step I did (and what usually is done) was to run the file
command.
This gives us some very basic but useful information about the file. We can mainly infer that the file is a 64-bit ELF executable, along with a few other details.
The next step is to run the strings
command.
Running it on this binary doesn’t give us anything useful. But, sometimes you can find crucial leads using the strings command.
There are some other basic Linux commands like strace
and ltrace
that you can run.
Decompilation
I opened the binary using Ghidra. After running the initial analysis that Ghidra does, I started looking at the decompilation of the main function.
The code might not look like regular C, but upon close inspection, it’s quite the same.
There’s a single function call present in the main
function and is obviously of interest. Double-clicking on the function takes us to the decompilation of the function itself.
Again, there’s a bunch of code which almost looks gibberish. But yet again, there’s a function call to xoring()
which is of interest. Navigating to xoring()
get the following.
Immediately what caught my eye was the array called flag
. Remember that the challenge description also mentions that we need to find the flag in the binary.
It’s pretty clear from the code around the variable that this function is “creating” the flag.
Our goal at this point is to obtain the contents of flag
. A point to note is that within the main function, printbinchar()
is getting called inside a for loop. Hence, to get the full value of the flag, we’d need to obtain the contents of the array after the control breaks out of the loop.
Dynamic Analysis
Until this point, whatever we had been doing was static analysis. Static analysis is mainly used to figure out the logical flow of a program.
Now that we know how the code works, and need to find out the contents of its memory during run-time, we need to switch to dynamic analysis.
There are various tools for dynamic analysis, the most basic albeit powerful one again being GDB. However, because of the aforementioned reasons, GDB alone can prove to be a difficult tool to work with.
Certain enhancements like pwndbg, peda or GEF are added to GDB to make it more friendly. Another alternative is Radare.
We launch Radare2 using the following command
The two arguments used are -d
for debugging mode and -AA
for a complete analysis of the binary.
Final Bit
To figure out the contents of the flag
variable, we need to follow three simple steps:-
- Find the memory address of
flag
. - Halt the program right after it exited from the loop in
main()
. - View the contents
flag
from memory using the previously found address.
1. Memory Address
Fortunately, in the case of this binary, we don’t need the memory address of flag
. This is because if we look at the corresponding assembly instructions to the array operations
We can see that the memory address of flag
was loaded into the RDX
register. If we also observe all the code following this, RDX
isn’t used anywhere else. This means instead of keeping track of the address of flag
we can, at any point, get the address by checking the RDX
register.
2. Halt Execution
One of the best features of dynamic analysis is setting breakpoints. Breakpoints allow you to halt the execution of code at almost any point and then give manual control of executing instructions after that.
To set the breakpoint, you need to access the main()
function in Radare. You can do this by typing s main
to reach the main()
function, followed by pdf
for disassembling it. The name of the main function in Radare is mostly just ‘main’.
I set a breakpoint right before the program quit as the for loop is the last piece of code in the main function.
db
is used to set a breakpoint. It requires the address of the instruction where the breakpoint is to be placed.
Obtain the Flag
Next, we need to run the binary. We can run it by using dc
.
The program executes and halts at the set breakpoint. Now, we need to look at the memory where flag
is located. We can do this by using the pxw
command. The p
stands for print, x
for hex, w
for word. So it will print the memory contents in 16-bit chunks and in hexadecimal. The memory we want to access is stored in RDX
so, we can use the dr
command to get the contents of RDX
.
The final command becomes:-
The backticks used to surround dr rdx
are to execute that command first, then use its output in executing the pxw
command.
The output obtained is
The yellow text is the ASCII representation of the memory contents. This is what we’re looking for.
And that's it, we have successfully obtained the flag!
Putting it in the format specified in the challenge description, we get csawctf{01011100010001110000}
.