From zero to building my own compiler

Neva Krien
9 min readJun 15, 2024

--

I wrote my first-ever assembly program 3 months ago. Now I am writing my own compiler from scratch. I’m not talking about something on the scale or quality of GCC or Clang. Heck I am not even talking about a language that allows variables.

Why am I doing this? Because it’s fun and I want to learn about how things actually work. Sure I could take something off the shelf like LLVM and get some parser library and hook them together. But that won’t teach me much.

For this project, I decided to go with a binary Turing machine, because it’s both incredibly simple and Turing complete.

If you’re not familiar, a Turing machine is a basic model of computation. It can simulate any computer algorithm. This felt like a nice bar to clear since its just complex enough to be potentially useful.

My version uses a tape of 1s and 0s as input, and the goal is to produce a result tape if the machine halts. It can always decide to run forever and my version even allows for other errors but that’s not really super important.

The whole thing is implemented in C. At times I was really tempted to switch to Python or learn Rust or C++. But I stuck with C for the entire thing and I am happy I did.

Assembly

Assembly is very different than anything you will see out there. Unlike high level languages like Python, or even medium level languages such as C, assembly doesn’t allow for variables or isolated function calls.

Its basically a nice dressing on the actual hardware operations a computer does. There really isn’t much lower you can actually code in.

Assembly is a weakly dynamically typed systems language… all these words together basically mean assembly has so many things that can go wrong, and no guard rails whatsoever. You can’t even printf debug with it.

Now, my two main languages are C, which is weakly statically typed, and Python, which is strongly dynamically typed. So I felt like I was in a pretty good position to start figuring out assembly.

C is usually seen as a low-level language, and in many ways it is. But it’s got nothing on assembly. You can write C code that works on any hardware and OS . You just need to make sure to follow the C standard (easier said than done) and not use any platform specific features.

In assembly you write to one target and one target only… You are forced to deal with the actual hardware and OS in literally anything you do. You also don’t have a compiler to fix your badly optimized code. What you write is what you get, end of story.

Learning Assembly

Around 3 months ago I started learning assembly in my free time. At first it was just playing “human resource machine”, which is a game with a very limited assembly language.

After taking a weekend to ace the entire game in one long sitting I got hooked. the last level was a prime factorization algorithm. It was very challenging because “human resource machine” does not let you divide.

I started porting the final program I made in the game into x86_64 assembly. It took some work but eventually, I had the whole thing working. It was really bad code because x86_64 does let you divide, but it was fun to see that my algorithm worked.

I then moved to reading some assembly. I looked at some of my old C code from a school project: walked through it and tried to map the assembly into the code itself. It was very informative.

Since most functions in that project had less than 15 variables (the number of registers in x64) I could really map registers directly onto C variables.

After doing that I started looking at how memcpy and memmove are implemented. I talked about it in the C subreddits, and it evolved into an article. Got a lot of feedback and the reception was relatively good.

I then took a break from assembly for a bit. For at least a month. I focused on the OS side of things.

Wasting Memory

When I decided to write the compiler I already had a working Turing machine implementation in C, from a previous project. I ended up redoing the entire thing.

Why? memory locality.

The previous implementation used 2 dynamic arrays: 1 for each side of the tape. We would then index into them with an integer. I knew this would be annoying to implement in assembly because for each read we need to:

  1. do a comparison to choose the right vector to index into
  2. potentially convert the negative index into a positive index
  3. figure out if the vector has enough memory
  4. potentially reallocate the memory (which is an entire function call).
  5. index into the vector

To avoid all of that I want to guarantee that the tape is actually one big chunk of memory. So we never need to ask for more, and we can also put the starting position in the middle of it. That way it expands “infinitely” to both sides.

Then I can just:

  1. figure out if the memory we are about to read is initialized
  2. potentially initialize it to zero or exit if we are out of memory.

It’s also probably faster because having the memory as long block of actual honest to god physical memory would do wonders for cache locality and branch predictions.

A downside of this scheme is that we have a bunch of extra unused memory. We then make it even more wasteful by taking a full 32-bit to represent a binary value.

We could have gone with 8-bits 16-bits or even 64-bits. But 32-bit is what I choose. It might seem counterintuitive but taking more memory for each value can actually be faster.

Essentially every system has a native size it prefers to work with, called the “word size”. Operations on this type of variable are usually quicker. This is why you will see C code use ints as a boolean. Even tho chars are smaller and are always available (booleans were only added at c99) int is the default way to handle boolean values.

Now on a modern 64-bit chip the word size is technically 64. And for historical reasons, an int in C is 32-bit on these machines. However, because this is the case, 32-bit operations are still highly optimized on these machines. From the googling I have done it seems they are just as fast, if not faster. I later confirmed this with an expert, they gave me this source.

This is why for my compiler I went with 32-bit. I figured it’s probably what the architecture would prefer. Now it could very well be the wrong choice, it’s not that I am actually using a lot of arithmetic operations, and having 4x the size does mean the memory bus needs to work harder.

Actually starting the project

I never really managed a large C project. Usually most things I build fit into headers, and maybe the odd object file. I knew for this project that won’t cut it. There is just too many tests and tools that would have to be built.

C is often given as an example of a language with a bad build system. I went with make and it was pretty nice. It took a bit of back and forth with ChatGPT to get something I could actually maintain. But once I did it was like magic.

I used Python and Bash for calling my test code, and doing some housekeeping. At the time of writing a full recompile and test takes 0.63 seconds (on a crappy laptop its 6), this gives me those very tight feedback loops I just love.

Tests

If there is anything I learned about systems programming its that you have to write good tests. You have to make sure that your code does what you think it does because otherwise when you have a bug you have no idea where to start.

Pretty much any module I wrote had a test written afterwards. And pretty much every single one of these tests failed at least once. Some of these were kind of challenging to properly test. For instance, how do you test that something doesn’t halt?

For testing the compiler itself I made an interpreter. I used the results from an interpreter as a reference. Both for tests and for actually writing the assembly. I made 2 interpreters: one that just ran, and a second that counted steps and would stop after a while.

From looking at my GitHub at least 15% of my project is Python… I only use Python for testing and it’s not even the main part of most tests. The exception to this is some pretty big end-to-end tests I have written in Python.

I would generate a big peace of code. Run the interpreter on it, then run the compiler. If they both gave the same result I knew the compiler is not bugged.

Random Bugs

Still, some things you just can’t test for. I spent 5 hours chasing a bug in this project that was just a badly placed “;” in the assembly I generated for NASM. So I could have saved 5 hours if I didn’t have comments in my generated assembly. Oh well.

There was also just a random 100x slow down in NASM assembling speeds. It would go from basically instant to 4 seconds for one specific file. All because of 2 lines of assembly… The hacky fix I found was moving some label to the top of the file. And it would magically get better.

I also did a total rewrite of the parser. After spending 2 full days trying to do it in C. And then in a feat of rage writing it in Python in 15 minutes. I then used that as a reference for the C solution.

It’s still terrible code I don’t have a lexer at all. I am just mutating the input text. But it works and I can maintain it. In the end that what matters. Sure it can’t be used in a real compiler but that’s not the point of this project.

Parser

So as I mentioned writing the parser was way trickier than it should have been. This is because C does not have good string handling. I didn’t really foresee that being the problem.

You would think that not having a hash table would really slow things down, at least I did. I even considered moving to Rust or C++ just so I have one. But it was actually trivial to implement and I would argue it made things easier to work with.

However, having the main string type be null terminated definitely stung. Not because it’s hard to get around, heck I got around it in a few minutes. You just make your own string. But I was so focused on how I should implement the string handling. That I forgot the part where I am actually solving parsing.

I later learned that you should actually split the text into LOGICAL tokens. What I did was just split based on white space and try and fix the errors later, needless to say that was bad.

Performance

Even tho I didn’t add any optimizations yet, I still wanted to see how well the compiler does compared to the interpreters. Benchmarking can be a bit tricky for something like this. Things like heat, memory alignment, and background processes, can effect the results.

Luckily this is all single-threaded code, so we can just throw a bunch of processes at the same time. And they all have basically the same conditions if we do it right.

I elected to test on 4 code snipets with 1000 samples each. I would then see which of the implementations came on top. My compiler’s code beat the 2 interpreters over 60% of the time. This is statistically significant.

So why C?

Whenever someone uses C the immediate question is “Why not X?”. The biggest of these is “Why not Rust? Its memory safe.” Okay… is memory safety a problem here?

I had 0 unintentional memory leaks in this project… and all of my out-of-bounds segfaults gave pretty good error messages in Valgrind. I am not going to be coding in Rust just because I am a queer woman.

Then it goes: “Why not C++? it’s C with more features.” More features are always better right? No… they are not.

I have a bad tendency to over-abstract things. C++ and Rust really push me in that direction. It’s almost never actually needed and it’s a nightmare to maintain later on.

Lastly, people ask “Why not Zig?”. Well, Zig is kinda cool but it keeps changing all the time. They added and removed async twice already.

On the other hand, C has stayed mostly the same for over 20 years. This means that EVERYTHING written in/on C in the past 20 years is basically still valid. You don’t need to bother relearning the language or rewriting your code. As a result, ChatGPT can code C fairly well.

So ya I like C because it’s simple, stable, and effective.

Conclusion

Yes, you can write a compiler in C without using anything extra fancy. No, it’s not that hard you the reader, can take a few months and go do this. I didn’t even know C six months ago. Apparently that does not matter. You just need determination and free time.

I still have a lot more to do with this project. Originally the goal was to learn about optimizations. Now that I have all the basic infrastructure and tests, it’s time to get to the fun part.

I will keep you posted

Neva.

--

--

Neva Krien

she/her 🏳️‍⚧️ first year uni student with published papers on LLMs. transitioning into systems programing.