Interesting Codebases

I love studying codebases, and when I come across beautiful designs, or novel ideas/algorithms, it feels a lot like I exploring a cave and finding gold hidden behind a rock.

Usually, I will dive in a codebase to figure out how a problem is solved, or an algorithm is implemented, and then branch out to other parts, mentally mapping everything as I go. Other times, I go in hoping to find something interesting, and other — very rare times — I do it because I love reading it, just like I sometimes enjoy reading a book more than once.

I don’t do this as much as I used to, certainly not as much as I ‘d like to, but I still come across the occasional gem. When I and my brother were younger and didn’t have Internet access, or anyone else to talk about our hobbies, or even related books to read, we would rely on language interpreters(and later, compilers) and the sample applications code included in the floppy disks to learn how to do anything(the closest thing we had to a go-to-resource was Amiga Basic’s manual, bundled with the Amiga 500, our first computer).

When we eventually got Internet access (I would take the bus to the university once a week, stay there for a few hours, and learn however much I could, and also fill as many disks as I could with stuff we could examine on our Amigas, that I would bring back home), I found the source code of one of the most impressive and well known Amiga applications, Term. That was a milestone for me. You could read all about it here. I suppose that’s when my real fascination with codebases began.

I thought it may benefit others if I compile a short list of some of the codebases that stood out for me. This list is going to be short — I am definitely forgetting many that were interesting to me at the time — , and it’s mostly out of order, but you may want to consider it if you want to be surprised like I was, or learn a thing or two perhaps.

  • Seastar and ScyllaDB: The most fun I ‘ve had studying a codebase in a long time; Seastar is a very impressive achievement, and easily the all-around greatest C++ codebase I ‘ve gone through. ScyllaDB is an application of the Seastar framework, and it too exhibits, for the most part, the same qualities with Seastar. If you care for “modern” C++, those are the codebases to understand inside out.
  • Aeron and Disruptor: I am not fond of Java, or any other JVM languages. I will write Java when I need to, but the language doesn’t appeal to me. Those two codebases, by Martin Thomson, are by far the most interesting, well written Java codebases I ‘ve studied. Their beauty matches the elegance and cleverness of the designs they implement.
  • Hashicorp codebases: I am not fond of the Go language either, though I don’t mind it. Of all Go codebases I ‘ve studied, Hashicorp’s codebases easily stand out from the rest.
  • Linux Kernel: No commentary required.
  • Folly: Facebook’s incredible C++11 library. The code quality is, in general, very high, and there are just too many learning opportunities there for anyone interested in learning new algorithms and designs. F14, in particular, is such a beautiful design with an equally beautiful codebase/implementation.
  • HHVM: The most interesting and highest quality Facebook codebase I ‘ve studied. A true gem.
  • Tensorflow: This is the finest Google codebase I ‘ve studied. Great design, very high-quality code, and easy to understand how everything fit together.
  • Unreal Engine: The unreal engine codebase is somewhat hard to navigate and mentally track due to the sheer size and complexity of the components that comprise it. If you are patient, you will be rewarded with too many ‘a-ha’ moments.
  • SQLite: This is here for the incredibly documented codebase. It’s really easy to understand how everything works, and it’s very valuable to anyone interested in how datastores are built.
  • Chrome and V8: The Chromimum codebase taught me a lot about building UIs on various platforms, and the V8 codebase is important because Javascript is so important nowadays. It’s not hard to understand either; they are well structured and quite nice to look at.
  • Postgres: See SQLLite commentary. A minor issue with this codebase is that it seems that the authors have not incorporated updates/advancements to various APIs and new syscalls introduced in the past few years — but that in no way lessens the importance and quality of the code. I really like the design that influenced the implementation of the code in procarray.c — which is similar to MariaDB’s group commit implementation, which in turn is a spin on the flat combining idea.
  • HAProxy: If you care for network I/O, you should study this codebase. It’s also simple to understand.
  • Doom: To say that I was impressed when I first saw and played Doom would have been a huge understatement. So when I first got access to the codebase, I spent a long time going through it, trying to understand how it was even possible. Carmack is a genius.
  • LMDB : The code is very minimal, and what really matters here is the ideas and the design of this memory mapped database. Its extremely high performance is due to that clever design and pragmatic decisions by its developer.
  • Erlang: You should study it to understand how light-weight processes work, how message passing is implemented, and how its VM and scheduler work. The code quality is not that high, all things considered, but it it’s worth it. You may also want to check out QNX.
  • LLVM: I learned a lot about compilers and optimizers by studying LLVM. The codebase quality is high, and its very easy to find your way around it.
  • WebKit: See V8 commentary.
  • LuaJIT: the finest C codebase I ‘ve studied. It’s so elegant, so beautiful and so well designed (both LuaJIT, and the code that implements it). I can’t recommend it enough.
  • Redis: The implementation of HyperLogLog and radix trees in Redis is very elegant, and well documented. It is straight-forward to understand how everything works in Redis in general. If you are interested in high-performance single-core servers design, you should study it.
  • Abseil: Google’s open source collection of C++ library code designed to augment the C++ standard library(as described by the developers). Very high-quality code, and very clever design choices, especially in their fantastic hash map implementation.
  • Compile Time Regular Expressions: @hankadusikova’s incredible constexpr/compile-time regular expressions is an incredible accomplishment. It’s a real eye-opener too; her design is based on types and type matches and you should watch her CppCon 2018 presentation and go through her slides, to see how that works. If you care for C++, and constexpr/compile-time programming, this is the codebase you need to study.
  • RE2: This is one of the nicest codebases open sourced by Google. Heavily commented, and very well implemented, it’s important for, among other reasons, it explains how powerful DFAs/NDFAs can be.

I will update this list with more codebases that I found interesting, as I remember them. I will also add some comments about them, reasons why they are interesting or important.

In the meantime, you should check the ones listed here though, I am pretty sure it will be worth your time and effort.

Have fun exploring them!

This blog post is on Hacker News. You may want to read the comments here. You can find me on twitter ( @markpapadakis ).