Kentucky Mule — limits of Scala typechecking speed
Back in August, I wrote wrote a stream-of-consciousness style post Can Scala have a highly parallel typechecker?. My original plan was to write down my thoughts on the subject and move on with my life. I thought to myself at the time: “wouldn’t it be cool to pretend I’m a Software Architect?”. So I did what I thought Software Architects do: I hand-waved about an unproven idea in a blog post.
Today, I’m sharing my findings on the limits of Scala typechecking speed I explored in the Kentucky Mule project. Lessons learnt from Kentucky Mule apply to a wide range of possible compilers and are not limited to just Scala compiler.
I was convinced the idea of parallelizing and distributing Scala typechecking was an interesting project to hack on and I was hoping somebody else would take up the baton. After a month of waiting I realized “Oh wait, there’s around fifteen people in the world who can readily run an experiment on a different implementation of Scala typechecking and I know all of them”. And I was the only hobo in that circle — everybody else is busy with their projects.
In my previous blog post I described the idea of computing outline types that would carry just enough information to figure out an efficient plan for parallelizing and distributing full Scala typechecking. You can think of computing outline types as a pre-typechecking phase. The big unknown to me was: can pre-typechecking be done in a fraction of the time it takes to fully typecheck a Scala program? If pre-typechecking turned out to be too slow then my idea from the blog post would turn out to be a flop.
Scala typechecking is a complicated problem and it’s a labour intensive area to experiment in. The prospect of investing several weeks of work into prototyping my idea and having just a negative validation as a result didn’t sound exciting. I knew that the negative outcome was the most likely one. I’ve been in business of speeding Scala compilation times since 2012, first working with Martin Odersky and later on my own.
I was looking for ways to hedge my risk. I realized I could reframe my problem into a winning-only, twofold question:
How to build a highly parallel typechecker for Scala, or does
Scala have a fundamental language design flaw that prevents one from being built for?
Either Scala community would win by learning about a path towards a high performance Scala typechecker that anybody programming in Scala wants. Or there would be a lesson of high interest to the entire programming language community in what kind of language feature to avoid in a programming language they are designing.
Armed with two different sets of lenses for looking at my challenge, I capped myself at 15 days of full time work to investigate.
Outline types are cheap
I’ll kill the suspense right off for you: Kentucky Mule pre-typechecks sources at the speed of 4.4 million lines of Scala code per second on a warm JVM. I got the units right, it’s that fast. On a cold JVM, it can process the entire scalap sources (2k lines of code) and summarize API-level dependencies in around 600ms. This time includes JVM’s startup time, loading Kentucky Mule’s bytecode, parsing time and actual processing.
I’ve got some good news for all Scala programmers out there:
Scala does not come with a language design flaw that prevents a fast compiler to be written for.
The pre-typechecking Kentucky Mule does is a small subset of real typechecking Scala compiler performs. It’s natural to ask: does Kentucky Mule have a better architecture and implementation or is it simply cheating but not doing the same work as Scala compiler does?
Kentucky Mule repo on GitHub contains a simple benchmark that brings both implementations very close to each other in terms of work they need to do. In that benchmark Scala source is very simple and Kentucky Mule’s pre-typechecking is essentially the same as real typechecking. Kentucky Mule performs at speed of 15 million lines of code per second and the Scala compiler at 7 thousand lines of code per second. In this specific case, Kentucky Mule is over three orders of magnitude faster than the Scala compiler.
The performance gap is so large that it’s interesting to explore what I did differently in Kentucky Mule.
When I was prototyping the basics of the typechecking in Kentucky Mule, I departed from the typical way of computing types and handling the dependencies between them.
Many compilers have a concept of a completer designated as a fundamental unit of computation in the typechecking process. The Scala compiler, the dotty compiler and the Java compiler have all a concept of a type completer at the center of its typechecking implementation.
A type completer is a deferred (lazy) computation that is triggered when someone asks for a type of an entity. For example, if a method call is typechecked, the typechecker will ask the symbol representing the method for its type. If this particular symbol hasn’t been inquired for its type before, the associated type c0mpleter will be executed and the computed type will be stored. Once a type is completed, an associated type completer is discarded.
At the very high level the type checking consists of setting up the right completers for symbols and then forcing all completers according to the dynamic graph of dependencies between types and other entities in the typechecked program.
At the first sight, the idea of having lazy computations for all steps of the typechecking process looks elegant. However, at closer inspection completers as implemented in Scala and Dotty turn out to be problematic. Let me list a few issues.
The completer is supposed to hide an order in which types are computed. You just ask an entity for its type and either it has it already, or it will be computed for you and you don’t need to think about it. However, types in any non-trivial programming language permit many kinds of cyclic references. What happens when you have a cyclic dependency between completers? You get either some form of an infinite recursion (usually observed as an StackOverflowError) or logical errors when one of the completers observes a half-baked type and derives wrong results based on a wrong state of its dependencies. This issue is real and is a life-sustaining source of bugs. Here’s one instance of code that tries to handle cycles in Dotty:
The long comment refers to implementation details of other completers (we don’t even know which ones exactly). Type completers as a simple deferred computations do not offer any mechanism for dealing with cycles. An ad-hoc measures are invoked that cause the type completer abstraction to leak.
The design of type completers triggering other type completers and blocking on the result of dependent computation promotes deep stack traces. Each step of forcing a type completer is adding at least a few method calls to the stack. It’s common to see very deep stack traces while the Scala’s typechecking is running.
Deep method call stacks are not handled well by JVM’s JIT compiler. This results in a suboptimal runtime performance.
Measuring performance is hard
The profiling tools for JVM are also not handling deep stacktraces well enough. The Java tooling for optimizing performance can be thrown out of the window because it doesn’t produce any actionable results for Scala compiler or any other program written in the style resembling Scala compiler.
A large graph of lazy computations makes performance accounting hard to do: you don’t know if the completer you’re looking at is slow or maybe one out of thousand completers that happen to be forced is slow.
Unpredictable performance model
A programmer working with type completers have a hard time building a mental model for performance characteristics of the code she is writing.
When an inquiry for a type can take anywhere between 1ms and 1s depending how large portion of the completer graph is forced, it’s the right time to give up on reasoning about code’s performance.
If a programmer observes her completer to wait on other entity that takes 1 second to compute its type, she has no tools to help her determine whether it’s a genuine performance bug or an accidental order of computation she is observing.
All those issues lead to a guess-driven programming when you’re trying to guess your way out of either exceptions caused by cycles or performance problems.
Interruptible type completers
In Kentucky Mule, I have tried an alternative design for type completers that can interrupt their execution and report back the reason for interruption, e.g. a missing dependency.
You can think of a regular completer to be defined as a lazy function
() => T. The completer in Kentucky Mule is defined as:
The completer can interrupt its execution when it encounters a symbol with an incomplete type. The caller to the completer is free to schedule the execution of that completer to be done after the incomplete dependencies are taken care of.
Interruptible type completers enable programming that reminds in style a cooperative multitasking. The completer explicitly yields its control back to a work queue upon encountering an incomplete dependency. The scheduler for the work queue rearranges computations according to incomplete dependencies reported by completers.
The incomplete dependencies, that are bread and butter of typechecking, are expressed as explicit values and receive an explicit handling.
The design with an explicit work queue and global view of dependencies enables nifty tricks. For example, completers touching I/O (like reading class files from JAR files) can be bundled together to reduce the cost of context switching.
Organizing typechecking tasks in an explicit job queue filled with interruptible completers promotes shallow stack traces and make performance accounting easier.
One downside of interruptible completers is that they make programming against them more elaborate compared to regular lazy type completers. An example taken from Kentucky Mule’s implementation illustrates the problem:
The pattern matching on
resolved implements yielding the control flow back to work queue in case a dependency is missing. This piece of logic has to be introduced every time two interruptible completers are composed. It’s clear that for the design based on interruptible completers to scale to the size of a real compiler, a better way for expressing composition has to be found.
I believe I can attribute the success of creating a high performance implementation in Kentucky Mule to a different development process than most compilers go through.
A typical way to implement compiler is to split its functionality into several phases that transform types and trees. You can think of a compiler as a pipeline of several phases, that starts with the original program’s source code and through several simplifying transformations arrives at some form of bytecode or machine code.
It’s natural to first implement the first phase (e.g. parsing) and once fully done, implement the second phase (typechecking), and so on. In this style of development, the pipeline is built incrementally in the order of phases. Typically, the work on performance starts once all phases are in place — the whole compiler is implemented.
I decided to take a different route in Kentucky Mule and I iterated over language features. I started with a trivial prototype that couldn’t typecheck anything; it could just register declarations in the source code. Once I understood performance characteristics of this piece, I added support for typechecking trivial method declarations but, for example, class inheritance was still not implemented.
I iterated many times; each time I would focus on a single language feature that I wanted to add a support for. At each step, I would carefully measure the performance cost of implementing a language feature support.
Originally, this style of work was motivated by my goal of isolating a single Scala language feature that was particularly expensive to support. I was looking for the potential language design flaw.
In retrospect, I believe this is an excellent approach for implementing a high performing compiler. This approach encourages monitoring from very early on in the development process the final performance number that everybody cares about: the raw speed of the entire compiler.
Kentucky Mule is not a fully functional Scala typechecker. It’s not even a fully functional pre-typechecker that computes all possible outline types. The goal of Kentucky Mule was to explore the limits of Scala typechecking speed and have a fresh look at a compiler architecture geared towards high-performance implementation. In that regard, I think of it as a success.
I hacked enough of Kentucky Mule’s functionality to convince myself that there’s no performance cliff lurking around the corner and the numbers I’m seeing are representative of the true potential. Gathering trustworthy numbers and gathering insights (collected in notes.md) was my goal in working on Kentucky Mule.
Kentucky Mule demonstrates a path towards a high performance compiler architecture and shows that both Scala and Dotty compilers do not even scratch the surface of possible compilation speed for Scala.
There’s a very long way from the prototype like Kentucky Mule to a robust, feature-complete parallel or distributed Scala typechecker. However, while working on Kentucky Mule I realized that the concept of the outline types that are extremely cheap to compute has many use cases on its own:
- Outline types can be used for an efficient computation of API differences in the style of JDiff.
- Outline types can be used in an incremental compiler to understand which parts of the source code should be recompiled.
- Outline types can be used for validation purposes in tools like Bazel that require an explicit import and export of transitive dependencies that are visible at the API level.
- Outline types can be used for a very fast, semantic indexing of sources, as seen from the API point of view.
I’m no longer a hobo so I’m planning to leave Kentucky Mule in its current state. I also tend to think it has served its purpose as a research project.