On Kentucky Mule: a transcript
The world of statically typed programming languages appears to be at peace with an unfortunate division. You either have an advanced and an interesting type system in your programming language but pay for it in slow compile times or you have an uninspiring type system with fast compile times.
The economics of designing a statically typed programming language with a good user experience (and the compilation speed is a major component of ux!) hasn’t been explored in depth.
Kentucky Mule, my slow-burning side project, has been a blue-collar research on that question. The aim was to produce an insight that’s highly practical and to focus my attention on one language I understand really well: Scala. A few months ago, I’ve reached a milestone in Kentucky Mule’s development that greatly reduced the risk and clarified how this project can graduate from a prototype of an unproven idea to a solid plan of re-engineering the real Scala compiler for 5x-10x performance improvement.
Below is a transcript of the talk I gave a few months ago on this topic. It’s slightly revisited since then because revising a history is fun.
Can Scala compiler go faster? What’s the limit? How much are we leaving off on the table?
As I was thinking about these questions, I realized that the search for high performance compiler implementation can have two outcomes. Either there’s a compiler implementation I’m looking for or Scala has a design flaw that prevents it from ever being fast to compile.
Scala has been released in 2004 but its real popularity rise started in 2006 with the Scala 2.0 release. By 2012, as people were building bigger projects and adopting Scala within companies, the compiler speed became the top concern.
Over the years, many efforts went into improving it. I worked on the redesign of zinc — the incremental compiler for Scala. It’s the piece of software that figures out what to recompile when you make a change in your project. The algorithmic redesign led to up to 10–50x improvements. In 2018, the compiler performance is still listed as the priority within the Scala community.
During my work on the incremental compiler, I stumbled upon Bret Victor’s talk Inventing on Principle. It was a revelation for me — Bret’s talk was overflowing with idealistic ideas on how programming should be done. I saw a deeply compelling case for the principle that if you make a change, you need to experience the effects of it immediately. The consequences went far more deeply than just saving the time. It changed how you do programming.
After watching it, I went on to tell all of my friends about it.
And I started to wonder why I was working on a programming language so far removed from what I saw? Did I work on the wrong cause?
The 2016 was the hobo year for me. I didn’t have a perf cycle to worry about or a paper submission deadline. During that year, I happened to have a job interview at Facebook. I talked with the Hack team and learned about their typechecker project. It handles static checking at the scale of tens of millions of line of code. It’s massively parallel and distributed.
And now I became a hobo with a plan.
A popular argument for Scala’s slow compilation speed is that Scala’s compiler simply has to do more work. The language has sophisticated features and supporting them is computationally expensive.
So I ran an experiment! I generated a simple Scala and Java code side-by-side. Except for language differences, they look exactly the same. In the code I stay away from Scala’s rich type system features that are normally attributed as the reason for slow compile times.
I compiled both with Scala and Java compilers and compared the execution times.
On the chart you see a wide gap between the two compilers. Fundamentally, both are solving the problem of the same complexity but Scala compiler is 6x slower than Java one. Whatever Scala compiler is doing, it’s a waste. None of the performance analysis work done so far explains that gap.
While my experiment shows there are potential gains in the single-threaded performance, the real riches await in the world of parallelism. People working on programming languages consider parallelizing compiler to be an unsolvable problem. But the Hack team built a non-trivial typechecker with a massively parallel, distributed and incremental architecture. Interestingly, they designed the language to enable that architecture.
Can Scala borrow Hack ideas? Can Scala have a highly parallel compiler without dramatically changing the language?
Ok, so we want to build a highly parallel compiler. If you plot the time across the pipeline of different phases in the compiler, you’ll see one standing out: the typer. This is the heavyweight phase that performs the type checking and takes 30–40% of the total execution time. It’s also the most difficult phase to reason about. We’ll focus on it.
This is the final architecture of typechecking we’re heading for. And boy, a lot of is going on here. Let’s unpack it. The goal of the architecture is to stack as many little boxes on top of each other as possible. The more we stack them, the more parallelism we get.
On the left we have parsing phase that’s trivial to parallelize: I can parse each source file in parallel. On the far right end I have the massively parallel typechecking of method bodies. All methods are independent from each other but they depend on the computation that’s happening in the middle. We’ll get back to it.
For comparison, this is the current design of Scalac. It appears to be simpler but the picture is imperfect. All complexity is packed into what in software is known as Big Ball Of Mud. What we have done in the previous slide is we carefully broken up the hidden complexity into small, manageable pieces.
When I asked my friend that does research in programming languages, why researchers do not spend more time on parallel architectures for typechecking, she cut me short with “it’s a messy data-dependency problem”. I’m now getting to the core of the problem.
“Messy data-dependency problem”
Let’s suppose I want to resolve the type T in the piece of code on the left, the sequence goes like this:
- Is the type T defined in the class B? No
- Is it defined in the enclosing class A? Yes, ok, great, I have it
Now, let’s suppose I remove the class declaration and bring in an import statement. I still want to resolve the type T:
- Is that type defined in the class B? No
- How about class A? No but there’s the import statement
- It’s a wildcard statement, does C include the type T as a member?
- But… what’s C? I have to first work this out. Is C declared in the class A? No
- How about the package foo? Let me kick out a global search
- I see that the object C is declared in package foo in other source file
- Ok, does object C include the type T? Yes, we have it.
So what’s going on here? We’re resolving named identifiers. We’re turning a string of characters into a reference to a specific definition. The resolution can be recursive: e.g. we first resolve the object C and then we ask it for the type T. The order in which we perform the resolution is determined both by the language specification and the structure of the program. The dependency structure that arises from the resolution process can be very complex and there’s no obvious opportunity for parallelism.
It’s a messy data dependency problem.
Skeleton & tissue
We can start untangling our messy problem by dragging the code to sides and drawing a dividing line between the two columns. Definitions and signatures go to the left and method implementations go to the right. Now, you can think of what’s on the left as “the skeleton” of the program. The stuff on the right is the tissue and is hanging on the skeleton.
There’s skeleton on the left and the little boxes are pieces of the tissue on the right. The pink arrows represent references, or dependencies, between the parts of the tissue or skeleton. The tissue only reaches out to the skeleton. The arrows on the right only go only to the left. The tissue both hangs on the skeleton and then reaches out to some other parts of the skeleton.
The skeleton is more complicated: it has self-references. The types can refer to other types, methods refer to types in their signatures. But skeleton is self-contained: it never reaches of to the tissue. Aaaand I know what you’re thinking: but what about inferred return types?!
We’ll get to them soon! I promise.
We have a first major win: we divided our problem into two large components, the tissue and the skeleton.
Skeleton is expensive to compute in Scala, and frankly, in most modern programming languages. For the overall plan of parallelization to pan out, the computation of the skeleton, of all signatures has to be parallelized too. It’s a tricky problem because parts of the skeleton, or bones, cross-reference each other. In order to avoid the thread contention, I would like to figure out the order in which to process all the pieces upfront.
However, this problem sounds circular at the outset: the order is determined by the dependencies between the parts of the skeleton and dependencies are available only when I fully compute type signatures of all parts of the skeleton: classes, types members, methods, etc.
To break that cycle, I came up with the notion of outline types. The idea is to approximate the real types with very simple types that are: extremely cheap to compute and capture dependencies.
Outline types once computed, give an IKEA-style assembly manual for building up our skeleton. Outline types give you the assembly manual, they tell you in which order to put the bones together.
The architecture again
Let’s go back to our architecture. The computation of types in the tissue of the program is on the far right, in the middle is the computation of the skeleton. You can see many small boxes arranged in a predetermined order marked by the dependency arrows. More to the left is where the assembly plan is put together for skeleton’s construction. That’s where outline types are computed.
Having this plan, you can see that outline types computation is represented by a single box. It’s the computation that cannot be parallelized. And the key question is: can outline types be computed fast enough that the whole scheme would make sense? Fast enough means an order of magnitude faster than the computation of the skeleton.
We would like to avoid the comical situation where we spend more time figuring out the assembly plan than actually building the thing.
Outline types are stripped down to their essence. They are only concerned with signatures of classes, objects, type definitions and vals. I cut everything that’s non-essential to computing the dependencies. Outline types do not support subtyping checks and are purely structural.
Outline types are so stripped down that’s initially hard to imagine they can sustain the weight of Scala’s complexity. I had no idea whether they will work and will work fast enough but I was very eager to find out. And I went on building Kentucky Mule.
For building a proof of concept, I picked the little known scalap project as my test case. This project is great fit. It’s self-contained and small: it consists of only two thousand lines of code. It’s not too small to be completely trivial, though. From the first line of code for my prototype, I was paranoid about performance. I measured the performance at each step of the initial Kentucky Mule implementation. I was interested in staying as close as possible to how quickly computers can compute.
The performance numbers I got looked really great: I was computing outline types for scalap at the speed of 4 million lines of code per second.
I knew that the numbers I got so far were purely for the validation of whether the idea of outline types has any grounding in reality. If they were too slow to compute, I wanted to know early. They did run fast but in my proof of concept, I conveniently left out all Scala features that I felt are the potential deal breakers.
The next task was to meet the deal breakers head-on. I picked Scala’s std library as the test case, and listed all language features I thought I’ll need to implement.
Starting with a really great performance and my predetermined list of 21 features, I went on implementing them. I want to show you the result.
In the demo you saw Kentucky Mule processes 30 thousands lines of Scala’s code in around 200 milliseconds. That’s a bargain for the assembly plan. And it inspires a great deal of confidence in the overall architecture.
I’d like to talk about some lower-level architecture now. Internally, compilers often have a concept of a type completer. It’s a function that goes from Unit to Type. That’s, has a signature () => Type. It’s a lazy computation returning a type. Recall the big ball of mud and the messy dependency-problem. Because the order of typechecking is not known upfront, the popular choice is to set up lazy computations that refer to each other and let laziness sort out the order of computations. This is a very natural trick that Scala compiler, Java compiler and Swift compiler all use.
It comes with downsides, though. We’re performing a deeply recursive depth-first graph search that is confusing to the JIT optimizer on JVM and slows down the compiler. Reasoning about the cost of a computation in the typechecker is difficult: you don’t know whether the piece you’re looking at is slow or it just happens to trigger a long chain of other computations. Deep stack traces are also tricky to handle by JVM profilers. And finally, some of the types are loaded from disk, e.g. for your external libraries. They loaded lazily and in small chunks which leads to I/O trashing. I’d like to batch I/O operations, instead.
These pain-points made me consider an alternative abstraction.
I came up with the notion of an interruptible type completer inspired by the cooperative multitasking. It’s the control flow scheme used by, for example, Windows 3.1. The idea is that completers are tasks that either return the Type they are completing or yield control back to the event loop if they’re stuck on some dependency that hasn’t run yet. The interruptible type completer has a signature () => Type | IncompleteDependency. The control is returned with a signal on which task is blocking the execution. And once the dependency is satisfied, the scheduler will retry the task.
In this scheme, tasks are organized in a work queue. Tasks are isolated and lend themselves to a straightforward performance attribution. Shallow stacktraces make JVM profilers much more usable. And I’m free to reorder the execution of tasks in the queue. For example, I can batch I/O tasks.
I used Kentucky Mule as a lab to try the idea of interruptible type completers for outline types. Much of Kentucky Mule implementation is an exploration of this single function signature change: going from () => Type, to () => Type | IncompleteDependency. I went from returning a type to returning a type or signaling a missing dependency.
By doing that, I decoupled dependency discovery from dependency computation. In the old scheme, the depth-first traversal ties these two together.
(Shea’s Law) The ability to improve a design occurs primarily at the interfaces. This is also the prime location for screwing it up. — Akin’s Laws of Spacecraft Design
I recently came across this quote on spacecraft design. All software deteriorates over time for variety of reasons and the only thing that prevents a total collapse are good interfaces. This simple idea of cooperative multitasking has far-reaching implications. By changing the interface, I unlocked application of decades of systems programming thinking that went into work queues and schedulers.
I had a profound “aha” moment when I learned from my systems programming friends that there’s a simple, very low-overhead parallel scheduler for the interruptible completers. The computation of outline types themselves can be parallelized. That one box in the architecture slide can be broken up! It can be broken up! (see this explanation for details how does it work).
A hint for design of fast compilers
Finally, I would like to talk about the search space of a fast compiler implementation. There are two ways you can go about finding a fast implementation. You can first implement all language features and all compiler phases and then start profiling and optimizing.
The other way is to implement a tiny subset of language features, let’s say, what programming languages supported in 70s. You implement this small set of features end-to-end. You want to get a great performance of the final implementation. And then you iterate over language features under the constraint of defending the excellent performance you already have.
The first way is the typical way software is constructed and it’s best summarized by this quote:
Make It Work Make It Right Make It Fast — Kent Beck
Rarely, you stumble upon a problem in software design where you need to step back and reevaluate even the most common practices in software construction. While working on Kentucky Mule I took a big step back and came to conclusion that I like the second method because your search space is simpler: I know at each step where I need to go. All 40 language features to be implemented are in the language specification. Each step is small enough that, when I notice I’m losing performance, it’s easy to back off and figure out why. More importantly, at each step, I know how far are am I from being done with the implementation.
With the more common practice of optimizing at the late stage of development, your search space is enormous. And you’re shooting in the dark which quickly becomes demoralizing.
It’s time to wrap up. So, are we up for a rewrite of Scala compiler? Conveniently, we have one rewrite at our hands called Dotty. Recall the experiment I started with. I added Dotty’s performance to the chart. It aligns perfectly with the old implementation. I repeated this experiment on different code bases, both Scala compiler and Dotty compiler come neck to neck in the compilation speed performance on the same code base.
It’s not about rewrite.
I believe Scala’s compiler performance is a hard but solvable problem.
My proposal is to carefully reexamine the implementation. Break down the big ball of mud into phases with the intent of future parallelization. Pick interruptible completers at the basic building block. This interface enables decoupling and unlocks opportunities for borrowing great ideas from systems programming community.
Lastly, start with minimal, end-to-end implementation. Iterate over language features and defend performance at each step.