Can Scala have a highly parallel typechecker?

Update (Dec 23): I wrote a follow up post “Limits of Scala typechecking speed”.

A couple months ago I had an interview with the Hack team at Facebook. One of the things we discussed was the architecture of Hack’s typechecker that is designed to handle the enormous scale of Facebook’s codebase. Hack’s typechecker processes over ten million lines of code in just two minutes and provides a nearly instantaneous feedback to code editor once the codebase went through the initial processing.

The high performance of Hack’s typechecker implementation is a driving force for the progressing adoption of Hack internally at Facebook: PHP programmers are used to fast feedback loop upon edits and Hack’s typechecking errors arrive at them nearly as they edit their code.

I’m sharing here what I learnt about Hack’s typechecker during the interview and offer my initial thoughts on the question that popped into my head: could Scala steal ideas from Hack and get a highly parallel typechecker too?

Hack’s typechecker: PDI

The implementation of Hack’s typechecker has three properties. It’s: parallel, distributed and incremental (PDI). Here distribution is parallelism under the constraint of a slow communication link between computing nodes. The parallel and incremental properties are enabled by the shared capability of understanding data dependencies between type information of parts of the typechecked program before the full typechecking is performed. Understanding of data dependencies enables scheduling independent pieces of typechecking work to parallel threads and aids avoiding data races.

The structure of data dependencies has both static and dynamic parts. The static dependencies are determined regardless of the structure of a typechecked Hack program. For example, typechecking depends on parsing source files to produce trees. Source files can be parsed independently from each other so parsing is easily parallelizable. There’s a static dependency between building a symbol table that contains information about all declared public classes and methods (visible to other source files) and typechecking method bodies. However, once you have a symbol table fully populated, there’s a static lack of dependencies between typechecking method bodies. One method body can be typechecked completely independently from the other one because the computed information is limited in visibility to the inside of the method body. This is true in Hack because programmers have to explicitly declare return type of all declared methods. Method’s full signature can be computed without looking at its or any other method’s body.

The dynamic dependencies are all type dependencies determined by the structure of a program. For example, if class B inherits from class A, you need to typecheck class A first to compute the full type information (e.g. list of all members) of class B. Dynamic dependencies are introduced by referring to a type somewhere in the typechecked program. Not all dynamic dependencies are of the same strength. The dependency introduced by inheritance between classes is strong: you can’t start typechecking signatures of methods in class B before class A is fully typecked because you need to know members of class A for things like override detection. The dependency introduced by referring to a class in method’s signature (e.g. as a return type) is weak dependency: we only need that a class exists but we don’t need a full type information about it.

Hack’s typechecker works in three phases:

  1. Parsing
  2. Symbol table computation
  3. Typechecking method bodies

Parsing is trivially parallelizable. Symbol table is the global data structure that contains all information required for the third phase. The third phase is also trivially parallelizable because method bodies can be typechecked independently.

The level of parallelism of all three phases is limited by how quickly one can construct the symbol table. The computation of the symbol table can be broken down further into two steps:

  1. Enter declared classes and methods
  2. Compute signatures of classes and methods

The first step is about extracting declarations from trees and putting them into the symbol table. The first step only writes to the symbol table. The second step requires lookups in the symbol table to resolve references to declared types (classes). The second step both reads and writes to the symbol table.

My understanding is that Hack’s typechecker parallelizes the work in both steps. The parallelism of the first step is limited by write contentation to the symbol table. The parallelism of second step is most difficult because there’s a large number of dynamic dependencies between computed pieces of data. One trick Hack’s typechecker uses to reduce contention is to not block on a missing data. Instead, it computes the missing information in the symbol table locally in a thread and continues the execution without waiting for the symbol table to be updated. It means that it’s possible the same signature can be computed multiple times and one of the writes will ultimately win. This is possible because computations are set up to be purely functional: you can rerun them multiple times and you’re guaranteed to get the same answer back.

To summarize, Hack’s typechecker achieves high level of parallelism by carefully considering data dependencies and scheduling computations accordingly. One of the insights is that typechecking class and method signatures is a separate problem from typechecking expressions in method bodies. Once the symbol table is fully constructed, you can trivially parallelize the work across source files or even single method bodies. This is enabled by Hack’s lack of interred return types and I believe the language’s design decisions were driven exactly by the goal of separating typechecking method signatures and method bodies. The construction of the symbol table in Hack is simplified by the clear separation between terms and types: method signatures can only refer to types but not to other members’ signatures (terms).

Incremental typechecking is also enabled by explicit consideration of data dependencies. The incremental typechecking is all about following through data dependencies to determine which parts of typechecking computation needs to be rerun to take into account updates to the typechecked program.

Scala’s typechecking: is it too complicated?

Scala’s typechecking is a lot more complicated problem than Hack’s typechecking because Scala’s type system is much more expressive than the one of Hack. However, is it complicated enough that Scala can’t draw on Hack’s ideas?

Scala’s typechecking can be expressed in the same three phases as of Hack’s:

  1. Parsing
  2. Symbol table computation
  3. Typechecking method bodies

This is almost true. Scala allows for return types to be inferred so you might need to typecheck a method body to calculate the full method signature you want to put into the symbol table. In such case, 2 & 3 are mixed up. This is the first challange.

The computation of the symbol table for a Scala program is more complicated for other reason: Scala has path-depenent types. Method signatures can refer to both declared types and terms:

class Storage(val h: Heap) {
 def store(i: h.Item): Unit = …
}

The signature of the store method refers to the type h.Item that is dependent on the type of val h. In order to check for existence of the type Item we need to know a signature of the val h. This introduces a dependency between typechecking signature of def store and val h. It’s the kind of dependency between signatures that doesn’t exist in Hack. For that reason, the graph of dynamic dependencies between method signatures in Scala is potentially more complicated with tricky cyclic dependencies looming. This is the second challange.

If you’re an experienced Scala programmer you might be thinking “But what about everbody’s favorite magic wands: implicits and macros?”. Fortunately, both are limited to the third phase of typechecking: typechecking of method bodies (expressions). Implicits and macros do not affect parsing or symbol table computation.

To overcome the challenge with inferred return types we can postpone computation of full signatures until the third phase. That is: in phase two we populate the symbol table with signatures of all methods but we leave an unknown hole whenever there’s an inferred return type. And we schedule completion of unknown holes corresponding to return types as initial tasks in the third phase. This will work as long as inferred return types are not needed for typechecking of other signatures. It’s mostly true except that signatures can refer to vals to express path-depent types. If you have a val that has an inferred return type and is referred from some signature as part of a path then you have no other choice than to drop down to expression-level typechecking. This would break the separation between phase 2 and 3. The solution I see is to outlaw such Scala programs: if you have a val that you want to use in path-depent types, it can’t have an inferred type — its type has to be provided explicitly.

The computation of a symbol table in Scala is an expensive task so ideally you would like to parallalize it. Let’s recall that there are two steps to the symbol table computation:

  1. Enter declared classes and methods
  2. Compute signatures of classes and methods

The first step can be parallalized easily the same way as in Hack. What about the second? Hack’s approach is to start computing the signature of an arbitrary method and resolve its dependencies as they’re being discovered. The fact that you can only refer to limited number of types makes it less likely that you’ll need to block on many dependencies or you’ll be duplicating too much work (Hack’s typechecker doesn’t block but might run the same computation multiple times with different threads).

Because of path-dependent types and rich scoping rules of Scala, I suspect it would be hard to follow Hack’s approach and get a good performance. Is there an other way? Could we schedule computation of signatures in order according to their dependencies? This way we could parallelize as much as possible and avoid paying any synchronization costs. The problem appears to be circular at the outset: in order to understand dependencies between signatures you need to resolve identifiers so you need both local scopes and list of members of all referred identifiers thus you need their types. Local scopes are cheap to compute. And we don’t need full type information to calculate dependencies — we just need the list of members along with their names and their types that are recursively capable of returning list of their members.

Let’s call types that carry just list of members outline types. I think outline Scala types can be computed rather cheaply. For computation of members you don’t need to perform really expensive operations involving subtype checks: bound checking, taking least upper bound, etc. However, you still need to properly substitute type parameters and type members as defined in Scala’s spec in the section on As Seen From (ASF). I’ve taken a closer look at As Seen From recently and shared my notes with examples. There’re a few mentions of subtyping checks in the rules defining ASF so definition of ASF for outline types carries over naturally. The ASF rules refer to base types and one of the rules which deals with types introduced by mixin composition of traits relies on subtyping checks. This rule answers the question what happens when you have:

trait A[+T] { def foo: T }
trait B1 extends A[String]
trait B2 extends A[AnyRef]
class C extends B1 with B2

What should be the signature of C.foo? The rules of computing base types determine just one instance of the type A[T] (the A[String]) as a base type of C. However, it seems to me that outline types could simply take all ancestors as base types and not try to unify them in any way. It would mean we would get duplicated members with different signatures and we would have to pick them non-deterministically. The algorithm would be similar to how non-deterministic automatons work for regular expression matching. The rest of Scala spec that is relevant to determining members of a type hardly mentions subtyping.

The outline types would miss some of the typechecking errors but that’s not their job anyway. The idea of outline types is to compute cheaply enough of typing information so accurate dependencies between types and signatures can be extracted.

I believe Hack’s approach for achieveing highly parallel typechecking can be brought to Scala. The main insight is about precisely and cheaply tracking dependencies so one can schedule typechecking tasks in the order of data dependencies. The overall structure of three phases:

  1. Parsing
  2. Symbol table computation
  3. Typechecking method bodies

can be preserved. However, the second phase has to be tweaked by introduction of outlines types that serve as bags of members (on which one can recursively query for members). For that reason, the symbol table computation would consist of three steps

  1. Enter declared classes and methods
  2. Compute outline types
  3. Compute signatures of classes and methods

The second step will let the typechecking algorithm schedule dependencies according to data dependencies so the third (heavyweight) step can be parallalized as much as possible.

Summary

I believe the architecture of Hack’s highly parallel and distributed typechecker can be stolen and brought to Scala. Scala’s rich type system poses some challenges that Hack’s typechecker implementators didn’t need to worry about but I don’t see inferred return types or path-depent types as showstoppers.

The architecture I described briefly here is different from architecture of either scalac (Scala 2.x compiler) or dotc (the research Scala compiler). However, the prospect of building a Scala compiler service that can efficiently deal with code bases of enormous size (milions of lines of code) is appealing enough to give an alternative architecture a try.