Haskell compilation pipeline and STG language

Published in

Superstring Theory

10 min readSep 15, 2019

Haskell to .Net compiler series: part 1

We’ve started exploring Haskell to .Net platform compilation in frames of a larger project and decided to share learnings, ideas and ask for feedback as we progress. We at Superstring Solutions firmly believe that functional programming in general and Haskell as it’s “flag bearer” in particular present a much better approach to developing complex real-world systems than predominantly imperative approach in the industry. There are 2 big hurdles for this adoption: lack of comprehensible learning resources, especially at school or undergrad level, and lack of GUI / Web integration (even though there’s significant horsepower available today for both with functional reactive programming and various web frameworks and haskell-2-js compilers). To address the latter, we think that having Haskell compile to .Net with easy seamless interoperability with existing .Net libraries would significantly boost Haskell’s popularity with “mainstream” developers and help with wider industry adoption.

Welcome to the Haskell to .Net compiler series! It will be useful to all of you who want to hack GHC, write your own compiler or are generally interested in advanced functional programming topics. In this post, we:

describe STG language and its’ data types in some detail with reference links
show how to get to STG from Haskell source via GHC plugins and in a standalone program using GHC API
Sketch .Net compilation approaches

To wet your appetite, here is the status of compiling the map function to pseudo-C# code we use in this very early design exploration stage:

Haskell → STG

Pseudo-C#

The compiler is (and will be) based on GHC (duh!), whose compilation pipeline is described from top level here. In short, it goes from Haskell source to Core language to STG language to C-minus-minus to various backend code generators (llvm and native), with desugaring, typechecking and a bunch of optimizations applied along the way. Most of this functionality is readily available via GHC API, which unfortunately has a couple of annoying problems:

It is not documented very well (do read the source files if you want to get serious, comments there are very helpful)
The API changes slightly from version to version, which makes existing resources describing how to use it obsolete with persistent regularity (e.g., we did use this excellent Stephen Diehl’s post series, as everything by this author, but it is already quite obsolete and required a lot of dancing around with types to get the current GHC version to work)

We will compile to .Net from STG, not Core and not Cmm due to a number of reasons that will be touched upon later on, hence our focus on STG here. In short, Core is too high-level and Cmm is too low-level for .Net CLR, but there are other considerations.

Everything below is based on GHC 8.6.5, but we will do our best to explain general principles which allow easy keeping up with the changes.

STG and Basic Design

STG stands for Spineless Tagless Graph (reduction) machine, a very small functional language and an approach (probably state of the art, certainly the most studied and tried in production due to being central to GHC compiler) to compile various functional languages to real hardware. It is described by Simon Peyton Jones and others in several papers, out of which you may want to read 2: “Spineless Tagless G-machine” and “Push/Enter vs Eval / Apply” (curiously, in the first STG machine function calls were implemented via push/enter, but after the 2nd GHC implementation was changed to eval / apply as more superior). All of Haskell gets compiled via this tiny language.

Very briefly summarizing key STG features that make it an excellent source for compilation:

It’s a very small purely functional language, which also has explicitly defined operational semantics, which makes mapping to hardware or low-level virtual machines (e.g., .Net or Java) pretty straightforward
All function applications and constructor arguments are simple variables or constants, while all data constructor applications and operator calls are fully saturated
All pattern matching is performed by case statements, and evaluation (or graph reduction) is driven also only by them; heap is allocated only by let statements
All objects are uniformly represented at runtime as an info pointer, code pointer and payload, and have only 3 types: FUN — function object (so, a function value), CON — data constructor application (so, a data value) and a THUNK — an as yet unevaluated suspension (which are ubiquitous due to laziness of Haskell) — more on this below

STG is easy to understand, both papers referenced above are very accessible, but the first thing that will jump at you after you’ve read them and understood STG principles is this (comments added by me):

data GenStgExpr bndr occ = 
    StgApp occ [GenStgArg occ] -- function application
  | StgLit Literal -- literal
  | StgConApp DataCon [GenStgArg occ] [Type] -- data constructor application
  | StgOpApp StgOp [GenStgArg occ] Type -- primitive operator or foreign call application
-- only used during core to stg passes:
  | StgLam (NonEmpty bndr) StgExpr 
-- case expressions:  
  | StgCase (GenStgExpr bndr occ) bndr AltType [GenStgAlt bndr occ] -- let bindings:
  | StgLet (GenStgBinding bndr occ) (GenStgExpr bndr occ)
  | StgLetNoEscape (GenStgBinding bndr occ) (GenStgExpr bndr occ)
  | StgTick (Tickish bndr) (GenStgExpr bndr occ)

… current main data type encoding STG expressions, which is significantly more cluttered and confusing than the STG description given in the papers, and then there are bindings and closures with their own data types to deal with on top of it. However, here we are primarily interested in StgApp — function application;StgConApp — constructor application that maps to CON objects at runtime; StgCase — case expressions that drive evaluation and pattern matching inside function definitions; and StgLet — let bindings that allocate heap objects for closures.

Let’s look at it and dissect the types top to bottom, step by step, to be able to write our compiler.

So what is an STG program?

Full type hierarchy to unpack and analyze an STG program: GenStgTopBinding → GenStgBinding describe bindings to .. → GenStgRhs closures, which in turn contain .. → GenStgExpr expressions. All of these types are currently parametrized by Var type for both bndr and occ type variables.

STG program is merely a list of top level bindings of closures to variables given by the type [GenStgTopBinding bndr occ] (here and below links to documentation are given right over the code piece). You can get STG program from the core program by calling coreToStg function; we describe how exactly to arrive there from Haskell source code in detail in the next section, for now let’s continue looking at the types. This type, as all other STG types as of the time of this writing, are parametrized by theVar type for both bndr and occ type variables, which encodes variable name and type, so de-facto the program is of type [GenStgTopBinding Var Var].

Now, all of the types in STG have some constructors that we will ignore below because they are not essential for understanding of how to manipulate STG and even write basic compilers. They are mostly used for Cmm specific code generation optimizations and we omit them to reduce clutter.

Inside GenStgTopBinding there is StgTopLifted (GenStgBinding bndr occ) constructor, which we want to simply unpack to get to the actual type of bindings:

data GenStgBinding bndr occ
  = StgNonRec bndr (GenStgRhs bndr occ)
  | StgRec    [(bndr, GenStgRhs bndr occ)]

In essence StgNonRec var rhs describes binding closures rhs::GenStgRhs Var Var to var::Var variable, with recursive and non-recursive variants, pretty straightforward.

Closures

Now, closures are more interesting — as mentioned above, they are the only type of heap objects in the STG machine, and here is how they are represented:

data GenStgRhs bndr occ
  = StgRhsClosure
        CostCentreStack         -- CCS to be attached (default is CurrentCCS)
        StgBinderInfo           -- Info about how this binder is used (see below)
        [occ]                   -- non-global free vars; a list, rather than
                                -- a set, because order is important
        !UpdateFlag             -- ReEntrant | Updatable | SingleEntry
        [bndr]                  -- arguments; if empty, then not a function;
                                -- as above, order is important.
        (GenStgExpr bndr occ)   -- body


  | StgRhsCon
        CostCentreStack  -- CCS to be attached (default is CurrentCCS).
                         -- Top-level (static) ones will end up with
                         -- DontCareCCS, because we don't count static
                         -- data in heap profiles, and we don't set CCCS
                         -- from static closure.
        DataCon          -- Constructor. Never an unboxed tuple or sum, as those
                         -- are not allocated.
        [GenStgArg occ]  -- Args

Remember that bndr and occ are always of type Var as of now.

StgRhsCon represents a “normal” value — data constructor DataCon application to a list of GenStgArg Var — a type that describes arguments to function, operator and constructor applications — a very simple sum type between Var and a literal. Remember that constructor applications in STG are always saturated, so you don’t need to deal with partial applications — and this translates to CON heap objects at runtime. Just 4 translates to something like StgRhsCon <'Just' DataCon rep> [StgLitArg 4] (not actual code!)

StgRhsClosure ccs binfo freeVars::[Var] flag::UpdateFlag vars::[Var] expr::GenStgExpr Var Var describes everything else — so, a function or a thunk. Ignore Cost Center and binder info stuff, they are used in code gen optimizations, and then signature above encodes STG code similar to: {xf1 ... xfn} \r {x1 ... xm} -> expr, where xfi are free variables, xj are arguments and expr::GenStgExpr Var Var is actual code.

This closure representation is key to understanding how STG machine functions, so we look at it in more detail.

This closure form has several interesting properties:

Free variables freeVars of the code represented by expr are referenced right here together with the code, so at runtime there’s no need to look them up in some global table, they are passed explicitly (for details please refer to the papers mentioned above)
Update flag can be either ReEntrant — which encodes a “normal” function with lambda-bound arguments vars and which may or may not have free variables freeVars (top level functions never have free variables)
Updatable — which encodes a THUNK — a lazy, as of yet unevaluated suspension, that normally has free variables and a code to execute (expr) but does not have arguments vars(otherwise it would be a function). At runtime, THUNK in STG is represented as a self-updating structure, so when we need a value, it is entered, evaluated and the pointer is updated with the calculated value for much faster subsequent access — hence the name call-by-need that is used to describe Haskell’s evaluation strategy (as opposed to much less efficient call-by-name or strict call-by-value)
SingleEntry — this is also a THUNK, but such that the compiler was able to prove it will be entered at most once, so it does not need to be updated. We simply evaluate it to normal form when its’ value is needed and then let garbage collector collect it — a very nice optimization increasing overall code efficiency

Then finally we get to GenStgExpr Var Var type that describes the actual code and listed in the beginning of the section. We will look at this key data type, how to deconstruct it and use it for compilation to .Net or other targets in more detail in the next post in this series, while now let’s see how we can get to STG from Haskell source!

Summarizing, STG program is a list of bindings of closures to variables. Closures are represented uniformly at runtime as a code pointer with code generated from GenStgExpr expressions, info pointer that tells runtime system what kind of closure this is, and payload with free vars and arguments. Closures can be of types FUN — function, CON — constructor application, so a value, and a THUNK — as yet unevaluated, self-updating suspension that will be evaluated when it's value is needed.

In our .Net compiler, we will mostly reuse FUN and THUNK ideas from the existing STG machine, but will represent CON quite differently via .Net CLR types to provide for much easier interop with .Net libraries and rely on inherently typed .Net CLR virtual machine optimizations. This is the key difference and contribution of our work compared to existing GHC approach. We will describe .Net STG machine design ideas (which will not really be tagless due to preserving some type information) in the future posts of the series.

Haskell to STG

There are (at least) two ways to compile Haskell program to STG with the help of GHC — as a standalone program using GHC API or via GHC plugin mechanism. The latter seems cleaner and we explored it initially, but unfortunately at this time there seems to be no way to tell GHC at which point in the compilation pipeline to insert the plugin, which doesn’t allow running some key optimizations of the Core program, so we ended up using the standalone program. However, here we briefly describe both.

Via Plugin

Do read the relevant section in GHC manual, it’s quite informative. Full code to get to STG from inside the Core plugin is this (imports take more space than code itself):

GHC operates a structure of type ModGuts while compiling a source file to Core Program, contained in mg_binds field of ModGuts. It contains loads of info and we plan to use some type information directly from there in our .Net compilation process, to be described at a later stage. Our plugin gets this ModGuts object at some stage during Core to Core passes of GHC compiler and then we compile it to STG (after a bit of dancing with getting relevant variables from the environment) via these 3 lines:

(prep, _) <- liftIO $ corePrepPgm env' mod loc core tcs
let (stg,_) = coreToStg dflags' mod prep 
stg_binds2 <- liftIO $ stg2stg dflags' stg

To run the compilation with the plugin, you launch ghc with relevant options: stack ghc — -fplugin Plugin Example.hs — but remember to register the module containing the plugin in the package database first, otherwise it will not be found.

Now, as mentioned, the problem with this approach is that we want to run core2core passes as well, before compiling to STG, otherwise we end up missing some important optimizations. The line of code we need to run is: guts’ <- liftIO $ core2core env’ coreMod but unfortunately we cannot call core2core from inside core2core for obvious reasons. The only way to resolve it at this point is to go

Via standalone program

We omit imports etc which is essentially the same as in the plugin option and only give the relevant main function that loads the module (file) separately from scratch:

This is obviously just a starting point and you can experiment quite a bit with ModGuts, flags etc, but at least this function sequence gets you through most of the compilation pipeline down to optimized STG — which we will be using as our primary source for further transformation into intermediate .Net representation type, which will subsequently be used for code generation.

More on this in the next posts of the series.

Appreciate your comments and feedback, as usual!