Writing CLR(1) Parsers in Swift

11 min readOct 30, 2023

Disclaimer

I usually don’t write disclaimers for my articles, because they tend to cover topics on which there’s plenty on material from which one can learn so I trust people’s judgement. But this time, it was difficult for me to find all the sources I needed to do this project. I do not have the feeling that I understood it very deeply and therefore, I do not encourage anyone to rely on the information provided here for their learning.

Here are the most instructive sources that I could dig up that don’t require reading entire books of technical specifications:

All that I did here was trying to apply what I think I learned from these to write a small compiler-compiler in Swift — for grammar specifications in Swift. If you want to understand this article, I highly recommend you first read/watch the above.

Rethinking Parsers

When I was done with my last parser lib and had written an article, I soon learned that my approach was known as top-down parsing which is not only slow (i.e. the time may be more than linear in terms of the input stream), but also not very general. For instance, it can’t handle a grammar like

A -> Aa | a

I quickly learned that bottom up parsers can be way more general and, being simple state machines, are way more efficient.

Unfortunately, in order to code them, I had to abandon my nice little parser lib. I needed to know exactly what the terminals and non-terminals in my grammar were. But I still wanted it all to be Swift-y. So, I came up with a design.

What We Want

What we really want is that our rule set is basically an enum with a computed property that says what the rule does. As stated before, we also need to know what the terminals and non-terminals are — so we make them enums as well. Let’s go ahead and do that:


public protocol Terminal : RawRepresentable, CaseIterable, Codable, Hashable
 where RawValue == Character {}


public protocol NonTerminal : RawRepresentable, CaseIterable, Codable,
 Hashable where RawValue == String {}

// we need a type that can hold terminals and non-terminals

public enum Expr<T : Hashable, NT : Hashable> : Hashable {
    case term(T)
    case nonTerm(NT)
}

// this encodes what a rule "does"

public struct Rule<T : Terminal, NT : NonTerminal> {
    public let lhs : NT
    public let rhs : [Expr<T, NT>]
    public init(_ lhs: NT, expression rhs: Expr<T, NT>...) {
        self.lhs = lhs
        self.rhs = rhs
    }
    init(_ lhs: NT, rhs: [Expr<T, NT>]) {
        self.lhs = lhs
        self.rhs = rhs
    }
}

// this is our main type

public protocol Rules : RawRepresentable, CaseIterable, Codable, Hashable
 where RawValue == String {
    associatedtype Term : Terminal
    associatedtype NTerm : NonTerminal
    static var goal : NTerm {get}
    var rule : Rule<Term, NTerm> {get}
}

Now, we want that we don’t have to constantly write .term(…) or .nonTerm(…), so we define operators:


prefix operator /

public prefix func /<T, NT>(_ t: T) -> Expr<T, NT> {
    .term(t)
}

public prefix func /<T, NT>(_ nt: NT) -> Expr<T, NT> {
    .nonTerm(nt)
}

Et voilà, we can write simple grammars:

enum NonLLTerminal : Character, Terminal {
    case a = "a"
}

enum NonLLNTerminal : String, NonTerminal {
    case A
}

enum NonLL : String, Rules {
    case ouch
    case term
    static var goal : NonLLNTerminal {.A}
    var rule : Rule<NonLLTerminal, NonLLNTerminal> {
        switch self {
        case .ouch:
            Rule(.A, expression: /.A, /.a)
        case .term:
            Rule(.A, expression: /.a)
        }
    }
}

Let’s think ahead for a moment. Having read Wikipedia, the output of a bottom up parser will just be a stack of rules that the parser has seen. So, we need a stack type:


public struct Stack<T> {
    
    var rep : [T] = []
    public init() {}
    
    public mutating func push(_ t: T) {
        rep.append(t)
    }
    
    public mutating func pop() -> T? {
        guard peek() != nil else {return nil}
        return rep.removeLast()
    }
    
    public func peek() -> T? {
        rep.last
    }
    
}

It would be way nicer though if we could pop those rules from the stack on the fly during parsing once a new rule is detected. This way, we could construct our ultimate output, e.g. an AST, during the parsing. Moreover, it would be super convenient to specify this construction process right next to the rule definition.

So… Let’s do this!


public protocol Constructions : Rules {
    associatedtype Term
    associatedtype NTerm
    associatedtype Output
    var construction : Construction<Term, NTerm, Output> {get}
}

public extension Constructions {
    var rule: Rule<Term, NTerm> {
        let cons = construction
        return Rule(cons.lhs, rhs: cons.rhs)
    }
}

public struct Construction<T : Terminal, NT : NonTerminal, Output> {
    public let lhs : NT
    public let rhs : [Expr<T, NT>]
    public let parse : (inout Stack<Output>) throws -> Void
    public init(_ lhs: NT, expression rhs: Expr<T, NT>...,
                parse: @escaping (inout Stack<Output>) throws -> Void) {
        self.lhs = lhs
        self.rhs = rhs
        self.parse = parse
    }
}

Super nice! Now, we can already write stuff like this:


enum MyTerm : Character, Terminal {
    case zero = "0"
    case one = "1"
    case plus = "+"
    case times = "*"
}

enum MyNTerm : String, NonTerminal {
    case E
    case B
}

extension String : Error {}

enum MyRules : String, Constructions {
    
    case eTimes
    case ePlus
    case eB
    case bZero
    case bOne
    
    static var goal: MyNTerm {.E}
    
    var construction: Construction<MyTerm, MyNTerm, eAST> {
        switch self {
        case .eTimes:
            Construction(.E, expression: /.E, /.times, /.B) {stack in
                guard let eb = stack.pop(),
                      let b = eb.asB,
                      let e = stack.pop() else {
                    throw "Ouch"
                }
                stack.push(.times(e, b))
            }
        case .ePlus:
            Construction(.E, expression: /.E, /.plus, /.B) {stack in
                guard let eb = stack.pop(),
                      let b = eb.asB,
                      let e = stack.pop() else {
                    throw "Ouch"
                }
                stack.push(.plus(e, b))
            }
        case .eB:
            Construction(.E, expression: /.B) {stack in
                guard let eb = stack.peek(),
                      nil != eb.asB else {
                    throw "Ouch"
                }
            }
        case .bZero:
            Construction(.B, expression: /.zero) {stack in
                stack.push(.b(.zero))
            }
        case .bOne:
            Construction(.B, expression: /.one) {stack in
                stack.push(.b(.one))
            }
        }
    }
}

// parser output defined here

enum bAST : Equatable {
    case zero
    case one
}

indirect enum eAST : Equatable {
    case b(bAST)
    case plus(Self, bAST)
    case times(Self, bAST)
    var asB : bAST? {
        guard case .b(let b) = self else {
            return nil
        }
        return b
    }
}

This looks promising! Now let’s go ahead and define the parser. Bottom up LR parsers generally consist of two tables: the Action and Goto tables.

Actions can be either a shift (move to the next input character and push a state onto the state stack), a reduce (pop as many states from the state stack as there are symbols on the right hand side of the newly found rule) or accept (stream has successfully been parsed).


public enum Action<R : Rules> : Codable, Equatable {
    case shift(Int)
    case reduce(R)
    case accept
}

Gotos on the other hand always require us to push a state onto the state stack and can therefore be modeled simply by integers. Gotos will always be performed right after a reduction.

Let’s now write down what a parser is and does.


public struct Parser<R : Rules> : Codable, Equatable {
    
    // actions happen for each terminal or eof (modeled as nil)
    public let actions : [R.Term? : [Int : Action<R>]]
    // gotos happen for each recognized rule
    public let gotos : [R.NTerm : [Int : Int]]
    
}

public extension Parser {
    
    func parse<Out>(_ stream: String,
                    do construction: (R, inout Stack<Out>) throws -> Void)
                    throws ->Stack<Out> {
        
        // we need fine grained control when to move ahead
        var iterator = stream.makeIterator()
        var current = iterator.next()
        
        var stateStack = Stack<Int>()
        stateStack.push(0)

        var outStack = Stack<Out>()
        
    loop:
        while true {
            
            // nil is a valid value, but some characters are not
            let term = try current.map{char in
                guard let res = R.Term(rawValue: char) else {
                  throw InvalidChar(char: char)
                }
                return res
            }
            // we need a state
            guard let stateBefore = stateStack.peek() else {
                throw UndefinedState()
            }
            // we need an action
            guard let dict = actions[term] else {  
              throw InvalidChar(char: current ?? "$")
            }
            guard let action = dict[stateBefore] else {
                throw UndefinedState()
            }
            
            switch action {
                
              // accept input character and push new state onto stack
            case .shift(let shift):
                stateStack.push(shift)
                current = iterator.next()
                
              // reduce and goto
            case .reduce(let reduce):
                let rule = reduce.rule
                for _ in rule.rhs {
                    _ = stateStack.pop()
                }
                guard let stateAfter = stateStack.peek() else {
                    throw UndefinedState()
                }
                try construction(reduce, &outStack)
                guard let nextState = gotos[rule.lhs]?[stateAfter] else {
                  throw NoGoTo(nonTerm: rule.lhs, state: stateAfter)
                }
                stateStack.push(nextState)
                
              // string has successfully been parsed
            case .accept:
                break loop
            }
            
        }
        return outStack
    }
    
    func buildStack(_ stream: String) throws -> Stack<R> {
        try parse(stream, do: {$1.push($0)})
    }
    
    func parse(_ stream: String) throws -> R.Output? where R : Constructions {
        var stack = try parse(stream, do: {try $0.construction.parse(&$1)})
        return stack.pop()
    }
    
}

That was a bit tricky, but ultimately we just do what we said we do.

Now comes the actually tricky part: initializing parsers just from grammars.

Graph Closures

Here’s the nice part: a lot of what will follow can actually be expressed in terms of graph closures. Basically, if we have a graph and we have edges that leave the graph, we will add the new nodes to the graph. There’s just one peculiarity:


public protocol Node : Hashable {
    associatedtype Edge : Hashable
    func canReach() throws -> [Edge: [Self]]
}

Instead of edges always going to one new node, they can go to multiple nodes. Basically, you can think of the type Edge as sets of labels that actually stand for a bunch of edges.

Now let’s construct a closed graph:


public struct ClosedGraph<N : Node> : Hashable {
    
    public let nodes : [N]
    public let edges : [N.Edge : [Int : [Int]]]
    
    public init(seeds: [N]) throws {
        
        // local variables so they're mutable
        var nodes = seeds
        var edges : [N.Edge : [N : Set<N>]] = [:]
        
        // we need these to keep track of nodes to expand
        var newNodes = seeds
        var seenNodes : Set<N> = Set(seeds)
        
        while true {
            
            // we need this to keep track if there are *next* nodes
            // to expand

            var newNewNodes : [N] = []
            
            for start in newNodes {
                for (content, ends) in try start.canReach() {
                    for end in ends {
                        
                        if !seenNodes.contains(end) {
                            seenNodes.insert(end)
                            newNewNodes.append(end)
                        }
                        if edges[content] == nil {
                            edges[content] = [start : [end]]
                        }
                        else if edges[content]![start] == nil {
                            edges[content]![start] = [end]
                        }
                        else {
                            edges[content]![start]!.insert(end)
                        }
                    }
                }
            }
            
            newNodes = newNewNodes
            if newNodes.isEmpty {
                break
            }
            nodes.append(contentsOf: newNodes)
        }
        
        self.nodes = nodes
        
        // every edge must now begin and end at a node in the nodes array
        self.edges = edges.mapValues{dict in
            Dictionary(uniqueKeysWithValues: dict.map{key, vals in
               (nodes.firstIndex(of: key)!,
                vals.map{nodes.firstIndex(of: $0)!})})
        }
    }
    
}

A crucial property that we can unfortunately not encode at the type level without rewriting the type is: if the edges of the nodes only point to one target node, then this will carry over to our closed graph. This is why in the next section, you will see some assertions.

LR(0) Items

I will not show the construction of the LR(1) items, because they’re very similar to the conceptually somewhat simpler LR(0) items. The other reason is that I’m not too confident (yet) about my implementation of the LR(1) items, as I haven’t tested them enough. My implementation can be found here. Reviews and PRs are highly welcome.

Let us have a look at the LR(0) items though.


// MARK: LR(0) ITEMS

fileprivate struct Item<R : Rules> : Node {
    
    let rule : R? // augmented rule has tag nil
    let all : [Expr<R.Term, R.NTerm>] // rhs of the rule
    let ptr : Int // represents the next position to parse
    
    // we can reach any "unparsed" rule beginning with
    // the next symbol, provided the symbol is a non-terminal

    func canReach () -> [R.NTerm : [Item<R>]] {
        guard let next = tbd.first, case .nonTerm(let nT) = next else {
            return [:]
        }
        return [nT : R.allCases.compactMap {rule in
            let ru = rule.rule
            guard ru.lhs == nT else {return nil}
            return Item(rule: rule, all: ru.rhs, ptr: 0)
        }]
    }
    
}

// MARK: LR(0) ITEM SETS

fileprivate struct ItemSet<R : Rules> {
    // auto generated graph closure!
    let graph : ClosedGraph<Item<R>>
}

// MARK: HELPERS

extension Item {
    
    func tryAdvance(_ expr: Expr<R.Term, R.NTerm>) -> Item<R>? {
        tbd.first.flatMap{$0 == expr ?
           Item(rule: rule, all: all, ptr: ptr + 1) :
           nil
        }
    }
    var tbd : some Collection<Expr<R.Term, R.NTerm>> {
        all[ptr...]
    }
    
}

// ItemSets form nodes in a higher order graph
extension ItemSet : Node {
    
    // we can reach an item set by "advancing" the items
    // here, we can already detect shift-reduce conflicts
    // and reduce-reduce conflicts

    func canReach() throws -> [Expr<R.Term, R.NTerm> : [ItemSet<R>]] {
        let exprs = Set(graph.nodes.compactMap(\.tbd.first))
        if exprs.isEmpty {
            _ = try reduceRule()
            return [:]
        }
        guard try reduceRule() == nil else {throw ShiftReduceConflict()}
        return try Dictionary(uniqueKeysWithValues: exprs.map{expr in
            try (expr,
                 [ItemSet(graph: ClosedGraph(seeds: graph.nodes
                                  .compactMap{$0.tryAdvance(expr)}))])
        })
    }
}

// MARK: LR(0) GRAPH

fileprivate struct ItemSetTable<R : Rules> {
    
    let graph : ClosedGraph<ItemSet<R>>
    
    init(rules: R.Type) throws {
        // our initial state is the augmented rule, tagged by nil
        let augmentedRule = Item<R>(rule: nil,
                                    all: [.nonTerm(R.goal)],   
                                    ptr: 0)
        // now, do both graph closures
        let itemSetGraph = try ClosedGraph(seeds: [augmentedRule])
        graph = try ClosedGraph(seeds: [ItemSet(graph: itemSetGraph)])
    }
    
}

// MARK: HELPER

extension ItemSet {
    
    func reduceRule() throws -> R? {
        let results : [R] = graph.nodes.lazy.filter(\.tbd.isEmpty)
                            .compactMap(\.rule)
        if results.count > 1 {
            throw ReduceReduceConflict(matching: results)
        }
        return results.first
    }
    
    
}

// MARK: ACTION + GOTO TABLES

extension ItemSetTable {
    
    func actionTable() throws -> [R.Term? : [Int : Action<R>]] {
        
        // shifts
        
        let keyAndVals = graph.edges
        .compactMap{(key : Expr<R.Term, R.NTerm>, vals : [Int : [Int]])
           -> (R.Term, [Int : Action<R>])? in
            guard case .term(let t) = key else {return nil}
            let dict = Dictionary(uniqueKeysWithValues: vals
                                      .map{start, ends in
                assert(ends.count == 1)
                return (start, Action<R>.shift(ends.first!))
            })
            return (t, dict)
        }
        
        var dict = Dictionary(uniqueKeysWithValues: keyAndVals) 
                    as [R.Term? : [Int : Action<R>]]
        
        for start in graph.nodes.indices {
            
            // reductions
            
            if let rule = try graph.nodes[start].reduceRule() {
                for term in Array(R.Term.allCases) as [R.Term?] + [nil] {
                    if dict[term] == nil {
                        dict[term] = [start: .reduce(rule)]
                    }
                    else {
                        if dict[term]?[start] != nil {
                            throw ShiftReduceConflict()
                        }
                        dict[term]?[start] = .reduce(rule)
                    }
                }
            }
            
            // accepts
            
            if graph.nodes[start].graph.nodes
        .contains(where: {$0.rule == nil && $0.tbd.isEmpty}) {
                if dict[nil] == nil {
                    dict[nil] = [start : .accept]
                }
                else {
                    if dict[nil]?[start] != nil {
                        throw AcceptConflict() //should not happen?
                    }
                    dict[nil]?[start] = .accept
                }
            }
        }
        return dict
    }
    
    var gotoTable : [R.NTerm : [Int : Int]] {
        Dictionary(uniqueKeysWithValues: graph.edges
            .compactMap{(key : Expr<R.Term, R.NTerm>, vals : [Int : [Int]]) in
            guard case .nonTerm(let nT) = key else {return nil}
            return (nT, vals.mapValues{ints in
                assert(ints.count == 1)
                return ints.first!
            })
        })
    }
    
}

// MARK: LR(0) PARSER

public extension Parser {
    
    // Et voilà
    static func LR0(rules: R.Type) throws -> Self {
        let table = try ItemSetTable(rules: rules)
        return Parser(actions: try table.actionTable(),
                      gotos: table.gotoTable)
    }
    
}

The full project can be found here.

Challenges

Something that seems not possible is to write modular grammars.

Consider this:


protocol Terms0 : Terminal {
    static var a : Self {get}
    static var b : Self {get}
}

protocol NTerms0 : NonTerminal {
    static var E : Self {get}
    static var B : Self {get}
}

enum Rules0<Term : Terms0, NTerm : NTerms0> : String, Rules {
    case aE
    ...
}

protocol Terms1 : Terms0 {
    static var c : Self {get}
    static var d : Self {get}
}

protocol NTerms1 : NTerms0 {
    static var Foo : Self {get}
    static var Bar : Self {get}
}

enum Rules1<Term: Terms1, NTerm: NTerms1> : Rules {
    case R0(Rules0<Term, NTerm>)
    case FooE
    var rawValue: String {
        // have to implement :(
    }
    init?(rawValue: String) {
        // have to implement :(
    }
    static var allCases: [Rules1<Term, NTerm>] {
        // have to implement :(
    }
    ...
}

Unfortunately, if we try to modularize our grammar like this, we have to manually implement rawValue and allCases which are completely obvious and boilerplate. Additionally, we have to implement init?(rawValue:) which is also boilerplate, except we actually have to check that the raw values are unique.

Probably, a macro might help, but those are a pain in the ass, so I’ll leave that to you as an exercise.