Writing CLR(1) Parsers in Swift

Markus Kasperczyk
11 min readOct 30, 2023

--

Disclaimer

I usually don’t write disclaimers for my articles, because they tend to cover topics on which there’s plenty on material from which one can learn so I trust people’s judgement. But this time, it was difficult for me to find all the sources I needed to do this project. I do not have the feeling that I understood it very deeply and therefore, I do not encourage anyone to rely on the information provided here for their learning.

Here are the most instructive sources that I could dig up that don’t require reading entire books of technical specifications:

All that I did here was trying to apply what I think I learned from these to write a small compiler-compiler in Swift — for grammar specifications in Swift. If you want to understand this article, I highly recommend you first read/watch the above.

Rethinking Parsers

When I was done with my last parser lib and had written an article, I soon learned that my approach was known as top-down parsing which is not only slow (i.e. the time may be more than linear in terms of the input stream), but also not very general. For instance, it can’t handle a grammar like

A -> Aa | a

I quickly learned that bottom up parsers can be way more general and, being simple state machines, are way more efficient.

Unfortunately, in order to code them, I had to abandon my nice little parser lib. I needed to know exactly what the terminals and non-terminals in my grammar were. But I still wanted it all to be Swift-y. So, I came up with a design.

What We Want

What we really want is that our rule set is basically an enum with a computed property that says what the rule does. As stated before, we also need to know what the terminals and non-terminals are — so we make them enums as well. Let’s go ahead and do that:


public protocol Terminal : RawRepresentable, CaseIterable, Codable, Hashable
where RawValue == Character {}


public protocol NonTerminal : RawRepresentable, CaseIterable, Codable,
Hashable where RawValue == String {}

// we need a type that can hold terminals and non-terminals

public enum Expr<T : Hashable, NT : Hashable> : Hashable {
case term(T)
case nonTerm(NT)
}

// this encodes what a rule "does"

public struct Rule<T : Terminal, NT : NonTerminal> {
public let lhs : NT
public let rhs : [Expr<T, NT>]
public init(_ lhs: NT, expression rhs: Expr<T, NT>...) {
self.lhs = lhs
self.rhs = rhs
}
init(_ lhs: NT, rhs: [Expr<T, NT>]) {
self.lhs = lhs
self.rhs = rhs
}
}

// this is our main type

public protocol Rules : RawRepresentable, CaseIterable, Codable, Hashable
where RawValue == String {
associatedtype Term : Terminal
associatedtype NTerm : NonTerminal
static var goal : NTerm {get}
var rule : Rule<Term, NTerm> {get}
}

Now, we want that we don’t have to constantly write .term(…) or .nonTerm(…), so we define operators:


prefix operator /

public prefix func /<T, NT>(_ t: T) -> Expr<T, NT> {
.term(t)
}

public prefix func /<T, NT>(_ nt: NT) -> Expr<T, NT> {
.nonTerm(nt)
}

Et voilà, we can write simple grammars:

enum NonLLTerminal : Character, Terminal {
case a = "a"
}

enum NonLLNTerminal : String, NonTerminal {
case A
}

enum NonLL : String, Rules {
case ouch
case term
static var goal : NonLLNTerminal {.A}
var rule : Rule<NonLLTerminal, NonLLNTerminal> {
switch self {
case .ouch:
Rule(.A, expression: /.A, /.a)
case .term:
Rule(.A, expression: /.a)
}
}
}

Let’s think ahead for a moment. Having read Wikipedia, the output of a bottom up parser will just be a stack of rules that the parser has seen. So, we need a stack type:


public struct Stack<T> {

var rep : [T] = []
public init() {}

public mutating func push(_ t: T) {
rep.append(t)
}

public mutating func pop() -> T? {
guard peek() != nil else {return nil}
return rep.removeLast()
}

public func peek() -> T? {
rep.last
}

}

It would be way nicer though if we could pop those rules from the stack on the fly during parsing once a new rule is detected. This way, we could construct our ultimate output, e.g. an AST, during the parsing. Moreover, it would be super convenient to specify this construction process right next to the rule definition.

So… Let’s do this!


public protocol Constructions : Rules {
associatedtype Term
associatedtype NTerm
associatedtype Output
var construction : Construction<Term, NTerm, Output> {get}
}

public extension Constructions {
var rule: Rule<Term, NTerm> {
let cons = construction
return Rule(cons.lhs, rhs: cons.rhs)
}
}

public struct Construction<T : Terminal, NT : NonTerminal, Output> {
public let lhs : NT
public let rhs : [Expr<T, NT>]
public let parse : (inout Stack<Output>) throws -> Void
public init(_ lhs: NT, expression rhs: Expr<T, NT>...,
parse: @escaping (inout Stack<Output>) throws -> Void) {
self.lhs = lhs
self.rhs = rhs
self.parse = parse
}
}

Super nice! Now, we can already write stuff like this:


enum MyTerm : Character, Terminal {
case zero = "0"
case one = "1"
case plus = "+"
case times = "*"
}

enum MyNTerm : String, NonTerminal {
case E
case B
}

extension String : Error {}

enum MyRules : String, Constructions {

case eTimes
case ePlus
case eB
case bZero
case bOne

static var goal: MyNTerm {.E}

var construction: Construction<MyTerm, MyNTerm, eAST> {
switch self {
case .eTimes:
Construction(.E, expression: /.E, /.times, /.B) {stack in
guard let eb = stack.pop(),
let b = eb.asB,
let e = stack.pop() else {
throw "Ouch"
}
stack.push(.times(e, b))
}
case .ePlus:
Construction(.E, expression: /.E, /.plus, /.B) {stack in
guard let eb = stack.pop(),
let b = eb.asB,
let e = stack.pop() else {
throw "Ouch"
}
stack.push(.plus(e, b))
}
case .eB:
Construction(.E, expression: /.B) {stack in
guard let eb = stack.peek(),
nil != eb.asB else {
throw "Ouch"
}
}
case .bZero:
Construction(.B, expression: /.zero) {stack in
stack.push(.b(.zero))
}
case .bOne:
Construction(.B, expression: /.one) {stack in
stack.push(.b(.one))
}
}
}
}

// parser output defined here

enum bAST : Equatable {
case zero
case one
}

indirect enum eAST : Equatable {
case b(bAST)
case plus(Self, bAST)
case times(Self, bAST)
var asB : bAST? {
guard case .b(let b) = self else {
return nil
}
return b
}
}

This looks promising! Now let’s go ahead and define the parser. Bottom up LR parsers generally consist of two tables: the Action and Goto tables.

Actions can be either a shift (move to the next input character and push a state onto the state stack), a reduce (pop as many states from the state stack as there are symbols on the right hand side of the newly found rule) or accept (stream has successfully been parsed).


public enum Action<R : Rules> : Codable, Equatable {
case shift(Int)
case reduce(R)
case accept
}

Gotos on the other hand always require us to push a state onto the state stack and can therefore be modeled simply by integers. Gotos will always be performed right after a reduction.

Let’s now write down what a parser is and does.


public struct Parser<R : Rules> : Codable, Equatable {

// actions happen for each terminal or eof (modeled as nil)
public let actions : [R.Term? : [Int : Action<R>]]
// gotos happen for each recognized rule
public let gotos : [R.NTerm : [Int : Int]]

}

public extension Parser {

func parse<Out>(_ stream: String,
do construction: (R, inout Stack<Out>) throws -> Void)
throws ->Stack<Out> {

// we need fine grained control when to move ahead
var iterator = stream.makeIterator()
var current = iterator.next()

var stateStack = Stack<Int>()
stateStack.push(0)

var outStack = Stack<Out>()

loop:
while true {

// nil is a valid value, but some characters are not
let term = try current.map{char in
guard let res = R.Term(rawValue: char) else {
throw InvalidChar(char: char)
}
return res
}
// we need a state
guard let stateBefore = stateStack.peek() else {
throw UndefinedState()
}
// we need an action
guard let dict = actions[term] else {
throw InvalidChar(char: current ?? "$")
}
guard let action = dict[stateBefore] else {
throw UndefinedState()
}

switch action {

// accept input character and push new state onto stack
case .shift(let shift):
stateStack.push(shift)
current = iterator.next()

// reduce and goto
case .reduce(let reduce):
let rule = reduce.rule
for _ in rule.rhs {
_ = stateStack.pop()
}
guard let stateAfter = stateStack.peek() else {
throw UndefinedState()
}
try construction(reduce, &outStack)
guard let nextState = gotos[rule.lhs]?[stateAfter] else {
throw NoGoTo(nonTerm: rule.lhs, state: stateAfter)
}
stateStack.push(nextState)

// string has successfully been parsed
case .accept:
break loop
}

}
return outStack
}

func buildStack(_ stream: String) throws -> Stack<R> {
try parse(stream, do: {$1.push($0)})
}

func parse(_ stream: String) throws -> R.Output? where R : Constructions {
var stack = try parse(stream, do: {try $0.construction.parse(&$1)})
return stack.pop()
}

}

That was a bit tricky, but ultimately we just do what we said we do.

Now comes the actually tricky part: initializing parsers just from grammars.

Graph Closures

Here’s the nice part: a lot of what will follow can actually be expressed in terms of graph closures. Basically, if we have a graph and we have edges that leave the graph, we will add the new nodes to the graph. There’s just one peculiarity:


public protocol Node : Hashable {
associatedtype Edge : Hashable
func canReach() throws -> [Edge: [Self]]
}

Instead of edges always going to one new node, they can go to multiple nodes. Basically, you can think of the type Edge as sets of labels that actually stand for a bunch of edges.

Now let’s construct a closed graph:


public struct ClosedGraph<N : Node> : Hashable {

public let nodes : [N]
public let edges : [N.Edge : [Int : [Int]]]

public init(seeds: [N]) throws {

// local variables so they're mutable
var nodes = seeds
var edges : [N.Edge : [N : Set<N>]] = [:]

// we need these to keep track of nodes to expand
var newNodes = seeds
var seenNodes : Set<N> = Set(seeds)

while true {

// we need this to keep track if there are *next* nodes
// to expand

var newNewNodes : [N] = []

for start in newNodes {
for (content, ends) in try start.canReach() {
for end in ends {

if !seenNodes.contains(end) {
seenNodes.insert(end)
newNewNodes.append(end)
}
if edges[content] == nil {
edges[content] = [start : [end]]
}
else if edges[content]![start] == nil {
edges[content]![start] = [end]
}
else {
edges[content]![start]!.insert(end)
}
}
}
}

newNodes = newNewNodes
if newNodes.isEmpty {
break
}
nodes.append(contentsOf: newNodes)
}

self.nodes = nodes

// every edge must now begin and end at a node in the nodes array
self.edges = edges.mapValues{dict in
Dictionary(uniqueKeysWithValues: dict.map{key, vals in
(nodes.firstIndex(of: key)!,
vals.map{nodes.firstIndex(of: $0)!})})
}
}

}

A crucial property that we can unfortunately not encode at the type level without rewriting the type is: if the edges of the nodes only point to one target node, then this will carry over to our closed graph. This is why in the next section, you will see some assertions.

LR(0) Items

I will not show the construction of the LR(1) items, because they’re very similar to the conceptually somewhat simpler LR(0) items. The other reason is that I’m not too confident (yet) about my implementation of the LR(1) items, as I haven’t tested them enough. My implementation can be found here. Reviews and PRs are highly welcome.

Let us have a look at the LR(0) items though.


// MARK: LR(0) ITEMS

fileprivate struct Item<R : Rules> : Node {

let rule : R? // augmented rule has tag nil
let all : [Expr<R.Term, R.NTerm>] // rhs of the rule
let ptr : Int // represents the next position to parse

// we can reach any "unparsed" rule beginning with
// the next symbol, provided the symbol is a non-terminal

func canReach () -> [R.NTerm : [Item<R>]] {
guard let next = tbd.first, case .nonTerm(let nT) = next else {
return [:]
}
return [nT : R.allCases.compactMap {rule in
let ru = rule.rule
guard ru.lhs == nT else {return nil}
return Item(rule: rule, all: ru.rhs, ptr: 0)
}]
}

}

// MARK: LR(0) ITEM SETS

fileprivate struct ItemSet<R : Rules> {
// auto generated graph closure!
let graph : ClosedGraph<Item<R>>
}

// MARK: HELPERS

extension Item {

func tryAdvance(_ expr: Expr<R.Term, R.NTerm>) -> Item<R>? {
tbd.first.flatMap{$0 == expr ?
Item(rule: rule, all: all, ptr: ptr + 1) :
nil
}
}
var tbd : some Collection<Expr<R.Term, R.NTerm>> {
all[ptr...]
}

}

// ItemSets form nodes in a higher order graph
extension ItemSet : Node {

// we can reach an item set by "advancing" the items
// here, we can already detect shift-reduce conflicts
// and reduce-reduce conflicts

func canReach() throws -> [Expr<R.Term, R.NTerm> : [ItemSet<R>]] {
let exprs = Set(graph.nodes.compactMap(\.tbd.first))
if exprs.isEmpty {
_ = try reduceRule()
return [:]
}
guard try reduceRule() == nil else {throw ShiftReduceConflict()}
return try Dictionary(uniqueKeysWithValues: exprs.map{expr in
try (expr,
[ItemSet(graph: ClosedGraph(seeds: graph.nodes
.compactMap{$0.tryAdvance(expr)}))])
})
}
}

// MARK: LR(0) GRAPH

fileprivate struct ItemSetTable<R : Rules> {

let graph : ClosedGraph<ItemSet<R>>

init(rules: R.Type) throws {
// our initial state is the augmented rule, tagged by nil
let augmentedRule = Item<R>(rule: nil,
all: [.nonTerm(R.goal)],
ptr: 0)
// now, do both graph closures
let itemSetGraph = try ClosedGraph(seeds: [augmentedRule])
graph = try ClosedGraph(seeds: [ItemSet(graph: itemSetGraph)])
}

}

// MARK: HELPER

extension ItemSet {

func reduceRule() throws -> R? {
let results : [R] = graph.nodes.lazy.filter(\.tbd.isEmpty)
.compactMap(\.rule)
if results.count > 1 {
throw ReduceReduceConflict(matching: results)
}
return results.first
}


}

// MARK: ACTION + GOTO TABLES

extension ItemSetTable {

func actionTable() throws -> [R.Term? : [Int : Action<R>]] {

// shifts

let keyAndVals = graph.edges
.compactMap{(key : Expr<R.Term, R.NTerm>, vals : [Int : [Int]])
-> (R.Term, [Int : Action<R>])? in
guard case .term(let t) = key else {return nil}
let dict = Dictionary(uniqueKeysWithValues: vals
.map{start, ends in
assert(ends.count == 1)
return (start, Action<R>.shift(ends.first!))
})
return (t, dict)
}

var dict = Dictionary(uniqueKeysWithValues: keyAndVals)
as [R.Term? : [Int : Action<R>]]

for start in graph.nodes.indices {

// reductions

if let rule = try graph.nodes[start].reduceRule() {
for term in Array(R.Term.allCases) as [R.Term?] + [nil] {
if dict[term] == nil {
dict[term] = [start: .reduce(rule)]
}
else {
if dict[term]?[start] != nil {
throw ShiftReduceConflict()
}
dict[term]?[start] = .reduce(rule)
}
}
}

// accepts

if graph.nodes[start].graph.nodes
.contains(where: {$0.rule == nil && $0.tbd.isEmpty}) {
if dict[nil] == nil {
dict[nil] = [start : .accept]
}
else {
if dict[nil]?[start] != nil {
throw AcceptConflict() //should not happen?
}
dict[nil]?[start] = .accept
}
}
}
return dict
}

var gotoTable : [R.NTerm : [Int : Int]] {
Dictionary(uniqueKeysWithValues: graph.edges
.compactMap{(key : Expr<R.Term, R.NTerm>, vals : [Int : [Int]]) in
guard case .nonTerm(let nT) = key else {return nil}
return (nT, vals.mapValues{ints in
assert(ints.count == 1)
return ints.first!
})
})
}

}

// MARK: LR(0) PARSER

public extension Parser {

// Et voilà
static func LR0(rules: R.Type) throws -> Self {
let table = try ItemSetTable(rules: rules)
return Parser(actions: try table.actionTable(),
gotos: table.gotoTable)
}

}

The full project can be found here.

Challenges

Something that seems not possible is to write modular grammars.

Consider this:


protocol Terms0 : Terminal {
static var a : Self {get}
static var b : Self {get}
}

protocol NTerms0 : NonTerminal {
static var E : Self {get}
static var B : Self {get}
}

enum Rules0<Term : Terms0, NTerm : NTerms0> : String, Rules {
case aE
...
}

protocol Terms1 : Terms0 {
static var c : Self {get}
static var d : Self {get}
}

protocol NTerms1 : NTerms0 {
static var Foo : Self {get}
static var Bar : Self {get}
}

enum Rules1<Term: Terms1, NTerm: NTerms1> : Rules {
case R0(Rules0<Term, NTerm>)
case FooE
var rawValue: String {
// have to implement :(
}
init?(rawValue: String) {
// have to implement :(
}
static var allCases: [Rules1<Term, NTerm>] {
// have to implement :(
}
...
}

Unfortunately, if we try to modularize our grammar like this, we have to manually implement rawValue and allCases which are completely obvious and boilerplate. Additionally, we have to implement init?(rawValue:) which is also boilerplate, except we actually have to check that the raw values are unique.

Probably, a macro might help, but those are a pain in the ass, so I’ll leave that to you as an exercise.

--

--