SuperCombinators: a Swifty Parser Combinator Framework

Sasha Lopoukhine
Snips Blog
Published in
6 min readMar 31, 2017

--

This article is about SuperCombinators, a parser combinator library I wrote to do some of the String parsing we do at Snips. Read on to find out what parser combinators are, and why we implemented our own framework to use them.

Parser combinators are a powerful tool to process text. While not very well suited to natural language, they allow to implement a very expressible, performant parser in a succinct way. In our work in creating cross-platform, multi-language NLP tools, we often need to create a flexible, legible and compact way to dynamically change the logic of our program that doesn’t quite fit in plain text, CSV or JSON. Parser combinators are well suited for this.

Existing frameworks

There are already quite a few open-source parser combinator libraries in Swift. At the time when we first started looking at the existing implementations, none were quite what we wanted.

  1. This was just toward the end of the Swift 3.0 migration season, and there were no frameworks already migrated at the time, although we had already migrated our code base.
  2. The libraries seemed to be written either as a showcase of a functional programming approach or as a developer’s first Swift framework, with readability and Swift conventions as a low priority.
  3. All the implementations of recursive parsers that we found leaked memory.
  4. It wasn’t possible to instantiate a parser from a regular expression or to implement a way to scan for characters in a CharacterSet.

After forking one of the available libraries, updating, and playing around with it, we decided to have a go at writing one from the ground up that would deal with the drawbacks above.

Usage

Types in Swift tend to have a semantic meaning associated with them, along with their syntactic properties. In this spirit, the framework has two key types, Pattern and Parser. A Parser traverses some text and extracts the value it represents. A Pattern traverses some text the same way, but does not capture a value. This allows us to have a clear semantic separation between in the usage of the parsers, particularly when combining them. There is a number of ways to make a Parser from a Pattern, and vice versa, as shown below.

Integer parser

Here is a parser that extracts an Int from text:

Let’s go through this code line by line.

  1. We create digits, a pattern that traverses the prefix of a String that consists of UnicodeScalars found in CharacterSet.decimalDigits.
  2. We create minus, that matches a String that starts with "-" using a String literal.
  3. minus.optional creates a parser that parses a "-" if it's there, or the empty prefix of any String. We combine it with the digits pattern using an overloaded & to give the pattern that matches both original patterns in sequence.
  4. We capture the prefix of a string that was traversed using this pattern, giving us a parser of type Parser<String>.
  5. We map the prefix captured to an Int by using the native failing initialiser from string.
  6. We call Parser.parseAll(_:): Parser<Value> -> Value?, that gives us the value obtained from parsing the entire string.

String parser

Implementing a basic string parser is fairly straightforward:

notQuote is a pattern that greedily parses characters that aren't ". We use that to capture the contents of a string that does not have escape sequences. We then combine it with patterns declared using string literals.

Note how Swift’s compiler helps us here. It infers the intended type of the literals, while keeping it clear which part of the information extracted we want to retain.

Toy CSV

Using the parsers we’ve defined so far, we can build a basic CSV parser, where each comma-separated cell contains either an int or a string with no escaped characters.

There are a couple more operators defined here: infix || and postfix *.

The parser formed using || attempts to parse using the first one, and then the second one if the first one fails.

SuperCombinators defines two postfix operators: * and + to mirror the familiar regex quantifiers. They both attempt parse using the original parser as many times as possible, the second one failing if there wasn't at least one match.

Recursive parsing

There is often a need to parse elements that may contain themselves. Parser and Pattern contain a factory method to do this:dA

In defining these recursive structures, the correct parse order is crucial.

  1. int || (int + "+" + sum).map(+) would greedily match using the first option and never recurse
  2. (sum + "+" + int).map(+) || int would run forever, recursing indefinitely

Note how Swift automatically infers that while the "+" pattern should be ignored, both the values from int and sum should be kept to create a parser of type Parser<(Int, Int)>.

Implementation

Due to this task having a relatively low priority, a fair bit of the implementation of this framework was done in my free time. This gave me the luxury of trying out a large number of approaches, with the freedom of chucking out working code and starting again, meaning that there is not a lot of code left from the first working implementation. Nonetheless, a few of the original building blocks have remained.

Swift encourages using types as a means of communication of the semantics of an instance. In the case of parser combinators, there can be a clear distinction between a part of the text that is used purely for structure, and a part of text that has meaning. For example, an array contains a list of values that have meaning, but that are difficult to parse correctly if they are not delimited by commas. By having a separate Parser type for extracting meaning, and Pattern type for extracting structure, the operation of chaining parsing units became much more clear without obscuring which parts of the information were conserved.

This separation also allowed for separate operators on the types, as well as StringLiteralExpressible conformance for Patterns, allowing for clear, readable and succinct expression of parsers.

The recursive parser is a familiar ARC logic puzzle. You want to give a reference type that has a single strong reference to the parsing function no matter how many times it is copied. This was done using an additional, private type that would lazily generate the parser when it was first used, with no strong references to itself, captured by the parsing function of the returned Parser.

An insight into the String type in Swift allowed for use of all the familiar Foundation techniques. This is that creating a String that is a substring of an existing String does not copy its contents. This allowed for switching from using either a CharacterView into a String, or some other generic collection to using String types directly, without fearing a large overhead in memory copying. This unfortunately obscures the fact that such a substring could retain a much larger string in memory, but is a worthy tradeoff for the speed and convenience of parsing in our case.

Going forward

While the existing library fits our needs, we would like it to also be useful to others. There is likely to be a use for some sort of a base for frequently used parsers, such as for optional whitespace and integers, and a way to share these between unrelated parsers that could be added to the framework itself. A mechanism for debugging parsing and error throwing/handling will also be useful for more complex parsers.

You can find the Swift code on Github, we would love to hear from people who find this framework useful, and pull requests are very welcome!

--

--