SuperCombinators: a Swifty Parser Combinator Framework
This article is about SuperCombinators, a parser combinator library I wrote to do some of the String parsing we do at Snips. Read on to find out what parser combinators are, and why we implemented our own framework to use them.
Parser combinators are a powerful tool to process text. While not very well suited to natural language, they allow to implement a very expressible, performant parser in a succinct way. In our work in creating cross-platform, multi-language NLP tools, we often need to create a flexible, legible and compact way to dynamically change the logic of our program that doesn’t quite fit in plain text, CSV or JSON. Parser combinators are well suited for this.
Existing frameworks
There are already quite a few open-source parser combinator libraries in Swift. At the time when we first started looking at the existing implementations, none were quite what we wanted.
- This was just toward the end of the Swift 3.0 migration season, and there were no frameworks already migrated at the time, although we had already migrated our code base.
- The libraries seemed to be written either as a showcase of a functional programming approach or as a developer’s first Swift framework, with readability and Swift conventions as a low priority.
- All the implementations of recursive parsers that we found leaked memory.
- It wasn’t possible to instantiate a parser from a regular expression or to implement a way to scan for characters in a
CharacterSet
.
After forking one of the available libraries, updating, and playing around with it, we decided to have a go at writing one from the ground up that would deal with the drawbacks above.
Usage
Types in Swift tend to have a semantic meaning associated with them, along with their syntactic properties. In this spirit, the framework has two key types, Pattern
and Parser
. A Parser
traverses some text and extracts the value it represents. A Pattern
traverses some text the same way, but does not capture a value. This allows us to have a clear semantic separation between in the usage of the parsers, particularly when combining them. There is a number of ways to make a Parser
from a Pattern
, and vice versa, as shown below.
Integer parser
Here is a parser that extracts an Int
from text:
Let’s go through this code line by line.
- We create
digits
, a pattern that traverses the prefix of aString
that consists ofUnicodeScalars
found inCharacterSet.decimalDigits
. - We create
minus
, that matches aString
that starts with "-" using aString
literal. minus.optional
creates a parser that parses a "-" if it's there, or the empty prefix of anyString
. We combine it with thedigits
pattern using an overloaded&
to give the pattern that matches both original patterns in sequence.- We capture the prefix of a string that was traversed using this pattern, giving us a parser of type
Parser<String>
. - We map the prefix captured to an
Int
by using the native failing initialiser from string. - We call
Parser.parseAll(_:): Parser<Value> -> Value?
, that gives us the value obtained from parsing the entire string.
String parser
Implementing a basic string parser is fairly straightforward:
notQuote
is a pattern that greedily parses characters that aren't "
. We use that to capture the contents of a string that does not have escape sequences. We then combine it with patterns declared using string literals.
Note how Swift’s compiler helps us here. It infers the intended type of the literals, while keeping it clear which part of the information extracted we want to retain.
Toy CSV
Using the parsers we’ve defined so far, we can build a basic CSV parser, where each comma-separated cell contains either an int or a string with no escaped characters.
There are a couple more operators defined here: infix ||
and postfix *
.
The parser formed using ||
attempts to parse using the first one, and then the second one if the first one fails.
SuperCombinators defines two postfix operators: *
and +
to mirror the familiar regex quantifiers. They both attempt parse using the original parser as many times as possible, the second one failing if there wasn't at least one match.
Recursive parsing
There is often a need to parse elements that may contain themselves. Parser
and Pattern
contain a factory method to do this:dA
In defining these recursive structures, the correct parse order is crucial.
int || (int + "+" + sum).map(+)
would greedily match using the first option and never recurse(sum + "+" + int).map(+) || int
would run forever, recursing indefinitely
Note how Swift automatically infers that while the "+"
pattern should be ignored, both the values from int
and sum
should be kept to create a parser of type Parser<(Int, Int)>
.
Implementation
Due to this task having a relatively low priority, a fair bit of the implementation of this framework was done in my free time. This gave me the luxury of trying out a large number of approaches, with the freedom of chucking out working code and starting again, meaning that there is not a lot of code left from the first working implementation. Nonetheless, a few of the original building blocks have remained.
Swift encourages using types as a means of communication of the semantics of an instance. In the case of parser combinators, there can be a clear distinction between a part of the text that is used purely for structure, and a part of text that has meaning. For example, an array contains a list of values that have meaning, but that are difficult to parse correctly if they are not delimited by commas. By having a separate Parser
type for extracting meaning, and Pattern
type for extracting structure, the operation of chaining parsing units became much more clear without obscuring which parts of the information were conserved.
This separation also allowed for separate operators on the types, as well as StringLiteralExpressible
conformance for Pattern
s, allowing for clear, readable and succinct expression of parsers.
The recursive parser is a familiar ARC logic puzzle. You want to give a reference type that has a single strong reference to the parsing function no matter how many times it is copied. This was done using an additional, private
type that would lazily generate the parser when it was first used, with no strong references to itself, captured by the parsing function of the returned Parser
.
An insight into the String
type in Swift allowed for use of all the familiar Foundation techniques. This is that creating a String
that is a substring of an existing String
does not copy its contents. This allowed for switching from using either a CharacterView
into a String
, or some other generic collection to using String
types directly, without fearing a large overhead in memory copying. This unfortunately obscures the fact that such a substring could retain a much larger string in memory, but is a worthy tradeoff for the speed and convenience of parsing in our case.
Going forward
While the existing library fits our needs, we would like it to also be useful to others. There is likely to be a use for some sort of a base for frequently used parsers, such as for optional whitespace and integers, and a way to share these between unrelated parsers that could be added to the framework itself. A mechanism for debugging parsing and error throwing/handling will also be useful for more complex parsers.
You can find the Swift code on Github, we would love to hear from people who find this framework useful, and pull requests are very welcome!