Building a Functional Parsing Library in JavaScript and Flow
An introduction to combinator libraries, from the ground up
--
While they aren’t often discussed outside of the functional programming world, combinator libraries are a common and useful technique for building functional libraries. Their core characteristic is the construction of a small set of base functions and types, and a few higher-order functions for combining them. Most importantly, the functions produced by these higher-order functions are as composable as the base functions built into the library, and can be recombined with the library’s higher-order functions any number of times. This enables the creation of arbitrarily large and complex abstractions from a small number of simple pieces, while maintaining type safety. In this post we’ll look at parsing, one of the most common introductions to combinator libraries, and will build a simple parser combinator library from scratch. The goal of this post is twofold: to demonstrate the benefits of combinator libraries, and to give the reader a familiarity with the concepts underlying parser combinators, providing a jumping-off point for those who want to work with more industrial-strength combinator parser solutions.
This post and the code it contains are heavily based on Graham Hutton’s paper “Higher-Order Functions for Parsing”, which introduces these concepts in Haskell.
Why use combinators?
When parsing is discussed outside of the realm of functional programming, usually it’s in reference to parser generators. These tools provide their own language for describing grammars, which is then compiled into source code that can be imported into your project. Examples of these include systems like lex and yacc in the UNIX world, ANTLR, or PEG.js and nearley.js in the JavaScript world. Parser generators usually allow you to embed functions from your main language into the grammar files, allowing you to conveniently produce the data structures that will be used by your main program. They provide high performance and good tooling, and are a good choice for many applications. However, their approach has several downsides.
The most obvious downside may be that a project using a parser generator requires developers to understand an additional language, with its own syntax and tooling. This is a language that most developers are unlikely to use frequently enough to be completely familiar with.
Embedding snippets of the “main” language into a sublanguage which is then compiled into a form importable by the main application presents challenges to debugging and type checking. A developer debugging code they embedded into a grammar file may find themselves stepping through a great deal of generated code, a task made more complicated by the fact that the control flow and execution model of this generated code is often unfamiliar to anyone not familiar with the parsing algorithm being used. This presents challenges to type checking as well. Say we want to implement a parser for a calculator language, capable of expressing basic operations on numbers and expressions. We might say that this language contains NUMBER
terms, which will be parsed into a Num
type, and PLUS
terms of the form NUMBER WHITESPACE "+" WHITESPACE NUMBER
, which will be parsed into a Plus
type. It would be desirable for the function responsible for turning the parsed PLUS
data into a Plus
type to be type safe, and detect errors if Plus
parsing code misuses the Num
types that a PLUS
contains. We would also like to have the guarantee that if a parse is successful, that the result of the parse that’s turned over to the main program is type safe. Threading types together like this can be challenging when implementing a parser generator, but as we’ll soon see, it is trivial when writing parser combinators.
Finally, parser combinators have an advantage over parser generators in that they give you the expressivity of a full programming language when building new abstractions. This argument will be familiar to JavaScript developers who have seen the debates between React and templating languages; custom abstractions which are difficult or impossible to implement in a restricted sublanguage become feasible when implemented using the power of a full language. This expressivity is where parser combinators really shine.
The combinator approach
The fundamental idea of parser combinators is that we will implement extremely simple parsers, which are too simple to do anything practical. Then we’ll implement some combinators: pure, higher-order functions which combine functions into other functions. These will let us build our simple parsers into more complex parsers, and those into more complex parsers, until we have the full parser that we want.
It’s customary for parser combinators to treat their input strings as a list of characters. In JavaScript and Flow, strings don’t have the full set of array methods, and there is no character type. For these reasons, we’ll declare a type alias type Char = string
, and use Char
as the type at interfaces where we expect strings of length 1
. To parse a string str
, we’ll do str.split("")
to break it into an array of strings of length 1
before parsing. We’ll define the input to our parsers as so:
type ParseInput = Array<Char>;
Because our basic parsers will be simple functions which only parse a small piece of our input, the output type of these parsers will need to include both the result of the parse along with the unparsed input. Furthermore, we’ll be implementing a library which handles ambiguous grammars; languages for which there may be more than one acceptable parse result. When a string can be parsed in multiple ways, we’ll return all possible parse results. Based on these two constraints, this will be the type returned by our parsers:
type ParseResult<R> = Array<[R, ParseInput]>;
Because our parsers will be capable of parsing values of any type, we represent the type of their final result with the generic R
. Each of these results is paired with the unparsed remainder of the string in question. In the case where a parse fails, we’ll return an empty array.
Now we can see the full type definition for a parser:
type Parser<R> = Array<Char> => ParseResult<R>;
The signature of our parsers can be remembered using a rhyme written by Graham Hutton: “A parser for things is a function from strings to lists of pairs of things and strings”.
Implementing our basic parsers
The foundational parsers of our library will be absurdly simple. First, let’s make a helper function for constructing a successful parse result:
function success<R>(result: R, input: ParseInput) : ParseResult<R> {
return [[result, input]];
}
Bear in mind that the outer array here is a proper array, but the inner one is being used as a tuple. Now, let’s make the most basic parser possible: one which always fails.
function fail<R>(input: ParseInput) : ParseResult<R> {
return [];
}
No matter what input this parser receives, it returns an empty success list. Using these two functions, we can produce a factory for slightly more useful parsers:
function satisfy<R>(test: Char => boolean) : Parser<R> {
return (input: ParseInput) =>
input.length === 0 ? fail(input)
: test(input[0]) ? success(input[0], input.slice(1))
: /* doesn't match */ fail(input);
}
satisfy
takes a test function, and produces a parser that consumes a character and succeeds if this test function returns true
for this character, and fails otherwise. We can make a basic parser that parses one of the characters a
, b
, c
, or d
, as so:
const parseABCD = satisfy(x => ["a", "b", "c", "d"].includes(x));parseABCD("asdf".split("")); // returns [['a', ['s', 'd', 'f']]]
parseABCD([]); // returns []
parseABCD(["Z"]); // returns []
We can use satisfy
to define a simple interface for parsing a single given character:
function character(c: Char) : Parser<Char> {
return satisfy(x => x === c);
}
These are the building blocks that we’ll put together to produce full-fledged parsers.
Combining parsers
Now that we have a way to create basic parsers for single characters, we need a way to put them together to parse longer strings. Let’s make a function that sequences one parser after another:
function then<R, S>(first: Parser<R>,
second: Parser<S>) : Parser<[R, S]> {
return input =>
flatMap(
first(input),
([r, remainder]) =>
second(remainder).map(
([s, secondRemainder]) => [[r, s], secondRemainder]));
}
This will be the most complicated function of this post, so let’s take a minute to break it down. flatMap
is a common utility function (show here), with the signature (Array<A>, A => Array<B>) => Array<B>
, which maps an array into an array of arrays, and flattens those together. We call it on first(result)
, mapping over the results of our first parser. Each entry in this array of results is a pair of the result of the first parser (r
), and the unconsumed input from the first parser (remainder
). Now we do second(remainder)
to get an array of results from the second parser. These results are pairs of the result of the second parser (s
) and the unparsed input left over from the second parser (secondRemainder
). What we’ve actually parsed is an r
followed by an s
, though, so we map over this array of results to return the pair [r, s]
as the parsed result.
Sequencing one parser after another is only one of the ways we can combine two parsers. We can also make an “or” or “alternatives” combinator, corresponding to the |
operator in BNF notation. This one is much simpler: it returns the results of parsing with the first parser, concatenated with the results of the second parser:
function alt<R>(left: Parser<R>, right: Parser<R>) : Parser<R> {
return input => left(input).concat(right(input));
}
In theory, using then
for sequential items and alt
for alternatives is all that we need to implement parsers for BNF grammars. Unfortunately the type signatures we’re working with will not allow this quite yet. Consider this grammar, for a 1
surrounded by any number of balanced pairs of parenthesis:
parenOne ::= "(" parenOne ")" | "1"
We would want to express this in our combinators by doing something like this:
const parenOne = alt(
then(character("("), then(parenOne, character(")"))),
character("1"));
Problem is, alt
requires that each of the parsers its given have the same return type, and we can see here that our first alternative’s return type will involve nested tuples, while our second will return a string. We need to make a few more tools for manipulating our parsers
Additional combinators
Let’s make a combinator that will let us transform the results of a parser. This way we can transform the types of our alternatives until they match:
export function map<A, B>(fn: A => B, p: Parser<A>) : Parser<B> {
return input =>
p(input).map(([result, remaining]) => [fn(result), remaining]);
}
All we need to do here is apply our given function to each return value of the parser.
Let’s define a cleaner type for the return value of our parse:
class ParenPair {
contents: ParenPair | One;
constructor(contents: ParenPair | One) {
this.contents = contents;
}
}class One {};
We have one more hurdle to hop over, here. Recall the definition of parenOne
that we gave earlier:
const parenOne = alt(
then(character("("), then(parenOne, character(")"))),
character("1"));
Ignoring the type issues, the definition of parenOne
is recursive. We can’t pass parenOne
to then
until after parenOne
is defined, but we need to do so as part of parenOne
's definition. We can deal with this by making a lazy version of then
:
function lazyThen<R, S>(first: () => Parser<R>,
second: () => Parser<S>) : Parser<[R, S]> {
return input => then(first(), second())(input);
}
All this does is wrap then
, such that rather than taking two parsers, we take two functions which return parsers. We don’t execute these functions until we actually need them to return the parse, so it’s fine to create them before these parsers are ready to use. (that is, we can create a function which references parenOne
during parenOne
's definition, as long as this function isn’t actually called until the definition is complete.) alt
can be wrapped in the same way, producing lazyAlt
.
Now that these extra concerns are accounted for, we can finally implement a parser for our parenOne
grammar:
const parenOne : Parser<ParenPair | One> =
alt(
map(
([openParen, [contents, closeParen]]) =>
new ParenPair(contents),
then(character("("),
lazyThen(() => parenOne, () => character(")")))),
map(
_ => new One(),
character("1")));
Let’s run this parser on "((1))".split("")
and see what it produces:
[ [ ParenPair { contents: ParenPair { contents: One {} } }, [] ] ]
It works! We have one result: the case where two ParenPairs
are parsed surrounding a One
, leaving an empty array as the remaining input. Note that we achieved one of the goals which lead to us using combinator parsers: we have type signatures on our parser, telling us that if a parse succeeds, the parse data will be of a known type.
Higher-level combinators
We now have the functions necessary to translate grammars from BNF into working parsers: replace |
with alt
or lazyAlt
, replace successive terms with nested calls to then
or lazyThen
, and use map
to turn the raw parser output into a nicely usable form. We can do better than this, though. In the above example, we represented our grammar as a sequence of parenthesis, parenPair
s, and 1
, but it’s much more elegant to think of this as a 1
surrounded on either side by pairs of parenthesis. Let’s implement a function to bracket a parser with any number of pairs of two others, using only the combinators which we’ve already declared:
function bracket<R, S, T>(left: Parser<R>,
middle: Parser<S>,
right: Parser<T>) :
Parser<[Array<[R, T]>, S]> {
return input =>
alt(
map(
m => [[], m],
middle),
map(
([l, [[lrs, m], r]]) => [[[l, r]].concat(lrs), m],
then(left,
then(bracket(left, middle, right), right))))(input);
}
The return type of a parser produced by bracket
is [Array<[R, T]>, S]
, where R
and T
are the types of the parsers wrapping either side, and S
is the type returned by the parser being wrapped. While the transformations in the map
calls may seem daunting, the core logic of this parser is quite simple: it returns an alternative of the wrapped middle value, or the left value, followed by a recursive call to bracket, followed by the right value.
Now we can rewrite our parenOne
function more naturally, without the need for self-reference, laziness, or new types:
const parenOneBracketed =
bracket(character("("), character("1"), character(")"));
If we want to produce the same output as before, we can still map
this with our additional types:
const parenOneBracketed =
map(
(parens, one) =>
parens.reduce((acc, _) => new ParenPair(acc), new One()),
bracket(character("("), character("1"), character(")")));
This is a substantial improvement, and demonstrates another of our motivations for using a combinator library: because we had the power of a full language at our fingertips, we were able to easily create a new parsing abstraction, that can be reused and composed just as readily as our base abstractions.
To round out our parsing library, let’s add a couple more functions for standard parsing cases: parsing a value repeated some number of times. First, let’s handle the case where a parser is match zero or more times, as with the symbol *
in regular expressions:
function many<R>(p: Parser<R>) : Parser<Array<R>> {
return alt(
map(
([r, rs]) => [r].concat(rs),
then(p, input => many(p)(input))),
input => success([], input));
}
The item we parse is of type R
, so we create a parser with the return type Array<R>
. We can use many
to similarly implement a function for when a parser should be matched one or more times, as with the regular expression operator +
:
function some<R>(p: Parser<R>) : Parser<Array<R>> {
return map(
([r, rs]) => [r].concat(rs),
then(p, input => many(p)(input)));
}
More useful functions may be defined. Here are the type signatures of few that are commonly useful:
// Eat any amount of whitespace, returning nothing
whitespace : Parser<null>;// Parse a sequence of items separated by some junk,
// like a "," - separated list
separatedBy : (Parser<R>, Parser<S>) => Parser<Array<R>>;// Parse a sequence of items separated by some important thing
separatedByImportant :
(Parser<R>, Parser<S>) => Parser<[R, Array<[S, R]>]>
Implementations of these are omitted for succinctness, and to provide a suggested challenge for those looking to try their hand at defining combinators themselves.
Note that each of these high-level combinators still have the Parser
type as both their arguments and their return values. This allows us to continue composing parsers at higher and higher levels of abstraction; there’s no reason that the parser we pass to bracketed
can’t be one created by many
, that those we pass to separatedBy
can’t be produced by bracketed
, etc.
The library we’ve implemented here has some serious limitations: it is not capable of parsing context-sensitive grammars, does not provide any mechanism for reporting errors when parses fail, and is far from efficient, in either memory usage or runtime. These are due to the choice of an implementation which favors clarity and succinctness over all else, and are not present in industrial-strength parser combinator libraries such as parsimmon. This said, we’ve implemented a flexible, expressive, type-safe library in around 10 functions and under 100 lines, capable of parsing any context-free language. This is no mean feat, and shows the power that comes with choosing combinators as the basic abstraction for library construction.
All code samples from this post are available here.
Thanks to Lizz Katsnelson for editing this post.