Introduction to Parsers
Parsing is a surprisingly challenging problem. No wonder I often see simple parsing problems as interview questions. In my own projects, I’ve tortured myself trying to find robust and efficient ways to scrape data from websites. I couldn’t find much help online except for people saying that using regular expressions is a bad approach.
In retrospect, this was one of those times where I simply didn’t know the right keywords to search. I finally feel like I’ve figured it all out, but it was a long journey filled with academic jargon that was hard to understand and often misused. The purpose of this article is to make the theory and practice of parsers more accessible.
I’m going to start out with some theory about formal grammars because I found it very useful to understand when people start throwing around fancy words like context-free grammar and the like. In second half of this article, we will build a parser from scratch so you can knock it out of the park in your next interview. This isn’t a quick-read, so make sure you have a nice cup of joe and a fresh mind before proceeding.
Theory of Formal Grammars
The theory behind parsers has its roots in a 1956 seminal paper by Noam Chomsky, Three Models for the Description of Language. In this paper, he describes the Chomsky Hierarchy of four classes of formal grammars. Each is a subset of another distinguished by the complexity of the algorithm required to validate them. We’ll go through them one-by-one.
Type 3 — Regular languages
Type 3 languages are called regular languages. Anything you can match with a regular expression is perfect example of a regular language — hence the name “regular” expression. Any regular language can be solved with a deterministic finite automata (DFA). For those less familiar, a DFA is a program that can be represented as a fixed set of states (nodes) and transitions (edges). There are some pretty awesome visualization tools out there for regular expressions based on this insight. For example, check out this one:
John: Some regular expression engines are actually more powerful than interpreting just regular languages. The Oniguruma engine that Ruby uses is a perfect example.
Me: Whoa!
Type 2 — Context-Free languages
Type 2 languages are called context-free (a.k.a. CFG for context-free grammar) and can be represented as a pushdown automata which is an automata that can maintain some state with a stack. This part of the hierarchy gets the most attention because most programming languages and domain specific languages are context-free. A perfect example of a language that is context-free but not regular is a language defined by “n 0’s followed by n 1’s for any n” . Mathematicians would define this grammar with the following notation:
{ 0^n 1^n | n >= 0 }
If we were to try to write this as a regular expression, we’d start by writing something like 0{n}1{n}
. But that is not a valid regular expression because we'd have to specify exactly what what n
is, hence this is not a regular language. However we can verify this grammar with a pushdown automata using the following procedure:
- If you see a 0, push it onto the stack and go back to (1).
- If you see a 1, pop an item from the stack and go back to (2).
- If the stack is empty, we’ve verified that the input string is “in the language”.
John: I have had many situations where I needed to parse balanced parenthesis and I think this is a more sympathetic case for programmers.
Me: Good point! This is a simple case of the balanced parentheses problem.
A common way of specifying context-free grammars is with with Backus-Naur Form (BNF). This is an excellent article explaining how BNF works, common extensions to BNF (typically called EBNF or ABNF), and explains how top-down (LL) and bottom-up (LR) parsers work. Some other common terms you might hear are SLR, CLR, LALR and this StackOverflow comment does a good job at clarifying those.
You might also hear about something called a parsing expression grammar (PEG). PEG is the same as CFG except it is unambiguous and greedy — if you are a programmer and not a mathematician, PEG is most often what you are thinking.
Ambiguity is painful. Markdown is a good example of an ambiguous grammar and this is the primary reason Markdown parsers do not have a formal BNF grammar definition. The following examples are the expected parse results based on the latest CommonMark specification.
***bold**italic* => <em><strong>bold</strong>italic</em>***italic*bold** => <strong><em>italic</em>bold<strong>*italic**not bold* => <em>italic**not bold</em>
A mathematician might define a CFG for Markdown using BNF like this:
The mathematician would say this is a valid CFG definition and verify that the three examples above are “in the language”. However, if you were a programmer trying to write a markdown parser, this CFG definition is pretty much useless to you. If you were to interpret this definition into an actual parser, it would fail because a program cannot be ambiguous. A PEG will eagerly match tokens as if it were a real program whereas the mathematician doesn’t actually care about that.
John: Markdown parser is great example and the ambiguity of the grammar is a pain. Markdown has become a wishy washy standard.
Me: I think it might actually be the other way around. Check out the latest spec for how bold and italics works. I dare you to spend 15 minutes reading that spec, trying to understand it, and write a simple parser for bold and emphasis that satisfies examples 328–455. Maybe the problem is that CommonMark so over-specified and ridiculous that no one actually conforms to it.
For unambiguous context-free languages like C, there are all kinds of tools. The most popular are ANTLR, Bison, and YACC. They’re actually called compiler-compilers because they don’t just verify the grammar, they provide tools for generating compilers for those grammars as well. There’s a repo with a bunch of ANTLR grammar examples that are pretty cool to check out. In the JavaScript world, you can also check out PEG.js and Jison.
Type 1 — Context-Sensitive languages
Type 1 languages are called context-sensitive. They can be represented by an automata without using more memory than the length of the input. The following grammar can distinguish between Type 2 and Type 1:
{ a^n b^n c^n | n >= 0 }
We can verify that a string is in this language using an automata with the following procedure:
- Read the first a and overwrite it with an x.
- Move to the first b and overwrite it with an x.
- Move to the first c and overwrite it with an x.
- Then move back to the beginning and go back to (1).
- If you make it to the end of the string looking for the first a and they’re all x’s, then we’ve verified the string.
The thing to recognize here compared to the previous example is that this parser has a state maintained by mutating the input memory that isn’t just a stack.
Type 0 — Recursively Enumerable languages
Type 0 languages are called recursively enumerable and is anything that can be solved by a computer — that is, not the halting problem. A simple, yet annoyingly math-heavy proof of a language that is Type 0 but not Type 1 is a language that reads two regular expressions and determines if they represent the same language. This is a well-known EXPSPACE problem so it clearly cannot be a Type 1.
An Unsolved Conjecture
A common question is how can we parse ambiguous context-free grammars. Well if you can prove that you cannot write a PEG for some CFG then you’ll be a decorated mathematician!
“It is conjectured that there exist context-free languages that cannot be parsed by a PEG, but this is not yet proven.” — Wikipedia , ACM
Otherwise, for all of your parsing needs, parser combinators are a one-size-fits-all elegant solution and that’s what we’re going to talk about next.
Parser Combinators
Parser combinators are an elegant and powerful way of defining parsers. Parsec is a popular Haskell library and the def-facto solution for parser combinators. Pretty much all other solutions in other languages are inspired from Parsec — Parsatron for Clojure, Parsec.py for Python, and Parsimmon for JavaScript. Parser combinators may seem a little daunting at first, but they’re actually relatively simple. So let’s break it down and build a little toy example.
If you’re familiar with functional programming and category theory, you’ll really appreciate the elegance of parser combinators. If not, it’s not required to keep reading, but I highly recommend checking out Brian Lonsdorf’s class on Egghead.
A Naive Approach
The goal of parser combinators is to create mini-parsers that combine together into larger and larger parsers. That means, when we parse something, there might be more to be parsed by subsequent parsers. So let’s define a parser that only parses one character.
If we see the character we’re looking for, we consume one character from the input and return a success message with the rest of the input to be parsed.
We can combine these parsers together with some higher-order functions called combinators. Let’s create a combinator called sequence that accepts a list of parsers and will parse the input using these parsers in order.
One thing that’s pretty cool is that when the parser fails, we can see exactly where it failed. Let’s use this sequence combinator to build up a string parser.
You’ll notice that we never even touched the input argument — that’s because these functions are curried! You can already start to see how terse parsers can be using these combinators. Here’s another very useful combinator that tries each of the parsers in a list until one works or the whole thing fails.
We can build these up into more complicated parsers with this approach and I recommend hopping in the CodePen below and implementing a few other combinators like nOrMore and maybe.
This approach is a bit naive though. One crucial piece that’s missing is that these parsers don’t output any values! This is important if we actually want to do something with the string we just parsed. Another problem has to do with performance — every time we create the rest string, we’re chopping off just one character and copying the rest of the string into another variable in memory. This is going to be a problem if we’re parsing a large input string.
Round 2
First, let’s tackle the performance issue of chopping up the input string. We’re going to create a new data type that holds the entire string along with a cursor to keep track of where we are inside the string. I’m going to call this a Stream although that might be a misnomer.
Notice that this Stream can handle Arrays as well as Strings — that’s pretty sweet! Next, we’re going to create the Success and Failure data types that keep track of the rest of the stream as well as a value that can be emitted from the parser. These data types represent a tagged union much like the Maybe or Either types. I’ve also added some prototype methods for map, bimap, chain, and fold. (If you’re not too familiar with functional programming, that’s totally fine and don’t worry about the next sentence.) These methods represent the algebraic structure for well defined categories for the Functor, Bifunctor, Chain, and Foldable based on the FantasyLand specification.
And now for the last piece of the puzzle. A Parser contains a function that parses a Stream but has methods for mapping and chaining over the parse Result.
Now we can create the char function all over again, only this time we’ll emit that character with the Result.
But remember how the Stream doesn’t always have to be a String? Let’s generalize this a little more to work with an Array as well.
So what have we got here? Well let’s parse something and then fold to get the result.
We can use map if we want to change the output value when the parse is successful.
And we can use bimap to provide better error messages.
The last piece of the puzzle is chain which we can use to combine parsers in sequence.
Notice that we only got back the last value from the parser. We can combine all of the values together in the output in the same way we might do this if we were chaining Promises together in sequence:
So now let’s create some of those useful combinators from before. We can start with our either combinator.
We can create our sequence combinator really cleanly, but it’s nice if we define some other helper functions as well.
Much cleaner now! I hope you can really start to appreciate why they’re called combinators. Another interesting example with this new approach we’ve taken is the maybe combinator.
I really like this example because there’s a subtlety that is very explicit: when the parser fails, we return a Success with a null value and the original Stream passed into the parser. That is, we didn’t consume anything from the stream on a failed parse. This is called backtracking which was discussed in the article I linked to in the beginning about LL(1) parsers . If the parser parses out 10 characters and then fails, we want to continue parsing from the where the parser started. However, this has its performance drawbacks if you’re parsing ahead more than one character and wrapping it all in a maybe.
Another fun combinator is lookahead. It’s bad for performance but is really useful when you need it. It basically makes an assertion about what’s up next, but doesn’t actually consume anything from the stream. This is useful for ambiguous cases in Markdown for example. If you want *italic**
to parse into <em>italic*</em>
then when you encounter the second star, you'll want to lookahead to make sure there isn't another star.
One last example: suppose we want to parse out the text between two tokens. Maybe they’re HTML tags or something. We’re going to approach this problem by defining a few combinators so that our parser definition will be really semantic and expressive.
First we’re going to define zeroOrMore which is very similar to the * in regular expressions. We’ll need this combinator to parse out everything in the middle of the two HTML tags.
If we want everything in between <here>
and </here>
, it would be nice to have a string function like we had before. In fact, its the same exact definition!
Now the tricky part: once we parse <here>
, we need an expression for what is considered a valid token is until we get to </here>
. To do this, I've created a not combinator that asserts that some parser will not parse, and then advances the stream by one character.
One last piece and we’re there: we can define a between combinator that takes three parsers and returns the parsed bit in between.
So now we’re ready to define our parser that parses everything between two HTML tags:
And there we go! Check out the CodePen below and play around yourself. And if you’re up for a challenge, create a sepBy function that parses values separated by other values. It’s useful for parsing out the items in a list, for example.
If you want to see some more complicated parser combinators in action, check out the Parsimmon examples — the JSON example should easy enough to follow.
Other Libraries
There are still some things we haven’t handled here yet with this mini-library. One substantial issue is that we’ve defined a lot of functions recursively but JavaScript doesn’t handle tail-call elimination. Thus you’d have to implement trampolining if you were to use this code to parse large files. That said, there are existing parser combinator libraries out there for JavaScript. Parsimmon looks pretty good and Bennu looks great despite being written in some bizarre language.
Conclusion
Let’s recap some of the important take-aways from this article:
- The Chomsky Hierarchy defines four classes of formal languages that can be distinguished by the complexity of the program required to parse and validate a string.
- Most programming languages are context-free and probably has a BNF grammar defined somewhere. A good place to start looking is ANTLR’s extensive list of examples.
- Some languages, like Markdown, have lots of ambiguities. Even if they are context-free, it’s exceedingly difficult to define a parsing expression using BNF.
- Parser combinators are an elegant general-purpose solution for all of your parsing needs.
Thanks for reading and please let me know what you think! If you enjoyed this article, you might be interested in my weekly newsletter where I curate articles that I find interesting.
I also want to mention and thank Elliot Marx for helping me understand everything I wrote in the theory section, and Brian Lonsdorf for telling me that parser combinators were a thing!