Parser Combinator Gotchas
How to avoid trouble when using parser combinators in Go
Tl/dr: Here’s how to solve the Gotchas in the Code of the last post.
- We use function definitions to define recursive parsers (η-conversion).
- We factor out common prefixes of alternatives in the grammar.
- We transform left-recursive grammars into right-recursive ones using repetition. This allows us to parse left-associative operators.
We’re writing a parser for the following grammar with the help of parser combinators in Go.
Digit := "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
Number := Digit+
Multiplicand := Number
| "(" ^ Expression ^ ")"
Addend := Multiplicand
| Multiplicand ^ "*" ^ Addend
Expression := Addend
| Addend ^ "+" ^ Expression
In the last post we learned how to use parser combinators. Now we’re going to take a look at what we have to watch out for when using them.
Gotcha: Recursive Functions, No Recursive Values
We need theExpression
parser in order to implement Multiplicand
. For Expression
we need Addend
and for that we need Multiplicand
again.
Multiplicand ~calls~> Expression ~> Addend ~calls~> Multiplicand
This is called recursion. Go only allows for recursion on functions. In this dummy example we try to use recursion on variables and the example fails to compile. In another dummy example we use recursion on functions and it compiles without errors! Luckily our parsers are functions. Converting the definition of Multiplicand
is easy. Here’s the naive version from the last post.
After turning it into a function definition we get:
Recursive parsers, here we come!
Gotcha: First-come, first-served
Finally, here’s our naive parser for the expression grammar.
Surprise! Our parser has a bug. Let’s try it on the Go Playground! If we parse the text "1+2"
, then we’ll see the following result.
result = 1
rest of input = +2
The parser didn’t eat up its input. If the parser could think, it would probably be something along these lines:
I'm the parser.
I try to parse an Expression.
Expression can be an Addend.
Addend can be a Multiplicand.
Multiplicand can be a Number.Oh great, 1 is a Number ... I'm done parsing the Expression.
More sophisticated combinators can handle this grammar better and give us the longest match instead of the first one. We’ll stick to the simple combinators, however and we’re going to deal with the problems by ourselves. One problem is the following rule.
Expression := Addend
| Addend ^ "+" ^ Expression
The single Addend
comes first and the compound expression comes second. If there’s an Addend
in the input, our parser doesn’t bother trying to find a compound expression. There’s an easy fix: We change the order!
Expression := Addend ^ "+" ^ Expression
| Addend
If our parser starts out with this grammar and if it could think, it would probably be thinking:
I'm the parser.
I try to parse an Expression.
Expression can be an Addend and then "+" and then Expression
...
This time it will only consider the alternative with the single Addend
if it can’t find the “+” followed by the Expression.
After applying this idea to every rule of our grammar we get a correct parser.
This Example shows our parser in action.
Gotcha: Bad Complexity
Our parser has a performance problem. It repeats parsing the same input with the same parser. We take “1” as an example input. The parser tries to parse Addend "+" Expression
only to realize that there is no “+” character. After that it starts out again to parse a single Addend
. For this example it’s only a factor of two in performance “loss”. What happens if we try to parse “((..(1)..))” with a lot of parenthesis? Let’s look at another example. Just adding a few more surrounding parentheses to the input text will make it run into a time-out. It’s not just bad performance, it’s bad complexity. We might have exponential run-time here.
Instead of proving the exponential time-bound we’re going to remove it. As before, we can see the problem from the grammar:
Expression := Addend ^ "+" ^ Expression
| Addend
We have two alternatives that start with the same Addend
. Why don’t we factor out the addend and make the rest optional?
Expression := Addend ^ ("+" ^ Expression)?
Through this we make sure that the Addend
will only parse once for the same input. We give the method for ?
the name Optional
and use it in our final parser for this grammar. The return value of Optional
will be Nothing
when the parser successfully parses the empty string ""
. Consequently, the method evaluate
evaluating the terms on the fly must handle Nothing
correctly.
Bonus Gotcha: Left-Associative Operators
We left out subtraction and division for a good reason: We’re parsing and evaluating from right to left. Our parser will evaluate the expression 1+2+3
to 1+5
and then to 6
. If we allowed subtraction, then the expression 1-2-3
would evaluate to 1-(-1)
and then to 2
— which is wrong, because 1-2-3
should be -4
. Let’s look at our current grammar and add subtraction and division where we naively think they should fit in:
Multiplicand := Number
| "(" ^ Expression ^ ")"
Addend := Multiplicand ^ (("*" | "/") ^ Addend)?
Expression := Addend ^ (("+" | "-") ^ Expression)?
If we read it like a parser would do, it will tell us that it firstly parses one addend and then the rest of the expression. Thus 1+2+3
becomes the Addend 1 + the Expression 2 + 3
, which is not what we want. We want the Expression 1 + 2 + the Addend 3
and then to further divide the first expression. Now let’s change the grammar accordingly.
Addend := Addend ^ (("*" | "/") ^ Multiplicand)?
| Multiplicand
Expression := Expression ^ (("+" | "-") ^ Addend)?
| Addend
We’ve left out the first rule because it hasn’t changed anyway. This is a correct context-free grammar that produces the syntax trees suitable for evaluation from left to right. However, we can’t parse this with naive parser combinators. To parse an Expression we first need to parse an Expression. We need another fix for this Gotcha.
The Expression does nothing but glue together several Addends with interspersed “+” or “-” signs. This sounds like repetition can be useful. Let’s expand Expression several times to get a feeling for how it works.
Expression := Addend
| Addend ^ ("+" | "-") ^ Addend
| Addend ^ ("+" | "-") ^ Addend ^ ("+" | "-") ^ Addend
| ...
The expression always starts with an Addend
, followed by any number of ("+" | "*") ^ Addend
, which is easy to express with the repetition operator *
in context-free grammars.
Expression := Addend ^ (("+" | "-") ^ Addend)*
Instead of evaluating from left to right or from right to left we just create a flat list this time. Then we can choose how to evaluate the whole list in a separate step. Here’s what the whole grammar looks like after this “flattening” of the recursive calls:
Multiplicand := Number
| "(" ^ Expression ^ ")"
Adddend := Multiplicand ^ (("*" | "/") ^ Multiplicand)*
Expression := Addend ^ (("+" | "-") ^ Addend)*
We will use the Repeated
method as the star *
operator from the grammar. The result will be a list of the parsed values. We can’t see the result type in the type of the parser because Go doesn’t have type parameters— this is odd. We just keep in mind that the method evaluate
now needs to deal with lists.
In the end we have one more example to play with on the Go Playground.
There’s a way to do the same thing without creating an intermediate list. We can find it on the github page Parser-Gombinators and we won’t discuss it here.
Conclusion
When using parser combinators we’re working with an intersection of usual programming language semantics and context-free grammar semantics. Whenever the two diverge, we run into problems. We can’t just use any BNF and convert it into parser combinators. As we have seen, left-recursion leads to non-termination. Common prefixes in alternatives lead to incomplete parses or bad complexity. We’ve gone through all of this and we’ve seen the bug-fixes.
At last, now that we know what we need to look for we can use our parser combinators to write light-weight parsers that are easy to read and maintain.
Armin Heller is working at QAware GmbH as a software engineer.