Image: MediaProduction — gettyimages.de

Parser Combinator Gotchas

How to avoid trouble when using parser combinators in Go

--

Tl/dr: Here’s how to solve the Gotchas in the Code of the last post.

  • We use function definitions to define recursive parsers (η-conversion).
  • We factor out common prefixes of alternatives in the grammar.
  • We transform left-recursive grammars into right-recursive ones using repetition. This allows us to parse left-associative operators.

We’re writing a parser for the following grammar with the help of parser combinators in Go.

Digit := "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" 
Number
:= Digit+
Multiplicand := Number
| "(" ^ Expression ^ ")"
Addend := Multiplicand
| Multiplicand ^ "*" ^ Addend
Expression
:= Addend
| Addend ^ "+" ^ Expression

In the last post we learned how to use parser combinators. Now we’re going to take a look at what we have to watch out for when using them.

Gotcha: Recursive Functions, No Recursive Values

We need theExpression parser in order to implement Multiplicand. For Expression we need Addend and for that we need Multiplicand again.

Multiplicand ~calls~> Expression ~> Addend ~calls~> Multiplicand

This is called recursion. Go only allows for recursion on functions. In this dummy example we try to use recursion on variables and the example fails to compile. In another dummy example we use recursion on functions and it compiles without errors! Luckily our parsers are functions. Converting the definition of Multiplicand is easy. Here’s the naive version from the last post.

After turning it into a function definition we get:

Recursive parsers, here we come!

Gotcha: First-come, first-served

Finally, here’s our naive parser for the expression grammar.

Surprise! Our parser has a bug. Let’s try it on the Go Playground! If we parse the text "1+2", then we’ll see the following result.

result = 1
rest of input = +2

The parser didn’t eat up its input. If the parser could think, it would probably be something along these lines:

I'm the parser.
I try to parse an Expression.
Expression can be an Addend.
Addend can be a Multiplicand.
Multiplicand can be a Number.
Oh great, 1 is a Number ... I'm done parsing the Expression.

More sophisticated combinators can handle this grammar better and give us the longest match instead of the first one. We’ll stick to the simple combinators, however and we’re going to deal with the problems by ourselves. One problem is the following rule.

Expression   := Addend
| Addend ^ "+" ^ Expression

The single Addend comes first and the compound expression comes second. If there’s an Addend in the input, our parser doesn’t bother trying to find a compound expression. There’s an easy fix: We change the order!

Expression   := Addend ^ "+" ^ Expression
| Addend

If our parser starts out with this grammar and if it could think, it would probably be thinking:

I'm the parser.
I try to parse an Expression.
Expression can be an Addend and then "+" and then Expression
...

This time it will only consider the alternative with the single Addend if it can’t find the “+” followed by the Expression.

After applying this idea to every rule of our grammar we get a correct parser.

This Example shows our parser in action.

Gotcha: Bad Complexity

Our parser has a performance problem. It repeats parsing the same input with the same parser. We take “1” as an example input. The parser tries to parse Addend "+" Expression only to realize that there is no “+” character. After that it starts out again to parse a single Addend. For this example it’s only a factor of two in performance “loss”. What happens if we try to parse “((..(1)..))” with a lot of parenthesis? Let’s look at another example. Just adding a few more surrounding parentheses to the input text will make it run into a time-out. It’s not just bad performance, it’s bad complexity. We might have exponential run-time here.

Instead of proving the exponential time-bound we’re going to remove it. As before, we can see the problem from the grammar:

Expression   := Addend ^ "+" ^ Expression
| Addend

We have two alternatives that start with the same Addend. Why don’t we factor out the addend and make the rest optional?

Expression   := Addend ^ ("+" ^ Expression)?

Through this we make sure that the Addend will only parse once for the same input. We give the method for ? the name Optional and use it in our final parser for this grammar. The return value of Optional will be Nothing when the parser successfully parses the empty string "". Consequently, the method evaluate evaluating the terms on the fly must handle Nothing correctly.

Bonus Gotcha: Left-Associative Operators

We left out subtraction and division for a good reason: We’re parsing and evaluating from right to left. Our parser will evaluate the expression 1+2+3 to 1+5 and then to 6. If we allowed subtraction, then the expression 1-2-3 would evaluate to 1-(-1) and then to 2 — which is wrong, because 1-2-3 should be -4. Let’s look at our current grammar and add subtraction and division where we naively think they should fit in:

Multiplicand := Number
| "(" ^ Expression ^ ")"
Addend := Multiplicand ^ (("*" | "/") ^ Addend)?
Expression
:= Addend ^ (("+" | "-") ^ Expression)?

If we read it like a parser would do, it will tell us that it firstly parses one addend and then the rest of the expression. Thus 1+2+3 becomes the Addend 1 + the Expression 2 + 3, which is not what we want. We want the Expression 1 + 2 + the Addend 3 and then to further divide the first expression. Now let’s change the grammar accordingly.

Addend       := Addend ^ (("*" | "/") ^ Multiplicand)?
| Multiplicand
Expression
:= Expression ^ (("+" | "-") ^ Addend)?
| Addend

We’ve left out the first rule because it hasn’t changed anyway. This is a correct context-free grammar that produces the syntax trees suitable for evaluation from left to right. However, we can’t parse this with naive parser combinators. To parse an Expression we first need to parse an Expression. We need another fix for this Gotcha.

The Expression does nothing but glue together several Addends with interspersed “+” or “-” signs. This sounds like repetition can be useful. Let’s expand Expression several times to get a feeling for how it works.

Expression   := Addend
| Addend ^ ("+" | "-") ^ Addend
| Addend ^ ("+" | "-") ^ Addend ^ ("+" | "-") ^ Addend
| ...

The expression always starts with an Addend, followed by any number of ("+" | "*") ^ Addend, which is easy to express with the repetition operator * in context-free grammars.

Expression := Addend ^ (("+" | "-") ^ Addend)*

Instead of evaluating from left to right or from right to left we just create a flat list this time. Then we can choose how to evaluate the whole list in a separate step. Here’s what the whole grammar looks like after this “flattening” of the recursive calls:

Multiplicand := Number
| "(" ^ Expression ^ ")"
Adddend := Multiplicand ^ (("*" | "/") ^ Multiplicand)*
Expression := Addend ^ (("+" | "-") ^ Addend)*

We will use the Repeated method as the star * operator from the grammar. The result will be a list of the parsed values. We can’t see the result type in the type of the parser because Go doesn’t have type parameters— this is odd. We just keep in mind that the method evaluate now needs to deal with lists.

In the end we have one more example to play with on the Go Playground.

There’s a way to do the same thing without creating an intermediate list. We can find it on the github page Parser-Gombinators and we won’t discuss it here.

Conclusion

When using parser combinators we’re working with an intersection of usual programming language semantics and context-free grammar semantics. Whenever the two diverge, we run into problems. We can’t just use any BNF and convert it into parser combinators. As we have seen, left-recursion leads to non-termination. Common prefixes in alternatives lead to incomplete parses or bad complexity. We’ve gone through all of this and we’ve seen the bug-fixes.

At last, now that we know what we need to look for we can use our parser combinators to write light-weight parsers that are easy to read and maintain.

Armin Heller is working at QAware GmbH as a software engineer.

--

--