# Parser Combinator Gotchas

## How to avoid trouble when using parser combinators in Go

Tl/dr: Here’s how to solve the Gotchas in the Code of the last post.

- We use function definitions to define recursive parsers (η-conversion).
- We factor out common prefixes of alternatives in the grammar.
- We transform left-recursive grammars into right-recursive ones using repetition. This allows us to parse left-associative operators.

We’re writing a parser for the following grammar with the help of parser combinators in Go.

**Digit **:= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

Number := **Digit**+

**Multiplicand** := **Number**

| "(" ^ **Expression **^ ")"

**Addend** := **Multiplicand**

| **Multiplicand** ^ "*" ^ **Addend**

Expression := **Addend**

| **Addend **^ "+" ^ **Expression**

In the last post we learned how to use parser combinators. Now we’re going to take a look at what we have to watch out for when using them.

## Gotcha: Recursive Functions, No Recursive Values

We need the`Expression`

parser in order to implement `Multiplicand`

. For `Expression`

we need `Addend`

and for that we need `Multiplicand`

again.

`Multiplicand ~calls~> Expression ~> Addend ~calls~> Multiplicand`

This is called **recursion**. Go only allows for recursion on functions. In this dummy example we try to use recursion on variables and the example fails to compile. In another dummy example we use recursion on functions and it compiles without errors! Luckily our parsers are functions. Converting the definition of `Multiplicand`

is easy. Here’s the naive version from the last post.

After turning it into a function definition we get:

Recursive parsers, here we come!

## Gotcha: First-come, first-served

Finally, here’s our naive parser for the expression grammar.

Surprise! Our parser has a bug. Let’s try it on the Go Playground! If we parse the text `"1+2"`

, then we’ll see the following result.

`result = 1`

rest of input = +2

The parser didn’t eat up its input. If the parser could think, it would probably be something along these lines:

I'm the parser.

I try to parse an Expression.

Expression can be an Addend.

Addend can be a Multiplicand.

Multiplicand can be a Number.Oh great, 1 is a Number ... I'm done parsing the Expression.

More sophisticated combinators can handle this grammar better and give us the longest match instead of the first one. We’ll stick to the simple combinators, however and we’re going to deal with the problems by ourselves. One problem is the following rule.

**Expression **:= **Addend**

| **Addend **^ "+" ^ **Expression**

The single

comes first and the compound expression comes second. If there’s an **Addend**

in the input, our parser doesn’t bother trying to find a compound expression. There’s an easy fix: We change the order!**Addend**

**Expression **:= **Addend **^ "+" ^ **Expression**

| **Addend**

If our parser starts out with this grammar and if it could think, it would probably be thinking:

`I'm the parser.`

I try to parse an Expression.

Expression can be an Addend and then "+" and then Expression

...

This time it will only consider the alternative with the single

if it can’t find the “+” followed by the Expression.**Addend**

After applying this idea to every rule of our grammar we get a correct parser.

This Example shows our parser in action.

## Gotcha: Bad Complexity

Our parser has a performance problem. It repeats parsing the same input with the same parser. We take “1” as an example input. The parser tries to parse

only to realize that there is no “+” character. After that it starts out **Addend** "+" **Expression***again* to parse a single

. For this example it’s only a factor of two in performance “loss”. What happens if we try to parse “((..(1)..))” with a lot of parenthesis? Let’s look at another example. Just adding a few more surrounding parentheses to the input text will make it run into a time-out. It’s not just bad performance, it’s bad complexity. We might have exponential run-time here.**Addend**

Instead of proving the exponential time-bound we’re going to remove it. As before, we can see the problem from the grammar:

**Expression **:= **Addend **^ "+" ^ **Expression**

| **Addend**

We have two alternatives that start with the same

. Why don’t we factor out the addend and make the rest optional?**Addend**

**Expression** := **Addend** ^ ("+" ^ **Expression**)?

Through this we make sure that the `Addend`

will only parse once for the same input. We give the method for `?`

the name `Optional`

and use it in our final parser for this grammar. The return value of `Optional`

will be `Nothing`

when the parser successfully parses the empty string `""`

. Consequently, the method `evaluate`

evaluating the terms on the fly must handle `Nothing`

correctly.

## Bonus Gotcha: Left-Associative Operators

We left out subtraction and division for a good reason: We’re parsing and evaluating from right to left. Our parser will evaluate the expression `1+2+3`

to `1+5`

and then to `6`

. If we allowed subtraction, then the expression `1-2-3`

would evaluate to `1-(-1)`

and then to `2`

— which is wrong, because `1-2-3`

should be `-4`

. Let’s look at our current grammar and add subtraction and division where we naively think they should fit in:

**Multiplicand** := **Number**

| "(" ^ **Expression **^ ")"

**Addend** := **Multiplicand **^ (("*" | "/") ^ **Addend**)?

Expression := **Addend **^ (("+" | "-") ^ **Expression**)?

If we read it like a parser would do, it will tell us that it firstly parses *one* addend and then the rest of the expression. Thus `1+2+3`

becomes `the `

, which is not what we want. We want **Addend** 1 + the **Expression** 2 + 3`the `

and then to further divide the first expression. Now let’s change the grammar accordingly.**Expression** 1 + 2 + the **Addend** 3

**Addend** := **Addend **^ (("*" | "/") ^ **Multiplicand**)?

| **Multiplicand**

Expression := **Expression **^ (("+" | "-") ^ **Addend**)?

| **Addend**

We’ve left out the first rule because it hasn’t changed anyway. This is a correct context-free grammar that produces the syntax trees suitable for evaluation from left to right. However, we can’t parse this with naive parser combinators. To parse an **Expression **we first need to parse an** Expression**. We need another fix for this Gotcha.

The **Expression** does nothing but glue together several **Addend**s with interspersed “+” or “-” signs. This sounds like repetition can be useful. Let’s expand **Expression **several times to get a feeling for how it works.

**Expression **:= **Addend**

| **Addend** ^ ("+" | "-") ^ **Addend**

| **Addend** ^ ("+" | "-") ^ **Addend **^ ("+" | "-") ^ **Addend**

| ...

The expression always starts with an

, followed by any number of **Addend**`("+" | "*") ^ `

, which is easy to express with the repetition operator **Addend**`*`

in context-free grammars.

**Expression** := **Addend** ^ (("+" | "-") ^ **Addend**)*

Instead of evaluating from left to right or from right to left we just create a flat list this time. Then we can choose how to evaluate the whole list in a separate step. Here’s what the whole grammar looks like after this “flattening” of the recursive calls:

**Multiplicand** := **Number**

| "(" ^ **Expression** ^ ")"

**Adddend** := **Multiplicand** ^ (("*" | "/") ^ **Multiplicand**)*

**Expression** := **Addend** ^ (("+" | "-") ^ **Addend**)*

We will use the `Repeated`

method as the star `*`

operator from the grammar. The result will be a list of the parsed values. We can’t see the result type in the type of the parser because Go doesn’t have type parameters— this is odd. We just keep in mind that the method `evaluate`

now needs to deal with lists.

In the end we have one more example to play with on the Go Playground.

There’s a way to do the same thing without creating an intermediate list. We can find it on the github page Parser-Gombinators and we won’t discuss it here.

## Conclusion

When using parser combinators we’re working with an intersection of usual programming language semantics and context-free grammar semantics. Whenever the two diverge, we run into problems. We can’t just use any BNF and convert it into parser combinators. As we have seen, left-recursion leads to non-termination. Common prefixes in alternatives lead to incomplete parses or bad complexity. We’ve gone through all of this and we’ve seen the bug-fixes.

At last, now that we know what we need to look for we can use our parser combinators to write light-weight parsers that are easy to read and maintain.

Armin Heller is working at QAware GmbH as a software engineer.