The Importance of Left-recursion in Parsing Expression Grammars
This article is part of a series about Parsing Expression Grammars (PEGs). Therefore, we will use PEGs for all examples. The examples, however, should be generalizable to other grammars as well.
In this article, we demonstrate why left-recursion is a necessary part of grammar definitions. First, we will have a close look at what left-recursion is and under which circumstances it occurs. Then, we construct a simple math-grammar to show that we need left-recursion.
What is Left-recursion?
Let us jump right in and look at an elementary grammar containing only two rules.
A := B "a" | "a"
B := A "b" | "b"
If we want to parse the string “aba” starting with rule A, this should be theoretically possible, and the parse tree should look like this:
Let us assume that we use a naive parser approach that always checks and evokes rules in prioritized order (like a simple packrat parser). A parser like this would invoke rule A, then invoke rule B and, after that, invoke rule A again, resulting in an infinite call loop that would quickly result in a stack overflow.
The parser evaluates the rules in a depth-first manner, having a strict prioritization of rule invocations. With this depth-first approach, rules can get invoked in cycles, which constitutes a significant problem for parsers.
This described problem of infinite rule invocation cycles is called left-recursion. To give a more formal definition:
A rule is left-recursive, if and only if, it can derive itself during parsing without the parser making any progress within the parse string.
The Necessity of Left-recursion
We worked out what left-recursion is and why it can be problematic for parsers. We still do not know why it is necessary for grammars to support it nevertheless and why we cannot entirely ban left-recursive rules from grammar definitions to avoid all parsing issues.
In this section, we will explore this. We will look at a trivial grammar to parse mathematical addition and work out why it needs to be left-recursive.
Let us jump right in with a fairly basic grammar that is not left-recursive and can parse arbitrary chains of addition expressions
expr := integer "+" expr | integer
integer := [1-9] [0-9]*
If we parse the string “7+5+3” with it, we get the following parse tree:
That worked great! We parsed the string into a tree representing the formula, and if we evaluate the tree from the bottom up, we get the correct result of 15.
Now let us look at a slightly modified grammar that can parse subtraction expressions.
expr := integer "-" expr | integer
integer := [1-9] [0-9]*
If we parse the string “7–5–3” with it, we get the following parse tree:
Unfortunately, if evaluated, the tree would give us the result 7-(5–3)=5. We can see that there is something off with the order of operations because we evaluate them from left to right, despite them needing to be evaluated from right to left.
Associativity of Mathematical Operations
Before we fix our grammar, we examine why ordering the mathematical operations matters and how this is related to our parse trees. We know that for addition, it makes no difference in which order they are evaluated. This property is called associativity. An associative operation can be evaluated in any order, as shown in the following example for addition.
(a + b) + c = a + b + c = a + (b + c)
As we can see, we can subdivide associativity into two separate properties. We call them left associativity and right associativity.
Left associativity means that the operations are grouped from the left.
a + b + c = (a + b) + c
Right associativity means that the operations are grouped from the right.
a + b + c = a + (b + c)
We can now determine that the order to evaluate additions is irrelevant because they are left- and right-associative. For other operations, however, this does not hold. Subtractions are left-associative and not right-associative.
a — b — c = (a — b) — c != a — (b — c)
Associativity is directly transferrable to recursiveness in a grammar. Rules that are left-recursive can represent left-associative operations, and rules that are right-recursive can represent right-associative operations.
The reason we need left-recursive rules in grammars is that they represent left-associative operations.
To wrap it up, we construct a left-recursive grammar that can parse additions and subtractions while preserving the parse tree’s order of operations.
expr := expr "+" integer |
expr "-" integer |
integer
integer := [1-9] [0-9]*
If we parse the string “7+5–3” with it, we get the following parse tree:
This parse tree would give us if evaluated, the correct result of 9. That is because it correctly represents the left-associativity of the subtraction.
Types of Left-recursion
We learned about left-recursion in general and why it is so essential despite being impractical to parse. We now have a look at the different types of left-recursion and how they differ. We can divide left-recursion into three categories, direct left-recursion, indirect left-recursion, and hidden left recursion.
Direct Left-recursion
Direct left-recursion is the most basic form of left-recursion. It occurs if a rule can directly invoke itself without making progress within the parse string. The following grammar that can parse any number of “a”s with just one rule is directly left-recursive.
A := A "a" | ""
Direct-left recursion is reasonably easy to detect and to handle with a parser, and many parsers do.
Indirect Left-recursion
In contrast to direct left-recursion, indirect left-recursion occurs if the invocation happens indirectly, meaning that the invocation cycle contains more than one rule. We already looked at an indirect left-recursive grammar in our first example.
A := B "a" | "a"
B := A "b" | "b"
This grammar can parse any string of alternating “a”s and “b”s that ends with an “a”. Indirect left-recursion is a little bit harder to detect, and most parsers cannot handle it.
Hidden Left-recursion
The last of the three types of left-recursion is the most tricky one. Hidden left-recursion happens when rules invoke themselves after invoking other rules that have not consumed any parse string input. To illustrate this, we slightly modify the grammar from the previous example such that any parsed string must start with “ab”.
A := &"ab" B "a" | "a"
B := A "b" | "b"
We inserted a lookahead expression into the first rule of the grammar that attempts to match the string “ab” and then backtracks to the starting point.
To explain what exactly happens and how this is left-recursive, let us parse the string “ababa”. The parser starts with rule A at the beginning of the string. It successfully matches the first two characters, “ab” without consuming them. Still, at the beginning of the string, the parser invokes rule B, which invokes rule A again. There we again have an infinite cycle of rule invocations.
It is important to note that the lookahead operator (&) is not the only operator by which hidden left-recursion can be constructed. The negative lookahead operator (!), the zero-or-more operator (*), and the optional operator (?) can be used as well. Also, one can use any combination of those that can result in a successful parse without consuming anything.
Hidden Left-recursion is quite hard to detect and even harder to handle.
Conclusion
In this article, we showed what left-recursion is and in which flavors it comes. We established why a grammar must have left-recursive rules if left-associativity is part of the semantic language the grammar represents. Therefore we also know why grammars must support the parsing of left-recursive rules
We did not show an example of direct left-recursion’s practical necessity; this, however, can quite easily be constructed. Also, we did not give an example of hidden left-recursion. It turns out that it is relatively hard to find a real-world example for this use case. However, prohibiting hidden left-recursion potentially could restrict PEG’s expressiveness, and just because we do not know about an example does not mean there is not one.
Fun Fact: Did you know that in the original paper about PEGs, left-recursion was explicitly forbidden.
If you liked this article or have any remarks, feel free to contact, comment, or clap.
Big thanks to Nico Duldhardt for reviewing and contributing to the article.