Yet Another Arithmetic Parser in Scala
Overview
In this article, we will implement an interpreter that will parse and evaluate arithmetical expressions. We will use the Scala language and Atto library. However most of our considerations will be technologically agnostic, so you should not have problems with implementation in your language of choice. In the beginning, we’ll come up with naive grammar and a corresponding interpreter. Then pinpoint and fix their shortcomings in an iterative manner. At the end of the day, our solution will reach its final form, respecting both operators' precedence and associativity.
Background
Some time ago I encountered a problem, which required one of my services to interpret and process a search query. Such a query could consist of one or more predicates pᵢ
combined with and/or
operators and parentheses (
)
. For example:
I thought that the most robust and elegant solution would be to define a formal grammar describing the query. A parser could then transform input into its AST representation. Dealing with AST in the deeper parts of the application would be much easier than operating on a raw input. The plan was set.
By the time I end up with a working solution, I came up with several incorrect implementations and hit the wall a time or two. With these experiences, I decided to create a step by step guide that will show you how to create and iteratively improve a grammar-based interpreter. However, instead of boolean expressions, we’ll work with arithmetic expressions which can be seen as a superset of logical expressions.
First iteration
I assume you have a general idea of formal grammars and terms like production, terminal, and nonterminal ring a bell.
Let’s start with the grammar of a single addition operation. It’s written as a number, a plus sign, and another number.
Let’s implement it in Atto!
Although it's possible to construct this parser in a more concise way, for readability reasons we will leverage Parser monadic API and use Scalas for comprehension syntax.
Expr.parse("1+2").done // Done(,3.0)
So far so good 🎉 Our parser correctly evaluated 1+2
.
Can it evaluate the sum of three numbers?
Expr.parse("1+2+3").done // Done(+3,3.0) -- expected 6.0
Well, it can’t. Parser didn’t even consume the whole expression. The first parameter of the returned Done
structure shows, that chunk +3
was not parsed. Let’s fix it then.
Sum of many numbers
Intuition tells us, that solution for summing many numbers will somehow involve recursion. And that’s a good guess.
We can handle the sum of numbers similarly to list folding. The first digit is read and added to the result of the evaluated remaining expression. Here is the recursive pattern: the computation of a longer expression will repeatedly evaluate its smaller and smaller subexpressions. It is necessary to define a stop condition because eventually, a parser will reach the last number without a following plus sing.
By introducing described recursive pattern we end up with the following grammar:
Here is corresponding parser:
Let’s run some tests:
Expr.parse("1").done // Done(,1.0)
Expr.parse("1+2").done // Done(,3.0)
Expr.parse("1+2+3+4+5").done // Done(,15.0)
Looks like it’s working. Our parser even handles the evaluation of single digits. Next, we will extend arithmetic expressions with multiplication.
Multiplication
Supporting multiplication doesn’t look like a big deal. When we run into multiplication sign we multiply, and upon plus sign we add, right? Let’s remodel our grammar and interpreter then.
Unfortunately, tests show that this approach doesn’t work as we would like.
Expr.parse("2+2*2+2").done // Done(,10.0) -- expected 8
The correct result of 2+2*2+2
is definitely not 10
. To understand how it happened, let’s simulate the construction of the parsing tree and its evaluation for such expression. We have to follow the same rules that Atto and most of the other parsing libraries are built upon:
- symbols are read from left to right
- leftmost nonterminal is always derived first
Solid lines connect symbols expanded by parser while reading input, and dashed lines indicate terminals used at each step of evaluation. As we can see, the parsing tree expands only on the right side. Our interpreter, consuming terminals staring from the bottom, effectively reduces operators from the rightmost one and going left, regardless of its type. That way we do not respect operators' precedence. They are just evaluated in backward order treating sum and multiplication as operations of the same priority. Let’s change that.
Operators precedence
We want to encode operators order of execution in grammar. Sum operation should be collected only after all adjacent multiplications have been evaluated. This can be achieved by splitting our single production into two: one for addition -Expr
, and another for multiplication — Term
. The principle of multiplication rule doesn’t change. Term
calls itself recursively until there’s no immediate *
after a number. It’s Expr
production, that holds all the magic. Instead of parsing a number it now uses a Term
rule to consume the input.
Such a relation between these two ensures, that every Term
production completes before any Expr
. This order allows the parser to respect operators' priority while evaluating an expression.
Looks like we nailed it 💪
Expr.parse("2+2*2+2").done // Done(,8.0)
Let’s see the interpretation process:
The first difference with the previous approach is that we no longer expand only on the right side. The pattern of derivation holds the key to operators precedence. According to the design of our new grammar, every Term
nonterminal is derived first. This is due to the fact, that Term
is the first element in Expr
production. And, as we said before, Atto always expands the leftmost nonterminal first. The interpreter follows this pattern and reduces multiplication operation before addition. That way we encoded semantics of operators' precedence in grammar structure.
There are two operations that still were not included: subtraction and division. Let's get to it then.
Subtraction and division
Subtraction and division have the same priority as addition and multiplication, respectively. Including them to the grammar should be straightforward:
Here is corresponding parser:
And here are some tests:
Expr.parse("5-2").done, // Done(,3.0)
Expr.parse("6-12/2").done, // Done(,0.0)
Expr.parse("1-2-3").done, // Done(,2.0) -- expected -4.0
Expr.parse("16/4/2").done, // Done(,8.0) -- expected 2.0
Something goes wrong for expressions with more than one subtraction or division. To find roots of the problem, again let's take a closer look at the process of parsing and interpretation for the expression 1-2-3
.
Operators defined in the same grammar production have the same priority. This is what we expect, but there’s a catch. Given operators of the same priority, parser consumes subsequent operators by expanding the right side of the tree. And, as we have seen earlier, in such a scenario interpreter will evaluate operators in backward order. That’s mean our parser is right-associative. And so subexpressions in1-2-3
are grouped from last to first what leads to an invalid result:
This is not how we expect subtraction to behave. While right-associativity works for addition and multiplication, it turns out that subtraction and division are inherently left-associative. Their subsequent operators can only be grouped from first to last. To fix this we will try to modify grammar, so it expands from the leftmost side.
Making parser left-associative
The direction of parsing tree expansion is determined by a place in grammar from which a production calls itself. To expand on the left we have to move recursion call to the left. Therefore, it should be the first action, that happens for a given production.
Here is a new grammar:
And here is the corresponding interpreter:
Let's make a simple test:
Expr.parse("1-2-3").done // it never ends...
No matter how long we will wait, this expression will never evaluate. In fact, it never ends for any expression, even trivial ones:
Expr.parse("1").done // it never ends...
What is the source of the problem? Based on experience, we can suspect some kind of never-ending loop happening under the hood. Normally we would expect StackOverflowException being thrown, but Atto uses stack safe recursion which will never make our stack explode. Again, let's trace parsing and interpretation process:
As said before, Atto always expands the leftmost nonterminal. We start with the initial Expr
symbol. It is the only nonterminal, so it is also the leftmost one. Therefore parser will try to replace it with Expr '+' Term
. Now we have two nonterminals: Expr
and Term
. Again the leftmost one is Expr
and again parser will replace it with Expr '+' Term
, making newExpr
leftmost nonterminal again. That way it falls into the infinite loop it can never escape. This pattern of production having reference to itself as the first element in its replacement is called left recursion and can be formally defined as below:
Parsers that expands leftmost nonterminals cannot deal with left recursion, because they will never advance. We don’t want to modify Atto parsing engine. We’ll have to modify grammar then.
Removing left recursion
Left recursion is often recognized as an undesirable pattern in grammars. Luckily, we can remove left recursion from productions by applying the following formula.
Having left recursive rule as below:
where α is a nonempty sequence of terminals and nonterminals and β is a sequence of terminals and nonterminals without nonterminal A at the beginning, we can transform it into two rules:
where Ɛ is an empty string. This removes left recursion while preserving language generated by the original grammar.
Applying these transformations to ourExpr
production:
we end up with these two:
As expected there is no more left recursion in Expr
. We do the same for Term
production and get final grammar:
Let’s modify interpreter to work with new grammar:
Interpreter get’s a little more complicated. Expr_prim
and Term_prim
returns a function, that expects a number to be used as a lefthand side operand. Upon Ɛ symbol an identity function is returned. This is equivalent to returning lefthand side operand in absence of operator with expected priority.
We run some tests:
Expr.parse("1").done // Done(,1.0)
Expr.parse("1-2-3").done // Done(,-4.0)
Expr.parse("16/4/2").done // Done(,2.0)
Expr.parse("1+2+12/4-3*3-2-1").done // Done(,-6.0)
And looks like it’s working 🎊 On top of grammar we built an interpreter that respects operators' precedence and associativity.
Let’s add support of parentheses as the icing on the cake.
Parentheses
Parentheses have the highest precedence in expression. To respect its priority before other operators, we’ll apply the same pattern we used to introduce multiplication operation.
We need a new production, let’s name it Factor
. Its replacement is either a number or an expression in parentheses. That way we can include within parenthesis an arbitrary expression, that belongs to the language generated by Expr
.
In the next step Term
and Term_prim
productions have to be modified. Each number
symbol within them has to be replaced with aFactor
nonterminal. That way if an expression within parentheses is an operand of either side, it will be evaluated before applying it to its operator.
Below is the final grammar look:
And here is a final form of the interpreter:
Let’s run some tests on expressions with parentheses:
Expr.parse("(10-(2+2)*2)/2").done // Done(,1.0)
Expr.parse("10-(2+2*2)/2").done // Done(,7.0)
Expr.parse("((2+2))*2").done // Done(,8.0)
It works! Our interpreter correctly evaluates parentheses and respects their highest priority. What’s more, it handles nested or even redundant parentheses.
Summary
We did it 🎉 We built an interpreter that evaluates arithmetic expressions consisting of numbers, basic operators and parentheses. We started with a simple grammar to handle the addition of two numbers. Then we added support for expressions consisting of many operators. Next, we dealt with operators precedence and associativity. We saw how to get rid of left recursion, which can be tricky for most parsers. At the final step, we added the support of parentheses.
Although our interpreter already can be used to evaluate quite complex expressions, it can be further developed by supporting power operation or ignoring whitespaces from the input.
I encourage you to consider using parsers next time, you’ll need to interpret some text-based data or queries. Relying on formal grammars and parsers, which theory is well researched, can turn out to be more robust, readable and extensible then in-house built solutions. Although the task will probably look complicated at first glance, you can start small and iteratively add new cases, just like we did in this post. Just avoid left recursion and draw a lot parsing trees 🌲
All source code is available here: https://github.com/bszmit/medium-parsers