Taming The Lexer
Language parsing is traditionally split into two phases: Lexing and Parsing. In this post I want to talk about how a Lexer generally works and what you can do if it doesn’t.
What A Lexer Does
The duty of a lexer is to turn a sequence of single characters into a sequence of so called tokens. A token is a chunk of characters associated with a certain token-type. Most programming languages define individual lexer rules for things like names (identifiers), string literals, numbers, whitespace and comments. The latter two are usually not passed to the parser, or at least sent in a special way, so that we don’t have to deal with whitespace and comments explicitly afterwards.
Within the parser rules we can then use the more coarse grained token types to define the grammar of a language. It’s possible to parse without a lexer (or scanner), which is then called scanner-less parsing. The main reason for having a lexer doing the tokenizing first is performance.
How A Lexer Works
There are a lot of similarities in how a lexer and a parser work; in Xtext and Antlr the lexer and parser rules share most of the concepts and only have a few differences. The most important difference is that a lexer is usually free of context.
Most lexers work by following these two principles:
- Be greedy
That is use the lexer rule that consumes the most characters - First Rule Wins
If there are two or more rules matching the same amount of characters, the first one wins.
Example
In Xtext, the lexer consists of all the lexer rules plus the keywords used in the parser rules. Because we want to give keywords a higher precedence, they are ‘copied’ before the defined lexer rules. Let’s consider the following two rules:
HelloWorld : 'Hello' ID ;terminal ID: ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')* ;
The input
Hello Xtext
will be lexed into two tokens, ‘Hello’ and ID (if we ignore the whitespace). Although ‘Hello’ would be a perfect match for ID, it is also a match for the ‘hello’ keyword, which is always preferred (declared before any lexer rule). So principle 2 applies here.
For the input
HelloXtext
we end up with one token ID. In this case principle 1 applies, as the sequence ‘HelloXtext’ is longer than what the keyword rule could have matched.
Problems With The Antlr 3 Lexer
Xtext generates an Antlr 3 based lexer, which also follows the two principles above. Unfortunately it only guesses on what rule will possibly match the longest sequence using a statemachine that does look ahead. Based on that outcome, it decides on one of the lexer rules. The problem is that the statemachine doesn’t do the full lexing and sometimes is wrong.
Example:
TheNumber : 'the' 'number' 'is' number=NUMBER '.';terminal NUMBER : '0'..'9'+ ('.' '0'..'9'+)?;
With this grammar it should be possible to write
the number is 24.
and
the number is 23.00.
but not
the number is 23..
The lexer generated by Antlr 3 will however not work as expected, but will always try to consume any dots following numbers as part of the NUMBER rule. As a result the first and the last text will have errors.
Ways To Solve This Issue
Most of the time, you can get around this kind of problems by using parser rules instead. E.g. you could write the following:
TheNumber : 'the' 'number' 'is' number=Decimal '.';
Decimal : NUMBER ->('.' NUMBER)?;
terminal NUMBER : '0'..'9'+;
But sometimes it is more involved and you really need (or want) to solve this on the lexer level. In that case you can use a different implementation. Both Antlr4 and JFlex will work fine with the above example.
I just recently replaced the lexer in a customer’s project which is actually open-source. If you want to see how this can be done, you can find the code here.
Of course I will be happy to do that for you, as well. Happy Lexing! :)