Day 94: Earley parser
Yesterday I have implemented an algorithm that directly relates to an implementation of parser for formal grammars. Today I have implemented the parser itself.
However, I have no use for yesterday’s work since the parser is not SLR [as you might have expected]. Instead, I have chosen Earley parser due to its interesting properties.
The one that is most important, Earley parser is able to handle both, left-recursive and right-recursive grammars, and it can even handle ambiguous grammars.
In fact, whatever context-free grammar you have, the parser will work just fine. Should I say, that’s not the case of my implementation since I do not support ε-production rules, but it is not difficult to make it work, as well.
What is the secret and source of its power?
Earley parser simulates non-deterministic finite automaton! It uses a dot symbol inside production rules to simulate the current position. In advance, the parser creates DFSA power-set by simulation of all the possible positions at once, which is how you would typically implement NDFSA.
There are three mechanisms used in the parser.
- predictor — handles construction of automaton state based on possible non-terminal productions with respect to the dot symbol
- completer — similar to reduce action in LR parser; handles reductions of rules with dot symbol at the end
- scanner — similar to shift action in LR parser; advances dot symbol over terminals in the rules
I think you have already guessed what was the problem of Earley parser and why it has been replaced by LR parsers — simulation of NDFSA is very slow and memory consuming.
However, if you have a complicated grammar, you can either run several algorithms to remove recursions, ambiguities or look-aheads… or you can just run Earley parser which handles the grammar for you.
https://github.com/coells/100days
https://notebooks.azure.com/coells/libraries/100days
algorithm
class EarleyParser: def __init__(self, grammar):
self.grammar = grammar
self.states = [] def parse(self, text):
self.states = [set() for _ in range(len(text) + 1)]
self.states[0].add(State(*grammar.start)) for k, token in enumerate(text + '\u0000'):
extension = list(self.states[k])
self.states[k].clear() while extension:
state = extension.pop()
if state in self.states[k]:
continue self.states[k].add(state) if state.finished:
self._completer(state, extension)
elif state.symbol_is_nonterminal:
self._predictor(state, k, extension)
else:
self._scanner(state, k, token) self._print(text) def _predictor(self, state, origin, extension):
for rule in self.grammar[state.symbol]:
extension.append(State(*rule, origin=origin)) def _scanner(self, state, origin, token):
if state.symbol == token:
self.states[origin + 1].add(state.shift) def _completer(self, state, extension):
for reduce in self.states[state.origin]:
if state.nonterminal == reduce.symbol:
extension.append(reduce.shift) def _print(self, text):
for k, state in enumerate(self.states):
accepts = any(s.nonterminal == '^' and
s.finished for s in state) print('(%d)' % k, end=' ')
print('"%s.%s"' % (text[:k], text[k:]), end=' ')
print(accepts and 'ACCEPTS' or '') for i in state:
print('\t', i)
class State: def __init__(self, nonterminal, expression, dot=0, origin=0):
self.nonterminal = nonterminal
self.expression = expression
self.dot = dot
self.origin = origin @property
def finished(self):
return self.dot >= len(self.expression) @property
def symbol(self):
return None if self.finished else self.expression[self.dot] @property
def symbol_is_nonterminal(self):
return self.symbol and \
self.symbol.isalpha() and \
self.symbol.isupper() @property
def shift(self):
return State(self.nonterminal,
self.expression,
self.dot + 1,
self.origin) @property
def tuple(self):
return (self.nonterminal,
self.expression,
self.dot,
self.origin) def __hash__(self):
return hash(self.tuple) def __eq__(self, other):
return self.tuple == other.tuple def __str__(self):
n, e, d, o = self.tuple
return '[%d] %s -> %s.%s' % (o, n, e[:d], e[d:])
class Grammar: def __init__(self, *rules):
self.rules = tuple(self._parse(rule) for rule in rules) def _parse(self, rule):
return tuple(rule.replace(' ', '').split('::='))
@property
def start(self):
return next(self['^']) def __getitem__(self, nonterminal):
yield from [
rule
for rule in self.rules
if rule[0] == nonterminal
]
grammar: arithmetic expression
grammar = Grammar(
'^ ::= E',
'E ::= E + T',
'E ::= E - T',
'E ::= T',
'T ::= T * F',
'T ::= T / F',
'T ::= F',
'F ::= ( E )',
'F ::= - F',
'F ::= x',
'F ::= y',
'F ::= z',
)EarleyParser(grammar).parse('x-x*(y+z)')>>>(0) ".x-x*(y+z)"
[0] F -> .-F
[0] T -> .T/F
[0] E -> .T
[0] F -> .z
[0] ^ -> .E
[0] F -> .y
[0] T -> .T*F
[0] F -> .(E)
[0] E -> .E+T
[0] F -> .x
[0] E -> .E-T
[0] T -> .F
(1) "x.-x*(y+z)" ACCEPTS
[0] E -> T.
[0] T -> T./F
[0] ^ -> E.
[0] F -> x.
[0] E -> E.-T
[0] T -> F.
[0] T -> T.*F
[0] E -> E.+T
(2) "x-.x*(y+z)"
[2] T -> .T*F
[2] F -> .y
[0] E -> E-.T
[2] F -> .x
[2] F -> .(E)
[2] T -> .F
[2] T -> .T/F
[2] F -> .-F
[2] F -> .z
(3) "x-x.*(y+z)" ACCEPTS
[2] F -> x.
[2] T -> F.
[2] T -> T.*F
[0] ^ -> E.
[0] E -> E.-T
[2] T -> T./F
[0] E -> E.+T
[0] E -> E-T.
(4) "x-x*.(y+z)"
[4] F -> .z
[4] F -> .y
[4] F -> .(E)
[2] T -> T*.F
[4] F -> .x
[4] F -> .-F
(5) "x-x*(.y+z)"
[5] F -> .(E)
[5] E -> .E+T
[5] F -> .y
[5] T -> .T/F
[5] E -> .E-T
[5] T -> .F
[5] F -> .x
[5] F -> .-F
[4] F -> (.E)
[5] T -> .T*F
[5] E -> .T
[5] F -> .z
(6) "x-x*(y.+z)"
[5] E -> E.-T
[4] F -> (E.)
[5] T -> F.
[5] T -> T./F
[5] E -> E.+T
[5] E -> T.
[5] T -> T.*F
[5] F -> y.
(7) "x-x*(y+.z)"
[7] F -> .x
[7] T -> .T*F
[7] F -> .-F
[5] E -> E+.T
[7] F -> .z
[7] F -> .(E)
[7] T -> .T/F
[7] F -> .y
[7] T -> .F
(8) "x-x*(y+z.)"
[7] T -> T.*F
[7] F -> z.
[4] F -> (E.)
[5] E -> E.-T
[5] E -> E.+T
[7] T -> T./F
[7] T -> F.
[5] E -> E+T.
(9) "x-x*(y+z)." ACCEPTS
[4] F -> (E).
[2] T -> T.*F
[0] ^ -> E.
[0] E -> E.-T
[2] T -> T./F
[0] E -> E.+T
[0] E -> E-T.
[2] T -> T*F.