Day 94: Earley parser

Tomáš Bouda
100 days of algorithms
5 min readJun 26, 2017

--

Yesterday I have implemented an algorithm that directly relates to an implementation of parser for formal grammars. Today I have implemented the parser itself.

However, I have no use for yesterday’s work since the parser is not SLR [as you might have expected]. Instead, I have chosen Earley parser due to its interesting properties.

The one that is most important, Earley parser is able to handle both, left-recursive and right-recursive grammars, and it can even handle ambiguous grammars.

In fact, whatever context-free grammar you have, the parser will work just fine. Should I say, that’s not the case of my implementation since I do not support ε-production rules, but it is not difficult to make it work, as well.

What is the secret and source of its power?

Earley parser simulates non-deterministic finite automaton! It uses a dot symbol inside production rules to simulate the current position. In advance, the parser creates DFSA power-set by simulation of all the possible positions at once, which is how you would typically implement NDFSA.

There are three mechanisms used in the parser.

  • predictor — handles construction of automaton state based on possible non-terminal productions with respect to the dot symbol
  • completer — similar to reduce action in LR parser; handles reductions of rules with dot symbol at the end
  • scanner — similar to shift action in LR parser; advances dot symbol over terminals in the rules

I think you have already guessed what was the problem of Earley parser and why it has been replaced by LR parsers — simulation of NDFSA is very slow and memory consuming.

However, if you have a complicated grammar, you can either run several algorithms to remove recursions, ambiguities or look-aheads… or you can just run Earley parser which handles the grammar for you.

https://github.com/coells/100days

https://notebooks.azure.com/coells/libraries/100days

algorithm

class EarleyParser:    def __init__(self, grammar):
self.grammar = grammar
self.states = []
def parse(self, text):
self.states = [set() for _ in range(len(text) + 1)]
self.states[0].add(State(*grammar.start))
for k, token in enumerate(text + '\u0000'):
extension = list(self.states[k])
self.states[k].clear()
while extension:
state = extension.pop()
if state in self.states[k]:
continue
self.states[k].add(state) if state.finished:
self._completer(state, extension)
elif state.symbol_is_nonterminal:
self._predictor(state, k, extension)
else:
self._scanner(state, k, token)
self._print(text) def _predictor(self, state, origin, extension):
for rule in self.grammar[state.symbol]:
extension.append(State(*rule, origin=origin))
def _scanner(self, state, origin, token):
if state.symbol == token:
self.states[origin + 1].add(state.shift)
def _completer(self, state, extension):
for reduce in self.states[state.origin]:
if state.nonterminal == reduce.symbol:
extension.append(reduce.shift)
def _print(self, text):
for k, state in enumerate(self.states):
accepts = any(s.nonterminal == '^' and
s.finished for s in state)
print('(%d)' % k, end=' ')
print('"%s.%s"' % (text[:k], text[k:]), end=' ')
print(accepts and 'ACCEPTS' or '')
for i in state:
print('\t', i)
class State: def __init__(self, nonterminal, expression, dot=0, origin=0):
self.nonterminal = nonterminal
self.expression = expression
self.dot = dot
self.origin = origin
@property
def finished(self):
return self.dot >= len(self.expression)
@property
def symbol(self):
return None if self.finished else self.expression[self.dot]
@property
def symbol_is_nonterminal(self):
return self.symbol and \
self.symbol.isalpha() and \
self.symbol.isupper()
@property
def shift(self):
return State(self.nonterminal,
self.expression,
self.dot + 1,
self.origin)
@property
def tuple(self):
return (self.nonterminal,
self.expression,
self.dot,
self.origin)
def __hash__(self):
return hash(self.tuple)
def __eq__(self, other):
return self.tuple == other.tuple
def __str__(self):
n, e, d, o = self.tuple
return '[%d] %s -> %s.%s' % (o, n, e[:d], e[d:])
class Grammar: def __init__(self, *rules):
self.rules = tuple(self._parse(rule) for rule in rules)
def _parse(self, rule):
return tuple(rule.replace(' ', '').split('::='))

@property
def start(self):
return next(self['^'])
def __getitem__(self, nonterminal):
yield from [
rule
for rule in self.rules
if rule[0] == nonterminal
]

grammar: arithmetic expression

grammar = Grammar(
'^ ::= E',
'E ::= E + T',
'E ::= E - T',
'E ::= T',
'T ::= T * F',
'T ::= T / F',
'T ::= F',
'F ::= ( E )',
'F ::= - F',
'F ::= x',
'F ::= y',
'F ::= z',
)
EarleyParser(grammar).parse('x-x*(y+z)')>>>(0) ".x-x*(y+z)"
[0] F -> .-F
[0] T -> .T/F
[0] E -> .T
[0] F -> .z
[0] ^ -> .E
[0] F -> .y
[0] T -> .T*F
[0] F -> .(E)
[0] E -> .E+T
[0] F -> .x
[0] E -> .E-T
[0] T -> .F
(1) "x.-x*(y+z)" ACCEPTS
[0] E -> T.
[0] T -> T./F
[0] ^ -> E.
[0] F -> x.
[0] E -> E.-T
[0] T -> F.
[0] T -> T.*F
[0] E -> E.+T
(2) "x-.x*(y+z)"
[2] T -> .T*F
[2] F -> .y
[0] E -> E-.T
[2] F -> .x
[2] F -> .(E)
[2] T -> .F
[2] T -> .T/F
[2] F -> .-F
[2] F -> .z
(3) "x-x.*(y+z)" ACCEPTS
[2] F -> x.
[2] T -> F.
[2] T -> T.*F
[0] ^ -> E.
[0] E -> E.-T
[2] T -> T./F
[0] E -> E.+T
[0] E -> E-T.
(4) "x-x*.(y+z)"
[4] F -> .z
[4] F -> .y
[4] F -> .(E)
[2] T -> T*.F
[4] F -> .x
[4] F -> .-F
(5) "x-x*(.y+z)"
[5] F -> .(E)
[5] E -> .E+T
[5] F -> .y
[5] T -> .T/F
[5] E -> .E-T
[5] T -> .F
[5] F -> .x
[5] F -> .-F
[4] F -> (.E)
[5] T -> .T*F
[5] E -> .T
[5] F -> .z
(6) "x-x*(y.+z)"
[5] E -> E.-T
[4] F -> (E.)
[5] T -> F.
[5] T -> T./F
[5] E -> E.+T
[5] E -> T.
[5] T -> T.*F
[5] F -> y.
(7) "x-x*(y+.z)"
[7] F -> .x
[7] T -> .T*F
[7] F -> .-F
[5] E -> E+.T
[7] F -> .z
[7] F -> .(E)
[7] T -> .T/F
[7] F -> .y
[7] T -> .F
(8) "x-x*(y+z.)"
[7] T -> T.*F
[7] F -> z.
[4] F -> (E.)
[5] E -> E.-T
[5] E -> E.+T
[7] T -> T./F
[7] T -> F.
[5] E -> E+T.
(9) "x-x*(y+z)." ACCEPTS
[4] F -> (E).
[2] T -> T.*F
[0] ^ -> E.
[0] E -> E.-T
[2] T -> T./F
[0] E -> E.+T
[0] E -> E-T.
[2] T -> T*F.

--

--