Day 94: Earley parser

Published in

100 days of algorithms

5 min readJun 26, 2017

Yesterday I have implemented an algorithm that directly relates to an implementation of parser for formal grammars. Today I have implemented the parser itself.

However, I have no use for yesterday’s work since the parser is not SLR [as you might have expected]. Instead, I have chosen Earley parser due to its interesting properties.

The one that is most important, Earley parser is able to handle both, left-recursive and right-recursive grammars, and it can even handle ambiguous grammars.

In fact, whatever context-free grammar you have, the parser will work just fine. Should I say, that’s not the case of my implementation since I do not support ε-production rules, but it is not difficult to make it work, as well.

What is the secret and source of its power?

Earley parser simulates non-deterministic finite automaton! It uses a dot symbol inside production rules to simulate the current position. In advance, the parser creates DFSA power-set by simulation of all the possible positions at once, which is how you would typically implement NDFSA.

There are three mechanisms used in the parser.

predictor — handles construction of automaton state based on possible non-terminal productions with respect to the dot symbol
completer — similar to reduce action in LR parser; handles reductions of rules with dot symbol at the end
scanner — similar to shift action in LR parser; advances dot symbol over terminals in the rules

I think you have already guessed what was the problem of Earley parser and why it has been replaced by LR parsers — simulation of NDFSA is very slow and memory consuming.

However, if you have a complicated grammar, you can either run several algorithms to remove recursions, ambiguities or look-aheads… or you can just run Earley parser which handles the grammar for you.

https://github.com/coells/100days

https://notebooks.azure.com/coells/libraries/100days

algorithm

class EarleyParser:    def __init__(self, grammar):
        self.grammar = grammar
        self.states = []    def parse(self, text):
        self.states = [set() for _ in range(len(text) + 1)]
        self.states[0].add(State(*grammar.start))        for k, token in enumerate(text + '\u0000'):
            extension = list(self.states[k])
            self.states[k].clear()            while extension:
                state = extension.pop()
                if state in self.states[k]:
                    continue                self.states[k].add(state)                if state.finished:
                    self._completer(state, extension)
                elif state.symbol_is_nonterminal:
                    self._predictor(state, k, extension)
                else:
                    self._scanner(state, k, token)        self._print(text)    def _predictor(self, state, origin, extension):
        for rule in self.grammar[state.symbol]:
            extension.append(State(*rule, origin=origin))    def _scanner(self, state, origin, token):
        if state.symbol == token:
            self.states[origin + 1].add(state.shift)    def _completer(self, state, extension):
        for reduce in self.states[state.origin]:
            if state.nonterminal == reduce.symbol:
                extension.append(reduce.shift)    def _print(self, text):
        for k, state in enumerate(self.states):
            accepts = any(s.nonterminal == '^' and 
                          s.finished for s in state)            print('(%d)' % k, end=' ')
            print('"%s.%s"' % (text[:k], text[k:]), end=' ')
            print(accepts and 'ACCEPTS' or '')            for i in state:
                print('\t', i)
class State:    def __init__(self, nonterminal, expression, dot=0, origin=0):
        self.nonterminal = nonterminal
        self.expression = expression
        self.dot = dot
        self.origin = origin    @property
    def finished(self):
        return self.dot >= len(self.expression)    @property
    def symbol(self):
        return None if self.finished else self.expression[self.dot]    @property
    def symbol_is_nonterminal(self):
        return self.symbol and \
               self.symbol.isalpha() and \
               self.symbol.isupper()    @property
    def shift(self):
        return State(self.nonterminal, 
                     self.expression, 
                     self.dot + 1, 
                     self.origin)    @property
    def tuple(self):
        return (self.nonterminal, 
                self.expression, 
                self.dot,
                self.origin)    def __hash__(self):
        return hash(self.tuple)    def __eq__(self, other):
        return self.tuple == other.tuple    def __str__(self):
        n, e, d, o = self.tuple
        return '[%d] %s -> %s.%s' % (o, n, e[:d], e[d:])
class Grammar:    def __init__(self, *rules):
        self.rules = tuple(self._parse(rule) for rule in rules)    def _parse(self, rule):
        return tuple(rule.replace(' ', '').split('::='))
        
    @property
    def start(self):
        return next(self['^'])    def __getitem__(self, nonterminal):
        yield from [
            rule 
            for rule in self.rules 
            if rule[0] == nonterminal
        ]

grammar: arithmetic expression

grammar = Grammar(
    '^ ::= E',
    'E ::= E + T',
    'E ::= E - T',
    'E ::= T',
    'T ::= T * F',
    'T ::= T / F',
    'T ::= F',
    'F ::= ( E )',
    'F ::= - F',
    'F ::= x',
    'F ::= y',
    'F ::= z',
)EarleyParser(grammar).parse('x-x*(y+z)')>>>(0) ".x-x*(y+z)" 
	 [0] F -> .-F
	 [0] T -> .T/F
	 [0] E -> .T
	 [0] F -> .z
	 [0] ^ -> .E
	 [0] F -> .y
	 [0] T -> .T*F
	 [0] F -> .(E)
	 [0] E -> .E+T
	 [0] F -> .x
	 [0] E -> .E-T
	 [0] T -> .F
(1) "x.-x*(y+z)" ACCEPTS
	 [0] E -> T.
	 [0] T -> T./F
	 [0] ^ -> E.
	 [0] F -> x.
	 [0] E -> E.-T
	 [0] T -> F.
	 [0] T -> T.*F
	 [0] E -> E.+T
(2) "x-.x*(y+z)" 
	 [2] T -> .T*F
	 [2] F -> .y
	 [0] E -> E-.T
	 [2] F -> .x
	 [2] F -> .(E)
	 [2] T -> .F
	 [2] T -> .T/F
	 [2] F -> .-F
	 [2] F -> .z
(3) "x-x.*(y+z)" ACCEPTS
	 [2] F -> x.
	 [2] T -> F.
	 [2] T -> T.*F
	 [0] ^ -> E.
	 [0] E -> E.-T
	 [2] T -> T./F
	 [0] E -> E.+T
	 [0] E -> E-T.
(4) "x-x*.(y+z)" 
	 [4] F -> .z
	 [4] F -> .y
	 [4] F -> .(E)
	 [2] T -> T*.F
	 [4] F -> .x
	 [4] F -> .-F
(5) "x-x*(.y+z)" 
	 [5] F -> .(E)
	 [5] E -> .E+T
	 [5] F -> .y
	 [5] T -> .T/F
	 [5] E -> .E-T
	 [5] T -> .F
	 [5] F -> .x
	 [5] F -> .-F
	 [4] F -> (.E)
	 [5] T -> .T*F
	 [5] E -> .T
	 [5] F -> .z
(6) "x-x*(y.+z)" 
	 [5] E -> E.-T
	 [4] F -> (E.)
	 [5] T -> F.
	 [5] T -> T./F
	 [5] E -> E.+T
	 [5] E -> T.
	 [5] T -> T.*F
	 [5] F -> y.
(7) "x-x*(y+.z)" 
	 [7] F -> .x
	 [7] T -> .T*F
	 [7] F -> .-F
	 [5] E -> E+.T
	 [7] F -> .z
	 [7] F -> .(E)
	 [7] T -> .T/F
	 [7] F -> .y
	 [7] T -> .F
(8) "x-x*(y+z.)" 
	 [7] T -> T.*F
	 [7] F -> z.
	 [4] F -> (E.)
	 [5] E -> E.-T
	 [5] E -> E.+T
	 [7] T -> T./F
	 [7] T -> F.
	 [5] E -> E+T.
(9) "x-x*(y+z)." ACCEPTS
	 [4] F -> (E).
	 [2] T -> T.*F
	 [0] ^ -> E.
	 [0] E -> E.-T
	 [2] T -> T./F
	 [0] E -> E.+T
	 [0] E -> E-T.
	 [2] T -> T*F.

Day 94: Earley parser

algorithm

grammar: arithmetic expression

Written by Tomáš Bouda