Day 93: FIRST & FOLLOW

Published in

100 days of algorithms

4 min readJun 25, 2017

If you plan to implement own parser for a context-free grammar, construction of FIRST and FOLLOW sets will be the first algorithm you will have to spend your time on.

And since definition of formal languages and grammars is much more complicated than we need at the moment, I’ll stick to the intuitive explanation.

Context-free grammar is a set of production rules that consists of terminals and non-terminals. Terminal is a “physical” character that may occur in the parsed text. Non-terminal is a “meta” character that must occur on the left side of production rule.

In the grammar above, S, A are non-terminals, while 1 is terminal.

We start with a starting symbol S and repetitively rewrite non-terminals using production rules. Once the sequence contains only terminals, we get a word that is accepted by the grammar.

Notice that this grammar is different from the first one, yet, it produces the same language.

Due to the second rule in each grammar, the first one is called left-recursive and the second one is called right-recursive.

The third grammar also describes the same language. It is both, left-recursive and right-recursive [which is not a good thing]. And what’s worse, it is ambiguous, since it can produce the same word by a different productions.

Ambiguity is very bad and there are different types of parsers that deal with these kinds of problems in a different ways.

This last grammar is able to generate any finite word with matching parentheses.

If you know regular expressions, you are probably aware that it is not possible to write regex that would match any number of parentheses. This is what makes context-free grammars so important for compilers — it is the simplest grammar that is strong enough to describe syntax of programming language.

Now, back to the algorithm.

The FIRST set enumerates possible terminals that a non-terminal may begin with.

The FOLLOW set enumerates possible terminals that a non-terminal may be followed by.

Check the two examples I have provided at the end of this article.

When you build your parser, either it is SLR, LALR, LR(k) or LL(k), you will need to construct the FIRST and FOLLOW sets. These sets are used to build a parsing table to control a finite state automaton processing the language.

https://github.com/coells/100days

https://notebooks.azure.com/coells/libraries/100days

algorithm

def first_and_follow(grammar):
    # first & follow sets, epsilon-productions
    first = {i: set() for i in grammar.nonterminals}
    first.update((i, {i}) for i in grammar.terminals)
    follow = {i: set() for i in grammar.nonterminals}
    epsilon = set()    while True:
        updated = False
        
        for nt, expression in grammar.rules:
            # FIRST set w.r.t epsilon-productions
            for symbol in expression:
                updated |= union(first[nt], first[symbol])
                if symbol not in epsilon:
                    break
            else:
                updated |= union(epsilon, {nt})
                
            # FOLLOW set w.r.t epsilon-productions
            aux = follow[nt]
            for symbol in reversed(expression):
                if symbol in follow:
                    updated |= union(follow[symbol], aux)
                if symbol in epsilon:
                    aux = aux.union(first[symbol])
                else:
                    aux = first[symbol]
        
        if not updated:
            return first, follow, epsilon
def union(first, begins):
    n = len(first)
    first |= begins
    return len(first) != n
class Grammar:
    
    def __init__(self, *rules):
        self.rules = tuple(self._parse(rule) for rule in rules)    def _parse(self, rule):
        return tuple(rule.replace(' ', '').split('::='))
        
    def __getitem__(self, nonterminal):
        yield from [rule for rule in self.rules 
                    if rule[0] == nonterminal]
        
    @staticmethod
    def is_nonterminal(symbol):
        return symbol.isalpha() and symbol.isupper()
        
    @property
    def nonterminals(self):
        return set(nt for nt, _ in self.rules)
        
    @property
    def terminals(self):
        return set(
            symbol
            for _, expression in self.rules
            for symbol in expression
            if not self.is_nonterminal(symbol)
        )

left-recursive grammar w/ epsilon-production

first, follow, epsilon = first_and_follow(Grammar(
    '^ ::= A $',
    'A ::= ABBC',
    'A ::= B',
    'A ::= 1',
    'B ::= C',
    'B ::= 2',
    'C ::= 3',
    'C ::= ',
))> first{'$': {'$'},
 '1': {'1'},
 '2': {'2'},
 '3': {'3'},
 'A': {'1', '2', '3'},
 'B': {'2', '3'},
 'C': {'3'},
 '^': {'$', '1', '2', '3'}}> follow{'A': {'$', '2', '3'}, 'B': {'$', '2', '3'}, 'C': {'$', '2', '3'}, '^': set()}> epsilon{'A', 'B', 'C'}

arithmetic expressions

first, follow, epsilon = first_and_follow(Grammar(
    '^ ::= E $',
    'E ::= E + T',
    'E ::= T',
    'T ::= T * F',
    'T ::= F',
    'F ::= ( E )',
    'F ::= x',
))> first{'$': {'$'},
 '(': {'('},
 ')': {')'},
 '*': {'*'},
 '+': {'+'},
 'E': {'(', 'x'},
 'F': {'(', 'x'},
 'T': {'(', 'x'},
 '^': {'(', 'x'},
 'x': {'x'}}> follow{'E': {'$', ')', '+'},
 'F': {'$', ')', '*', '+'},
 'T': {'$', ')', '*', '+'},
 '^': set()}> epsilonset()

Day 93: FIRST & FOLLOW

algorithm

left-recursive grammar w/ epsilon-production

arithmetic expressions

Written by Tomáš Bouda