Day 93: FIRST & FOLLOW

Tomáš Bouda
100 days of algorithms
4 min readJun 25, 2017

--

If you plan to implement own parser for a context-free grammar, construction of FIRST and FOLLOW sets will be the first algorithm you will have to spend your time on.

And since definition of formal languages and grammars is much more complicated than we need at the moment, I’ll stick to the intuitive explanation.

Context-free grammar is a set of production rules that consists of terminals and non-terminals. Terminal is a “physical” character that may occur in the parsed text. Non-terminal is a “meta” character that must occur on the left side of production rule.

In the grammar above, S, A are non-terminals, while 1 is terminal.

We start with a starting symbol S and repetitively rewrite non-terminals using production rules. Once the sequence contains only terminals, we get a word that is accepted by the grammar.

Notice that this grammar is different from the first one, yet, it produces the same language.

Due to the second rule in each grammar, the first one is called left-recursive and the second one is called right-recursive.

The third grammar also describes the same language. It is both, left-recursive and right-recursive [which is not a good thing]. And what’s worse, it is ambiguous, since it can produce the same word by a different productions.

Ambiguity is very bad and there are different types of parsers that deal with these kinds of problems in a different ways.

This last grammar is able to generate any finite word with matching parentheses.

If you know regular expressions, you are probably aware that it is not possible to write regex that would match any number of parentheses. This is what makes context-free grammars so important for compilers — it is the simplest grammar that is strong enough to describe syntax of programming language.

Now, back to the algorithm.

The FIRST set enumerates possible terminals that a non-terminal may begin with.

The FOLLOW set enumerates possible terminals that a non-terminal may be followed by.

Check the two examples I have provided at the end of this article.

When you build your parser, either it is SLR, LALR, LR(k) or LL(k), you will need to construct the FIRST and FOLLOW sets. These sets are used to build a parsing table to control a finite state automaton processing the language.

https://github.com/coells/100days

https://notebooks.azure.com/coells/libraries/100days

algorithm

def first_and_follow(grammar):
# first & follow sets, epsilon-productions
first = {i: set() for i in grammar.nonterminals}
first.update((i, {i}) for i in grammar.terminals)
follow = {i: set() for i in grammar.nonterminals}
epsilon = set()
while True:
updated = False

for nt, expression in grammar.rules:
# FIRST set w.r.t epsilon-productions
for symbol in expression:
updated |= union(first[nt], first[symbol])
if symbol not in epsilon:
break
else:
updated |= union(epsilon, {nt})

# FOLLOW set w.r.t epsilon-productions
aux = follow[nt]
for symbol in reversed(expression):
if symbol in follow:
updated |= union(follow[symbol], aux)
if symbol in epsilon:
aux = aux.union(first[symbol])
else:
aux = first[symbol]

if not updated:
return first, follow, epsilon

def union(first, begins):
n = len(first)
first |= begins
return len(first) != n

class Grammar:

def __init__(self, *rules):
self.rules = tuple(self._parse(rule) for rule in rules)
def _parse(self, rule):
return tuple(rule.replace(' ', '').split('::='))

def __getitem__(self, nonterminal):
yield from [rule for rule in self.rules
if rule[0] == nonterminal]

@staticmethod
def is_nonterminal(symbol):
return symbol.isalpha() and symbol.isupper()

@property
def nonterminals(self):
return set(nt for nt, _ in self.rules)

@property
def terminals(self):
return set(
symbol
for _, expression in self.rules
for symbol in expression
if not self.is_nonterminal(symbol)
)

left-recursive grammar w/ epsilon-production

first, follow, epsilon = first_and_follow(Grammar(
'^ ::= A $',
'A ::= ABBC',
'A ::= B',
'A ::= 1',
'B ::= C',
'B ::= 2',
'C ::= 3',
'C ::= ',
))
> first{'$': {'$'},
'1': {'1'},
'2': {'2'},
'3': {'3'},
'A': {'1', '2', '3'},
'B': {'2', '3'},
'C': {'3'},
'^': {'$', '1', '2', '3'}}
> follow{'A': {'$', '2', '3'}, 'B': {'$', '2', '3'}, 'C': {'$', '2', '3'}, '^': set()}> epsilon{'A', 'B', 'C'}

arithmetic expressions

first, follow, epsilon = first_and_follow(Grammar(
'^ ::= E $',
'E ::= E + T',
'E ::= T',
'T ::= T * F',
'T ::= F',
'F ::= ( E )',
'F ::= x',
))
> first{'$': {'$'},
'(': {'('},
')': {')'},
'*': {'*'},
'+': {'+'},
'E': {'(', 'x'},
'F': {'(', 'x'},
'T': {'(', 'x'},
'^': {'(', 'x'},
'x': {'x'}}
> follow{'E': {'$', ')', '+'},
'F': {'$', ')', '*', '+'},
'T': {'$', ')', '*', '+'},
'^': set()}
> epsilonset()

--

--