Notes on Formal Language Theory

Objects, Operations, Regular Expressions and Finite State Automata

Jake Batsuuri

Published in

Computronium Blog

12 min readJun 29, 2020

Introduction

This article is meant to be a gentle introduction into formal language theory. As with everything, we will try to balance mathematical rigour with concrete examples that have linguistically motivated examples.

Topics to touch on:

Elementary formal language theory
Regular languages
Regular expressions
Languages
Finite state automata
Regular relations

Basic Notions

Sets and Elements

Formal languages are defined to have a certain, specific alphabet. The alphabet may be denoted by ∑.

This alphabet set, consists of elements called letters. This set is a finite set. Each letter is a symbol that can be from “abc…”, “123…”. Symbols can even be words.

A sequence of letters may be called a word or a string.

Then these are examples of string:

Cat
Arachnadiscoteka
ASKdnskjf
uuuuuuu
Ouuuuu
ksdjfnskdjfnSOULJABOITELLEMksjdnfksjdnf

Length of a String

The length of a string, or the number of characters in a string, is denoted by |w|.

Empty String

We consider the empty string of 0 length as unique and denote by ϵ.

Concatenation Operation

Concatenation, is the addition of 2 strings together, where order matters. For example, “artificial” + “intelligence” = “artificialintelligence”. Given:

Concatenation is denoted by:

Note that:

For every string w:

Some examples:

“learn”+”s”
“learn”+”ed”
“learn”+”ing”

Exponent Operation

For every string w:

w⁰=ϵ

For all exponents n>0:

wⁿ=wⁿ⁻¹ ⋅ w

For example:

If w=”go”, then:

w⁰=ϵ=””
w¹=w=”go”
w²=w*w=”gogo”
w³=w*w*w=”gogogo”

Reversal Operation

If w is a string, then reverse of w is:

Palindrome is defined by:

Here we mean words like “ana”, “anna”, “racecar” etc. I’m not sure if these are technically called symmetry. But everyone would understand exactly what I mean if I called these words symmetric. If there is such a concept as symmetry in formal language theory, I wonder how it would connect to Group Theory.

Coincidentally, group theory arises out of our need to study symmetry of geometrical objects and spaces. Where we might build out the theory exactly the same way that formal language theory is built, using sets as the objects then defining standard operations.

On this later…

Substring Set, or String Sub Set

If w is a string, then a substring of w is a sequence formed by taking a contiguous symbols of w in the order in which they occur in w.

is a substring of w if and only if there are 2 other substring such that

Also w₁ and wᵣ may be empty strings. These two substring are special cases of substrings called, prefix and suffix.

Formal Language

Given an alphabet ∑, the set of all strings over the alphabet is denoted as ∑*.

Notice that as long as our alphabet has at least 1 symbol, the set of all string over our alphabet is infinite. So given that also a formal language is any subset of ∑*, there are infinite number of formal language over the same alphabet.

A formal language is literally a set of strings. Even still, all natural languages are technically just a set of words.

The English language is just a set of all the English words. Connections to ideas and actions are not part of the language by this definition.

This is the Latin alphabet, and here are some examples of formal languages:

∑*
The set of string consisting of only consonants
The set of strings consisting of only one vowel and one consonant
The set of all single letter words
The set of all words occurring in a certain book

Since some of these sets are infinite, while others are finite, the languages that represent them are also infinite and finite.

Extending String Operations to Languages

In the same way that strings can have operations. Some of these operations also work on languages, which are sets of strings. For example, if L is a language then Lᵣ is the language where all the string are reverses of that in L.

Another operation that works on languages is concatenation. Given language L and M, the concatenated language LM is the language where all its strings are concatenated strings of L and M.

Kleene Closure

Let L={dog, cat}.

Observe that:

L⁰ ={},
L¹ ={dog, cat},
L² = {catcat, catdog, dogcat, dogdog}
etc.

Thus L* contains, among its infinite set of strings, the strings: {ϵ, cat, dog, catcat, catdog, dogcat, dogdog, catcatcat, catdogcat, dogcatcat, dogdogcat, etc}

Kleene Closure as The Set of All Strings over the Alphabet

Consider the alphabet ∑={a,b} and the language L={a,b} defined over ∑. L* is the set of all strings over a and b, which is also the exact definition of ∑*.

The ∑* is the special case of L* where L=∑.

Language Classes and Linguistic Formalisms

Formal languages are sets of strings. And they can be constructed using any method available to sets. This isn’t very useful in general. So for fairly simple languages, we introduce the idea of rules. Rules characterize a language, such that we can classify languages from each other, using these rules. We denote languages with:

Furthermore we denote classes of languages using:

Given a particular language, a property of that language that is of interest is the closure. For example, given a binary operation, such as a union, we wanna know if the language is closed under the union operation.

There’s a way to check whether a class of languages is closed under some operation. Basically if the two languages are independently in the class. And the language resulting from the operation of the two languages is also in the class.

The reason that this is of interest is because, if we know a certain operation is closed, we can process a string knowing that computationally we have not much to worry about.

Algorithms that determine membership are called recognition algorithms. Generally, the more expressive a class is, the harder it is to determine the membership in languages of this class.

If additionally, an algorithm forces a structure on the string, then it is called a parsing algorithm.

Regular Expressions

Regular expression is a type of linguistic formalism. Specifically, they are expressions over some alphabet ∑, augmented by some special characters.

Denotation is a mapping from regular expressions to sets of strings over ∑, such that every well formed regular expression denotes a set of strings or a language.

Remember that a language is simple a set of strings.

DEFINITION 1. Given an alphabet ∑, the set of regular expressions over ∑ is defined as follows:

∅ is a regular expression, {}
ϵ is a regular expression, {“”}
if a ∈ ∑ is a letter, then a is a regular expression
if r₁ and r₂ are regular expressions, then so are (r₁+r₂) and (r₁⋅ r₂)
if r is a regular expression, then so is (r)*.
nothing else is a regular expression over ∑

Let Σ be the alphabet {a, b, c, . . . , y, z}. Some regular expressions over this alphabet are

∅,
a,
((c · a) · t),
(((m · e) · (o)∗) · w),
(a + (e + (i + (o + u)))),
((a + (e + (i + (o + u)))))∗,
etc.

DEFINITION 2. Given a regular expression r, its denotation, ||r||, is a set of strings defined as follows:

∅ is a regular expression, {}, the empty set
ϵ is a regular expression, {“”}, singleton set with an empty string
if a ∈ ∑ is a letter, then a is a regular expression, {a}, singleton set with an a string
if r₁ and r₂ are two regular expressions whose denotations are [[r₁]] and [[r₂]], respectively, then [[(r₁ + r₂)]] = [[r₁]] ∪ [[r₂]] and [[(r₁ · r₂)]] = [[r₁]] · [[r₂]];
if r is a regular expression whose denotation is [[r]] then [[(r) ∗]] = [[r]]∗.

Here normally, we can omit the parentheses so that ((a + (e + (i + (o + u)))))∗ becomes (a + e + i + o + u)∗.

Where regular expressions become very useful is in defining infinite languages or sets of strings. Because finite languages can be enumerated and just written down.

Given the alphabet of all English letters, Σ ={a, b, c, . . . , y, z}, the language Σ∗ is denoted by the regular expression Σ∗.

The set of all strings which contain a vowel is denoted by Σ∗ · (a + e + i + o + u) · Σ∗.

The set of all strings that begin in “un” is denoted by (un)Σ∗. The set of strings that end in either “tion” or “sion” is denoted by Σ∗ · (s + t) · (ion).

Note that all these languages are infinite.

DEFINITION 3. A language L is regular iff there exists a regular expression r such that L=||r||.

The class of languages, L, that can be expressed as the denotation of regular expressions is called the regular languages. Which makes for the existence of non regular languages also.

Properties of Regular Languages

The motivation to study regular languages comes from the fact that they have some pretty neat properties. For example, since these languages are summarized easily by set notation. It’s easy to create some lemmas like their closed-ness under union, concatenation and Kleene closure.

Given two regular languages, L1 and L2, there must exist two regular expressions, r1 and r2, such that [[r1]] = L1 and [[r2]] = L2. It is therefore possible to form new regular expressions based on r1 and r2, such as r1 · r2, r1 + r2 and r∗1.
Given the definition of regular expressions and their denotations, it follows that the denotation of r1 · r2 is L1 · L2: [[r1 · r2]] = L1 · L2. Since r1 · r2 is a regular expression, its denotation is a regular language, and hence L1 ·L2 is a regular language.

Shuly Wintner. Formal language theory for natural language processing. Effective tools and methodologies for teaching natural language processing and computational linguistics, an ACL’02 Workshop, Philadelphia, PA, July 2002.

Regular languages are known to be closed under many operations:

Intersection
Complementation
Exponentiation
Substitution
Homomorphism

Also note that non regular, complex languages don’t exhibit such nice behaviours.

Finite State Automata

We use regular expressions to specify the set summarizing a regular language. We will add another formalism to our study of languages. Here we will define languages as entities generated by a computation.

Just like the dual nature of matter as a particle and a wave. Language has a dual nature of being both a set of string and an output of a computational process.

This computational process is called the finite state automata. There’s a generalized version of a finite state machine, but here we define our automata for language processing.

DEFINITION 4. A finite state automation is a five-tuple (Q, q₀, ∑, δ, F) where:

∑ is a finite state of alphabet symbols
Q is a finite set of states
q₀ ∈ Q is the initial state
F ⊆ Q is the set of final states, a subset of Q
δ : Q × ∑ × Q is a relation from states and alphabet symbols to states

GRAPHICAL REPRESENTATION. Finite state automata are depicted graphically like graphs. Where we have nodes and links. Imagine each node as a state and each link representing a letter from the alphabet.

The finite state automation A defined by (Q, q₀, ∑, δ, F) where:

∑ = {c,a,t,r}
Q={q₀,q₁,q₂,q₃}
F={q₃}
δ={(q₀,c,q₁), (q₁,a,q₂), (q₂,t,q₃),(q₂,r,q₃)}

DEFINITION 5. Given the finite state automation A defined by (Q, q₀, ∑, δ, F) the reflexive transitive closure of the transition relation, δ : Q × ∑ × Q, is δ^, defined by:

for every state q ∈ Q, (q, ϵ, q) ∈ δ^
for every string w ∈ ∑* and letter a ∈ ∑
if (q,w,q’) ∈ δ^ and (q’, a, q’’) ∈ δ then (q, w ⋅ a, q’’) ∈ δ^

A string w is accepted by A if and only if there exists a state q ∈ F such that δ^(q₀,w)=q. The language of A is the set of all strings accepted by the A: L(A)={w| there exists q ∈ F such that δ^ (q₀,w)=q}.

Remember that the finite state automata is defined only for transition relations from node to node. To be able to define a language using FSA, we need to extend this transition relation to node to paths. If the transition relation, is symbolized by δ then the reflexive transitive closure is symbolized by δ^. The reflexive signifies that every element in the nodes can have a self relation, while the transitivity implies instead of another node, it encompasses a string of nodes.

This relation assigns a non empty string to each path. While an empty string is equated to an empty path, meaning its a self relation. Since this is an extension of the FSA, we still are interested in the paths that start at q₀ and end at q.

As long as a string has a related path defined by the FSA, we say the string is part of the language. This means that the paths that end in non official q are not part of the language.

The following examples will demonstrate the difference between definition 4 and 5. Take the following finite state machine. Under the definition 4, we have such strings as c, a, t, r.

Whereas with the extended definition 5, we have strings cat and car. As our δ^ allows for:

Self relations

(q₀,ϵ, q₀), (q₁,ϵ, q₁), (q₂,ϵ, q₂), (q₃,ϵ, q₃)

Paths

(q₀,c, q₁), (q₁, a, q₂), (q₂,t, q₃), (q₂,r, q₃)
(q₀,ca, q₂), (q₁,at, q₃), (q₁,ar, q₃)
(q₀,cat, q₃), (q₀,car, q₃)

Note that not all the paths go from the initial state to the final state. Therefore the language passed by this FSA is {cat, car}.

We now extend our definition a bit more to include ϵ — moves. These basically allow the FSA to move over certain states.

Anywhere there is an ϵ the FSA can move from one state to another without printing a letter or symbol. For this diagram, the language would have {do, undo, undone, done}.

Finite state automata are one of the two ways of defining languages, the other way being regular expressions. It turns out the class of languages that can be generate by an FSA are the regular languages.

THEOREM 1. A language L is regular iff there exists an FSA A such that L=L(A).

Up Next…

In the next article, we will explore Chomsky’s hierarchy of languages as it is one of the formal pillars of computational linguistics, and its results continue to shape modern research and development in NLP.

For the table of contents and more content click here.

References

Clark, Alexander. The Handbook of Computational Linguistics and Natural Language Processing. Wiley-Blackwell, 2013.

Eisenstein, Jacob. Introduction to Natural Language Processing. The MIT Press, 2019.

Bird, Steven, et al. Natural Language Processing with Python. O’Reilly, 2009.

Jurafsky, Dan, and James H. Martin. Speech and Language Processing. Pearson, 2014.

Barker-Plummer, Dave, et al. Language, Proof and Logic. CSLI Publ., Center for the Study of Language and Information, 2011.