MSc Data Science in the UK — Applied NLP — Week3

Matt Chang
11 min readOct 15, 2023

--

Hi there everyone, welcome to my week 3 Master’s in the UK session.

For the past week, we’ve looked at some background knowledge and challenges in modern NLP applications. Refer to the link, if you haven’t read it yet.

For this week, we will be looking at chapter 2 — REGULAR EXPRESSIONS AND AUTOMATA.

The professor will be using the book shown below as our main material. I will extract the important points from the book and share them with you here. Doing this will also give me a chance to review what I’ve been taught and will definitely consolidate my understanding of the field.

Hope you guys enjoy and let’s dive in.

Photo by Timothy Eberly on Unsplash

2.1.1 Concept of Regular Expressions

Introduction:

The chapter starts by introducing the real-world problem of searching for text sequences within databases, websites, or documents. The search for words like ‘woodchuck’ with or without an ending ‘s’ and parsing prices like $25, $199, or $24.99 exemplify these challenges.

  • Example: You search for ‘woodchuck’ but you would like to also find ‘woodchucks’ without initiating a separate search.

Utility: Regular expressions are not only practical for text-based search operations but also have important theoretical applications in computational linguistics.

  • Example: In information retrieval systems like web search engines, regular expressions can be employed to filter and present the most relevant documents or web pages.

Implementation

Finite-State Automaton: Regular expressions can be implemented via finite-state automata, which are foundational tools in computational linguistics.

  • Example: Various applications like speech recognition, machine translation, etc., use finite-state automata to interpret regular expressions.

Characteristics of Regular Expressions

Basic Patterns: The simplest kind of regular expression is a sequence of characters. They are case-sensitive.

  • Example: The regular expression /woodchuck/ would match any string containing the substring 'woodchuck'.

Square Braces: To search for multiple possibilities for a single character, square braces can be used.

  • Example: /[wW]oodchuck/ would match either 'Woodchuck' or 'woodchuck'.

Ranges: The dash (-) within square brackets specifies a range of characters to match.

  • Example: /[A-Z]/ would match any uppercase letter.

Negation: The caret (^) negates the pattern when placed as the first symbol after the open square brace.

  • Example: /[^A-Z]/ matches any character that is not an uppercase letter.

Quantifiers: The question mark (?) denotes zero or one instance of the preceding character, and the asterisk (*) represents zero or more instances.

  • Example: woodchucks? would match both 'woodchuck' and 'woodchucks'.

Kleene Plus: The plus (+) symbol indicates one or more instances of the preceding character.

  • Example: /[0-9]+/ would match sequences like '123', '4', '56', etc.

Wildcard Character: The period (.) matches any single character.

  • Example: /beg.n/ matches 'begin', 'begun', 'beg’n', etc.

By understanding and utilizing these regular expression features and their respective syntax, individuals involved in computational tasks, data engineering, and linguistic studies can greatly enhance the accuracy and efficiency of text-based searching and other related operations.

2.1.2 Disjunction, Grouping, and Precedence

Disjunction Operator (|)

The pipe symbol (|) is used for matching either the preceding or the succeeding pattern. For instance, the pattern /cat|dog/ would match either "cat" or "dog".

Operator Precedence

In regular expressions, certain operators take precedence over others.

  • For example, the sequence “guppy” takes precedence over the disjunction operator in the expression /guppy|ies/, leading to incorrect matches.

Grouping ()

Parentheses can be used to group parts of the expression, making them act as a single character. This is especially useful for applying other operators like the disjunction operator (|) or the Kleene star (*).

  • For example, /gupp(y|ies)/ would correctly match either "guppy" or "guppies".

Precedence Hierarchy

The order of precedence from highest to lowest is as follows:

  1. Parenthesis ()
  2. Counters *, +, ?, {}
  3. Sequences and anchors ^, $
  4. Disjunction |

Greediness

Regular expressions are greedy by default, meaning they will match the largest possible string that fits the pattern.

  • For instance, the expression /[a-z]*/ could match "o", "on", "once", etc., when matched against "once upon a time".

2.1.3 A Simple Example

False Positives and False Negatives

While designing regular expressions, one has to address two kinds of errors: false positives and false negatives. False positives are incorrect matches, while false negatives are missed matches.

Example: Finding “the”

  • Initial pattern: /the/ (Fails for "The" and within words like "other")
  • Improved pattern: /[tT]he/ (Still fails for within words like "other")
  • Adding word boundaries: /\b[tT]he\b/ (May miss instances with numbers or underscores)
  • Final pattern: /(^|[^a-zA-Z])[tT]he([^a-zA-Z]|$)/ (Accounts for most cases)

Trade-offs in Accuracy and Coverage

Increasing accuracy often involves minimizing false positives, whereas increasing coverage entails minimizing false negatives. Achieving both simultaneously is a complex but essential aspect of speech and language processing systems.

2.1.4 A More Complex Example

Price Matching: The regular expression /[0-9]+\.[0-9][0-9]/ matches prices with decimal points, such as $199.99. This is then extended to /\\b$[0-9]+(\\.[0-9][0-9])?\\b/ to make cents optional and match whole numbers like $200.

Processor Speed: Processor speeds are captured by the expression /\\b[0-9]+ *(MHz|[Mm]egahertz|GHz|[Gg]igahertz)\\b/.

Disk Space and Memory: Disk space is described by /\\b[0-9]+ *(Mb|[Mm]egabytes?)\\b/ for megabytes and /\\b[0-9](\\.[0-9]+)? *(Gb|[Gg]igabytes?)\\b/ for gigabytes.

Operating Systems and Vendors: Patterns for operating systems and vendors like Windows and Mac are defined by /\\b(Win95|Win98|WinNT|Windows *(NT|95|98|2000)?)\\b/ and /\\b(Mac|Macintosh|Apple)\\b/.

  • If you want to search for any PC with a speed of “500 MHz” and disk space “32 Gb”, you could combine the processor and disk space regular expressions to search through the text data.

2.1.5 Advanced Operators

Aliases: Symbols like \d, \D, \w, and \W are aliases for common ranges.

\d

Meaning: This operator matches any digit, which is synonymous with the set [0-9].

  • Example: In the string “Party of 5”, the regular expression \d would match the character "5".

\D

Meaning: This is the complement of \d. It matches any non-digit character, equivalent to the set [^0-9].

  • Example: In the string “Blue moon”, \D would match all characters except for digits, so it would match "Blue moon".

\w and \W

\w

  • Meaning: This operator matches any alphanumeric character or underscore, equivalent to the set [a-zA-Z0-9_].
  • Example: In the string “Daiyu”, \w would match each of the characters "D", "a", "i", "y", "u".

\W

Meaning: This is the complement of \w. It matches any non-alphanumeric character, excluding underscores, equivalent to the set [^a-zA-Z0-9_].

  • Example: In the string “!!!!”, \W would match each of the exclamation marks.

Counters: The {n,m} syntax specifies from n to m occurrences of the previous character or expression.

Special Characters: Symbols like \\n and \\t denote newline and tab characters, respectively.

  • Example: Using \d{2,4} would match any sequence of digits that is at least 2 but not more than 4 characters long, e.g., "100" or "9999".

2.1.6 Regular Expression Substitution, Memory, and ELIZA

Perl Substitution Operator s/regexp1/pattern/

The Perl substitution operator allows for the replacement of a substring characterized by a regular expression with another string. The syntax is s/regexp1/pattern/.

Example:

Let’s consider a simple text string: “I enjoy learning colour theory.” Using Perl’s substitution operator, we can Americanize the spelling of “colour” as follows:

$text = "I enjoy learning colour theory.";
$text =~ s/colour/color/;

The $text would now contain: "I enjoy learning color theory."

Memory and Backreferences

Memory in regular expressions pertains to the use of backreferences. Parentheses are used to capture portions of the matching text, and the number operator (\1, \2, etc.) can subsequently refer to these captured groups.

Example:

Suppose we have the string: “My favourite numbers are 23 and 45.” We want to wrap all the numbers in this string with angle brackets.
Perl code could look like this:

$text = "My favourite numbers are 23 and 45.";
$text =~ s/([0-9]+)/<$1>/g;

Here, ([0-9]+) captures one or more digits, and <$1> replaces it by wrapping it in angle brackets. The $text would now be: "My favourite numbers are <23> and <45>."

ELIZA and Regular Expression Substitution

ELIZA was one of the earliest examples of a simple natural-language understanding program. It utilized a series of regular expression substitutions to interact with the user and simulate conversation.

Example:

In ELIZA, a user might input: “I feel sad.”
A typical ELIZA script could look like this in Perl:

$text = "I feel sad.";
$text =~ s/I feel (sad|depressed)/I am sorry to hear you feel $1/;

The $text would then be: "I am sorry to hear you feel sad."

2.2 Finite-State Automata (FSA) and Regular Expressions

Equivalence between Regular Expressions and FSA

The text posits that regular expressions and finite-state automata (FSA) are two sides of the same theoretical coin. Both can be used to describe what is known as a “regular language,” which is a specific type of formal language that can be expressed through regular expressions, FSAs, or regular grammars.

  • Example 1: Imagine a regular expression that identifies email addresses, which might look something like `^[a-zA-Z0–9._%+-]+@[a-zA-Z0–9.-]+\.[a-zA-Z]{2,}$`. An equivalent FSA could be constructed to perform the same validation.

Sheeptalk: An Example of FSA in Action

FSAs Can Recognize Language Patterns

The text provides an example using the “sheep language,” essentially a sequence of ‘baa+’ followed by an exclamation mark. The FSA is represented as a directed graph with states and transitions (arcs) between these states based on the input symbols.

  • Example: In the sheep language, a valid string could be “baaa!” The corresponding FSA will start at the initial state (q0) and transition through states q1, q2, and q3 based on encountering ‘b’, ‘a’, ‘a’, and ‘a’, finally reaching the accept state (q4) upon encountering ‘!’.

State-Transition Table

State-Transition Tables as a Representation of FSAs

FSAs can also be represented in a tabular form known as a state-transition table. This table indicates how to transition between states based on the input symbols.

  • Example 3: For the sheep language, the state-transition table would indicate that starting from state q0 and encountering ‘b’, you should transition to state q1. If at q1 and you encounter ‘a’, you transition to q2, and so on.

Formal Definition of FSA

The Five Parameters of an FSA

The formal definition of an FSA comprises five parameters:

- \( Q \): A finite set of states.
- \( \Sigma \): A finite input alphabet.
- \( q_0 \): The start state.
- \( F \): The set of accepted states.
- \( \delta(q, i) \): The state-transition function.

  • Example: For the sheeptalk FSA, \( Q = \{q0, q1, q2, q3, q4\} \), \( \Sigma = \{a, b, !\} \), \( q_0 = q0 \), \( F = \{q4\} \), and \( \delta \) is defined by the transition table.

Algorithms for Recognition

Algorithms for FSA-based Recognition

The text presents an algorithm, termed D-RECOGNIZE, that uses a state-transition table to determine whether an input string is accepted by an FSA.

  • Example: Using D-RECOGNIZE, the string “baaa!” would be processed starting from state q0. The algorithm would make transitions according to the symbols in the input string, and since it reaches an accepting state (q4) by the end, the input string would be accepted.

Formal Languages and Generation

FSAs Can Generate and Recognize Formal Languages

FSAs are not just for recognition; they can also be used for generating strings in a formal language.

  • Example: If our FSA starts at state q0 for generating the sheep language, it can transition through states to print out valid strings like ‘baaa!’ based on the transitions it takes.

2.2.4 Non-Deterministic FSAs (NFSAs)

NFSAs are distinct from deterministic FSAs (DFSAs) in that they permit multiple paths for the same input from a particular state. The text describes two NFSAs for the fictional language of “sheeptalk” but differs in their non-deterministic aspects.

  • Example: In a DFSA for the alphabet {a, b}, from state \(q_0\), upon receiving ‘a’, you might go to state \(q_1\). In an NFSA, you could have the choice to stay at \(q_0\) or go to \(q_1\) when you receive ‘a’.

Types of Non-Determinism

The text mentions two types of non-determinism in NFSAs:

Standard Non-Determinism: Induced by having multiple outgoing transitions for the same symbol from the same state.
Epsilon (\(\epsilon\)) Transitions: Transitions that don’t require input consumption.

  • Example: In an NFSA, from state \(q_2\), you might have the choice to stay in \(q_2\) or go to \(q_3\) when you receive an ‘a’. This is standard non-determinism. Additionally, from state \(q_3\), you might be allowed to move to \(q_2\) without any input, representing an \(\epsilon\)-transition.

Resolving Non-Determinism

There are three approaches mentioned to deal with the issue of choice in NFSAs:
1. **Backup**: Save the current state and input position to try another path later.
2. **Look-ahead**: Check the next inputs to decide which path to take.
3. **Parallelism**: Examine all possible paths concurrently.

For the purpose of this discussion, the text focuses on the Backup approach.

  • Example**: While at state \(q_2\) with an incoming ‘a’, you could choose to move to \(q_3\). If this leads to a dead-end, you could backtrack to \(q_2\) and try staying at \(q_2\) this time.

NFSA Recognition Algorithm (ND-RECOGNIZE)

The text presents an algorithm that uses a variable “agenda” to manage unexplored choices. The algorithm systematically explores paths, accepting an input string if it leads to an accepting state and rejecting it otherwise.

  • Example**: Consider a string ‘baaa!’ and an NFSA. The algorithm would initiate from the start state \(q_0\) and use an “agenda” to maintain the states and corresponding tape positions to explore. It would systematically explore these and determine whether to accept or reject the string.

Recognition as State-Space Search

The text extends the discussion to consider ND-RECOGNIZE as a state-space search algorithm. The state space consists of pairings of machine states and tape positions, and the objective is to find an accepting state paired with the end of the tape position.

  • Example**: The state-space for ‘baaa!’ in our NFSA example would include pairings like \((q_0, b), (q_1, a), (q_2, a), (q_3, !)\) etc. The algorithm explores these pairings in search of a sequence leading to an accepted state and the end of the tape.

Depth-First and Breadth-First Search in Non-Deterministic Automata

In Non-Deterministic Finite Automata (NFA), there can be moments, called “choice points,” where the algorithm needs to decide between multiple paths. Depth-first search (DFS) selects one and explores it fully before backtracking. Breadth-First Search (BFS) examines all possible choices concurrently. Both have pitfalls: DFS can enter an infinite loop, while BFS may require impractical amounts of memory.

* Example: Consider an NFA where state `q2` can lead to `q3` or `q4` upon receiving input ‘a’. In DFS, it would choose, say `q3`, and explore fully. If this turns out to be a wrong choice, it backtracks to `q2` and tries `q4`. In BFS, both `q3` and `q4` would be considered simultaneously, storing them for future exploration.

Equivalence between Deterministic and Non-Deterministic Automata

Despite their differences, it’s possible to convert any NFA into a Deterministic Finite Automaton (DFA). The conversion algorithm groups states reached by similar input symbols into equivalence classes, which become new states in the DFA.
* Example: If an NFA has states `qa` and `qb` that could both be reached by the same input symbol, the conversion algorithm creates a new state `qab` in the corresponding DFA to represent both states concurrently.

Regular Languages and Finite-State Automata

Regular languages are those that can be described by regular expressions or characterized by finite-state automata. The properties of regular languages are precisely defined, and they are closed under several set operations like union, intersection, and complement.

* Example: If `L1 = {a, aa}` and `L2 = {b, bb}`, then `L1 ∪ L2 = {a, aa, b, bb}` is also a regular language. Similarly, `L1 ∩ L2 = {}` (empty set) is also regular.

Conversion from Regular Expressions to Automata

Regular expressions can be systematically converted into FSAs. The conversion is inductive, meaning it builds upon simpler cases. The basic operations in regular expressions like concatenation, union, and closure can be mirrored in automata.
* Example: Consider the regular expression `a*`. An automaton for this would include a start and end state connected by ε-transitions, with a loop on the start state for the symbol `a`.

Feel free to drop me a question or comment below.

Cheers, happy learning. I will see you in chapter 3.

The data journey is not a sprint but a marathon.

Medium: MattYuChang

LinkedIn: matt-chang

Facebook: Taichung English Meetup

(I created this group four years ago for people who want to hone their English skills. Events are held regularly by our awesome hosts every week. Follow the FB group link for more information!)

--

--