Regular Expressions: An Art of Manipulating Strings.

Dolamu Oludare
Towards Data Engineering
9 min readJan 15, 2024
Photo by Jyotirmoy Gupta on Unsplash

The internet's core is data, ranging from textual data to images and videos. The Internet is estimated to contain several exabytes of textual information. At any point, you would need to extract structured information from a website or a database. Regular Expression (Regex) is one powerful tool for manipulating textual data and extracting them from a large pool of information. Regex is generic to all programming languages.

In this article, we will focus on Regex using the Python programming language. We will examine some of the methods, wildcards, and regex rules before importantly putting our ideas to code.

Photo by Álvaro Serrano on Unsplash

The re module is Python’s library for performing regular expressions on data. Some of the major methods in the re library include:

  1. match(): This method determines if a regular expression matches the beginning of the string, and returns the item if it exists at the beginning of the text.
  2. Search (): The search method scans through a string, looking for any location where the regular expression matches.
  3. findall(): This method finds all substrings where the regular expression matches and returns them as a list.
  4. finditer(): This method finds all substrings where the regular expression matches and returns them as an iterator.

5. split(): This is a modification method that returns a list where the string has been split at each match.

6. sub(): This is a modification method that replaces one or many matches with a string.

Regex Wildcards

The regex wildcards are characters that control how the regular expressions work and how they are interpreted. Let’s go ahead and examine them one after the other.

  1. . — The dot (.) symbol in regex simply means any character except a newline.
  2. * — The asterisk (*) symbol in regex means zero (0) or more repetitions of any character.
  3. ^ — The caret (^) symbol means starts with. For example “^David” means start with David.
  4. $ — The dollar ($) symbol means ends with. For example “world$” means end with world.
  5. + — The plus (+) symbol means one (1) or more repetitions of any character.
  6. ? — The question mark (?) symbol means zero (0) or one (1) repetition of any character.
  7. {m} — This means m repetition.
  8. {m, n} — This means m to n repetitions.
  9. [] — This means a set of characters e.g [a-zA-Z].
  10. \ — The backward slash symbol in regex is used to escape a character or get the raw form of a character. “\\t”.
  11. | — This symbol simply means either or. For example “fall | stand” means fall or stand.
  12. () — This symbol is used to capture or group a set of characters.

More Special Characters

  1. \d: Matches any decimal digit; [0–9].
  2. \D: Matches any non-digit character.
  3. \s: Matches any whitespace character; (space “ “ tab \t newline “\n”).
  4. \S : Matches any non-whitespace character.
  5. \w : Matches any alphanumeric (word) character; [a-zA-Z0–9_].
  6. \W : Matches any non-alphanumeric characters.
  7. \b : Matches where the specified characters are at the beginning or the end of a word.
  8. \B : Matches where the specified characters are present. but not at the beginning.

Code Implementation 1

Now that we know the basic wildcards in regex, we would put our ideas into code by trying to extract particular text patterns from data.

Let’s start with a simple problem of extracting phone numbers from a pool of phone numbers.

Code explanation:

Line 1: We import the Python regex library.

Line 2–5: We create a pool of phone numbers encoded in a string and assign it a variable with the name chat.

Line 7: We write the regex pattern for the data we want to extract. Now we are trying to extract all phone numbers in the dataset. We can see that the first phone number has the first three numbers enclosed in a bracket and also every three numbers is separated by a dash (-).

pattern1 = r"\(?\d+\)?-?\d{3}?-?\d{4}?"

\(?\d+\)? — This means the program would look for content in the text that starts with zero or one open bracket. The backward slash (\) before the open bracket escapes the character and gets the raw character if it exists in the text.

-?\d{3}? — This means the program looks for zero or more dash characters in the text and after the dash characters it also looks out for 3 digits or numbers. This pattern would end up extracting both phone numbers in the text.

-?\d{4}? — This means the program looks for zero or more dash characters in the text and after the dash characters it also looks out for 4 digits or numbers. This pattern would end up extracting both phone numbers in the text.

Line 8 — We provide the pattern and the data as arguments to the findall method in the re module. This returns the matched text pattern in the data as a list and assigns it to the variable matches.

Line 9 — We print the variable matches to see the content of the variable.

Output:

>> ['(123)-567-8912', '1234558494940']
The output of the code.

The code matches the two numbers in the text as seen in the output image.

Code Implementation 2

In this code example, we would extract the order number of product sales from the data.

Code explanation:

Line 1: We import the Python regex library.

Line 2–6: We create a pool of order numbers of product sales encoded in a string and assign it a variable with the name content.

Line 8: We write the regex pattern for the data we want to extract. Now we are trying to extract all order numbers that have space between them in the data. We can see that the first two order numbers have space between the order text and the numbers.

(order)\s\#\d+ — This pattern simply means that the regex matches text that starts with the exact word order, and the \s wildcard tells the program to check for space in the text. The \# pattern is used to check for the hash character (#) in the word, the backslash character is used to escape the hash character and get the hash character.

Line 10: In this code example, we used the finditer() method in the re module to match our regex pattern to the variable content. This returns an iterator and assigns it to a variable matches.

Line 13–14: We iterate through the matches iterator and print out each of the order numbers that match our regex pattern.

Output:

>> order #56474849303
>> order #56475647483
The output of the program.

Code Implementation 3

Companies in Europe report their financial numbers on a semi-annual basis and you can have a document like this. To extract quarterly and semi-annual periods you can use a regex as shown below

The text we are extracting from is shown below:

So, we need to extract the financial report year notation from the text above. Let’s write a regex pattern to do this.

Code explanation:

Line 1: We import the Python regex library.

Line 2: We write the regex pattern for the data we want to extract. Now we are trying to extract all the financial year notations in the text. We can see that they all start with FY followed by a space character and then S or Q then a number or digit.

FY(\d+)\s — This pattern simply means that the regex matches text that starts with the exact text FY, and then (\d+) matches more than one digit after the text and puts them in a group.\s wildcard tells the program to check for space in the text.

([SQ][0–9]) — This pattern simply means that the regex matches text that starts with either a character S or Q then followed by any digit from 0 to 9. The regex puts this pattern in a group.

Line 3: In this code example, we used the findall() method in the re module to match our regex pattern to the text content. This returns a list and assigns it to a variable matches.

Line 5–7: We iterate through the list and check if the content is None, then print the content in a string.

Output:

The output of our code can be seen below.

Code Output.

Code Implementation 4

In this last code sample, we will extract a set of information from Elon Musk’s Wikipedia biography page. We will extract his name, date of birth, age, and place of birth. Let’s go through the code.

Firstly, let’s extract some information from the Wikipedia page and encode it in a string.

So from the string in the code above, we would write regex patterns to extract the name, date of birth, age, and place of birth of Elon Musk.

Extracting Name

Looking at the text, there are diverse ways to extract the name in the text. We can use a pattern that checks for the actual word Born, then extract every character after the word by grouping them with curved brackets.

The pattern is shown in the code block. The program looks out for the Born word and then the (.*) means that the program looks out for zero or more repetitions of characters after the word Born and puts them in a group.

Extracting Birth Date

To extract the birth date, we use the pattern in the code block. The program looks out for the word Born and then the .*\n pattern looks for any character after Born. The program moves to the next line of the text. Since the date of birth is on the new line, we can use the (.*) to trap all the characters in the new line. We then use the pattern \(age as a control to extract the actual date of birth information.

Extracting Age

To extract the age information, we can take advantage of the fact that we have the word age existing only once in the text and the age number is located after the word age. So we use the pattern age (\d+) to extract the information, this means the program would look up the word age in the text and extract any digit after it.

Extracting Place of Birth

We would use the age text in the second line of the text to guide the extraction of the place of birth on the third line. The pattern looks up where the age text then the .*\n pattern tracks any character after the age text and moves to a new line. The (.*) expression is used to track every character on the third line of the text.

Writing into a Function

Now that we have all our patterns well sorted out, we would write the extraction code into a function.

Code explanation:

Line 1–4: We write a function to extract the actual text of each of the items. We use the findall() method to match the regex patterns to the text and then return the first item in the list. We used the strip() method to remove space and tabs in each text.

Line 6–10 — We make use of the first function to extract the age, full name, date of birth, and place of birth data.

Line 12–17: We return the data in a dictionary data structure.

Now, we would initiate the functions and see the output

Output:

>>> {'name': 'Elon Reeve Musk',
'age': '52',
'birthdate': 'June 28, 1971',
'birth place': 'Pretoria, Transvaal, South Africa'}

The code gives us the required information in a dictionary as shown in the code block above.

Conclusion

In conclusion, regular expression is an interesting concept in programming for dealing with string data types. The beauty of this concept is that there are diverse patterns and methods to solve a problem when it comes to data extraction from a string. To practice more on regex and consolidate your understanding of the concept, you can visit the regex101 website.

Thank you for reading.

--

--