Mastering Text Analysis: A Guide to Regular Expressions in Python

From Zero to Regex Hero: Building a Strong Foundation in Textual Data Analysis

Arthur Chong
Artificial Corner
10 min readAug 25, 2023

--

99% of our data today has been produced in the last 10 years! This is an unprecedented amount, and it is projected that the amount of data we are going to generate is only going to increase exponentially. With this amount of data, there is a ton of insights that we can retrieve from these data. However, with so much data, we alone are unable to process so much and thus, we need the help of machines.

Machines excel at dealing with numbers and deriving insights from these number data. However, there is another type of data — textual data. Many insights can be derived from textual data. For example, take a look at this customer airline reviews page for British Airways.

Review taken from https://www.airlinequality.com/airline-reviews/british-airways/page/1/?sortby=post_date%3ADesc&pagesize=100

In the above review, we can see that the customer rated the airline 1/10. However, that does not tell us much about why he rated it as such right? This is where the textual data comes in. Just look at the amount of information that is contained in that long paragraph of text! If we were able to extract that and analyse it, we would be able to get a lot more insight into the airline service. This is where Regular Expression (RegEx) can help us. With the help of RegEx, we can pre-process the text data and feed it into Machine Learning models based on Natural Language Processing. By leveraging these technologies that are at our disposal, we are able to analyse these natural language data and extract way more meaningful insights.

RegEx aims to search for and match specific patterns of text. If a match of the pattern is found, we can perform operations and manipulate it. We are able to work with RegEx by using Python’s built-in package called re, simply by calling import re .

In this article, I will be going through the basics of RegEx, more specifically, these 3 topics:

  1. RegEx Methods
  2. RegEx Metacharacters
  3. RegEx Character Classes

At the end of this article, you should be equipped with the knowledge of applying RegEx as a starting step to being able to reveal concealed truths within textual data!

RegEx Methods

I will introduce 5 basic but essential methods and they are:

RegEx methods and descriptions

1. search()

The fundamental of RegEx is the search() method. It looks for a pre-specified pattern within the input string and returns a match object if there is a match. It takes two arguments, the pattern you want to look for and the string that you want to search for the pattern.

import re
text = "She sells seashell on the seashore"
pattern = "sea"
re.search(pattern,text)
Output of the source code

The above image is the output of the code. This is what it means by returning a Match object. From the output, you are able to deduce a few pieces of information. For example, the span of the Match object tells you the starting (inclusive) and ending (exclusive) indexes of the string where the match is found. In this case “sea” in “seashell” is located at text [10:13]. We can also call the group() method to find out the match that we have found with the following : re.search(pattern,text).group() and this is the output

However, do you notice something? That’s right, there are actually two “sea”s in the string, and the other one is found in “seashore”. This is an important point to take note of, which is that the re.search() method only returns the first instance where the match is found. This is where the finditer() method comes in!

2. finditer()

Using finditer() we are able to retrieve all the matches if there are multiple of them by doing the following:

text = "She sells seashell on the seashore"
matches = re.finditer("sea", text)
for match in matches:
print(match)
Output of the source code

Now, we can see that both matches are returned here! Moving on to the next method!

3. findall()

Using findall() will return a list that contains all the matches found in the input string. The easiest way to show this is with an example. Going back to our text variable,

text = "She sells seashell on the seashore"
re.findall('sea', text)
Output of the source code

As you can see above, a list is returned for each match of the pattern “sea” found in the text. This may not seem really useful now but as we go more in-depth into RegEx Metacharacters and Character Classes will you then start to see the practicality of this method. So keep this in mind first as we move along!

4. split()

The split() method divides a string into substrings at each occurrence of the specified pattern and places them in a list. Once again, we will use the example that we have defined earlier.

text = "She sells seashell on the seashore"
re.split('sea', text)
Output of the source code

The method is similar to the regular built-in Python string split() method, but it is a better alternative to that for instances that require matching multiple characters. As seen in the output, the text is split into a list based on any instance of “sea” in it. Thus, the specified pattern will not be present in the list.

5. sub()

Now we can move on to the last method that we will talk about in this article, the sub() method. Intuitively, it can be seen as replacing a specified pattern with something else. re.sub() takes in 3 parameters.

  1. The pattern to match
  2. The string to substitute in if a match is found
  3. the text to find the pattern in
text = "She sells seashell on the seashore"
re.sub('sea','crab',text)

Forgive me for the sentence not making a whole lot of sense. This is entirely for illustration purposes.

RegEx Metacharacters

We can now move on to the metacharacters of RegEx!

So far, we have only learnt how to match very specific patterns that we know exist in the text. Sorry to break your bubble but this is unfortunately not the case almost all the time in the real world. What if we want to match strings that we know are of a particular format, but are not sure of what the exact string is? This is where metacharacters can help! Metacharacters are essentially special characters within RegEx that can change how we specify what pattern we are looking for.

Let’s look at some of the metacharacters!

IMPORTANT NOTE

Before going into metacharacters, I should introduce to you what a raw string is. A raw string is a string that is prefixed with an ‘r’ at the front of the string. For example string = r'This is a raw string' . It tells Python to interpret backslashes (“ \ ”) as literal characters. Usually, when dealing with RegEx, we will always convert the pattern that we want to match into a raw string so as to not cause any issues when finding the expected pattern. With that out of the way, let’s move on to metacharacters!

[ ] (square brackets)

The square brackets allow you to match a set of characters specified within the brackets. For example, let’s look at this string string = "The quick brown fox jumped over 32412 fences" . We could write re.findall('[a-zA-Z]', string) . This allows us to match all values in the string that are alphabets, lowercase or uppercase. If we want to just only the numbers, we could write re.findall('[0-9]',string) . Both these codes would return every individual element that matches the pattern.

. (period)

The period character acts like a wildcard character. It matches every character (even spaces) except for a new line( "\n" ). Something to be careful about when using the period is that because of this metacharacter, if we want to find a period specifically in the pattern, we would have to use the backslash (“ \ ”) to indicate to Python that we want to treat the period literally.

As seen above, the period sign matches everything. However, if we were to only want to match the full stop, we have to add a backslash to it as such

{} (Curly brackets)

The curly brackets allow us to specify the number of occurrences of the preceding character.

Again, a nonsensical sentence, but just to illustrate the functionality of using the period and curly brackets. In this pattern, we are trying to match a ‘s’ followed by any character one time, followed by an ‘e’. Hence, ‘see’, ‘sue’, and ‘she’ was matched.

^/$ (Carat/ Dollar Sign)

The carat character specifies that the pattern we are searching for appears at the beginning of the string, while the Dollar sign character represents the end of the string.

In the above case, "seashore" was not matched as it did not appear at the beginning of the string. However, if we were to replace it with a “$” instead, we would get a match.

An important thing to notice is the placement of the “^” and the “$”. The “^” has to be in front of the pattern that you want to match while the “$” should be at the back of it.

*/+/? (Asterisk/ Plus sign/ Question Mark)

These 3 characters carry out similar functions. They specify how many times the pattern preceding it occurs.

“ * ” — The asterisk checks whether the preceding pattern occurs zero or more times.

“ + ” — Probably the most common metacharacter used in RegEx, the + sign checks whether the preceding pattern occurs one or more times.

Here, the + checks that the letter “ L ” occurs at least one time. Since it does, it matches all the ‘“ L ”s after it as well.

“ ? ” — The question mark checks whether the preceding character occurs zero or one times. However, this is rarely used as the use case is very specific.

RegEx Character Classes

Finally, let’s look at some character classes! The table below shows the most common character classes that you might use for RegEx!

The backslashes (“ \ ”) at the start allow us to indicate that the character that follows it is a special character and thus allow RegEx to perform operations on it.

\d and \D

\d matches any digit in the string. It has the same functionality as matching the pattern[0–9] .

\D matches any non-digit in the string. It helps to note that the capitalised version of the character class is basically the negation of the lowercase version. Anything that is not a number will be matched.

In the example, a + is added after the \D, meaning that it will match one or more occurrences of the non-digit character. RegEx will find the first occurrence of the non-digit character and match everything until it meets a digit character, which is 32412 in this case.

\w and \W

\w matches any word character. However, the term word character may be a little misleading over here. A word character in RegEx includes letters, numbers, and the underscore (“ _ ”) character. Therefore, you can also write it as the following [a-zA-Z0-9_] .

Here, the +sign helps us match every word in the text. This is one very common way of using the + sign!

Again, \W matches anything that is not the above characters.

\s and \S

\s matches any whitespace character. This includes spaces " " , new lines \n , and tabs \t .

\S will match anything that is not a whitespace character.

Matching email addresses

Finally, Now that we have covered the fundamentals of RegEx, let us look at how we can use these tools that we have learnt to match email addresses!

This pattern [\w._%+-]+means that we want to match any word character, or characters like ".” “_" “%" “+” “-”. This is because those characters may sometimes appear in the username of the email! And the + sign at the end of the square brackets means we want to match it as many times as possible. Then, we want to match the “@” sign, followed by [\w.-]+, matching more word characters and additional characters. We would then want to match a single period using \., followed by any alphabet in upper or lowercase, matching them at least two times but not more than 4 times. This may seem very cryptic if you are not familiar with RegEx. However, I hope this makes sense to you now!

Conclusion

With those 3 sections of RegEx covered, you’ve probably caught a glimpse of just how powerful regex can be! With these RegEx tricks up your sleeve, you can now manipulate text. This is an essential first step to being able to perform Natural Language Processing (NLP), and uncover insights from textual data! Have fun!

Connect with me!

LinkedIn
Email: arthurchong01@gmail.com

--

--

Arthur Chong
Artificial Corner

Undergraduate Data Science and Analytics student at The National University of Singapore interested in Machine Learning and AI