QUANTRIUM GUIDES

Extracting Words from a string in Python using the “re” module

Extract word from your text data using Python’s built in Regular Expression Module

Bharath Sivakumar
Quantrium.ai

--

Regular Expressions in Python

Regular expression (RegEx) is an extremely powerful tool for processing and extracting character patterns from text. Regular Expressions are fast and helps you to avoid using unnecessary loops in your program to match and extract desired information.

In this post, we will show you how you can use regular expressions in Python to solve certain type of problems.

For going through this post, prior knowledge of regular expressions is not required.

Let’s understand how you can use RegEx to solve various problems in text processing. In this post we are focusing on extracting words from strings.

Using Regular Expressions in Python

To start using Regular Expressions in Python, you need to import Python’s re module.

import re 

We have divided this post into 3 sections that are not strictly related to each other and you could head to any one of them directly to start working, but if you are not familiar with RegEx, we suggest you follow this post in order.

We will be using the findall function provided in re module throughout this post to solve our problems. Let’s begin.

Using "|" Operator to Extract all Occurrence of Specific Words

Let’s assume that say you have the following text paragraph which describes various cities and you want a list of all occurrences for the particular city.

text = "Chennai is a beautiful city. It’s the capital of the state of Tamil Nadu. Chennai has an area close to 430 kilometer squares. Well chennai is not as large as mumbai which has an area of 603.4 kilometer squares. By road, Chennai is about 1500 kilometers away from Mumbai. Whereas, it is about 2200 kilometers away from Delhi, the capital of India."

Now, you want to extract all the occurrences of Chennai, for which, you can do something like this:

cities_record = 'Chennai'
re.findall(cities_record, text)

Here, findall is a method in re that takes two parameters — first the pattern to be searched, in this case it is 'Chennai' and second parameter is the content in string, from which it will search for the pattern.

The method returns all non-overlapping matches of the pattern, which is in cities_record variable, from the second parameter string, which is in variable text in our case, as a list of strings.

Hence, the above code cell will return a list of all the occurrences of the word 'Chennai' in our string and would therefore return the following list:

['Chennai', 'Chennai', 'Chennai']

But wait a second. Our document had Chennai occurring 4 times though but the list only show 2. Why?

If you look carefully in the paragraph, you will see that the third time, the name of the city was written as "chennai" with a 'c' in lower case.

By default, regular expressions are case sensitive.

So how do you capture 'chennai' too within the one go itself? This gives us an opportunity to introduce you to the third parameter 'flags' of 'findall' method. You can set its value to 're.IGNORECASE' as follows:

cities_record = 'Chennai'
re.findall(cities_record, text, flags=re.IGNORECASE)

By setting the flags parameter to re.IGNORECASE, you are telling interpreter to ignore the case while performing the search. On running this code, you will get the following output:

['Chennai', 'Chennai', 'chennai', 'Chennai']

Searching Multiple Patterns

Now, along with Chennai, you want to extract all occurrences of the city name “Mumbai” from this paragraph of text. You can simply do this by using | operator to create your pattern:

cities_record = 'Chennai|Mumbai'
re.findall(cities_record, text, flags=re.IGNORECASE)

This will return the:

['Chennai', 'Chennai', 'chennai', 'mumbai', 'Chennai', 'Mumbai']

So essentially the | is a ‘special character’ telling regex to search for pattern one 'or' pattern two in the provided text.

What if you want to search for occurrence of '|' in your document? Since, '|' serves has an special meaning hence, you need to give it in your pattern with a backslash as \|. The backslash \ essentially tells regex to read it as a character without inferencing its special meaning.

So with this search, it doesn’t matter if the name of the city is written as “mUMBAI”, “MUMBAI”, “CHENNAI” or “cHENNAI” in your document. All these cases would be captured, as long as the spelling of the city is written correctly. If you want to include more cities in your search, you can again include them using the | operator.

Extracting Words Containing only Alphabets

There are times when you want to extract the words containing only alphabets. A good example for this will be if you get a text document containing the names of all the fruits and vegetable along with the quantity in kilogram that a person bought in the following format:

text = "\
Banana 1.051 48.25\
Apple 1.024 180.54\
Carrot 0.524 47.20\
Radish 0.251 27.14\
Tomato 0.508 41.05"

To extract only the names of the fruits/vegetables that were bought, you can create a pattern using the class containing only characters. The pattern will be as follows:

words_pattern = '[a-z]+'

In this pattern [a-z] denotes a class of characters from a to z. The + operator denotes the multiple occurrences of this character class. Hence, to extract out the names of fruits and vegetables you can use the pattern as follows:

re.findall(words_pattern, text, flags=re.IGNORECASE)

You will get the following output:

['Banana', 'Apple', 'Carrot', 'Radish', 'Tomato']

The + character is a special character in regex. It is used to match 1 or more repetitions of the preceding regular expression or class which in our case is [a-z]. So it matches 1 or more repetitions of lower case alphabets and hence we get the above list. If we wanted to include 1 or more repetitions of both lower and upper case alphabets, we can create the pattern as follows:

words_pattern = '[a-zA-Z]+'

So this way no matter what case our fruits and vegetables are written in , they will be captured by this pattern even without using the re.IGNORECASE flag.

Understanding Character Classes in Regex

The square brackets are ‘special characters’ in regex used to match a set of characters. For example, [amk] will match 'a', 'm', or 'k'. In our case, we have used [a-z]. The -character when used inside [], specifies the range of characters that can be matched. It is used by placing it between the two characters that are the lower and upper limits of the range.

The class[a-z] will match any lowercase ASCII letter, [a-g]will match all lower case alphabets from a to g and so on. If you want to match the literal '-' inside square brackets you need to specify it using backslash \-. The backslash character '\' is the escape character that tells regex to treat the following character as a literal and ignoring its special meaning.

Regex will also consider '-' to be a literal if it is used as the starting or beginning character inside the square bracket like this: [g-]. This will match only 'g' and '-'.

Extracting Words Followed by Specific Pattern

You will often come across the problems where you have to extract specific words/patterns followed by a specific character. A good example of this would be the case when you got a comment on a particular article maybe on a website and you want to extract all the user names/ids that were tagged in it.

For simplicity, let’s assume that our usernames can only contain alphabets and anything followed by an '@' without any space is a username.

Let’s take the following comment as example text:

comment = "This is an great article @Bharath. You have explained the complex topic in a very simplistic manner. @Yashwant, you might find this article to be useful."

Let’s create a regex pattern that can be used to search all the usernames tagged in the comment.

username_pattern = '@([a-zA-Z]+)'

This regular expression pattern will find and extract all the usernames tagged in the comment, without the '@' part.

re.findall('@([a-zA-Z]+)', comment)

The output for the above regular expression is:

['Bharath', 'Yashwant']

Here, if you examine our pattern carefully, we have put our pattern inside the parenthesis after '@'. The pattern with parenthesis returns whatever regular matched with expression is inside the parentheses but starting or ending with whatever is mentioned outside the parenthesis.

That means, what is searched for in this case is @ immediately followed by 1 or more repetitions of any lower/upper case alphabet, but only the pattern inside () is returned as the object of interest. So, if you remove the () operator from our regular expression:

re.findall('@[a-zA-Z]+', comment)

You will get the following output:

['@Bharath', '@Yashwant']

This is one of the ways in which you can use the () operator to extract particular patterns that we are interested in, which occur along with some other pattern that we are not interested in capturing, like we want to ignore the '@' symbol in our case.

To understand all the fundamental components of regex in Python, the best way to do so is by heading to the official documentation of Python 3.8 RegEx here:

We hope you followed the post along and execute the code as you were reading the post.

--

--

Bharath Sivakumar
Quantrium.ai

A Machine Learning enthusiast who wants to make Machine Learning tools accessible to everybody