#30DaysOfNLP

NLP-Day 29: How To Manipulate And Preprocess Strings With Regular Expressions

Just express yourself with regular expressions

Marvin Lanhenke
5 min readMay 5, 2022

--

Express yourself with regular expressions #30DaysOfNLP [Image by Author]

In the last episode, we reviewed the key architectures in the field of deep learning and highlighted the importance of a general workflow. We also stated that most of the challenges lie not in the designing or modeling aspect but in the preparation and preprocessing of the data.

Now, it’s time to take a small detour and learn about regular expressions.

In the following sections, we’re going to cover the basics of regular expressions, allowing us to preprocess and modify strings in a way that helps us to solve the NLP task at hand.

So take a seat, don’t go anywhere, and make sure to follow #30DaysOfNLP: How To Manipulate And Preprocess String With Regular Expressions

Introducing regular expressions

Regular expressions (regex) can also be viewed as a tiny, highly specialized programming language embedded inside Python that is made available through the re module.

Although embedded, regular expressions are actually compiled into a series of bytecodes and executed by a matching engine written in C.

Regular expressions allow us to specify rules for a set of possible strings we want to match e.g. English sentences, e-mail addresses, specific characters, etc. However, we can not only match certain patterns but also modify or even split strings.

Despite regex being quite powerful, it also can get complicated pretty quickly. Thus, more sophisticated preprocessing steps should not be done completely with regex alone but rather with or in combination with Python.

Matching operations

Let’s begin with probably the most common task. Matching characters.

import retext = "Natural Language processing is so awesome, isn't it?"
pattern = re.compile(r'\?')
matches = pattern.findall(text)
>>> ['?']

However, before we can do anything at all, we have to import the re module and specify or compile a pattern. In our simple case, we define a pattern to match a question mark.

After defining the matching pattern, we can apply several built-in functions.

1. match()- determines if RE matches at the beginning of the String
2. search()- scans through a string, looks for any matching location
3. findall()- finds all matching substrings, returns a list
4. finditer() - finds all matching substrings, returns an iterator

We make use of the findall() function that matches all substrings and returns them in a list. In our example, we retrieve the question mark.

Pretty straightforward so far. But what about more sophisticated patterns? What about metacharacters?

Metacharacters

Most letters and characters simply match themselves.

With metacharacters, however, this is a completely different story. We can use metacharacters to signal that some out-of-the-ordinary thing should be matched.

Let’s consider the square brackets '[]' for example which can be used to specify a set of characters.

import retext = "Natural Language processing is so awesome, isn't it?"pattern = re.compile(r'[a-c]')
matches = pattern.findall(text)
>>>
['a', 'a', 'a', 'a', 'c', 'a']

Picking up our simple example from before, we specify a set of characters [a-c] and try to find all occurrences in our string.

Other metacharacters to consider are the caret ’^’ and the dollar sign '$' which can be used to either check if a string starts or ends with a certain character.

import retext = "Natural Language processing is so awesome, isn't it?"pattern = re.compile(r'\?$')
matches = pattern.search(text)
if matches:
print(True)
else:
print(False)
>>>
True

By making use of the dollar sign, we are able to verify that the string ends with a question mark.

So far so good.

However, things start to get more interesting once we account for the number of occurrences. Using metacharacters like '*' '+' '?' we can specify the number of times a character has to appear in a given string.

import retext = "abcd"pattern = re.compile(r'[e-z]+')
matches = pattern.search(text)
if matches:
print(True)
else:
print(False)
>>>
False

By using the '+' character, we try to match the characters in the range [e-z] that appear at least once. Since our string doesn’t contain any of those characters we’re unable to match the pattern.

For a complete list of metacharacters, you can refer to the table provided by w3schools.com.

Special Sequences

Using the backslash character, we can access several special sequences. For example, \w which matches any alphanumeric character.

Or imagine we want to extract all digits from a sequence.

import retext = "I am 32 years old."pattern = re.compile(r'\d')
matches = pattern.findall(text)
>>>
['3', '2']

In this example, we make use of the \d sequence to extract all single digits.

For a list of all special sequences, we can once again refer to w3schools.com.

String modifications

With regular expressions, we can do more than just matching operations. We can also split and modify strings as well.

import retext = "Natural Language Processing is so awesome!"pattern = re.compile(r'\W+')
result = pattern.split(text)
>>>
['Natural', 'Language', 'Processing', 'is', 'so', 'awesome', '']

Relatively straightforward. We simply make use of the split() function to split a string. In this example, based on all non-alphanumerical characters.

By making use of the sub() function we can even modify a string by substituting characters based on a certain pattern. Let’s assume, we want to replace all whitespace characters with a hyphen.

import retext = "Natural Language Processing is so awesome!"pattern = re.compile(r'\s')
result = re.sub(pattern, '-', text)
>>>
'Natural-Language-Processing-is-so-awesome!'

Useful expressions

Now that we covered most of the basics, let’s finish this article with some more useful examples.

Finding e-mail addresses

import retext = "Here are some mail addresses \
alice-b@googlemail.com peter@yahoo.com"
pattern = re.compile(r'[-\w]+@[\w]+')
matches = pattern.findall(text)
>>>
['alice-b@googlemail', 'peter@yahoo']

Extracting phone numbers

import retext = "Here are my phone numbers (555) 555-1234, (555) 555-5678"pattern = re.compile(r'\([-)\d\s]+')
matches = pattern.findall(text)
>>>
['(555) 555-1234', '(555) 555-5678']

Working with numbers (including separators)

import retext = "The numbers are 21.40453, 2,245.43, and 4,506."pattern = re.compile(r'\b[\d.,]+\b')
matches = pattern.findall(text)
>>>
['21.40453', '2,245.43', '4,506']

Extracting dates

import retext = "Here are some timestamps \
2013-02-20T17:24:33Z, 2016-03-23T11:19:33Z"
pattern = re.compile(r'[-\d]+\d{2}')
matches = pattern.findall(text)
>>>
['2013-02-20', '2016-03-23']

Conclusion

In this article, we took a small detour and learned the basics of regular expressions. And such basic knowledge might come in handy when we have to preprocess and modify certain strings to fit into our Natural Language Processing pipeline.

Now, it’s time to wrap up the complete series by looking back, reviewing the work we have done, and providing some useful resources in the last episode.

So take a seat, don’t go anywhere, make sure to follow, and never miss a single day of the ongoing series #30DaysOfNLP.

#30DaysOfNLP

30 stories

--

--

Marvin Lanhenke

Business Analyst. Solutions Architect. Self-Taught. Hands-On. Writing about Software Architecture & Engineering. Say Hi @ linkedin.com/in/marvinlanhenke/