RegEx in Python

Published in

Byte-Sized-Code

8 min readJul 18, 2020

You may be familiar with searching for text using ctrl + F and entering the words you are looking for. Regular Expressions (RegEx in short) go a step further and allow you to match patterns.

What is RegEx?

A regular expression (or RE or RegEx) specifies a set of strings that matches it; the functions in this [python] module let you check if a particular string matches a given regular expression.

Patterns are present everywhere, email ids have a specific pattern — username@site.com, phone numbers (Indian numbers start with +91 in general), social media hashtags begin with #and_contain_no_more_spaces, and many more such examples can be cited from our everyday lives. Regular expressions help us match these general patterns without knowing specifically what we are looking for. For example, you can use RegEx to search for an email pattern on this page, and end up getting the above generic example.

Knowing [RegEx] can mean the difference between solving a problem in three steps and solving it in 3,000 steps. When you’re a nerd, you forget that the problems you solve with a couple keystrokes can take other people days of tedious, error-prone work to slog through.
— Cory Doctorow

Cory goes on to argue that one should first learn regular expressions before learning how to program. It is absolutely necessary for your day to day tasks.

Matching Simple Patterns

The simplest form of pattern matching is matching characters. Let’s start by looking at that.

Matching Characters

Most characters and numbers match themselves when used directly. For example, the string text will exactly match text in the given paragraph or page. We can also use a case-insensitive mode which will match text with text, Text, TEXT, tEXt, and so on (more on case-insensitivity later).

However, there are certain characters (called metacharacters) that do not match themselves and signal some out-of-the-ordinary pattern matching. Here is a list of metacharacters that are used in pattern matching in python: [ ] $ ^ . * + ? { } \ ( ) | .

Group of Characters

The first metacharacters that we’ll discuss are [ and ] . They are used for specifying a character class, which is a set of characters that will be matched. For example, [abc] will match a, b, or c. This is the same as [a-c], which uses a range to express the same characters.

Metacharacters are not active inside classes. For example, [a-c$] will match a, b, c, and $. Although $ is a metacharacter, it acts as a simple character inside the square brackets.

The caret ( ^ ) is used to “complement” a character set. This is indicated by using the caret as the first letter inside the character class. The string [^a-c] will match any character that is not a, b, or c. NOTE: ^ outside the character class is a simple character and will match ^. Quite opposite to other metacharacters.

The all-powerful (and at times very difficult to use) metacharacter is .. It is used to match any character except the newline character.

The Backslash

Perhaps the most extensively used metacharacter is the backslash. The backslash ( \ ) can be followed by various characters to signal the various special sequences. It is frequently used to escape metacharacters (match metacharacters). For example, \[ is used to match [. Preceding characters with \ removes their special meaning. Also, \\ is used to match a\.

There are certain predefined character sets that can be matched by using the backslash. The following character sets can be matched by using backslash:

\d — Matches any digit in the given text. It is the same as the character class [0-9].
\D — Matches any non-digit character. It is same as the character class [^0-9].
\s — Matches any whitespace character. It is the same as the character class [\t\v\n\r\f].
\S — Matches any non-whitespace character. It is the same as the character class [^\t\v\n\r\f].
\w —Matches any alphanumeric character. It is the same as the character class [a-zA-Z0-9_].
\W — Matches any non-alphanumeric character. It is the same as the character class [^a-zA-Z0-9_].

These predefined classes can be used inside other character classes too. For example, [\s\w] matches alphanumeric characters and spaces.

Repetition

A very useful capability of RegEx is that it can be used to specify if some portions of the string we want to match are to be repeated a certain number of times.

The metacharacters { and } are used to specify repetitions. The curly brackets are used in the following way. {m,n} means that there must be at least m repetitions of the previous character and at most n repetitions. For example, the expression ab{2,3}c will match abbc and abbbc, but not abc (1 b) or abbbbc (4 repetitions of b). Three other metacharacters ( ?, +, *) are basically shortcuts that can be used in place of the long curly bracket syntax.

The question mark character matches either one or two repetitions. You can think of it as marking something as optional. For example, cat-?call will match both catcall and cat-call. It is the same as cat-{0,1}call.

The addition symbol matches one or more repetitions. For example, ab+c would match abc, abbc, abbbc, and so on (one or more repetitions of b). But it won’t match ac. The asterisk symbol is used to match zero or more occurrences, thus, ab*c will match all that ab+c matches and also ac.

That was all the raw material we needed to use RegEx in python. Now let’s learn how to actually use RE in python.

Using Regular Expressions

For using RE in python, we use the re module. Thus, we need to import re in all of our scripts where we want to use regular expressions.

Compiling RE

Regular expressions are compiled into pattern objects, which have methods for various tasks such as matching, searching, and replacing patterns. Here is a sample use case of re.

>>> import re
>>> p = re.compile('ab*')
>>> p.pattern
'ab*'

Earlier we stated that we can use case-insensitivity to match different cases of text. The case-insensitive behavior can be triggered by using the IGNORECASE flag in re.compile(). The same is illustrated below:

>>> p = re.compile('ab*', re.IGNORECASE)

Escaping Backslash

As mentioned earlier, one can use a backslash to escape metacharacters, but it leads to a problem of inflating backslashes. For example, if you want to match the text \text in your page, you need to escape the backslash before “text” using another backslash. Then you need to escape the second backslash by using another backslash. Thus, to match \text, you need to compile the string \\\\text. In short, to match one literal backslash ( \ ), you need to type in four backslashes ( \\\\ ). This makes the code pretty confusing to look at, especially if you want to match URLs.

The simple solution is using raw string notation. By appending a r in front of a given string, it instructs python to treat it as a raw string literal, where backslashes have no special meaning. The following code will illustrate the difference:

>>> a = re.compile(r'\\text')
>>> a.pattern
'\\\\text'
>>> a = re.compile('\\\\text')
>>> a.pattern
'\\\\text'

The same pattern is achieved by using far fewer backslashes when dealing with a raw string literal.

Pattern Matching

Now that we have compiled and made a pattern object, what next? We can use the following methods on the pattern object:

match() — Determines if the expression matches at the beginning of the string.
search() — Scans a string to check if the expression matches any substring of the given string.
findall() — Finds all substrings where the pattern matches and returns a list containing the matches.
finditer() — Finds all substrings where the pattern matches and returns them as an iterable.

The following code block has examples of pattern matching using all the different methods of a pre-compiled match object.

Here’s an example of using match() for a RE.

>>> import re
>>> p = re.compile('[a-z]+')>>> p.match("")
>>> print(p.match(""))
None>>> m = p.match('tempo')
>>> m
<_sre.SRE_Match object at 0x7fd35a818f10>>>> n = p.search('.*tempo')
>>> n
<_sre.SRE_Match object at 0x7fd35a818f80>>>> r = p.findall('tempo.*tempo')
>>> r
['tempo', 'tempo']>>> s = p.finditer('tempo.*tempo')
>>> s
<callable-iterator object at 0x7fd35a7c8550>

As you can see, the findall method returns a simple list containing all the matches in the text. The finditer method returns an iterable object that can be iterated over by using a simple for loop:

>>> for i in s:
...   print(i)
...
<_sre.SRE_Match object at 0x7fd35a7cc030>
<_sre.SRE_Match object at 0x7fd35a7cc0a0>

One can note that the individual entities in finditer and the results of search and match methods are match objects. The natural question here: What is a match object and what can we do with them?

Match Objects

Match objects returned by re have four basic and important methods:

group() — returns the string matched by the regular expression.
start() — returns the index of the starting position of the match.
end() — returns the ending position of the match.
span() — returns a tuple containing the (starting, ending) positions of the match.

Here is what the methods return when called upon our match object ‘m’:

>>> m.group()
'tempo'
>>> m.end()
5
>>> m.start()
0
>>> m.span()
(0, 5)

An Example

Untill now we learnt the meaning of RE and basic methods that are present in the re library of python. Let’s use this feature to try extracting a given date from a given piece of text:

>>> import re
>>> dateRegex = re.compile('\d\d-\d\d-\d\d\d\d')
>>> para = "The date today is 18-07-2020. It's already July!"
>>> date = dateRegex.search(para)
>>> date.group()
'18-07-2020'
>>> date.span()
(18, 28)
>>> para[18:28]
'18-07-2020'

The above code first creates a compiled regex object that recognizes dates. Then it searches for a “date” in a given paragraph. As you can see, we can easily extract information about the location of the match via the span method and directly access the matching substring from the paragraph.

Conclusion

Although not a pre-requisite for learning basic programming, people are generally encouraged to learn about regular expressions sometime or later. It does help in shortening your code a lot if you work in the field of data science or specifically data mining. This articles serves as a basic introduction to regular expressions and a later article will enlighten you with the best practices of regular expressions and also fill in some gaps in knowledge (mostly about some left out metacharacters and flags).

Before You Go…

If you liked the article, give a clap (it’s free, and you can clap up to 50 times!), share, recommend, and respond to it. Your claps, shares, and responses encourage me to write more such articles.