Understanding Regular Expressions in Python

Pattern matching with Regex

Published in

The Startup

4 min readJul 12, 2020

The story of regular expressions started building up in 1951 when Stephen Cole Kleene, a mathematician who described regular language. And, Ken Thompson used Kleene’s concept for pattern matching in the 1960s.

Since them, regular expressions or regexes have boomed in programming. From Python, Java to JavaScript, regexes have adopted regular expressions. So is the case with text editors and many Unix tools.

Regular expressions are used by Pythonistas mostly when they are dealing with strings. Let’s say, you want to find a substring in a string object. Either you can write dozen of lines to do that or use regular expressions directly. So, Python comes with a built-in library “re” which deals with pattern matching, string search, or comparison.

Working with strings, I have bumped into regular expressions infinite times. That’s why I have decided to make a pair of articles which will decode and make regular expressions easier and understandable.

import re

Python comes with a built-in library called re which lets you deal with the regular expressions. There are many regex functionalities residing in the module like re.search().

Let’s talk about it.

re.search(pattern, string, flags=0)

It scans the string and looks for the first location where the pattern can produce a match. Hence, a match object is returned. If nothing matches, None is returned.

import res = "I am looking for a pattern in this string"
search=re.search("pattern",s)

print(search)

In the above line of code, string “s” is being searched and “pattern” is looked into it. If found, the match object will be printed.

[Output]: <re.Match object; span=(19, 26), match='pattern'>

But, that’s pretty straightforward. Instead of working like it, regex matching reveals its real powers when metacharacters come into play. <regex> contains special characters that are called metacharacters.

While using metacharacters the same code can be written as:

import res="I am looking for a pattern in this string"
search=re.search("p[a-z]+",s)print(search)

The code will behave exactly like the above one. But, if you see, the pattern is changed into regexes metacharacters.

“p” defines that pattern starts with letter p.
“[a-z]” means the next letter is lying between smaller a to z.
“+” sign donates that letters between a-z can be more than once.

Hence, p[a-z]+ says that we are looking for a string starting with p and next to it, we have more than one character lying between a-z. “pattern” is the only string of following this regex rule. So, we found “pattern” in the string “s”.

Regex Metacharacters

I have created a table that will define every metacharacter that you can use to define your own regex.

. Matches any single character except newline
^ Matches the beginning of the string
$ Match at the end of a string
* Match zero or more repetitions
+ Match one or more repetitions
[] Defines a character class
| Behaves like “or” logic
()Creates a group
\ Introduces a special character like “\s” for “space”
{m} Matches m repetitions
{m,n} Matches m to n, inclusive repetitions

. Dot

Matches any character except a newline.

import res="patterntolook"
search=re.search("pattern..look",s)print(search)
[Output]: <re.Match object; span=(0, 13), match='patterntolook'>

Double dots fill “to” from the string and return us the matched pattern object. If the same code had only one dot(.), we would have got None.

import res="patterntolook"
search=re.search("pattern.look",s)print(search)
[Output]: None

[ ] character class

It creates a character class. It specifies certain characters that are to be matched in the string.

import res="patterntolook"
search=re.search("pattern[a-z]",s)print(search)
[Output]: <re.Match object; span=(0, 13), match='patternt'>

The above code says that look for a pattern and the next letter must lie in a to z. So, we get patternt.

+ More than one repetitions

If you are looking for having character more than once. You use the + symbol.

import res="patterntolook"
search=re.search("pattern[a-z]+",s)print(search)
[Output]: <re.Match object; span=(0, 13), match='patterntolook'>

Now, we are looking for a letter between a to z which can occur more than once. Hence, we get the whole string as output.

* Zero or more repetitions

It will look for a preceding character which must be zero or more times repeated in the string.

re.search('foo-*bar', 'foo-bar')
[Output]: <re.Match object; span=(0, 7), match='foo-bar'>

If I do the same with zero repetitions, it will still provide us the output i.e equal to actual string.

re.search('foo-*bar', 'foobar')
[Output]: <re.Match object; span=(0, 6), match='foobar'>

\ Backslash

Working with backslash can be tricky at times but it is easy if you know some notations to deal with letters, special characters, and digits.

\d is used for digits
\w is used for lowercase word
\s for space in words
\t, \n, \r are used for tab, newline, return respectively

re.search('foo\dbar', 'foo1bar')
[Output]: <re.Match object; span=(0, 7), match='foo1bar'>

Using ‘\d’ matches 1 in the original string and return us the desired output.

{m,n} Matches m to n(inclusive) repetitions

While working with { }, there can arise in two cases. Either you get to work with {m} or {m,n}.

re.search('x-{3}x', 'x---x')[Output]: <re.Match object; span=(0, 5), match='x---x'>

We are looking “-” that is thrice in the original string. But, if the number was 4 in our pattern regex, it would have returned None.

re.search('x-{4}x', 'x---x')[Output]: None

To overcome it, we can take m and n both to make it work.

re.search('x-{3,5}x', 'x-----x')[Output]: <re.Match object; span=(0, 7), match='x-----x'>

This time it looks for “-” that repeats itself 3, 4, and 5 times. And, it gets the best match when “-” is repeated 5 times.

Summary

This long article talks about regex or regular expressions. How you can use them in Python for strung matching problems? There is a lot to cover in this concept. So, I will be making another article that takes on from the ending of this one. Till then, happy regexing!

Peace!