Regular Expression in Python

Sohan Lal Gupta
Analytics Vidhya
Published in
5 min readJan 5, 2021
https://tutorial.eyehunts.com/python/python-regex-regular-expression-re-example

Regular Expression is also called RE or RegEx, in short. It is a sequence of characters that forms a search pattern. It is used to check whether the search pattern exists in the given string. At first glance, It may seem complicated due to weird symbols like ^$.*+- etc. But, It’s a powerful tool that is worth learning. At the end of the article, We will see some useful regex examples of real-world applications.

Python Module (re)

Python has a module named re, which lets you deal with the regular expressions. Many regex functionalities are residing in the module.

import re

Let us see some important functionalities of the re module.

re.findall()

It returns a list of all matching patterns.

re.findall(pattern, string)

For example:

re.search()

It returns a match object of the first location of the pattern in the string.

re.search(pattern, string)

For example:

re.split()

It returns a list where the string has been split at each match.

re.split(pattern, string)

For example:

re.sub()

It replaces one or many matches with a string.

re.sub(pattern, replace, string, count=0)

Here, the default value of count is equal to zero means it will replace all the matching patterns in the string. If we pass the value of count other than zero, then that would be the number of the matching patterns in the string.

For example:

Another example:

Now, We will see special characters and special sequences. By using these, You can define your own regex.

Special Characters

I have created a table of special characters. After that, we will see all these special characters in detail.

^ : Matches the start of the string$ : Matches the end of the string. : Matches any character except a newline? : Repeats a character zero or one time+ : Matches one or more repetitions of the preceding RE+? : Matches one or more repetitions of the preceding RE(non-greedy)* : Matches zero or more repetitions of the preceding RE*? : Matches zero or more repetitions of the preceding RE(non-greedy)\ : Signals a special sequence| : Used for alternation[] : Used to indicate a set of characters() : Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group{m} : Specifies that exactly m copies of the previous RE should be matched{m,n} : Matches from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible

^ (Caret)

It matches the start of the string. For example:

$ (Dollar)

It matches the end of the string. For example:

. (Dot)

It matches any character except a newline. For example:

? (Question Mark)

It matches zero or one repetition of the preceding RE. For example:

In the above example, ‘a’ has zero repetition in the string. Let us see another example where ‘a’ will match with one repetition.

What will happen if there is more than one ‘a’ in the string (e.g. ‘maan’). Let us see one more example:

You saw in the above example that It would not match if the repetition is more than one.

+ (Plus)

It matches one or more repetitions of the preceding RE. For example:

Another example:

In the above example, ‘^F’ means string starts with ‘F’ and ‘.+’ means matches any character any number of times( ≥1) till the character ‘:’ (i.e., there is at least one character between ‘F’ and ‘:’).

Since ‘+’ is a greedy means, it matches as much string (or text) as possible. So, it doesn’t stop scanning at first ‘:’. It scans till the end of the string. Thus, the output is [‘From : using the :’].

+?

It matches one or more repetitions of the preceding RE (non-greedy). Non-greedy means it matches as little string (or text) as possible. For example:

In the above example, It scans till the first ‘:’ due to the non-greedy behavior of +?.

* (Star)

It matches zero or more repetitions of the preceding RE. It is also greedy. For example:

The above example is same as the + example which you have already seen. Then, What is the difference between + and * ? Let us see another example:

In the above example, the pattern also matched with ‘Heo’. ‘l*’ means zero or more times ‘l’. In the above example, the number of ‘l’ is zero. That’s why it matched, but if I write ‘+’ instead of ‘*’ in the pattern then at least one ‘l’ should be there in the string.

Another example:

*?

It matches zero or more repetitions of the preceding RE (non-greedy). For example:

It is non-greedy. So, it matches as little as possible(i.e., zero repetition). Thus, the output is [‘He’].

\ (Backslash)

It signals a special sequence. We will see this topic in detail in Special Sequences .

| (Alternation)

It is used for alternation. Like A|B, where A and B can be arbitrary REs, it will match any string that contains either A or B. For example:

[] (Square Brackets)

It is used to indicate a set of characters. Like,

[aeiou] : Matches a single character in the listed set[^xyz] : Matches a single character not in the listed set[a-z0-9] : Matches a single character between a to z or 0 to 9.

For example:

Another example:

One more example:

() (Parentheses)

It matches whatever regular expression is inside the parentheses and indicates the start and end of a group. For example:

In the above example, It matches the string starts with ‘From ’ and after that matches any character any number of times till ‘@’ and then matches any character except space(‘ ’) according to RE written inside parentheses. It returns only the characters that would match according to RE inside the parentheses.

{m}

It specifies that exactly m copies of the previous RE should be matched. For example:

In the above example, ‘i{3}’ means three times ‘i’. So, the output is [‘Mississippiii’].

{m,n}

It matches from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible. For example:

Special Sequences

\s : Matches any whitespace character (i.e. equivalent to the class [ \t\n\r\f\v])\S : Matches any non-whitespace character (i.e. equivalent to the class [^ \t\n\r\f\v])\d : Matches any decimal digit (i.e. equivalent to the class [0-9])\D : Matches any non-digit character (i.e. equivalent to the class      [^0-9])\w : Matches any alphanumeric character (i.e. equivalent to the class [a-zA-Z0-9_])\W : Matches any non-alphanumeric character (i.e. equivalent to the class [^a-zA-Z0-9_])

Examples:

Some useful regex examples of real-world applications:

Example 1: Extracting date from string

Example 2: Check Email is valid or not

Example 3: Check URL is valid or not

Summary

Regex is used to check whether the search pattern exists in the given string. In Python, there is an inbuilt library named re which deals with regular expression. I think I have covered most of the concepts with suitable examples.

If you have any questions or suggestions, feel free to leave them in the comments below.

Thanks for reading and have fun learning!

--

--