Python re Module

Published in

CodeX

7 min readAug 5, 2022

A Regular Expression (RegEx) is a sequence of characters that defines a search pattern. Python has an inbuilt ‘re’ module to handle the same.

Before starting with the re-module, let's understand the difference between normal string and raw string as we are going to use raw string with the ‘re’ module.

Python raw string is created by prefixing a string literal with ‘r’ or ‘R’. Python raw string treats backslash (\) as a literal character. This is useful when we want to have a string that contains backslash and don’t want it to be treated as an escape character.

let's see the difference in the below code.

#Print with Normal String

print('\tName')

#Print with Raw String

print(r'\tName')Output-
   Name
\tName

So in the above example,\t is considered as a tab but in raw string, it will be printed as ‘\t’ only.

To work with regular expressions, first, we need to create a pattern, that will find our desired match than using finditr() method we will perform the operations, let's check the below code.

import re

sentence = 'My little son now has started learning abc in school'
pattern = re.compile(r'abc')
matches = pattern.finditer(sentence)

for match in matches:
    print(match)
Output-
<re.Match object; span=(39, 42), match='abc'>

In the above example, we have written a pattern that will match the small ‘abc’. and in this string ‘abc’ will present on index number starting from 39 till 42. So it will return the match with the index number.

Below is the list of all the patterns.

Let's explore a few patterns one by one. ‘.’ matches any character except a new line.

import re

sentence = 'school'
pattern = re.compile(r'.')
matches = pattern.finditer(sentence)
for match in matches:
    print(match)
Output-
<re.Match object; span=(0, 1), match='s'>
<re.Match object; span=(1, 2), match='c'>
<re.Match object; span=(2, 3), match='h'>
<re.Match object; span=(3, 4), match='o'>
<re.Match object; span=(4, 5), match='o'>
<re.Match object; span=(5, 6), match='l'>

‘\d’ used to match only digits between 0–9. below is the same code.

import re

sentence = 'school123'
pattern = re.compile(r'\d')
matches = pattern.finditer(sentence)
for match in matches:
    print(match)
Output-
<re.Match object; span=(6, 7), match='1'>
<re.Match object; span=(7, 8), match='2'>
<re.Match object; span=(8, 9), match='3'>

Oppositely ‘\D’ used to match Not a Digit. In Regular expressions, capital letters negate the operation that a small letter is performing.

import re

sentence = 'school123'
pattern = re.compile(r'\D')
matches = pattern.finditer(sentence)
for match in matches:
    print(match)
Output-
<re.Match object; span=(0, 1), match='s'>
<re.Match object; span=(1, 2), match='c'>
<re.Match object; span=(2, 3), match='h'>
<re.Match object; span=(3, 4), match='o'>
<re.Match object; span=(4, 5), match='o'>
<re.Match object; span=(5, 6), match='l'>

‘\b’ is used to check word boundary and ‘\B’ for not a word boundary. Word boundaries are especially handy when it is necessary to match a sequence of letters or digits on their own. Or it is useful when you want to ensure that they happen at the start or the end of the sequence of characters.

import re

sentence = 'school schoolschool'
pattern = re.compile(r'school\b')
matches = pattern.finditer(sentence)

for match in matches:
     print(match)
Output-
<re.Match object; span=(0, 6), match='school'>
<re.Match object; span=(13, 19), match='school'>

It matches the school two times as the second school does not contain space at the end of the string.

‘^’ is used to match the beginning of the string. below is the code for the same.

import re

sentence = 'My little son now has started learning abc in school'
pattern = re.compile(r'^My')
matches = pattern.finditer(sentence)

for match in matches:
     print(match)
Output-
<re.Match object; span=(0, 2), match='My'>

Similarly ‘$’ is used to match the end of the string.

import re

sentence = 'My little son now has started learning abc in school'
pattern = re.compile(r'school$')
matches = pattern.finditer(sentence)

for match in matches:
     print(match)
Output-
<re.Match object; span=(46, 52), match='school'>

‘[]’ matches the character in the brackets, suppose if we mention ‘[a-zA-Z]’ in the bracket, it will match any letter between small a to z and capital A to Z.

sentence = 'My little son now has started learning abc in school'
pattern = re.compile(r'[A-Z]')
matches = pattern.finditer(sentence)

for match in matches:
     print(match)
Output-
<re.Match object; span=(0, 1), match='M'>

There is only one capital letter ‘M’ present in the string so it returns only one match. Oppositely ‘[^] used to match characters, not in a bracket.

sentence = 'My little son now has started learning abc in school'
pattern = re.compile(r'[^a-z\s]')
matches = pattern.finditer(sentence)

for match in matches:
     print(match)
Output-
<re.Match object; span=(0, 1), match='M'>

This will ignore all the small letters and whitespaces, hence returning only the capital ‘M’.

‘()’ represents the group, suppose we have multiple domains for mail and we want to match the email id that contains only Gmail and Yahoo. so we can use()’ for the same.

str = """
vivek@gmail.com
vivek1@yahoo.com
vivek2@outlook.com
"""

pattern = re.compile('.+(gmail|yahoo)+\.(com)')
matches = pattern.finditer(str)
for match in matches:
     print(match)
Output-
<re.Match object; span=(1, 16), match='vivek@gmail.com'>
<re.Match object; span=(17, 33), match='vivek1@yahoo.com'>

In the same example, If I put ‘*’ instead of ‘+’, it will return all three emails as ‘*’ means 0 or more so it will print outlook.com.

import re
str = """
vivek@gmail.com
vivek1@yahoo.com
vivek2@outlook.com
"""

pattern = re.compile('.+(gmail|yahoo)*\.(com)')
matches = pattern.finditer(str)
for match in matches:
     print(match)
Output-
<re.Match object; span=(1, 16), match='vivek@gmail.com'>
<re.Match object; span=(17, 33), match='vivek1@yahoo.com'>
<re.Match object; span=(34, 52), match='vivek2@outlook.com'>

In the next one, we changed the email id outlook.com to outlook.edu so if we run the same code, we will get two matches as ‘com’ in the last is mandatory.

import re
str = """
vivek@gmail.com
vivek1@yahoo.com
vivek2@outlook.edu
"""

pattern = re.compile('.+(gmail|yahoo)*\.(com)')
matches = pattern.finditer(str)
for match in matches:
     print(match)
Output-
<re.Match object; span=(1, 16), match='vivek@gmail.com'>
<re.Match object; span=(17, 33), match='vivek1@yahoo.com'>

Here we can use?’ to make it optional as it matches 0 or one.

import re
str = """
vivek@gmail.com
vivek1@yahoo.com
vivek2@outlook.edu
"""

pattern = re.compile('.+(gmail|yahoo)*\.(com)?')
matches = pattern.finditer(str)
for match in matches:
     print(match)
Output-
<re.Match object; span=(1, 16), match='vivek@gmail.com'>
<re.Match object; span=(17, 33), match='vivek1@yahoo.com'>
<re.Match object; span=(34, 49), match='vivek2@outlook.'>

In the above example, we are getting three matches now, but not getting ‘edu’ as it's not part of the group.

Now we have one more scenario, where we need to match the phone number in the below format.

str = """
999-4568-123
897-3258-325
698-7548-654
"""

It contains the digit ‘-’ symbol so we can use ‘\d’ to match the digit, below is the code.

import re
str = """
999-4568-123
897-3258-325
698-7548-654
"""

pattern = re.compile('\d\d\d-\d\d\d\d-\d\d\d')
matches = pattern.finditer(str)
for match in matches:
     print(match)
Output- 
<re.Match object; span=(1, 13), match='999-4568-123'>
<re.Match object; span=(14, 26), match='897-3258-325'>
<re.Match object; span=(27, 39), match='698-7548-654'>

So it matches the phone number, but we need to type ‘\d’ multiple times, which can cause an error.

To avoid this we can use ‘{}’ to pass the exact number like before first ‘-’, ‘\d’ is coming three times so we can write — ‘\d{3}’

import re
str = """
999-4568-123
897-3258-325
698-7548-654
"""

pattern = re.compile('\d{3}-\d{4}-\d{3}')
matches = pattern.finditer(str)
for match in matches:
     print(match)
Output-
<re.Match object; span=(1, 13), match='999-4568-123'>
<re.Match object; span=(14, 26), match='897-3258-325'>
<re.Match object; span=(27, 39), match='698-7548-654'>

Using this now code become much more readable and provides the same result.

Now check the example for group and sub.

import re
str = """
www.google.com
www.yahoo.com
www.gmail.com
"""

pattern = re.compile('(www.)?(\w+)?(\.\w+)')
matches = pattern.finditer(str)
for match in matches:
     print(match)
Output-
<re.Match object; span=(1, 15), match='www.google.com'>
<re.Match object; span=(16, 29), match='www.yahoo.com'>
<re.Match object; span=(30, 43), match='www.gmail.com'>

here we have 3 websites and our pattern matches all three. but as a output, we want only ‘google.com’ and remove the ‘www.’

So in our regular expression, we have three groups first match ‘www’ (www.), Second for name like google — (\w+) and third for domain name — (\.\w+).

so if we want any specific group in the result then we can use the group method for the same.

import re
str = """
www.google.com
www.yahoo.com
www.gmail.com
"""

pattern = re.compile('(www.)?(\w+)?(\.\w+)')
matches = pattern.finditer(str)
for match in matches:
     print(match.group(2))
Output -
google
yahoo
gmail

here index starts with 1, so the second index is name. also we have a sub-method where we can provide multiple group sequences and filter the result.

import re
str = """
www.google.com
www.yahoo.com
www.gmail.com
"""

pattern = re.compile('(www.)?(\w+)?(\.\w+)')
matches = pattern.finditer(str)
sub_Pattern = pattern.sub(r'\2\3',str)
print(sub_Pattern)
Output-
google.com
yahoo.com
gmail.com

here, we are passing group sequences 2 and 3 in the sub-pattern and getting the expected output.

re. match()- a function of re will search the regular expression pattern and return the first occurrence. RegEx Match method checks for a match only at the beginning of the string. So, if a match is found in the first line, it returns the match object. But if a match is found in some other line, the RegEx Match function returns null.

import re
str = "My little son now has started learning abc in"
#Matching at begining of the string
pattern = re.match('My',str)
print(pattern)
#Matching at the middle of the string
pattern = re.match('started',str)
print(pattern)
Output-
<re.Match object; span=(0, 2), match='My'>
None

So if we want to match in a complete string, then we can use search() instead of match().

import re
str = "My little son now has started learning abc in"
#Matching at begining of the string
pattern = re.search('My',str)
print(pattern)
#Matching at the middle of the string
pattern = re.search('started',str)
print(pattern)
Output-
<re.Match object; span=(0, 2), match='My'>
<re.Match object; span=(22, 29), match='started'>

Python re Module

Written by Vivekawasthi