Python Diaries ….. Day 13

Regular expressions

Nishitha Kalathil
7 min readSep 21, 2023

--

Welcome to Day 13 of Python Diaries! Today, we’ll be delving into the powerful world of regular expressions, often abbreviated as regex. Regular expressions are a versatile tool for pattern matching and manipulation of text data.

What are Regular Expressions?

A regular expression is a sequence of characters that forms a search pattern. It can be used to perform searches, substitutions, and validations on strings, based on certain patterns.

Creating a Regular Expression

Regular expressions are created using a specific syntax that defines patterns you want to match in strings.

In Python, the re module provides functions for working with regular expressions.

Here are some key components and concepts of regular expressions in Python:

Metacharacters:

Metacharacters are special characters with a reserved meaning in regular expressions.

Special Sequences

A special sequence is a \ followed by one of the characters in the list below, and has a special meaning:

from W3schools

Sets

A set is a set of characters inside a pair of square brackets [] with a special meaning:

Raw Strings:

Raw strings in regular expressions in Python are strings that are prefixed with an r. They are used to represent regular expression patterns without processing any escape sequences, which can be especially useful since regular expressions often involve backslashes (\) to denote special characters.

For example, consider the regular expression pattern that matches a backslash followed by a digit:

pattern = r'\\d'

If this regular expression were not in a raw string, you would need to escape the backslash for Python by writing \\ to represent a literal backslash. In a raw string, you can write the regular expression more naturally:

pattern = r'\\d'  # Matches a backslash followed by a digit

Functions

n Python, the re module provides functions for working with regular expressions. Here are some of the most commonly used functions:

The “flags” attribute

the flags parameter allows you to modify the behavior of the regular expression pattern matching. It provides additional options that control aspects like case sensitivity, multi-line matching, and more. Here are some commonly used flags:

re.IGNORECASE or re.I:

  • This flag makes the pattern matching case-insensitive. It allows the regular expression to match characters regardless of whether they are upper or lower case.
import re

pattern = re.compile(r'apple', re.IGNORECASE)
result = pattern.findall('I have an Apple and an apple')

# output
['Apple', 'apple']

re.MULTILINE or re.M:

  • This flag allows the ^ and $ anchors to match at the beginning and end of each line in a multi-line string. Without this flag, they only match at the beginning and end of the entire string.
import re

pattern = re.compile(r'^start', re.MULTILINE)
result = pattern.findall('start of line 1\nstart of line 2')

# output:
['start', 'start']

re.DOTALL or re.S:

  • This flag allows the . metacharacter to match any character, including newlines (\n).
import re

pattern = re.compile(r'.+', re.DOTALL)
result = pattern.findall('Line 1\nLine 2\nLine 3')

#output
['Line 1\nLine 2\nLine 3']

re.VERBOSE or re.X:

  • This flag allows you to write more readable regular expressions by ignoring whitespace and comments.
import re

pattern = re.compile(r'''
\d+ # Match one or more digits
\s* # Match zero or more whitespace characters
[a-zA-Z]+ # Match one or more letters
''', re.VERBOSE)

These flags can be used individually or combined using the | (pipe) operator. For example, if you want to use both re.IGNORECASE and re.MULTILINE, you can use re.IGNORECASE | re.MULTILINE.

The search() Function

Searches for a match anywhere in the string. It stops at the first occurrence.

# Search for a string anywhere in the string.
pattern = 'world'
string = 'Hello, world!'

match = re.search(pattern, string)
if match:
print('The string matches the pattern.')
else:
print('The string does not match the pattern.')

# output
The string matches the pattern.

More example are given:

txt = "I am beautifull"
x = re.search("\s", txt)# Search for the first white-space character in the string:
print(x)
print("The first white-space character is located in position:", x.start())

x = re.search("not", txt) # Search for the pattern "not" in the string:
print(x)

x = re.search("..am", txt) # looking for any two characters followed by the string "am".
print(x)

output is:

<re.Match object; span=(1, 2), match=' '>
The first white-space character is located in position: 1
None
<re.Match object; span=(0, 4), match='I am'>

Another one,

text = "Python is simple than other languages";

searchObj = re.search( r'(.*) is (.*?) .*', text, re.M|re.I)

if searchObj:
print("searchObj.group() : ", searchObj.group())
print("searchObj.group(1) : ", searchObj.group(1))
print("searchObj.group(2) : ", searchObj.group(2))
else:
print("Nothing found!!")

output:

searchObj.group() :  Python is simple than other languages
searchObj.group(1) : Python
searchObj.group(2) : simple

The match() Function

Searches for a match at the beginning of the string. Returns a match object if successful, or None if there's no match.

# Match a string at the beginning of the string.
pattern = '^Hello, world!'
string = 'Hello, world!'

match = re.match(pattern, string)
if match:
print('The string matches the pattern.')
else:
print('The string does not match the pattern.')

# output
The string matches the pattern.

Matching Versus Searching

Python offers two different primitive operations based on regular expressions: match checks for a match only at the beginning of the string, while search checks for a match anywhere in the string (this is what Perl does by default).

line = "Python is simple than other languages";

matchObj = re.match( r'languages', line, re.M|re.I)
if matchObj:
print("match --> matchObj.group() : ", matchObj.group())
else:
print("No match!!")

searchObj = re.search( r'languages', line, re.M|re.I)
if searchObj:
print("search --> searchObj.group() : ", searchObj.group())
else:
print("Nothing found!!")

output:

No match!!
search --> searchObj.group() : languages

The re.match() function tries to find the pattern 'languages' at the beginning of the string. Since it's not at the beginning, it returns None, and the message "No match!!" is printed.

The re.search() function searches for the pattern 'languages' anywhere in the string. It finds a match within the sentence, and the message "search --> searchObj.group() : languages" is printed, indicating that the pattern was found.

The findall() Function

Finds all matches in the string and returns a list of them.

txt = "You are great "

x = re.findall("re", txt)
print(x)

x = re.findall("funny", txt)
print(x)

output:

['re', 're']
[]
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) ## ['alice@google.com', 'bob@abc.com']
for email in emails:
print(email)

output:

alice@google.com
bob@abc.com

The finditer() Function

Returns an iterator that produces match objects for all non-overlapping matches of the pattern in the string.

statement = "Please contact us at: support@india.com, xyz@hr.com"

#'addresses' is a list that stores all the possible match
addresses = re.finditer(r'[\w\.-]+@[\w\.-]+', statement)
for address in addresses:
print(address)

output:

<re.Match object; span=(22, 39), match='support@india.com'>
<re.Match object; span=(41, 51), match='xyz@hr.com'>

The split() Function

Splits the string into a list of substrings using the pattern as a delimiter.

txt = "I am beautiful"
x = re.split("\s", txt) # Split at each white-space character:
print(x)

x = re.split("\s", txt, 1) # Split the string only at the first occurrence:
print(x)

output:

['I', 'am', 'beautiful']
['I', 'am beautiful']

The sub() Function

Substitutes the matched text with a replacement string.

txt = "I am beautiful and smart"
x = re.sub("\s", "9", txt)
print(x)

x = re.sub("\s", "9", txt, 2)
print(x)

output:

I9am9beautiful9and9smart
I9am9beautiful and smart
# multiline string
string = 'abc 12\
de 23 \n f45 6'

# matches all whitespace characters
pattern = '\s+'
replace = ''

new_string = re.sub(r'\s+', replace, string, 1)
print(new_string)

output:

abc12de 23 
f45 6

Regular expressions provide a powerful and flexible way to search for, extract, and manipulate text based on specific patterns or templates. They are widely used in tasks like text processing, data extraction, and search operations.

You can access the other topics in this tutorial series right here:

In Day 13, we’ll explore the concept of OOPS in python. Keep up the great work! Happy coding!

--

--