Introduction to Python re module (Part -1)

Hrishikesh Kumar
5 min readApr 6, 2024

Regular expressions or Regex are a vast topic in any language and Python’s re module gives us a lot to learn. I will be adding multiple series of this for the readers to reap benefits. Alright, enough chit chat let’s dive in.

Regular expressions are text-matching patterns described with a formal syntax. The patterns are interpreted as a set of instructions, which are then executed with a string as input to produce a matching subset or modified version of the original. Expressions can include literal text matching, repetition, pattern composition, branching, and other sophisticated rules. The syntax used in Python’s re module is based on the syntax used for regular expressions in Perl, with a few Python-specific enhancements.

Finding Patterns in Text

The most common use re is to search for patterns in text. The search() function takes the pattern and text to scan and returns an Match object when the pattern is found. If the pattern is not found, search() returns None.

Each Match object holds information about the nature of the match, including the original input string, the regular expression used, and the location within the original string where the pattern occurs.

import re

pattern = 'this'
text = 'Does this text match the pattern?'

match = re.search(pattern, text)

s = match.start()
e = match.end()

print('Found "{}"\nin "{}"\nfrom {} to {} ("{}")'.format(
match.re.pattern, match.string, s, e, text[s:e]))
The result of the above code

Compiling Expressions

Although re includes module-level functions for working with regular expressions as text strings, it is more efficient to compile the expressions a program uses frequently. The compile() function converts an expression string into a RegexObject.

The module-level functions maintain a cache of compiled expressions, but the size of the cache is limited, and using compiled expressions directly avoids the overhead associated with cache lookup. Another advantage of using compiled expressions is that by precompiling all of the expressions when the module is loaded, the compilation work is shifted to the application start time, instead of occurring at a point where the program may be responding to a user action.

import re

# Precompile the patterns
regexes = [
re.compile(p)
for p in ['this', 'that']
]
text = 'this is really absurd right that code can be copied?'

print('Text: {!r}\n'.format(text))

for regex in regexes:
print('Seeking "{}" ->'.format(regex.pattern),
end=' ')

if regex.search(text):
print('match!')
else:
print('no match')

Multiple matches

But how do we find all the instances of a search text or pattern I want to find in the string, as the search only returns the first and single instance.

The findall() function of re returns all of the substrings of the input that match the pattern without overlapping.

import re

text = 'abbaaabbbbaaaaa'

pattern = 'ab'

for match in re.findall(pattern, text):
print('Found {!r}'.format(match))

The finditer() function returns an iterator that produces Match instances instead of the strings returned by findall().

import re

text = 'abbaaabbbbaaaaabbbbbabababa'

pattern = 'ab'

for match in re.finditer(pattern, text):
s = match.start()
e = match.end()
print('Found {!r} at {:d}:{:d}'.format(
text[s:e], s, e))

This example finds the same two occurrences of ab, and the Match instance shows where they are found in the original input.

So far we are using the text-matching capability of the re module. Let’s explore some ways the re module of the python can help us find patterns in a given string or text.

Repetition in a pattern

Here’s a brief overview of some of the quantifiers/meta characters you can use to find repetition in a given pattern :

  1. * - Matches zero or more occurrences of the preceding element.
  2. + - Matches one or more occurrences of the preceding element.
  3. ? - Matches zero or one occurrence of the preceding element.
  4. {m} - Matches exactly m repetitions of the preceding element.
  5. {m, n} - Matches between m and n repetitions of the preceding element, inclusive.
  6. {m,} - Matches m or more repetitions of the preceding element.

We will use the following code to run some examples for removing code redundancy.

import re


def test_patterns(text, patterns):
"""Given source text and a list of patterns, look for
matches for each pattern within the text and print
them to stdout.
"""
# Look for each pattern in the text and print the results
for pattern, desc in patterns:
print("'{}' ({})\n".format(pattern, desc))
print(" '{}'".format(text))
for match in re.finditer(pattern, text):
s = match.start()
e = match.end()
substr = text[s:e]
n_backslashes = text[:s].count('\\')
prefix = '.' * (s + n_backslashes)
print(" {}'{}'".format(prefix, substr))
print()
return

Let’s try some of the text repetition patterns using the above method and the pattern-matching regex mentioned above.

test_patterns(
'abbaabbba',
[('ab*', 'a followed by zero or more b'),
('ab+', 'a followed by one or more b'),
('ab?', 'a followed by zero or one b'),
('ab{3}', 'a followed by three b'),
('ab{2,3}', 'a followed by two to three b')],
)
Output for text patterns example

When processing a repetition instruction, re will usually consume as much of the input as possible while matching the pattern. This so-called greedy behavior may result in fewer individual matches, or the matches may include more of the input text than intended. Greediness can be turned off by following the repetition instructions with ?.

test_patterns(
'abbaabbba',
[('ab*?', 'a followed by zero or more b'),
('ab+?', 'a followed by one or more b'),
('ab??', 'a followed by zero or one b'),
('ab{3}?', 'a followed by three b'),
('ab{2,3}?', 'a followed by two to three b')],
)
Disabling greedy consumption of the input for any of the patterns where zero occurrences of b are allowed means the matched substring does not include any b characters.

In the next part, we will continue our learning from the character set, and escape codes and we will continue the journey.

--

--

Hrishikesh Kumar

Data Science Enthusiast || Full Stack Developer || NLP || Machine Learning