How to write efficient Regular Expressions in Python

So, a few months ago, I had to search the quickest way to apply a regular expression to a huge file in Python. While doing so, I learned quite a few things that I’d like to share.

War and Peace

A series of important books

When I started searching for a text corpus I could use for this article, my mind wandered off and I remembered being at my grandparents’ house in front of the library, looking at a series of huge leather-backed books.

They were the compilation of the “War and Peace” (wikipedia) novels, telling the story of 5 aristocratic Russian families during the French invasion of Russia in 1812. It took its author Leo Tolstoy 6 years to write.

I grabbed a download link from Project Gutenberg and voilà!

Regexp in Python

Regular expressions (also called regexp) are chains of characters that define a search pattern (thanks wikipedia!). Basically, it’s a tool that allows you to filter, extract or transform a chain of characters.

In Python, there is a built-in library dedicated to regular expressions
called re. You simply import it and use the functions it provides (search , match, findall, etc…). They will return a Match object with some useful methods to manipulate your results.

>>> import re
>>> my_string = 'I like tortilla.'
# 'like (\w+)' is a regexp that matches 'like' followed by a space followed by
# any number of word characters ('\w' means 'word character' and '+' means '1 or more'),
# putting those word characters in a capture group (the parenthesis)
>>> regex_result = re.search(r'like (\w+)', my_string)
>>> print(regex_result.groups())
('tortilla')

Why did I prefix my regular expression string with a r? This tells Python to use raw string, disabling the escaping of characters such as \.

One cool thing I like about Python regexp, is that you can assign names to capture groups and work with dictionaries.

>>> regex_result = re.search(r'like (?P<food>\w+)', my_string)
>>> print(regex_result.groupdict())
{'food': 'tortilla'}

Finding princes: the easy approach

Now, let’s say I want to know what royal heirs of “War and Peace” are the most mentioned. I could try to answer this with Python and regexp.
A first naive approach would be:

import re
from collections import defaultdict
path_war_and_peace = 'war_and_peace.txt'
# My regular expression, matching 'Prince' and 'Princess' plus the next word
regex_prince = r'prince(?:ss)? (\w+)'
# I'm going to store results in a dict which return 0 when we ask for a non-existing key
result = defaultdict(int)
with open(path_war_and_peace) as f:
for line in f:
# Note the re.IGNORECASE flag to be case insensitive
re_result = re.findall(regex_prince, line, re.IGNORECASE)
for name in re_result:
result[name] += 1

Let’s put this in a script called 00_naive.py and time its execution.

To time the execution, I’m using another Python module called timeit
with a command that looks like this:

python3 -m timeit -n 20 -r 5 -s 'import os' 'os.system("python 00_naive.py")'

I’m using Python3.6 on my work laptop, a LenovoV110, 11.5GiB of RAM,
an Intel Core i5 processor and Ubuntu 16.04.

I get:

$ python3 -m timeit -n 20 -r 5 -s 'import os' 'os.system("python 00_naive.py")'
20 loops, best of 5: 597 msec per loop

We launched 20 timing loops, repeating the measurements 5 times. In total, we launched the program a 100 times. And in the end, we got the time per loop (execution). So, 597msec, that seems good.

But we could have gone faster.

Building a list vs. iterating

I used the findall function but the documentation mentions a finditer function. What’s the difference?

findall returns a list containing all matches
while finditerreturns an iterator over matches.

In Python, an iterator represent a stream of data:
iterating over an iterator means asking for the next elements in the iterator
until none are returned. When working on big datasets, iterator are generally faster (mostly because of memory allocation matters).

Just replacing findall by finditer in the code changes the execution time from 597msec to 667msec. Tinkering a bit with timeit and re, I found out that findall is faster than finditer when working on small strings. And while findall is faster at returning no matches (creating an empty list) than matches, it’s the opposite for finditer who is slower at returning no matches than matches. Since our text have lots of lines that don’t match our regexp, finditer is generally (a bit) slower than findall.

Can we go even faster please?

Compiling the regexp


Smart people compile their Python (Slavic composers — Ilya Repin)

If we read the documentation, it says something about compiling regular expressions.
”Compiling” in Python? What witchcraft is that?

In that precise case, it makes sense: in order to go fast, Python will transform the regexp into bytecode that a C engine can then use to perform the matching at blazingly fast speed (see How-to RE docs). Python is more linked to C than you would think, and it’s a very interesting subject. If you’re interested in this, you can start your researches with keywords like CPython, Cython, Pypy and Python CFFI.

When using a function like re.findallunder the hood, the re library will create a Patternobject, compile the regular expression to bytecode, keep it in an attribute and link the .findall method of Pattern to that compiled code.

Since we want to apply the same regexp to our whole set of novels, let’s compile it:

import re
from collections import defaultdict
path_war_and_peace = 'war_and_peace.txt'
regex_prince = r'prince(?:ss)? (\w+)'
# compiled_regex is a Pattern object
compiled_regex = re.compile(regex_prince, re.IGNORECASE)
result = defaultdict(int)
with open(path_war_and_peace) as f:
for line in f:
re_result = compiled_regex.findall(line)
for name in re_result:
result[name.groups()[0]] += 1

Let’s time this:

$ python3 -m timeit -n 20 -r 5 -s 'import os' 'os.system("python 02_recompile.py")'
20 loops, best of 5: 402 msec per loop

402ms, now that’s some gain!
And the longer the text, the higher the gain.


Notes and conclusion

Now, some could argue Pandas is better for this kind of operations. But I think it wouldn’t be as fast as the pure Python version we wrote: we would lose a lot of time on loading the dataset in a DataFrame.

Other solutions could give further speed-up, such as using the regex library or splitting the data into several parts and counting in parallel (a strategy that relates to map-reduce, a highly relevant algorithm in Big Data). You can find articles and books filled just on this subject. However, as this goes beyond the scope of this article, it just want to leave the reader with this takeaway:

By applying just a bit of profiling (and actually taking the time to read the documentation), we can gain performance with only minimal changes in the code.

And also, that we live in an era where it take less than a second for a mid-range laptop to process what is considered one of the best piece of literature humanity produced.