The incredible power of Python’s replace regex

Image for post
Image for post
Photo by Ashkan Forouzani on Unsplash

Regular expressions are used in almost any programming language to match, search, or extract complex or intricate text patterns. A regular expression is a statement in a regular language, a language with a grammar specialized for matching patterns of characters in strings. They come into play in a wide variety of applications. One example is password and naming validation. It’s very likely whatever website you create a new user account on validated your password or username with a regular expression.

One other regular expression operation that’s not talked about as much is replacement. Similarly to matching and searching, replace looks for a pattern to match. However, it then creates a new string that has the matched pattern replaced with a replacement string. The replacement string can be a literal replacement, such as matching [0-9]+ and replacing it with "abc" . Such an operation would convert all digit sequences in a string with the character sequence "abc" . Regex replacements can be done globally across the string , or a specified n times.

The most powerful features of replace regexes though, are using template strings or functions to replace regex patterns. Unlike literal strings, template strings allow captured groups from the original match to be interpolated in the replacement. Functions offer an immensely powerful ability to take a match object as a parameter and return whatever is desired as the replacement string, through a callback relationship. Function replacers can also call other function replacers on different matched groups, which can be used as a recursive regex.

In this article, a few techniques and tricks on using the replace regex, also called re.sub , will be discussed and demonstrated.

Template strings and groups

The first and most important concept of replacement regexes are groups. Without the use of groups, replacements can’t do that much at all. With the use of groups, a replace regex can reorder and reformat patterns in strings. To do that, a template string is used, that marks the places we want to insert the groups at. First, let’s take a look at the following example:

The above code, when run, will print (321) 110-9245 (221) 243-6690 . In the first argument to the re.sub function, three groups are specified in the regex, of three, three, and four digits long, respectively. This pattern is designed to match 10 digit phone numbers which have no format that’s usually used in contact information or distinguishes the area code. The second argument, the replace string, is a template string which specifies the interpolation of the captured groups. In Python’s regex, the notation \<x> , where x is an integer between and including 1 to 99, signifies the nth matched group in the regex. Since the code uses strings and not raw strings, the \ must be escaped.

Now, the reason the phone number must be matched by subgroups of digits as opposed to using a regex such as (\d{10}) , is because we need the phone number captured as three groups in order to reformat it in the replacement string. The one limitation of using a replace regex with a template string is that, the captured groups are immutable. Additionally, template replacers still suffer from having no ability to replace recursively. They may only replace in a greedy or non greedy fashion. To demonstrate this limitation, consider the following three cases of replacements:

In the first call to re.sub , the result makes it seem as if only the first print() pattern was replaced. That’s actually not the case. What happened is a case of greedy capture. When operators such as * or + are used in regex, they capture as many instances as possible of the previous pattern. So when the pattern print\((.+)\) is matched, it actually matches to the very last closing parenthesis ). Thus, the captured group becomes 'yo') print('u' instead of the intended group capture 'yo'.

In the second call, the ? modifier is used to capture in a non-greedy fashion. The ? modifier to operators like * and + means to match the least amount of the previous characters or pattern as possible. In this regex, that permits only 'yo' to be captured in the first occurrence, and 'u' to be captured in the second. So this regex essentially works for the intended purpose. However, what if the intention is to replace a nested pattern ?

In the third call, the problem with the first regex pattern is encountered again, but in a slightly different scenario. Instead of capturing too many characters, the regex pattern is not performing matching or replacement on the captured group. However, this is a perfectly normal behavior for regular expressions. It’s one of the fundamental limitations of a regular language, in that it cannot recognize recursion or nested expressions. Well, without a little more work. In order to replace expressions that have repetition, functions or callable objects need to be used instead of template strings.

Function Replacers

The re.sub function allows function or any other callable object to be passed in as a replacement instead of strings. Functions used for this purpose must accept one parameter, which becomes the match object that matches the regex. Functions used for replacement are expected to return the replacement string. This means such functions are free to execute and run whatever code they wish in order to produce that string. Being able to produce whatever replacement desired through running code opens up a huge amount of possibilities, far more than just positioning groups in a new string.

First, let’s look at an example that uses a function replacer to evaluate some arithmetic expressions in strings:

The regex pattern used in the above code matches expressions of the form a + b , where white space may exist between a and b , but both a and b must be convertible to base 10 decimal integers. The replacer function, add_replacer , converts both of the groups to an integer, adds them together, then converts them back into a string. This will work on any two integer patterns connected by a + in the string the replacer is being run on. However, since regex by themselves do not recognize repetition, this will not work for arithmetic expressions longer than two integers.

To deal with arithmetic expressions of an arbitrary length, we need a pattern that can deal with an arbitrary length. The regex pattern used needs to be able to at least be applied recursively. The straight forward way to write such a regex is to have the left side of the regex be a specific pattern that’s definitely desired, and the right side be an open pattern that matches anything. That way, the replacer will keep calling re.sub on the right side group, and it will keep matching the desired pattern step by step. In the following code, a regex is used with a recursive pattern, but also makes a call to re.sub within it’s replacer function, on the variable group:

Here, (\d|[1-9]\d+)\s*\+\s* is the base pattern and (.*) is the recursive pattern. Since this pattern is only intended to match an arithmetic expression, it’s possible that (.*) may match some string that is not an integer, or not convertible to an integer. Thus, there has to be a guard against that with an except ValueError, that can be raised by the builtin int() function if a string is not a real base 10 number. If that happens, the replacer catches the exception and instead just returns the first group, that is insured to be an integer.

To extend this example, a third group could be added to match different arithmetic operators, and perform different operations based on that:

Instead of just matching + , we need to match [\+\-\*\/] for the operator, and apply different operations accordingly. Normally, character in a regex character set are not escaped, but they are done so here to prevent issues with , which in the wrong order can attempt to represent a character range. Also, since + and * are operators normally, it helps make the expression more clear to escape them when used in a character set. This is a great example of recursive replacement because it accounts for variable expressions.

Overall, function replacers offer a wide range capabilities that can make string processing in Python easier, and faster.

Written by

Programmer, Artist, Published Author. I’m an engineer on a mission to write the fastest software in the world.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store