Manage Complex Regular Expressions (Python)
Regular Expressions come in handy in various situations. I find Regular Expressions especially helpful because for the most part, they’re language agnostic and therefore portable.
If you can craft Regular Expressions, you can use them in your Javascript code, your Python code, your shell scripts… really in any language that supports them. This makes them a highly valuable tool for a programmer.
Many people use regexes as ad hoc solutions for simple string problems. Most of the time it’s string substitution, while other times they’re used for more difficult tasks like content scraping.
No matter what you’re using Regular Expressions for, surely you’ve encountered difficult-to-read regexes that cause you to despair.
This article discusses a few simple tricks to help developers wrangle with regex complexity. Developers should not be deterred from introducing sophistication into our regex patterns out of fear of reducing code maintainability.
We’re going to work through a simple, recognizable example: phone-number matching. We’ll begin with a simple regex…
phone_re = re.compile(r'(\d{3})(\d{3})(\d{4})')
At this point, we have a straight-forward regex. The difficulty in managing regex complexity is that there is usually a trade-off between regex sophistication and code readability. Let’s introduce some more sophistication into our regex, making it more flexible, and refactor our code to make it more manageable.
Area codes can sometimes contain parentheses. Let’s extend the first capturing group to handle that. The change is in bold.
# area code (\d{3}) => (\d{3}|\(\d{3}\))
phone_re = re.compile(r'(\d{3}|\(\d{3}\))(\d{3})(\d{4})')
With this extension, we can now match more inputs (see below).
inputs = ['7033217654','(703)3217654']
You should immediately notice how our regex went from readable to significantly less readable with that one change. And we are going to introduce more sophistication, so at this rate, we may end up with unmaintainable gibberish by the end of our regex design.
This is usually when most developers shy away regex-land. But fret not! Regex-land is fertile territory. Let’s introduce some code management techniques to help us tame our regex as it supports more patterns.
Verbose
Verbose mode is a universal regex option that allows us to leverage multi-line strings, whitespace, and comments within our regex definition. This is a big win! Now our Regular Expression can be broken up on multiple lines, each line corresponding to a separate component of a phone number.
phone_re = re.compile(r'''
(\d{3}|\(\d{3}\)) # area code
(\d{3}) # first 3
(\d{4}) # last 4
''', re.VERBOSE)
Now let’s extend our regex to handle separators so that we can recognize phone numbers like (703)-321–7654, 703.321.7654, and (703) 321 7654. We’ve wrapped our groups of interest in capturing groups, but we’re not interested in capturing our separators, so let’s put those in a non-capturing group.
We should also indicate that separators may not always appear in the text we are matching, so we’ll make the separators optional.
# spaces \s+
# dots \.
# dashes -
# non-capturing group (:?)
# optional ?
# altogether (:?\s+|-|\.)?phone_re = re.compile(r'''
(\d{3}|\(\d{3}\)) # area code
(:?\s+|-|\.)? # separator
(\d{3}) # first 3
(:?\s+|-|\.)? # separator
(\d{4}) # last 4
''', re.VERBOSE)
Take a look at our supported inputs now!
inputs = [
'7033217654', '(703)3217654', '(703) 321 7654',
'703.321.7654', '(703)-321-7654', '703-321-7654'
# et cetera permutations
]
Lastly, since we’re constructing our regex out of a raw string, we can abstract away portions that are being reused (e.g. our separators).
Reusable Regexes
For our final act, we’re going to pull out our separator regex and inject it into our phone regex through string interpolation.
sep = '(:?\s+|-|\.)?' # separatorphone_re = re.compile(r'''
(\d{3}|\(\d{3}\)) # area code
{sep} # separator
(\d{3}) # first 3
{sep} # separator
(\d{4}) # last 4
'''.format(sep=sep), re.VERBOSE)
Voila! Clarity. Without any of the tricks we’ve used here, our regex could have grow to look something like this:
(\d{3}|\(\d{3}\))(:?\s+|-|\.)?(\d{3})(:?\s+|-|\.)?(\d{4})
You choose between the two….
Conclusion
Although this example was contrived, hopefully illustrated how a few useful features of Python can help us break up the logic of our Regular Expressions. Try and find analogous features in your language of choice.
Through the use of inline comments, verbose mode, multi-line strings, variables and string interpolation, you should face fewer headwinds when writing more complex Regular Expressions.
Thank you for reading!