Pattern Matching with Regular Expressions

A series of tutorials on Regular Expressions using Python

Zohaib Shahzad
The Startup
5 min readMay 4, 2020

--

This article is the first in a series of tutorials where we will demonstrate how you can use/implement regular expressions for text pattern matching, validation, parsing/replacing strings, passing through translating data to other formats, web scraping, etc. The following are the tutorials in order:

  1. Regular Expressions: Basics
  2. Regular Expressions: Grouping & the Pipe Character
  3. Regular Expressions: Repetition & Greedy/Non-Greedy Matching
  4. Regular Expressions: Character Classes & findall() Method
  5. Regular Expressions: Dot-Star and the Caret/Dollar Characters
  6. Regular Expressions: sub() Method and Verbose Mode

Before I even get into what a regular expression is or “regex” for short, I think its better I start this lesson out with an example. I want to start off by creating a function that checks if the input is a phone number. For simplicity purposes, the phone number format we’ll be referring to are US/Canada based.

Now this function seems to work and can detect whether the input is a phone number. But maybe we want to take a string and search for whether there’s a phone number embedded within it and print it out.

In this snippet of code, we wanted to search for phone number within strings. We first loop though the length of the “message” string and slice the message from 0–12 characters and store that in the “chunk” variable. We then check if that chunk matches a phone number and if not, we gradually shift that chunk and iterate through the string until we get a string between the 0–12 characters that match a phone number.

Now this is a lot of code to just check if an input or a string contains a phone number. There’s always an easier way and that’s where regular expressions or “regex” come into play.

What are Regular Expressions?

Regular expressions are specially encoded text strings used as patterns for matching sets of strings. Some have even called them wildcards on steroids. Knowing how to apply regular expressions is powerful. At some point in your career as a developer, you’ll find yourself working on a program where you’ll need to extract information from some text by searching for one or more matches of a specific search pattern (i.e. a specific sequence of ASCII or unicode characters) and regex can save you time and make your code more efficient.

Other applications using regex include validation, parsing/replacing strings, passing through translating data to other formats and web scraping. In addition, the syntax of regex is largely the same for all programming languages (with some minor distinctions).

We’ll be using Python throughout this series of regex tutorials. Since we’re using Python, what we’ll want to do is import a built-in Python module called re (‘re’ referring to ‘regular expressions’), Importing that module will enable us to start working with regex.

NOTE: Python Modules

Built-in modules are written in C and integrated with the Python interpreter. Each built-in module contains resources for certain system-specific functionalities such as OS management, disk IO, etc.

In Python, each built-in module comes with its own methods. Since we’re importing the regex module (re), it’s important you know what each method does beforehand.

Regex has a reputation for being clustered and messy with all the symbols and special characters, but it really depends on how you approach it. There’s a natural progression that looks something as simple as this:

\d (represent a digits between 0-9)

to something a bit more complicated like this:

^(\(\d{3}\)|^\d{3}[.-]?)?\d{3}[.-]?\d{4}$

which is where we’ll end up after this series of regex tutorials. It’s a fairly robust regular expression that matches a 10 digit, North American telephone number.

Text Pattern Recognition Using Regex Makes It Easier

Let’s re-write the isPhoneNumber.py function and redo the task using regex:

Let’s break this snippet of code down one line at a time. First, we imported the re module that enables us to work with regular expressions. Second, we simply created a short string containing phone numbers we want to search and analyze through it, and stored it in a variable called “message”.

Now we’re going to be using a method from the re module. Re.compile() allows to create the actual regular expression (special text used as patterns for matching sets of strings).

re.compile(r’\d\d\d-\d\d\d-\d\d\d\d’)

Typically before we create our regular expression, we’re working with the following at first: re.compile(r’ ’)

The r’ ’ is called a raw string and in Python its what we place our regex pattern within. For now, all you need to know is that the regular expression is inserted within that raw string.

Within the snippet of code, you’ll see a bunch of \d jumbled together. The \d is a character shorthand which by itself will match any numeric digit character between 0–9.

You can see that we’re repeating \d three and four times in sequence which will actually match the same pattern a North American phone number follows. The hyphens within the regex are entered as literal characters and will be matched as such.

After crafting our regex pattern, we’ll store that pattern into a variable called phoneNumRegex. We’ll then store phoneNumRegex into mo which is a special variable that returns the matched object. We then use a method called group(). The matched objects have a group method which tells you the actual text found. All these regex methods are included within the re module.

DETOUR:

Take a look at line 5 in the code snippet. If you want to find all occurrences of a phone number within the message string. You can use the findall() method. Your code would look like this:

Credits

First, I’d like to credit these series of articles for RegEx to Al Sweigart. I’m essentially basing this RegEx series on his online book called: Automate the Boring Stuff with Python (link below).

Resources

Automate the Boring Stuff with Python

Big shout out to Al Sweigart. He has a course on Udemy and a free book online called: Automate the Boring Stuff with Python.

http://automatetheboringstuff.com/2e/

In his course, he has a section on Regular Expressions (chapter 7) which I personally found to be a good refresher.

Regex One

Feel free to check out this link as well as it does a decent job at walking you through the different components of Regex.

https://regexone.com/

--

--