Everything you need to know about Regular Expressions (RegEx)

Tom Staite
9 min readJan 10, 2023

Just what does that seemingly random sequence of characters mean anyway?

Introduction & Definition

Regular Expressions is a widely used technique developed in theoretical computer science, and more importantly formal language theory. They are sequences of characters that specify search patterns within text. Usually such patterns are used by string-searching algorithms for “find” or “find and replace” operations on strings, or for input validation (for example, checking an email address has the correct format, or ensuring a password contains required characters).

With regular strings, one is able to perform operations such as concatenation, length calculation, and slicing, in their data exploration and preparation work. Regular expressions enables us to go one step further and carry out some more powerful operations such as pattern finding, fuzzy matching, and string validation. This is all done by codifying our language requirements — as done in the example shown in the above image.

It goes without saying, the syntax for regular expressions is not at all pretty and on first inspection it can seem quite an intimidating thing to wrap your head around. In reality, only learning and practice will help you to fully become accustomed to the intricacies of this important tool, although everyone has to start somewhere right? Furthermore, even people with years of experience working with Regular Expressions still find themselves consulting the internet to check the correct way to do it — just as an experienced programmer would still often search for programming tips.

Hopefully after reading this article you’ll have everything you need to know to begin reading and writing your own expressions!

The Anatomy of a Regular Expression

Before we begin, I want to clarify here that we will be working with the Python programming language. This is important to note as other programming languages may have slightly different syntax for Regular Expressions with minor changes which you’ll need to adapt your code for accordingly.

I’ve also included a Python Cheat Sheet below which lists all of the available combinations. This is a very handy “go-to” support document for when you begin to write your own expressions.

Python has a very handy module named ‘re’ that enables us to write Regular Expressions. This is what we will use in this article and for our example work. More information on this module can be found in the official documentation here.

Let’s go ahead and import our module.

import re

We will first look at the list of functions that enable us to search a string for a match (what we are searching for will be encoded as a regular expression). Here are the functions that allow us to do this.

For example, we can use the .search function to see if a string starts with “The” and ends with “Spain”. Here is the code that will perform this check.

txt = "The rain in Spain"
x = re.search("^The. *Spain$", txt)

Here is a breakdown of this particular regular expression (I have included all regular expression syntax in a later section of the article).

^ means ‘starts with’. So we are checking if the text starts with ‘The’.

. (full stop) means any character (except a newline character).

* means zero or more occurrences, and since it is next to our full-stop, it means zero or more occurrences of any character. So in theory we could have a string that starts with “The” and is then followed by 1,000 characters, or 100 characters, or just 1 character, and these would all satisfy the RegEx so far.

$ means ‘ends with’. This is preceeded by ‘Spain’. So we are checking that the text ends with ‘Spain’.

Based on the logic outlined above, here is a list of texts that would satisfy the RegEx.

  • “The neighbours went to visit Spain”
  • “The Spanish cuisine is amongst the finest. I always feel at home when I visit Spain”
  • “The country Spain”

And here is a list of texts that would not satisfy the RegEx.

  • “The breed of dog was Spanish”
  • “My last holiday was to Spain”
  • “The rain in Spain!”

Notice on the last example there is an exclamation mark after ‘Spain’. This has not been accounted for in our RegEx, and so this would not return a match. It is important to point out here that Regular Expressions are specific — just one incorrect character can completely change the meaning of your expression. Boundary testing can be a good way to check your work and ensure what you have written performs the intended search.

In the above code, we used the .search function. This returns a match object containing only the first occurrence of the match. So any succeeding matches will be ignored.

The .findall function on the other hand will store a list of all matches. Here is an example of the function in use, and the results it returns.

(for clarity, we’re searching for the string ‘ai’ within the sentence ‘The rain in Spain’)

txt = "The rain in Spain"
x = re.findall("ai", txt)
print(x)

>> ['ai', 'ai']

The .split function will return a list where the string has been split at each match

(for clarity, the RegEx “\s” will split at each white-space character”)

txt = "The rain in Spain"
x = re.split("\s", txt)
print(x)

>> ['The', 'rain', 'in', 'Spain']

Finally, the .sub function (short for ‘substitute’) will replace the matches with the text of your choice.

(for clarity, we’re replacing every white-space character with the number 9)

txt = "The rain in Spain"
x = re.sub("\s", "9", txt)
print(x)

>> The9rain9in9Spain

All of the examples above form the very basics of Regular Expressions. We have used simple expressions to search, list, split, and replace characters within bodies of text. More complex expressions can be devised by using a combination of the codes outlined in the ‘Python Regular Expression Syntax & Cheat Sheet’ section below. Just about any possible combination of language can be defined using Regular Expressions.

I want to share with you some examples of where they can be used in real-life.

Examples & Use Cases

To help consolidate your newly found knowledge of Regular Expressions, I have provided a couple of specific use cases (all in Python), helping you to see exactly how this technique is used in practice. Please remember to leave any questions you may have in the comment section of the article!

‘Create Password’ Pattern Matching

When you create a new account on a website, you are usually asked to input a password which conforms to a given criteria in order for it to be verified. For example, it might say

Please enter a password. Your password must have;

  • at least 8 characters
  • at least 1 upper case letter
  • at least 1 lower case letter
  • at least 1 digit (0 to 9)
  • at least 1 special character”

The developer behind the log-in credentials can use a single Regular Expression to check every password entered conforms to the criteria, and to highlight which (if any) of the criteria is missing. Here is the RegEx for this specific example.

"^(?=.*?[A-Z])(?=.*?[a-z])(?=.*[0-9])(?=.*?[#?!@$%^&*-]).{8,}$"

And here is a quick example of this code in action against a list of prospect passwords.

import re

expression = re.compile("^(?=.*?[A-Z])(?=.*?[a-z])(?=.*[0-9])(?=.*?[#?!@$%^&*-]).{8,}$")

passwords = ["Tornado", "t3rn4", "T0rnaDo!", "tornado123!"]

for i in passwords:
matches = expression.match(i)
if (not matches):
print(i, "INVALID PASSWORD")
else:
print(i, "is a valid password!")



>>
Tornado INVALID PASSWORD
t3rn4 INVALID PASSWORD
T0rnaDo! is a valid password!
tornado123! INVALID PASSWORD

Here is a breakdown of the Regular Expression

  • (?=.*?[A-Z]) ensures there is an uppercase letter within the string
  • (?=.*?[a-z]) ensures there is a lowercase letter within the string
  • (?=.*?[0–9]) ensures there is a digit within the string
  • (?=.*?[#?!@$%^&*-]) ensure one of these special characters is within the string
  • .{8,} ensure the string is of at least 8 characters long
  • ^ and $ ensure the match starts and ends at the beginning of a word — i.e. partial matches are not considered.

Email Address Format Checker

Another example where Regular Expressions can be used is to validate the format of email addresses. It should be said here that RegExs can only check the format of the email address and not that it is actually correct (i.e. someone could enter a correctly formatted email address but misspell it, deeming it incorrect).

Again, a single expression can be written to complete this task.

"([A-Za-z0-9]+[.-_])*[A-Za-z0-9]+@[A-Za-z0-9-]+(\.[A-Z|a-z]{2,})+"

And here is a quick example of this code in action against a list of prospect passwords.

import re

expression = re.compile("([A-Za-z0-9]+[.-_])*[A-Za-z0-9]+@[A-Za-z0-9-]+(\.[A-Z|a-z]{2,})+")

emails= ["test@hotmail.co.uk", "test@", "test!", "test@live.com"]

for i in emails:
matches = expression.match(i)
if (not matches):
print(i, "INVALID EMAIL")
else:
print(i, "is a valid email!")

>>
test@hotmail.co.uk is a valid email!
test@ INVALID EMAIL
test! INVALID EMAIL
test@live.com is a valid email!

Here is the breakdown of the Regular Expression

  • ([A-Za-z0–9]+[.-_])*[A-Za-z0–9]+ checks if the username is valid (the bit before the ‘@’ symbol). We check all valid characters are being used and that at least one of them was added.
  • @[A-Za-z0–9-]+ checks for the @ character and the host name.
  • (\.[A-Z|a-z]{2,})+ checks for the top-level domain.

Python Regular Expression Syntax & Cheat Sheet

Use the guides below to help you learn the content within this article and begin to write your own expressions.

Here is a useful cheat sheet that summarises the above information into a neat one-pager, courtesy of Dataquest.

— — — — — — — —

Like this content? Please consider following me and sharing the story! Kindly drop any questions or comments below and I’ll get back to you. Thank you & have a great day :)

Reach out to me on LinkedIn: https://www.linkedin.com/in/thomas-staite-msc-bsc-ambcs-55474015a/

Bonus Content: History of Regular Expressions

Reference: https://scantopdf.com/blog/the-history-of-regex/

Creation of Regular Expressions

In 1943, Warren S. McCulloch (Neuroscientist) and Walter Pitts (Logician) began to develop models describing how the human nervous system works. Their research focused on trying to understand how the brain could produce complex patterns using simple cells that are bound together. In 1956, mathematician Stephen Kleen described McCulloch-Pitts neural models with an algebra notation that he penned ‘regular expressions’. Influenced by Kleen’s notion, in 1968, mathematician and Unix pioneer, Ken Thompson, implemented the idea of regular expressions inside the text editor, ‘ed’. His aim was that ed users would be able to do advanced pattern matching in text files. ed soon evolved to have the functionality to search based on regular expressions — this is when regexes entered the computing world.

Evolution of Regular Expressions

Many people have contributed to the development and promotion of regular expressions since they entered popular usage in ed software. Notably, Larry Wall’s Perl programming language from the late 80s helped regular expressions to become mainstream. Perl was originally designed as a flexible text-processing language but grew into a fully-fledged programming language that remains a natural choice for any text-processing to this day. The programme still relies heavily on the use of regexes.

Future of Regular Expressions

Despite being hard to read, hard to validate, hard to document and notoriously hard to master, regexes are still widely used today. Supported by all modern programming languages, text processing programs and advanced text editors, regexes are now used in more than a third of both Python and JavaScript projects. With this in mind, over 50 years since their inception, the use of regexes seems very much here to stay.

--

--