Regular Expressions: Not Your Regular Old Tool

To the uninitiated, regular expressions (regex) may seem like a secret wartime code used to send messages over enemy lines. In reality, with some practice, regular expressions are an achievable way to perform powerful and extremely fast pattern searches. Its applications can vary from search engines, find and replace functions, email validation, and much more. The purpose of this article is to give a brief overview of regular expressions and to leave the reader with a bit of code to perform basic searches with regular expressions.

First created by mathematician Stephen Cole Kleene in 1951 to describe formal languages, regex gives us a quick and flexible way to find patterns that doesn’t involve writing functions to parse the searched text line by line or word by word. While regex has its roots in the 1950’s it is an essential tool in modern forms of Natural Language Processing (NLP). At its base, regular expressions derive their workings from the branches of Mathematics, Computer Science, and Linguistic Frameworks, like Automata Theory , Chomsky’s Hierarchy, and Mathematical Sets. For our more general purposes we can move on with the understanding that different regular expression engines run processes in your software in slightly different ways, giving way to a variety of different “flavors” of regex for different programming languages. In this brief overview we will be sticking strictly to Python syntax, but remember that when you encounter regular expressions in the wild, the syntax may vary.

RAW STRINGS

To begin, we need to import the regex module in Python.

In regex, raw strings are a commonly used feature. Let’s take the example strings and their outputs. See if you can notice the difference in how the raw string is processed.

By prepending the string with a lowercase “r”, the “\” escape character is ignored and the string is printed literally as it is. By typing the “r” you are essentially calling a separate Python regex parser that interprets the string differently. While, depending on the search criteria, you may retrieve indistinguishable results by keeping or omitting the “r”. Realize that there is a different process going on behind the scenes.

THE BASICS

For a basic pattern match we can use the re.compile( ) method to allow us to separate our patterns into variables. Compile turns our pattern into a pattern object. Then we can assign another variable to loop over using .finditer( ). Finally we can loop using list comprehension to find matches :

You will end up with an re.Match object giving you the coordinates of the match as displayed in “span=(#, #)”:

Now we can simply splice our original to see our match:

META CHARACTERS

In a regex search, certain characters carry special significance. These characters are referred to as “Special Characters” or “Meta Characters”, which need to be escaped with a backslash “\”.

For example, the dollar sign “$” tells Python to search for a match with whatever the $ is prefixed with, specifically, at the end of the searched string. If we wanted to search for the literal string “is$” in the text below, we would need to escape the $ with a backslash. Otherwise we are asking Python to look for the word “is” as the last characters in the string.

versus

While there are many examples of special characters, let us take one more example to search for the first word in a string. For this we will need a caret “^”. Using the same text as before, we would simply prefix what we want to find, in the beginning of the string, with a caret.

If we were to change “search = re.compile(r’^This’)” to “search = re.compile(r’^text’)”. We would return an empty string although “text” is in our string. This specificity and heavy use of meta characters can seem stringent but it is what gives regular expressions the flexibility and reach to perform a number of very specific tasks.

Character Classes aka Character Sets

Another very commonly used feature of regular expressions are character classes aka character sets. These are denoted with the use of brackets “[ ]”. They tell Python to look for a match from the selected characters or from a range of selected characters. We can also use meta characters outside of our brackets to return different results. Take the string below for example:

The code below is our criteria when looking through the text string. Notice the different outputs we can get by changing a small piece of code while using character classes and meta characters.

In our first example without character sets, “awing” tells Python to search for the characters “awing” in succession, which it found in the final word “cawing”. Notice that it did not return the whole word. In our second example brackets are used which ask Python to look for either “a” or “w” followed by an “ing”, which it found again in the word “cawing”, only returning “wing”. In the third search we include a hyphen to denote any letter between “a” and “w”. This returns us the “ning” in “running”, the “sing” in “singing”, and the “wing” in cawing. Lastly we use a hyphen in the brackets and a plus sign outside of the brackets which tells Python to search for any sequence of characters containing any of the letters from “a” to “w” followed by an “ing”. This causes Python to return the full words, “running”, “singing”, and “cawing. To be even more specific, take moment to think about what would be returned, given our logic, if we changed the word “cawing” with the filler word “terawingsel”. Remembering that the character class plus the meta character will retrieve a sequence of any characters ending in “ing”, we get [<re.Match object; span=(46, 54), match=’terawing’>]. Python returns the beginning of the string until “ing”, leaving off the last few characters “sel”.

Conclusion

I hope to have laid-out enough of a framework to whet your appetite and show you the possibilities in applying regular expressions in your own code. In many cases it is a faster and more efficient way to perform your text processing tasks once you internalize some of the grammar of it. In your future ventures creating conversation bots to chat with, in these socially isolating times, you can look back on this article as a jumping off point to the wide world of regex.

The Startup

Get smarter at building your thing. Join The Startup’s +788K followers.

Sign up for Top 10 Stories

By The Startup

Get smarter at building your thing. Subscribe to receive The Startup's top 10 most read stories — delivered straight into your inbox, once a week. Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Kevin Macias-Matsuura

Written by

Former English teacher turned Data Scientist/Analyst interested in data, design, and storytelling.

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +788K followers.

Kevin Macias-Matsuura

Written by

Former English teacher turned Data Scientist/Analyst interested in data, design, and storytelling.

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +788K followers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store