The Hitchhiker’s Guide to Regular Expressions and Python’s re Library

DON’T PANIC

Hannah Parker
Aug 7 · 9 min read
My title inspiration

Last week I found myself trying to find all the numbers in a string without knowing what a regex is, and it was ugly. It was lines upon lines of hyper-specific, repetitive if statements that hurt my soul to type. Figuring that there had to be a better way, I did what any good programmer does; I googled what the better way was. I found many short, simple answers on StackOverflow, all of which used Python’s re library, but none of which explained what was happening. So, I did my due diligence and navigated to the documentation, only to find an incredibly long, dense article that took all my mental energy to parse. I mean, this is an article that states, and I quote, “For further information and a gentler presentation, consult the Regular Expression HOWTO.” If you head to that page, you will find, in my opinion, a presentation that is just as hostile (or whatever the opposite of gentle is) but with slightly more examples!

So, I hope to leave the world in a slightly better state than I found it—with a fledgling data scientist’s simple synthesis of what you need to know about regexes and Python’s re library.


What Are Regular Expressions? What Is Python’s re Library?

There are two parts of this synthesis: regular expressions and Python’s re library. I separate these because regexes are a cross-language tool, and Python’s re library is a very common Python-specific implementation of this tool.

Regular expressions can be summed up mathematically (and beautifully): They tell you whether or not a string is in a regular set, where a regular set is defined as a set containing elements that can be expressed using a regular expression.

In other words, you can use a regular expression to test if a string fits a certain format (like the format of an email) or contains a certain sequence of characters.


The Basics of Regular Expressions

This is almost every email that has ever or will ever exist (as current laws of email dictate):

r"(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)"

This ghastly-looking thing is a regular expression, otherwise known as a regex, and I hope that by the end of this section, we’ll be able to parse exactly what it’s saying. If you’re already comfortable with regular expressions, feel free to skip this section. Otherwise, let’s begin.

Regular expressions use certain special characters and groupings of characters such as:

.  ^  $  *  +  ?  {m}  {m, n}  \  [...]  |  (...)  \s  \S  \w  \d

Most of these fall into one of three main classifications: character sets, modifiers, and anchors. These classifications are basically just ways of specifying how you want the regular expression to understand your desired string or pattern.

Our goal throughout this section will be to use each classification to build a regex that will match the string galaxy. But, for more practice, as you read through these descriptions and examples, you can come up with examples and test yourself using this website, which I found extremely helpful when I was getting started or need a sanity check.

  • ^ will match the start of the string (It also has another meaning, but we’ll get there later!). For example, the regex ^a will tell you that abc contains a match or aaaaaaaxyz contains a match (the first a is the match in both). However, ba is not a match, since it does not start with a.
  • $ will match the end of the string. For example, the regex a$ will tell you that cba contains a match but that “aaaaaaah” does not.

Let’s build a regex with the characters we know so far:

^gy$

This will only consider the string gy to be a match, as it is the only string that starts with g, ends with y, and has nothing else in the middle. We will soon see how to modify the regex so that the string galaxy would be a match, but it is not a match for this regex.

  • . will match any character (except a new line). For example, the regex a. will consider as to have a match, or a2, or aq, or a followed by any other character you can think of.
  • [a-z] (or [abcd] or [0-9]) will match any character specified in between the brackets. Importantly, though normally we would think of listing elements with commas, such as [1, 2, 3, 4], regexes recognize [1234] instead. You can also string two ranges together, like [a-z1-9]. So, the regex [A-Z1-4] will recognize A or 1G3HZ as containing multiple matches but not z99 (since it is case sensitive).

Now we can build on our regex from before:

^g.[xX]y$

This still forces our regex to only match groups of characters beginning or ending with e, but now a match will also have any character, followed by a lowercase or capital x. So, goXy is a match, as is g8xy, or even g?Xy and many other examples. We’re getting closer to galaxy!

  • * will match the directly preceding character 0 or more times. For example, ab* would match a, ab, abb, and so on, since each has 0 or more instances of b.
  • + is like *, but 1 or more times. So, the regex ab+ would match ab, abb, abbb, and so on, but not a, since there are 0 instances of b.
  • ? is like *, but 0 or 1 times, so the regex ab? would only match a or ab.
  • \ is somewhat familiar; it escapes the meaning of special characters, such as *, so you can match these characters as well, without using square brackets.
  • Note: the meanings of +, ?, and * are altered inside square brackets; generally, when inside brackets, they will not have a special meaning but will stand for the characters themselves to be matched. An example will be shown at the end of this section. The exception to remember is ^, when placed at the beginning of the inside of a square bracket: this means the complement of a set—all characters that aren’t what the expression describes. It takes on this meaning when placed directly inside square brackets, right at the beginning, as in the regex [[^abc]a-z], which would match any string containing any letter in the range d through z (lowercase).

We can build even further on our regex from before, getting really, really close to matching galaxy.

^g.[alxX]+y$

By adding a and l to our square brackets, we can use these letters in our string, and adding + after the square brackets means that at least one letter from that selection must show up one or more times. galaxy is finally a match! It’s worth mentioning that gaallaaxxy is also a match, and so is goloxy. There is still room to improve our regex, and the tools below can help. But, having reached our main goal—matching “galaxy”—I’ll leave that to you.

  • {m}, where m is an integer, will match only exactly m repetitions of the preceding character.
  • {m, n} will match as little as m and as many as n repetitions, being as greedy as possible (matching as many strings as possible).
  • {m, n}? will match as little as m and as many as n repetitions, but is not greedy and therefore will match as few strings as possible.
  • | is used for or, as is true in many cases, so that you can create a new regex from regexes A and B, so that the new regex matches either A or B as follows:
A | B
  • (regex) will treat all characters inside as a big group or enable you to retrieve certain matches (otherwise known as a capture group, which is a bit beyond the scope of this article, but very cool!).
  • \d will match all digits.
  • \D will match all non-digits.
  • \s will match all whitespace characters (space, tab, etc.).
  • \S will match all non-whitespace characters.

There are many more ways to group characters that alter the meaning of special characters, but for the sake of time and your sanity, I’ll leave them out for now. Once you’re comfortable with what’s presented here, I highly recommend looking further into regex notation.

Finally, you can also combine regular expressions. In general, if A and B are regexes, and a is a match for A, and b is a match for B, then AB (the concatenation of A and B) is a regex, and ab is a match for AB.

Equipped with the notation we’ve been over so far, as well as the key concepts we’ve learned, we can finally (with considerable effort) decompose the regular expression describing emails from way back in the beginning!

r"(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)"
  • r"()” denotes that there is a regular expression inside.
  • ^[a-zA-Z0–9_.+-]+@ denotes any combination of letters, numbers, dashes, pluses, or underscores, with the + denoting that this must occur 1 or more times. After all, emails must have at least one character preceding the @! The beginning carat (^) only lets strings beginning with this format count as a valid email, since ‘###foo@bar.com’ would not be an email.
  • [a-zA-Z0-9-]+ denotes any combination of alphanumeric characters and dashes, which again must occur at least once before the dot (.).
  • \.[a-zA-Z0-9-.]+$ finishes our email by escaping the special meaning of . with a backslash and allowing any combination of alphanumeric/dashed domain names, which must occur at least once (denoted by +) at the end (denoted by $).

The Basics of Python’s re Library

Finally! In this section, we’ll only go over only a few methods in the library. There are others, but we’ll only focus on the ones you most urgently need to know. There are also regular expression object types (such as Match), which I won’t discuss in depth, beyond just letting you know that a method returns that type. Unless you’re reusing the same regex over and over in the same file, it’s not much more efficient to use a regular expression object rather than just a string with your regular expression. However, just so you’re aware, you might see some documentation out there that uses methods associated with regex objects (such as the BeautifulSoup documentation, which regularly uses re.compile([regex])). On to the methods!

  • re.search() finds the first match in the string you pass it and returns a Match object. This object will have a boolean value of true, which you can pass to additional regexes. I can see this being really useful in an if statement (if re.search([regex], [string]):), especially if you want to find some particular information within a pattern. You can even take the Match object and turn it into a dictionary! (For example, if you were given a string of addresses and wanted a dictionary for each one, with the first line, second line, city, state, zip-code, etc.)
  • re.findall() is similar to re.search, but will return all non-overlapping matches as a list of strings.
  • re.VERBOSE() is used to be able to write your regex with comments across multiple lines, so that
r"([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)"

becomes the ever-more-legible

re.VERBOSE([a-zA-Z0-9_.+-]+  # email user-name
@ # the @ symbol
[a-zA-Z0-9-]+ # the domain name
\.[a-zA-Z0-9-.]+$ # the top-level domain
)

This method will ignore everything after the comment sign (‘#’) until the end of the line, so you can comment to your heart’s content and help demystify regexes for others (hooray!). (You pass this in as a parameter to a method like re.search() that finds matches.)

  • re.MULTILINE() is used to have ^ work at the beginning of every line in a string with multiple lines, not just the very start of the string. So, rather than having to break apart a multiline string into many individual strings, you can use this method to find matches on the beginning of each line. (Similar to re.VERBOSE, you pass this in as a parameter to a method like re.search() that finds matches.)
  • r.split() might be my favorite method of all in this library. It makes parsing through text so. much. easier. If you’re parsing through a news article, for example, and you want a list of all the words used, rather than using the typical .split() method and a bunch of if statements, you can split on the regex [,:;“”!.*] | \b. In other words, split on all that punctation or on white space. It returns just like the normal str.split() method, as a list. How much easier is that?

A final note: Before, I specified that a regular expression can tell you whether a string is in a regular set. You may have asked yourself, “what is a non-regular set?” which is a great question! Certain strings cannot be written as a regular expression, such as “a sequence of as, followed by the same number of bs, and half that number of cs.” To find these, you need a stronger tool, called a context-free grammar!

EDIT: Shout out to @grahamhome333 for pointing out that I’d forgotten the first carat in the email example when explaining it!

Better Programming

Advice for programmers.

Hannah Parker

Written by

Better Programming

Advice for programmers.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade