What is Regular Expression (Comprehensive)

Regex Rocks!
Regex Rocks!

Based on the Definition on Wiki,

A regular expression is a sequence of characters that define a search pattern in theoretical computer science and formal language.

So what does it exactly mean? Based on my understanding, I think regular expression is a very simple language that can help with search/find function. For example, if you want to find a movie called “zootopia” saved in your computer, but all you forget the name, and all you can remember is “zoo something with ‘ia’ at then end ”. What you are going to do? You may start to search “zoo” in your computer, but you may have hundreds or thousands of files saved with “zoo” in the name. Are there any better way to find it?

Luckily, regular expression (Regex) is here to help. It can help you search whatever pattern you want to use. In the example above, you can simply use ^zoo\w*ia$ to match the pattern and find the movie you want. Don’t worry if you do not understand what the thing I just wrote mean. It was nonsense to me several week ago. I will show you in the following ones to help you understand the awesome regex.

Before I start, I want to provide some documentations and website that can help you understand and practice regex. There are many other resources you can use, such as Codecademy, but the following three are extremely useful for me.

  • Regex Documentation provides the comprehensive documentation you can refer to help you understand all aspects of regex.
  • RegExr is an awesome website that summarize the most important information about regex. It has cheatsheet you can refer to, and have samples that can help you decide whether your regex is correct.
  • Hackerrank has very good practice for people check how their knowledge. It also has lots of other languages practice, such as SQL and Python.

Here is the summary of the basic Regex Pattern.

Literal Characters

Literal Characters in Regex are the same as normal search in Excel. When you want to search for ‘a’ just type ‘a’.

Special Characters (a.k.a. metacharacters)

  • ^: To make a character the beginning of a pattern. i.e. ^zoo to search ‘zootopia’. It also means negated if placed after [.
  • $: To make a character the ending of a pattern. i.e.: pia$ to search ‘zootopia’.
  • .: To search all the possible characters except new line.
  • |: Means ‘Or’. i.e. I like [dog | cat], but not both.
  • ?: To match 0 or 1.
  • *: To match 0 or more.
  • +: To match 1 or more.
  • (
  • )
  • [
  • {
  • \: It normally means ‘escape’. If you want to search for the characters that listed above, you need to add a \ before them to search. For example, to search period ‘.’, instead of using ., you need to type \..

Character Classes (a.k.a. Character Sets)

  • []: It is used to match one out of several characters. [ab]c will match ‘ac’ or ‘bc’.
  • [^]: It is negated character classes. [^0-8] can match ‘9’.
  • \d: [0–9]
  • \D: [^\d]
  • \w: [a-zA-Z0–9]
  • \W: [^\w]
  • \s: [ \t\r\n\f]
  • \S: [^\s]

Anchors

Anchors do not match any characters. They match a position for the characters.

  • ^: match the position before the first character.
  • $: match the position after the laster character
  • \d: Word boundaries. Including:
  • Before the first character if the first character is a word character
  • After the last character if the last character is a word character
  • Between two characters if one is word character and the other is not

Repeat

  • *?`: To match 0 or 1.
  • *: To match 0 or more.
  • +: To match 1 or more.
  • {m}: To match m repetition
  • {m, n}: To match m to n repetition
  • {m, }: To match m to more repetition

Grouping

  • (group): capture part of the regex together
  • (?:group): will group together but will not be recognized in backreferences

Backreferences

  • (regex1)regex2\1

Lookahead

  • regex1(?=regex2): match regex1 following by regex2
  • regex2(?!regex2): match regex1 not following by regex2

Lookbehind

  • (?<=regex2)regex1: match regex1 behind regex2
  • (?<!regex2)regex1: match regex1 not behind regex2

Here are the detail explanation.

Simple String Pattern

The easiest and simplest Regex is just the word or phase pattern you want to search. For example, you want to search ‘Medium’ in your website bookmark, you can simple type ‘medium’ and ‘https://medium.com’ can be found.

You can use the pattern for both letters (abc…) and digits (123…)

dot(.)

The above example has no differences with normal search? How about this one — dot(.). ‘.’ can be used for matching anything except newline. What does this mean? Well, if you forget whether it is ‘microeconomics’ or ‘macroeconomics’ you are looking for, you can type m.croeconomics, so you can find both. Dot(.) means any single letter, digit or special character you want to search. But remember, it can only substitute for one character. If you type 1.2.3.4, you may find 1+2–3+4 or may find 1a2b3c4. If you want to match for more than one character, you can type 1..34.

If you want to search a dot(.) instead of everything, you can use \. Here, \ is an escape to show computer that you want to match dot(.) instead of using dot as a pattern character.

\d vs. \D

You may say: man, dot is useful, but can I just match numbers when I search for phone number instead of everything? You bet. you can use \d to help. When you want to search phone number ‘800–800-’ but forget the last four digits, you can type 800-800-\d\d\d\d to find the number you need. It will show something like ‘800–800–1234’. How about the opposite way? You have your final essay needs to be submit, but you have version essay_v1, essay_v2, … essay_v10000 saved in your computer, and you just want to find essay_vFinal, what should you do? Well, you can not use dot since it will list you everything. Or, if you remember, you can search vFinal. What about you were too excited when you finished the final version, and misspell final to something else?\D will help you here. \D is exactly the opposite of \d, which means it will match everything other than digits. essay_v\D can help you find ‘essay_vp’, ‘essay_vF’ or ‘essay_v@’.

\D and \d can only match one character at a time like dot. You can type \d\d\d\d to match multiple, or I will show you, in the following articles’ about how to match multiple characters at a time.

\w vs. \W

\d is cool when search for phone number, but what about word character. Dot can help find everything, but how about some special characters that you do not want to search. \w can help with all the issues. \w can match all word characters, which includes both alphanumeric characters (a-z, A-Z, and 0–9) and underscores (_). The method to use \w is the same with /d. ‘\W’ is exactly the opposite of \w, which means it will match any character other than alphanumeric characters and underscore. When you want to search for special character, it will help a lot.

\s vs. \S

Another often used patterns with backslash are \s and \S. \s can matches any whitespace character, which includes \r, \n, \t, \f. You do not need to memorize these four. All you need to remember is when you are looking for whitespace like ‘A B’, \s can help you find it. No matter whether you are looking for a single whitespace, or a new line, or a space created by hitting tab. \S, as you may already know, can match all non-whitespace character. Therefore, \S is more general than \w, since alphanumeric characters, underscores and special characters are all non-whitespace character.

* vs. +

All the patterns are cool, but it will be a disaster if you do not know how many of them you are looking for. Do you remember the previous example of ‘zootopia’? When you know ‘zoo’ and ‘ia’ but forget the rest of it? You may want to use . or \w to help you find it, but you do not know how many letters are between ‘zoo’ and ‘ia’. What should you do, try them one by one until you find it? Fortunately, there is a better way to do it. Regex provides * and +. * can match zero or more repetitions and + can match one ore more repetitions. For example, a*b can find ‘b’ or ‘aaaaaaaab’, while a+b can find ‘ab’ or ‘aaaaaab’. For the example of ‘zootopia’, if you know there are letters between ‘zoo’ and ‘ia’ but do not know how many are there, you can use zoo\w+ia. If you are not sure whether there are letters in between, you can use zoo\w*ia. Short and sweet, right?

^…$

These two characters are probably my favorite in regular expression. In short, ^ means beginning and $ means ending. You can use them separately or combined. For the ‘zootopia’ example above, if just using zoo\w+ia, you may find something like ‘asdfjazooa;sdjfaskdiaa;skjf;kj;fja’, which is clearly not what you are looking for. Since you know the word starts from ‘zoo’ and end at ‘ia’, you can use ^zoo\w+ia$. In this way, computer knows you are looking for some word starts from ‘z’ and end at ‘a’. Then, how about starts from ‘zoo’, and end at ‘ia’? This will be another topic called group, which will be shown later.

\b vs. \B

There is one more special pattern you need to remember, \b and \B. Again, \B is the negated version of \b. \b can find a word boundary, such as whitespace, punctuation and the start/end of a string. There are three different kinds of word boundaries. By word boundaries, I mean a-z, A-Z and 0–9, which is \w in regex.

  • Before a character:
  • If the first character is a word character, the boundary is the one before the first character.
  • When you put something like \bword\b, it can help you find the whole words. For example: \bword\b can find ‘a word’.
  • Between two characters:
  • When one is a word character and the other is not.
  • For example: \b1\b can find ‘Regex-1
  • After a character:
  • When the last character is a word character.
  • For example: Welcome\b can find ‘Welcome!’

You may wonder what is the different between \b and \s. It seems that both can find whitespace, why we need two? Well, \b is like ‘^’ and ‘$’, which means it will match a position. The match is zero-length. It can only work with other characters together. It will be helpful when you want to find the word characters with boundaries. Meanwhile \w can be very helpful when you want to find the actual whitespace.

[…] & [^…] & [a-z]

\b might be a little bit tricky. Let’s do something simple, then. All the patterns above are for general use. Are there any patterns for specific character? When you want to search for a song with vowel as a start, \w or dot can make things even harder. To match one out of several characters, you can placed them inside a square brackets []. For example, [aeiou] is a vowel can find ‘a is a vowel’, ‘e’ is a vowel’ and etc. Just remember, [] will find only one character out of all the characters within it, not all of them.

How about you want to find music name that not starts from a vowel. Well, you can add a ‘^’ within a square brackets like [^aeiou]. Within a square brackets, ‘^’ does not mean ‘start’ anymore, it means ‘opposite’ or ‘negated’. [^aeiou] can help find all the letter other than ‘aeios’.

[] is very useful, but listing all the word or number one by one may also be exhausted. You can use dash(-) to accelerate this process. [a-z] means a, b, c…z. [7–9] means 7, 8, 9. [A-D] means A, B, C, D. Easy? One thing to keep in mind, Regex is case sensitive, which means a is different from A. In the example above, if you want to search for vowel and do not care whether it is capital letter or not, you can use [aeiouAEIOU].

{x} & {x,} & {x, y}

Do you remember * and +, the two symbols that means repetitions? They are very helpful with zero/one or more repetitions, but there are high probabilities that you need to more than that. For example, finding certain amount of repetitions, more than 10 repetitions or repetition between 1 and 100. In these cases, {} will be very helpful.

  • {x} can match x repetitions
  • a{5} can match ‘aaaaa’
  • {x,} can match x or more repetitions.
  • a{5,} can match ‘aaaaa’ or more than 5 a.
  • {x,y} can match repetitions between x and y (both inclusive)
  • a{2,3} can match ‘aa’ or ‘aaa’.

?

If there are ways to match repetition more than zero or one, there will definitely a way to match zero or one. ? can match characters that happen zero or one time. For example, ab?c can match either ‘abc’ or ‘ac’.

Groups

Talking about repetitions, all the examples above are about repetition of single characters. Is it possible to repeat certain words? Absolutely. Parenthesis() can group part of the regex together; in which case, you can repeat that part after grouping. For example, Regular expression is (not)? awesome can match either ‘Regular expression is awesome’ or ‘Regular expression is not awesome’. (We know it is!!!)

|

As I mentioned before, regex is case sensitive unless you told it not to. Therefore, ‘a’ is not the same as ‘A’. In the above example Regular expression is (not)? awesome, you are not sure whether the ‘R’ is capital or not, you can use | to do alternative matching. In many programming language, | means ‘or’. (Rachel | Amanda) likes regex can match either ‘Rachel likes regex’ or ‘Amanda likes regex’, but not both. (R | r)egular can match ‘Regular’ or ‘regular’.

Lookahead

All the examples we talked about before are related to the characters or words you are looking for directly, which means, when you want to search for ‘dog’ in ‘I love dogs’, you just search ‘dog’.

But how about this? Within the word ‘chocolate’, there are three ’c’s, but you want to get the ‘c’ before the letter ‘o’ only, not the other ‘c’. There are certainly ways to do so. There is a regex pattern looks like regex1(?=regex2). In the chocolate example, it looks like c(?=o): ‘chocolate’. This pattern means regex will find ‘c’ followed by an ‘o’. Regex engine will only show the regex1 that you are looking for, not the following regex2.

What about the opposite way? There is a pattern that can match characters not following by another characters as well. In the above example, you are looking for ‘c’ not following by an ‘o’. The pattern will look like regex1(?!regex2). c(?!o) can find ‘chocolate’.

This two patterns are called lookahead.

Lookbehind

If there are ‘lookahead’, there will be ‘lookbehind’. Here is a practical example: you are doing a research on Facebook to learn whether gmail or hotmail is more popular. You may want to grab all the registered email and get the email address. You cannot count them one by one for sure since it will take forever. What will you do? Lookbehind of regex can help since all email address contains ‘@’.

The lookbehind pattern looks like this: (?<=regex2)regex1. It means looking for and return regex1 following by regex2. In the email example, it looks like this: (?<=@)(gmail | hotmail). Regex engine will return either ‘gmail’ or ‘hotmail’ following by @, and you can get all the data you need very easily.

If you are looking for regex1 not following by regex2, like lookahead, just change ‘=’ to ‘!’. As a result, the pattern becomes (?<!regex2)regex1π.

BackReferences

As mentioned earlier, parentheses ( ) in regex capture group, which can be used by backreferences. Backreferences match the same text as previously matched by a capturing group. For example: (1)23\1 will match ‘1231’, and 12(3)\1 will match ‘1233’. The number behind ‘' means group number. The group number is determined by the number of the parentheses. The simplest example to show like this: ’(\d)(\d)(\d)(\d)\4\3\2\1’ can match ‘12344321’. Backreferences will always repeat the pattern they referred to. so when you want to repeat a pattern, a backreferences will always be helpful!

However, there is one thing need to be aware of: capturing group that match nothing does not equal to capturing group that did not participate in the match at all.

  • One the one hand, (a?)b\1 can match ‘b’, where a does not happen, so the () captures nothing. \1 will match nothing as well. There is nothing after ‘b’ so ‘b’ can be matched.
  • One the other hand, (a)?b\1 cannot match ‘b’. Why? Since (a) does not happen instead of a, this capturing group does not participate in the match at all. Since there is ? after (a), it will be OK. However, \1 will not participate in the match, and there is no option sign ‘?’, which will cause the failure. Due to the failure, ‘b’ will not be matched.

If you have too many parentheses to capture group, and you do not some of the group to be capturing group, just put ?: after the first parenthese, which will look like (?:aaa), which will create a non-capturing group.

One useful example of backreferences is to check for doubled words or numbers. \b([\w+]\s+\1\b can capture stuff like ‘123 123’, which can help find duplicates.

ForwardReferences

Forwardreferences just like backreferences. Backreferences will refer to group captured earlier while forwardreferences will refer to group captured later. Forwardreferences for example, can be something like \123(45) can match ‘452345’.

In practice, forwardreferences are not very helpful, and will not be used often. Therefore, not all the languages support forwardreferences, including Python.

These are all the basics of Regular Expression. Hope you enjoy learning it. :)

Show your support

Clapping shows how much you appreciated Rachel Fu’s story.