Demystify Regular Expressions Part 1

A pragmatic approach to Regular Expression

Sagar Suri
Analytics Vidhya
7 min readDec 1, 2019

--

Ever wanted to search different email addresses in a long paragraph with just one special string or a word that started with a vowel or maybe URLs? If you are smart then definitely your answer is a yes. But the next question is how is it possible to search for different types of email addresses or URLs with just one single string? Regular expressions come to the rescue.

Regular expressions are an essential skill for any developer or computer programmer to have. They allow you to define text patterns that can be used for a variety of tasks. You can validate data submitted by users, your code can extract parts from longer strings, you can use them to convert data into a new format. They even make it easier to search your own code.

Photo by Markus Spiske on Unsplash

Goal

This is the first part of a series. The aim of this series is to make you understand how regular expressions work and help you type less and make your search queries more robust by the end of this series. In this part, you will learn the basics concepts of regular expressions and work on a small challenge. Throughout this series, you will be using the site RegExr to validate and test all the regular expressions you create.

regexr.com

Table of content

  1. Metacharacters
  2. Wildcard Metacharacter
  3. Escaping Metacharacters
  4. Special Characters
  5. Challenges

Metacharacters

Before I talk about metacharacters we will begin learning the syntax of regular expressions by starting with the simplest match of all, a literal character. If you want to search the character “a” in a sentence, you will simply write “a” in the expression section in the RegExr site and the sentence in the text area. You should be seeing the following output:

Here /a/(forward and the backslash is the start and end of a regular expression in Javascript) is a regular expression and matches the character “a” in the string “want”. But there are two more occurrences of the same character and it was not matched. If you want to do a global search and find all the occurrences of the regular expression, you need to turn on the Global flag from the right corner drop-down menu. Once the flag is turned on, you will see the following output:

global flag turned on

Let’s try another one, write a regular expression that will search the word “play” in the sentence:

I want to buy a PlayStation to play exclusive games and experience smooth gameplay.

Here is the result:

Did you notice something in the above search result? The regex failed to match “Play” in the word PlayStation. The regex is case-sensitive and to change this you need to turn on the case-insensitive flag from the right drop-down menu. Now the result you will get will be like following:

case-insensitive

Now let’s talk about metacharacters, a character with special meaning is a metacharacter. Metacharacters can be used to transform literal characters into powerful expressions. There are only a few metacharacters for you to learn. Let’s understand some of them with examples.

Wildcard Metacharacter

This metacharacter is most commonly used in the regex. It is denoted with a .(dot) symbol. It matches any character except for a new line. Let’s try it out, put a .(dot) symbol in the expression section and type anything in the text section in the RegExr site. You will see the following result:

wildcard metacharacter

As you can see above, it matches all the characters in the sentence. Now write a regular expression which matches get, got and git. Now to solve such a problem let’s break the solution into steps:

  1. Open RegExr and write those three words in the text section.
  2. Search what are the common characters in each word? The common characters are g and t. Write those characters in the Expression section first.
  3. As you have learned that wildcard metacharacter matches any single character. Put it in the middle of g and t in the expression section. Now all three words are highlighted.

Try out

Add the word great to the Text section and see if it gets highlighted? No, it didn’t because of the wildcard matches only a single character, it’s something important to remember. If you want to match the word great, you need to use /g…t/ as the regex. In the upcoming series, you will learn about repetition and how to avoid repeated adding of the wildcard metacharacter for every character in the word.

A common mistake with wildcard

It’s probably the most common metacharacter that’s going to be used and it’s also the most common mistake that people make. Open regexr and write 109, 159 and 1.9 in the Text section. Now, write a regular expression that matches 1.9 only.

It’s easy and you will most probably end up writing /1.9/ as the regex. But did you notice something, it matched 1.1 including 109 and 159. Do you know why? Here .(dot) is a wildcard metacharacter and matches any character between 1 and 9. To solve this problem, you will escape the .(dot) to make it behave like a literal character instead of a wildcard metacharacter. We will talk more about escaping characters in a later section. Now add a backslash(metacharacter) in front of the .(dot) in the regex. Now it will only match 1.9 instead of matching all the other numbers.

https://giphy.com/gifs/jn1QTk0nnlojMjyvqa/html5

Escaping Metacharacters

In the previous section, you converted the wildcard metacharacter to a literal character using another metacharacter i.e the backslash “\”. This metacharacter will convert another metacharacter following it to a literal character e.g: \. or \+ or \? etc. Remember, you should only escape a metacharacter and never a literal character or else it will add a different meaning to it e.g: \d or \w is not a literal character anymore instead \d will look for a digit and \w will look of a word character.

Here is a small challenge for you. Write a regular expression that will match these two filenames: starwar1.mp4 starwar2.mp4 and not starwar3_mp4.rar. If you got it right in the first attempt that is really great but if you didn’t then no problem. Check the following gif for the solution:

If for some technical reason you are not able to see the gif then the correct regex for the above challenge is: starwar.\.mp4

Special Characters

In this section, you will learn how to use special characters for spaces, tabs, and newlines. First, let’s start with spaces, remember space is a character and to match a space you will escape the literal character ‘s’ i.e \s . You have to make sure that you are using the small case ‘s’. Here is a small example:

space character

If you want to match a tab space you will be using \t . Here the character t is no more a literal character because you escaped it with a metacharacter i.e backslash. \t is often referred to as a control character. Let’s see the usage of this control character in search tabs within strings:

control character

In the above gif, you can see that a tab is denoted with an → . There is something interesting also if you add continuous spaces in place of a tab the control character won’t detect it as a tab.

Next is to match a new line, you will be using the special character \n . Again you will be escaping the literal character ‘n’. This is also referred to as a LINE FEED character. Please note that in certain environments, Linux uses \n for a new-line, Windows \r\n and old Macs \r. So there are multiple ways to write a newline. I believe that RegExr is hosted on either a Linux server or a Windows server. Here is a small demonstration of how and when to use the \n :

In the above gif, you can see that a new line is denoted with a down arrow.

Challenges

In this tutorial, you were introduced to many small and interesting concepts about regular expressions. Before you leave to make sure you have cemented the concepts here is a small challenge that will touch all the concepts.

  1. Write a regular expression that will match words ending with mail.com:
    gmail.com, hotmail.com, yoomail.com, s@rmail_com.com
    Note:
    Remove , before you write the regex.
  2. Write a regular expression that will match the following words:
    SaGaR, sagar, SAGAR
    Note: Remove , before you write the regex.
  3. Write a regular expression that will match all the occurrence of the word Alex(case-insensitive):
    His name is Alex. ALeX is a dev! ALEX love writing blogs and alexa is an assistant.

What’s next?

I hope you enjoyed the first part of this regular expression series. I the next part I will be covering many more topics that will bring you close to becoming an expert in the regex.

You can tweet your answers for the above challenge and mention me in the tweet so that I can review them and discuss if the answers can be made better 😄. You can follow me on Twitter and connect through LinkedIn. If you like this article then do appreciate it with few 👏 👏 😄.

--

--

Sagar Suri
Analytics Vidhya

Google certified Android app developer | Flutter Developer | Computer Vision Engineer | Gamer