Regular Expression(s)-Pattern Details

Rakesh Shinde
Globant
Published in
9 min readSep 16, 2022

What is Regular Expression ?

Regular expression or Rational Expression is a search pattern you make using some characters that will help in finding similar pattern in your string and you can replace that matched pattern with your own string.

Where are Regular Expressions used ?

Regular expressions are used in search engines,in search and replace dialog of word processors and text editors, in text processing utilities such as sed and AWK, and in lexical analysis.

Example :-

1. Search and Replace in a Text Editor :

The most straightforward application is to search a given text in your text editor. To replace all occurrences of a customer 'Max Power' with the name 'Max Power, Ph.D.,'.

2. Validate User Input in Web Applications :

If you’re running a web application, you need to deal with user input.Your application must validate that the user input is okay — otherwise you’re guaranteed to crash your backend application or database.

Validation user field

3.Regular Expressions for Web Scraping (Data Collection):

Data collection is a very common part of a Data Scientist’s work and given that we are living in the age of internet, it is easier than ever to find data on the web. One can simply scrape websites like Wikipedia etc. to collect/generate data.

Web Scraping

Regular expressions are particularly useful for defining filters.

How to make patterns ?

Before using any library we have to understand the pattern making technique and all the characters used in it. Here I am gonna show you how you can make your own pattern.

We have used some characters in below pattern,I will define each character one by one.

Characters :

1. Repeaters * , + and { }

These symbols act as repeaters and tell the search engine that the preceding character is to be used for more than just one time.

1.1 Asterisk * symbol : The asterisk is known as a repeater symbol, meaning the preceding character can be found 0 or more times.( 0 or any number of times).

Example : sho*t
A * symbol is used after ‘o’ character, That means you have to find the pattern starts with ‘sh’ , after ‘o’ character is 0 or many times and ends with ‘t’ character
The above pattern matches with following string.

sht ( ‘o’ appear 0 times)

shoot (‘o’ appear 2 times)

..

shooooooot (‘o’ appear 7 times)

shooo……t (o’ appear many times)

1.2 Plus + symbol: The plus symbol is also a repeater symbol,but the meaning is preceding character can be found 1 or more times. ( 1 or any number of times).

Example : co+l

A + symbol is used after ‘o’ character, That means you have to find the pattern starts with ‘c’ , after ‘o’ character is 1 or many times and ends with ‘l’ character

The above pattern matches with following string.

cl( ‘o’ appear 0 times) — — Wrong

col( ‘o’ appear 1 times)

cool(‘o’ appear 2 times)

..

coooooool(‘o’ appear 7 times)

cooo……l (o’ appear many times)

1.3 curly Braces {min} or{min,max} symbol: This braces tells search engine to repeat the preceding character (or set of characters) for as many times as the value inside this bracket.

Example :

{3} means at least 3 times

{2, 5} means at least 2 times and at most 5 times

Pattern : ba{2,5}d

So {2, 5} is placed after ‘a’ character. The pattern tells the machine to match the string which starts with ‘b’ character, have ‘a’ character after ‘b’ character (2 to 5 times) and ends with ‘d’ character.

So the above pattern matches with the following string

bd (Wrong as ‘a’ appear at least 2 times)

bad (Wrong as ‘a’ appear at at least 2 times)

baad (‘a’ occurs 2 times)

baaad (‘a’ occurs 3 times)

baaaad (‘a’ occurs 4 times)

baaaaad (‘a’ occurs 5 times)

baaaaaad (Wrong as ‘a’ appear at most 5 times)

2. Wild Card “ . ” :

The symbol “ . ” ( dot ) tells machine that this is reserved seat for character and will be filled by character later on. Since any character can occupy its position it is called the wildcard character.

Pattern h.t tells machine that any character can come in place of dot “ . ” character. It matches with hit, hot, het, and other similar words.

We can use others characters with this character like repeaters as below.

Example : b.*d

The above pattern tells that the above pattern will match the string which starts with‘b’ character, may have any character after ‘b’ character 0 to many times and ends with ‘d’ character.

So the above pattern will match with any character.

bd ( as * is there so character occurs 0 times)

bad (‘a’ character replaces ‘.’ character and it occurs 1 times)

bed (‘e’ character replaces ‘.’ character and it occurs1 times)

baaad (‘a’ character replaces ‘.’ character and it occurs 3 times)

……….. so on , beed, boood, biiiiiid, bid, etc.

Example : b.+d

The above pattern tells that the above pattern will match the string which starts with‘b’ character, may have any character after ‘b’ character 1 to many times and ends with ‘d’ character.

bd ( as * is there so character occurs 0 times) — — wrong

bad (‘a’ character replaces ‘.’ character and it occurs 1 times)

bed (‘e’ character replaces ‘.’ character and it occurs1 times)

baaad (‘a’ character replaces ‘.’ character and it occurs 3 times)

……….. so on , beed, boood, biiiiiid, bid, etc.

3. Optional Character “ ? ”:

The symbol “ ?” tells the machine that the preceding character may or may not occur in the string. Like the pattern goods? tells us that ‘s’ character may or may not be present in the string. Hence, the strings “good” and “goods” matches with the pattern.

In below example lets include some other character’s with character “?” for better understanding.

Pattern : ba*ds?

The above pattern will match up the string that starts with ‘b’ character, have ‘a’ character after ‘b’ character 0 to many times, then have ‘d’ character and ends with optional ‘d’ character which may or may not be present.

The above pattern matches with the following strings.

bds (‘a’ occurs 0 times ,‘s’is present)

bad (‘a’ occurs 1 times and ‘s’ is absent)

baaads (‘a’ occurs 3 times and ‘s’ is present)

………………. so on.

4. Caret symbol ( ^ ):

The caret ( ^ ) symbol tells the machine to check if the pattern is present at the beginning of the word.

Like ^a{2}b pattern means ‘a’ character comes 2 times and ‘b’ comes after ‘a’ character check for “aab” if its present in the start of word or not.

Now we will check the pattern with the below words to see if it is getting matched.

aa (“aab” does not matched with start of the string.)

aab (“aab” matched with start of string)

aabaac (“aab” matched with the start of the string)

saab (“aab” is not present in the start of the string”)

5. Dollar Sign “ $ ”

The dollar “$” symbol tells the machine to check if the pattern is present at the end of the word.

Like $a{2}b pattern means ‘a’ character comes 2 times and ‘b’ comes after ‘a’ character check for “aab” if its present in the end of word or not.

Now we will try to match pattern with the below words to check if it is getting matched.

aaa (“aab” does not matched with end of the string.)

aab (“aab” matched with end of string)

aacaab (“aab” matched with the end of the string)

saab (“aab” is present in the end of the string”)

6. Square Brackets [ ]

The symbol [] tells the machine to check if any character matches with the characters that are present in [ ].

For example , the pattern [xyz] means only the character’s ‘x’, ’y’ and ‘z’ will get matched with the pattern.

Pattern : a[xyz]*

The above pattern will check for the strings if it starts with character ‘a’ having character ‘x’ or ‘y’ or ‘z’ which may occur 0 to many times.

Lets check for the pattern for following string’s :

a (character in [x, y, z] occurs 0 times)

ay (‘a’ character is in start and ‘y’ is present in [x, y, z])

ayyy (‘a’ character presents in start and ‘yyy’ is in [xyz] and ‘y’ occurs 3 times)

We can also specify the range in [ ] .

7. Square Brackets [] with Hyphen Symbol “-”

Specifies a range, for example [0–9] , it will check for all the values from 0 to 9.

[a-z] means it will check for characters from “a” to “z”.

[a-i] means it will check for the character from “a” to “i”.

[a-z1–7] means it will check for the characters from “a” to “z” or 1 to 7

[a-z0–9] means it will check for the characters from “a” to “z” or 0 to 9

8. Square Bracket with Negation [^]

It is opposite to [ ] symbol.

It matches the characters except the ones present in the square bracket. For example, the pattern a[^abc]d means that the string should start with character ‘a’ , followed by a character other than (‘a’, ‘b’ and ‘c’) ending with character ‘d’ .

Pattern : a[^abc]d

As explained above, lets see which strings matches the pattern.

Abd (Wrong as ‘b’ character in [^abc] and it clearly says that other than ‘a’, ‘b’, ‘c’ characters.)

avd (‘a’ is starting character, ‘v’ does not present in [ ^abc] and ‘d’ is end character)

9. Grouping Characters ( )

This is used to group the regular expressions in a group on which we can apply other regular expressions.

Pattern : Like (a[^abc])*

Here we have applied ( ) on our pattern which will group it into container. We can further apply more operation on it.

For example, the pattern a[^abc] means it should start with character “a” and ends with any character other than “a”, “b” and “c”.

Then we apply ( ) to group it.

Then we further apply * on this group which tells us that the group a[^abc] can repeat 0 to many times.

Hence it can be matched with the following string :

“” (a[^abc] occurs 0 times)

ad (a[^abc] occurs 1 times)

ahay (a[^abc] occurs 2 times, “ah” and “ay”)

10. Vertical Bar ( | )

Matches any one element separated by the vertical bar (|) character. The vertical bar act as OR condition.

Example : th(e|is|at) will match words — the, this and that

11. Character Classes

A character class allows you to match any symbol from a certain character set. A character class is also called a character set

Like ,

/s : matches any whitespace characters such as space and tab
/S : matches any non-whitespace characters
/d : matches any digit character
/D : matches any non-digit characters
/w : matches any word character (basically alpha-numeric)
/W : matches any non-word character
/b : matches any word boundary (this would include spaces, dashes, commas, semi-colons, etc)

Example

Pattern : a/d+

The pattern will match with the string which starts with “a” character and then numeric digits will occur at least 1 times and at most many times.

So lets check for string,

a2 (as 2 is digit and occur 1 times)

a44(as 4 is digit and occur 2 times)

s11 (Wrong. ‘a’ is absent at the start of the string)

a121312 ( any digit and occurs 6 times)

Pattern : a/s+d

The pattern will match with the string which starts with “a” character and then white space will occur at least 1 times and at most many times.

So lets check for string

a d(as whitespace is present a and d letter)

ad(as whitespace is not present a and d letter) — — wrong

--

--