Regex - Regular Expression

Made data scrapping easier

Published in

featurepreneur

4 min readJun 14, 2023

Regular expressions are powerful tools for pattern matching and text manipulation, allowing developers to search, validate, and extract data from strings with ease. Most of the chatbots use this regular expression to understand the users' messages and reply accordingly. It is also used for Data scraping, cleaning, and validation.

Getting started with RE

We can use the regular expression library by importing to your python script.

import re

As mentioned earlier re is mainly used for pattern matching. This is done by giving the content from which a specific type of data have to be scrapped and the pattern of the data. With passing these values interpreter goes through the whole content and scrapes all the value in the given pattern.

How to set the Pattern?

The pattern of the data to be scrapped is given in the form of pattern tokens. Pattern tokens are a set of special characters that are used to represent the pattern of data that we want to scrape.

Commonly used token

. – Any character
\n – Newline character
\t – Tab character
\s– Any white space character (including \t, \n and a few others)
\S – Any non-whitespace character
\w– Any word character (Uppercase and lowercase Latin alphabet, numbers 0-9, and _)
\W– Any non-word character (the inverse of the \w token)
\d–Any digit
\D– Any non-digit
\b– Word boundary: The boundaries between \w and \W, but matches in-between characters
\B– Non-word boundary: The inverse of \b
^ – The start of a line
$ – The end of a line
\– The literal character “\”
[A-Z]– Match any uppercase character from “A” to “Z”
[a-z]– Match any lowercase character from “a” to “z”
[0-9] – Match any number
[asdf]– Match any character that’s either “a”, “s”, “d”, or “f”
[^asdf]– Match any character that’s not any of the following: “a”, “s”, “d”, or “f”
[0-9A-Z]– Match any character that’s either a number or a capital letter from “A” to “Z”
[^a-z] – Match any non-lowercase letter

Example patterns

1. Identifying phone numbers

A phone number is always a 10-digit number. With the help of the above-mentioned tokens, we know that digits are represented using ‘\d’, therefor the pattern for a phone number is \d\d\d\d\d\d\d\d\d\d. It can be simplified to \d{10}.

import re
content = "Jhon's phone number is 9856272545 and Dom's number is 8743891002"
pattern = '\d{10}'

matches = re.findall(pattern,content)
print(matches)

output : [‘9856272545’, ‘8743891002’]

What if phone numbers are given in the western format i.e. (999)-999–9999? ‘(’ is already a token, so to avoid confusion we add a slash in front of it so that it will be considered as an escape sequence. Therefore the pattern for identifying the phone number is ‘$\d{3}$-\d{3}-\d{4}’. Here the brackets have a slash in front to notify it's not a token but the pattern followed by three digits and a close bracket. A hyphen is used without a slash as it is not a token.

import re

text = "This is my phone number 1234567890 find it using regular expression. 
        What if the number is in the format (123)-456-67890."

pattern = '\d{10} | \(\d{3}\)-\d{3}-\d{4}'

matches = re.findall(pattern, text)
print(matches)

in the above-mentioned code we can see that the two different patterns are separated using ‘|’. This symbol represents ‘or’ operation. This means the data should be scrapped if it is in the pattern \d{10} or $\d{3}$-\d{3}-\d{4}.

2. Order number

Let's say a specific e-commerce company creates order numbers in the form #AB followed by the date, month, state initial, and a three-digit number, for example: #AB3108TN123. RE token for this pattern - #AB\d{4}[A-Z]+\d{3}

import re

text = "My order number is #AB3108TN123, please track my order"

pattern = '#AB\d{4}[A-Z]+\d{3}'

matches = re.findall(pattern, text)
print(matches)

output : [‘#AB3108TN123’]

But what if the customer quotes it as #ab3108tn123? Here all the alphabets are in a small case, so RE will not recognize this as data in the given pattern. To avoid such situations we use flags. Flags are line exceptions in exception. We can solve this problem using the IGNORECASE flag.

import re

text = "My order number is #ab3108tn123, please track my order"

pattern = '#AB\d{4}[A-Z]+\d{3}'

matches = re.findall(pattern, text, flags = re.IGNORECASE)
print(matches)

Helping Hand

To find the correct pattern for your data or to cross-check your pattern and your data you can use the website https://regex101.com/.

Conclusion

The RE library in Python is a powerful tool for working with regular expressions, offering a wide range of functions and methods to handle pattern matching, text search, data extraction, and manipulation. Whether you're validating user input, parsing data, performing text analysis, or developing web scraping applications, the re library provides the necessary functionality to accomplish these tasks efficiently.