How regular expression is used in getting insightful data ?

Published in

Analytics Vidhya

7 min readNov 5, 2020

Let’s start with the story, how regular expression picked up my interest in short. All computer science students studied or at least came across the term “compiler design”. In that regular expression plays a major role in the portion. I too studied the subject, but I was not picked up there. While doing the ‘Applied data science with python’ course from the University of Michigan in Coursera. The instructor while taking the lecture about regular expression, said that regular expression is one of the important and underrated skills in data science and he can able to take the whole course with the use of regular expression. That suddenly made me think how powerful is that and there my interest was picked up and I went through the documentation mentioned there and played with the code. Here I am going to share the things I gained from the experimentation with regex using the re module in python.

Preface to regular expression:

In simple, a regular expression is a sequence of characters that defines the search pattern which is used for the purpose of string matching, searching substring, finding the pattern, and find and replace operation, etc.

But why in data science??
The regular expression is majorly used in the data preparation and data cleaning task. It is used to extract meaningful data like hashtags, websites, email id, phone number, etc.from the messy data using the patterns for the data preparation task. It is used widely in natural language processing for data cleaning like extracting only the alphabetic words and removing numerics.

Okay, let’s dive headlong into regex, which is a python module for implementing regular expression

So, What’s really regex is?

The regular expression is something like you see in the picture above. It’s all about writing expressions like these to find the pattern. Let’s see each and every bits and byte of regex in detail next.

. ^ $ * + ? { } [ ] \ | ( )

These each character have their own meaning in the regular expression. Let’s see the basics initially and then examine their usage with the help of code examples.

* - is used to indicate the repeating things, which is placed after character, tells it can be repeated multiple times
$ -is used to indicate the end of a pattern
^ -is used to reject a pattern ( like not operation)
[ ]-it is used to define the character set like [a-z], [0–9]
+ -is same as the * operation, but it needs at least one occurrence.

Some of the majorly used special sequences are

\d-matches the decimal digit numbers accepting numbers containing 0 to 9
\w-matches all the alphanumeric strings
\s-matches special characters like \t, \n, \f
\D-matches non-numeric data

Implementation using re module:

Let’s break into the coding part and see the implementation of regex using python.

import re
p=re.compile('[a-z]+')
print(p.match('priya'))Output:
<re.Match object; span=(0, 5), match='priya'>

The expression inside the compile method is a regular expression. It tells to find the string that starts with the lower case alphabet and is followed by many letters. It returns the match and it’s start and end index.

print(p.match('Priya'))Output:
None

It returns none because the string doesn’t start with a lower case letter. The match() methods see only the initial position of the string.

Using the search() method:

pat=re.compile('[A-Z][a-z]*')
match=pat.search('Abi is my classmate, Bibi is my roomate')

if match:
    print('Found names are',match.group())Output:
Found names are Abi

The above pattern is used to find the names in the string, which can be done using the words that start with upper case and followed by small letters using the * character. It is not applicable in every real-world situation. The group() method is used to get all the strings matched by the RE.

But if you noticed clearly, the search() method returns only one name. But there are two names in the string, because the search() method returns the first occurrence of the pattern, and do not see further.
For the purpose of finding all the occurrence of a pattern, we have to use findall() method.

pat=re.compile('[A-Z][a-z]*')
match=pat.findall('Abi is my classmate, Bibi is my roomate')

if match:
    print('Found names are',match)Output:
Found names are ['Abi', 'Bibi']

More meta characters:

The | character performs the same as OR operation.

Grade='AABBCDABADACB'

print(re.findall('AB|CB',Grade))Output:
['AB', 'AB', 'CB']

It returns every occurrence of B after A or C.

The ^ character performs the not operation, rejecting the specified range of characters from the string and returning the rest.

Characters='abcde9289abdifhoejf'

print(re.findall('[^0-9]',Characters))Output:
['a', 'b', 'c', 'd', 'e', 'a', 'b', 'd', 'i', 'f', 'h', 'o', 'e', 'j', 'f']

See the RE return all the characters expect the numeric in the string.

The $ operator is placed after a character denoting that character will be the end of the pattern.

numbers=[110,322,550,10]

for i in numbers:
    a=str(i)
    print(re.findall('[0-9]*0$',a))Output:
['110']
['550']
['10']

The above pattern returns only the numbers that ends with 0. The regex accepts only the strings. So I converted the numbers using the str() method.

Other useful operations:

Named groups:

We can also able to name the groups that are returned from the strings that match the RE. It is done by using the ‘?<NAME>’ pattern, we can give any name in the place of the name. It can be done by

names=['Bagavathy Priya','Saravana Kumar']

for i in names:
    res=re.match(r'(?P<first>\w+) (?P<last>\w+)',i)
    print(res.groupdict())Output:
{'first': 'Bagavathy', 'last': 'Priya'}
{'first': 'Saravana', 'last': 'Kumar'}

The first name is taken by the use of ‘\w’ sequence, that takes any alphanumeric, and note that is followed by space and the rest of alphabets are taken as the last name.

Split strings:

We can able to split the strings by any regular expression pattern using the split() method in the regex.

pat=re.compile(',')

text='Raj,Vincy,Zoe'

print(pat.split(text))Output:
['Raj', 'Vincy', 'Zoe']

Here, in the pattern part, I specified the ‘,’ character. So the RE splits the string by ‘,’ . We can also use this in places to separate the columns by delimiter while working with CSV files.

Search and replace:

Consider a situation if Emy left from a project and Abi working on the same, we have to replace every occurrence of Emy with the name Abi, We can do this by using the sub() method.

sen='Priya and Emy are working in AB project,Priya is taking the data cleaning part and Emy is doing the modelling part'pat=re.compile('Emy')

pat.sub('Abi',sen)Output:
'Priya and Abi are working in AB project,Priya is taking the data cleaning part and Abi is doing the modelling part'

It works fine : ) In the sub() method, giving “Abi” for the replacement of “Emy” which was found from the compiled pattern.

Real world examples:

If I got messy data from Twitter in a text file, How can I get insightful details from that data for further analysis. With the help of RE, we can able to get the details like e-mail, hashtags, profile id by pattern matching. It can be done by the following code.

txt="VFD\n483686339687510016|Mon Jun 30 18:59:44 +0000 2014|Hobby Lobby ruling seems likely to make it harder for women to get contraception in the future @aaroncarroll writes http://nyti.ms/1iO0g23\n483683719476432896|Mon Jun 30 18:49:19 +0000 2014|Leprosy, Still Claiming Victims http://nyti.ms/1nZIuG9\n483657174275874817|Mon Jun 30 17:03:50 +0000 2014|Justices Rule in Favor of Hobby Lobby http://nyti.ms/VAgtgX\n483652347860885504|Mon Jun 30 16:44:40 +0000 2014|Supreme Court Declines Case Contesting Ban on Gay ‘Conversion Therapy’ http://nyti.ms/1iNwGd1\n483651450342735872|Mon Jun 30 16:41:06 +0000 2014|RT @UpshotNYT: Evidence-based medicine, or why hardly anyone has tonsils removed these days: http://nyti.ms/1sRDCdc\n483651253374033921|Mon Jun 30 16:40:19 +0000 2014|RT @jessbidgood: I spent 3 days outside Boston’s @PPact clinic to look at how SCOTUS\' buffer-zone ruling is felt there: http://t.co/wyNjQuP…\n483650611528093697|Mon Jun 30 16:37:46 +0000 2014|RT @celiadugger: Scientists: morning after pills don\'t cause abortions in way abortion opponents contend in Hobby Lobby case. http://t.co/k…\n483650393164222465|Mon Jun 30 16:36:54 +0000 2014|#health"

If I take this messy data, we can get insights like

pattern='@[\w\d]*'

re.findall(pattern,txt)Output:
['@aaroncarroll',
 '@UpshotNYT',
 '@jessbidgood',
 '@PPact',
 '@celiadugger']

Getting the profile id’s using the @ character in the pattern followed by the sequence \w and \d.

pattern=re.findall(r"[A-Za-z0-9._%+-]+"
                     r"@[A-Za-z0-9.-]+"
                     r"\.[A-Za-z]{2,4}", txt)

print(pattern)Output:
['jessbidgood@ibm.co', 'ppact@gmail.com']

The above pattern is used to extract the mail id’s from the text.

Likewise, we can do more complex and robust operations using a regular expression. We can able to clean the data and get the dictionary of data in a proper manner and can able to convert that to a CSV file using the pandas data frame like the example below.