Day 110(DL) — Regular Expressions for NLP

Nandhini N
May 5 · 5 min read
Photo by Amador Loureiro on Unsplash

We’ve all heard about the popular Data science saying “Garbage In, Garbage Out”. Data preprocessing plays a vital role in any model building requirement. A simple model could produce accurate results with well-curated data. Conversely, a complex model can fail to meet the target outcomes because of poor selection in the input data fed into the model.

Just like any other deep learning models, the NLP models also does require properly cleaned and processed input data for creating powerful models. The very first step in the NLP data preprocessing includes the handling of regular expressions. Let’s gain a better intuition with some sentences. Consider we’re working on a project that has to automatically route the customer requests based on the content present to the respective departments.

One of the mails from a valuable customer is in the below format,

received from: Tommy@gmail.com

hello helpdesk

Recently, I received a new debit card as a replacement for the old one. But, I am not able to make any online transactions using the new card. Could you please look into the issue and resolve it soon? As I have pending bills that need to be paid.

I am attaching a screenshot of the latest error received.

[cid:error1.jpg]

Thanks & Regards, Tommy

Extracting the Crux: If we observe the mail closely, the mail content resides within a paragraph. The rest of the details such as received from, hello, Thanks are formal expressions that can be removed as they do not assist in understanding the desired content. In addition to that, we also have a question mark, commas, full stops in the main para as well. These expressions also do not carry any meaningful information and occupying only additional space that can be freed up.

On top of the content, there are also blank lines included in the mail that can be dropped as well. In the end, we only need the actual content which contains useful information for the assignment process.

Implementing Regular Expression:

Let’s first import the function for regular expression,

import re, stringtext = "received from: Tommy@gmail.com\n\n hello helpdesk \n\nRecently,  I received a new debit card as a replacement for the old one. But, I am not able to make any online transactions using the new card. Could you please look into the issue and resolve it soon? As I have pending bills that need to be paid.I am attaching a screenshot of the latest error received.\n\n [cid:error1.jpg]\n\nThanks & Regards, Tommy"print(text)received from: Tommy@gmail.com

hello helpdesk

Recently, I received a new debit card as a replacement for the old one. But, I am not able to make any online transactions using the new card. Could you please look into the issue and resolve it soon? As I have pending bills that need to be paid.I am attaching a screenshot of the latest error received.

[cid:error1.jpg]

Thanks & Regards, Tommy

As we can see there are punctuations, blank lines are included in the text that has been received. In NLP, one global recommendation is, converting the text into the lower case before we start any processing. This ensures distinct words are counted only once instead of multiple times(once in lower and another in upper).

text = text.lower()

After conversion,

received from: tommy@gmail.com

hello helpdesk

recently, i received a new debit card as a replacement for the old one. but, i am not able to make any online transactions using the new card. could you please look into the issue and resolve it soon? as i have pending bills that need to be paid.i am attaching a screenshot of the latest error received.

[cid:error1.jpg]

thanks & regards, tommy

Step1(removing the received from before @): Here, we are replacing the received from followed by the mail_id with spaces. The expressions \s is whitespace and \w for word character including numbers and _. The first ‘+’ sign indicates concatenation and the second one implies including all the word characters.

#let's replace received from with spaces
text = re.sub(r'(received from:\s+\w+)', ' ', text)
print(text)@gmail.com

hello helpdesk

recently, i received a new debit card as a replacement for the old one. but, i am not able to make any online transactions using the new card. could you please look into the issue and resolve it soon? as i have pending bills that need to be paid.i am attaching a screenshot of the latest error received.

[cid:error1.jpg]

thanks & regards, tommy

We can notice the details of the received form has been eliminated. Similarly, we can play around with other unwanted words and replace them with spaces.

Step2(removal of punctuations and single letters):

puncts = [',', '.', '"', ':', ')', '(', '-', '!', '?', '|', ';', "'", '$', '&','/', '[', ']', '>', '%', '=', '#', '*', '+', '\\', '•',  '~', '@', '£','·', '_', '{', '}', '©', '^', '®', '`',  '<', '→', '°', '€', '™', '›','♥', '←', '×', '§', '″', '′', 'Â', '█', '½', 'à', '…', '\xa0', '\t','“', '★', '”', '–', '●', 'â', '►', '−', '¢', '²', '¬', '░', '¶', '↑','±', '¿', '▾', '═', '¦', '║', '―', '¥', '▓', '—', '‹', '─', '\u3000', '\u202f','▒', ':', '¼', '⊕', '▼', '▪', '†', '■', '’', '▀', '¨', '▄', '♫','☆', 'é', '¯', '♦', '¤', '▲', 'è', '¸', '¾', 'Ã', '⋅', '‘', '∞', '«','∙', ')', '↓', '、', '│', '(', '»', ',', '♪', '╩', '╚', '³', '・','╦', '╣', '╔', '╗', '▬', '❤', 'ï', 'Ø', '¹', '≤', '‡', '√','☺','•','\u200e','·','…','a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l','m', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w' , 'x', 'y', 'z']

We can retrieve a list of punctuations using string.punctuation in python and add our custom details into the fetched list.

import nltk
nltk.download('punkt')
def advance_punct_remove(x):
list1 = [str(word) for word in nltk.word_tokenize(x) if word not in puncts]
return ' '.join(list1)

The nltk package is a python library that comes in handy with many functionalities for NLP preprocessing. The word_tokenize function separates each work delimited by space. This is followed by checking for any punctuation. If the word read falls under the category of punctuation, then it will be removed.

advance_punct_remove(text)'gmail.com hello helpdesk recently received new debit card as replacement for the old one but am not able to make any online transactions using the new card could you please look into the issue and resolve it soon as have pending bills that need to be paid.i am attaching screenshot of the latest error received cid error1.jpg thanks regards tommy'

The entire code can be found in the Github repository.

Recommended Reading:

https://medium.com/analytics-vidhya/regular-expressions-an-excellent-tool-for-text-analysis-or-nlp-d1fa7d666cb9#:~:text=A%20regular%20expression%20is%20a,a%20large%20unstructured%20text%20content.

https://www.geeksforgeeks.org/string-punctuation-in-python/#:~:text=In%20Python%2C%20string.,the%20all%20sets%20of%20punctuation.&text=Parameters%20%3A%20Doesn't%20take%20any,Return%20all%20sets%20of%20punctuation.

Nerd For Tech

Nerd For Tech

NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To know more about us, visit https://www.nerdfortech.org/. Don’t forget to check out Ask-NFT, a mentorship ecosystem we’ve started

Nandhini N

Written by

AI Enthusiast | Blogger✍

Nerd For Tech

NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To know more about us, visit https://www.nerdfortech.org/. Don’t forget to check out Ask-NFT, a mentorship ecosystem we’ve started

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store