NLP Preprocessing:- A useful and important step

Published in

Analytics Vidhya

4 min readJul 26, 2020

Source — https://s3.amuction/post_images/435/NLP/original.jpg?1506438363

Introduction

GPT-3 model has, for now, became a hot topic in the natural language processing field due to its performance. It has nearly 175 billion parameters in comparison to GPT-2 which had around 1.5 billion parameters. It's a major breakthrough in the field of NLP. But the preprocessing steps that are required before training any model is of utmost importance. Therefore in this article, I will be explaining all the major steps that are used and are required in preprocessing the data before training any NLP model.

First I will list out the preprocessing steps and then will explain them in detail:-

Removing HTML tags
Removing stopwords
Removing extra spaces
Converting numbers to their textual representations
Lowercasing the text
Tokenization
Stemming
Lemmatization
Spell-checking

Now let’s start with their explanation one by one.

Removing HTML tags

Sometimes the text data could contain the HTML tags along with the normal text if the data has been web-scraped from the internet. This could be removed by using python’s BeautfulSoup library because these tags will not be of any use or these tags can be removed using regex as well. The code is explained below:-

Removing stop-words

Many times the data contains a large number of stop-words. These might not be useful because they won’t be making any significant impact on the data. These can be removed by using nltk or spacy library. The code is shown below:-

Removing extra-spaces

There might be certain situations where the data might contain extra spaces within the sentences. These can be easily removed by python’s split() and join() functions.

Converting numbers to their textual representations

Converting numbers to their textual form is also much useful in NLP preprocessing steps. For this purpose, the num2words library can be used.

Lowercasing the text

Converting all the words in data into lowercase is a good practice to remove redundancies. There might be a possibility that words may appear more than one time in the text. One in lowercase form, other in uppercase form.

Tokenization

Tokenization involves converting the sentences into tokens. By tokens, I mean that splitting the sentences into words. It is also useful to separate punctuation from the words because, in the embedding layer of the model, it is much possible that the model does not have embedding present for that word. For example — ‘thanks.’ is a word with full-stop. Tokenization will split into [‘thanks’, ‘.’]. The code for doing this is expressed below using NLTK’s word_toknize:-

Stemming

Stemming is the process of converting any word in the data to its root form. For example:- ‘sitting’ will be converted to ‘sit’, ‘thinking’ will be converted to ‘think’ etc. NLTK’s PorterStemmer can be used for this purpose.

Lemmatization

Many people consider lemmatization similar to stemming. But actually they are different. Lemmatization does a morphological analysis of words that stemming does not do. NLTK has an implementation of lemmatization (WordNetLemmatizer) which can be used.

Spell-checking

There is much possibility that the data that is being used contain spelling mistakes. There spell-checking becomes an important step in NLP preprocessing. I will be using the TextBlob library for this purpose.

Although the above spell-checker may not be perfect still it will be of good use.

All the above methods that are depicted above are one of the possible techniques available to do those steps. There are other methods available as well.

I have created a Github repo as well accumulating all the above methods in one file. You can check it by going through the below link:-

DhavalTaunk08/NLP_Preprocessing

This file contains the important preprocessing steps in NLP that are required in NLP based tasks. GitHub is home to…

github.com

That’s all from my side this time. Keep reading. If you want to read more, you can reach to below stories of mine:-

How to fine-tune BERT on text classification tasks?

BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based architecture released in the…

medium.com

L1 vs L2 Regularization: Which is better

A lot of people usually get confused which regularization technique is better to avoid overfitting while training a…

medium.com

Demystifying Generative Adversarial Networks: Real vs Fake discriminator

Introduction

medium.com

If you liked my article:

NLP Preprocessing:- A useful and important step

Introduction

Removing HTML tags

Removing stop-words

Removing extra-spaces

Converting numbers to their textual representations

Lowercasing the text

Tokenization

Stemming

Lemmatization

Spell-checking

DhavalTaunk08/NLP_Preprocessing

This file contains the important preprocessing steps in NLP that are required in NLP based tasks. GitHub is home to…

How to fine-tune BERT on text classification tasks?

BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based architecture released in the…

L1 vs L2 Regularization: Which is better

A lot of people usually get confused which regularization technique is better to avoid overfitting while training a…

Demystifying Generative Adversarial Networks: Real vs Fake discriminator

Introduction

Written by Dhaval Taunk