NLP Preprocessing:- A useful and important step

Dhaval Taunk
Analytics Vidhya
Published in
4 min readJul 26, 2020
Source — https://s3.amuction/post_images/435/NLP/original.jpg?1506438363

Introduction

GPT-3 model has, for now, became a hot topic in the natural language processing field due to its performance. It has nearly 175 billion parameters in comparison to GPT-2 which had around 1.5 billion parameters. It's a major breakthrough in the field of NLP. But the preprocessing steps that are required before training any model is of utmost importance. Therefore in this article, I will be explaining all the major steps that are used and are required in preprocessing the data before training any NLP model.

First I will list out the preprocessing steps and then will explain them in detail:-

  1. Removing HTML tags
  2. Removing stopwords
  3. Removing extra spaces
  4. Converting numbers to their textual representations
  5. Lowercasing the text
  6. Tokenization
  7. Stemming
  8. Lemmatization
  9. Spell-checking

Now let’s start with their explanation one by one.

Removing HTML tags

Sometimes the text data could contain the HTML tags along with the normal text if the data has been web-scraped from the internet. This could be removed by using python’s BeautfulSoup library because these tags will not be of any use or these tags can be removed using regex as well. The code is explained below:-

Removing stop-words

Many times the data contains a large number of stop-words. These might not be useful because they won’t be making any significant impact on the data. These can be removed by using nltk or spacy library. The code is shown below:-

Removing extra-spaces

There might be certain situations where the data might contain extra spaces within the sentences. These can be easily removed by python’s split() and join() functions.

Converting numbers to their textual representations

Converting numbers to their textual form is also much useful in NLP preprocessing steps. For this purpose, the num2words library can be used.

Lowercasing the text

Converting all the words in data into lowercase is a good practice to remove redundancies. There might be a possibility that words may appear more than one time in the text. One in lowercase form, other in uppercase form.

Tokenization

Tokenization involves converting the sentences into tokens. By tokens, I mean that splitting the sentences into words. It is also useful to separate punctuation from the words because, in the embedding layer of the model, it is much possible that the model does not have embedding present for that word. For example — ‘thanks.’ is a word with full-stop. Tokenization will split into [‘thanks’, ‘.’]. The code for doing this is expressed below using NLTK’s word_toknize:-

Stemming

Stemming is the process of converting any word in the data to its root form. For example:- ‘sitting’ will be converted to ‘sit’, ‘thinking’ will be converted to ‘think’ etc. NLTK’s PorterStemmer can be used for this purpose.

Lemmatization

Many people consider lemmatization similar to stemming. But actually they are different. Lemmatization does a morphological analysis of words that stemming does not do. NLTK has an implementation of lemmatization (WordNetLemmatizer) which can be used.

Spell-checking

There is much possibility that the data that is being used contain spelling mistakes. There spell-checking becomes an important step in NLP preprocessing. I will be using the TextBlob library for this purpose.

Although the above spell-checker may not be perfect still it will be of good use.

All the above methods that are depicted above are one of the possible techniques available to do those steps. There are other methods available as well.

I have created a Github repo as well accumulating all the above methods in one file. You can check it by going through the below link:-

--

--

Dhaval Taunk
Analytics Vidhya

MS by Research @IIITH, Ex Data Scientist @ Yes Bank | Former Intern @ Haptik, IIT Guwahati | Machine Learning | Deep Learning | NLP