Text Preprocessing for NLP (Natural Language Processing),Beginners to Master

Ujjawal Verma
Analytics Vidhya
Published in
10 min readFeb 14, 2020

--

cc:https://www.flickr.com/photos/stevensnodgrass/6274372541/

In this blog we will talking about the text preprocessing for Natural Language Processing (NLP) problems. Basically, NLP is an art to extract some information from the text. Now a days many organization deal with huge amount of text data like customers review, tweets,news letters,emails, etc. and get much more information from text by using NLP & Machine Learning.

The first step of NLP is text preprocessing, that we are going to discuss. Here I am using Amazon Reviews: Unlocked Mobile Phones dataset for text preprocessing.

So, Before starting we all need to know why text preprocessing is required?

Why text preprocessing ?

As we know Machine Learning needs data in the numeric form. We basically used encoding technique (BagOfWord, Bi-gram,n-gram, TF-IDF, Word2Vec) to encode text into numeric vector. But before encoding we first need to clean the text data and this process to prepare(or clean) text data before encoding is called text preprocessing, this is the very first step to solve the NLP problems.

I am doing text preprocessing step by step on sentiment analysis of Amazon Reviews: Unlocked Mobile Phones dataset, Let’s play with the data…😊😎

Content

  1. Import the dataset & Libraries.
  2. Dealing with Missing Values.
  3. Labeling the Dataset.
  4. Data Cleaning and text preprocessing.

1. Import the dataset & Libraries

First step is usually importing the libraries that will be needed in the program. A library is essentially a collection of modules that can be called and used.

let’s look at the dataset we got, its look like as shown below, Here we can see there is 6 features ‘Product Name’, ‘Brand Name’, ‘Price’, ‘Rating’, ‘Reviews’ and ‘Review Votes’.

2. Dealing With Missing Values

In this step we will check the null values in our dataset and replace or drop as per the dataset.

We are doing sentiment analysis on this dataset. So we required basically two features ‘Rating’ and ‘Review’. As above, Reviews having only 62 null values. Now, we will first trim our dataset with only two features and then remove these all 62 records with the help of below code.

As we can see all null values has been removed from our dataset. Let’s Create labels according to the rating given by customers.

3. Labeling The Dataset

As per our dataset there is rating from 1 to 5. So, According to rating we will create there labels, Positive(for 1 & 2 Rating), Neutral(for 3 Rating) and Negative (for 4,& 5 Rating).

Labeling

So, As above code we create the label as per the rating. Let’s look at the dataset.

4. Data Cleaning And Text Preprocessing.

We are only considering the ‘Reviews’ feature from the dataset for text preprocessing. I will do few steps here to clean the text data, Generally it’s depends on the text data or problem requirement. Here i am explaining this process step-by-step.

Preprocessing the raw text:

This involves the following:

I. Removing URL.

II. Removing all irrelevant characters (Numbers and Punctuation).

III. Convert all characters into lowercase.

IV. Tokenization

V. Removing Stopwords

VI. Stemming and Lemmatization

VII. Remove the words having length <= 2

VIII. Convert the list of tokens into back to the string

For better understanding we take a review from the dataset and see how it will change after each steps.

Example

I. Removing URL —

As we can see there is an url and we don’t want it to be a part of our corpus. Let’s remove it by using below line of code.

Result: As we can see the url (highlighted by green color) has been removed.

Removing URL

II. Removing all irrelevant characters (Numbers and Punctuation) —

Remove numbers if they are not relevant to your analyses (0–9). And punctuation also will be remove. Punctuation is basically the set of symbols [!”#$%&’()*+,-./:;<=>?@[\]^_`{|}~]:

Result: All numeric and punctuation has been replaced with space ’ ‘ .

Removing all irrelevant characters

III. Convert all characters into lowercase —

All words changes into lower case or uppercase to avoid the duplication. Because “Phone” and “phone” will be considered as 2 separate words if this step is not done.

Result: All upper case that are highlighted by green color has been replaced with lower case (highlighted by yellow color).

Convert all characters into lowercase

IV. Tokenization —

Tokenization is the process of splitting the given text into smaller pieces called tokens. Words, numbers, punctuation marks, and others can be considered as tokens. We will use Natural language tool kit (nltk) library for tokenization.

Note: If we have data in the form of paragraphs, and we want to convert the paragraph into sentences, then we will use nltk.sent_tokenize(paragraph).

Here we will use below line of code to perform tokenization.

Result: As we can see the string has been changed into tokens, that has been stored in the form of ‘list of string’ .

Tokenization

Now we get list of string of each record (or row). Let’s look to the dataset.

V. Removing Stopwords —

“Stopwords” are the most common words in a language like “the”, “a”, “me”, “is”, “to”, “all”,. These words do not carry important meaning and are usually removed from texts. It is possible to remove stopwords using Natural Language Toolkit (nltk). You also may check the list of stopwords by using following code.

List of Stopwords

So, these are the stopwords which we need to remove, Let’s remove the stopwords.

Result: Now here we can see that all highlighted tokens has been removed from the corpus after apply the function clean_stopwords().

Removing Stopwords

VI. Stemming and Lemmatization —

The aim of both processes is the same: reducing the inflectional forms of each word into a common base or root. Both process are different, let’s see what is stemming and lemmatization.

Stemming usually refers to a crude process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational units (the obtained element is known as the stem).

On the other hand, lemmatization consists in doing things properly with the use of a vocabulary and morphological analysis of words, to return the base or dictionary form of a word, which is known as the lemma.

If we stem the sentence “I saw an amazing thing ”we would obtain ‘s’ instead of ‘saw’, but if we lemmatize it we would obtain ‘see’, which is the lemma.

Both techniques could remove important information but also help us to normalize our corpus (although lemmatization is the one that is usually applied). Actually stemming create some words, that may not have any meaning, so we usually use lemmatization.

I will show you the difference between both with the help of code and result.

let’s look at stemming first.

  • Stemming:

Result: As observe the output, there is some words has been stem like ‘commponents’ to ‘ ‘commpon’, ‘says’ to ‘say’, ‘people’ to ‘peopl’ and ‘troubling’ to ‘troubl’.

Stemming

Now we can see here, there is some words changed those have no meaning, and this is the challenge to use stemming. Let’s go forward for lemmatization and see the difference in the output.

  • Lemmatization:

Result: Now here we can see it finds the root word, like ‘troubling’ to ‘trouble’, ‘took’ to ‘take’ and ‘payed’ to ‘pay’ . So, As opposed to stemming, lemmatization does not simply chop off inflections. Instead it uses lexical knowledge bases to get the correct base forms of words.

Lemmatization

So, here we can see the difference between stemming and lemmatization. Generally we are using lemmatization.

Let’s consider lemmatization output for further process.

VII. Remove the words having length <= 2 —

Basically, after performing all required process in text processing there is some kind of noise is present in our corpus, so like that i am removing the words which have very short length.

Result:Removing the words those have length less than or equal to 2.

Remove the words having length <= 2

Now we get the required corpus after text preprocessing. now we will convert back this list to string. for encoding the text.

VII. Convert the list of tokens into back to the string —

Result: After preforming the above code, getting a string of input list.

Converting tokens into string

Let’s look at our dataset after performing text preprocessing.

Final data after text preprocessing

Now we get the dataset which we required for encoding the text.

Same thing can be achieved by using single function which I want to share with you.

Function for Text Preprocessing

So, these are the steps using for text preprocessing for NLP problems. You don’t have need to follow all process, some times you have need to cover less steps. Actually It’s depends on your dataset and as well as your problem.

Let’s See one more interesting thing about the text preprocessing, which is text visualization.

Text Visualization

After text preprocessing, let’s look at our corpus and observe is this ready for encode to numeric vector or not?

Here I will visualize out tokens in corpus label wise, means we have to split our dataset corresponding to labels. which can be done by following code.

Split data with respect to labels

Let’s visualize our corpus corresponding to the labels, Here i am taking top 20 words and see their frequency in our corpus with the help of below line of code.

Will get the result as below for Positive, Neutral and Negative review.

  • For Positive corpus
Top 20 words in Positive corpus
  • For Neutral Corpus
Top 20 words in Neutral corpus
  • For Negative Corpus
Top 20 words in Negative corpus

Observation: Here we can clearly observe that there is a word ‘phone’, which is common in all the labels and having highest frequency. So, it will be a good step if we remove the ‘phone’ word from our corpus for a better performance.

Note: You also can remove some other common words if you want like, ‘would’, ‘get’, those are also not carry important deal with respect to your problem.

So, here I am only removing ‘phone’ word and this will be our final step in text preprocessing. Let’s go for it.

Removing ‘phone’ from corpus: —

So, if you visualize your corpus again then you will see there is no ‘phone’ word present there, and it is our final corpus which is ready for encoding.

Generally we used BagOfWord, Bi-gram,n-gram, TF-IDF & Word2Vec technique to encode text into numeric vector. and after that applied Machine Learning for sentiment analysis. I will cover all encoding techniques in my upcoming blogs.

So This is the ending of this blog, hope you enjoyed and got something from here. Please let me know if I missed something in text preprocessing.

Happy Learning, Keep growing…😊😊

--

--