Step by Step text analysis and sentiment analysis using NLTK

Published in

Analytics Vidhya

5 min readNov 16, 2020

Problem Statement : Suppose you are working as compliance data science analyst in a firm. One of your task is to monitor internal communications in order to better understand employees’ moods and assess any potential risks. You can leverage sentiment analysis technology which has become a form of risk management and is emerging as a useful risk control tool for a variety of businesses to identify and address regulatory risk issues, compliance problems and potential fraud.

Data Pre-Processing

Let’s have a look at the data set :

After removing null values, missing values and duplicate rows with unused columns(ID) we proceed for text analysis.

some leads to what the data is all about

Stop words (is, to , and , the .. )are very frequent and present in abundance in a document so before moving forward let’s see how are they distributed in our dataset with respect to positive data and negative data (label 1 and label 0 respectively)

Zipf’s distribution :
The Zipf distribution (also known as the zeta distribution) is a continuous probability distribution that satisfies Zipf’s law: the frequency of an item is inversely proportional to its rank in a frequency table

Feature Engineering

‘pos frequency’ means how many times a token has occurred in positive labeled data
‘neg frequency’ means how many times a token has occurred in negative labeled data
‘pos rate’ = (pos frequency)/ (pos frequency + neg frequency); Similarly, we can compute ‘neg rate’
‘pos_freq_pct’ = (pos frequency)/ (Total pos frequency); Similarly, we can compute ‘neg_freq_pct’
harmonic_pos_freq_pct : harmonic mean between pos frequency and pos_freq_perc for each token
harmonic_neg_freq_pct : harmonic mean between neg frequency and neg_freq_perc for each token
norm_cdf_pos_rate : normal distibution cdf value of pos rate (mu=0.sig=1)
norm_cdf_neg_rate : normal distibution cdf value of neg rate (mu=0.sig=1)
norm_cdf_pos_freq_pct : normal distibution cdf value of pos freq pct (mu=0.sig=1)
norm_cdf_neg_freq_pct : normal distibution cdf value of neg freq pct (mu=0.sig=1)
harmonic_pos_cdf : harmonic mean between norm_cdf_pos_rate and norm_cdf_pos_freq_pct for each token
harmonic_neg_cdf : harmonic mean between norm_cdf_neg_rate and norm_cdf_neg_freq_pct for each token

Correlation Analysis

For correlation between words used in positive data and words used in negative data we need to do several correlation analysis.

(harmonic mean between pos rate and pos_freq_pct) & (harmonic mean between neg rate and neg_freq_pct)

(harmonic mean between normal distribution cdf of pos rate and normal distribution cdf of pos_freq_pct) & (harmonic mean between normal distribution cdf of neg rate and normal distribution cdf of neg_freq_pct)

Which plot out of the above 3 better explains the relationships between words in neg and pos rows. Why ?

As far as all the three graphs are concerned ,third graph explains the relationship between words in neg and pos rows the most. Reasons are listed below :

when normally distributed the words are pretty much different in both the negative data set and frequency of words are negatively correlated in both the labels .Also , we can see that in the middle there is a slight bump which tells us that few highly frequent words like “i”,”feel” , ‘and’ etc are present in almost same amount in both the labels.
It takes care of the highly frequent words other than stop words also and tells us the correct scenario that there are different kind of words used in positive and negative data

Suppose we want to build a machine learning model to predict whether a given sentence has positive or negative sentiment. The entire pipeline highlighting the various steps and algorithms for the same.

After reading the data as fresh and let’s prepare it for ML model.

Remove punctuation,spaces,special_character and duplicates

remove stop words, tokenisation and lemmatisation

Since the data is cleaned and prepared , let’s split the data set for training and testing purposes.

ML model needs training data in numeric form so using TF-IDF vectoriser for the conversion and creation of BOW.(Bag of Words)

train-test conversion to BOW using TF-IDF

Selecting a model on the basis of number of classes and type of class.In our case it’s nominal and have two categories (0 & 1)

Baseline Model Selection: Logistic Regression

After fitting the model on train , we need to test the accuracy of model .In our case recall ,precision ,f-score and ROC-AUC curve are multiple parameters on the basis of which model should get evaluated.