Step by Step text analysis and sentiment analysis using NLTK
Problem Statement : Suppose you are working as compliance data science analyst in a firm. One of your task is to monitor internal communications in order to better understand employees’ moods and assess any potential risks. You can leverage sentiment analysis technology which has become a form of risk management and is emerging as a useful risk control tool for a variety of businesses to identify and address regulatory risk issues, compliance problems and potential fraud.
Data Pre-Processing
Let’s have a look at the data set :
After removing null values, missing values and duplicate rows with unused columns(ID) we proceed for text analysis.
Stop words (is, to , and , the .. )are very frequent and present in abundance in a document so before moving forward let’s see how are they distributed in our dataset with respect to positive data and negative data (label 1 and label 0 respectively)
Zipf’s distribution :
The Zipf distribution (also known as the zeta distribution) is a continuous probability distribution that satisfies Zipf’s law: the frequency of an item is inversely proportional to its rank in a frequency table
Feature Engineering
- ‘pos frequency’ means how many times a token has occurred in positive labeled data
- ‘neg frequency’ means how many times a token has occurred in negative labeled data
- ‘pos rate’ = (pos frequency)/ (pos frequency + neg frequency); Similarly, we can compute ‘neg rate’
- ‘pos_freq_pct’ = (pos frequency)/ (Total pos frequency); Similarly, we can compute ‘neg_freq_pct’
- harmonic_pos_freq_pct : harmonic mean between pos frequency and pos_freq_perc for each token
- harmonic_neg_freq_pct : harmonic mean between neg frequency and neg_freq_perc for each token
- norm_cdf_pos_rate : normal distibution cdf value of pos rate (mu=0.sig=1)
- norm_cdf_neg_rate : normal distibution cdf value of neg rate (mu=0.sig=1)
- norm_cdf_pos_freq_pct : normal distibution cdf value of pos freq pct (mu=0.sig=1)
- norm_cdf_neg_freq_pct : normal distibution cdf value of neg freq pct (mu=0.sig=1)
- harmonic_pos_cdf : harmonic mean between norm_cdf_pos_rate and norm_cdf_pos_freq_pct for each token
- harmonic_neg_cdf : harmonic mean between norm_cdf_neg_rate and norm_cdf_neg_freq_pct for each token
More Data Articles:
1.5 Reasons not to use Data (And why most are BS)
2.Begineer’s Guide To Data Strategy
3. Chatbot For Recommending Netflix Movies
4. Product Affinity and Basket Analysis for Ecommerce Website
Correlation Analysis
For correlation between words used in positive data and words used in negative data we need to do several correlation analysis.
Which plot out of the above 3 better explains the relationships between words in neg and pos rows. Why ?
As far as all the three graphs are concerned ,third graph explains the relationship between words in neg and pos rows the most. Reasons are listed below :
- when normally distributed the words are pretty much different in both the negative data set and frequency of words are negatively correlated in both the labels .Also , we can see that in the middle there is a slight bump which tells us that few highly frequent words like “i”,”feel” , ‘and’ etc are present in almost same amount in both the labels.
- It takes care of the highly frequent words other than stop words also and tells us the correct scenario that there are different kind of words used in positive and negative data
Suppose we want to build a machine learning model to predict whether a given sentence has positive or negative sentiment. The entire pipeline highlighting the various steps and algorithms for the same.
After reading the data as fresh and let’s prepare it for ML model.
Since the data is cleaned and prepared , let’s split the data set for training and testing purposes.
ML model needs training data in numeric form so using TF-IDF vectoriser for the conversion and creation of BOW.(Bag of Words)
Selecting a model on the basis of number of classes and type of class.In our case it’s nominal and have two categories (0 & 1)
After fitting the model on train , we need to test the accuracy of model .In our case recall ,precision ,f-score and ROC-AUC curve are multiple parameters on the basis of which model should get evaluated.
Let’s test the data with some raw/unseen data for it should not be overfitting.
Feel free to reach out for whole code:)
Thanks!