Sentiment Classification Using Distil Bert on Custom Dataset
Overview
In This Article we will be going to learn end to end NLP project by extracting the data from website to making the sentiment analysis using modern day state of the art technology. This also involves text cleaning ,text analysis and visual representation and understanding of text and then will deploy the model on cloud .At the end of this article you will be knowledgeable enough to approach almost any kind of sentiment classification task on noisy dataset.
The dataset i have been collecting is the review of any specific product from e- commerce website then will categorize them into different -different sentiment and at the end we will put out test data to check how efficient our model is for detecting the sentiment. let’s get started………………
Step 1 :- Building the custom dataset
We will be collecting the dataset from e-commerce website and will extract all the dataset into our local system using beautiful soup. No need to worry i will be showing you step wise guide to extract data using beautiful soup.
Importing necessary libraries
import requests
from bs4 import BeautifulSoup as bs
from urllib.request import urlopen as uReq
import numpy as np
import pandas as pd
uReq
is going to request the url. BeautifulSoup
helps to parse the data from html tags. Here i am going to automate the product link you just need to take the product name of your own choice and for that specific product the review will be popped out.
print("Enter the product of your own chice")
product_name=input()
web_link='https://www.flipkart.com/search?q='
url=web_link+product.replace(" ","")
url_clinet=uReq(url) #clinet requesting to access the url from servr
url=url_client.read() #readng the url
url.close() #closing the conncetion
soup=bs(url,'html.parser')
Now it’s time to take one product out of listed product to scrap their reviews.
data_class=soup.find_all('div',{"class":"_1AtVbE col-12-12"})
data=data_class[0] #it indicates we are taking first product from a
complete given list present on UI
#list of things i am going to scrap from website
product_name=[] #name of product
comment_header=[] #comment_heading
comments=[] #full description of comment
ratings=[] #rating of product
user_name=[] #person who has commented about product
likes=[] #people aggreng with the comments
dislikes=[] #people don't agree with comment
region=[] #location from where user has commented
Now we will take a product of our own choice automate them and then will take upper required field from website with the help of beautiful soup.
import requests
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
search_product=input()
url="https://www.flipkart.com/search?q="+search_product.replace(" ","")
uClient=uReq(url) #opening the url
url=uClient.read() #reading the url
uClient.close() #closing the connection this is must needed thing when we are doing scrapping
soup=BeautifulSoup(url,"html.parser") #parsing the html file for given product page
ratings=[]
headings=[]
user_comments=[]
names=[]
likes_count=[]
region=[]
n_data=soup.find_all("div",class_="_1AtVbE col-12-12") #finding all the product present on requested page
data=n_data[2]
for page in range(1, 45):
click_data = "https://www.flipkart.com" + data.div.div.div.a["href"].format(page) # clicking on that particular bject link
req = requests.get(click_data)
soup_next = BeautifulSoup(req.text, "html.parser")
comment_boxes = soup_next.find_all("div", class_="col _2wzgFH")
for comment in comment_boxes:
rating = comment.find("div", {"class": "_3LWZlK _1BLPMq"})
if rating is not None: # if there will be some ratings
ratings.append(float(rating.text))
else: # if the rating will be missing then will add null values
ratings.append(np.nan)
heading = comment.find("p", class_="_2-N8zT")
if heading is not None:
headings.append(heading.text)
else:
headings.append(np.nan)
user_comment = comment.find("div", class_="t-ZTKy")
if user_comment is not None:
user_comments.append(user_comment.text)
else:
user_comments.append(np.nan)
name = comment.find("p", class_="_2sc7ZR _2V5EHH")
if name is not None:
names.append(name.text)
else:
names.append(np.nan)
like_count = comment.find("span", class_="_3c3Px5")
if like_count is not None:
likes_count.append(like_count.text)
else:
likes_count.append(np.nan)
region=soup_updated.find_all('p',{"class":"_2mcZGG"})dict1={"header":headings,"comments":user_comments,"user name":names,"number of likes":likes_count,"ratings":ratings} #saving all these data into dictionary file
df=pd.DataFrame(dict1)
df.to_csv("final_result1.csv")
let’s create dictionary and take all the scrapped dataset and then further convert all the dataset in csv format.
Now We have successfully built our own custom dataset. we will convert the ratings into either class 0 or class 1 where class 0 will be normal sentiment(4 and 5 start rating) and class 1 will be negative sentiment(1 and 2 star) and will ignore the neutral rating because neutral rating will not help business to improve product and it is not that much effective either in case of appreciation or in case of criticism.
Step 2 :- Let’s understand the data
I have made a basic utility file to do data preprocessing with the help of spacy, nltk , textblob and python if you are interested to used the file please click this link and by simple importing you can use all the necessary pre built preprocessing function to clean and understand the text.
Getting basic info from dataset;
data=pd.read_csv('custom_data.csv')
data.head()
data.shape #number of rows and column in dataset
[out]>> (1850, 5)
converting rating 4 and 5 into positive rating else the rating with be negative. This can be done using apply function;
def rating_into_sentiment(x):
if x>3:
return 0
else:
return 1
data['ratings']=data['ratings'].apply(rating_into_sentiment)
data.head(10)
let’s check is there any null value in dataset if yes then we probably remove the dataset.
data.isnull().values.any()[out]>> False
import all the essential functions that had been created and stored inside utility file.
from utils import char_count
from utils import word_count
from utils import stop_word_count
from utils import email_removal
from utils import mention_count
from utils import numeric_digit_count
from utils import upper_case_count
from utils import lower_case_conversion
from utils import cont_to_exp
from utils import remove_mul_space,spelling_correction,remove_spec_char
from utils import remove_stop_words,remove_ac_char
from utils import base_root_form,remove_common_words,remove_rare_words
applying above imported function to understand more about data.
data['cont_to_exp']=data['comments'].apply(cont_to_exp)
data['upper_case_count']=data['comments'].apply(upper_case_count)
data['stop_word_count']=data['comments'].apply(stop_word_count)
data['word_count']=data['comments'].apply(word_count)
data['char_count']=data['comments'].apply(char_count)
data['comments']=data['comments'].apply(remove_mul_space)
data['comments']=data['comments'].apply(remove_spec_char)
data['comments']=data['comments'].apply(remove_stop_words)
data['comments']=data['comments'].apply(remove_ac_char)
data['comments']=data['comments'].apply(base_root_form)
Here we can see that we are able to remove all the unwanted text from the dataset now our dataset is cleaned and ready to create the model but before doing that we will will do some exploratory data analysis to understand the nature of text and how and where the comments affects the ratings.
Step 3:- Exploratory Data Analysis
Importing necessary library needed for text visualization.
import seaborn as sns
import matplotlib.pyplot as plt
from wordcloud import WordCloud
Let’s see how the number of likes present in comment affects the sentiment.
sns.set_style('darkgrid')
sns.kdeplot(data[data['ratings']==0]['number of likes'],shade=True,color='red')
sns.kdeplot(data[data['ratings']==1]['number of likes'],shade=True,color='green')
plt.title("like affecting the sentiment")
plt.show()
If we see above figure we can understand that more people hits like to those comments which are in favor of products.
plt.figure(figsize=(6,8))
sns.set_style("ticks")
data['ratings'].value_counts().plot.pie(autopct='%0.2f%%')
plt.title("Percentage Contribution")
plt.xlabel("percent contribution")
plt.ylabel("target")
plt.show()
sns.set_style('darkgrid')
sns.kdeplot(data[data['ratings']==0]['char_count'],shade=True,color='red')
sns.kdeplot(data[data['ratings']==1]['char_count'],shade=True,color='green')
plt.title("like affecting the sentiment")
plt.show()
The length of comment indicates the nature of ratings. More lengthier the comment is greater is the chance of getting positive feedback about products.
sns.set_style('darkgrid')
sns.countplot(data['ratings'])
plt.show()
The dataset is biased towards positive rating means more number of people are satisfied with the purchase.
let’s do word cloud visualization about positive and negative sentence and will observe what are the common occurring word in positive and negative sentiments.
Obviously i have taken the rating of realme mobile phones and the word present in above word cloud also showing words like ram, battery, backup etc.
Now we are done with cleaning and visualization ,we will be ready to apply SOTA model in text to get a better prediction.
Step 4:-Model Building
We will be using KTrain wrapper for keras and tensorflow for building SOTA models. We will take 80% of raw data for training set and 20% data for validation set.
from sklearn.model_selection import train_test_split
data_train,data_test =
train_test_split(data,test_size=0.2,random_state=1)
Let’s build the sentiment classifier model using KTrain text. Now we will do preprocessing or embedding the text into numeric value by using distil bert preprocessing.
(x_train,y_train),(x_test,y_test),preprocess=
text.texts_from_df(data=data_train,text_column='comments',
label_columns='ratings',
'preprocess_mode='distilbert')
let’s build the KTrain model, first stage will be to chose which model we would like to take for sentiment analysis.
train,test,preprocess=text.texts_from_df(data_train,text_column
='comments',label_columns='ratings'
,preprocess_mode='distilbert',maxlen=100)
Now will fine tune the distil bert model on our custom data and then with the help of our fine tuned weight we will predict the sentiment of the comments provided on shopping websites.
model=text.text_classifier('distilbert',train_data=train,preproc
=preprocess,verbose=1)
Now we will wrap the model with the help of ktrain.learner
and will be further used in prediction of model.
learner=ktrain.get_learner(model,train_data=train,val_data=test,
batch_size=32)
learner.fit_onecycle(lr=2e-5,epochs=4)
Since the dataset is small and was preprocessed in good manner i was able to get accuracy 100% which won’t be in case of large dataset with different variety of comments and uncleaned and messy data.
predictor=ktrain.get_predictor(learner.model,preprocess)
let’s save the model in google drive so that it can be further used in predicting results.
from google.colab import drive
drive.mount('/content/drive')
predictor.save('/content/drive/MyDrive/distilbert')
Let’s take some text data and then predict the sentiment of comment based on the model we have fined tuned based on our own built custom data.
data='''Let me get straight to the point.From day 1 this phone hangs.multitasking performance is very bad.. sometimes wifi gets disconnected automatically..apps like whatsapp take 15 seconds to open.and flipkart 20-25 seconds.In the time of usb type c,we are getting micro usb.charging takes a long time..and 2 gb ram is not sufficient to run apps..phone is good for just to make calls..and do light works.not meant for multi tasking'''
Making a helper function that will help us to predict the result.
def prediction(data):
if predictor.predict([data])=='rating':
return 'Positive Sentiment'
else:
return 'Negative Sentiment'
prediction(data)
As have taken above comment of a realme product from flipkart where user have given one star rating which definitely means user is not satisfied with product and gave the comment unfavorable to product and yes our model is able to predict the result correctly.
Conclusion :-
This is all from my side if you have any suggestion regarding the improvement of this blog. Thank you for your precious time to go through this blog .Keep learning keep exploring…………….