Working on NLP with TextBlob

Prateek Majumder
Analytics Vidhya
Published in
4 min readMar 22, 2021

TextBlob is a Python Library for working with Text data. It provides simple API to dive into various Natural Language Processing Tasks.

Let us work on some NLP tasks using TextBlob. We shall work on some sentiment analysis.

NLP Meme.

We use the Trip Advisor reviews dataset, it has abourt 20,000 reviews with 1–5 star ratings on them.

#importing the libraries

import os
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from textblob import TextBlob
import nltk
from nltk.stem import WordNetLemmatizer
import matplotlib.pyplot as plt
%matplotlib inline
#reading the data
df = pd.read_csv("/kaggle/input/trip-advisor-hotel-reviews/tripadvisor_hotel_reviews.csv")

We start by importing the libraries and importing the data.

The data looks like this.

Now, we look at how many times each data point is appearing.

print("Star reviews and number of times they occur.")
Hotels have mainly ratings of 5 or 4.

Then there is a lot of stuff with normal NLTK, creating a frequency table, wordcloud etc. Have a look at the Kaggle Notebook for the entire code. (Link will be given below.)

Using TextBlob

We will use the in-built methods in TextBlob to generate review polarity and subjectivity.

df_arr = df.to_numpy()

But, before working with creating a classifier, we need to pass the data to Textblob functions, so we will pass the data into a python 2D list.

for a in df_arr:
testimonial = TextBlob(text)

Hence, we get the values for polarity and subjectivity. To consolidate the data, we pass them to our dataframe as well.

The new data added.

Understanding the features of the new data. We shall be using distribution plots for that.

Polarity Distribution Plot.
Review Subjectivity distribution.

Now, before we get into working on the model, a few words. TextBlob library has many predefined functions, which make the works very easy. But it also means that, working on TextBlob needs high computational resources. I was not able to train the model on all 20,000 data points. So, went with only 1000 data points.

from textblob.classifiers import NaiveBayesClassifierdf_model=df_arr[0:1000]cl = NaiveBayesClassifier(df_model)

Now, the model is trained, we can go and test on some sample text.

cl.classify("The hotel is very good. Food was good, housekeeping could have been better. The staff was ok")
5 star review.
test_text=".before stay hotel arrange car service price 53 tip reasonable driver waiting arrival.checkin easy downside room picked 2 person jacuzi tub no bath accessories salts bubble bath did n't stay, night got 12/1a checked voucher bottle champagne nice gesture fish waiting room, impression room huge open space felt room big, tv far away bed chore change channel, ipod dock broken morning way asked desk check thermostat said 65f "cl.classify(test_text)
3 star review.

So we can say that the classifier has been doing well. With better computational resources, model can be made to do better.

Have a look at the entire code.

Thank You.

My Linkedin Profile -

My Github Profile -