Sentiment Analysis Project Using TextBlob

Qudirah
6 min readFeb 27, 2023

--

Hi, It’s going to be the end of February tomorrow and the beginning of this year, I made a commitment to posting here twice per month. Best believe I will be posting both articles by tomorrow (the 28th).

What has changed since then?

Well, I am now a data science instructor and I guess I will dedicate a whole new post to talking about that.

Since my career as a machine learning engineer has begun, I have worked more on NLP-related projects so much that my interest in it grew naturally.

The goal of sentiment analysis is to determine the overall sentiment polarity of a piece of text, which can be positive, negative, or neutral. This can be useful in a variety of applications, such as social media monitoring, brand reputation management, customer feedback analysis, and market research.

Sentiment analysis can be performed using various techniques, such as rule-based systems, lexicon-based methods, machine learning algorithms, and deep learning models. The choice of method depends on the specific use case, the quality and quantity of available data, and the desired level of accuracy and interpretability.

One of the libraries you can use to achieve sentiment analysis is TextBlob.

There are others such as Vader Sentiment, Flair or even building it from scratch yourself but today, we will focus on the TextBlob library.

Now for the next couple of months, my projects will be based on The British Airways Airline and after the series of projects, I will be telling you why that is. It is a way of applying DS/DA/ML to real-life projects. The projects will be beginner friendly so hang on.

So today, the project is centered around scraping reviews of British Airway Airline’s customers and getting insights from them.

The contents will be:

  1. Scraping data
  2. Cleaning data
  3. Word Cloud Analysis
  4. Sentiment analysis
  5. Count Plot of Sentiments
  6. Summarizing the analysis

Scraping data

Of course thefirst step to any project is to import any of the libraries you will be using and if you don’t have them, you can pip-install them.

import requests
from bs4 import BeautifulSoup
import pandas as pd
import preprocessor as p
from wordcloud import WordCloud, ImageColorGenerator, STOPWORDS
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
import matplotlib.pyplot as plt
from textblob import TextBlob
import seaborn as sns

For this project, these are the libraries necessary.

To scrape, I will be using the BeautifulSoup library and I won’t be going over the scraping in detail in this post. The reviews are about 3000+ and it spans across a few pages. The code below helped me in scraping the data.

base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 5
page_size = 1000

reviews = []

# for i in range(1, pages + 1):
for i in range(1, pages + 1):

print(f"Scraping page {i}")

# Create URL to collect links from paginated data
url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

# Collect HTML data from this page
response = requests.get(url)

# Parse content
content = response.content
parsed_content = BeautifulSoup(content, 'html.parser')
for para in parsed_content.find_all("div", {"class": "text_content"}):
reviews.append(para.get_text())

print(f" ---> {len(reviews)} total reviews")

2. Cleaning the data

This is the data in question

There are emojis and unnecessary ‘Trip Verified’ and ‘Trip Unverified’ in some of the reviews. There are also punctuation marks to get rid of.

The preprocessor library will get rid of emojis, we can create a function to remove unnecessary words and punctuation marks. Below is the code:

df['clean_reviews']=df['reviews'].apply(p.clean)
def clean_verified(review):
if review[:16]=='Trip Verified | ':
review = review[16:]
elif review[:15]=="Not Verified | ":
review = review[15:]
else:
pass
return review
df['clean_reviews'] = df['clean_reviews'].apply(clean_verified)
def punctuation_removal(r):
punc = '!()-[]{};:""\,<>./?@#$%^&*_~'''
for ele in r:
if ele in punc:
r = r.replace(ele, "")
return r
df['clean_reviews']=df['clean_reviews'].apply(punctuation_removal)

3. WordCloud Analysis

I want to build a word cloud to view the most mentioned words by the customers. For the word cloud, stop words are removed and only alphabets are considered. The function to generate the word cloud is below and so is the word cloud itself:

stop_words = set(stopwords.words('english'))
def filter_words(review):
filtered = []
for reviews in review:
word_tokens = word_tokenize(reviews)
for w in word_tokens:
if w not in stop_words and w.isalpha():
filtered.append(w)
return (filtered)
reviews=[]
for i in df['clean_reviews']:
reviews.append(i)
def plot_wordcloud(review,title,max_words):
words_filtered = filter_words(review)
text = " ".join([ele for ele in words_filtered])
word_cloud= WordCloud(background_color="white", random_state=1,stopwords=stop_words,max_words=max_words,width =800, height = 1500)
word_cloud.generate(text)
plt.figure(figsize=[10,10])
plt.imshow(word_cloud,interpolation="bilinear")
plt.axis('off')
plt.title(title)
plt.savefig('foo.png')
plt.show()

Now just before the sentiment analysis, I divided the words into tokens, lemmatized them, and then joined them to texts.

lemmatizer = nltk.stem.WordNetLemmatizer()
def lemmatize_text(text):
return [(lemmatizer.lemmatize(w)) for w in word_tokenize(text)]
df['tokenized_reviews'] = df['clean_reviews'].apply(lemmatize_text)
df['tokenized_reviews'] = df['tokenized_reviews'].apply(lambda x: [item for item in x if item not in stop_words])
df['tokenized_reviews'] = df['tokenized_reviews'].apply(lambda x: ' '.join(x))

Below is the data after the preprocessing:

4. Sentiment analysis

Now, the sentiment analysis. To conduct the sentiment analysis, I use the library called TextBlob. Text Blob is a popular library but honestly not that accurate. It depends on the use case and in this case, the use case is quite simple so it will do well. Below is the function used. Polarity score higher than 0 shows positivity, while those below 0 shows negativity. There are so many things to do with TextBlob honestly and you can read the documentation here.

def sentiment_analyzer(review):
sentiment= TextBlob(review)
score= sentiment.sentiment.polarity
if score > 0:
return "positive"
elif score < 0:
return "negative"
else:
return "neutral"

Finally, visualize the ratio of positive to negative using a simple count plot.

sns.countplot(data=df,x='sentiment')
plt.title('Sentiment Analysis of review')
plt.savefig('foo1.png')

The inferences were drawn:

and that leads to the end of this sentiment analysis project using TextBlob. Thank you.

Appendix

Sentiment Analysis: Chat GPT explains sentiment analysis as a subfield of natural language processing (NLP) that involves using machine learning and statistical techniques to automatically identify and extract subjective information from textual data, such as opinions, emotions, attitudes, and feelings expressed by people towards a particular topic or entity.

Word cloud: A word cloud is a visual representation of a collection of words, where the size of each word indicates its frequency or importance within the text or dataset being analyzed.

Tokens: In natural language processing, a token refers to a sequence of characters that represents a single unit of meaning. Typically, a token corresponds to a word, although it can also be a punctuation mark, a number, a symbol, or a combination of these.

Lemmatize: Lemmatization is a process in natural language processing (NLP) that involves reducing a word to its base or dictionary form, known as its lemma. The lemma of a word is its canonical form that represents its core meaning and is often useful in standardizing variations of words that have the same meaning.

--

--

Qudirah

To be a Data Scientist is hard, to be a Nigerian data scientist is harder. Taking you through my journey because success is a must.