Sentiment Analysis on OpinRank Dataset

Muttineni Sai Rohith
Analytics Vidhya
Published in
5 min readMay 28, 2021

--

A large amount of data that is generated today is unstructured, which requires processing to generate insights. Some examples of unstructured data are news articles, posts on social media, reviews for products and places, and search history. The process of analyzing natural language and making sense out of it falls under the field of Natural Language Processing (NLP). Sentiment analysis is a common NLP task, which involves classifying texts or parts of texts into a pre-defined sentiment. We will use the Natural Language Toolkit (NLTK), a commonly used NLP library in Python, to analyze textual data.

The NLTK library contains various utilities that allow you to effectively manipulate and analyze linguistic data. Among its advanced features are text classifiers that you can use for many kinds of classification, including sentiment analysis.

Sentiment analysis is the practice of using algorithms to classify various samples of related text into overall positive and negative categories. With NLTK, you can employ these algorithms through powerful built-in machine learning operations to obtain insights from linguistic data.

Installing and Importing

You’ll begin by installing some prerequisites, including NLTK itself as well as specific resources you’ll need throughout this tutorial.

First, use pip to install NLTK:

$ python3 -m pip install nltk

While this will install the NLTK module, you’ll still need to obtain a few additional resources. Some of them are text samples, and others are data models that certain NLTK functions require.

To get the resources you’ll need, use nltk.download():

import nltk

nltk.download()

As we are performing sentiment Analysis, we will be using the vader_lexicon package in NLTK.

  • vader_lexicon: A scored list of words and jargon that NLTK references when performing sentiment analysis, created by C.J. Hutto and Eric Gilbert

OpinRank Dataset:

OpinRank Dataset contains user reviews for entities namely cars and hotels. They constitute full reviews from Tripadvisor (~259,000 reviews) and Edmunds (~42,230 reviews). The dataset can be found in the link — https://github.com/kavgan/OpinRank.git

This dataset can be used best for analyzing the textual data, sentiment Analysis, Summarizing the data, Categorizing the reviews, and many other features in text mining.

Dataset is in the below format:

Converting the above tag format into CSV format:

To do so we will use a couple of libraries. The first library that we need to download is the beautiful soup which is a very useful Python utility for web scraping. Execute the following command at the command prompt to download the Beautiful Soup utility.

$ python3 -m pip install beautifulsoup4

Another important library that we need to parse XML and HTML is the lxml library. Execute the following command at the command prompt to download lxml:

$ python3 -m pip install lxml

Now let’s use some python code to convert the data from tag format to CSV format. Use Jupyter notebook for a better experience.

#changing data from tag format into csv data
#import BeautifulSoup package
from bs4 import BeautifulSoup
#data file to load the data
data_file = "2009_audi_/2009_audi_a5"
#csv file to convert data in tag format into csv format
csv_file = "2009_audi_/2009_audi_a5.csv"
#loading data from the data file in text format
with open(data_file) as txt_file:
data = txt_file.read()
#using Beautiful soup to get the data into html format
soup = BeautifulSoup(data, 'lxml')
#taking list to load the data into csv format
csv_data = []
#headers for the csv format
csv_data.append(["date","author","text","favorite"])
#finding and printing the data of "doc" format
for doc_tag in soup.find_all("doc"):
#loading data in list to append the cummulated data to upper list
raw_data = []
#getting each values for a respective doc tag
raw_data.append(doc_tag.find("date").text)
raw_data.append(doc_tag.find("author").text)
raw_data.append(doc_tag.find("text").text)
raw_data.append(doc_tag.find("favorite").text)
csv_data.append(raw_data)
#Converting list of lists to CSV
import csv

#function to convert list of lists to csv format
def write_csv(file,data):
with open(file, 'w', newline='') as file:
writer = csv.writer(file)
writer.writerows(data)
#loading the data into csv format
write_csv(csv_file,csv_data)

We have successfully converted the data in tag format into CSV format. I have saved the data in CSV format so that I can share the files for those who want results faster than the work. Instead, you can directly convert it into a Pandas DataFrame.

Building the sentiment Analyzer:

Once the data is ready in CSV format, we need to build a sentiment Analyzer to categorize the ratings and calculate the rating score.

So let’s use some python code to do that task.

import pandas as pd#loading the csv data into dataframe
df = pd.read_csv(csv_file)
df.head(5)
#Using nltk to load the sentiment analyzer
import nltk
nltk.download('vader_lexicon')

we will use a function called SentimentIntensityAnalyzer() from the nltk.sentiment.vader package. The SentimentAnalyzer can implement and facilitate sentiment analysis tasks with NLTK algorithms and features, so the sentiment scores can be generated without complex coding. Before we use it, we need to call it.

#using sentiment analyzer of nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

#loading sentiment Analyzer
sid = SentimentIntensityAnalyzer()

The above function will generate the polarity scores. We will get four types of scores: negative, neutral, positive, and compound. We will be using the compound score as a rating score as we can use it to categorize the reviews by keeping certain thresholds.

#storing scores
scores = []

#iterating every review
for i in df["text"]:
#calculating the sentiment score
scores.append(sid.polarity_scores(i)["compound"]

#loading rating score to dataframe
df["rating_score"] = scores
df.head(10)

Here we can observe that the rating score is ranging from minus values to 0.98 approx. So by applying certain threshold values we can change the above code and can categorize the rating into positive, negative, and neutral.

#lower and upper thresholds
threshold_lower = 0.4
threshold_upper = 0.85

The above thresholds are chosen after statistical observation of rating scores and ratings.

#storing scores and rating
scores = []
rating = []


#iterating every review
for i in df["text"]:
#calculating the sentiment score and comparing threshold
if sid.polarity_scores(i)["compound"] < threshold_lower:
rating.append("Negative")
elif sid.polarity_scores(i)["compound"] < threshold_upper:
rating.append("Neutral")

else:
rating.append("Positive")

#appending scores
scores.append(sid.polarity_scores(i)["compound"])

#loading rating score and rating to dataframe
df["rating_score"] = scores
df["rating"] = rating
df.head(10)

So We achieved our objective and categorized the ratings into Positive, Negative, and Neutral. Entire structured code can be found at — https://github.com/muttinenisairohith/OpinRank.git

We can further analyze the reviews and add additional functionalities such as-

  • building a model to summarize the topics that are key reasons for people to love/hate the product.
  • compare the sentiment scores with other products by extracting the reviews from other products and suggesting better products.
  • Generating the sentiment scores for certain topics providing additional information for the users based on reviews.

The above topics will be covered in our further posts.

Stay tuned.

--

--

Muttineni Sai Rohith
Analytics Vidhya

Senior Data Engineer with experience in Python, Pyspark and SQL! Reach me at sairohith.muttineni@gmail.com