Sentiment Analysis of Google Play Store reviews using Python

Nikita Bhole
6 min readAug 13, 2021

--

Sentiment analysis is a technique with which we can identify and determine if data indicates a positive, negative or neutral emotion. It aids in understanding customer feedback and is mostly used by several companies to analyse brand and product reviews. It also helps in finding out the underlying sentiment in a text.

There are different types of sentiment analysis and in each type, information can be extracted in different ways. For example,

a) I really love this app and its features. : Positive

This app is very confusing and disappointing : Negative

b) This app is a nightmare. Totally useless. : Anger

This app was very helpful. It made everything lot easier. : Happiness

c) Very frustrated right now. I am not able to log in. Can you help? : Request for assistance

d) This application’s dashboard is useless, and it is a waste of time.

From this review various information can be extracted like:

This application’s dashboard is useless,and it is a waste of time. Negative

[Entity] [Aspect] [Opinion] Dashboard

Natural language processing can be called a tool with which a document can be processed to find positive, negative, or neutral sentiments , that can be useful in identifying trend and customer’s sentiments towards a product or service .Consequently, business objectives can be created/modified to address customer’s concerns.

Let us understand and deep dive into the application of sentiment analysis. This exercise focuses on first part of a project i.e., providing sentiment score to each customer review from an application store.

The sentiment score is represented by two scores: sentiment polarity and subjectivity. Sentiment polarity ranges from -1 to +1. +1 means a positive statement and -1 means negative statement. Subjectivity refers to personal opinion, emotion or judgement whereas Objectivity refers to factual information.

Data for this exercise can be found here. This dataset is created by scraping reviews from application store into a json list.

We can use libraries like NLTK, Gensim, TextBlob, spaCY, CoreNLP for data pre-processing and sentiment analysis. We will use NLTK for pre-processing of data and TextBlob to calculate sentiment score (sentiment polarity and subjectivity ) .

import json
import regex
import pandas as pd
import matplotlib.pyplot as plt
import nltk
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from wordcloud import WordCloud, STOPWORDS
from textblob import TextBlob
import string
import re
import emoji
from pandas_profiling import ProfileReport

As we discussed previously, the data set that we are working with is a JSON file. So, here we will open a JSON file. This is one way we can open JSON file in read only format. When we try to import file using “pd.read_csv”, due to large size of data , we can get an “out of memory” message.

# open json file
f = open('google_reviews.json',"r")
# Return json object as a dictionary
data = json.load(f)

One way of checking complete data set is iterating through the JSON list , where all columns and total number of rows can be seen. It will be convenient and easy to handle if we can convert data into a dataframe.

# iterating through the json list
for i in data:
print(i)
print(len(data))
df=pd.DataFrame(data)

A quick and simple way of exploratory data analysis is pandas profiling. It is helpful in providing various information as shown below:

prof = ProfileReport(df)
prof.to_file(output_file='output.html')

Various other information like variables, interaction, correlation, missing values, sample and duplicate rows can also be analysed in it.

There are many columns which are not useful for our analysis. So, we can just pick out the ones that are essential for analysis like “reviewCreateVersion” which shows the version of application over which reviews are given by customers, “score” which is given by customer in the range 1–5, where, 1 is lower most value and denotes a negative review and 5 is the higher most value and denotes a positive review, “content” which denotes the reviews given by the customers for the application.

df=df[["reviewCreatedVersion", "score","content"]]

Let’s not forget to look at missing values . In reviewCreateVersion column about 5012 values are missing. Since this is a negligible amount, it should not make much difference. Rest of the columns do not have any missing values.

check_total_none=df.isnull().sum()
print(check_total_none)

To get a good idea of the data set we can do some analysis like how many different versions of app are reviewed and which version received highest rating ( i.e. 5) and mid rating ( i.e. 3).

score_high= df[df["score"]==5]
print("score high:",score_high)

score_mid=df[df["score"]==3]
print("score_mid:",score_mid)

Check how many unique versions of the app are available in the dataset.

print(df.reviewCreatedVersion.unique())
print(df.reviewCreatedVersion.nunique())

Find out average of all ratings for each unique version and check which version received highest average rating.

x=(df.groupby('reviewCreatedVersion')['score'].mean())
print(x)

An analysis that plots a histogram through which it can be determined if the app received more positive reviews or negative reviews.

plt.hist(df['score'], bins = 5)
plt.show()

From the above graph showing the frequency of each rating received, it can be said that most of the times app received either 4 or 5 rating. Thus, it can also be said that the app performs above average.

Data Pre-processing

Now let’s start with pre-processing of data and perform some basic steps that are necessary to convert raw data into more suitable information to work.

Lowercasing

Lowercasing is an important step to maintain consistency of data. Different variation in capitalization of input may sometimes lead to different types of output. For example , Canada, canadA, CANADA, all of these map to same lowercase form. Hence, lowercasing is generally considered a standard practice.

# Lower casing

# Change the reviews type to string
df['content'] = df['content'].astype(str)
# Before lowercasing
print(df['content'][2])
#Lowercase all reviews
df['content']= df['content'].apply(lambda x: x.lower())
print(df['content'][2]) ## to see the difference

Emojis

It is also important to remove emojis from data as it can sometimes create a problem with analysis further.

#check if there is any special character
alphabet = string.ascii_letters+string.punctuation
print(df.content.str.strip(alphabet).astype(bool).any())

extracted_emojis=[]

def extract_emojis(s):
expe = re.compile('[\U00010000-\U0010ffff]', flags=re.UNICODE)
#return expe.findall(s)
return expe.sub(r'',s)

for y in df['content']:
#print(str(extract_emojis(y)))
extracted_emojis.append(str(extract_emojis(y)))
print(extracted_emojis)

Stopwords

Removing stopwords can improve the performance to a great extent. Words such as myself, me , she, he , they , mine, you etc. When these words are removed only meaningful tokens are left.

# stop words

stop_words=stopwords.words('english')
df['extracted_emojis'] = extracted_emojis
df['extracted_emojis']= df['extracted_emojis'].apply(lambda x:x if x not in stop_words else None)
print(df['extracted_emojis'][5])

Stemming

Stemming is also an important process, it chops off the end of the word and transform the word into its root form. All suffixes like -s, -es, -ed, -ing are removed.

# stemming

def stemming(x):
st = PorterStemmer()
if x is not None:
for word in x.split():
st.stem(word)

df['extracted_emojis'].apply(lambda x:stemming(x))
print(df['extracted_emojis'][100])

Sentiment Score

Each review will receive sentiment score now in this step and the results can be obtained .

#Function to calculate sentiment score for whole data set

def senti_sc(x):
if x is not None:
return TextBlob(x).sentiment

df["Sentiment_score"]= df["extracted_emojis"].apply(senti_sc)
print(df.loc[0:19,['extracted_emojis','Sentiment_score']])
f.close()

From the above results sentiment polarity and subjectivity has been calculated. Sentiment polarity is denoting if the review is positive or negative and subjectivity is denoting the subjectivity of the text.

Further, this analysis can be used in various other applications where the aspects of the product or the application that are more positively accepted in market can be explored for more business opportunities. For example, two newly introduced products can be compared to understand customer behaviour among other things.

Thanks for taking your time to read this article. I hope this might have helped in understanding sentiment analysis of Google play store reviews using python. I would be glad to have your feedback. Always enjoy to converse and have discussion with data folks.

--

--

Nikita Bhole

Data enthusiast, 2+ years experienced,continuous efforts to make a newer version of myself and propel career in Data Science and Analytics.