Data Science Tutorial: Analysis Of The Google Play Store Dataset

Winning submission- December Data Festival 2018

Published in

The Research Nest

12 min readMar 24, 2019

The Internet is a true gold mine of data. E-commerce and review sites are brimming with a lot of untapped data with a prominent potential to convert into meaningful insights that can help with robust decision making. Here, we explore using data science and machine learning techniques on data retrieved from one such avenue on the internet, the Google Play Store.

Details of Dataset:

The Play Store apps data has enormous potential to drive app-making businesses to success. Actionable insights can be drawn for developers to work on and capture the Android market. The dataset is chosen from Kaggle. It is the web scraped data of 10k Play Store apps for analyzing the Android market. It consists of in total of 10841 rows and 13 columns.

**Just look at the beauty of data… It’s powerful**

The columns of the dataset are as follows:

1) App (Name)

2) Category (App)

3) Rating (App)

4) Reviews (User)

5) Size (App)

6) Installs (App)

7) Type (Free/Paid)

8) Price (App)

9) Content Rating (Everyone/Teenager/Adult)

10) Genres (Detailed Category)

11) Last Updated (App)

12) Current Version (App)

13) Android Version (Support)

Exploratory Data Analysis:

Some key observations at first glance include how the performance of the App can be improved from the reviews obtained and different patterns that could be found to get more business values out of the same.

Now, we will start implementing the procedures by importing libraries:

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error # 0.3 error
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

The “sklearn” library offers many algorithms for implementing machine learning techniques on different problems.

# Reading csv file
df = pd.read_csv('googleplaystore.csv')

The most critical thing from which patterns could be obtained is data. It may be a single review or a bundle of them. Whatever data comes in, could be used to draw value out of it. Data comes with unexpected values too, which should be handled before it affects the performance of trained models that predict the outcome.

Here is the first step to clean the data that will make the results “more” accurate.

By finding all unique values of each row the inappropriate values can be identified. Different methods can then be used for removing them or to change those values accordingly to use them to make predictions better. As the proverb goes by saying –

“The more data we have, the more likely we are to drown in it.” — Nassim Taleb

Not only are we interested in raw data but in the data from which valuable insights can be drawn. To do so, let us take a glimpse at another proverb.

“More data beats clever algorithms, but better data beats more data.” — Peter Norvig

Data Cleaning

With that being said, here are the various steps taken to clean the data:

# Data cleaning for "Size" column
df['Size'] = df['Size'].map(lambda x: x.rstrip('M'))
df['Size'] = df['Size'].map(lambda x: str(round((float(x.rstrip('k'))/1024), 1)) if x[-1]=='k' else x)
df['Size'] = df['Size'].map(lambda x: np.nan if x.startswith('Varies') else x)# Data cleaning for "Installs" column
df['Installs'] = df['Installs'].map(lambda x: x.rstrip('+'))
df['Installs'] = df['Installs'].map(lambda x: ''.join(x.split(',')))# Data cleaning for "Price" column
df['Price'] = df['Price'].map(lambda x: x.lstrip('$').rstrip())# Row 10472 removed due to missing value of Category
df.drop(df.index[10472], inplace=True)# Row [7312,8266] removed due to "Unrated" value in Content Rating
df.drop(df.index[[7312,8266]], inplace=True)

The raw data can have random sorting. To solve this, we will use:

# Sort by "Category"
df.sort_values("Category", inplace = True)

It is necessary to make a note that each and every piece of raw data may lead to a more accurate result. The current dataset holds values that are in the string format. For solving a regression problem, we should convert the strings to a numerical format. To do so, we will proceed as follows:

# Label encoding
lb_make = LabelEncoder()# Create column for "numeric" Content Rating 
df["Content Rating NUM"] = lb_make.fit_transform(df["Content Rating"])# Form dicitonary for Content Rating and numeric values 
dict_content_rating = {"Adults only 18+": 0, "Everyone": 1, "Everyone 10+": 2, "Mature 17+": 3, "Teen": 4}# Numeric value for Content Rating
'''
Adults only 18+ = 0
Everyone = 1
Everyone 10+ = 2
Mature 17+ = 3
Teen = 4
'''# Create column for "numeric" Category
df["Category NUM"] = lb_make.fit_transform(df["Category"])# Form dicitonary for Category and numeric values
dict_category = {}
val = 0
for i in df["Category"].unique():
 dict_category[i] = val
 val += 1# Numeric value for Category
'''
ART_AND_DESIGN = 0
AUTO_AND_VEHICLES = 1
BEAUTY = 2 
BOOKS_AND_REFERENCE = 3
BUSINESS = 4
COMICS = 5
COMMUNICATION = 6
DATING = 7
EDUCATION = 8
ENTERTAINMENT = 9
EVENTS = 10
FAMILY = 11
FINANCE = 12
FOOD_AND_DRINK = 13
GAME = 14
HEALTH_AND_FITNESS = 15
HOUSE_AND_HOME = 16
LIBRARIES_AND_DEMO = 17
LIFESTYLE = 18
MAPS_AND_NAVIGATION = 19
MEDICAL = 20
NEWS_AND_MAGAZINES = 21
PARENTING = 22
PERSONALIZATION = 23
PHOTOGRAPHY = 24
PRODUCTIVITY = 25
SHOPPING = 26
SOCIAL = 27
SPORTS = 28
TOOLS = 29
TRAVEL_AND_LOCAL = 30
VIDEO_PLAYERS = 31
WEATHER = 32
'''

Now we will use:

# Replace "NaN" with mean 
imputer = SimpleImputer()
df['Rating'] = imputer.fit_transform(df[['Rating']])# Rounding the mean value to 1 decimal place
df['Rating'].round(1)df.dropna(axis=0, inplace=True)

Take a look at the last line of code(LoC) in the above code snippet. Instead of dropping the rows that contain null values, we have used them. After every transformation possible with the dataset, we have finally dropped the rows having null values.

# Change datatype
df['Reviews'] = pd.to_numeric(df['Reviews'])
df['Installs'] = pd.to_numeric(df['Installs'])
df['Price'] = pd.to_numeric(df['Price'])

Though the dataset may seem to have the correct datatypes for each column, we need to check it. Inconsistent datatypes will create issues while dealing with regression problems.

Data visualization can be used to get a glimpse of the distribution of the app market. This can help businesses in several ways. Apps could be targeted to a particular market. A business could analyze its approach to entering a market with more/moderate/fewer competitors. If the app holds a feature that may change the future usage of users, a data-driven business venture could launch the app in the market of more competitors to get a better hold of the market relying on that key feature and making further development.

Another strategy could be to build something different from the normal apps and their usage as the data shows to bring in something new to the market.

Visualization can further be used to get finer details of the split in categories. For example, if the category is “Gaming”, it consists of “Arcade”, “Board”, “Racing”, etc. This could be used to get into a more specific domain in “Gaming”. Such insights can enable consultants to get a clearer view for framing a business model while launching a new app.

The “Ratings” of the app could be used to look whether the original ratings of the app matches the predicted rating to know whether the app is performing better or worse compared to other apps on the Play Store.

“Having your own league is great but when it comes to business, you should look at some statistics.”

The null values in the dataset, especially in the column of “Ratings”, could be replaced by the mean, median or something else. I have used a “mean”. Because the value to be replaced can be influenced by Outlier, but there are no outliers in the dataset for this column. An outlier existing was removed before replacing the null values with the mean.

Pictorial representation can be seen using the “code”.

The above figure consists of two pie charts clubbed into one.

The outer chart consists of the distribution of apps Category wise. And the inner chart consists of the percentage of free/paid apps for that particular Category.

The above figure consists of a pie chart of the category “GAME” representing different domains.

Similarly, the below figure shows a pie chart of the category “FAMILY”.

NOTE: The chart looks clumsy. A better view can be obtained by enlarging the plot and moving around it. Ways of making improvements in this are discussed in the later part of this report.

The data is in different formats that should be converted into a similar format to use data in building machine learning model.

The above charts can be plotted using the following code:

# Pie chart for Category
value_category = np.zeros(33)
labels_category = df['Category'].unique()for i in range(len(df['Category'])):
 try:
  value = df['Category'][i]
  num = dict_category[value]
  value_category[num] = value_category[num] + 1
 except:
  pass# Free and paid counts for each category
free_paid_list = [] 
# 1st value = Free 
# 2nd value = Paid 
### Alternate valuesfor j in labels_category:
 free_count = 0
 paid_count = 0
 for i in range(len(df['Type'])):
  try:
   if df['Category'][i] == j:
    if df['Type'][i] == "Free":
     free_count += 1
    if df['Type'][i] == "Paid":
     paid_count += 1
  except:
   pass   
 free_paid_list.append(free_count)
 free_paid_list.append(paid_count)colors_free_paid = []
free_color = "#00ff00" # GREEN color
paid_color = "#0000ff" # BLUE color
for i in range(int(len(free_paid_list)/2)):
 colors_free_paid.append(free_color)
 colors_free_paid.append(paid_color)plt.axis("equal")
plt.pie(value_category, labels=labels_category, radius=1.5, autopct='%0.2f%%', rotatelabels=True, pctdistance=1.1, labeldistance=1.2)
plt.pie(free_paid_list, colors=colors_free_paid, radius=1.25, autopct='%0.2f%%', pctdistance=1.0)
centre_circle = plt.Circle((0,0),1.0,color='black', fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
plt.tight_layout()
plt.show()# Get possible values for GAME and FAMILY in Genres
list_games_genres = []
list_family_genres = []
for i in range(len(df['Category'])):
 try:
  if df['Category'][i] == 'GAME':
   value = df['Genres'][i]
   if value not in list_games_genres:
    list_games_genres.append(value)
  if df['Category'][i] == 'FAMILY':
   value = df['Genres'][i]
   if value not in list_family_genres:
    list_family_genres.append(value)  
 except:
  passvalue_games = np.zeros(len(list_games_genres))
labels_games = sorted(list_games_genres)value_family = np.zeros(len(list_family_genres))
labels_family = sorted(list_family_genres)# Dictionary for games:
dict_games = {}
for i in range(len(labels_games)):
 dict_games[labels_games[i]] = i# Dictionary for family
dict_family = {}
for i in range(len(labels_family)):
 dict_family[labels_family[i]] = i# Pie chart for GAME in Genres
for i in range(len(df['Genres'])):
 try:
  if df['Genres'][i] in labels_games:
   value = df['Genres'][i]
   num = dict_games[value]
   value_games[num] = value_games[num] + 1
 except:
  passplt.axis("equal")
plt.pie(value_games, labels=labels_games, radius=1.5, autopct='%0.2f%%', rotatelabels=True, pctdistance=1.1, labeldistance=1.2)
plt.show()# Pie chart for FAMILY in Genres
for i in range(len(df['Category'])):
 try:
  if df['Genres'][i] in labels_family:
   value = df['Genres'][i]
   num = dict_family[value]
   value_family[num] = value_family[num] + 1
 except:
  passplt.axis("equal")
plt.pie(value_family, labels=labels_family, radius=1.5, autopct='%0.2f%%', rotatelabels=True, pctdistance=1.1, labeldistance=1.2)
plt.show()

NOTE: The above code may look messy, it is really a combination of functions and parameters. So, take a deep breath and go through it once again. I am sure that you will understand the code snippet.

Converting our data into appropriate forms

Size: For example, the size of the app is in “string” format. We need to convert it into a numeric value. If the size is “10M”, then ‘M’ was removed to get the numeric value of ‘10’. If the size is “512k”, which depicts app size in kilobytes, the first ‘k’ should be removed and the size should be converted to an equivalent of ‘megabytes’.

Installs: The value of installs is in “string” format. It contains numeric values with commas. It should be removed. And also, the ‘+’ sign should be removed from the end of each string.

Category and Content Rating: The Category and Content Rating consists of categorical values that should be converted to numeric values if we need to perform regression. So, these were converted to numeric values.

Price: The price is in “string” format. We should remove the dollar sign from the string to convert it into numeric form.

On analyzing the data, RATINGS of the app can be concluded as the most important parameter that plays an important role in depicting how better the app performs compared to the other apps in the market. It also hints on how well the company works on implementing the feedback given by the users. After all, users are the key to modern software businesses.

RATINGS depend on various factors. The correlation between these will be discussed in the next part of this report.

Problem Statement:

To predict the ratings of the App (before/after launching it on Play Store).

This is clearly a regression problem.

The factors that require attention in solving this problem are,

1) Category

2) Reviews

3) Size

4) Installs

5) Price

6) Content Rating

Here, Category and Content Rating are categorical values. So instead of these, we will use “Category NUM” and “Content Rating NUM” that contains a numerical mapping.

By taking the values of these columns into the account, we will get a prediction for “Rating” of the app. The rating can be obtained by providing the current values and comparing the predicted value and original value to get an overview of whether the app is performing better or worse than expected.

# Features selection
features = ['Category NUM', 'Reviews', 'Size', 'Installs', 'Price', 'Content Rating NUM']
X = df[features]# Label selection
y = df.Rating# For testing purpose
#train_X, test_X, train_y, test_y = train_test_split(X, y)

If we want to predict how well the app may perform before launching it on Play Store, we could take some random numbers as parameters. And then compare different parameters, i.e., if we get same ratings for installs, but way fewer reviews, we come to know that we should do something to get feedback from the user as it is necessary to improve the app.

Machine Learning Model:

The model used to train the dataset is the “Random Forest Regressor”.

The dataset was split into training and testing data and with the help of a function, the “mean absolute error” the accuracy was measured.

The model gives options to tune. It is known as hyper tuning our model to predict a better result.

Different parameters were used to get the least error.

One other approach used to get a better result is to train the model a few times and then take the median of the same. Though it is going to take a bit more time, accurate results could be obtained by doing the same.

# Loop is used to get more generalized result
total_sum = []
for i in range(10):
 # Hypertuning of parameters for better prediction
 forest_model = RandomForestRegressor(n_estimators=100, max_features=3, min_samples_leaf=10)
 forest_model.fit(X, y)
 # For testing purpose
 #forest_model.fit(train_X, train_y)
 # Pass values to get prediction for ratings
 # 1st value = Category NUM
 # 2nd value = Reviews
 # 3rd value = Size
 # 4th value = Installs
 # 5th value = Price
 # 6th value = Content Rating NUM
 forest_pred = forest_model.predict([[4,100000,20,1000000,5,1]])
 total_sum.append(forest_pred)
 # For testing purpose
 #forest_pred = forest_model.predict(test_X)
 #print(mean_absolute_error(forest_pred, test_y))

NOTE: Every statement below the for loop is a part of it.

Finally, we will print the result:

print(round(np.median(total_sum), 2))

As mentioned earlier, after using mean_absolute_error, the error is “0.3”. For eg., if the actual rating of the app is “4.0”, then the predicted ratings could fall in the range [3.7,4.3].

NOTE: To see the result, we need to pass in a few parameters. These parameters are mentioned in the code snippet above. This gives an overall idea beforehand about how the app may perform.

Conclusion and Future Work:

The dataset contains immense possibilities to improve business values and have a positive impact. It is not limited to the problem taken into consideration for this project. Many other interesting possibilities can be explored using this dataset.

Future work can include

Optimization of the pie-charts shown above i.e. Fig 3. There are multiple domains in the same slice. The multiple domains could be separated and added to the same field to get a more detailed version of this pie chart.
Prediction of the number of reviews and installs by using the regression model.
Identifying the categories and stats of the most installed apps.
Exploring the correlation between the size of the app, the version of Android, etc on the number of installs.

The ways in which questions can be asked varies, so does the way of tackling a problem. Only the one that has been minutely observed and tested will provide results worth trusting.

Editorial Note:

The event — Data December Festival 2018 — was organized by The Research Nest as a two-month online learning campaign focused on helping beginners learn Data Science. The main aim of this event was to get the participants engaged in a real-time project as they learn various concepts of data science and to complete it by creating a report documenting their insights.

An informative material guide was provided that included several resources to assist in self-learning data science within a month or two.

The guide can be downloaded here: http://bit.ly/self-learning-datascience