My Data Science class project “Google Play Store Analysis”.

8 min readDec 9, 2021

“Anyone who stops learning is old, whether at twenty or eighty. Anyone who keeps learning stays young.”
― Henry Ford

Introduction

In 2020 I’ve started a very exciting journey into the ‘Data Science World’ with a joint program in Sofware Development of Qwasar and Qwant. Starting with the ABCs of the programming languages we grow to Machine Learning tools. During this time we have solved a bunch of interesting projects including the NBA Game analysis and tasks where we had to find solutions with Linear regression models. If you ask me: ‘Is it worth it in my 40's?’ the answer is: ‘Definitely! 100%’

In this article, I will show one of our Data Science projects “My Mobapp Studio”.

The Project

In this Project, we are fresh Data Science and should analyze the Google Play Store Data, as a member of the start-up team, which is working on a new Android-based app. The raw data should be obtained, cleaned, and find insight useful for the management and the developer's team.

The main points are to find answers to the following questions:

What is the size of the market? numbers of download and $;
Same question but per category, in percentage?
Depending on each category, what is the ratio of downloads per app?
Any additional information you will find useful for us to make the right decision.

So, let’s dive in!

Exploring the Data

Load Data

We will use the Dataset “Google Play Store Apps” located here in Kaggle. We will use the Jupyter notebook for this project and some other libraries in it.

First of all, we will load all need libraries module in Pandas:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import plotly.express as px

and load the data using the read_csv from the same location:

apps = pd.read_csv(‘googleplaystore.csv’)

let’s see what we have just uploaded:

apps.head(5).style.set_properties(**{“background-color”: “#fffc99”,”color”: “black”, “border-color”: “black”})

Ok, now we have uploaded the data frame which we will read with the head() function and get the columns with headings and rows showing info about each app in the Play Market:

Now we get 13 columns with raw data about apps in the Play Market.

Clean Dataset

This part of the project literally reshapes the data frame into what we want it to be. With the Pandas isna() function cleared NaN data. Secondly, we removed duplicates with the drop_duplicates() function. For the sake to keep it short, I will skip these steps.

The next step was to correct data types attributes.

apps.info()

As you can see, except Rating, all column types are object types, which is not very useful to our later work. So, on our next step, we will be cleaning them as well.

Clean Reviews, Installs, and Price:

We will start cleaning “Reviews”, “Installs”, “Price” columns from special characters and add a new column named “Paid” which we will use later on.

special_char = [‘+’,’$’,’,’]col_to_clean = [‘Reviews’,’Installs’, ‘Price’]for col in col_to_clean:
    for char in special_char:
        apps[col] = apps[col].apply(lambda x: x.replace(char,’’))
 
apps[apps.Type == ‘Paid’]

Clean Size:

The type of size in the uploaded as string type, which we cannot use further in our calculations. So, with some additional code will convert it to float type in MBs.

apps_m = apps[apps.Size.str.contains(‘k’,regex=False)]
apps_m.Size = apps_m.Size.str.replace(‘k’,’’).astype(float)
apps_m.Size = apps_m.Size/1024

All ‘Size’ converted from KB to an MB after it made some cleaning on this column as well.

apps.Size = apps.Size.str.replace(‘M’,’’)apps.Size = apps.Size.replace(‘Varies with device’, np.nan)
apps.Size = apps.Size.astype(float)

Clean Rating:

Lastly, we have unified “Content Rating” into ‘Everyone’, ‘Teen’, ‘Everyone 10+’, ‘Mature 17+’, ‘Adults only 18+’, ‘Unrated’ categories.

apps[‘Content Rating’].unique()

converted with code bellow:

apps[‘Content Rating’].replace([‘Everyone 10+’,’Mature 17+’,’Adults only 18+’],[‘10+’,’Mature’,’Adults’], inplace=True)

The picture of the final version of the data frame where we will search for insights:

Data Analysis

Now we get a bright and shiny data frame and the next part will be the most exciting part of the Project. We will be answering the questions asked in Project as well as digging deeper to find out more insights.

What is the size of the market? The numbers of downloads and total value in USD?

a) Market Size:

total_downloads = apps.Installs.sum()/1000000000print(“Market size is:”, round(total_downloads,2),”BN downloads.”)

The market size is 146.63 BN downloads.

b) Market Capitalisation:

col_to_multiply = [‘Installs’, ‘Price’]
app_income = apps.apply((lambda x: x.Installs * x.Price), axis=1)
total_value = app_income.sum()/100000000
print(“Total Market Capitalisation is:”, round(total_value,2), “BN USD”)

The total Market Capitalisation is 3.67 BN USD.

What is the market share per category, in percentage?

plt.figure(figsize=(15,10))
apps[‘Category’].value_counts(‘Installs’).plot(kind=’pie’, autopct=’%1.1f%%’)
plt.figure(figsize=(15,10))
apps[‘Category’].value_counts().plot(kind=’bar’)
plt.xlabel(‘Category’)
plt.ylabel(‘number of apps’)
plt.grid()
plt.show()

The most popular categories among the developers are Family, Game, and Tools. Almost 38% of all apps on Play Market are in this category.

Depending on each category, what is the ratio of downloads per app?

top_downloads = apps.groupby(‘Category’).Installs.sum().sort_values(ascending = False)px.bar(y = top_downloads.values, x = top_downloads.index)

free_apps = apps[apps.Type == ‘Free’]
paid_apps = apps[apps.Type == ‘Paid’]
total_value = paid_apps.Price * paid_apps.Installs
paid_apps[‘total_value’] = total_value
px.pie(paid_apps, values = ‘total_value’, names = ‘Category’)

Free and paid:

sns.countplot(data = apps, x = ‘Type’)

px.pie(apps.groupby(‘Type’).count(), values = ‘App’, names = [‘Free’,’Paid’])

The lion's share of the Android market apps is free which is shown on these diagrams. More than half of the share of the market users paid for the Family apps and only 15,7% second-biggest share of the market for the Lifestyle category.

App with the largest number of installs:

res = apps.groupby(‘App’)[‘Installs’].sum().reset_index()
final_result = res.sort_values(by = ‘Installs’, ascending = False).head(10)plt.bar(“App”, “Installs”, data = final_result, color = “Blue”)
plt.xlabel(“Apps”)
plt.xticks(rotation = 90)
plt.ylabel(“Install Counts”)
plt.title(“Top 10 Apps having Highest Installs”)
plt.show()

There is no doubt that the biggest number of apps will be pre-installed Android Google Apps. Despite this fact, ‘Subway Surfers’ was the top downloaded app. ‘Instagram’ and the other two-game apps closed the “Top 10 most downloaded apps”.

Any additional information you will find useful for us to make the right decision?

Most Expensive 10 Apps(USD):

most_expensive = apps.groupby(‘Category’).Price.max().sort_values(ascending = True)
px.bar(x = most_expensive.values[10:], y = most_expensive.index[10:])

Top 10 Expensive Apps in Family Category:

paid_apps[paid_apps.Category == ‘FAMILY’].sort_values(by = [‘Price’], ascending = False)[0:10][[‘App’,’Price’,’Installs’]]

The name of the most expensive app is most expensive app, with the price ~400 USD

Average App Size:

px.box(apps.Size, labels={‘y’:’MB’})

Category vs Price:

plt.figure(figsize = [15, 10])
sns.set_context(‘paper’)
sns.stripplot(y=’Category’, x=’Price’, data=paid_apps,\
 jitter=True, size=6, palette=’plasma’, marker=’D’)
plt.axvline(paid_apps.Price.mean(), linewidth=1.5, linestyle=’-’)
plt.text(20, 2, ‘← Average Price’, fontsize=15, color=’red’)
plt.show()

The average price for paid apps is around 10 USD.

Distribution of paid and unpaid applications rating:

avg_rating = apps.Rating.mean()
print(“Average Rating: {}”.format(avg_rating))sns.displot(data=apps, x=’Rating’, kde=True, hue=’Type’, height=8, aspect=1.2)
plt.axvline(avg_rating, linestyle=’ — ‘, color = ‘red’, linewidth = 2.0)
plt.text(3.3, 800, ‘Average Rating →’, color = ‘red’, fontsize=14)
plt.title(‘Distribution of Rating with Type’, fontsize=18)
plt.show()

The diagram above shows that the average rating user rates the app between 4.0 and 4.5.

Correlation Matrix:

corr = apps.corr()
sns.heatmap(corr, cmap=”YlGnBu”,
vmin = -.5 , vmax = 0.6,annot = True)

As you can see there is a direct correlation between ‘Reviews’ and ‘Installs’.

App Downloads per Content Rating:

fig,(ax)=plt.subplots(1,1,figsize=(25,10))
sns.countplot(‘Content Rating’,data=apps,ax=ax)
plt.show()

It is clear that the apps with “Everyone” ratings are the most popular.

Summary

The android market is valued at only 3,67 BN USD which is only 7.4% of the market, this shows the market has the potential to grow with the right price range up to 10 USD. Family, Game, and Tools are highly competitive categories than other ones. But with the right app in these categories with Everyone's content rating can accelerate to reach Break-Even Point and shorten the Payback Period. The developers should keep in mind the preferable size of the app will be between 10–20 MB. The marketing team should work on increasing the Rating of the users' Reviews, which will directly increase the number of downloads.

Conclusion

In this article, we tried how the Data Science tools in Python can help in the right decision-making. Examples and insight are the smallest part of the info that can be found from this info. The right questions will lead you to success. Keep working and you will find your gem.