Running a Freelance Data Science Project. PART-1

Published in

Analytics Vidhya

9 min readOct 20, 2020

There are lot of things which i learned in my life and looking forward to learn more onn. The one thing i learned is being a Team Player. Nah! This post is not going about on how to be a team player but yes, teamwork is essential. Recently i got my hands on Google Play Store Data and there were two main questions on my mind:

Movie: Social Dilemma
Can i help bring some hard real facts too?

So, i took a dip and brought some stars into it by including 2 of my brodas!. We took it in sense of a Project and named it: Google Play Store. Now here, we used GitHub as our Source Code Controller. This thing really changed the way in which a team can work. We all had our git branches, created 4 columns of To-Do,In-Progress,Review,Sprint1,Sprint2 and Done. The dashboard looked like this:

We had everything sorted by aligning all the tasks in form of cards and distributing the work easily. One of my colleague Chandan Singh in this project quoted, That’s the way any Development Project runs and we have a Data Science Project of our own, cheers to this Team!!

The main objectives we were looking for were:

Solve the tasks associated with the dataset. Tasks involved some of the questions like What is the most popular category that has the largest number of installs;Which application has the largest size;Which app has not been updated for a long time;This can be done by comparing the current date with the last date that the app was actually updated;Which app has the largest number of installs,For the years 2016, 2017, 2018 what are the category of apps that have got the most downloads;Which application has the largest number of reviews?
Finalise 10 apps for our NLP activities.
Perform NLP EDA for the reviews related to that.
Bring it out to the front.

I’ll divide this whole project in two different parts, so that things are not stretched much. Initially, we will focus on answering the questions associated with tasks and then applying the ML and NLP techniques over the decisions taken.

So, let’s have a look at the dataset which we have in hand:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import re
import warnings
warnings.filterwarnings('ignore')summary_data = pd.read_csv('googleplaystore.csv')
reviews_data = pd.read_csv('googleplaystore_user_reviews.csv')

# Let's do some DATA Formalitites
summary_data.head()

Let’s perform data checklist:

print('Shape of summary data:', summary_data.shape)
print('--------------------------------------------')
print('Columns in our summary of data: ',summary_data.columns)
print('--------------------------------------------')
print(summary_data.info())
print('--------------------------------------------')
print('Total Unique APPS in the data: ',summary_data.App.nunique())
print('--------------------------------------------')
print('Total Unique Categories: ',summary_data.Category.nunique())
print('--------------------------------------------')
print('Total Unique Genres: ',summary_data.Genres.nunique())
print('--------------------------------------------')
print('Total Unique Content Ratings: ',summary_data['Content Rating'].nunique())
print('--------------------------------------------')
print("Let's see the distribution of Content Ratings: ")
print(summary_data['Content Rating'].value_counts())

It’s always not necessary to draw some plots in order to prove your EDA SKILLS, but knowing the way to drill down to answer a question is more important.

Looking at the checklist, we find that columns such Rating,Type,Content Rating,Current Ver,Android Ver have missing values and they need to be treated. So, we will drop the rows which have even one missing value for any column.

summary_data = summary_data.dropna(axis='rows')
summary_data = summary_data.drop_duplicates(subset=['App'])
summary_data = summary_data[summary_data.Size!= 'Varies with device']
summary_data.info()

There are multiple things to process for our DATA. If you’ve followed correctly, you might have seen + and M in Installs and Size column respectively. Also, column Content Rating has values, such as Adults Only 18+,Everyone 10+ which is not quite proper, we can replace them by underscores and cleanup our data.

# Installs like 1000+,50,000,000+ doesn't looks good, as we can't compare strings properly on a logical manner. So we remove the +

def remove_plus_add_underscore(summary_data):
    summary_data['Installs'] = summary_data['Installs'].str.replace(',','')
    summary_data['Installs'] = summary_data['Installs'].str.replace('+','')
    summary_data['Content Rating'] = summary_data['Content Rating'].str.replace('+','')
    summary_data['Content Rating'] = summary_data['Content Rating'].str.replace(' ','_')
    summary_data['Installs'] = summary_data['Installs'].astype('int64')
    # Converting SIZE of APPS from KB to MB
    for i,j in enumerate(summary_data.Size):
        #print(i,j)
        if 'k' in j:
            clean_str = j.replace('k','')
            cleant_int = round(float(clean_str)*0.001,2)
        else:
            clean_str = j.replace('M','')
            cleant_int = float(clean_str)
        summary_data.Size.iloc[i] = cleant_int
        #print('Clean',cleant_int)
    return summary_data

copy_summary_data = remove_plus_add_underscore(summary_data)
copy_summary_data.head()

After some more preprocessing, like subsetting on each Content Rating value as one dataframe, we find the top APPS with highest RATINGS in each Category.

content_rating_namelist = ['content_rating_Adults_only_18','content_rating_Everyone','content_rating_Everyone_10','content_rating_Mature_17',
                      'content_rating_Teen','content_rating_Unrated']
content_rating_list = [content_rating_Adults_only_18,content_rating_Everyone,content_rating_Everyone_10,content_rating_Mature_17,
                      content_rating_Teen,content_rating_Unrated]
for i,j in zip(content_rating_namelist,content_rating_list):
        print('Top 3 Apps in :',i)
        print(j.App[:3].values,j.Rating[:3].values)

Question 1: What is the most popular category that has the largest number of installs?

> This is an easy question to answer, since there are not much rules associated. We just need to groupby on CATEGORY, sumup the number of installs and sort on INSTALLS. Now, here we can do this in one line of code.

category_with_largest_installs = copy_summary_data.groupby('Category').agg({'Installs':'sum'}).sort_values(by = 'Installs',ascending=False).reset_index()
category_with_largest_installs

# Let's try to plot the same
plt.figure(figsize=(12,8))
sns.barplot(y = 'Category', x ='Installs',data = category_with_largest_installs)
plt.title('Sorted order for the CATEGORIES for which the highest number of installs are recorded on Play Store');
plt.show();

And the PLOT looks like this:

Sorted order for the CATEGORIES for which the highest number of installs are recorded on Play Store

GAMES is the CATEGORY with highest number of INSTALLS.

Question 2: Which application has the largest size?

> Now, here when i did a normal SORT on the SIZE column, there is an ambiguity, there were several APPS with same size 100 MB, so in order to find that one app, i just added a clause, RATING. We will pickup that APP which has the highest rating after performing sorting on SIZE.

app_with_larget_size = copy_summary_data.sort_values(by = ‘Size’,ascending = False)# We now know that the Apps we are dealing with has maximum of 100MB as size in our dataset
x = app_with_larget_size.groupby('Size').get_group(100).sort_values(by = 'Rating',ascending=False).reset_index(drop=True).head(1)
x

Navi RadiographyPro which is a highly paid APP is the APP with the largest Size.

Question 3: Which app has not been updated for a long time. This can be done by comparing the current date with the last date that the app was actually updated?

> Here, we only need a column such as Date Difference which will hold the value of subtracted date from today and then we look for the max of the Date Difference.

import datetime
#copy_summary_data['Last Updated'] = pd.to_datetime(copy_summary_data['Last Updated'],coerce=True)
copy_summary_data['Date Diff'] = pd.to_datetime(copy_summary_data['Last Updated']) - pd.Timestamp.now().normalize() 
copy_summary_data['Date Diff'] = copy_summary_data['Date Diff']*-1# We multiply by -1 because values generated are with minus sign to show the backdate feature.copy_summary_data[copy_summary_data['Date Diff'] == max(copy_summary_data['Date Diff'])]

Booyah, APP seems to be a bit odd, isn’t it.? But it is an APP which provides docks and docked clocks along with maybe a bit more widgets. Which app has not been updated for a long time. F*ck my life + widget. It has not been updated since past 10 years.

Question 4: Which app has the largest number of installs.?

> This is pretty straightforward. We just sort on Install and pick the first value.

copy_summary_data.sort_values(by = 'Installs',ascending = False).reset_index(drop=True).head(1)

And, here it is. SUBWAY SURFERS, well i played it a lot in my SAMSUNG phone and my DAD is addicted to it.😅 Moving on….

Question 5: For the years 2016, 2017, 2018 what are the category of apps that have got the most downloads.?

> Sinc, we already converted the dtype of column Last Updated to datetime, hence here, we need to just filter based on every year, i.e., 2016,2017,2018 and sort on Installs. This will give us the desired result.

year_2016 = copy_summary_data[copy_summary_data['Last Updated'].dt.year == 2016].sort_values(by = 'Installs',ascending=False).reset_index(drop=True)
year_2017 = copy_summary_data[copy_summary_data['Last Updated'].dt.year == 2017].sort_values(by = 'Installs',ascending=False).reset_index(drop=True)
year_2018 = copy_summary_data[copy_summary_data['Last Updated'].dt.year == 2018].sort_values(by = 'Installs',ascending=False).reset_index(drop=True)# Finding the top HEAD of all APPS for different categories.print('Category in 2016 which got highest Installs is:"', year_2016['Category'][0],'",','has installs count around:"',year_2016['Installs'][0],'".')
print('Category in 2017 which got highest Installs is:"', year_2017['Category'][0],'",','has installs count around:"',year_2017['Installs'][0],'".')
print('Category in 2018 which got highest Installs is:"', year_2018['Category'][0],'",','has installs count around:"',year_2018['Installs'][0],'".')

Finally, the last Question.

Question 6: Which application has the largest number of reviews?

largest_reviews = copy_summary_data.sort_values(by = 'Reviews',ascending = False).reset_index(drop=True)
print('App with the Maximum Reviews:"',largest_reviews['App'][0],largest_reviews['Reviews'][0],'"')
print('App with the Second Maximum Reviews:"',largest_reviews['App'][1],largest_reviews['Reviews'][1],'"')

It was fun while answering these questions and solving the way around. I tried to be more precise so that the sortings and conversion of DTYPES or even grouping by could be performed in one comprehension of code. This saved a lot of execution time.

One, interesting thing was that i have the habit of sharing the inferences with my peers,friends,colleagues and that’s where something positive came up. One of my senior from school has a startup in the domain of DIGITAL MARKETING and his team focusses in customer acquisition and one of the mode of generating revenue is running Advertising on multiple platforms. And there he quoted me, can you help me in identifying APPS, GENRES, CATEGORIES, etc. to target in order to widen the reach of his market.

This really boosted me up as something which i was doing, turned out to be helpful and from there we went onto onboarding of this project, preparing some dashboards, drafts, reports, as part of his expansion plan.

Some key inferences we shared was:

All the products related to healthcare or skin care can shared on APPS such as Navi RadiographyPro as one such example.
Since, we found out that APPS like Subway Surfers, Vector are highly liked games with high ratings, so that products like footwears and cool dude clothing can be advertised there as the same reflects in the APPS.
Since, GAMING is one of the highest installed category, they should focus on an AR-VR gaming concept and then time by time, can include paid promotions and this way the EUROPE Market can be targeted.
For Indian market, apps like UC Browser, Turbo VPN, Picsart are the APPS which can be targeted to reach a greater audience.

Repo for some of the work is public as I wanted my peers also to contribute, do have a look. And Notebook is also available for reference.

Next, we are working towards a Sentiment Classification Model in order to help them for any R&D they wanna do on a specific APP released in a beta phase.

I have seen many people posting on LinkedIn that DATA SCIENCE is just not about coding, or domain, but explaining the work we have done using this technique, but they never come up and show the way how it should be done. I guess, story telling the work we have done is the key to keep client in our loop and that’s where we hit the mark.

You reached here! Kudos. You learned how to frame your projects and story tell it to others.

Stay tuned for the next part. Happy Learning!🎇

Running a Freelance Data Science Project. PART-1

Written by Atul Mishra