Art of becoming The Sherlock Holmes: A comprehensive guide to Pandas

Picture Credit: The Indian Express

A patient needs a doctor, a hungry needs food, a victim needs a Sherlock Holmes and so does an organisation needs YOU (a data analyst), period.

The era of mining fossil fuels has been long taken over by the practices of data mining. In just one internet minute, loads of worthy data is generated which if dug carefully with a business objective can bring high profit to an organisation.

Picture Credit: Visual Capitalist

The picture on the left clearly illustrates the amount of huge data generated per minute in 2018. Technology giants (in picture) maintain a record of each and every activity of yours (on your consent), store them as their raw data, convert that data into valuable information and keep earning profit. But, who derives such valuable insights from raw data? Do these firms got chance to hire some kind of Sherlock Holmes!!!

Well, may be YES!

So, let’s learn how to make yourself The Sherlock (at least a confident beginner) and detect hidden answers from raw data.


The In and Out

This tutorial teaches you the necessary skills to deliver valuable insights from data using Python’s data analysis library, Pandas. The data sets used in this tutorial is available and taken from Kaggle.

So what’re the prerequisites?
What will you gain?

Get a complete hands-on guide on Pandas methods and attributes listed below and learn not just syntax, rather how to use them in a scenario to perform extensive exploratory data analysis.

Pandas: Covered methods and attributes

Note In between the tutorial, you can expect hands-on questions and tryout on Pandas functions. Plus use of methods borrowed from above mentioned optional libraries.

Enough comments are provided along the way to understand the meaning of code. But you can always refer documentation for elaborate description.

All codes used in this tutorial are available as a single Jupyter notebook; link is provided under reference section.


Go Panda!

Life of a Panda — Sleep Analyse Repeat. Pic Credit — QZ

Case objective

To deliver what does it takes for an app to get a 5 star rating on Google Play Store.

The tutorial has been partitioned into two sections —

I. Getting the case ready — here you’ll tidy the data by using basic functionalities of Pandas like data cleaning, handling missing data, duplicate data and outliers.

II. Answering the mysteries — here you’ll fetch information and insights using Pandas visualisation (plus visualisation from other lib) to uncover rumours and arrive at a decision.

I. Getting the case ready

Consider a data set named Google Play Store Apps available on Kaggle. Download it and extract the data which will give you two CSV files.

Next, we need to load the file (googleplaystore.csv) into python environment using Pandas. Think of Pandas as an alternative to MS Excel available within Python but with more advanced features and of course with a great community support, not to mention the Stack Overflow community support ;). Pandas consists of two objects namely Series and DataFrame. A DataFrame holds a 2D matrix while a Series holds a 1D matrix. Since we have a CSV file to load, hence we’ll rely on a DataFrame.

1. Loading the data set into a Pandas DataFrame

Open a new Jupyter notebook, import necessary libraries and load the googleplaystore.csv file using read_csv function into a DataFrame df as shown:

# Importing necessary libraries for tutorial
import numpy as np
import scipy as sp

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Loading the data into a dataframe df
df = pd.read_csv('googleplaystore.csv')

Now, Sherlock has a case to resolve. Let’s observe the raw data using df.info(memory_usage='deep') . We used memory_usage='deep' argument to retrieve the space consumed by object (alias to string) datatype columns.

df.info(memory_usage='deep')
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 10841 entries, 0 to 10840
# Data columns (total 13 columns):
# App 10841 non-null object
# Category 10841 non-null object
# Rating 9367 non-null float64
# Reviews 10841 non-null object
# Size 10841 non-null object
# Installs 10841 non-null object
# Type 10840 non-null object
# Price 10841 non-null object
# Content Rating 10840 non-null object
# Genres 10841 non-null object
# Last Updated 10841 non-null object
# Current Ver 10833 non-null object
# Android Ver 10838 non-null object
# dtypes: float64(1), object(12)
# memory usage: 8.3 MB

We can observe that there’re 10,841 rows and 13 columns; feature names and datatypes of which are listed above. float64 means 64-bit decimal values and pointing out again, object here means string datatype. Overall, the whole data set consumes 8.3 MB of disk space.

To access only column names, index values, main body of dataframe and datatypes, one can use following commands:

df.columns           #  <- retrieves all column names
df.index # <- retrieves all index values
df.values # <- retrieves main body of df
df.dtypes # <- retrieves datatype of each column

Question 1 — Rename the following columns:

FROM                                TO
Content Rating ContentRating
Last Updated LastUpdated
Current Ver CurrentVer
Android Ver AndroidVer

Hint: Use pandas.DataFrame.rename. You can use inplace argument which helps to perform changes to the dataframe without assigning the operation back to df. Hence, with inplace=True, no need of df=df.rename….


2. Data cleaning

As we have observed from info(), there is just 1 numeric column and rest all are strings. However, columns such as Reviews (i.e. number of reviews), Size, Installs and Priceare expected to be numeric rather than being object. So let us investigate the matter.

Let’s fetch the row which has string elements in all these columns -

# Fetching the entity which has string
df.Reviews[df.Reviews.str.contains('[^0-9]')]
10472 3.0M
Name: Reviews, dtype: object
df.Size[df.Size.str.contains('[^0-9]')].head()
0 19M
1 14M
2 8.7M
3 25M
4 2.8M
Name: Size, dtype: object
df.Installs[df.Installs.str.contains('[a-z]')]
10472 Free
Name: Installs, dtype: object
df.Price[df.Price.str.contains('[a-z]')]
10472 Everyone
Name: Price, dtype: object

Here comes regex as a saviour pointing us to the right direction. Observe that column Size has almost all the elements as strings, however other three columns have just one row with index 10472 which brings problem. So let us drop this row and convert the datatype of these columns back to numeric.

# Dropping row 10472
df.drop(10472, inplace=True)
# Changing the datatype of Reviews column back to numeric
df.Reviews = df.Reviews.astype(int)

Beware!!! columns Installs and Price have a special format which needs to be fixed before you can convert them to numeric. You can observe their current format using loc or iloc. loc is used if you have to access the elements through their label whereas iloc is used to access elements based on their index position.

df.loc[5:10, 'Installs']  
# 5 50,000+
# 6 50,000+
# 7 1,000,000+
# 8 1,000,000+
# 9 10,000+
# 10 1,000,000+
# Name: Installs, dtype: object
df.iloc[1832:1839, 7]  # <- Here, 7 is the index of Price column
# 1832 $1.99
# 1833 $4.99
# 1834 $4.99
# 1835 $4.99
# 1836 $5.99
# 1837 $6.99
# 1838 $9.99
# Name: Price, dtype: object

So, let us convert the format of Installs column from 12,345+ to 12345 and Price from $1.23 to 1.23.

# Install column
df.Installs = df.Installs.str.replace(',','') #Replacing ',' with ''
df.Installs.head()
# 0 10000+
# 1 500000+
# 2 5000000+
# 3 50000000+
# 4 100000+
# Name: Installs, dtype: object
df.Installs = df.Installs.str.replace('+','').astype(int) #Replacing '+' with '' and converting datatype back to numeric
df.Installs.head()
# 0 10000
# 1 500000
# 2 5000000
# 3 50000000
# 4 100000
# Name: Installs, dtype: int32 # <- observe the new datatype

# ---------------------------------------------------------------
# Price column
df.Price = df.Price.str.lstrip('$').astype(np.float32)  # Using left  strip to cut '$' and later converting the column to float 32-bit
df.iloc[1832:1839, 7]
# 1832 1.99
# 1833 4.99
# 1834 4.99
# 1835 4.99
# 1836 5.99
# 1837 6.99
# 1838 9.99
# Name: Price, dtype: float32 # <- observe the new datatype

Question 2 — Reformatting Size column

The Size column has the mix format of values. Convert the values from current format to new format as shown:

FROM                              TO
19M 19000000
23.5k 23500
Varies with device Varies with device (no change)

Note — Values suffixed with M or k can also be present in decimal. E.g. 1.9M. So handle them accordingly.


And now newbie Sherlock knows how to start with a dirty case. Here’s a glimpse of what you have achieved so far (look for the bold text) -

df.info()
# <class 'pandas.core.frame.DataFrame'>
# Int64Index: 10840 entries, 0 to 10840
# Data columns (total 13 columns):
# App 10840 non-null object
# Category 10840 non-null object
# Rating 9366 non-null float64
# Reviews 10840 non-null int32
# Size 10840 non-null object
# Installs 10840 non-null int32
# Type 10839 non-null object
# Price 10840 non-null float32
# ContentRating 10840 non-null object
# Genres 10840 non-null object
# LastUpdated 10840 non-null object
# CurrentVer 10832 non-null object
# AndroidVer 10838 non-null object
# dtypes: float32(1), float64(1), int32(2), object(9)
# memory usage: 1.0+ MB

3. Dealing with missing values

Now, let us observe the amount of missing values in each column and how to deal with them. This information can be retrieved textually in percentage of missing value per column. However, if there are more number of columns then visualisation of the same is recommended. Both the ways are shown here.

actual_values = df.count()    # <- counts non-missing values per col
total_values = df.shape[0] # <- shape results into (n_row, n_col)
# Rounding to 2 decimal places using np.round(val_to_round, 2)
miss_perct = np.round((1 - (actual_values/total_values)) * 100, 2)
miss_perct
# App               0.00
# Category 0.00
# Rating 13.60
# Reviews 0.00
# Size 0.00
# Installs 0.00
# Type 0.01
# Price 0.00
# ContentRating 0.00
# Genres 0.00
# LastUpdated 0.00
# CurrentVer 0.07
# AndroidVer 0.02
# dtype: float64

The variable miss_perct holds the percentage of missing values in each column rounded to 2 decimal places. But wait! Do you observe something new? Well try type(miss_perct). Hasn’t our Sherlock self-discovered Pandas Series ;)


A tryout

  1. Converting above series values into a dataframe named df_miss and giving a column name percent.
# Method 1
df_miss = pd.DataFrame(miss_perct.values, index=miss_perct.index, columns=['percent'])
# Method 2
df_miss = pd.DataFrame(miss_perct, columns=['percent'])
df_miss
#                percent
# App 0.00
# Category 0.00
# Rating 13.60
# Reviews 0.00
# Size 0.00
# Installs 0.00
# Type 0.01
# Price 0.00
# ContentRating 0.00
# Genres 0.00
# LastUpdated 0.00
# CurrentVer 0.07
# AndroidVer 0.02

2. Adding another column to dataframe df_miss, named miss_count which represents number of missing values.

# Number of missing values per column
df_miss['miss_count'] = len(df) - df.count() # <- in spite of using df.shape[0], another way of finding total dataframe length is by using len(df)
df_miss.head()     # <- prints top 5 rows if no number if passed
#                percent     miss_count
# App 0.00 0
# Category 0.00 0
# Rating 13.60 1474
# Reviews 0.00 0
# Size 0.00 0
df_miss.tail(8)    # <- prints last 8 rows
#                percent     miss_count
# Installs 0.00 0
# Type 0.01 1
# Price 0.00 0
# ContentRating 0.00 0
# Genres 0.00 0
# LastUpdated 0.00 0
# CurrentVer 0.07 8
# AndroidVer 0.02 2

Now, that we have covered retrieving the percentage of missing values textually, let us gather the same information visually. We will use df_miss dataframe and plot percent column to visualize percentage of missing value per column.

# Adding index values as new column to make it as tidy-dataframe for input to seaborn
df_miss = df_miss.reset_index()  # <- If you add reset_index(drop=True) then index column is dropped
df_miss.head(3)
#     index     percent miss_count
# 0 App 0.0 0
# 1 Category 0.0 0
# 2 Rating 13.6 1474
plt.figure(figsize=(8,5))
sns.barplot(x="percent", y="index", data=df_miss, palette='icefire')
sns.despine()
plt.show()
Visual representation of missing values per column

Since we have seen both ways to fetch percentage of missing values, now let us handle these missing values. One way is to fill them with median if feature (column) is quantitative (numeric) or fill them with mode if feature is qualitative (categorical). Other possible way include dropping the complete feature if missing value percentage is too high, interpolating the values and much more.

Since the only feature Rating has 13.6% of missing value and the rest have only minute percentage, so let us go with the first way viz. filling missing values of a quantitative feature(s) with median values and a qualitative feature(s) with its mode values.

# Filling missing values in Quantitative feature by median
for col in df.select_dtypes('float64'):
df.loc[:, col].fillna(df.loc[:, col].median(), inplace=True)
# Filling missing values in Qualitative feature by mode
for col in df.select_dtypes('object'):
df.loc[:, col].fillna(sp.stats.mode(df.loc[:, col].astype(str))[0][0], inplace=True)
df.info()
# <class 'pandas.core.frame.DataFrame'>
# Int64Index: 10840 entries, 0 to 10840
# Data columns (total 13 columns):
# App 10840 non-null object
# Category 10840 non-null object
# Rating 10840 non-null float64
# Reviews 10840 non-null int32
# Size 10840 non-null object
# Installs 10840 non-null int32
# Type 10840 non-null object
# Price 10840 non-null float32
# ContentRating 10840 non-null object
# Genres 10840 non-null object
# LastUpdated 10840 non-null object
# CurrentVer 10840 non-null object
# AndroidVer 10840 non-null object
# dtypes: float32(1), float64(1), int32(2), object(9)
# memory usage: 1.0+ MB

Now, we have a dataframe with no missing values as illustrated above.

4. Checking duplicate entries

Sometimes similar instances fill a data set which can be handled by removing them unless the sample size isn’t too small. So, let us count the number of duplicate instances in our data -

total_duplicates = len(df[df.duplicated()])
total_duplicates
# 483
print('The data has ' + str(np.round(total_duplicates/len(df) * 100, 2)) + '% of duplicate data')
# The data has 4.46% of duplicate data

Let’s proceed with the removal of these instances to arrive at only unique rows.

df.drop_duplicates(inplace=True)
print('New shape of data set: ' + str(df.shape))
# New shape of data set: (10357, 13)

Also, let us check if there are certain App records which are repeated.

app_duplicates = len(df[df.App.duplicated()])
app_duplicates

# 698
print('The data has ' + str(np.round(app_duplicates/len(df) * 100, 2)) + '% of duplicate app records')
# The data has 6.74% of duplicate app records

As can be seen there are several instances of apps which are repeated, so let’s drop the duplicates and keep only unique instances.

df = df[~df.App.duplicated()].reset_index(drop=True)
df.shape

# (9659, 13)

5. Checking for outliers

Outliers are the main concern while dealing with dirty data. They are easy to detect using box plot (boxenplot, for large data). Let us find the outliers in following features -

rating, reviews, installs, price and size (replacing ‘Varies with device’ with 0).

# Selecting required features and storing them into a new dataframe
df_temp = df.loc[:,['Rating', 'Reviews', 'Size', 'Installs', 'Price']]
# Replacing 'Varies with device' values under Size column with 0
df_temp.Size.replace('Varies with device', 0, inplace=True)
# Normalizing values to visualize each feature in one plot
from sklearn.preprocessing import MinMaxScaler
temp = pd.melt(pd.DataFrame(MinMaxScaler().fit_transform(df_temp)))
plt.figure(figsize=(10,6))
sns.boxenplot(x='variable', y='value', data=temp, palette='Blues')
sns.stripplot(x='variable', y='value', data=temp, size=0.5, jitter=True, color="gray")
plt.xticks(np.arange(df_temp.shape[1]), df_temp.columns)
plt.yticks([])
plt.xlabel('')
plt.ylabel('')
sns.despine(left=True,trim=True)
plt.show()
Boxenplot and stripplot

An alternative to boxenplot is boxplot as shown:

plt.figure(figsize=(10,6))
sns.boxplot(x='variable', y='value', data=temp, palette='Blues')
sns.stripplot(x='variable', y='value', data=temp, size=0.5, jitter=True, color="gray")
plt.xticks(np.arange(df_temp.shape[1]), df_temp.columns)
plt.yticks([])
plt.xlabel('')
plt.ylabel('')
sns.despine(left=True,trim=True)
plt.show()
Boxplot and stripplot

Each feature carries multitude of outliers, for this case, let us not break the peace of features by removing outliers.


II. Answering the mysteries

With clean and explored data , let us now bring insights out of a table.

1. Uncovering rumours

A new information always comes with rumours. Suppose your manager gives you an opportunity to transfer permanently to a new country with only 2 available options say, country A and B. Assuming you never visited any of these 2 countries before, your decision will be most probably based upon some people’s recommendations, common statistics and known rumours.

Same goes with a data set. Say we need to find the peak hours when maximum number of working people do chatting on WhatsApp between Monday to Friday. As per a common man understanding (referring to a rumour, hence differ person to person) a random guess would be in the late night hours. Its a random guess, but still for decision making you need a valid analysis to take further action.

Therefore, let us validate common understanding points of following data set:

Good relationship between Reviews and Installs

First, let’s figure out the relationship between available quantitative feature using pairplot or correlation heatmap.

You can observe from the heatmap that there exists a good amount of correlation between number of installs done for an app against number of reviews given to app. Also, if outliers can be removed then we might have good correlation in other features as can be observed in pairplot.

Take away — A diverging C.I. regression plot between Installs and Reviews

Here’s the code for following two figures —

# Heatmap of correlation plot
plt.figure(figsize=(7,5))
sns.heatmap(df_temp.corr(), cmap='Blues')
plt.show()

# Pairplot
plt.figure(figsize=(7, 5))
sns.pairplot(df_temp, kind='reg', diag_kind='kde', markers='+')
plt.show()

Now, let us list down some known information and validate it —

  1. Division of apps rating i.e. extracting number of apps per 0.5 step rating.
  2. Ratio of app type as per app rating i.e. in each 0.5 step rating how many number of free and paid apps are available.
  3. Ratio of rating per content type i.e. for each category what is the ratio of 0.5 step rating.
  4. Ratio of rating per top 6 genres i.e. selecting only top 6 genres based on the maximum count of apps and getting the ratio of 0.5 step rating.
  5. Number of apps per category.
  6. Number of apps updated yearly.
  7. Number of apps updated monthly.
  8. Number of apps updated in each year’s months.
Rumours verified and uncovered!!!

Take-away from above figure —

  1. Maximum number of apps are rated 4.5 on a scale of 5 having step size 0.5.
  2. Number of free apps are more than paid apps on each rating scale.
  3. Apps meant for Everyone take lead in maximum number of apps followed by Teen and Mature 17+. Rating per content type is illustrated above.
  4. Top Genres include Tools, Entertainment and Education, each having maximum number of rating as 4.5.
  5. Top three categories with maximum number of apps include Family, Game and Tools.
  6. Maximum number of apps get updated as years pass by. Hence, an upward pattern.
  7. Combining all years data, maximum number of apps are updated during summer i.e. July.
  8. Each year, high update frequency is observed after month of April.

Here follows above figure’s code.

# 1. Division of apps rating
rate = df.Rating.apply(lambda x: np.round(x * 2)/2) # Rounding to the nearest 0.5
# Retrieving the frequency of each unique rating
val, cnt = np.unique(rate, return_counts=True)
df_rate = pd.DataFrame([val, cnt]).T
df_rate.columns = ['val', 'cnt']
# 2. Ratio of app type as per app rating
df_rating_type = pd.DataFrame([rate, df.Type]).T
# 3. App rating per type of content
df_rating_content = pd.DataFrame([rate, df.ContentRating]).T
# 4. Ratio of rating per top 6 genres
df_rating_genres = pd.DataFrame([rate, df.Genres]).T
# Finding top 6 Genres with highest number of apps
val_g, cnt_g = np.unique(df_rating_genres.Genres, return_counts=True)
df_genres = pd.DataFrame([val_g, cnt_g]).T
df_genres.columns = ['Genres', 'Count']
df_genres.sort_values('Count', ascending=False, inplace=True)
df_genres = df_genres.iloc[:6,:].reset_index(drop=True)
# Subsetting dataframe for rows containing only these 6 generes
df_rating_genres = df_rating_genres[df_rating_genres.Genres.isin(df_genres.Genres)]
# 5. Number of apps per category
ratio_cat_app = df.groupby('Category')['App'].count().reset_index()
ratio_cat_app.sort_values('App', ascending=False, inplace=True)
# 6. Number of updates in each months, years and each year's months
# Converting the LastUpdated column to date column
df.LastUpdated = pd.to_datetime(df.LastUpdated, format='%B %d, %Y')
df.LastUpdated.head(2)
# 0   2018-01-07
# 1 2018-01-15
# Name: LastUpdated, dtype: datetime64[ns]
val_y, cnt_y = np.unique(df.LastUpdated.dt.year, return_counts=True)
val_m, cnt_m = np.unique(df.LastUpdated.dt.month, return_counts=True)
df_date = pd.DataFrame(df.LastUpdated.astype(str).str.split('-').tolist(), columns=list('YMD'))
df_date.head(3)
#      Y  M  D
# 0 2018 01 07
# 1 2018 01 15
# 2 2018 08 01
date_temp = df_date.groupby(['M','Y']).apply(lambda x: len(x)).unstack()                
date_temp.fillna(0, inplace=True)
# Generating sequence of Months
from datetime import datetime
months_seq = [datetime.strftime(datetime.strptime(str(_), "%m"), "%b") for _ in range(1,13)]
# ===============PANDAS PLOTTING===================
# 1 
plt.figure(figsize=(20, 20))
sns.set_style('dark')
row_mask = 6
col_mask = 2
ax1 = plt.subplot2grid((row_mask, col_mask), (0,0))
sns.barplot(x='val', y='cnt', data=df_rate, palette='Blues', ax=ax1)
sns.despine(ax=ax1)
ax1.set_ylabel('Count')
ax1.set_xlabel('Ratings')
ax1.set_title('Division of apps rating', weight='bold')
# 2
ax2 = plt.subplot2grid((row_mask, col_mask), (0,1))
df_rating_type.groupby(['Rating', 'Type']).\
apply(lambda x: len(x)).unstack().\
plot.bar(stacked=True,
rot=0, color= ['lightblue', 'darkblue'], ax=ax2)
sns.despine(ax=ax2)
ax2.legend(loc='upper left')
ax2.set_xlabel('Rating')
ax2.set_ylabel('Count')
ax2.set_title('Ratio of app type as per app rating', weight='bold')
# 3
ax3 = plt.subplot2grid((row_mask, col_mask), (1,0))
df_rating_content.groupby(['ContentRating', 'Rating']).\
apply(lambda x: len(x)).unstack().\
plot(kind='barh', stacked=True,
rot=0, colormap='Blues', ax=ax3)
sns.despine(ax=ax3)
ax3.legend(loc='upper right', ncol=3)
ax3.set_xlabel('Rating')
ax3.set_ylabel('Content Type')
ax3.set_title('Ratio of rating per content type', weight='bold')
# 4
ax4 = plt.subplot2grid((row_mask, col_mask), (1,1))
df_rating_genres.groupby(['Genres', 'Rating']).\
apply(lambda x: len(x)).unstack().\
plot(kind='barh', stacked=True,
rot=0, colormap='Blues', ax=ax4)
sns.despine(ax=ax4)
ax4.legend(loc='lower right', ncol=3)
ax4.set_xlabel('Rating')
ax4.set_title('Ratio of rating per top 6 genres', weight='bold')
# 5
ax5 = plt.subplot2grid((row_mask, col_mask), (2,0), rowspan=2, colspan=2)
sns.barplot(x='App', y='Category', data=ratio_cat_app, palette='Blues_r', ax=ax5)
sns.despine(ax=ax5)
ax5.set_title('Number of apps per category', weight='bold')
# 6a
ax6a = plt.subplot2grid((row_mask, col_mask), (4,0))
ax6a.plot(val_y, cnt_y, '--o', linewidth=2)
sns.despine(trim=True, ax=ax6a)
ax6a.set_title('An increasing trend in yearly app update', weight='bold')
# 6b
ax6b = plt.subplot2grid((row_mask, col_mask), (4,1))
ax6b.plot(months_seq, cnt_m, '--o', linewidth=2)
ax6b.set_title('An upward rising and falling trend in montly app update', weight='bold')
# 6c
ax6c = plt.subplot2grid((row_mask, col_mask), (5,0), colspan=2)
# Normalizing values so that all values can be visualized without any bias
pd.DataFrame(MinMaxScaler().fit_transform(date_temp), index=months_seq, columns=date_temp.columns).\
plot(kind='line', style='--o', colormap='Blues', xticks=np.arange(12), ax=ax6c)
ax6c.set_title('Number of updates (norm) in each year\'s months', weight='bold')
ax6c.legend(loc='upper left', ncol=5)
sns.despine(ax=ax6c)
plt.tight_layout()
plt.show()

Let us also bring a visualisation of top apps as per maximum rating, reviews, size and installs using wordcloud.

Here’s the code for wordcloud subplots —

from wordcloud import WordCloud
import random
# Function to create Gray wordcloud. 
# Credit: https://amueller.github.io/word_cloud/auto_examples/a_new_hope.html
def grey_color_func(word, font_size, position, orientation, random_state=None,
**kwargs):
return "hsl(0, 0%%, %d%%)" % random.randint(60, 100)
# Creating dictionary of features against app names
wc_rating = pd.DataFrame(df_temp.Rating).set_index(df.App).to_dict()['Rating']
wc_reviews = pd.DataFrame(df_temp.Reviews).set_index(df.App).to_dict()['Reviews']
wc_size = pd.DataFrame(df_temp.Size).set_index(df.App).to_dict()['Size']
wc_installs = pd.DataFrame(df_temp.Installs).set_index(df.App).to_dict()['Installs']
plt.figure(figsize=(14, 8))
r_c = 8
ax7 = plt.subplot2grid((r_c, r_c), (0,0), colspan=4, rowspan=4)
wc1 = WordCloud()
wc1.generate_from_frequencies(frequencies=wc_rating)
ax7.imshow(wc1.recolor(color_func=grey_color_func, random_state=3), interpolation="bilinear")
ax7.set_title('Rating', weight='bold', size=20)
ax7.axis("off")
ax8 = plt.subplot2grid((r_c, r_c), (0,4), colspan=4, rowspan=4)
wc2 = WordCloud()
wc2.generate_from_frequencies(frequencies=wc_reviews)
ax8.imshow(wc2.recolor(color_func=grey_color_func, random_state=3), interpolation="bilinear")
ax8.set_title('Reviews', weight='bold', size=20)
ax8.axis("off")
ax9 = plt.subplot2grid((r_c, r_c), (4,0), colspan=4, rowspan=4)
wc3 = WordCloud()
wc3.generate_from_frequencies(frequencies=wc_size)
ax9.imshow(wc3.recolor(color_func=grey_color_func, random_state=3), interpolation="bilinear")
ax9.set_title('Size', weight='bold', size=20)
ax9.axis("off")
ax10 = plt.subplot2grid((r_c, r_c), (4,4), colspan=4, rowspan=4)
wc4 = WordCloud()
wc4.generate_from_frequencies(frequencies=wc_installs)
ax10.imshow(wc4.recolor(color_func=grey_color_func, random_state=3), interpolation="bilinear")
ax10.set_title('Installs', weight='bold', size=20)
ax10.axis("off")
plt.show()

2. Mining insights

So far, we have learnt various attributes and methods available in Pandas and also verified various known information. Now let us fetch the insights i.e. unknown information which adds value in further decision making.

The goal of this case is to understand —

What does it takes for an app to be rated as 5 star on Google Play Store?

So let us try to answer it.

print('Number of unique apps with 5 star rating: ', len(df[df.Rating == 5.0]))
# Number of unique apps with 5 star rating:  271

Out of these 271 apps, let us fetch the top most values for major features to deliver a decision.

df_5star = df[df.Rating == 5.0].reset_index() # Saving index to access Size from df_temp
features = [
df_5star.Category,
df_5star.Reviews,
df_temp.iloc[df_5star.iloc[:, 0],:].Size,
df_5star.Installs,
df_5star.Type,
df_5star.ContentRating,
df_5star.Genres,
df_5star.LastUpdated,
df_5star.AndroidVer
]
df_expect = pd.DataFrame(index=['Category', 'Reviews', 'Size', 'Installs', 
'Type', 'Content Rating', 'Genres', 'Last Updated', 'Android Version'],
columns=['Values'])
cnt = 0
for fea in features:
val_t, cnt_t = np.unique(fea, return_counts=True)
df_expect.iloc[cnt, 0] = pd.DataFrame([val_t, cnt_t]).T.sort_values(1, ascending=False).iloc[0,0]
cnt += 1
df_expect
+-----------------+---------------------+
| | Values |
+-----------------+---------------------+
| Category | FAMILY |
| Reviews | 1 |
| Size | 11000000 |
| Installs | 100 |
| Type | Free |
| Content Rating | Everyone |
| Genres | Education |
| Last Updated | 2018-07-24 00:00:00 |
| Android Version | 4.1 and up |
+-----------------+---------------------+

Therefore, according to our analysis our newbie Sherlock can point out that for an app to have a 5 star rating above values are the keys.

In summary, to be a 5 star app, it should be free of cost, built for everyone and should emphasise on education and family. Also, it should maintain at least 1 review, 100 installs and can consume nearly 11 MB of disk space, hence need to be light weight. Finally, it should be frequently updated and kept for android version 4.1 and up.

Picture Credit: Oliver Emberton

This completes our guide to Pandas. Congratulate yourself for completing this sweet journey. You have learnt the use of major pandas methods and attributes.

For more learning refer Pandas original website, or join any interesting online course like Applied AI course, Coursera, Udemy, Udacity, etc.

References

GitHub — Jupyter notebook (holds complete code used in this tutorial)

Applied AI course, December 15th, 2018


Your comments are always welcomed for any valued suggestion and feedback. Further, you can always reach me out at kumar.vivek@acm.org for any discussion. Happy learning!