Data Science Tutorial: Analysis Of The Google Play Store Dataset

Winning submission- December Data Festival 2018

Abhimanyu Thakre
The Research Nest
12 min readMar 24, 2019

--

Photo by Paweł Czerwiński on Unsplash

The Internet is a true gold mine of data. E-commerce and review sites are brimming with a lot of untapped data with a prominent potential to convert into meaningful insights that can help with robust decision making. Here, we explore using data science and machine learning techniques on data retrieved from one such avenue on the internet, the Google Play Store.

Details of Dataset:

The Play Store apps data has enormous potential to drive app-making businesses to success. Actionable insights can be drawn for developers to work on and capture the Android market. The dataset is chosen from Kaggle. It is the web scraped data of 10k Play Store apps for analyzing the Android market. It consists of in total of 10841 rows and 13 columns.

Just look at the beauty of data… It’s powerful

The columns of the dataset are as follows:

1) App (Name)

2) Category (App)

3) Rating (App)

4) Reviews (User)

5) Size (App)

6) Installs (App)

7) Type (Free/Paid)

8) Price (App)

9) Content Rating (Everyone/Teenager/Adult)

10) Genres (Detailed Category)

11) Last Updated (App)

12) Current Version (App)

13) Android Version (Support)

Exploratory Data Analysis:

Some key observations at first glance include how the performance of the App can be improved from the reviews obtained and different patterns that could be found to get more business values out of the same.

Now, we will start implementing the procedures by importing libraries:

The “sklearn” library offers many algorithms for implementing machine learning techniques on different problems.

The most critical thing from which patterns could be obtained is data. It may be a single review or a bundle of them. Whatever data comes in, could be used to draw value out of it. Data comes with unexpected values too, which should be handled before it affects the performance of trained models that predict the outcome.

Here is the first step to clean the data that will make the results “more” accurate.

By finding all unique values of each row the inappropriate values can be identified. Different methods can then be used for removing them or to change those values accordingly to use them to make predictions better. As the proverb goes by saying –

“The more data we have, the more likely we are to drown in it.” — Nassim Taleb

Not only are we interested in raw data but in the data from which valuable insights can be drawn. To do so, let us take a glimpse at another proverb.

“More data beats clever algorithms, but better data beats more data.” — Peter Norvig

Data Cleaning

With that being said, here are the various steps taken to clean the data:

The raw data can have random sorting. To solve this, we will use:

It is necessary to make a note that each and every piece of raw data may lead to a more accurate result. The current dataset holds values that are in the string format. For solving a regression problem, we should convert the strings to a numerical format. To do so, we will proceed as follows:

Now we will use:

Take a look at the last line of code(LoC) in the above code snippet. Instead of dropping the rows that contain null values, we have used them. After every transformation possible with the dataset, we have finally dropped the rows having null values.

Though the dataset may seem to have the correct datatypes for each column, we need to check it. Inconsistent datatypes will create issues while dealing with regression problems.

Data visualization can be used to get a glimpse of the distribution of the app market. This can help businesses in several ways. Apps could be targeted to a particular market. A business could analyze its approach to entering a market with more/moderate/fewer competitors. If the app holds a feature that may change the future usage of users, a data-driven business venture could launch the app in the market of more competitors to get a better hold of the market relying on that key feature and making further development.

Another strategy could be to build something different from the normal apps and their usage as the data shows to bring in something new to the market.

Visualization can further be used to get finer details of the split in categories. For example, if the category is “Gaming”, it consists of “Arcade”, “Board”, “Racing”, etc. This could be used to get into a more specific domain in “Gaming”. Such insights can enable consultants to get a clearer view for framing a business model while launching a new app.

The “Ratings” of the app could be used to look whether the original ratings of the app matches the predicted rating to know whether the app is performing better or worse compared to other apps on the Play Store.

“Having your own league is great but when it comes to business, you should look at some statistics.”

The null values in the dataset, especially in the column of “Ratings”, could be replaced by the mean, median or something else. I have used a “mean”. Because the value to be replaced can be influenced by Outlier, but there are no outliers in the dataset for this column. An outlier existing was removed before replacing the null values with the mean.

Pictorial representation can be seen using the “code”.

Fig 1

The above figure consists of two pie charts clubbed into one.

The outer chart consists of the distribution of apps Category wise. And the inner chart consists of the percentage of free/paid apps for that particular Category.

Fig 2

The above figure consists of a pie chart of the category “GAME” representing different domains.

Similarly, the below figure shows a pie chart of the category “FAMILY”.

NOTE: The chart looks clumsy. A better view can be obtained by enlarging the plot and moving around it. Ways of making improvements in this are discussed in the later part of this report.

Fig 3

The data is in different formats that should be converted into a similar format to use data in building machine learning model.

The above charts can be plotted using the following code:

NOTE: The above code may look messy, it is really a combination of functions and parameters. So, take a deep breath and go through it once again. I am sure that you will understand the code snippet.

Converting our data into appropriate forms

Size: For example, the size of the app is in “string” format. We need to convert it into a numeric value. If the size is “10M”, then ‘M’ was removed to get the numeric value of ‘10’. If the size is “512k”, which depicts app size in kilobytes, the first ‘k’ should be removed and the size should be converted to an equivalent of ‘megabytes’.

Installs: The value of installs is in “string” format. It contains numeric values with commas. It should be removed. And also, the ‘+’ sign should be removed from the end of each string.

Category and Content Rating: The Category and Content Rating consists of categorical values that should be converted to numeric values if we need to perform regression. So, these were converted to numeric values.

Price: The price is in “string” format. We should remove the dollar sign from the string to convert it into numeric form.

On analyzing the data, RATINGS of the app can be concluded as the most important parameter that plays an important role in depicting how better the app performs compared to the other apps in the market. It also hints on how well the company works on implementing the feedback given by the users. After all, users are the key to modern software businesses.

RATINGS depend on various factors. The correlation between these will be discussed in the next part of this report.

Problem Statement:

To predict the ratings of the App (before/after launching it on Play Store).

This is clearly a regression problem.

I see where this is going.

The factors that require attention in solving this problem are,

1) Category

2) Reviews

3) Size

4) Installs

5) Price

6) Content Rating

Here, Category and Content Rating are categorical values. So instead of these, we will use “Category NUM” and “Content Rating NUM” that contains a numerical mapping.

By taking the values of these columns into the account, we will get a prediction for “Rating” of the app. The rating can be obtained by providing the current values and comparing the predicted value and original value to get an overview of whether the app is performing better or worse than expected.

If we want to predict how well the app may perform before launching it on Play Store, we could take some random numbers as parameters. And then compare different parameters, i.e., if we get same ratings for installs, but way fewer reviews, we come to know that we should do something to get feedback from the user as it is necessary to improve the app.

Machine Learning Model:

The model used to train the dataset is the “Random Forest Regressor”.

The dataset was split into training and testing data and with the help of a function, the “mean absolute error” the accuracy was measured.

The model gives options to tune. It is known as hyper tuning our model to predict a better result.

Different parameters were used to get the least error.

One other approach used to get a better result is to train the model a few times and then take the median of the same. Though it is going to take a bit more time, accurate results could be obtained by doing the same.

NOTE: Every statement below the for loop is a part of it.

Finally, we will print the result:

As mentioned earlier, after using mean_absolute_error, the error is “0.3”. For eg., if the actual rating of the app is “4.0”, then the predicted ratings could fall in the range [3.7,4.3].

NOTE: To see the result, we need to pass in a few parameters. These parameters are mentioned in the code snippet above. This gives an overall idea beforehand about how the app may perform.

Conclusion and Future Work:

The dataset contains immense possibilities to improve business values and have a positive impact. It is not limited to the problem taken into consideration for this project. Many other interesting possibilities can be explored using this dataset.

Future work can include

  • Optimization of the pie-charts shown above i.e. Fig 3. There are multiple domains in the same slice. The multiple domains could be separated and added to the same field to get a more detailed version of this pie chart.
  • Prediction of the number of reviews and installs by using the regression model.
  • Identifying the categories and stats of the most installed apps.
  • Exploring the correlation between the size of the app, the version of Android, etc on the number of installs.

The ways in which questions can be asked varies, so does the way of tackling a problem. Only the one that has been minutely observed and tested will provide results worth trusting.

Editorial Note:

The event — Data December Festival 2018 — was organized by The Research Nest as a two-month online learning campaign focused on helping beginners learn Data Science. The main aim of this event was to get the participants engaged in a real-time project as they learn various concepts of data science and to complete it by creating a report documenting their insights.

An informative material guide was provided that included several resources to assist in self-learning data science within a month or two.

The guide can be downloaded here: http://bit.ly/self-learning-datascience

--

--