User ratings predictions for a Google Play store data set

Antony Paulson Chazhoor
May 6 · 3 min read

The ability to use services and products on the go has been a major leap in this century. Applications on the Google play store aim to do exactly that. Owing to worldwide accessibility and the ease of use, it has not only become the most popular application download destination but also a hotbed for competing services to attract and gain customers. This project aims to employ machine learning & visual analytics concepts to gain insights into how applications
become successful and achieve high user ratings.

The data set chosen for this project was from the popular data website Kaggle. It contains over 10k application data, capturing various details like category, reviews, installs, size, etc.

The kaggle dataset can be found using this link.

https://www.kaggle.com/lava18/google-play-store-apps

The aim of my project was to first generally visualize the distribution of the data set across categories, identify correlations among the parameters and to then find an accurate machine learning model which could fairly accurately predict user ratings on any app when similar data is available. Seaborn & Matplotlib libraries of python were used to perform visualizations on python. Subsequently, four different machine learning models were used and trained on this data.

Visualizations indicated that the apps were broadly distributed across 33 distinct categories and that the family category was the most popular within this dataset. It also showed that the user ratings in the dataset were either 0 or mostly between 3.0 to 5.0. In the latter ratings interval the distribution roughly followed a normal distribution with the peak at approximate ratings of 4.5.

Correlations among some major parameters were also visualized for the data.

After initial visualizing and data processing, the goal was to create a machine learning model to predict user ratings. Four different models namely Multiple Linear Regression, Neural Networks, Decision Tree Regression, and Light Gradient Boosted Tree Model were created and they were trained on
the available data. The LightGBM model predicted user ratings with the least error rates and much better accuracy when compared to the other machine learning models used.

Error Comparison for the models

Finally the important parameters responsible for predicting were identified. It was highly enlightening to see that the size of an application had the highest say in user ratings followed by the more obvious presence of many user reviews.

Feature Importance for the data

This project helped to answer various questions about the data with regards to the distribution of the data, which model to use for rating predictions and finally which parameters affected the ratings. The approach adopted in the project can easily be scaled for huge similar data sets and when implemented correctly can provide an insightful advantage over the competition in the market.

A link to the complete project and my machine learning models can be found by following this link.

Thank you for the read!!

Antony Paulson Chazhoor

Written by

Data Engineer @ View | Data Scientist | Problem Solver | Solution oriented insight builder

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade