Google App Store Rating Prediction

8 min readDec 16, 2018

Project by Akaash Chikarmane, Erte Bablu, and Nikhil Gaur

In this post, we will show the data preprocessing steps we took and why we took those steps. We will also talk about any transformations on the dataset and look for any relationships while exploring the data through visualization. Lastly, we will talk about the motivation behind each model we used (Linear regressor, XGB Regressor, etc.), briefly summarize what they are, and, if applicable, why we tuned the parameters the way we did.

Background

As mobile apps have become so prevalent, and more and more developers make their livelihood off of mobile development alone. It has become important for developers to be able to predict the success of their app. Our goal was to find the overall rating of an app because so much of the users’ trust in the app comes from that one statistic alone. Higher rated apps are more likely to be recommended and more likely to be trusted by users that find the app while browsing the app store.

Outline of Approach and Rationale

The vast majority of this project was about cleaning up and preprocessing the data. Since all of the data was scraped directly from the Google Play Store, there were a lot of errors in transcription (NaN values representing nothing scraped, shifted data columns, etc.) and categorical values to translate or encode (more on that later). From there, we applied a multitude of regression models such as the Gradient-Boosting Regressor from the XGBoost package, Linear Regression, and RidgeRegression.

Data Collection and Analysis

Dataset: https://www.kaggle.com/lava18/google-play-store-apps

The dataset comes in two pieces: objective information (app statistics like size, # of installs, price, category, # of reviews, type, content rating, genres, last updated, current version, and minimum required Android version, and the aggregated number of stars) and user reviews (the review itself, positive/negative/neutral sentiment, sentiment polarity, and sentiment subjectivity). Since we expect developers to have easier access to their app statistics than reviews, we suspected that the objective information would contribute much more to the overarching goals of the project. However, we did test models augmented with and without the review information. By comparing the two results, we decided whether or not to include the review data in the final model.

The objective dataset has 12 features and one target variable (the rating) and about 10.8k entries. The user reviews dataset contains the first 100 most relevant reviews and 5 features for a total of 64.3k entries. All the data was acquired by scraping the Google Play Store directly and was last updated 3 months ago.

Data Pre-Processing

The initial data looked like the image below.

The Installs, Rating, Price, and Size features had to be processed so they could be read as numbers as they were originally all objects. Each feature had their unique problems that had to be fixed. For the Installs, the commas and had ‘+’s appended to the ends were removed . Ratings were just converted to floats. Price needed the ‘$’ removed. Size was the most troublesome as they were all written with KB and MB at the ends, so they required the removal of the text and multiplications to all be under the same unit. The before and after of Installs and Size are shown below.

0        10,000+
1       500,000+
2     5,000,000+
3    50,000,000+
4       100,000+
Name: Installs, dtype: object
0       10000.0
1      500000.0
2     5000000.0
3    50000000.0
4      100000.0
Name: Installs, dtype: float640     19M
1     14M
2    8.7M
3     25M
4    2.8M
Name: Size, dtype: object
0    19000000.0
1    14000000.0
2     8700000.0
3    25000000.0
4     2800000.0
Name: Size, dtype: float64

We also had to write code to make certain features more relevant to what we were doing. For example, Last Updated, as it was provided, was not very useful because it was simply the date of the last update. In order to make these more usable, we used the Datetime package to convert those values by transforming them into days since last update and sorting them into bins of a “width” of 3 months and sorting them. The code for transforming the dates is shown below.

from datetime import datetime
from dateutil.relativedelta import relativedeltan = 3 # month bin size
last_updated_list = (new_google_play_store["Last Updated"]).values
last_n_months = list()for (index, last_updated) in enumerate(last_updated_list):
    window2 = datetime.today()
    window1 = window2 - relativedelta(months=+n)
    date_bin = 1
    #print("{0}: {1}".format(index, last_updated))
    last_update_date = datetime.strptime(last_updated, "%d-%b-%y")
    while(not(window1 < last_update_date < window2)):
        date_bin = date_bin + 1
        window2 = window2 - relativedelta(months=+n)
        window1 = window1 - relativedelta(months=+n)
    last_n_months.append(date_bin)
    
new_google_play_store["Updated ({0} month increments)".format(n)] = last_n_months
new_google_play_store = new_google_play_store.drop(labels = ["Last Updated"], axis = 1)
new_google_play_store.head()

Feature engineering made up large portion of this project and one aspect of it was dealing with all the categorical variables. The primary methods we used were one-hot encoding and label encoding. In one-hot encoding, a categorical variable of n variables is expanded into n new variables where the values are booleans that correspond to whether or not that n categorical variable value is in the datapoint. Label encoding is more straightforward where it maps each unique categorical variable value to a number and applies that mapping to the variable in question. Label encoding was a good scheme when each datapoint belonged in only a single category (ex. Category), but one-hot encoding was useful for when the datapoint could be in multiple categories (ex. Genre). The one-hot and label encoding functions are shown below.

from copy import deepcopy
from sklearn.preprocessing import LabelEncoderdef one_hot_encode_by_label(df, labels):
    df_new = deepcopy(df)
    for label in labels:
        dummies = df_new[label].str.get_dummies(sep = ";")
        df_new = df_new.drop(labels = label, axis = 1)
        df_new = df_new.join(dummies)
return df_newdef label_encode_by_label(df, labels):
    df_new = deepcopy(df)
    le = LabelEncoder()
    for label in labels:
        print(label + " is label encoded")
        le.fit(df_new[label])
        dummies = le.transform(df_new[label])
        df_new.drop(label, axis = 1)
        df_new[label] = pd.Series(dummies)
return df_new

In an effort to normalize the data we tried applying the log1p transformation to the ratings. The ratings are extremely skewed towards the high end of the total range so the log transformation was not able to correct the skew as much as it would if it were less extreme.

Before transform:

After log transform:

Data Exploration

We can see that Family and Games have the most apps in our data set with all other categories having less than 500 entries.

Most of the apps fit under the Everyone category.

The content rating does not really affect the rating despite most apps being in the Everyone category.

The higher rated apps have more reviews than lower rated apps. Some of the higher rated apps have significantly more reviews than others. This could be caused through in-app pop ups or in-app incentives.

The Installs and Reviews seem to be moderately correlated (r² = 0.435). We created a model where we used only one of the two. This relationship could explain why more popular categories have more installs and more reviews.

Models and Results

We used train test split to split up data into testing and training sets. Cross validation with GridSearchCV was used to improve model training score to find best alpha with Lasso, Ridge regression, and XGBRegressor.

One of the models we used was the XGBRegressor from the XGBoost package. This model is generally very effective since it uses gradient boosting (an ensemble of weaker models, generally decision trees) to come to its conclusion, but we had to be wary of overfitting which is one of the dangers of models that use this technique. The initial mean-squared value without any lengthy feature engineering (only including encoding and cleanup) was about 0.228. After log transforming the Ratings, the mean-squared error lowered to 0.219, which was a small adjustment but still a step in the right direction.

We used linear regression after examining the relationship between Reviews, Installs, and Rating. As part of the examination we looked at statistical info between these variables such as adjusted r-squared and p value, and decided on linear regression. The first linear regression model used was between Installs and Rating with a score of 0.2233, Our Reviews and Rating linear regression model gave us an MSE of 0.2107, and the combined linear regression model Reviews, Installs vs Rating gave us an MSE of 0.214.

We also used a KNeighborsRegressor model because it gave one Kaggle user’s kernel a good mean squared error, a KNeighborsRegressor model with Reviews as a predictor gave us a mean squared error of 0.19948. The model is pictured below.

Conclusion

For this project, we took the Google Play Store Data sets and analyzed and processed the data. After the data was transformed into a usable set, we used plots and functions to understand the correlations between features. We then used this knowledge to build the best model we could for finding ratings based on the cleaned data set.

We thought finding a decent model would not be too difficult and that we would be working on making a very accurate model. Instead, we learned that creating a model to find the rating was not a simple task. We were able to better understand the challenges that comes from using the models we did with a complex data set.

We could have tried to:

Create a separate model for each genre
Create new features from android versions like we did with dates
Train a CNN as we had many categorical and numerical data points
Parse and clean data from the Google App Store ourselves