Board Game Rating Prediction with Linear Regression & Random Forest Regression in Python

S Joel Franklin
Analytics Vidhya
Published in
9 min readNov 29, 2019
Photo by Christopher Paul High on Unsplash

This machine learning project is about predicting the rating of board games. The data set ‘games.csv’ can be downloaded here.

The necessary packages are imported.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn import model_selection
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

READING THE DATA

The data is loaded into ‘df’ dataframe.

df = pd.read_csv(‘games.csv’)

UNDERSTANDING THE DATA

Understanding the data is important as we get an intuitive feeling for the data, check for missing values, check for incorrect data, check for incorrect relationships between variables which will help us to decide the necessary preprocessing steps. The preprocessing steps have been identified and given in bold lines.

df.shape # Displays the shape of 'df'

The ‘df’ dataframe is of shape (81312,20) which implies 81312 cases and 20 columns.

df.head() # Displays first five rows of ‘df’.
All the columns could not be displayed in the picture.
df.columns # Displays column names of ‘df’.

The column names are [‘id’, ‘type’, ‘name’, ‘yearpublished’, ‘minplayers’, ‘maxplayers’, ‘playingtime’, ‘minplaytime’, ‘maxplaytime’, ‘minage’, users_rated’, ‘average_rating’, ‘bayes_average_rating’, ‘total_owners’,
‘total_traders’, ‘total_wanters’, ‘total_wishers’, ‘total_comments’,
‘total_weights’, ‘average_weight’]

The objective is to predict ‘average_rating’ of board games. The columns [‘id’ , ‘name’ ] can be dropped as they don’t influence ‘average_rating’. The column ‘bayes_average_rating’ can be dropped too as it is calculated using ‘average_rating’ and doesn’t influence ‘average_rating’.

df.nunique() # Displays total number of unique elements in each column.

It can be observed that ‘type’ is the only categorical variable which takes on 2 values. A categorical variable is a variable that can take on one of the fixed number of values. The value can be numeric or string or any other data type. If it is non numeric data type, it is ‘one hot encoded’ as only numeric data types are allowed in the input/output of machine learning algorithms. One hot encoding will be covered in subsequent article.

df[‘type’].unique() # Prints the unique values in ‘type’ column

The 2 unique values in ‘type’ column are [‘boardgame’, boardgameexpansion’]. The values doesn’t seem to influence the output ‘average_rating’ and is not of much use. Hence the ‘type’ column can be dropped.

df.describe() # Displays properties of each column
All columns could not be displayed in the picture.

It can be observed from ‘count’ that not all columns have row count = 81312 which suggests there are missing values. But the number of missing values is very less when compared to the row count = 81312. Hence all rows which have at least one missing value can be dropped.

Note :- If it was the case of too many missing values in a column, the entire column can be dropped. If it was the case of a small number of missing values in a column, only the rows that have missing values can be dropped. If it was the case of a moderate (neither very high nor too low) number of missing values, the column has to be preprocessed to fill in the empty values such as replacing empty values with average value of column or any other preprocessing step depending on the column and problem set given.

It can also be observed that minimum value of ‘users_rated’ is 0 which suggests there can be board games which were not rated by any of the users. Let us look at the board games which were not rated by any of the users.

plt.hist(df[‘users_rated’],range = (0,5)) # Histogram of ‘users_rated’ column.

A significant proportion of board games have ‘users_rated’ = 0. Let us look at the ‘average_rating’ of board games which were not rated by any of the users.


df[df[‘users_rated’] == 0][‘average_rating’].describe()
# Prints the properties of ‘average_rating’ column of board games which were not rated by any of the users.

24380 board games were not rated by any of the users. ‘mean’ and ‘std’ of ‘average_rating’ of those 24380 board games are 0 which indicates all the board games which were not rated by any of the users were given an ‘average_rating’ = 0.

This is not right as the board games which have not been rated can’t be classified as poor and be given an ‘average_rating’ = 0. Hence the rows with ‘users_rated’ = 0 are to be dropped.

Another observation is that [‘maxplayers’, ‘maxplaytime’] is 0.

df[(df[‘maxplayers’] == 0) | (df[‘maxplaytime’] == 0)] # Displays the rows when either of mentioned variables is 0.

There are 20675 such cases where at least one of [‘maxplayers’,’maxplaytime] is 0. This information should have been given by the board game manufacturer. One preprocessing step would be to scrape data from other sources and replace with correct values for [‘maxplayers’, ‘maxplaytime’] for each of these 20675 cases. But to avoid complexity, we would stick with no preprocessing.

As we have minimum and maximum values in the dataframe, it is good to check whether all maximum values are greater than all minimum values.

df[df[‘minplayers’] > df[‘maxplayers’]].count() # Displays count of rows where ‘minplayers’ > ‘maxplayers’

There are 4020 cases where ‘minplayers’ > ‘maxplayers’.

df[df[‘minplaytime’] > df[‘maxplaytime’]].count() # Displays count of rows where ‘minplaytime’ > ‘maxplaytime’

There are 600 cases where ‘minplaytime’ > ‘maxplaytime’.

The minimum values are greater than maximum values due to some error while preparing the data. The preprocessing step would be to swap ‘minplayers’ with ‘maxplayers’ and ‘minplaytime’ with ‘maxplaytime’ for the above cases.

Summary :- The necessary preprocessing steps have been identified. The columns [‘id’, ‘name’, ‘bayes_average_rating’, ‘type’] are to be dropped. The rows with missing values are to be dropped. The rows with ‘users_rated’ = 0 are to be dropped. Swapping is to be done for the rows with ‘minplayers’ > ‘maxplayers’ and ‘minplaytime’ > ‘maxplaytime’.

PREPROCESSING THE DATA

We carry out the preprocessing steps that have been identified in ‘understanding the data’ section.

1. The columns [‘id’, ‘type’, ‘name’, ‘bayes_average_rating’] are to be dropped.

df.drop([‘id’,’type’,’name’,’bayes_average_rating’],axis = 1,inplace = True)

2. The rows with missing values are to be dropped.

df.dropna(axis = 0,inplace = True)

3. The rows with ‘users_rated’ = 0 are to be dropped.

df.drop(df[df[‘users_rated’] == 0].index,inplace = True)

4. Swapping is to be done for rows with ‘minplayers’ > ‘maxplayers’ and ‘minplaytime’ > ‘maxplaytime’

a = (df[‘minplayers’] > df[‘maxplayers’]) 
df.loc[a,[‘minplayers’,’maxplayers’]] = df.loc[a, [‘maxplayers’,’minplayers’]].values
b = (df[‘minplaytime’] > df[‘maxplaytime’])
df.loc[b,[‘minplaytime’,’maxplaytime’]] = df.loc[b,[‘maxplaytime’,’minplaytime’]].values

VISUALIZING THE DATA

We check the correlation between variables in ‘df’. Correlation is any statistical relationship ( causal or not) between 2 random variables though it commonly refers to the degree to which a pair of variables are linearly related.

plt.figure(figsize = (10,6)) # Adjusting figure size.
sns.heatmap(df.corr()) # Displays heatmap of correlations between variables in ‘df’.

The light shaded areas are highly correlated. It can be observed that ‘average_rating’ has relatively high correlation with ‘minage’, ‘total_wanters’ and ‘average_weight’.

‘total_owners’, ‘total_traders’, ‘total_wanters’, ‘total_wishers’, ‘total_comments’ and ‘total_weights’ have good correlation among themselves which is expected as each of these variables are directly proportional to demand value of a board game.

‘playtime’, ‘minplaytime’ and ‘maxplaytime’ have good correlation among themselves which is expected as each of these variables are related to playing time of a board game.

Since ‘average_rating’ has relatively high correlation with ‘minage’, ‘total_wanters’ and ‘average_weight’, we would focus on these variables.

sns.set(font_scale = 1.5)
sns.pairplot(df[[‘minage’,’average_rating’,’total_wanters’,’average_weight’]],height = 3.5,aspect = 0.9)

In ‘total_wanters’ vs ‘minage’ graph, it can be observed that ‘total_wanters’ is high for board games with ‘minage’ between 10 and 20. This implies many people prefer board games designed for teens.

A new column called ‘new_users_rated’ is defined. ‘new_users_rated’ = 1 if ‘users_rated’ > df[‘users_rated’].mean() and 0 otherwise.

df[‘new_users_rated’] = df[‘users_rated’].apply(lambda x: 1 if x>df[‘users_rated’].mean() else 0)
sns.scatterplot(‘average_rating’,’total_wanters’,data = df,hue = ‘new_users_rated’,legend = ‘full’)

‘users_rated’ is high for board games with ‘average_rating’ between 6 and 9. ‘total_wanters’ is high for board games with ‘average_rating’ around 8 which is sensible as people would desire to have board games which have high ratings.

Is it right to compare board games based on ratings alone?

The ‘users_rated’ must also be considered before making a decision. It may be that board games with high ‘average_rating’ had less number of ‘users_rated’ which makes the ‘average_rating’ biased.

A new column called ‘new_users_rated’ is defined. ‘new_users_rated’ = 1 if ‘users_rated’ > 5000 and 0 otherwise.

df[‘new_users_rated’] = df[‘users_rated’].apply(lambda x: 1 if x>5000 else 0)
sns.scatterplot(‘average_rating’,’users_rated’,data = df,hue = ‘new_users_rated’)

The preferred board game would be the one with highest ‘average_rating’ in the group of orange points(‘users_rated’ > 5000).

TRAINING THE MODEL

The data set has been preprocessed and is ready to be trained. A comparison is made between 2 models :- Linear Regression & Random Forest Regression.

Comparison of Linear Regression and Random Forest Regression

The comparison is made based on the cross validation score. The given training set is divided into 2 sets :- ‘Train_set’ and ‘Test_set’. The model is trained using a portion of ‘Train_set’. Cross validation score is calculated based on performance of trained model in remaining portion of ‘Train_set’.

There are various cross validation techniques which will be discussed later. Here K-Fold cross validation technique is used.

The training set is split into ‘Train_set’ and ‘Test_set’.

X = df.drop(‘average_rating’,1) # X is input
y = df[‘average_rating’] # y is output
X_train,X_test,y_train,y_test = model_selection.train_test_split(X,y,test_size=0.2) # Spitting into ‘Train_set’ and ‘Test_set’.

Cross validation scores are calculated for both models.

validation_type = model_selection.KFold(n_splits = 10) # K-Fold cross validation technique is used.
cross_validation_result1 = model_selection.cross_val_score(LinearRegression(),X_train,y_train,cv = validation_type,scoring = ‘neg_mean_squared_error’) # Cross validation score of SVC model.
cross_validation_result2 = model_selection.cross_val_score(RandomForestRegressor(),X_train,y_train,cv = validation_type,scoring = ‘neg_mean_squared_error’) # Cross validation score of KNN model.
print(cross_validation_result1.mean(),cross_validation_result2.mean())

Cross validation score of Linear Regression model = -2.094

Cross validation score of Random Forest Regressor model = -1.632

The negative of mean squared error is calculated above. So the mean sqaured error of Linear Regression model and Random Forest Regressor model are 2.094 and 1.632 respectively.

The performance of Random Forest Regressor model on given data set is expected to be better than Linear Regression model. Let us verify this by training using ‘Train_set’ and testing using ‘Test_set’.

Performance of Linear Regression model

a = LinearRegression().fit(X_train,y_train) # Fitting the model.
predictions = a.predict(X_test) # Test set is predicted.
print(mean_squared_error(y_test,predictions)) # Mean squared error is calculated.

The mean squared error is 2.083

Performance of Random Forest Regressor model

b = RandomForestRegressor().fit(X_train,y_train) # Fitting the model.
predictions = b.predict(X_test) # Test set is predicted.
print(mean_squared_error(y_test,predictions)) # Mean squared error is calculated.

The mean squared error is 1.609

As expected the performance of Random Forest Regressor model is better than Linear Regression model for the given data set. This is because the board game data set is very large and it is difficult for the Linear Regression model to fit the data by a straight line.

Prediction using trained model

The trained Random Forest Regressor model is used for prediction of ‘average_rating’ of 2 randomly selected board games from data set.

prediction = b.predict(df.iloc[[42,97]].drop(‘average_rating’,1))
print(prediction)

The predicted ‘average_ratings’ of 42nd and 97th board games in data set are [7.793971 7.683359].

df.iloc[[42,97]][‘average_rating’] # Actual average ratings.

The actual ‘average_ratings’ of 42nd and 97th board games in data set are [7.86088 7.67833].

The predicted ‘average_ratings’ are very close to the actual ‘average_ratings’. The Random Forest Regressor model has been trained well.

Happy Reading!

--

--

S Joel Franklin
Analytics Vidhya

Data Scientist | Fitness enthusiast | Avid traveller | Happy Learning