Using scikit-learn to analyze board game data using linear regression

I’m two weeks into my data science immersive at Metis in Chicago. We’ve just started our foray into machine learning, beginning with scikit-learn and linear regression models. I am a self-confessed board game junkie, so I chose to do my project with data scraped from BoardGameGeek. In this article, I’ll assume you have a basic understanding of what machine learning is, but I will explain some of the specific mechanics I chose to utilize. You can find the entire project on GitHub.

The Challenge

Wouldn’t it be nice to know if a board game is good before you buy it? Let alone before you spend several hours with your friends plodding through the rule book, stumbling through the metaphorical dark. For games that already exist, BoardGameGeek (BGG) serves as the light. BGG is an only forum, marketplace, and wiki for all things board game related. With a large active community and a database of over 80,000 board games, it is the best place to find a new game to play. The goal of this project was to create a model that can accurately predict game ratings.

Data Munging

I scraped my data set from BGG using Scrapy and BeautifulSoup. The scrapers I used were written by Sean Beck. Scrapy was used to interact with the webpage and scrape an ID tag for every game on the website. Then, those ID were passed to a BeautifulSoup script to parse each individual game’s XML statistics page. The resulting .csv file was brought into pandas before being worked with. Initially there were over 80,000 games, but that number was reduced to a little over 13,000 when games with less than 30 reviews were removed.


Exploratory Data Analysis

Let’s get to know the data, starting with our target.

The target distribution is just slightly skewed left, but otherwise Gaussian. This is explained by the lack of 10/10 games, which are pragmatically non-existent on BoardGameGeek (likewise for extremely low rated games). Giving a 10 out 10 score would be considered unhelpful and without any critical thought in the review. Similar phenomena can be seen somewhere like IMDb where the highest rated film of all time, The Shawshank Redemption, is rated 9.2 out of 10. Mean Squared Error will make for a good error metric since the distribution is continuous and close to Gaussian.

Next, let’s take a look at our features.

This pairplot of the feature variables can help to identify collinear variables. Machine learning favors the application of Occam’s Razor when possible. By removing highly collinear variables we reduce the complexity of our model, but we can still use the remaining feature(s) to find a linear relationship between them and the target. Let’s take a look at the correlation scores of all the features with the target, highlight some notable features, and then make a decision about what to keep and what to remove.

Average_weight

Users on BGG are invited to submit a weight score for each game. BGG defines weight as “a community rating for how difficult a game is to understand. Lower rating (lighter weight) means easier.” Weight is scored on a scale from 0.0 to 5.0

Total_wanters

This is the current number of BGG members who are willing to trade/purchase this game on marketplace.

Yearpublished

The year a game was published is surprisingly correlated with the game’s rating compared to other feature variables. This could be explained by an revival in board game popularity in recent years. This in turn can be explained by crowdfunding platforms, like Kickstarter (founded in 2009), which make it possible to create small batches of niche board games for a subculture community.

Feature Engineering

Using the pairplot from above, there are five major collinear relationships between features:

  1. total_owners & users_rated
  2. total_owners & total_weights
  3. total_owners & total_comments
  4. users_rated & total_weights
  5. users_rated & total_comments

We will want to retain the features with the strongest correlation with the target. Looking at the correlation table, we can drop total_traders, total_weights, total_comments, and users_rated. We will keep total_owners because of the collinear features, it’s the most correlated with average rating. Also, we will drop bayes_average_rating since it is almost analogous to our target.


Linear Regression

I wanted to use regularization in this project to help prevent overfitting (and to get the pratice). I decided to try both L1 (Lasso) and L2 (Ridge) regularization and use whichever produced the best model. L2 will find a singular solution, but L1 has the λ parameter that can be tuned. In order to find the best λ for the L1 regularization, I passed numerous λ values and used the one that produced the lowest error.

Lasso

# preprocess, instantiate model, fit the data to the model
games_standardized = preprocessing.scale(games)
lasso_cv = linear_model.LassoCV(cv=5, normalize=False, verbose=True, random_state=8)
lasso_cv.fit(games_standardized, y)
alpha_lasso = lasso_cv.alpha_
lasso_coef = lasso_cv.coef_
lasso = linear_model.Lasso(alpha=alpha_lasso, random_state=8)
lasso.fit(games_standardized, y)

LassoCV() has a default parameter of n_alphas=100, where alpha is the λ parameter. After the model is fit, it will return the λ with the lowest score out of the 100 that were tested during cross-validation.. This model produced the following scores:

cv_score_lasso = cross_val_score(lasso_cv, games_standardized, y, cv=5, scoring='neg_mean_squared_error', verbose=0)
rmse_lasso = np.mean(-cv_score_lasso) ** (0.5)
print(rmse_lasso)
0.679433492610936

This model produced a root mean squared error of 0.68. We can interpret this as the model is capable of predicting a board game rating within 0.68 on average. The standard deviation of average rating is 0.95, so this a slight improvement to random guessing within 1 standard deviation of the mean.

Ridge

ridge_cv = linear_model.RidgeCV(cv=5, normalize=False)
ridge_cv.fit(games_standardized, y)
alpha_ridge = ridge_cv.alpha_
ridge_coef = ridge_cv.coef_
ridge_cv = linear_model.Ridge(alpha=alpha_ridge, random_state = 8)
ridge_cv.fit(games_standardized, y)
cv_score_ridge = cross_val_score(ridge_cv, games_standardized, y, cv=5, scoring='mean_squared_error')
rmse_ridge = np.mean(-cv_score_ridge) ** (0.5)
print(rmse_ridge)
0.6793923373719547

L2 is robust and provides a unique solution, whereas L1 will provide one of what is usually a set of possible solutions. Surprisingly, both types of regularization produced very similar results. We can see how similar the models produced are by the absolute value of the difference of their coefficients (i.e. betas/weights for each feature variable).

Lasso:
[ 0.31977862 -0.04933663 0.01243574 0.00374256 0.01611358 0.0012954
0.01437157 0.23118331 0.61730485 -0.63864544 0.41574556]
Ridge:
[ 0.32100285 -0.05039445 0.01375012 0.00286474 0.01674359 0.00286474
0.01529783 0.24030501 0.63912382 -0.66750468 0.41545838]

These linear regression models are almost identical! When I saw this, I knew that I had to experiment. My hypothesis was that the Lasso regularization had not found the optimal λ and given a larger list of potential values, it would produce the same model as Ridge. Let’s find out!

Optimized Lasso

lassoT_cv = linear_model.LassoCV(cv=5, n_alphas=500, normalize=False, verbose=True, random_state=8)

The code was the same as before, but I explicitly passed n_alphas=500. Let’s look at the new coefficients:

Optimized Lasso:
[ 0.31976778 -0.04933048 0.01242661 0.00373719 0.01610949 0.00129619
0.01436647 0.23109295 0.6170705 -0.63834188 0.41574785]
Ridge - Optimized Lasso:
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

These are the exact same coefficients produced by Ridge. Truly remarkable! Both methods of L1 and L2 regularization produced identical models. This is confirmed by the models having equal root mean square errors.


The Results

In order to test the efficiency of the model, I wanted to compare the mean squared error from cross-validation to the error produced by training on the entire data set.

non_cv = linear_model.Lasso(alpha=alphaT_lasso)
non_cv.fit(games_standardized, y)
non_cv_predictions = non_cv.predict(games_standardized)
cv_mse_predictions = lasso_T.predict(games_standardized)
non_cv_mse = mean_squared_error(y, non_cv_predictions)
cv_mse = mean_squared_error(y, cv_mse_predictions)
Training error: 0.45942496793575494
Cross-validated error: 0.45942496793575494

Another mystery has arisen. In a good example of a linear regression model, we would expect the model that trains on the entire data set to overfit and have a lower in-sample MSE.

What does it mean?

The model is extremely underfit because there is a high error rate, but literally zero difference between my training error and cross validated error. There is a lack of complexity — perhaps not in the number of features, but definitely in their richness or utility. Replacing features with more meaningful or indicative features is a starting place to improving the model. However, I learned an important lesson called GIGO — Garbage In = Garbage Out.

Looking back, the model was fundamentally flawed. Many of the features that described the game itself (e.g. playing time) had little correlation to whether or not it was a good game. The more important features were those that were external to the game. For example total_wanters had a strong correlation, but it is almost self-fulfilling. Using total_wanters as a feature is akin to saying “Tell me how popular (in demand) a game is and I will tell you it’s average rating.” There is not a lot of utility in that kind of predictive model.


Future work

This project was a nice practice, but not very exciting as far as results. Ultimately, my features were not very rich. Most trends in average rating are explained by a few things:

  1. More recent games are rated highly due to an increased user population on BoardGameGeek and recent proliferation of board games
  2. Average rating being correlated with the number of people who are interested in the game in any capacity is basically self-explanatory

The major problem with the model is that the features tell us very little useful information about the game itself.

Solution

The XML pages that are being scraped have more nuanced information about the games.

<poll name="suggested_playerage" title="User Suggested Player Age" totalvotes="0"></poll> 
<
poll name="language_dependence" title="Language Dependence" totalvotes="0"></poll>
<
link type="boardgamecategory" id="1002" value="Card Game"/>
<
link type="boardgamecategory" id="1031" value="Racing"/>
<
link type="boardgamemechanic" id="2040" value="Hand Management"/><link type="boardgamefamily" id="17438" value="Mille Bornes"/>
<
link type="boardgamefamily" id="18055" value="Pixar Cars"/>
<
link type="boardgamedesigner" id="3" value="(Uncredited)"/>
<
link type="boardgamepublisher" id="3082" value="Dujardin"/>

Creating dummy variables for things like game category and mechanics could be much more indicative of average score. Almost more importantly, this data could be used for unsupervised learning and clustered by average ratings and publishing years. This could be used to identify trends in categories and mechanics and make accurate predictions about games that will be released in the near future. When I have some free time, I’ll go back and try again with a revised data set. Be on the lookout for Board Games and Machine Learning Part II!