Ratings on BoardGameGeek have long been viewed as one of the most authoritative ways to judge the quality of a board game by hobbyists. But what does the rating really tell us? Certainly the rating measures how people decide to rate the game, but what can we conclude from that?
This article is based on a dataset available on Kaggle that has the 4999 rated games on BoardGameGeek as of June 2018. Our primary task will be to examine how objective features like mechanics, genre, year of release, and game time impact the average rating, and how much “magic sauce” games have in surpassing their predicted rating. We’ll also look at some of the most important variables and explore the relationship between them and the average rating. Finally, we’ll cluster games based on mechanics and genres, and examine what games are similar to each other.
Generating the Model
We’re going to be using a technique called gradient boosting to build a regression tree model for the data. Gradient boosting is a newer technique that can frequently provide additional predictive accuracy over existing methods. A regression tree can be thought of as like regression, but can accept both categorical and continuous variables. Additionally, a regression tree has innate advantages in capturing variable interactions and dealing with non-linearity. The downside of more complex models of this nature is that they tend to generate complicated output and become “black boxes” that generate output, but not in an easily explainable way. Luckily, an approach to explain the output of complex models called SHAP (SHapley Additive exPlanations) was developed, and we’ll be using SHAP extensively to explain what our model really means.
We’re going to use only objective variables, including genres, mechanics, year, average length of game, and the max/min player count. The model root mean square error was about 0.40, which is surprisingly good, since that means that we can usually predict the rating of a game within 0.4 on a 10 point scale!
To interpret the graph of SHAP values, note that each row represents one of the features of the model, with red indicating a high value, and blue indicating a low value. The SHAP value is essentially a measure of how much that feature’s value is impacting the total predicted rating, and the rows are ranked based on average impact. Most of the variables are binary, so red is games that have the property, blue is games that don’t, and the horizontal axis is the importance of that feature in determining the predicted rating.
For example, we see that the year of release is very important, and recent games (shown in red) have a predicted rating of up to 0.6 points higher. Conversely, older games (shown in blue) are rated up to about 0.3 rating points lower. The average taken it takes to play the game is also a big factor, with longer games generally being rated higher by up to 0.4 points. This phenomenon is perhaps unsurprising, since BGG users are probably hobbyists who like longer and heavier games. The remainder of the most important features are genres or mechanics, with most providing up to a 0.2 point increase in rating. Notably, card games are in general rated slightly lower, while humor games are rated up to 0.3 points lower, and take that games are rated up to 0.2 points lower.
We can see here that on average the model does a good job of predicting the rating, with popular games tending to perform better than expected, and highly rated games also tending to perform better than expected.
Our analysis is essentially a regression, so it’s good to check that it does a good job of modeling the data, without any unexplained behavior in the residuals. As it happens, the straight QQ plot tells us that assuming normally-distributed error was reasonably accurate, and our model does not have any major structural problems. It’s worth noting that while the total variance of the average rating was about 0.3, our residuals had a variance of 0.16, suggesting that our model captured just under half of the total variance, which is as much as we could expect from including no subjective appraisals at all!
As we will see in the next part of this article, it looks like on average Games can do about half a point better or worse than expected.
Where’s the Magic?
So now that we have a model that can do a reasonable job predicting the score, natural question to ask is, what factors that can’t be represented as mechanics or categories? In other words, how much ‘magic sauce’ does the game have makes its rating better or worse than expected? To do this, we take each game’s actual score, and subtract its predicted score, and then graph vs the rank of the game. As one would expect, higher-ranked games have more magic sauce than lower ranked games. Notice that for about the top 1000 games, the tendency is to be doing better than expected, with the top 200 or so games doing even better.
Once we get to the top 500, we have a strong tendency to be about half a point better than expected. Somewhat interesting is that newer games actually tend to fall below this mark, indicating that while newer games choose popular categories and mechanics, many fall slightly short of what their fundamentals would suggest.
Older games display the opposite phenomenon, but it’s notable that older games which fail to be exceptional tend to fall down in rank over time as they’re replaced by the latest and greatest. This phenomenon has led some to call board gaming as a hobby “the cult of the new”.
It’s worth nothing, however, that because rankings are based on the geek rating, which has Bayesian ballasting, new games require a higher average rating to achieve the same rank as an old game, simply because they in general will not have as many votes yet.
We see here that the magic sauce has a noticeably smaller variance than the average score does, since we captured about half the variance in our model.
While the amount of ‘magic sauce’ changes from about -0.7 at rank 5000 to about +0.8 at rank 1 (a difference of 1.5 points), notice that the average rating actually increases from about 6.0 to about 8.5, a difference of 2.5 points, indicating that quite a bit of the average rating is less about the quality of each individual game, but the popularity of the categories and mechanics of each game.
When we sort by the amount of magic sauce each top 500 game has, and take the top 50 entries, we end up with the list on the left. Many of the games on the list are also top 100 games in BGG, but many of them are not.
Some of the choices would definitely stand out as being games that have something going for them that can’t be explained merely via categories and mechanics. For example, Gloomhaven is widely renowned as one of the best if not the best dungeon crawler ever made, and yet looking at only its dry characteristics, it wouldn’t appear very different from other entries like Descent, for example. In reality, its popularity speaks for itself. If we look at games that aren’t ranked as highly but still make the list, we see some abstracts like YINSH which continue to have a following even 16 years after release, as well as some perennial favorites like Codenames and Resistance.
Of course, no discussion of this list would be completely without talking about Pandemic Legacy. The curious thing about Pandemic Legacy is that its rating is much higher than Pandemic or any of its other variants, probably because of the legacy factor adding something intangible but favorable to the game. Of course, there may be some sampling bias here, since it’s likely that Pandemic Legacy players tend to be those who enjoyed Pandemic. But this effect likely affects many games, as you wouldn’t expect someone to buy a heavy euro as one of the first board games they own.
We also see the emergence of some classics like Crokinole, Go, and MTG.
On the other end of the spectrum, what about overrated games? I actually own a couple of these, namely XCOM and Tiny Epic Kingdoms, and can confirm that they collect dust on the shelf. For example, Tiny Epic Kingdoms would be expected to be achieve a decent rating (7.19) based on being, for example, an action point, area control, area movement, variable player powers worker placement, but it only achieves an average rating of 6.67. Based on having played it, this isn’t surprising at all, as I think I’ve played it once and never then had a desire to take it off the shelf.
So what does this all mean?
So what’s the takeaway? BGG ratings have long been used as an invaluable tool in evaluating the merit of board games, and it can’t be denied the effect a game’s BGG ranking has on its sales. And yet, it seems that a lot of rating can be predicted without knowing anything about it except what mechanisms it uses.
Hobbyists have long since argued about the usefulness of BGG ratings, and opinions span the entire spectrum, between those who view rating as everything and those who view rating as totally useless. As it turns out, both camps had a point. While a lot of rating is priced into the mechanisms of the game, the really good games do have an extra bit of magic that you have to play them to experience.
If we account for the predictable, we end up with a list of top games that is somehow both both familiar and novel. I’m definitely looking forward to trying some of them, whether finding inspiration in a totally new game or recapturing the magic in an old but forgotten favorite.
As one final thought, here are some histograms of average score that compare games with and without each of the properties.