Data selection

Agnes Smans
Project Data Visualization UHasselt
4 min readMar 27, 2021



The data that we decided to use concerns the 2020 ratings of board games derived from the website, as well as the general characteristics of the included board games, such as the category, mechanics, weight (difficulty), number of players, etc. We found this data on Kaggle (, it was extracted from the website. The date of the scrape is the 8th of July 2020.

Overview of BoardGameGeek top 7 games

Technical description of the data:

The dataset includes dimensions containing numerical type values (average rating, play time, difficulty, minimum age, number of players, wishing,wanting, year of publication) and categorical type values (category, mechanics, designer, artist, name of the game, publisher). After the design phase, we will decide on how many observations to use. We aim for approximately 500–1000 data points based on the average rank of the board game. Otherwise we can choose to use the 500 best and worst rated, or include certain sections over the whole dataset.

The original dimensions of the data set contain 125537 observations by 163 variables. The dimensions for the final dataset are subject to change. Some of the variables that we might use are listed below.

Numerical variables:

  • Average: Average rating over all ratings for the game
  • Usersrated: Number of users that rated the game
  • Weight: Difficulty of the game rated by the users
  • Minplayers — maxplayers: Minimum and maximum number of players that will be able to play the same game together at the same time
  • Minplaytime — maxplaytime: Minimum and maximum playtime of the game as stated by the publisher
  • Owned: Number of people stating they own the game
  • Wishing: Number of users that have this game in their wishlist
  • Wanting: Number of users who want to receive this game by trading
  • Trading: Number of users that made this game available for trading
  • Yearpublished: The publishing year
  • Abstractrank: The overall rank given to the game by user ratings and superseding all other ranks
  • Different ranks: different ranking categories like boardgamerank, family rank, partyrank, customizable rank

Categorical variables:

  • Category: the main categories by which the game can be described (up to 10 different variables), such as economic, civil war, city building etc.
  • Domain: The genres they are listed in (up to 3 different variables), such as strategy game, thematic game etc.
  • Mechanic: The main mechanics in the game (up to 10 different variables), such as dice rolling, drafting, income, cooperative game etc.
  • Family: The thematic family (up to 10 different variables), such as Star Wars, Game of Thrones etc.
  • Publisher: different publishers operating in different countries (up to 10 variables), such as Ares Games, Roxley etc.
  • Honor: received honors (up to 10 different variables)
  • Expansion: available expansions for this game (up to 10 variables)
  • Version: different versions (up to 10 variables), such as countries special ones, version 1.0 etc.
  • Accessory: available accessories (up to 10 variables), such as props or special cards
  • Designer: name of the designers (up to 10 variables)
  • Artist: names of the artists (up to 10 variables)

Distributions of some interesting variables / interactions between some variables:

We can see a positive interaction between the weight and the average rating.

This graph resembles a Gaussian distribution, slightly skewed towards the higher values. We also note that the ratings are concentrated around whole numbers, likely because the users rate the games themselves.

From this histogram of the categories, we can see that there are certain categories that are very common, and others that are rare. Of note should be that all categories are at least mentioned once in the dataset but don’t show up in this histogram because of the high frequency of other categories. Similar distributions were observed for the mechanics.