Web scraping for board game analysis

7 min readAug 3, 2021

Tokaido is one of my favorite light board games. It is easy to pick up, looks great, and makes for a relaxing play. Like many modern board games, Tokaido has asymmetric powers. Each player starts the game by selecting a unique traveler, which grants an ability and starting money. Among experienced players, it is well known that certain travelers are much stronger than others. Since Tokaido is a game where making the right decisions means gaining small advantages, it is significant when a particular traveler may lead to gaining a couple of extra points or coins. It is popular to rank the travelers, such as a great ranking from a poll in this Tokaido group. However, it remains unclear how traveler choice translates to actual win rates.

In order to quantify the traveler balance in competitive Tokaido games, we can turn to the wealth of data available on popular online board game site Board Game Arena (bga) in the form of replays. We aim to collect results from the best players with the traveler, starting position, points scored, achievements won, and final placement. Tokaido enthusiasts just interested in the traveler rankings should feel free to skip to the results section.

Please note that web scraping is not allowed by bga’s terms of service, so users may be banned for scraping. Scraping can incur significant costs for the website. Contact a bga admin before attempting anything similar.

Finding the requests

Our first steps involve finding the http requests relevant for the desired data. Browsing the bga website, we note that players’ traveler information is available in the game log for a table replay, which appears at a url like this.

We don’t care about the rest of the page’s content, so we are looking for the specific request whose response gives the replay data. A browser’s developer tools gives an easy way to do this. On Chrome, going to the network tab of developer tools then reloading the above url, we see that the replay information comes from this request.

Using developer tools to find the request url and parameters

The data from that url allows us to retrieve traveler info and results for any table specified by its id. Browsing bga further shows that there are pages that allow you to search a player’s game history filtered by game, opponent, date, and whether the game is finished using a url such as this, which will let us get the desired table ids. Additionally, bga has pages displaying the top players for each arena season, for instance here. We can again look at the network log to find the relevant requests for accessing the displayed data.

Data flow

Now, let us come up with some functions to break up the process of getting our data. We can use the simple python module requests to send the requests.

Get the ids of top players from /halloffame/halloffame/getRanking.html with get_player_by_rank(rank, season) .
Find table ids for those player’s finished games from /gamestats/gamestats/getGames.html with get_arena_tables(player_id, start_date, end_date) .
Save the raw replay logs of those tables from /archive/archive/logs.html with save_table_replay(table_id, file_name) .
Later, process the replay logs and organize the results.

Logging in

If you tried visiting some of the above urls, you may have already noticed that bga restricts access to replay logs. Only validated accounts that have finished at least two games and are a day old can view replays. We will need to make use of requests Session objects, which have persistent cookies, so that replays can be accessed after logging in with a valid account.

To login on a Session, we must locate the request used to login to bga. Looking at the network tab when logging in, we see that a post request to /account/account/login.html with the username, password, and something known as a csrf token is required. A csrf token is a value provided to the client in order to insure that a request is indeed intended by the client and not constructed by a malicious third party. On bga, the csrf token is included in an html element on the login page. We can use Beautiful Soup to parse the content of the login page and locate the token.

sess = requests.Session()
resp = sess.get('https://boardgamearena.com/account')
soup = BeautifulSoup(resp.content, 'html.parser')
csrf_token = soup.find(id='csrf_token')['value']

Bga caps the number of replays each account may access due to the cost of saving and serving the logs. For this project, I ended up making many bga accounts and playing two short games on each in order to access the replays. It is possible to create a new Session, login, and continue as soon as the current account runs out of replay access, as shown below. Note that I no longer endorse or recommend this method. If you are interesting in collecting many replays, reach out to bga staff first.

Parsing the replays

After performing the previous steps, we have collected many replay logs. Each replay contains a list of moves, and the formats of the moves are quite messy. Luckily, we are only interested in extracting a few key pieces of information. Sifting through a few logs reveals that each player’s traveler and start position can be found in a move of type ‘travelerChosen’, and the final results such as points, achievements, and placement appears in the ‘results’ section of the last move.

A natural way to organize the data is a pandas dataframe with each row as one player’s result for a single game. This allows us to easily view basic statistics for a particular traveler or starting position.

Results

Arena mode is a standardized competitive format on bga, which makes arena mode games ideal choices for data analysis. I collected all ranked season 5 arena mode games for the top 10 players for that season, which totaled 1508 results over 1219 games. The season 5 format was standard four-player Tokaido with gastronomy, which means that there is one meal per player available at each inn. That rule set is likely the most popular way to play Tokaido online.

First, take a look at how starting positions impacted the players’ results. ‘Placement’ is the number of players defeated in the game, so a sole first place is 3 while two players tied for first is 2.5 in a four-player game. ‘Position’ is the spot in the initial inn, so the player with position 4 moved first and position 1 moved last.

Averages with standard deviations by starting position

Note that going last was about a 2 point disadvantage and lead to placing about a quarter spot lower. The difference between going first, second, or third was small. Preparations, the rule aimed to balance the starting positions, is rarely used since it way overcompensates for the disadvantage of going later. The data suggests that giving the last to depart a single additional coin may be the best way to improve balance.

Next, check out the rankings by traveler. The ‘rate’ indicates the approximate fraction of the time the players chose that traveler when given the opportunity. This value is 5 times the frequency, since each player was given the choice of two unique, random travelers out of the ten.

Averages with standard deviations by traveler

The traveler pick rates matched their performance with some deviations. Compared to always choosing the better placing of their two options, the players chose Chuubei 0.2 less and Hiroshige 0.3 less often that their performances would indicate, which is probably related to the fact that Chuubei’s and Hiroshige’s performances were better than other rankings suggest. Hiroshige is interesting in that he scored the second lowest on average, but was fifth in terms of average placement. This is likely because Hiroshige’s ability allows him to get many achievements, which prevents other players from getting those achievements and thus reduces their expected scores. Yoshiyasu was quite a bit more popular than he should be. Satsuki was not as bad as often suggested. Like Hiroshige, her placement may benefit from blocking others from achievements, since she often wins the gourmet achievement.

The most surprising finding may be the absolutely dismal performance of Sasayakko. Playing as Sasayakko was such a massive disadvantage that among top players, who averaged near second place before assigning travelers, starting with Sasayakko meant averaging third. Keep in mind that the sample size is relatively small because she was chosen the least often.

Traveler strength can be nicely summarized in a tier list.

Difference between tiers is approximately 1 point, except for F tier

Future Work

This kind of analysis could be applicable similar projects. Simply changing season 5 to season 4 will give traveler rankings for Tokaido Crossroads. Also, filtering the tables more carefully, such as looking at tables with all highly-ranked players, may yield some interesting differences. With some additional effort digging through replay files, the same analysis could be applied to games like 7 Wonders or Terra Mystica.

Thanks for reading! The code and data I used is available on GitHub.

Edits: Updated to mention that scraping can negatively impact bga and to encourage those considering similar projects to first seek approval from bga admins.