Discovering Chess Openings in Grandmasters’ Games using Python and DuckDB

Octavian Zarzu
7 min readApr 11, 2023

--

Do today’s grandmasters rely on conventional chess openings? Let’s find out by analyzing the top 100 live blitz players’ games as ranked by chess.com using Python and DuckDB.

The process consists of these steps:

  • Retrieve the top 100 players’ list.
  • Obtain the game records for each player from the past three months.
  • Process individual game moves and store each player’s (black and white) moves in separate arrays within a Python dataframe.
  • Construct a static DuckDB table featuring well-known chess openings.
  • Join the Python dataframe and the DuckDB table based on the chess moves arrays while slicing the arrays to ensure they have similar lengths.

This should be fun, so let’s dive right into it!

Setup

We need the following Python packages:

  • chessdotcom — for fetching the list of games
  • chess — for processing each game moves
  • pandas — for storing the games’ moves as a Dataframe
  • duckdb — for storing popular chess openings and doing array lookups between a DuckDB table and a Python dataframe
  • datetime, io — minor processing
import chessdotcom as cdc
import datetime
import chess.pgn
import io
import pandas
import re
import duckdb

Processing games’ moves

(To follow along, you can open this Google Collab file that contains all the cells executed in this article)

To kick off our analysis, we’ll begin by fetching the top players. We’ll utilize the get_leaderboards method from the chessdotcom package.

num_players=100
leaderboards = cdc.get_leaderboards().json
top_blitz_players = leaderboards['leaderboards']['live_blitz'][:num_players]

All we require are the player names. Let’s encapsulate the above code in a function and have it return a list containing these names.

def get_top_players(num_players=100):
leaderboards = cdc.get_leaderboards().json
top_blitz_players = leaderboards['leaderboards']['live_blitz'][:num_players]
return [player['username'] for player in top_blitz_players]

Moving forward, our goal is to determine the games each player participated in during the past three months. To achieve this, we can use the get_player_games_by_month method in the chessdotcom package. This method takes a username and a specific month as input, and returns an object containing all games for that month, including the PGN (Portable Game Notation) for each game. We will want to control how many months we want to go back, so let’s define a variable called months_ago to control the timeframe:

def get_games_for_player(username, months_ago=3):
now = datetime.datetime.now()
games = []

for i in range(months_ago):
month = (now - datetime.timedelta(days=(i * 30)))
games_data = cdc.get_player_games_by_month(username, datetime_obj=month).json
games.extend(games_data.get('games', []))

return games

Finding each game moves

Unfortunately, here is where we cannot use the chess.com package anymore, and we need to switch to the chess package. This package contains a read_game function that takes the game PGN (Portable Game Notation) as input and can produce a human-readable representation of each move.

We will store each player’s moves in two separate lists for black and white pieces:

def process_moves(pgn_text):
pgn_text = io.StringIO(pgn_text)
pgn = chess.pgn.read_game(pgn_text)

white_moves = []
black_moves = []
turn = 1 # 1 for white, -1 for black

board = pgn.board()

for move in pgn.mainline_moves():
san_move = board.san(move)
if turn == 1:
white_moves.append(san_move)
else:
black_moves.append(san_move)

board.push(move)
turn *= -1

return white_moves, black_moves

Great! Now, let’s combine all the pieces we’ve created so far and populate a dataframe with the moves from all the games:

num_players = 1
num_months = 3

top_players = get_top_players(num_players)

all_games = []

for player in top_players:
games = get_games_for_player(player, num_months)
for game in games:
game_id = game['url'].split('/')[-1]
pgn_text = game['pgn']
white_moves, black_moves = process_moves(pgn_text)

all_games.append({
'game_id': game_id,
'white': game['white']['username'],
'black': game['black']['username'],
'white_moves': white_moves,
'black_moves': black_moves,
})

all_games_df = pd.DataFrame(all_games)

Lookup popular chess openings

We now have a dataframe containing all the player’s moves for the past three months. We are going to transition to DuckDB to build a static list of popular. Afterward, we’ll perform a join between the above dataframe and the DuckDB table:

We’ll utilize some of the openings listed here to populate the table.

and split the openings’ moves between white and black.

con = duckdb.connect()

ctas_popular_openings_query = """
CREATE TABLE popular_openings AS
SELECT * FROM (
VALUES
('Italian Game', ARRAY['e4', 'Nf3', 'Bc4'], ARRAY['e5', 'Nc6']),
('Sicilian Defense', ARRAY['e4'], ARRAY['c5']),
('French Defense', ARRAY['e4'], ARRAY['e6']),
('Caro-Kann Defense', ARRAY['e4'], ARRAY['c6']),
('Ruy Lopez (Spanish Opening)', ARRAY['e4', 'Nf3', 'Bb5'], ARRAY['e5', 'Nc6']),
('Slav Defense', ARRAY['d4', 'c4'], ARRAY['d5', 'c6']),
('Queens Gambit Declined', ARRAY['d4', 'c4'], ARRAY['d5', 'e6']),
('Kings Indian Defense', ARRAY['d4', 'c4', 'Nc3'], ARRAY['Nf6', 'g6', 'Bg7']),
('Nimzo-Indian Defense', ARRAY['d4', 'c4', 'Nc3'], ARRAY['Nf6', 'e6', 'Bb4']),
('Grunfeld Defense', ARRAY['d4', 'c4', 'Nc3'], ARRAY['Nf6', 'g6', 'd5']),
('English Opening', ARRAY['c4'], ARRAY[]),
('Reti Opening', ARRAY['Nf3'], ARRAY[]),
('Dutch Defense', ARRAY['d4'], ARRAY['f5']),
('Scandinavian Defense (Center Counter Defense)', ARRAY['e4'], ARRAY['d5']),
('Pirc Defense', ARRAY['e4', 'd4'], ARRAY['d6', 'Nf6']),
('Alekhines Defense', ARRAY['e4'], ARRAY['Nf6']),
('Modern Defense', ARRAY['e4', 'd4'], ARRAY['g6', 'Bg7']),
('Kings Gambit', ARRAY['e4', 'f4'], ARRAY['e5']),
('Vienna Game', ARRAY['e4', 'Nc3'], ARRAY['e5']),
('Philidor Defense', ARRAY['e4', 'Nf3'], ARRAY['e5', 'd6'])
) AS t (opening, white_moves, black_moves);
"""

con.execute(ctas_popular_openings_query)

Awesome! We are only a step away from answering our initial question: Do top players leverage common chess openings? To answer the question we need to join the two objects we just built.

But how can we perform the join between the two arrays?

We cannot slice the all_mvoes_df.white_moves (and black_moves) by a static value — 2 or 3 — because some of the openings have a length of 2 moves, whereas others have a length of 3 or even more.

One option would be to leverage the length of the popular_openings.white_moves (and black_moves) arrays and slice the all_moves_df dataframe black_ and white_moves arrays by the length of the chess_openings table black_ and white_moves arrays.

join_query = """
SELECT *
FROM all_games_df agdf
INNER JOIN popular_openings po
ON agdf.white_moves[1:len(po.white_moves)] = po.white_moves
AND agdf.black_moves[1:len(po.black_moves)] = po.black_moves
"""

con.execute(join_query).df()

🤯 So let’s just recap what happened:

  • We joined a Python dataframe (all_games_df) with a duckdb table (popular_openings) in the same query!
  • We sliced the Dataframe array columns, white_moves, and black_moves by the length of the similarly named arrays from the duckdb table. And then perform the join on the sliced arrays!

LGTM! So let’s now run the script with the num_players variable set to 100.

Notes

Two important notes:

  • To handle games without associated or valid PGNs, we can add a check in the process_moves() function:
main(): pgn_text = game['pgn'] -> pgn_text = game.get('pgn', '') # replace
process_moves(): if pgn is None: return [], [] # add
  • Some of the rows are duplicated in the all_games_df dataframe — a game shows up twice if the two players are in the top 100:

In order to remove the duplicates we will use the QUALIFY clause, which allows us to filter WINDOW functions results:

Our dataset has been reduced from 22,377 games to 18,786. Now, let’s merge with dataframe with the popular openings table and perform a LEFT JOIN — this will allow us to examine the number of games that utilize a common opening and those that do not::

join_query = """
SELECT ifnull(po.opening, 'No common opening') as opening, COUNT(*) as count
FROM (
SELECT *
FROM all_games_df
QUALIFY ROW_NUMBER() OVER (PARTITION BY game_id) = 1
) agdf
LEFT JOIN popular_openings po ON agdf.white_moves[1:len(po.white_moves)] = po.white_moves AND agdf.black_moves[1:len(po.black_moves)] = po.black_moves
GROUP BY ALL
ORDER BY COUNT(*) DESC
"""

con.execute(join_query).df()

Of the 18,876 games, 10,605 (or approximately 56%) rely on one of the 20 chess openings mentioned above.

Plotting only those games (56%) that started with one of the 20 openings (switching back from left to inner join), we can visualize which ones are the most frequently used openings among them:

con.execute(join_query).df().plot(x='opening', y='count', kind='barh')

Conclusions

We’ve done a lot! We utilized two chess-related public Python packages to build a dataframe containing the moves of all games for a specific query— the games played by the first 100 players ranked on chess.com in live blitz over the past 3 months. We then created a table in DuckDB, joined it with the dataframe, slice arrays, and used the QUALIFY clause to eliminate duplicates.

I have created a Google Collab notebook with all the code from above if you want to run it yourself.

You can look up individual games from the dataset by replacing {game_id} with the actual game_id in the following URL and replay the moves:

https://www.chess.com/game/live/{game_id}

For example: https://www.chess.com/game/live/74367799555

Have fun and happy SQL’ing!

--

--