Analyzing Chess with Pandas to Learn from the Best and Raise My Rating.

Crawford Collins
Nov 28, 2018 · 4 min read

Every time I play a game of chess I feel that I am behind after one move. So, I have decided to do some chess openings data analysis and improve my game.

A screenshot of a game I lost as white, analysis by

The line chart along the bottom shows the likelihood of winning. If it is shaded white and above the middle, that means white has a better chance of winning. if the line drops below the middle and the black background show, it means black has a better chance of winning. On the right is a list of moves with and in the same box shows the advantage, white is positive, black is negative. Also included is the notation for inaccuracies ‘?!’, mistakes ‘?’, and blunders ‘??’.

As you can see my second move ‘d3’ earned a ‘?!’. Let’s explore this line and see how I can get off to a better start.


Unfortunately, I could not find a nice dataset to use in python. I had to make my own. is a list of openings from high-elo games. by default, the database shows only openings that contain over 100 recorded games for a total of 1884. I want to get this data into python so I can use it. I did not see any way to download it as one file, so I copied and pasted it to a google sheet and saved it as a csv. If you want to follow along with data wrangling look here( rework the data.ipynb). I had to infer the color field and calculate the wins for each color.

This data does not double count opening moves. For example, “Alekhine Defense, General” (1.e4 Nf6) has 524 games listed and the “Alekhine Defense, Exchange Variation” (1.e4 Nf6 2.e5 Nd5 3.d4 d6 4.c4 Nb6 5.exd6 ) has 6,485 games listed.

After working on the data I had 25 columns and 1884 rows.


Now it is time to get started on the analysis. Our goal is to find the strongest move for white after 1. e4d5.

Since we already saved the data as a csv, load the data using read_csv. I used the index_cols option to avoid duplicating the index column.

df = pd.read_csv(‘chessdb0–1’,index_col=0)

To get the line we want to analyze, we create a new DataFrame by slicing the original. The code uses all the columns of the original DataFrame where the first white move is e4 and the first black move is d5.

e4d5 = df[ (df['move1w']=='e4') & (df['move1b']=='d5') ] 

Calling the shape property, we have a 22 row object. A twenty-two row table is going to be hard to decipher and I can not think of a visualization that will help. One problem is that there are many rows with the same value for ‘move2w’. We need a way to summarize these values. The way that comes naturally to me is the groupby function. We simply attach it to the end of our DataFrame.

e4d5 = df[(df['move1w']=='e4') & (df['move1b']=='d5')]\

This returns a DataFrameGroupBy object which is not easy to use. To turn the object back into a DataFrame we can aggregate some of the values. To find the strongest move for white, we aggregate the ‘Num Games’, ‘White_Wins’, and ‘Black_Wins’ columns.

e4d5 = df[(df['move1w']=='e4') & (df['move1b']=='d5') ] \ .groupby('move2w') \ 
.agg({'Num Games': np.sum, 'White_Wins' : np.sum, \

These functions return a DataFrame with the ‘move2w’ column and the columns in agg function itself. The function started to run a little long so, I used the “\” symbol also known as the “Line continuation operator”.

Let’s visualize our data.

e4d5.plot(y= ['White_Wins', 'Black_Wins'],kind='bar')
Number of games won for white and black by move.

There seems to be only one move after the 1. e4 d5 sequence at high Elo’s and that is exd5 or capturing the pawn. Our DataFrame shows that there were 49,541 games in our set that started 1. e4 d5 and in every single game white responded the same way. It should be automatic to play 2. exd5 next time.

Bonus Analysis

Thanks for making through all the code stuff. If you wanted more chess analysis, its included below.

What are the strongest opening moves for white?

w1 = df.groupby('move1w').agg({'Num Games': np.sum, 'White_Wins' : \
np.sum, 'Black_Wins':np.sum})
w1['White_odds'] = w1['White_Wins'] / w1['Black_Wins'] w1.sort_values('White_odds', ascending = False)

White_odds is White_wins divided by Black_wins. Looks like g3 is one of the strongest for white but is rarely played. d4 is the second strongest and second most likely to be played. e4 and d4 overshadow the stronger openings c4 and g3

The strongest opening moves for white. Showing only those with win ratios greater than 1.0.

What is the strongest move for black after e4?

If white starts e4, the odds are in white’s favor. But, playing c5 gives one the best chance of avoiding the loss. C5 is also the most common in these high-rated games. The second most common is e5 even though it is statistically one of the weakest moves.

        White_odds  Num Games
c5 1.080567 737822
g6 1.111637 56518
Nc6 1.206246 9318
Nf6 1.209108 34710
e6 1.303615 228091
c6 1.308556 129053
d6 1.337409 68605
e5 1.399075 370196
d5 1.399592 51575
a6 1.458711 1283
b6 1.474684 4594

Crawford Collins

Written by

Studying data science and sharing what I learn.

More From Medium

Related reads

Also tagged Data Science

Related reads

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade