# Analyzing Chess with Pandas to Learn from the Best and Raise My Rating.

Every time I play a game of chess I feel that I am behind after one move. So, I have decided to do some chess openings data analysis and improve my game.

The line chart along the bottom shows the likelihood of winning. If it is shaded white and above the middle, that means white has a better chance of winning. if the line drops below the middle and the black background show, it means black has a better chance of winning. On the right is a list of moves with and in the same box shows the advantage, white is positive, black is negative. Also included is the notation for inaccuracies ‘?!’, mistakes ‘?’, and blunders ‘??’.

As you can see my second move ‘d3’ earned a ‘?!’. Let’s explore this line and see how I can get off to a better start.

# Data

Unfortunately, I could not find a nice dataset to use in python. I had to make my own. https://chesstempo.com/chess-openings.html is a list of openings from high-elo games. by default, the database shows only openings that contain over 100 recorded games for a total of 1884. I want to get this data into python so I can use it. I did not see any way to download it as one file, so I copied and pasted it to a google sheet and saved it as a csv. If you want to follow along with data wrangling look here(https://github.com/crawftv/chess_openings/blob/master/chess_db rework the data.ipynb). I had to infer the color field and calculate the wins for each color.

This data does not double count opening moves. For example, “Alekhine Defense, General” (1.e4 Nf6) has 524 games listed and the “Alekhine Defense, Exchange Variation” (1.e4 Nf6 2.e5 Nd5 3.d4 d6 4.c4 Nb6 5.exd6 ) has 6,485 games listed.

After working on the data I had 25 columns and 1884 rows.

# Analysis

Now it is time to get started on the analysis. Our goal is to find the strongest move for white after 1. e4d5.

Since we already saved the data as a csv, load the data using read_csv. I used the index_cols option to avoid duplicating the index column.

`df = pd.read_csv(‘chessdb0–1’,index_col=0)`

To get the line we want to analyze, we create a new DataFrame by slicing the original. The code uses all the columns of the original DataFrame where the first white move is e4 and the first black move is d5.

`e4d5 = df[ (df['move1w']=='e4') & (df['move1b']=='d5') ] `

e4d5.shape

Calling the shape property, we have a 22 row object. A twenty-two row table is going to be hard to decipher and I can not think of a visualization that will help. One problem is that there are many rows with the same value for ‘move2w’. We need a way to summarize these values. The way that comes naturally to me is the groupby function. We simply attach it to the end of our DataFrame.

`e4d5 = df[(df['move1w']=='e4') & (df['move1b']=='d5')]\`

.groupby('move2w')

This returns a DataFrameGroupBy object which is not easy to use. To turn the object back into a DataFrame we can aggregate some of the values. To find the strongest move for white, we aggregate the ‘Num Games’, ‘White_Wins’, and ‘Black_Wins’ columns.

`e4d5 = df[(df['move1w']=='e4') & (df['move1b']=='d5') ] \ .groupby('move2w') \ `

.agg({'Num Games': np.sum, 'White_Wins' : np.sum, \

'Black_Wins':np.sum})

These functions return a DataFrame with the ‘move2w’ column and the columns in agg function itself. The function started to run a little long so, I used the “\” symbol also known as the “Line continuation operator”.

Let’s visualize our data.

`e4d5.plot(y= ['White_Wins', 'Black_Wins'],kind='bar')`

There seems to be only one move after the 1. e4 d5 sequence at high Elo’s and that is exd5 or capturing the pawn. Our DataFrame shows that there were 49,541 games in our set that started 1. e4 d5 and in every single game white responded the same way. It should be automatic to play 2. exd5 next time.

# Bonus Analysis

Thanks for making through all the code stuff. If you wanted more chess analysis, its included below.

## What are the strongest opening moves for white?

`w1 = df.groupby('move1w').agg({'Num Games': np.sum, 'White_Wins' : \`

np.sum, 'Black_Wins':np.sum})

w1['White_odds'] = w1['White_Wins'] / w1['Black_Wins'] w1.sort_values('White_odds', ascending = False)

White_odds is White_wins divided by Black_wins. Looks like g3 is one of the strongest for white but is rarely played. d4 is the second strongest and second most likely to be played. e4 and d4 overshadow the stronger openings c4 and g3

## What is the strongest move for black after e4?

If white starts e4, the odds are in white’s favor. But, playing c5 gives one the best chance of avoiding the loss. C5 is also the most common in these high-rated games. The second most common is e5 even though it is statistically one of the weakest moves.

` White_odds Num Games`

move1b

c5 1.080567 737822

g6 1.111637 56518

Nc6 1.206246 9318

Nf6 1.209108 34710

e6 1.303615 228091

c6 1.308556 129053

d6 1.337409 68605

e5 1.399075 370196

d5 1.399592 51575

a6 1.458711 1283

b6 1.474684 4594