Exploring Chess Games Using Data

Published in

CodeX

25 min readMar 22, 2021

Chess is a board game that has been around for almost 1500 years, but it has reached its peak popularity since the release of famous Netflix show “The Queen’s Gambit”. Following the release of the show, the sales of chessboard increased in US by 87%. Not only that, monthly searches for Chess escalated by 189% in November compared to the average for the 11 months prior. Along with “The Queen’s Gambit”, the average Chess viewership on Twitch and YouTube has increased tremendously, thanks to events like “PogChamps” and chess content creators on those platforms that have gathered a huge community.

With the popularity Chess has gained, more people have started to learn, play and enjoy the game. The average number of games played on websites like Chess.com and Lichess.org has reached its all-time peak. People have started to delve into the intricacies of the game, and are exploring the beauty Chess embraces within its history, old games, and literature.

Chess is a game where basic principles of playing the game are easy to learn but the game is hard to master. Most chess grandmasters start as child prodigies and play and learn chess their whole life to reach and maintain the status of Grandmaster. Chess players learn the tactics, strategies, positions, endgames and other concepts and details related to chess, and try to master those techniques to become a better player. One such thing that players learn are the chess openings. Chess openings are basically the initial moves made by the players with white and black pieces. There is an Encyclopedia of Chess Openings, dedicated to the analysis of popular opening moves of chess. So, the question rises, why chess openings are so important? This is something we are going to investigate in this blog.

Interest over time for the term ‘Chess Opening’. October is the month when “The Queen’s Gambit” was released. Source: trends.google.com

With the introductory phase of the article concluded, we are going to formally state the core purpose and the questions we are going to investigate in our blog. Firstly, we are going to prepare our dataset of chess games. Next, we are going to clean it. Finally, we are going to analyze and visualize our data.

Our thesis question and sub-questions are:

Do openings really matter?

What are some of the most popular openings played?
What has been the success rate of the openings from either side (black and white)?
Do openings make any difference depending on the game format (like fast chess or slower one)?
Can ratings of the players playing be taken under consideration when analyzing openings?
What are some of the most common end results of the game in different game formats?

Preparing our dataset:

The first thing we are going to do is prepare our own data set of chess games. Chess games are recorded as PGN (Portable Game Notation). The PGN includes all the necessary information like the players playing, moves, end result and all other necessary information regarding the game. We are going to read those chess games in its raw form as a PGN. After reading, we will process those games. Finally, we will export those games as a .csv file which will be easier to read and process for our next portion of data cleaning and visualization.

Source of Dataset:

The source of our chess games in its raw form is FICS Games Database. FICS allows us to download the PGNs of games based on different filters. I am going to download games from June and September 2019 of all ratings and game formats. For the sake of our blog, these much games will be sufficient.

Our Unit of Data:

The file downloaded has the extension of .pgn. We can change the extension to .txt and open it as a text file. Let’s have a quick look at our raw data.

[Event "FICS rated blitz game"]
[Site "FICS freechess.org"]
[FICSGamesDBGameNo "453184368"]
[White "wellthoughtplan"]
[Black "thirtythree"]
[WhiteElo "1524"]
[BlackElo "1580"]
[WhiteRD "32.2"]
[BlackRD "29.4"]
[TimeControl "180+0"]
[Date "2019.06.30"]
[Time "23:56:00"]
[WhiteClock "0:03:00.000"]
[BlackClock "0:03:00.000"]
[ECO "B22"]
[PlyCount "49"]
[Result "1-0"]1. e4 c5 2. d4 cxd4 3. c3 d5 4. exd5 Qxd5 5. cxd4 g6 6. Nc3 Qa5 7. Be3 Bg7 8. Bc4 Nf6 9. Nf3 O-O 10. O-O Bf5 11. Qd2 Nc6 12. b4 Nxb4 13. Rab1 Bxb1 14. Rxb1 Rac8 15. Ne5 Nc6 16. Rb5 Qc7 17. Bf4 Nh5 18. Nxf7 Ne5 19. Nxe5+ Kh8 20. Nxg6+ hxg6 21. Bxc7 Rxc7 22. Rxh5+ gxh5 23. Be6 Rd8 24. Ne2 Bxd4 25. Nxd4 {Black resigns} 1-0

This is a single game. One unit of our data is one game. Now, I am going to explain the attributes in which we are most interested in:

WhiteElo: Elo is a method of calculating relative skill levels of chess players. WhiteElo is the elo of the player playing with the white pieces.
BlackElo: Elo of the player playing with the black pieces.
WhiteRD: RD stands for Rating Deviation. If a player has elo of 1500 and RD of 50, it means that the strength of player lies between 1400 and 1600. WhiteRD is the RD of the player playing with the white pieces.
BlackRD: RD of the player playing with the black pieces.
TimeControl: This is the game format. Time Control establishes the game clock. Here, the time control is written in seconds. 180+0 means the starting time was 180 seconds (3 minutes) with 0 second increment per move. 600 + 10 would mean that the starting time was 600 seconds with 10 second increment per move.
ECO: This is the chess opening encoding in the chess encyclopedia. For example, here B22 encodes to “Sicilian Defence: Alapin Variation” (we are going to look at a lot of openings in the incoming sections of the article).
PlyCount: Number of total turns the game lasted.
Result: The end result of the game. 1–0 means a win for white, 0–1 a win for black, and 1/2–1/2 a draw.
Moves: The last lines are the moves of the game. The moves are written in the form of board coordinates and the game notation of Chess. In the end, we have the status of the game.

Now, we are going to process the games in our PGN.

Processing and Exporting our data:

The approach to read the files is going to be simple. As these are just text files, we are going to read these text files and process them line by line. As an example of how we are going to process the file line by line is below:

white_elo = []
def read_white_elo(line):
    if line.startswith('[WhiteElo "'):
                self.white_elo.append(line)

In the example, simply enough, we are going to make an empty list that will include the WhiteElo of all the games. Whenever the passed text line is starting with “ ‘[WhiteElo ‘ “, the line will be appended in our list. Now, we are going to split up the line and extract the value out of it. Below is an instance of the process:

def extract_elo(elo_list):
    elos = []
    for l in elo_list:
        elos.append(l.rsplit('"', 2)[1])
    
    return elos

So, by this process of splitting, we will extract the elos of the players from our lists. Extracting information from the comments (like the status of the game) is slightly complex, but it still essentially involves splitting the strings at certain character just like we have done in the above function. Also, everything we have read and processed is from the text file, so all the things have data type of string(a word). We also need to change the data types of some attributes that are numeric based. This is easy by simply using int() on the number based lists like WhiteElo.

After processing all the lines, we are ready to export all of our lists as a .csv file using pandas. The code snippet below gives the example of exporting two lists as a .csv dataset.

import pandas as pd
def to_dataframe(white_elo, black_elo):
    zipped = list(zip(white_elo, black_elo))
    cols = ['White Elo', 'Black Elo']
    df = pd.DataFrame(zipped, columns = cols)
    return dfdf = to_dataframe(white_elo, black_elo)
df.to_csv("chess_games_csv\chess_games.csv", index=False)

The resultant dataset looks like this:

We have successfully created our own dataset. We managed to extract 935,362 games. With this simple procedure, we can process as many games as we like. Next, we are going to clean our data.

Data Cleaning

Data Cleaning is the process of replacing, modifying, and deleting the coarse and dirty data. This not only helps us to visualize and analyze our data effectively, but helps us build a strong machine learning model as well. For our dataset, we are going to get through a number of steps to cleanse it.

Overall Look

First of all, we are going to have a general look on the data set. We have already looked how each data frame cell looks like and what are the names of the columns in the picture above. Now, let’s start by looking at the number of rows and columns first. Our data frame is under the name of df.

So, we currently have 935,362 rows and 19 columns. Now, we are going to tackle the null values in some of our columns.

Tackling Null Values

Next, we are going to manage the null values that exist in our columns.

ECO is not a column of our interest but Time Control is. Luckily, there exist only three null values in our Time Control column. The number is negligible and our data set is big, we can afford to remove those three entries. So, we are going to delete these rows. The code snippet below will do this for us. In essence, the code just removes the rows where ‘Time Control’ is null and stores the new data frame in df.

df = df.dropna(subset=['Time Control'])

Checking data types

In the data preparation section, we converted the values to their respective data type so that they are properly encoded. To verify that, we can check data types of our data columns. (Here objects are simply strings)

Removing unnecessary columns

There are a lot of columns that we are not going to use in any manner and there are some columns that we will add in later sections which will help us in categorizing and visualizing data. So, we can safely drop those unnecessary columns. After dropping those columns, we are left with:

There might be lack of columns, but we will add some columns in the incoming sections of data cleaning and visualization.

Standardizing games by rating difference

There will be some games where the elo difference of the players playing will be extremely high, which means that there will be no need of opening analysis in such cases because the higher elo player is already way stronger and this will produce bias in our data. To cater such cases, we have to visualize the trend of rating differences in our data.

There are some games where the elo difference is very high (around 600+). We want to remove such entries where the difference between the elos of the players is super high.

The code below will do this for us:

df['rating_difference'] = abs(df['White Elo'] - df['Black Elo'])
df = df.drop(df[df.rating_difference > 600].index)

Regulate Data by PlyCount

As explained earlier in the data preparation, PlyCount is the total number of turns played in the game. There might be some games that just ended in 3 to 4 moves. Games like that are not useful in our opening analysis, so we would like to remove such entries from our data frame. Let’s visualize the PlyCount of all the games first.

It is clear that there do exist some games that ended in 3 to 4 moves. We have to remove them from our data set as well. We will be removing the games that ended with a PlyCount of less than 6. The code snippet below will solve the problem for us:

df = df.drop(df[df.PlyCount < 6].index)

Making Columns Consistent

Time Control:

The source of our dataset is mostly computerized, not handwritten or typed. So, we can expect to have most of the values of the attributes to be consistent. But, it is always to a good idea to see how many different values each column has. If there do exist some values that have very less representation in our data, we might just remove them.

First of all, we are going to take a look at the Time Control column. The column has some issues associated with it. We can see that there are 398 different types of game formats. Analyzing each of them is going to be a disaster. Also, some game formats are custom (not standard) and have little to no representation. In order to cater this issue, we are going to create our own column of Time Control, which we will call as ‘Game Format’. The column will have four possible values: Bullet, Blitz, Rapid, Classical. Bullet will be those games that have the starting time less than or equal to 3 minutes, Blitz between 3 and 10, and more than 10 minutes will be under Rapid (Classical games is not a category in our data because they have a negligible representation in our data. Moreover, proper classical games are not held on a website because of how long they are. Such games are always properly held on the physical board). The code that will achieve the desired output is written below:

import numpy as np
def game_format(tc): #tc: time control
    starting_time = tc.split('+', 1)[0]
    starting_time = int(starting_time)
    
    if starting_time <= 180:
        return 'Bullet'
    elif starting_time <= 600:
        return 'Blitz'
    else:
        return 'Rapid'df['game_format'] = np.vectorize(game_format)(df['Time Control'])

Now, our game format has 3 values only:

Comments

The comments are the ending status of the game. By ending status, it means that how the ending came out to be. Either it was a stalemate, resignation from either side or checkmate happened on the board. Now, the way we extracted those comments was through splitting, by extracting the very last word in sentence. The approach is ambiguous and we do not know what that last word maps to.

Let me elaborate this more. Below is the data of different types of comments in our data:

Let me explain some of the comments:

Resigns: One side (black or white) resigned in the game.
Time: One side’s time ran out.
Checkmate: The game ended in a checkmate
Disconnection: One side disconnected.
Repetition: The game was drawn by three-fold repetition (both sides repeated their moves three times).
Mate: As explained earlier that we only extracted the last word in the comment, this ‘mate’ word maps to “one side ran out of time but the other had insufficient material to mate”. Such a scenario is a draw.
Material: Both sides had insufficient material to checkmate the other. So, the result is a draw.
Agreement: The game ended in a draw by agreement.
Stalemate: This is a position where one player is not in check but has no legal moves left. So, the end result is a draw.
Others are rule, which means that the game ended due to a specific rule (can be anything like fair play etc.), adjudication is mostly in longer games where the game is unfinished and the result is declared as a win or draw based on the final position, and length means that the game lasted way longer and is declared as a draw.

We can see that some comments have no representation in our data, so we are going to remove such rows. Also, some comments like ‘mate’ are confusing as they are not described clearly. About the ‘mate’ comment, we are going to add such entries as ‘material’ (The reason being if one player had even a single more pawn and the other player had lost all of his/her time, it would still have been win for the one who had the pawn. So, indirectly, the first player had insufficient material to win on time, due to which the game ended in a draw).

After normalizing the ‘comments’ column, we have a lot more cleaner version of the comments:

Finishing touches

We are almost done with our data cleansing. Lastly, we are going to make the column names uniform. We are going to use the snake case (stylized as snake_case) to write the names of our columns and make them consistent. Also, we are going to reset the index of our data frame.

df.columns = [i.replace(' ', '_').lower() for i in df.columns]
df = df.reset_index(drop=True)

After data cleaning, our data looks like this:

Following data cleaning, we are next going to visualize our data on different parameters like skill level, game format, and the opening played in the game.

Visualization and Analysis

Given that a typical chess game has a branching factor of about 35 and lasts 80 moves, the number of possible moves is vast, about 35⁸⁰ (or 10¹²³), aka the “Shannon number”. We are not going to analyze each and every move, we are just going to take a look at the opening moves of the game. Even analyzing all moves at this small depth will be way too comprehensive. Instead, we are going to take a look at the some of the most popular moves at the certain depth.

Tools needed

The most important tool is the data set which we are going to use for all the investigation.
One more column that I would like to add is the ‘skill_level’. This column will classify the games based on the average rating of the players playing. This will help us categorize the data more easily. The skill levels we are going to use are Beginner ( < 1350), Intermediate-1 (1350–1650), Intermediate-2 (1650–1800), Advanced-1 (1800–2025), Advanced-2 (2025 < ). The categorization looks like:

Another tool we need is something that calculates most common moves at a certain depth. These two functions will do the job for us:

from collection import Counterdef move_array_maker(move, depth):
    depth_str = str(depth) + '. '
    (_, needed_moves) = move.split(depth_str, 1)
    (white_move, black_move, _) = needed_moves.split(' ', 2)
    
    return white_move, black_movedef reverse_sort_dict(move_list):
    d = Counter(move_list)
    sorted_d = dict( sorted(d.items(),key=operator.itemgetter(1),reverse=True))
    return sorted_d

The first function takes all the moves played in the game and the depth at which we are looking for moves and returns the moves played by the players playing with the white and black pieces. The second function takes in the list and returns a dictionary with keys as moves and values as the number of times that move was played. The dictionary is sorted descending. These functions are going to be very helpful in our analysis.

One last tool we would like to have is something that categorizes our data set based on different game formats.

def game_grouper(df):
    skill_level_branch_df = df.groupby(by = ['winner','skill_level']).size()
    total_games_played = df.groupby(by = ['skill_level']).size()
    
    percentages_win_df = (skill_level_branch_df/total_games_played)*100
    temp_df = pd.DataFrame(percentages_win_df)
    temp_df.columns = ['Result percentages']
    
    game_format_branch_df = df.groupby(by = ['winner','game_format']).size()
    total_games_played2 = df.groupby(by = ['game_format']).size()
    
    percentages_win_df2 = (game_format_branch_df/total_games_played2)*100
    temp_df2 = pd.DataFrame(percentages_win_df2)
    temp_df2.columns = ['Result percentages']
    
    return temp_df, temp_df2

The function above will return two data sets, the first one will have our data set categorized under different skill levels, the second one will be grouped under game formats. The data sets will give the winning percentages for black and white, depending on the input data frame.

Approach

The procedure that we will follow most of the time is as follows:

Extract the move at certain depth for black and white using move_array_maker function.
Append that move into a new column in the existing data frame.
Check the most popular move using reverse_sort_dict.
Filter out the data set of the most popular move you are about to analyze.
Use game_grouper function to get resulting data frames based on skill levels and game formats.
Plot those data frames.

You can use this approach to analyze any depth. Here, we will be using this approach for some of the most popular openings.

With our dataset and all the tools finalized, we can start our analysis.

Common First Moves

In this section, we are going to take a look at some of the most common at the first depth. Below all the results after applying the above mentioned functions:

We can see that e4 (King’s Pawn) and d4 (Queen’s Pawn) are some of the most common opening moves. So, we are going to visualize those games more deeply as compared to other ones. Also, Nf3 (Reti Opening) and c4 (English Opening) are also fairly common. Other moves, however, are not very common. So, we are not going to analyze them. You are free to investigate them on your own using the approach mentioned above. Lets start our analysis on e4 first.

The One — e4: King’s Pawn

Most popular opening, Fischer’s favorite, e4 is one of the most played opening move in chess. Why is it most played? Because it takes the control of the center of board and opens the diagonal for the light-squared bishop. Most of the time the positions that arise from e4 are explosive and open. It is different from d4 because d4 is more of a quite move as compared to e4 which is a lot more aggressive. Overall, a very solid choice from white. Now, we are going to take a look at the results that this opening has yielded overall.

e4 has yielded quite solid results for both black and white, with a small edge of 1.4% to white. As this is only one move, we will branch out our analysis even more according to the move black makes. Now, lets group our data on skill level and game format.

Result percentages with e4 by different skill levels

If we categorize our results on the basis of skill level, we can see that e4 has yielded really good results for beginners with white pieces. So, we can confidently say that e4 is a beginner friendly move. That’s why, it is the go-to move when someone is starting to play chess. Results of other skill-levels are also pretty consistent with around 47% winning rate for both black and white. But at the most advanced level, we can see that black has a slightly less winning rate and the percentages of draws is more than other skill levels. We cannot say much about this right now because we have to analyze the response by black pieces as well. But one thing is for sure, it is a really beginner-friendly move and a pretty solid choice for white. Now, lets categorize our results on the basis of game formats.

Result percentages with e4 by different game formats

In our game format analysis of e4, we can see that it is a viable option in all possible game formats. In rapid time format, however, it has a slight edge towards white but it is a small one. Overall, e4 is a move that has been played the most over 1500 years of chess and has stood the test of time. If you are a beginner, e4 is the best choice. Lets see what have been the most common responses to e4 from Black’s side. Below are some of the most common responses to e4.

These are some of the most common responses to e4 from black’s side. e5 is the Open Game, c5 is the Sicilian Defense, e6 is the French Defense, d5 is called Scandinavian Defense, c6 is named as the Caro-Kann Defense. These 5 mentioned openings are some of the most common responses to e4. Now, we are going to take a deeper look at e5 and c5.

e4-e5: Open Game

This is called the Open Game (or Double King’s Pawn Opening). The idea of the move is to counter the control of the center of board of white by pushing a pawn two squares ahead. Also, it opens the diagonal of the dark-squared bishop. This response to e4 is also a very standard and solid approach to the game.

Results percentages of e4-e5 by skill level

Based on different skill levels, at a very advanced level, the percentage of winning with black is lower and there are a lot more draws at advanced level. Also, it favors white in all skill levels. By looking at this graph, we can conclude that if you are looking to play some solid chess with black pieces and not winning the game, you might want to opt for e4-e5. But if you are looking for more better results, you might want to look for other weapons that have better results for black.

Result Percentages of e4-e5 by game format.

If we categorize the results based on game formats, the results are pretty equal in all formats and do not favor any specific time format. One observation we can make is slower is the game format, the results for black are not so great. So, all in all, e4-e5 is just a simple, solid opening in all time formats. It is played on all levels, and is a very standard, solid approach to the game.

The most common variations that arise from e4-e5 are Ruy Lopez (Spanish Game), Italian Game and Scotch Game. All of the mentioned variations have their own merits for each side. Lets move on to our next common response to e4, the Sicilian Defense.

Sicilian Defense: e4-c5

Now we have Sicilian Defense on the board. Here, instead of contesting white in the center of the board like e5 does, black tries to gain space a little to the side, but still keeping its presence known in the center. This is one of the most aggressive response to e4 because it is asymmetrical opening unlike e4-e5 due to which the positions that arise are exciting and chaotic. White tries to attack on the King side (right side of the board), while Black tries to produce counter-play on the Queen side (left side of the board) most of the time. This was second most popular response to e4 in our data set. Now, lets proceed to analyze it.

The pie chart indicates black as the favorite in Sicilian Defense. Historically speaking, early in chess, people always played e4 and Sicilian defense was not something deeply analyzed. But as the time passed by, it was seen that this is a more confrontational response to white and since then, it has yielded more successful results for black as it is evident from the pie chart. Because of the success of Sicilian Defense, top level chess players are starting to play d4 (Queen’s Pawn) a little more in order to avoid Sicilian Defense. Overall Sicilian Defense has yielded the most successful results against e4.

Based on all the skill levels, Sicilian Defense has favored black in all skill levels except at Advanced-2 where the results are more balanced. One thing we notice is that except Advanced-2, all other skill levels follow a similar conclusion most of the time. This shows that if you are a beginner or intermediate player and not a very strong player, it is not a good idea to follow the advanced level stuff because if it works for advanced players does not mean it is going to work for you. Also, the percentage of draws is less than e4-e5 openings which shows the aggressive nature of Sicilian Defense.

Based on categories of game format, the results at faster time format (Bullet) are more balanced as compared to slower time formats where black is the clear favorite. So far, the trend in game formats has been reasonably similar: slower the time format, more unfavorable is the side which has overall worse results. Lets see if this trend continues or not in our next sections.

The most common variations that arise from Sicilian Defense are Open Sicilian and Closed Sicilian. In Open Sicilian, white chooses to open the center by playing d4, while in Closed Sicilian, white does not open the center. We have seen the most common responses to e4. Now, it’s time to check how d4 positions come out to be.

d4-Queen’s Pawn

With our analysis of e4 positions, lets dive into our opening move of d4.

We have d4 on the board. This move was considered unusual in the past but it is starting to get a lot more common now. This is more of mute move, where the games result in more quite, closed games. This is clear from our data set as the average PlyCount (number of turns in the game) for e4 games is 70 while in d4 games, it is 74.

These are the results grouped under different skill levels. Beginners and Advanced-2 have almost balanced results (both black and white have almost equal winning percentages). At intermediate level, the winning percentage of white is quite high which is telling us that black is most of the uncomfortable against d4. What has been the best response against d4? We shall find out next. Now, we are going to see the results grouped under game formats.

These are the results categorized on game formats. There is one small observation we can make. In e4 games, the slower time format favored the color which had the better results. Here, it is the opposite. In slower time format (Rapid), both white and black have almost equal winning percentage, while in faster format, black has worse results. Lets investigate what are some of the common responses to d4. Maybe we will find something interesting for black like we found Sicilian Defense.

These are the common responses from black when white plays d4. The first one is symmetrical d5. The second one is Nf6 (Indian Defense), and the third one is e6 (Horwitz Defense). The first two are a lot more common, so we are going to have a look at those.

d4-d5: Double Queen’s Pawn Opening

Now we have d4-d5 on the board. This is also called as Closed Game (opposite to e4-e5 which is called Open Game). This is called as Closed Game because of the very nature of the positions that arise from this opening, which are closed because, in e4, white’s pawn is undefended and it is easier to break things open. While in d4-d5, the pawn is already protected by the queen and it gets harder to break things open for either side.

These are the result percentages for d4-d5. At Beginner level, the winning percentage for black and white is pretty much equal. But things get really rough for black when we climb up the skill ladder, as Intermediate and Advanced levels have lower percentage of winning. In conclusion, if you are beginner, feel free to play d4-d5, but if you climb higher skill level, you might want to try some different tools against d4 that have yielded more success for black.

These are the results grouped under different game formats for d4-d5. Faster formats (blitz and bullet) have highly favored white. This is opposite to the trend in e4-e5, where results were a lot more balanced all over the formats. On further investigation, I found out that of all the games black lost, 24.5% of those games were lost on time. Whereas, in e4-e5, 23% of those games were lost on time. So, perhaps, because of the closed nature of the positions that occur due to d4-d5, black faces difficulties in navigating the game in faster game formats and loses on time. There might be other factors but this one was the more obvious one to me.

The variations that mostly arise from d4-d5 are QGD (Queen’s Gambit Declined), QGA (Queen’s Gambit Accepted) and Slav Defense. All variations have their own pros and cons and you are free to discover them on your own. We will move on to King’s Indian Defense now.

d4-Nf6: Indian Defense

Here we have Indian Defense on the board. In the past, it was considered really bad for black to not contest white in the center of board by pawns, so d4-d5 and e4-e5 were simply the most played move. But some group of players which belonged to the school of ‘hypermodernism’ introduced another idea to play for black. The idea was to let white take the center with pawns, but black will contest the center with its pieces and unleash them later in the game to counter the extended central play of white. This type of play has stood the test of time and has been very successful till today. Indian defense is one of the hypermodern defense for black.

Here we have the overall results for Indian Defense. We have got another weapon where black has been slightly more successful than white. So far, the second most popular response from black has resulted in better statistics for black as compared to white. Just like we had Sicilian Defense in e4 games, Indian Defense has been really successful for black in d4 games. Now, lets categorize the results based on skill levels and game formats.

Here we have Indian Defense categorized under different skill levels. The results are pretty balanced at Advanced-2 level. Other than that, it has yielded really good results for black at beginner, intermediate and advanced-1 levels. All in all, Indian defense has been most successful against d4 at almost all the skill levels.

Indian defense has resulted in very good statistics for black under different game formats as well, especially in Blitz games. This shows that it is a lot more comfortable to play for black with little time on the clock. Due to this finding, I have personally decided to play Indian Defense with black pieces more in my games because of how good it has been for different skill levels and game formats.

The most common variations that arise from Indian Defense are King’s Indian Defense, Nimzo-Indian Defense, Grünfeld Defense, and Catalan Opening. Feel free to do research on these variations.

Further analysis

I would have absolutely loved to investigate more openings like French Defense, Caro-Kann Defense, English opening etc. But because the article has been very long already, you are free to explore these openings on your own using the tools and approach mentioned above in the article. So, here I would like to conclude my article.

Observations and Conclusions:

King’s Pawn and Queen’s Pawn are easily the most popular opening moves played by white. In response to King’s Pawn, e5 and c5 are most popular, and against Queen’s Pawn, d5 and Nf6 are most popular from black.
Open Game (e4-e5) has been favorable for white, while Sicilian Defense (e4-c5) has better results for black. Closed Game (d4-d5) has yielded better statistics for white but Indian Defense (d4-Nf6) has better results for black.
Depending on the opening, there is definitely some difference associated with skill levels. For example, d4-d5 has almost equal winning percentages for beginner players with both colors (black and white), while Intermediate players had worse results in d4-d5 for black.
The effect of game format on opening depends on the very nature of the openings. For example, in e4 games, there was no clear edge to a color in different game formats. While in d4 games, the effect of game format was a lot more evident.
Other interesting observations included the average number of turns a game lasted (d4 games lasted for 74 turns while e4 games lasted for 70 turns in average), the comments (black lost on time a lot more in d4-d5 games).