My first full stack data science project (Part 1)…

5 min readFeb 20, 2022

Link to website: Here
Link to source code: Here
Link to the next article: Here

Objective of this project is to learn to work with real data that gets updated in real-time, means to learn the API technology to collect the data, clean the data that was collected in form of json like object and store the newly collected records in the existing database like SQL. then to learn HTML,CSS, JS and Django to make a good responsive website, also upload the database on cloud and host the application on cloud.

Using the API provided by chess.com, I made an amazing chess openings explorer hosted on Azure where I allow the user to analyze his played openings on chess.com vs on average the openings played by specific rating range people. This way user gets help can focus on specific openings based on which he/she can improve his/her rating.

Attaching here the video of the website in case there’s a server overload on cloud and you can’t open the website.

Main advantage of this website is that it's free and chess.com asks for membership for analyzing your openings data. chess.com doesn't provide you this another feature of analyzing the rating range data. The data on website is of 30 days and data gets updated on daily basis.

We will see in the report about this game, why this website is important for a chess player, all the components of project one by one and how the process of making this website fits into the data science process.

Chess is a 2-player strategic board game. Each player one-by-one plays one move at a time. Whoever kills opponent's king first wins the game. chess.com is the most famous website to play this game with over 10 million games played everyday. You get rewarded some rating points for winning the game online. There are main 3 phases of the game. Opening, middle game and endgame. Whoever gets the winning advantage in the opening phase, becomes more likely to win the game. So learning the openings is relevant.

There are 5 major parts of this project:

1> working with chess.com API and writing functions for cleaning the data
2> write functions to convert data-frames to SQL tables. make sure process of adding new data of existing users and adding new users is very very convenient.
3> write model of the website(means what should be the response of user clicks like searching for data, move clicks and many more..)
4> making of website. This includes designing the responsive webpage, chess stuff like encoding rules, decoding move notations, encoding moves to strings (toughest part, took longest time, most number of lines of code)
5> Learn Django, connect frontend with backend via Ajax requests. Learn Azure cloud and learn to upload database on cloud and host our website on cloud.

Every part is discussed one by one in the different articles. You can simultaneously look at the source code on Github while reading the report.

Part 1:

In this article we'll design, collect and clean all the data required for the project. These functions are written in 'work_with_chesscom_api.py' file in source code.

Let's start by exploring the chess.com API. Our goal is that we need as many games possible. Important point is that we can collect all the games played by a specific user. A game object will have the usernames of players, who had which color, start time, moves played, who won. Some info in the PGN string and some info directly in the JSON. As we need more games than just one user, we need more usernames. We can request list of users by country but that request was not working at least not for me.

There's another way. As I can request data for my games, I have list of all users who played with me. That I get by knowing opponent's username in a game object. Think of this process as a DFS in a graph. Here's the function for getting list of all the usernames on chess.com. Not all because the process is slow and I don't have much resources for high computation. I stopped at the point I searched 25 users in users_set 80K-90K users in to_visit_set.

There's a problem. This way we got significant number of users but the problem is that the density of higher rated users is very low. We want more high rated users. We can fix this by small change in function. We keep track of rating of users in the priority queue and we pop high rated user from the to_visit_set first. Also change the first user to 'Hikaru' who has played many high rated players. (high rated player here means more than around 2k. My rating is around 1950). I stopped at around 110k users.

Now we start collecting the games. Attributes of a game will have: user_name, user_white(ture or false), datetime, time_control, moves, white_won(1-white, 0-black, -1-draw). Every attribute will be used sooner or later in the project. We have to take care that we collect all rated games and game class is either rapid, blitz or bullet only. We also have to write functions for cleaning the PGN(it's a string which has all the data of game) because many required attributes are not directly in the JSON.

Below process is actually very very slow and I had to take help of running parallel Jupyter servers and run userlist_all_games to get around 1M games of last 30 days played by 9K total users.

Example of data: user           user_white            date_time time_control  \
bhavya1238955           0  2022-01-17 05:25:51          600   
bhavya1238955           0  2022-01-16 12:34:52          600   
bhavya1238955           1  2022-01-15 04:02:21          600   
bhavya1238955           1  2022-01-13 13:56:33          600   
bhavya1238955           1  2022-01-13 13:32:44          600   
                                            moves  white_won  
e4, e5, d4, exd4, Bc4, Bc5, Bxf7+, Kxf7, Qh5+,...          1  
c4, e5, Nc3, Nf6, g3, Bc5, e3, O-O, a3, a6, b4...          1  
e4, e6, d4, d5, e5, c5, c3, Nc6, Nf3, Qb6, Be2...          0  
e4, e5, Nf3, d6, Bc4, Be6, Bxe6, fxe6, d4, exd...          0  
e4, e5, Nf3, Nc6, Bc4, Bc5, c3, Nf6, d4, exd4,...          1

We also need to write function for current ratings data(rapid, blitz, bullet ratings) of each user.

We save all these data in the csv format and we’re done with getting all the data required.
In the next article we'll see how to convert these data into SQL tables and write functions to make it easy to add new users or add new data of existing user. Link to part 2: Here

My first full stack data science project (Part 1)…

There are 5 major parts of this project:

Part 1:

Written by Bhavya Rajdev