My first full stack data science project (Part 2)…

Bhavya Rajdev
3 min readFeb 20, 2022

--

Link to previous article: Here
Link to next article: Here

In the previous article we collected all the data required for the project in CSV format. In this article we'll see how to convert data stored in CSV to SQL database table. We'll also see function to add new users into database and function to insert newly generated data of existing users. These functions are written in 'df_to_sql.py' file in the source code.

I started with MySQL . I learned the language, did many practice problems and came back to project.
First step is to establish a connection in python with MySQL which was done by 'sqlalchemy' module's create engine method.

Initially converting CSV to SQL this seemed easy because pandas library has a method called to_sql. I tried a lot to make it work but that never worked for me. I was able to query a database via pandas for 'select' statements but not insert statements. Another method was to use the execute method in the connection object returned by create_engine to execute 'insert' statement which worked.
Now we create a pipeline which will manipulate each row stored stored in dataframe and convert it into insert statement we can execute.

Same process is repeated for converting ratings dataframe to sql table.

Later on I found better way to directly convert CSV format to SQL which we’ll discuss while migrating MySQL database to Microsoft SQL Server.

Add new users in the database:

Get games and ratings data of all new users by userlist_all_games and userlist_ratings functions described in the previous article, convert data in arrays to data-frames and execute games_table_to_sql and ratings_table_to_sql methods to get the data to SQL.

This is the function I used to collect the data. I used 3-4 parallel Jupyter servers for 500-500 users. Cutting the last step of migrating the data to SQL. Instead store the data into CSV in live Jupyter server, download the CSVs and write code to combine those CSVs using python code.

Add new coming data of existing users.

The goal is to get the last date at which user played and get all the games which has date greater than last played date. This is complicated because there are users in the ratings list who haven't played any games last month. All of theirs last date would be fixed which is 30 days back. because we only want data for 30 days. Algorithm: First step is easy. max(date_time) is the last date at which a user has played(user whose data is there in the games SQL table), group by user and get the data to a dataframe. Ratings table will also have users whose data is not there in the dames table. We take all those usernames and get another dataframe. Merge both dataframes. Second dataframe doesn't have last_date column so their values will be null. Fill those null values with 30 days back date.

Now we have last time at which user played we can collect all the games whose time is greater than last time and get all the ratings of each user and get that data to SQL tables.

We also want to delete old data. The games which are more than 30 days old, we delete those.

In next article we'll write functions that be used to handle user clicks on a website like move clicks, user data search, rating data search and more. Link to next article: Here

--

--