Cleaning and Injecting data into Microsoft SQL server from Google Colab.

Muhammad Saad Uddin
7 min readMar 12, 2022

--

Did it ever happen to you that you are sometimes trying to upload large csv into Microsoft SQL server or any other databases and ran out of RAM?

Maybe?

Maybe not?

If you have established pipelines at your workplaces or cloud infrastructure than you might think I am crazy, but I am not 😊. Imagine you somehow have to do it manually via a local machine. What will you do?

You will end up in my scenario of running out of RAM. Solutions?
1. divide large csv into smaller ones and upload is sequentially. (Too much time taking :/)
2. Convert csv to Hdf5 format (yup, but not the talk for this article)
3. Something creative? (Let’s try)

Since I had this issue and I had time to try something creative, I decided to try it out with Google Colab. The reason was I had the opportunity to clean the data too in Python environment before injecting it to SQL server. In this article, I will discuss all the problems I faced from importing the data to Colab, cleaning it and uploading it to Microsoft SQL server. So, you can do it in peace.

Let’s Begin

First, we need to open Google Colab session (you need to have a google account to access Colab recourses). Then we will install and import all the necessary modules. Since Google Colab don’t have pymysql pre-installed in it we will install it on the runtime with !pip. As shown below

installing pymysql

pymysql is a database connector, which in our case will work with PYODBC and SQLAlchemy to establish a database connection and inject data in the SQL server. After installing pymysql we will import following modules as shown below.

Pandas and Numpy will be used for data reading, cleaning & reshaping as per our needs. SQLAlchemy will play a significant role in communication between python and database. Usually, SQLAlchemy used as Object Relational Mapper (ORM) tool that is utilize to translates classes in Python to tables on relational databases. create engine() from SQLAlchemy will be used to create the remote connection to database.

If you are familiar with connecting databases to Python, At this point, you will surely be asking why didn’t I installed or imported Pyodbc module (I will keep you waiting for few more seconds 😊 ). For those who don’t know what Pyodbc is, PyODBC creates connection for using an ODBC database using the standard DB API 2.0. It is similar to SQLAlchemy but SQLAlchemy is one step ahead since it provides many more features like ORM.

So, the reason for not installing or importing Pyodbc till now is that I was having error connecting my SQLAlchemy engine to database. The error was:

Error: (‘01000’, “[01000] [unixODBC][Driver Manager]Can’t open lib ‘SQL Server Native Client 17.0’ : file not found (0) (SQLDriverConnect)”)

Colab was not able to locate the SQL Driver. After doing some research I found that many people were facing similar issues and was unable to connect DB from Colab remotely. So I decided to dig further up and starting looking in different documentations like Microsoft’s and others to find a solution.

After several hit and trail I’ve found that I can install ODBC package direct from Microsoft to my Colab workspace using following commands:

%%sh

curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add -

curl https://packages.microsoft.com/config/ubuntu/16.04/prod.list > /etc/apt/sources.list.d/mssql-release.list

sudo apt-get update

sudo ACCEPT_EULA=Y apt-get -q -y install msodbcsql17

This is major key which install the required SQL Driver we need for running the DB operations and due to which we were having Error. In this way we directly install msodbsql17 driver to our Colab environment. Now we will install following modules

Unixodbc is installed because it is an opensource project that implements the OBDC API we will be using with PyODBC. Now we import Pyodbc

importing Pyodbc

Now we will import the dataset we are planning to inject in database, in my case I will be using movie lens 25M ratings dataset [1]. The reason for this is because this dataset is big enough for a challenging work of cleaning and transformation. I have downloaded dataset to my Colab workspace as:

Since the dataset is in zip file, we will import zipfile module of python to unzip the data as

Here, I have given the location of zipfile to local_zip variable and used zipfile.ZipFile to read the files in zip format. Then we will save the extracted file (in my case they are csv files) in base directly namely test_1.

Now that our files are extracted from zip format and now in the folder, we specified we will read the ratings.csv file which contains 25 million rating data in pandas dataframe as:

Reading the data

After reading, it is always a good idea to check the head() and shape of the dataframe you are going to work on just to get familiar with what you are going to work on. Now I’ve decided to put 2 dataframes the original movie ratings by user and movie title with Id’s as it is in SQL server and to create 2 more new dataframes from original dataset to have more valuable information at hand in SQL server (Keep in mind we can do these operations in SQL too). So, now we do some analysis on our dataframe and we will create 2 new dataframes which will contain no. of rating and average rating given by user in df_user and no. of ratings and average rating given to a movie as df_movies.

cleaning and wrangling data from original frame

In this way we will have analytical information for each user and movie in hand which can be injected in SQL server and we don’t need to perform SQL queries to create them again and again. Similarly, if your dataset will have some predefined operations in database, it is wise to have some transformed table in SQL to avoid doing repetitive tasks.

So now we have all the tables we need to inject in our SQL database. We will now initiate our connection to our database as

establishing connection to database

As discussed above, we will have a combination of Pymysql and PyODBC. The rest part of create engine are username and password of your database, Database server is the IP or address of your server and database is the name of the database you want to access. After running this our connection with the Microsoft SQL server is established and now, we can query or inject data to/from our db. In create_engine() notice that I’ve set fast_executemany = True [2] This is because we will dealing will a lot of data it will require a faster execution to put it all in the database from Colab environment. Now that our connection is established let’s start by injecting the transform data we have created to SQL server.

We will use pandas to.sql() function which supports SQLAlchemy and write dataframe records into SQL database [3] as:

The first string is the name of our table in database, next engine it is the instance of the connection we have established. Method is set to multi to pass multiple values in a single INSERT clause, by default it is optional. I’ve set Chunksize to avoid out of memory error as Pandas to_sql() if chunksize is not defined will try to write all rows at once which can generate error so specifying the chunk size will process a chunk at a time. if_exists is a way to deal if table already is in database, I’ve append the new data but you can replace it too.

Now I will push other two original dataframe of movie id’s and ratings to database. Movie_id dataframe will be easy to inject as the previous but the problem occur when I tried to inject ratings dataframe of 25 million rows. So, I had to be creative and divide the divide in chunks and apply a for loop to inject data in database as

Looping over rows

Just to be cautious as each loop took around 5–6 mins to inject data in database for this dataframe. So, if you some better trick for this part do share it in comments for me and others.

After injecting all of this in database we can query them from Microsoft SQL server and perform routine tasks in SQL, here are some snippet of these tables from my server.

query of data injected into SQL server

That’s all for this part, if you are interested in how to leverage this SQL database to create a web app this article will be helpful: Deploying a Web app using MySQL via Streamlit

--

--

Muhammad Saad Uddin

Data Science is poetry of math where data is your love and its on you how you write your verses to show world your poetic expressions with utmost clarity®.