Data Analysis project- Using SQL to Clean and Analyse Data.

Armonia
10 min readJun 11, 2022

--

In this project we use a dataset from ‘Real World Fake Data’ , Import it to MySQL Workbench, Clean it a bit then analyse it. At last, as always, I like to use Tableau to visualize the data.

Photo by Pavan Trikutam on Unsplash

Welcome back everyone 🖤, and to those who are new to my page here, ermmm… hello ? *stares judgementally*

Today we will go through how import CSV files into MySQL Workbench, clean the data, analyse it and then visualize using Tableau. Let’s start !

Step 1: Import our data.

The dataset we are using is from my all time favourite, master above masters, Real World Fake Data.

They have A M A Z I N G datasets that are very interesting and in a variety of fields so I am sure you will find one that would suit you ! The one that I chose is the ‘call center’ dataset. But you can easily choose whatever one you want and just follow my steps here.

The call-center dataset basically has over 32,900 records of data that describers calls made to various call centres. It includes the ID of the call , duration of the call in minutes, the name of the person who called, their satisfaction score and many other attributes that you will see as we go.

Anywayyyy, we downloaded our dataset, now it’s time to import it to MySQL Workbench.

Really Reallyyyyy important note: This process of Importing our data to MySQL I did by following this brilliant YouTube video. So if you can spare 10 minutes to watch it and follow with the video steps it’s wayyyyy better than my explanation here. So watch it and then jump to step 2 ! If you trust me , continue with your own self to blame..

Another not so important note: The code is on my GitHub profile, I recommend to open the code in a separate tab and watch step by step from there since it’s clearer. That’s where I post all of my projects. Link here.

First off, we need a database that we will import our data into it. We can either create a new one or just use an existing one. I chose the former.

Create a database.

Here I just went off and created a database called Project. Note that the two dashes at the beginning of the line makes it a comment and it doesn’t affect our code. Next, to get to work on a specific database we need to select it, and we do that with the keyword ‘use’.

Selecting the database we want to work with.

After that, we need to create a table that will fit our data and match it. Let me explain that.

Here is a snapshot from the CSV file we have:

our dataset in excel.

You can see that we have 12 columns: ‘id’ containing the, well, Id of the record. Then we have a customer name column, the sentiment that describes the call and so on..

To load that data into a table in the database we need to create a table that will match it. Like this:

creating a table.

Here we are creating a table called ‘calls’, and we are designing it in a way to fit our data by matching the columns and data types. We go column by column in our csv file and in the same order we build our calls table. It is not that critical as we can change that while we import our data.

We create columns in this way ( at least in this example): first off we specify the column name, then we add the data type of that column and in the parenthesis we choose the size of the variable. Check this guide here.

Specifying the size of the variables without checking the CSV file before proved to be stupid. By specifying that city, or state , or whatever it is , is of max size of 20 meant that rows that will be more than 20 characters will not be imported. It wasn’t that critical though, I lost almost 20 rows out of 32,900+ records that was too lazy to fix, no biggie. But I hope you will double check everything you do before creating a table.

Ok, now we have our table calls that is waiting to be fed with data. Here is where our magic works, quite literally. We go to the left panel of the MySQL workbench, go to the database we use> the table we just created:

If you cant find the database or the table, make sure to refresh by pressing the little arrows button next to schemas.

When you right click, there is an option called ‘Table Data Import Wizard’. That is the wizard that will help us ! click on it, it will open a window for you to browse for the csv file you want to import, search for it and then we you finish press next:

specifying the path to our dataset.

After you press next, it will ask you for the destination where you want the data to go:

We want the data to go to the table we created so choose the ‘Use existing table’ and then press next.

Now we need to open our eyes very well here, because here we are finalizing the process by making sure everything is good:

As you can see, it shows us the CSV columns and their destination columns in our table. But Lo and behold , in the highlighted areas it shows us that the csat_score will go to sentiment column in our table, and that the reason column will go to the city column *face palms*. So make sure you select the right column from the dropdown menu ! Ah right, at the end we can see a preview of our beloved csv file. Then after making sure the columns are matching press next and continue. I can’t do that since I already imported my data so, continue :)

To make sure everything went smoothly, let’s take a look at our table:

The table is way too big to see all of it, so let’s see first 10 rows. This is the result:

Looks like everything is good… you thought !! Anyway let’s go to step 2 and explain there.

My last note, I swear ! There is another faster and neater way to import files, but I didn’t vibe at the time with using it. Maybe in the future. explanation here.

Step 2: Clean the data.

Yep, small minor inconveniences happen. Back to where we were at the last part in Step 1. Let me introduce you to our two small bugs:

  1. When we created the table, you might’ve wondered: Armonia, Your highness, Why is the call_timestamp , a date, using a char datatype ? and here I will bless you with my answer : Back in the csv file, the call timestamp was in this format: mm-dd-YYYY. which in human words mean: two digits for month, two digits for day and four digits for year. This is not acceptable in MySQL which it’s default format is: yyyy-MM-dd. That is why we made it into a string and then fix it up later in MySQL.
  2. You might’ve noticed that all the empty values which were in the original/csv dataset that were in the csat_score (ah forgot to say, it means customer satisfaction score), got converted to zero’s Instead of null values, which will mess up our aggregations because the minimum score is 1 and not 0.

Let’s fix them:

  1. It’s pretty straightforward to be honest, we just call the function str_to_date() and give it our column and the way the date is formatted in it (pay attention here, not the format *we* want, but the format that the string is already in it):

We need to set the SQL_SAFE_UPDATES off before we do change the column. reason is because we dont specify a where clause that uses a KEY column. That is why we set it off before the query, and then set it back on after. and the results?

The call_timestamp is finally date format ! yay.

Now to our next problem. The zero’s in the csat_score. There are two options: either we set them to NULL, or we just leave them be and then when we query the table we just add the clause WHERE csat_score != 0. But I will set them to nulls in this way:

should be calls and not calls2, made a typo.

(Make sure to add set the safe updates off and then on, just like we did earlier).

Now let’s see the table:

Sweet. Now the we finished cleaning, Let’s go to Step 3.

Step 3: EDA.

I wonder what is the shape of our table, i.e, the number of columns and rows. Let’s see:

Running the first line gives us:

And the second line gives us:

So we 32,918 records and 12 columns.

Next up: Distinct values.

We are checking the different values possible for the rows we selected. Let’s run a random one of these , since they are all similar:

So looks like we have only 4 call-centres. Let’s continue :

To see the distribution of our calls among different columns. Let’s see the reason column:

Here we can see that Billing Questions amount to a whooping 71% of all calls, with service outage and payment related calls both are 14.4% of all calls.

Moving on, which day has the most calls?

Friday has the most number of calls while Sunday has the least.

Now let’s move to some aggregations:

Again, most of them follow the same logic with minor changes. Let’s run a few of them , starting with the one at line 106, querying the min, max and average call duration in minutes:

Then let’s check line 108 and 109:

Here we are checking how many calls are within, below or above the Service-Level -Agreement time. For example we see that Chicago/IL call center has around 3359 calls Within SLA , and then Denver/CO has 692 calls below SLA. you get it.

At the end I just added one Window Function to query the maximum call duration each day and then sort by it.

Here we see that for example on Oct 4th the maximum call duration was 45 minutes long while on Oct 8th it was 27 minutes long.

That’s it for SQL ! You can make as many queries as you want and change to suit your dataset. One thing to be said though, is I wished we had a database with multiple related tables so we can run some joins and more window functions, but I guess we can do that next time!

Before wrapping up, I made a dashboard using Tableau ! Link to it here.

Sneak peek:

Hope you all have a fruitful Data Analysis journey ! Have a nice week ~

Follow me for more projects coming up! Till next time ❤ Peace out

--

--