Engineering, Analysing and Visualizing your Data
In this blog of the Quick-Code(DataScience) series, we are going to convert raw data into working data and then put them in a database, which we’ll use for modeling and visualization.
We combine the working principles of a,
Data engineer (Cleaning and Uploading the data to a database)
Data scientist (Deriving Insights from the dataset)
Data analyst (Analyzing and Visualizing the results) on our sample dataset
There are 5 steps to do this process,
1. Ad-hoc excel analysis (Cleaning)
2. Pushing the data through ETL pipeline using Microsoft SSIS (Exporting-I)
3. Storing the output data from the ETL pipeline in a SQL database (Exporting-II)
4. Modeling the data from the SQL database using Gretl (Analysis)
5. Visualizing the data using Tableau (Visualization)
1. Ad-hoc excel analysis (Cleaning) :
Microsoft Excel can be used for high-level data analysis and cleaning. If you don’t have Excel installed on your PC/Mac, you can use Google sheets available for free.
The main goal of step 1 is to do a thorough analysis of the data given to us directly from a client. By thorough analysis I don’t mean applying some crazy algorithms and writing hundreds of lines of code rather, just looking at what the data is telling us is way more important than applying models and algorithms.
Understand the data just by looking at the column names and find the dependent variable(a variable that is dependent on other variables for their result) and the independent variables.
The above data is given to us by an XYZ bank. They’ve told us to analyze the information and give them valuable insights about their customers. (Note: a normal dataset may have thousands of rows. Since this is just for example I’ve selected a dataset with only 20 rows for easier understanding)
Dependent Variable = Balance
( balance is dependent on other variables)
Independent Variable = CustomerID
, Name
, Surname
, Gender
, Age
, DateJoined
We can do some quick cleaning here at excel, like changing the date format in the DateJoined
and removing the dollar sign and decimal point(just for the sake of our example) from the Balance
.
To change the date right click on the DateJoined
column -> click Format Cell -> enter the custom date format(yyyy-mm-dd).
To change the balance right click on the Balance
column -> click Format Cell -> click Number and decimal places to zero
If you save and re-open the excel file you’ll encounter that the dates haven’t been changed, that’s because excel modifies the date format according to your laptop’s location. To encounter this problem you’ll need to install Notepad++ and then open the excel file on it, then click Save As a .txt file.
That’s the end of the First step. We’ve done an Adhoc excel analysis. The data that we now have after step 1 analysis is given in the below image.
Step 2: Pushing the data through ETL pipeline using Microsoft SSIS (Exporting-I)
Before Pushing the data through ETL, we first need to create an SQL database, so that we can establish a connection between our flat file (excel) and our OLE DB server (SQL database)
Install Microsoft SQL Server and then connect to your server
Create a new database by right-clicking Databases and name your database(I’ve named it AV_Competition)
Press Ctrl + N or select a new query in the title bar to create our first SQL query (note this is not the actual query for our database. This is just for understanding the environment)
Click Execute on the title bar to execute the code. Then expand the drop-down of the AV_Competition then right-click on the tables folder and click refresh to see our newly created table.
To see the contents in our table right-click on dbo.SampleTable and click select top 1000 rows.
We’ve successfully created our first sample table using SQL.
Now, let's load our actual data to SQL, using Microsoft SSIS
Install Microsoft SSIS and Create a new file for our project in Integration Services
The idea is to create a Data flow task inside which our data transfer from a flat file (Excel) to an OLE DB server (our database) happens.
Drag and drop the Data flow task in the left pane to create our first Data flow task
Double click the Data Flow task. This is where our transfer happens. Drag and drop the Flat file source and OLE DB destination from the left pane. Click on the blue arrow on the Flat File source box and drop them on the OLE DB destination box
Now, let’s add our excel file to the Flat File source. Double-click on the flat file and select New connection manager. Browse your excel file(now stored as a text file from notepad++)and input the text qualifier as “
Click the Columns tab and check the columns and see that our DateJoined
and Balance
have changed.
Click Advanced and select all the column names by holding shift+ down arrow
and set the Output Column width to 1000 (sometimes our string value may be large eg. reviews or dialogues, we do this to encounter that)
Preview it and click Ok. Our flat file source has been successfully created.
3. Storing the output data from the ETL pipeline in a SQL database (Exporting-II)
Now, to create an OLE DB server(our database in SQL), we’ll double click on the OLE DB server textbox. Click New on the connection manager and type in your SQL server name which you can find on Microsoft SQL (my SQL server name is: SHREEDHAR\SQLEXPRESS) and select your database which will show you the name that we’ve already created AV_competition. Click Ok.
Now, to fill in the tables click new on the Name of the table. The magic of SSIS is that the SQL code gets automatically filled in for us
Click Ok, go to mappings and see if the columns are mapped correctly.
That’s it, our setup is ready for data transfer. Press the Start button on the title bar to start the transfer.
Once it is successful, you’ll get green tick marks on the source and destination textbox.
Now, let’s check our database. Go to SQL, refresh the tables to see our table name dbo.OLE DB Destination (note we can always change the table name in our SSIS). Right-click the table and click select top 1000 rows.
We’ve successfully created our transfer between excel and the database. Now, all that’s left is to analyze and visualize them.
4. Modeling the data from the SQL database using Gretl (Analysis)
Now let’s do our analysis of the data. There are various ways to do this step. One can use python to code regression(if the output is continuous)/classification(if the output is discrete) models and predict the results. But here we are going to keep it simple and use Gretl software to do that for us. We are going to apply multiple linear regression on our model so that the software can predict the Balance
for us given the independent variable.
Install Gretl and Open your dataset on Gretl. Then click Model -> Ordinary least Square(for regression model). Select your dependent and independent variable.
We’re not choosing Name
, Surname
, CustomerID
as our independent variable because they don’t bring any value to our model. There is no correlation between your name and the amount of balance you could have.
Once selected Click Ok.
You’ll get a detailed statistical analysis like the p-value, standard error, r-squared value, etc. that you can use to analyze your model.
Click Analysis -> Forecast -> Ok . You’ll get the predicted value by our Linear regression model under the prediction
column.
We’ve successfully modeled our data. Note that our model is not statistically strong (huge difference between actual Balance
and the predicted Balance
) and that’s because our model was only given 20 samples, so it’s hard to predict with a very low data set.
If you add more samples to the data and then run the model it would give you a much better result. Also, there are various elimination methods that we could do to improve our model, you can check that out if you’re interested to learn more about model tuning.
5. Visualizing the data using Tableau (Visualization)
Tableau is a very useful tool to provide interactive visualizations which we can present to the team so that we can discuss the insights we are providing through charts using Tableau. The alternative to Tableau is PowerBI
Install tableau public (the free version), and connect to your file.
We are going to visualize the different Names
and their Balance
. Drag and drop Name
to Columns and Balance
to rows
Let’s see which gender has more balance than the other one. Drag and drop Gender
to Columns and Balance
to rows
We can clearly see from the below chart that the Males have more balance in our bank. So maybe the bank can focus more on the female customers to check what’s wrong based on our insight provided to them.
Let’s check which Age group has the higher balance. To do that we’ll first create bins of 10 intervals(group of 10's). Right-click Age(in the left pane)-> Create -> Bins( select interval to 10).
Now drag and drop Age(bin)
(in the left pane) to columns and Balance
to rows
We can quickly analyze that the ages between 30 to 40 have more balance than the rest of the age group
We can do the same thing for the five-year gap interval, just by modifying the bin interval(change the interval to 5), to get a clearer picture. (Note the empty space on 25–30 age group suggests that there are no data available for that age group)
We’ve successfully created our visualization for the dataset.
To recap our blog :
We have taken our raw data set, did an ad-hoc cleaning(formatting date and removing dollar sign) using Microsoft excel, then we transported our data into SQL database through ETL pipeline, which we used to analyze using gretl to do linear regression on our model. Finally, we visualized our data to provide more meaningful insight to our clients.
You can use the same steps here to your own dataset to quickly Clean, Extract, Analyse and Visualize the model.
This blog is part of the Quick-Code(data science) series, check more of the Quick-Code series on my medium page