One thing you need to know before doing data analysis and model Building

7 min readOct 18, 2021

What’s the one thing you need to know before doing data analysis?
Lets understand that practically
I have read few blogs in this data analysis topic, and I observed that some blogs uses data sets like Titanic or iris data set and the likes. while the other doesn’t answer few why’s that need to be answered during the processes.
I didn’t do a complete different job here. Rather, I used my own way of explaining things. I used a complex data, giving few step by step approach to analyze and build a model.
This blog isn’t for guiding you the techniques. You don’t need to use the same technique for your data analysis. Rather, read, observe and think. How the analysis is performed? How each and every steps are carried out?

The way you think leads you a way for a solution.

And I didn’t include any data visualization stuffs here as it will lengthen my blog.
So lets get started
Here problem statement is to predict how much the customer will spend for the G-store data. Its an online store where we can buy google products like Google swag t-shirts and the likes. We can say we need to predict the revenue of the customers. So it’s a regression problem
First, let’s load the data and find out how our data looks like.

Here you can see that we have 1n2 columns. These aren’t the complete features. Some of the columns is in the Json data. We need to normalize it, to know the how many actual features we have.
lets create a function & flatten the Json data

I will be using the train data for analysis.
Instead of just using df.info() to describe data, we will create a function to give better description of the data.
Using this function we can read the shape, data type, missing values, unique values, first 3 values of the features and the entropy of the data of each feature

Now observe the data. This seems bit complex right? As majority of the features are categorical. And many null values are present in some columns. we can’t remove those columns just like that. So we need to think before treating this.
Just like that we can’t impute the missing values too, with mean or some other kind. We need to think before we do that.
We have null values in the page views right? But practically page view can’t be zero here. As each record shows the customer visit to the website. which means for sure one visit must have happened. So lets impute it as 1.
And for new visits, bounces and transaction revenue we are keeping it as zero and converting them into integer column.
For ‘trafficSource.isTrueDirect’ impute NA with ‘false’ & for ‘trafficSource.adwordsClickInfo.isVideoAd’ impute with true. These are Boolean columns. so we are imputing the null values with other/alternate boolean. And we will label encode these values in future.

now let’s convert ‘fullVisitorId’ & ‘sessionId’ to float as they are in object data type.

We saw that, our data has some features with unique values i.e just one value that too mentioned as ‘not available in demo dataset’. So there is no point in using such features for our model.
So let’s remove them.

We have a feature as ‘date’. Shall we keep this date as same or do we need to do some more stuffs here?
Ya, what we can do is that we can use this one feature to extract several features like week, day, month, year etc.
Note: remember that wherever we get a chance to extract many features from one feature, we should do it. It adds more value to the model.
And date time features gives that opportunity to do it.

‘sessionId’, & ‘trafficSource.campaignCode’ doesn’t gives much value to the model so we are removing it

Impute ‘totals.transactionRevenue columns’ with 0 for nan data and convert it to values. Can we not impute the revenue columns with any other values? No we can’t. Null values means, no revenue, so just keep it as zero.
Also convert full visitor id to values
Now we will do log of transaction revenue.
But why we need to do log of the revenue column? The same question I asked with my other team mates, no one was able to answer.
Its because the customers who gave revenue are very few out of total customers/vistors. more than 90% of records(892138 records) in transaction column are null values.
So highly imbalanced. this wont give a good prediction. so to avoid this we will use log.
Log will normalize the values, in such a way that the we will have a balanced data.

There are many categorical features in our data, lets label encode this.
But why label encoding? Why can’t we do one hot encoding to these features?
Because if we do so, this will create a huge junk of features. As one column has 200 categories. If we do one hot, we will get 200 more features just by that columns.And this will create curse of dimensionality. and there are several other catergical features too.
So best option is to label encode these features.

Some features in the data has huge amount of null values, say 892707, 903652 null values we have, which is more than 90% null values. So these won’t add much value to the model. Its better to remove such features. and lets convert rest of the columns to float.
Here our target feature ‘y’ is “totals.transactionRevenue”. And the rest of the features will keep it as X and lets do train test split.
Lets drop the transaction revenue from x, as we have kept it as ‘y’ and do train test split to split the data

lets drop some more columns that doesn’t adds value

Now we are done with analysis. But which algorithm should we use?
Lets Use XGBoost. But why? because its very popular. No . dont do that.
Study the maths behind it. and research their advantage and dis advantages.
But you may ask me why do I need to spend time on researching stuffs? I know the algorithm I can jump into it.

See the xgboost model gave good score. Yes it is, but don’t just see the score. We can’t decide based on this.
We use xgboost here because we have a complex and huge data. and xgboost works well.
But one disadvantage of xgboost is that, it doesnt works that well with too much categorical data. And we have many categorical data
For that LightGBM works pretty well.
Also XGBoost works by splitting trees. one problem with trees is that it gives overfitting problem. If depth of trees are huge. Our model will overfit, and we get poor score with test data.
That’s why we need to research the concepts and maths behind it.
I am not showing every single steps/ algorithm here to teach you.
Rather what I did here is that I questioned every steps.
That’s what I wish to tell you, spend quality time asking question, researching concepts, and think how it can solve a problem.
It will make you think better. Solve the problem better
Here before model building, did you questioned why did I not used feature scaling/transformation?
Its because XGBoost is a decision tree based model. And tree based model don’t require scaling, as they are invariant, which means no change happens to data even if we do feature scaling.
Feature scaling plays important role in algorithms like KNN, Clustering, etc.
See this is how you get the answers. By questioning them
Don’t just focus on giving a better performing model, rather focus on giving a better solution to the problem.

Written by thi_thinker