Ad Click Prediction

Gourav Bais
6 min readJul 19, 2019

--

Image credit : Falcon.io

Machine Learning has now variety of applications in real world whether it is for predictive analysis or automating things or decision making for business purpose.

“Computers are able to see, hear and learn. Welcome to the future.” — Dave Waters

Having knowledge of Machine Learning is different thing but applying it to business purpose for making profit, and learning something new from the data is the main motive. So keeping these points in mind recently I have done a project which was a Machine Learning competition. Project was about predicting the ad click rate of a customer, based on certain features. I found this problem very interesting and therefore made a Machine Learning model to predict the Ad click.

Why Ad click is important ?

A company wants to know the CTR ( Click Through Rate ) in order to identify whether spending their money on digital advertising is worth or not.

A higher CTR represents more interest in that specific campaign, whereas a lower CTR can show that your ad may not be as relevant. High CTRs are important because they show that more people are clicking through to your website. Along with this high CTRs also help to get better ad position for less money on online platforms like Google, Bing etc.

Now we will discuss how to implement ad click prediction using Machine Learning Algorithm.

Overview :

The whole project is divided into 6 steps :

  1. Importing dependencies and loading Data set
  2. Exploratory Analysis
  3. Data Cleaning
  4. Train Test Split
  5. Training the Model
  6. Testing the model accuracy

The data set was provided by the hosted competition, you can find data sets here.

Step 1 : Importing Dependencies and Loading Data set

In order to do the predictive analysis we need to import some python libraries which will help in data visualization, dealing with data set and will also provide pre-implemented Machine Learning models.

a. Pandas is the most popular python library used for data analysis. It provides high performance data frames in order to deal with CSV ( Comma Separated Value ) files.

b. Matplotlib is used for making some useful graphs for exploratory analysis.

c. Seaborn is used for some detailed level plots like count plots and heatmaps.

Step 2 : Exploratory Analysis

Exploring the data through different plots gives you deeper understanding of data, and also choosing the suitable Machine Learning Model.

Generally, count plots and heat maps are used to get the visual insights about the garbage values in data set and also for identifying the relation between multiple features. Code for plotting the count plots is as follows :

Fig 01 : Count Plot-showing the click result based in gender

It can be observed from the Fig 01 that males are more prone to click on a particular ad than females.

Fig 02 : Count Plot-showing the click result based on age

It is clear from Fig 02 that mid age level people are more inclined towards clicking an ad than others.

To check the NaN values in data set,we will user heat maps :

Heatmap : Showing the NaN values contained in the dataset

white lines in Fig 03 shows the NaN values, which must be removed in order to train our machine learning model.

Step 3 : Data Cleaning

This is the Most important step of all Machine Learning and Data Science projects. It is about 80% of the overall work. For this project I have done data cleaning manually by identifying the relation between multiple columns, although there are some tools and standard procedures available but I found it more suitable as per the accuracy.

Modules ( Functions ) which are used to clean features of data set are as follows :

  1. Product category feature : ProductCatagory2() module mainly deals with the product category feature and removes all the NaN values based on above specified conditions.
Find complete code here : https://drive.google.com/open?id=1Gv0jtk73SVfXBEhjV4sX9WU_hP_w5vA3

2. Gender feature: Similarly Gender() module deals with gender feature and userGroupId() deals with group id data cleaning.

Find complete code here : https://drive.google.com/open?id=1Gv0jtk73SVfXBEhjV4sX9WU_hP_w5vA3

3. Date Time feature : Since the Date Time column contains all the time parameters like year, month, date, hour and minute they must be split in order to make data set more relevant.

Find complete code here : https://drive.google.com/open?id=1Gv0jtk73SVfXBEhjV4sX9WU_hP_w5vA3

Now data cleaning is done all the NaN values have been replaced with the effective values, now its time to split the data set into two parts i.e. training part and testing part.

Don’t be confused with the training and testing files, at this point we will use only training file for train test split method.

Step 4 : Train Test Split

Find complete code here : https://drive.google.com/open?id=1Gv0jtk73SVfXBEhjV4sX9WU_hP_w5vA3

X_train and Y_train are used to train the Machine Learning model while x_test is used as input for making predictions which will be then validated with the y_test values.

Step 5 : Training the Machine Learning Model

Since there are two categories in output data which are :

  1. Either Customer will click on the ad (i.e. 1) or
  2. Customer won’t click on the ad (i.e. 0)

It simply suggests that it is a classification problem. Also visualization of data also gave an intuition that there are decision boundaries which can be used as the basis of selecting the Machine Learning model therefore we will use Decision Trees Classifier, and the code for training the model is as follows:

Find complete code here : https://drive.google.com/open?id=1Gv0jtk73SVfXBEhjV4sX9WU_hP_w5vA3

ScikitLearn (sklearn) library provides the implementation of almost all kind of Machine Learning models, now import the Classifier then create an instance of the model, fit the model to our training data and finally make predictions through model.

Step 6 : checking Model accuracy

Final step is to check the accuracy of the Machine Learning model which we have created for ad click prediction :

Find complete code here : https://drive.google.com/open?id=1Gv0jtk73SVfXBEhjV4sX9WU_hP_w5vA3

Confusion Matrix :

depicting false positive and true negative

Classification Report :

Accuracy Score :

As you can see accuracy of the model is 93%, but the problem you can see here is that most of our predictions belong to class 0 which indicates that our model is overfitted. Reason behind this is our data is unbalanced i.e. it contains more samples for 0 in comparison of 1. To tackle this problem what we can do here is we can use different sampling techniques to make data balanced and then we can apply our model for better accuracy. You can see the types of sampling in the following links and choose the suitable one:

https://www.analyticsvidhya.com/blog/2019/09/data-scientists-guide-8-types-of-sampling-techniques/

https://towardsdatascience.com/the-5-sampling-algorithms-every-data-scientist-need-to-know-43c7bc11d17c

But for beginning this is good, and hence our Machine Learning model for ad prediction is ready.

you can download the Python notebook and the data sets from here : https://drive.google.com/open?id=1Gv0jtk73SVfXBEhjV4sX9WU_hP_w5vA3

This is my first blog so if you have something to say about this please comment.

Let’s connect on LinkedIn.

Thanks !

--

--