Predict the Click-Through Rate for Keywords Using Machine Learning Methodologies

Vyshnav M T
Analytics Vidhya
Published in
6 min readOct 21, 2019

In this post I would like to discuss on a data science task which I completed recently. First I will explain about the problem description, goal and the data set. Further the visualization of the data is done and important features are selected to train the machine learning model.

Problem Description:

An advertising company sells a service of buying keywords in search engines on behalf of their customers. They’re trying to optimize their keyword and funds allocation. The first towards the optimal solution is to predict performance by keyword and fund.

Goal:

Predicting for any keyword (not necessarily the ones in the dataset file), CPC, and market (US/UK) the traffic a website would receive (I.e., the clicks).

Data set Description:

Description of attributes in data set
# importing necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.model_selection import train_test_split
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import re
import nltk
nltk.download('stopwords')

Reading the data set sorting it by the ‘Date’ attribute.

df = pd.read_csv('Dataset.csv')
df.sort_values(by='Date',inplace=True)
df.head()

Initially I have used describe method to have intuition about number of non-missing values, mean, standard deviation, range, median, 0.25 and 0.75 quartiles.

data.describe()

Now converting the Date attribute to date-time format (YYYY-MM-DD)to extract the features such as year, month, day of week, day of year etc.

data[‘Date’] = data[‘Date’].apply(lambda x: pd.to_datetime(str(x), format=’%Y%m%d’, errors=’ignore’))date_data = data['Date']date = pd.DataFrame({'year':date_data.dt.year,
'month':date_data.dt.month,
'week':date_data.dt.week,
'dayofweek':date_data.dt.dayofweek,
'dayofyear':date_data.dt.dayofyear})

The categorical variables are converted to dummy variables

Market = pd.get_dummies(data['Market'],drop_first=True)
train_data = pd.concat([date,Market,data],axis=1)
train_data.drop(['Date','Market'],axis=1,inplace=True)

Then I plotted the histogram plot of each independent features to get the distribution of each features. It was observed that number of clicks = ‘0’ were more when compared to other numbers. Further, clicks were more in the year 2012 than in 2013. Also when considered month wise, most of the clicks were during months of August — December. If we see day of week, the highest clicks have happened on Tuesdays. And most of the clicks were from the US market.

train_data.hist(figsize=(15, 15), bins=50, xlabelsize=8, ylabelsize=8)

Now to identify the feature importance of the numerical features, I have done a pair plot between the target variable=’clicks’ and all other features. It was observed that features such as Cost, Impressions, CTR, CPC, Week, day of year has some linear relationship with the target variable.

for i in range(0, len(train_data.columns), 5):
sns.pairplot(data=train_data,
x_vars=train_data.columns[i:i+5],
y_vars=['Clicks'])

In order to further verify the important features, I have plotted a correlation matrix with heatmap. It is observed that only two of the features impressions and cost were highly correlated with target variable. We can also see strong negative correlations in the plotted heatmap.

corrmat = train_data.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(8,8))
#plot heat map
g=sns.heatmap(train_data[top_corr_features].corr(),annot=True,cmap=”RdYlGn”)

Text preprocessing/cleaning was done on the Keywords:

Initially all numbers, punctuation marks, symbols were removed from the keywords using regular expressions. Then the keywords were lowercased and tokenized. After tokenizing the keywords, stop words were removed using NLTK stop words.

Word Embedding’s applied on the cleaned keywords:

After cleaning the data, each words in the keyword needs to be converted into word vectors, in order for the machine learning model to be trained, as algorithms cannot process the plain text or strings in its raw form. Word embedding’s are vectors that captures the semantic and contextual information of words. Here I have used a Fasttext based word embedding which gave vector of dimension 300 for each words in the keyword. Thus to generate a single vector for complete keyword, average of each word vectors in a keyword were taken.

In the given problem, the number of clicks is the target variable which is to be predicted by the trained model, when Date, market, keywords and CPC were given features. Since the number of clicks have positive continuous values, I have modeled the problem as a regression problem. There were many ‘Nan’ values present in the data, which were dropped instead of doing some imputations on the data and replacing Nan values. Because we all of the features were Nan for some particular keywords and so we are not sure how many clicks, impression or cost could be for these keywords. Aslo, since the data is huge, it doesn’t affect even if we drop these fewer data’s. Now, since date on which the keyword has been searched could have important features, I have extracted features like year, month, week, dayofweek, and dayofyear from the date appended to the data. Categorical values were converted to dummy variables.

As per the problem statement, it was mentioned that given keywords and CPC for each market (US/UK) at the date of 14/2/2013, model has to predict the number of clicks. So I have modelled the problem by taking two different sets of features:

a) X = keywords, market, and cpc

b) X = year, month, week, dayofweek, dayofyear, keywords, market, and cpc

Target variable was set as Y=number of clicks.

I have used classical machine learning and deep learning model such as Random forest regression, Xgboost regression, 1D-CNN and GRU models to train and evaluate the given data. The given data is split into train and test data by sorting it by date. All data from year 2012 were taken as train data and all data from year 2013 were taken as test data.

The codes are available in my GitHub profile : https://github.com/Vyshnavmt94/Predict-the-Click-Through-Rate-for-Keywords-Using-Machine-Learning-Methodologies

Results:

a) X = keywords, market, and cpc

Random forest Regression:

Coefficient of determination R² of the prediction: 0.914177804049537Mean squared error: 297606.09Test Variance score: 0.80mean_absolute_error: 74.95

XGboost Regression:

Coefficient of determination R^2 of the prediction.  0.5800091699348091Mean squared error: 393330.94Test Variance score: 0.74mean_absolute_error: 213.09

1-DCNN: (70 epochs)

Mean squared error: 183220.71Test Variance score: 0.88mean_absolute_error: 61.83

GRU: (20 epochs)

Mean squared error: 125502.27R2 score: 0.92mean_absolute_error: 70.50

b) X = year, month, week, dayofweek, dayofyear, keywords, market, and cpc

Random forest Regression:

coefficient of determination R² of the prediction.: 0.9976723650491951

Mean squared error: 1008241.56Test Variance score: 0.33mean_absolute_error: 175.60

XGboost Regression:

coefficient of determination R² of the prediction.: 0.6653407930753578

Mean squared error: 519430.67Test Variance score: 0.65mean_absolute_error: 224.09

1-DCNN: (30 epochs)

Mean squared error: 9360270.80R2 score: -5.25mean_absolute_error: 2336.26

GRU: (30 epochs)

Mean squared error: 1535167.11R2 score: -0.02mean_absolute_error: 193.50

I hope this post is helpful to understand the visualization of the data to extract best features an train the Machine learning model.

--

--