Predicting Conversion Rates With Bayesian Statistics

Greg Werner

Published in

IllumiDesk

10 min readApr 10, 2017

by Samuel Noriega

What is the importance of predicting Conversion Rates?

You may have asked yourself what’s the best channel to advertise a product or what will be the impact of a digital campaign, or even wonder if your will get your advertising money back throughout product sales. Trust me, we all have been there, so in this post I will show you the key factors and the importance of predicting conversion rates using Naive Bayes and Logistic Regression.

Choosing the correct advertisement directly impacts the revenue of the e-commerce sites and user satisfaction. For long run advertising campaigns it can be very easy to predict conversion rates or CTR by simple using statistical data. But for new campaigns without enough data such approach is not robust or reliable.

In this paper we suggest a model for predicting conversions on new e-commerce site using Naïve Bayes and Logistic Regression.

What is Naïve Bayes?

Traditional predictive modeling approaches typically analyze customer activity using variables or features that represent the profile of the customer (i.e. age, sex, location, etc.) and an aggregated representation of their prior behavior, for instance, the number and type of products the customer has purchased in a prior period.

Customer intent is modeled using a supervised learning algorithm, which uses historical data to learn patterns of user behavior based on the variables mentioned above. The Naïve Bayesian algorithm, which uses a statistical approach and is based upon the Bayes Theorem on the probability of event occurrence, has gained popularity because it’s suitable for classifying or predicting multiple outcomes, each associated with a propensity score. The Bayesian approach is robust and less likely to find false patterns in noisy data (“over-fit” the data).

How to Implement Bayesian event-based predictive models?

In probability theory and statistics, Bayes’ theorem (alternatively Bayes’ law or Bayes’ rule) describes the probability of an event, based on prior knowledge of conditions that might be related to the event.

The Bayesian approach gives us the probabilities of events and conditional probabilities, which are the probabilities of events that depend upon other events.

In the equation below, P(A) denotes the probability of an event A occurring, and P(A|B) denotes the conditional probability of event A to occur when we have observed an event B.

The Bayes rule is expressed as follows:

The number of events B can be computed using the combination of customer profile data and event sequences that occur somewhat frequently. The event A represents a desired outcome, what we can call “Conversion” — for example, a product purchase or a click on specific links.

Imagine we have a visitor of a given demographic and with a history of having viewed some products on a website you are tracking; for this particular visitor, you can now build models that predict the propensity that the next action would be to:

What is the probability of a returning visitor adds a product to the shopping cart
What is the probability a visitor will purchase a product given their recent on-sight behavior?
What is the probability a visitor will purchase a product (Which can help us separate browsers from buyers)
What is the probability a visitor will purchase a specific product.
What is the probability a visitor will click on the support link
What is the probability a visitor will sign in for a newsletter
What is the probability a visitor will click on any ads in the website.
Etc.

Let’s focus first on the first option; what’s the probability of a returning visitor adds some product to the shopping cart.

The Bayes rule for “Add to Cart” might look like this

Since a lot of factors are involved or may be involve, computing the probability that events occur together can quickly become unruly. So, in order to simplify, we need to make a very strong assumption: the “Naïve Bayes” assumption.

What is the Naïve Bayes assumption for this particular case?

Let’s consider the Naïve approach is an assumption that the act of visiting a site is conditionally independent of the shopper’s gender, as long as the action of adding to the cart took place. In other words, conditional independence implies that the event representing the site visit and the variable representing the shopper’s gender are independent of one another, given the variable corresponding to adding to the cart.

The prior statement can be expressed mathematically as follows:

Thus converting our overall formula to:

Since our last formula contains individual probabilities of its constituents rather than their dependence on each other, it is computationally tractable. If we take a closer look at our formula, we might discover a flaw that can be remedied by the Laplace estimator. If one of the multipliers in the numerator happens to be zero, the whole probability will be zero, which might not be accurate.

This is called the “zero frequency” problem; it’s how you estimate the probability of an event that you have not observed. Setting the probability of such events to zero is not correct, and would give rise to serious problems in the Bayesian approach. Instead, an estimator like the Laplace estimator, which ensures that all probability estimates are non-zero, is used to ensure that the Bayesian formula does not produce a zero probability.

We will cover an implementation of the naïve bayes approach applied to anonymized sample data sets in section 2 of this paper.

What is Logistic Regression?

Logistic Regression is a common Supervised Machine Learning Classification algorithm that attempts to estimate the probability that a given set of data can be classified as a positive example of some category.
Logistic Regression binary logistic model is used to estimate the probability of a binary response based on one or more predictor (or independent) variables (features). It allows one to say that the presence of a risk factor increases the probability of a given outcome by a specific percentage.

With Logistic Regression we can answer the same questions proposed in the previous Naïve Bayes approach.

What is the probability a costumer will purchase a product given their recent on-sight behavior
What is the probability a visitor will purchase a product given their recent on-sight behavior?
What is the probability a visitor will purchase a specific product.
Etc.

In order to answer these questions, Logistic Regression does answer them by setting a classification threshold for the probability, say 0.80 (or 80%). All examples with an estimated probability of being a positive example of the classification above 80% will be classified as a positive example. What to set the threshold depends on the problem or goal of the classification, and is not inherit to Logistic Regression.

Logistic Regression: The math behind it.

Logistic Regression is an algorithm that attempts to estimate the probability (P) that a given the data (x) is an example of given classification (y=1). This can be written as:

P(y=1|x)

Where

P(y=1|x) + P(y=0|x)= 1

The odds ratio for a given example is:

The Logistic Regression algorithm requires or assumes that the natural log of the odds

is a linear function of the variables:

We shall notice that the weights (w) where added into the equation. The index i represents the ith variable in the example set (x,y). The result is now both a condition of the data and the chosen weights because we have decided to model the probability with a specific function.

The formula written above requires that the probability is represented as a Sigmoid function.

We now have a set of examples of (x,y) and a model to represent the probabilities from each example. The weights or coefficients in the model are not specified. The goal now is to choose the weights that best reflect the truth presented in the data.

Implementing Naïve Bayes and Logistic Regression to our sample data.

We will implement both algorithms to our data set in order to achieve a result. We are using a 90-day log of web site data that we will keep the source anonymous for privacy purposes. In the attached data set we will be able to observe the following data:

Day
Total visits per day
Total visitors per day
Total of new visitors per day
Total of returning visitors per day
Male visitors per day
Women visitors per day
Male returning visitors
Women returning visitors
Conversions

Implementing Logistic Regression to our dataset

We are going to work mostly with Sklearn library in python, so the first step in our process will be call for those libraries in order to implement our code.

#First we need to import all the libraries.
import numpy as np
import pandas as pd
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB

Next step will be to read and load our database. It is a good idea to visualize it.

# load our database
mydata = pd.read_csv("website-data.csv")# Vizualize our database
mydata.head()

Let’s take a look at our dataset head in order to visualize our data is loaded correctly.

We can also obtain the full description for each column, that will allows us to visualize the mean, standard deviation, etc.

For this particular example, we are going to delete the date column since the date of the visit is not a factor to achieve a conversion on the website. Once we get the information updated we need to put it in a form of an array.

Now, we can implement the Logistic Regression model from sklearn library. We are going to evaluate the model with 3 different approaches so we can evaluate the best accuracy and prediction.

Evaluate using a train and a test set:

# Evaluate using a train and a test set
test_size = 0.30
seed = 7
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=test_size, random_state=seed)
model = LogisticRegression()
model.fit(X_train, y_train)
result = model.score(X_test, y_test)
print("Accuracy: %.3f%%") % (result*100.0)

Our prediction: 14.286%

Evaluate using Cross Validation Our prediction:

# Evaluate using Cross Validation
num_folds = 10
num_instances = len(X)
seed = 1
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
model = LogisticRegression()
results = cross_validation.cross_val_score(model, X, y, cv=kfold)
print("Accuracy: %.3f%% (%.3f%%)") % (results.mean()*100.0, results.std()*100.0)

21.111% (15.275%)

Evaluate using Leave One Out Cross Validation:

# Evaluate using Leave One Out Cross Validation
num_folds = 10
num_instances = len(X)
loocv = cross_validation.LeaveOneOut(n=num_instances)
model = LogisticRegression()
results = cross_validation.cross_val_score(model, X, y, cv=loocv)
print("Accuracy: %.3f%% (%.3f%%)") % (results.mean()*100.0, results.std()*100.0)

Our prediction: 23.077% (42.133%)

Evaluate using Shuffle Split Cross Validation:

# Evaluate using Shuffle Split Cross Validation
num_samples = 10
test_size = 0.3
num_instances = len(X)
seed = 7
kfold = cross_validation.ShuffleSplit(n=num_instances, n_iter=num_samples, test_size=test_size, random_state=seed)
model = LogisticRegression()
results = cross_validation.cross_val_score(model, X, y, cv=kfold)
print("Accuracy: %.3f%% (%.3f%%)") % (results.mean()*100.0, results.std()*100.0)

Our prediction: 20.714% (5.487%)

K-fold cross validation is the most commonly used method to test for unknown data with k set to 3, 5, or 10.
Using a train/test split is good for speed when using a slow algorithm and produces performance estimates with lower bias when using large datasets.
Leave-one-out cross validation and repeated random splits can be useful intermediates when trying to balance variance in the estimated performance, model training speed and dataset size.

Let’s now implement Naïve Bayes to the same dataset

As we stated before in this paper, Naïve Bayes classifies instances based on the assumption that class features are independent of each other.

A Gaussian distribution is assumed to estimate the probabilities for input variables using the Gaussian Probability Density Function.

For our example, we will use the GaussianNB class that returns the mean estimated accuracy.

Gaussian Naïve Bayes Classification

# Gaussian Naive Bayes Classification
num_folds = 10
num_instances = len(X)
seed = 7
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
model = GaussianNB()
results = cross_validation.cross_val_score(model, X, y, cv=kfold)
print("Accuracy: %.3f%% (%.3f%%)") % (results.mean()*100.0, results.std()*100.0)

Our prediction: 19.778% (8.282%)

Now that we have both results using Naïve Bayes and Logistic Regression, lets compare them to see their outputs.

First we need to prepare the configuration for cross validation test harness and prepare both models.

# prepare configuration for cross validation test harness
num_folds = 10
num_instances = len(X)
seed = 7
# prepare models
models = []
models.append(('LR', LogisticRegression()))
models.append(('NB', GaussianNB()))

Then we need to evaluate each model

# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
    kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
    cv_results = cross_validation.cross_val_score(model, X, y, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

Which will allow us to get a prediction score and be able to plot the results.

# boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

LR: 0.211111 (0.152753)
NB: 0.197778 (0.082821)

Conclusion:
In this paper we focused on using existing libraries which allow us to save time, are robust and are used widely in the industry, however, there are cases where one might not want to use the default models provided by a package and implement your own code from scratch.

The dataset we used on this example is very limited and small due to restrictions imposed by the owner of the site, but the same procedure can be implemented in big or huge databases with similar results.