Using Decision Trees to Predict Conversion Rate

Deandra Alvear
The Startup
Published in
8 min readJul 14, 2020

A case study in predicting and optimizing customer conversion rate

Image: Deandra Alvear

Introduction

Vandelay Industries has collected some data about people that visit their online store. The data includes basic information about shoppers such as their country, age, how many pages they visited during a session, if they are a new or returning user, which marketing channel they entered the site through, and whether or not they made a purchase (converted).

Goal

Given this information, our task is to predict conversion rate, and make recommendations to the product team and the marketing team on ways to improve conversion rate.

Exploratory Data Analysis

The data set has been pre-cleaned and does not contain missing values. So we can dive right into EDA and plotting the feature distributions.

Image: Deandra Alvear

By plotting the distribution of each feature, we gain several new insights about the shoppers that visit Vandelay Industries’s website.

  1. More than half the shoppers in the data set are located in the US: Vandelay Industries is most likely a US-based company.

2. age seems to be between 17-123: Upon closer examination, there are two shoppers between 100 and 123 years old. There’s no way to verify these records, but in a data set with 300,000+ shoppers, it is unlikely these two records will affect our results, so I’ll keep them in.

3. There is twice as many new shoppers than returning shoppers: Vandelay Industries is doing well at getting shoppers to return to its website.

4. Most shoppers enter the site by clicking a search engine result: this is opposed to them entering via an advertisement or directly typing in the website address.

5. Most shoppers visit less than 10 pages during a session

6. The classes in the converted column are imbalanced: a quick calculation shows the site’s current conversion rate is around 3%. A quick search for average conversion rate for an e-commerce platform is 1–2%, so the company isn’t performing poorly by any means. This will be our target attribute.

At this point I know quite a bit about this data set; now I can explore the relationships between the features. Since there is a mix of numerical and categorical features, some feature engineering will need to be done. This step is outside the scope of this article, but can be found in this Jupyter Notebook.

Image: Deandra Alvear

It appears that none of the columns are correlated to each other except total_pages_visited and converted, and they seem to be positively correlated. This suggests that as shoppers view more pages, the possibility of them converting increases. Conversely, a negative correlation would imply that as shoppers visit more pages in a session, the possibility of them converting decreases. Typically we see this when a feature on a website isn’t functioning correctly.

Now I’ll implement a baseline model to predict conversion rate.

Baseline Decision Tree Model

It’s important to implement a baseline model to confirm our EDA findings, and to see if a basic model or approach is “good enough” to solve the problem at hand. This will also inform what steps are next in the model building process.

We want to use an interpretable, out-of-the-box model (no hyper-parameter tuning), such as regression or a decision tree on a data set that hasn’t been feature engineered yet. Since this is a classification task (predicting a binary response), and our features are numerical and categorical, I’ll train a decision tree classifier and only let it grow to two levels. Here’s my code:

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

X = data_new.iloc[:,:5]
y = data_new.loc[:,'converted']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=22)

dtree = DecisionTreeClassifier(max_depth=2, random_state=22)
dtree.fit(X_train, y_train)
#using accuracy as our performance metric
score = dtree.score(X_test, y_test)
print(score)
>>> 0.9818469323213156

Our model accuracy is ~98%, which seems great except we know that decision trees have a tendency to overfit. We also know that we trained this model on an imbalanced data set.

Accuracy is also not the best performance metric to use because it measures the proportion of correct classifications, thus only telling us half the story. Let’s look at the confusion matrix:

Image: Deandra Alvear

From the confusion matrix above, we see that the decision tree is correctly predicting shoppers that do not convert, but has a 50/50 chance of predicting shoppers that will convert. We can have a high accuracy with a poor model because accuracy measures the proportion of correct classifications. Since this model is correctly classifying shoppers that do not convert, and this is also the dominant class, our model accuracy is being driven up.

We have an imbalanced classes problem that we must correct for. There’s a few ways to do this, I could change my performance metric, change my algorithm, or resample the data set. For this particular problem, I’ll under-sample the dominant class and then retrain the model. The steps to do this are outside the scope of this article but can be found in the accompanying Jupyter Notebook here.

Retrained model results:

Decision Tree Train Accuracy: 0.925721977484092
Decision Tree Test Accuracy: 0.954427577482606
precision recall f1-score support

0 1.00 0.96 0.98 61212
1 0.40 0.88 0.55 2028

accuracy 0.95 63240
macro avg 0.70 0.92 0.76 63240
weighted avg 0.98 0.95 0.96 63240

Our model accuracy went down a few percentage points, and now the minority class has a low precision and high recall, whereas before it had the opposite.

This means that:

  • 40% of the shoppers classified as belonging to the minority class (converted) actually belong to that class
  • 88% of shoppers that are in the minority class were classified as belonging to the minority class.

Now I’ll look at the confusion matrix:

Image: Deandra Alvear

The confusion matrix indicates that our classifier performs much better than the first one. 96% of shoppers that did not convert, and 88% of shoppers that did convert were correctly classified.

At this point, our classifier is “good enough” and we can use it to predict the probability that a shopper will convert given their demographics. The following code snippet will give us class probabilities:

# To get the probability of conversion
# ['country', 'age', 'new_user', 'source', 'total_pages_visited']

w = [[0, 24, 0, 2, 5]]

dtree_new.predict_proba(w)
>>> array([[0.95968863, 0.04031137]])
# 96% probability they will not convert, 4% they will convert

Typically, we would want to weigh the pros and cons of our model in terms of how much it will cost Vandelay Industries to incorrectly classify a shopper. I’ll explore this more when I make my recommendations.

Making Recommendations

We want to make a recommendation to the product and marketing teams to improve shopper conversion. To do this, based on the information we have, we should examine any patterns or trends we see in the data. It is important to do this after training a model because we don’t want to become biased during the model building process.

Total Pages Visited

We know from the EDA step that total_pages_visited is positively correlated to whether or not a shopper will convert. After taking a closer look at this feature for both classes, it appears that users that convert view an average of 10 more total pages in a session before converting. This could be due to comparison shopping or an indication that a user that spends more time on the site is more likely to buy something.

If we examine the feature importances of our retrained classifier:

country importance: 1.7146843233999107%
age importance: 0.0%
new_user importance: 0.0%
source importance: 0.0%
total_pages_visited importance: 98.28531567660009%

We see that total_pages_visited is the most important feature for predicting whether or not a user will convert.

Shopper Demographics

Country

We already know that a majority of shoppers are from the US, so it makes sense that shoppers from the US have higher numbers of converting/not converting. The second largest group of shoppers are from China. A quick calculation reveals that less than 1% of users from China end up converting, despite being the second highest demographic.

Age

The age distribution of shoppers that did convert looks to be right skewed toward younger users. Binning shoppers into age groups shows that the site is most successful at converting those under 30 but still successful generally at getting those under 40 to convert.

(16.892999999999997, 27.6]    6175
(27.6, 38.2] 3365
(38.2, 48.8] 615
(48.8, 59.4] 40
(59.4, 70.0] 3
(112.4, 123.0] 1
(101.8, 112.4] 1
(91.2, 101.8] 0
(80.6, 91.2] 0
(70.0, 80.6] 0
Name: age, dtype: int64

Final Insights

  • Less than 1% of shoppers in China are converting
  • The site works well at getting those under 40 to convert
  • Returning shoppers convert more than new shoppers
  • Most shoppers, whether they convert or not, enter the site through a search engine result
  • Shoppers that convert visit 10 times more pages that users that did not convert; more time on the site leads to conversion

Recommendations

We want to base our recommendations on areas where the site is successful and areas where the site is performing poorly that we can directly change.

  1. We know that shoppers in China are converting at very low rates despite being the second largest demographics. The first priority is to recommend the product team look into the Chinese version of the site to make sure that translations are correct, payments are in the correct currency, and the content matches the culture.
  2. Since the site is successful at getting shoppers under 40 to convert, we could recommend the marketing team target this demographic via ads and other marketing channels. The marketing team could also send promotional/reminder emails to shoppers that spent a lot of time on the site (viewed a lot of pages), but haven’t converted yet, since we know that time spent on the site is positively correlated to whether or not a shopper will convert.

We are not given the cost of incorrectly classifying a shopper, so we cannot denote a dollar amount but we can still talk in terms of marketing resources. The cost of an incorrect classification will be the cost of sending a promotional/reminder email to a shopper that does not end up converting. However, conversion is tied to the number of pages a shopper viewed in once session, so a shopper with a low conversion potential can still become converted if they spend more time on the website.

This post is based on a sample problem from A Collection of Data Science Take-Home Challenges. All images and work is my own; view the Jupyter Notebook here.

--

--