Predicting the Amount Spent By Customers Using ML

Built a Machine Learning model that uses linear regression to predict the yearly amount spent by customers in an ecommerce

Published in

Geek Culture

4 min readMay 4, 2021

After messing around with the titanic Dataset, I decided to work on a theme that is much more current. The Dataset I chose was about customers of an ecommerce. It Basically contains data about many customers, like Average session length on the site, time spent on site, length of membership, and the yearly amount spent. I really liked to work with this dataset, since ecommerces are growing a lot in recent years. Now let’s take a look at the Dataset:

This Dataset contains 8 columns and 500 rows. After doing some analysis, I found out that Length of membership is the feature that have the biggest impact at the yearly amount spent by customers. After that, I started ploting some graphs to see this relation better.

Graph 2 : Length of membership x Yearly Amount Spent

The jointplot showed a positive relation between length of membership and yearly amount spent. This means that the longer the membership, the highest the amount spent. Now it’s clear that a linear regression would be a good choice to this dataset!

Spliting Data

I started importing some new libraries to help to split the data and assigned the features into two variables

Now I was ready to split the data into “train” and “test” data, to train my model and see if it could predict well. For this part I used the train_test_split method from Sklearn:

Creating the model

The basic Linear Regression formula is Y = A + B*X. This means that the purpose of the model is to figure out the values of A and B in order to predict Y (yearly amount spent), as long as I give the X value (other features). Now let’s build the model and train it:

Using the test data and evaluating the results

After using the test data, I decided to plot a graph to see the relation between the predict values and the true values. Let’s see if they look similar!

Graph 3 : Lineplot Predict values x True values

That’s nice! The values have a good correlation! This means that the model predicted well enough the Y value!

It’s time to import some metric to evaluate the model. I decided to import these metrics:

Sum of Squared Error (SSE): sum all residuals and square them
Mean of squared error (MSE): The average of SSE
Sqrt of mean squared error (RMSE): The sqrt of MSE
R2 Score: Explains the variance between Y and X

R2 is the easiest to understand. The value for R2 in this model was 0.98, it means that our linear regression is able to explain 98% of the variance between Yearly Amount Spent and the other features. The benchmark for this value is 0.7. A really high value for R2, like 0.98, could have a lot of reasons, like unreal data or biased data.

How can this model help?

At the beginning I said that “Lenght of membership” was the feature that more affected the Yearly Amount Spent, so, the Ecommerce could focus on actions to extend the membership, or to make more customers members. But, and the other features? Is it worth to work on them?

I printed out the coefficients of the model to understand better these relations:

And what this means? Basically, we can understand this like, for each unity growth on “Time on App” the “Yearly Amount Spent” increase in 38 Dollars. Again, the membership is what causes the most increase, but, maybe, focus on the App could give a good result too, maybe it’s cheaper to improve the app rather than increase the membership length!

You can check this complete notebook at Github.