How to Forecast Customer Sales Volume Using Machine Learning?

Published in
6 min readAug 16, 2021


Photo by Johann Walter Bantz on Unsplash


  1. Executive Summary
  2. Introduction
  3. Methodology
  4. Results
  5. Discussion
  6. Conclusion
  7. Appendixes


In this article, we will see how to predict customer sales volume using transaction history and demographics datasets. We conclude that a Simple Linear Regression could use annual income as a strong predictor for future sales.


This case study is part of the “Introduction to Predictive Analytics using Python” in the University of Edinburgh Predictive Analytics using Python MicroMasters program. The project’s scope is to build a well-performing predictive model that predicts the customer sales volume using demographic and transaction data.


3.1 Analytic Approach and Data Requirements

The predictive analytics approach is selected as we need to predict a continuous number. Along the way, we cover the classic KDD (Knowledge Discovery in Databases) cycle, including data cleaning, data selection and transformation, data mining, resulting in knowledge. The model culminates in a simple regression model as an example of a predicting model leading to conclusions. Since we use a simple linear regression, the data has to be in a numeric format.

There were two possible ways to approach the sales forecasting problem given the dataset: aggregate sales volume by the customer or by the product. In this case study, we take the former approach.

3.2 Data

We work with two datasets imported in a CSV format: CS_Purchase_data and CS_Customer_data.

Dataset 1: CS_Purchase_data

The first dataset contains 50000 transactions on customer demographic data and the product data with the “Purchase” feature signifying the purchase amount. The variables are as follows:

  • User_ID
  • Product_ID
  • Gender
  • Age
  • Occupation
  • City_Category
  • Stay_In_Current_City_Years
  • Marital_Status
  • Product_Category_1
  • Product_Category_2
  • Product_Category_3
  • Purchase
Photo by Beth Macdonald on Unsplash

Feature Selection:
Since the exercise approach is to predict the transactions by customer, we drop the product features ‘Product_ID’,’Product_Category_1',’Product_Category_2',’Product_Category_3'.

Aggregating the observations:
Next, we aggregate the transaction by the customer using User_ID and create a new feature, ‘Purchase_Sum’, adding up the transaction values for the user. The dataset includes transactions for 5424 customers.

Figure 3.1 Purchase Dataset Aggregated by Customer

Dataset 2: CS_Customer_data

Then we import the second dataset CS_Customer_data containing 5424 customer records with five features:

  • User_ID
  • annual_income
  • number_of_children
  • proximity_town
  • sum
Photo by Jonathan Borba on Unsplash
Figure 3.2 Customer dataset

Merge datasets
We combine the data in both datasets by right-merging the two on “User_ID.”

Figure 3.3 Merged Dataset

Appendix 1 Feature Histograms displays the variable histogram distributions.

Variables format
Next, we bring categorical features (Gender, City Category), binary (Marital Status, Occupation), and binned variables (Stay in current city years, age) into a numeric format by creating dummy variables. We also drop the user id as it is not helpful for regression.

Figure 3.4 Dataset Before Format Correction

Final Preprocessing
The independent variable proximity_town had 158 missing values. We replace missing values with the mean. LocalOutlierFactor algorithm identified outliers using 20 neighbours and 5% contamination as the parameters.

3.3 Transformation

We transform normalise ‘Occupation’, ‘Purchase_Sum’, ‘annual_income’,
‘number_of_children’, ‘proximity_town’, ‘sum’.

Figure 3.5 Normalised Dataset

3.4 Modeling Methodology

Photo by Enayet Raheem on Unsplash

We used a plain vanilla Simple Linear Regression model to predict continuous variables, capturing the relationship between variables proximity to town and annual income individually.


Proximity to town was a poor predictor. The predicted line stayed almost horizontal along with different proximity values. The errors were exceptionally high for lower values with lower proximities.

Figure 4.1 proximity_town SLR

As we can see from Figure 4.2 annual income SLR, annual income seems to have a strong linear relationship with sales volume even without additional features or further transformation.

Figure 4.2 annual income SLR


Without further transformation, proximity on its own is not a good predictor. Logarithmic transform, polynomial transformation, or new feature creation including the variable (e.g. neural network nodes or manual multiplication with another variable) could potentially make the feature more useful.

Annual income would be a strong predictor for forecasting sales volume for a customer.
Besides using the SLR model to predict, the company could use this insight to dig deeper into the following questions:

Photo by irfan hakim on Unsplash

Apart from disposable income, why do lower-income customers have lower sales volume for their products? Could the price-sensitive customers be underserved? Is there an opportunity to introduce more budget-friendly niche products? How can we use the insights to craft our marketing campaigns?
Is the income variable correlated to other demographic variables that could better explain customer behaviour? A Multiple Linear Regression, a highly interpretable model, could help answer the question. We could interpret coefficients and gain further insights into segments to devise new product offerings.


We have combined the data from transactions, and the customer dataset wrangled and transformed the data. Then we answered the question of whether we can build a well-performing model to predict customer sales. The answer was a yes. Using a simple linear regression using annual income as the predictor, we visually see the regression line was following the datapoints closely.

In the age when deep learning is all the rage, it was refreshing to see how effective a linear regression with only one variable could be. Computationally inexpensive and highly interpretable. Before building a more data-hungry sophisticated algorithm and gathering more data to feed it, maybe stop and think. Could a simple model form a strong starting point and do the job for this application? If not, at least we can use it to explore and interpret relationships between the variables and the predicted value, which can assist in feature selection, understanding the real-world relation between different dimensions and worse come to worse have a benchmark. Let’s not write off the good old linear regression yet!

Future Directions
we need to split the dataset into training, validation, and test sets to fine-tune hyperparameters and report predictive performance on the unseen data. We can expand the model to include more features to capture more of the variance present in the data.


Appendix 1: Feature Distribution Histograms

Appendix 2: References

The case study is part of Introduction to Predictive Analytics using Python
EdinburghX — PA4.1x_MM

University of Edinburgh Predictive Analytics using Python MicroMasters program

Copyright © 2021 Schwarzwald_AI




Data Science | Machine Learning | Operations Research