# eCommerce: How to use statistics to increase sales?

Case of regression analysis and clustering for sales analysis.

In the previous part we took a look at the performance of eCommerce store. We started with the combination of the vanity KPI — sessions and a reasonably good business KPI — transactions. It has quite interesting relationship.

How to make sense of the data and understand the relationship? In previous part we explored correlation which lead to even more questions. Next step will involve the use of two statistical methods: regression analysis and clustering.

Predicting sales — regression analysis
In statistical modelling, regression analysis is a statistical process for estimating the relationships among variables. This is exactly what we are looking for.

We will use Python and stat models library to build a model to predict the number of transactions based on the number of sessions.

The outcome of the model can be represented as a straight line (linear regression).

There are two immediate challenges with our model:
1. It doesn’t make sense logically. It fails to explain scenario when there is no or very low number of sessions. If sessions equals zero model predicts four transactions (Intercept has value of 3.99).
2. R-squared value is 0.232. Our model explains only 23% of the variability of the data. It leaves 77% of the data unexplained which is not acceptable from the business perspective.

Let’s try to tweak our model to accommodate scenarios when number of sessions is low. We will force the intercept to be equal zero (zero sessions means zero transactions):

New model makes more sense logically but is significantly worse. It explains only 13% of variability of the data which is disappointing. We can always try to improve r-squared parameter by using different regression analysis methods like lasso or gradient descent instead of least squares) but it is outside the scope of this article. We will explore another interesting method instead.

Understanding audience — cluster analysis
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).

Why this is important for eCommerce? There is no average user or customer. Instead we can try to distinguish different groups (clusters) who will have some things in common.

We illustrate the idea of clusters by dividing our dataset in a couple of different groups:

We can use a statistical package like SPSS or BigML to do the number crunching for us.

In the example below we took two variables: number of sessions and number of transactions. Number of clusters were set to four. Each circle represents single cluster, size represents number of instances within.

In this simplified example we have finished up with four clusters. While it won’t provide a wealth of insights it will help to understand idea of clustering.

Above four clusters may explain our audience better than regression analysis. For example cluster 2 (very high transactions number) is made out of outliers not captured by regression analysis.

Each cluster is described by number of parameters: number of instances (observations), central point for transactions and sessions. All instances in the cluster relates to his central point. Distance from this point is shown in the histogram.

Getting insights from cluster analysis. Real life scenarios.

Having industry knowledge helps. Good practice will dictate including more relevant data into analysis. Where should we start? We may want to consider the following variables:
-Customer acquisition method (paid, email, social etc.),
-Transaction details (like product category)

Cluster analysis is a fine balancing act. Our aim is to get a better understanding of eCommerce audience while we try to keep number of clusters low and make sure they make business sense.

Next: Can we predict future sales based on the past performance?

Back to part 1 — Demystifying Vanity KPIs