Data Science in eCommerce — Part 3
Summary Statistics
Let’s take a closer look at the summary statistics of out transformed data set. It will reveal some interesting information:
In this article, we will focus on the two variables: conversions and path length.
- Conversions
It has a mean of 9.8 but a standard deviation of 125.8. This looks quite suspicious. Box plot confirms distribution of the values — majority of the observations has number of conversions equal to 1 or 2 (read more about quartiles and box plot).
Number of observations (1,851) and the maximum value in the set (4,062) provides some clues. Let’s visualise this data:
Look at the distribution of the values points into Pareto distribution. It gives us a hint about the use of the 80/20 rule know as Pareto principle. In above case we can translate into the following statement: ‘Majority of the conversions comes from the limited number of customer paths’.
Definitely worth a further exploration.
2. Path Length
It has a mean of 7.6 and standard deviation of 6.1. Maximum value of 161 may indicate outliers. Box plot and distribution plot will help to understand distribution.
Again, we have a case of Pareto distribution. More properties of this type of distribution will be shown in the next part.
Some business takeouts:
- Disparity in the number of conversions: some paths had 4,062 conversions while 75% of all paths had only up to two conversions. We may want to take a closer look at those outliers to understand where bulk of conversions takes place.
- 50% of observations has a path length equal or shorter to 6 touchpoints,
- 75% of observations has a path length equal or shorter to 9 touchpoints,
- There some some outliers skewing the summary statistics — observations with up to 161 touchpoints to conversion.