Data Science in eCommerce — Part 3

Summary Statistics

Let’s take a closer look at the summary statistics of out transformed data set. It will reveal some interesting information:

Summary statistics

In this article, we will focus on the two variables: conversions and path length.

  1. Conversions
    It has a mean of 9.8 but a standard deviation of 125.8. This looks quite suspicious. Box plot confirms distribution of the values — majority of the observations has number of conversions equal to 1 or 2 (read more about quartiles and box plot).

Number of observations (1,851) and the maximum value in the set (4,062) provides some clues. Let’s visualise this data:

Number of conversions (x axis — conversions, y axis — path to conversion)

Look at the distribution of the values points into Pareto distribution. It gives us a hint about the use of the 80/20 rule know as Pareto principle. In above case we can translate into the following statement: ‘Majority of the conversions comes from the limited number of customer paths’.
Definitely worth a further exploration.

2. Path Length
It has a mean of 7.6 and standard deviation of 6.1. Maximum value of 161 may indicate outliers. Box plot and distribution plot will help to understand distribution.

Again, we have a case of Pareto distribution. More properties of this type of distribution will be shown in the next part.

(x axis — number of touchpoints to conversion, y axis — path to conversion)

Some business takeouts:

  • Disparity in the number of conversions: some paths had 4,062 conversions while 75% of all paths had only up to two conversions. We may want to take a closer look at those outliers to understand where bulk of conversions takes place.
  • 50% of observations has a path length equal or shorter to 6 touchpoints,
  • 75% of observations has a path length equal or shorter to 9 touchpoints,
  • There some some outliers skewing the summary statistics — observations with up to 161 touchpoints to conversion.

Deep dive into customer journey analysis in Part 4