Product Segmentation: Pareto Law Using Databricks

Agrim Rustagi
Brillio Data Science
6 min readNov 29, 2022

80–20 rule — Simple but, very powerful

Credits: rushcuttershealth.com.au

INTRODUCTION

In current scenario, AI/ML is playing a huge role in transforming different industries. But as data scientists we often forget to start with easier/rule-based solutions which make more sense in terms of understanding and explainability. Today, I want to discuss one such simple approach which can be used for product segmentation instead of starting with clustering/complex techniques.

As the name suggests, the pareto law was developed by reading the works of an Italian economist — Vilfredo Pareto, who wrote about the 80–20 rule while at the university. He observed that approximately 80% of the land in Italy was owned by 20% of the population. From a formal definition point of view: —

The pareto principle states that for many outcomes, roughly 80% of consequences come from 20% of causes (the “vital few”).

Empirical observations have shown that this 80–20 distribution fits a wide range of cases, including natural phenomena and human activities.

Examples –

1) In economics, we use Gini-coefficient to look at the wealth inequality in a population. It is found that roughly top 20% of the richest people accumulate more than 80% of the wealth.

2) In computing, Microsoft noted that by fixing the top 20% of the most-reported bugs, 80% of the related errors and crashes in each system would be eliminated.

3) In sports, for example in cricket, the top 3 batsmen (roughly 20–30% of the team) have a contribution of 70–80% in the total runs scored by a team. (I can say it’s true for India in the past few years — Shikhar Dhawan, Rohit Sharma, Virat Kohli)

There are many other applications in our day-to-day life as well. For example, when we prepare for any examination most of the questions are repeated from past years and are marked as important. They cover only 15–20% of the syllabus but cover 65%-75% of the total percentage of marks. So, don’t study hard, study smart to fetch good marks😊.

Let’s look at its application in inventory management and then we would test the hypothesis of pareto law on a dataset…

ABC ANALYSIS

ABC analysis is an inventory categorization technique. The ABC analysis suggests that inventories of an organization are not of equal value. Thus, the inventory is grouped into three categories (A, B, and C) in order of their estimated importance.

A-items:

These items have the highest annual consumption value i.e., 70%-80% of the annual consumption value of the company.

They account for only 10%-20% of the total inventory items.

B-items:

They are the interclass items, having medium consumption value i.e., 15%-25% of annual consumption value.

They account for 30%-40% of the total inventory items.

C-items:

These items have the lowest consumption value i.e., 5%-10% of annual consumption value.

They account for around 50% of the total inventory items.

Let’s do some Databricks and test the hypothesis of ABC analysis on a dataset…

DATABRICKS

The dataset is based on transactional data which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The dataset can be found at — UCI Machine Learning Repository: Online Retail Data Set.

If you are new to Databricks, you can use a free trial of 15-days by creating an account on Databricks community edition — Login — Databricks Community Edition.

After creating an account, we can create a cluster and you can give it any name of your choice. I have given it- ABC_analysis.

Credits: Databricks

Once the cluster is created, we upload our retail dataset and create a table with UI or in notebook. I have done it using UI as shown below:

Credits: Databricks

The sample data would look something like this:

Now, we create a notebook under our cluster with default language python. Let’s start with our ABC analysis using the revenue generated by each SKU.

Firstly, we start the spark session by using the command below:

# start spark session #

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("abc_segmentation")

We load our table that we created earlier in the notebook:

# load the table and order revenue in descending order #

new_df = sqlContext.sql("select Description, sum(Quantity*UnitPrice) total_revenue from online_retail_1_csv group by Description order by total_revenue desc")
new_df.show()

Now, we can find the contribution of each SKU as a percentage of total revenue and create a cumulative revenue percentage column — cumsum.

# contribution of each SKU in the total revenue expressed as percentage 

total_revenue = new_df.agg({'total_revenue': 'sum'}).collect()[0][0]
new_df = new_df.select(new_df['Description'], new_df['total_revenue']/total_revenue)
new_df.show()

# Rename the column

new_df = new_df.withColumnRenamed('(total_revenue / 9747747.934000015)', 'percentage_revenue')

# Create Cummulative revenue percentage column (cumsum)

from pyspark.sql.window import Window
import pyspark.sql.functions as f
import sys

cumulative_summation = new_df.withColumn('cumsum', f.sum(new_df.percentage_revenue).over(Window.partitionBy().orderBy().rowsBetween(-sys.maxsize, 0)))
cumulative_summation.show()

We now create a function to segment our products according to ABC analysis and create a new column “product_quality” which segments our products into categories A, B and C.

# Segmentation of SKUs using percentage of revenue

def segmentation(x):
if x < 0.80:
return "A"
elif (x > 0.80) and (x < 0.95):
return "B"
else:
return "C"

# Determining the product importance or quality in terms of revenue

from pyspark.sql.functions import udf,col
from pyspark.sql.types import StringType
​​
udf_quality = udf(lambda x: segmentation(x),StringType())​
new_df = cumulative_summation.withColumn("product_quality",udf_quality(col("cumsum")))
new_df.show()

After segmentation, we can have a look at the distribution of products A, B and C.

new_df.groupBy('product_quality').count().orderBy('count').display()

We can also look at how cumulative revenue increases as we move from items — A to B and to C.

We can see that 18% of the top SKUs (A) are contributing to 80% of the revenue. Also, 18% of the SKUs which come under B category contribute to 15% of the total revenue and 64% of the SKUs which come under C category contribute to only 5% of the total revenue.

This dataset clearly follows our pareto law which we hypothesized for this dataset. If we have a close look at the shape of cumulative distribution of revenue — it’s a concave shaped curve which intuitively makes sense as revenue is decreasing as we move from item- A to B to C.

I hope you liked the analysis. Stay tuned for more!!!!!

--

--