Clustering Analysis using Spark ML and Scikit -learn— part 4

22 min readMay 18, 2024

Introduction:

Olist is a Brazilian company that helps sellers advertise their products on the main internet marketplaces (Amazon, Mercado Livre, Americanas, Carrefour, Submarino, Via Varejo, Casas Bahia, B2W Digital, Extra, Shoptime, Ponto Frio, Madeira Madeira and Zoom).

Olist offers a marketplace solution (of e-commerce segment) to shopkeepers of all sizes (and for most segments) to increase their sales whether they have online presence or not.

In the previous parts we had discussed Exploratory data analysis, Data visualization and Customer Segmentation. In the last part, we tried out various permutation and combinations of algorithms for customer segmentation (using Scikit-learn and Spark ML). In this final part we will finalize the best model and give out recommendations, and findings.

Don’t forget to checkout
Part-1: EDA and Data Transformation using PySpark
Part-2: Advanced Data visualization with PowerBI and Python
Part-3: Customer Segmentation using SparkML and Scikit-learn

GitHub repository

To get PySpark notebooks, data dictionary and Marketplace final report go to this repo:

Olist-Segmentation-Dashboarding/ at master · Suhaib-88/Olist-Segmentation-Dashboarding (github.com)

So, As discussed in the previous blog. Here are the finalized optimum k values for each clustering algorithm.

We will continue to drill down until we get the best k value and clustering algorithm.

1. Analysis of Clusters Kmeans with SparkML(k=3)

# File location and type
file_location = "/user/hive/warehouse/olist.db/predictor_category3"
file_type = "delta"

# CSV options
infer_schema = "false"
first_row_is_header = "True"
delimiter = ","

# The applied options are for CSV files. For other file types, these will be ignored.
df3 = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

display(df3)

PCA

Let’s reduce the dimensions to 2 main components in order to graphically visualize the separation of the clusters.

from pyspark.ml.feature import PCA
from pyspark.ml.linalg import Vectors

pca = PCA(k=2, inputCol="scaledFeatures", outputCol="pcaFeatures")
model = pca.fit(df3)

result = model.transform(df3).select('SELLER_ID', 'prediction', "pcaFeatures")

from pyspark.ml.functions import vector_to_array
from pyspark.sql.functions import col

df_temp = result.withColumn("xs", vector_to_array("pcaFeatures")).select(["SELLER_ID"] + ['prediction'] + [col("xs")[i] for i in range(2)]).withColumnRenamed('prediction', 'cluster')
df_temp = df_temp.toPandas()
df_temp['cluster'] = df_temp['cluster'].astype('category')

import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="whitegrid")


f, ax = plt.subplots(figsize=(10, 10))
sns.despine(f, left=True, bottom=True)

sns.scatterplot(x='xs[1]', y='xs[0]',
                hue="cluster",
                style="cluster",
                linewidth=0,
                s=100,
                data=df_temp, ax=ax)

Cluster Profiling

df_temp = df3.toPandas()
df_temp.groupby('prediction')['QTD_ORDER', 'NUMBER_OF_PRODUCTS', 'QTD_PRODUCT_ORDER', 'NUMBER_OF_BUYERS', 'SALE_VALUE', 'TICKET_SIZE',
 'RECENCY', 'DAYS_DELAY_DELIVERY', 'FREQUENCY','REVIEW_SCORE'].describe()

Using k=3 in Spark’s native K-means clustering, the number of orders has a great influence on the separation of sellers from the majority group (prediction=0) and the most qualified group (prediction=2). Below, the detailed profile of the groups:
Prediction=0: in the most populous group (2990 sellers), the average seller has 20 orders, sells 23 products and earns just under R$3,500. As he has an average of 20 different buyers, it is possible to conclude that this seller has no repurchase, that is, he has no customer retention. Your average ticket (we calculated by dividing the average sales value by the average order, since the “average average” calculated by describe distorts the number; the result for prediction=0 is R$171.44) and your average price ( calculated by dividing the average sales value by the average quantity of products; the result for prediction=0 is R$151.03) are in the middle of the other groups, that is, it sells products with an average added value. It’s been almost 5 months since you last sold (adjusting for recency of minus 44 days) and you usually close 3 sales every 2 months. Some sellers in this group tend to delay delivery. This is the profile of most Olist sellers: they make few sales and are only interesting to Olist in volume.
Prediction=1: in the least populated group (33 sellers), the average seller has 26 orders, sells 31 products (just over 1 product per sale) and earns just over R$6,000. As he has an average of 25 different buyers, it is possible to conclude that this seller has a low repurchase rate, that is, he has low customer retention. Your average ticket (we calculated by dividing the average sales value by the average order, since the “average average” calculated by describe distorts the number; the result for prediction=1 is R$228.41) and your average price ( calculated by dividing the average sales value by the average quantity of products; the result for prediction=1 is R$188.64) are higher than those of the other groups, that is, it sells products with greater added value. It’s been just over 1 month since you last sold (adjusting for recency of minus 44 days) and you usually close almost 5 sales every 2 months. No seller in this group tends to be late with delivery. This is the profile of aspiring premium sellers on Olist: they make more sales than most sellers, but still at a level well below that of premium sellers. To compensate, it sells more expensive products.
Prediction=2: in the 3rd group (72 sellers), the average seller has 531 orders, sells 599 products (just over 1 product per sale) and earns just over R$73,300. As he has an average of 525 different buyers, it is possible to conclude that this seller has a low repurchase rate, that is, he has low customer retention. Your average ticket (we calculated by dividing the average sales value by the average number of orders, since the “average average” calculated by describe distorts the number; the result for prediction=2 is R$138.05) and your average price ( calculated by dividing the average sales value by the average quantity of products; the result for prediction=2 is R$122.37) are lower than that of the other groups, that is, it sells products with lower added value. It has been just over 16 days since you last sold (adjusting for recency of minus 44 days) and you usually close almost 30 sales per month, which is almost 1 per day. No seller in this group tends to be late with delivery. This is the profile of Olist’s premium sellers: they make a lot of sales and guarantee good revenue.
In general, profiles can be ordered as follows, considering the general quality of the profile and the importance of the seller to Olist:
Prediction=2 > Prediction=1 > Prediction=0

display(df3[['prediction','QTD_ORDER']])

display(df3[['prediction','NUMBER_OF_PRODUCTS']])

display(df3[['prediction','QTD_PRODUCT_ORDER']])

display(df3[['prediction','NUMBER_OF_BUYERS']])

display(df3[['prediction','SALE_VALUE']])

display(df3[['prediction','TICKET_SIZE']])

display(df3[['prediction','RECENCY']])

display(df3[['prediction','DAYS_DELAY_DELIVERY']])

display(df3[['prediction','FREQUENCY']])

display(df3[['prediction','REVIEW_SCORE']])

df_temp = df3.groupby('prediction','SELLER_STATE').count()
display(df_temp.orderBy('SELLER_STATE','prediction'))

df3.createOrReplaceTempView("df3")
df_seller_state = spark.sql("SELECT SELLER_STATE \
,SELLER_STATE_index \
FROM df3")

Results

2. Analysis of Clusters Kmeans SKlearn (k=3)

# File location and type
file_location = "/user/hive/warehouse/olist.db/predictor_category1"
file_type = "delta"

# CSV options
infer_schema = "false"
first_row_is_header = "True"
delimiter = ","

# The applied options are for CSV files. For other file types, these will be ignored.
df3 = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

display(df3)

from pyspark.ml.feature import VectorAssembler

df3 = VectorAssembler().setInputCols(["QTD_ORDER","NUMBER_OF_PRODUCTS","QTD_PRODUCT_ORDER","NUMBER_OF_BUYERS","SALE_VALUE","TICKET_SIZE","RECENCY","DAYS_DELAY_DELIVERY","FREQUENCY","DELIVERED","SHIPPED","APPROVED","CANCELED","INVOICED","CREATED","UNAVAILABLE","PROCESSING","REVIEW_SCORE",'art', 'flowers', 'home_construction', 'fashion_male_clothing', 'kitchen_dining_laundry_garden_furniture', 'small_appliances', 'la_cuisine', 'bed_bath_table', 'signaling_and_security', 'office_furniture', 'computers', 'watches_gifts', 'auto', 'fashion_bags_accessories','construction_tools_lights', 'cool_stuff', 'cds_dvds_musicals', 'food', 'computers_accessories', 'perfumery', 'pet_shop', 'stationery', 'furniture_decor', 'drinks', 'construction_tools_safety', 'telephony', 'home_confort', 'furniture_mattress_and_upholstery', 'agro_industry_and_commerce', 'party_supplies', 'cine_photo', 'books_technical', 'home_appliances_2', 'music', 'costruction_tools_garden', 'christmas_supplies', 'audio', 'fashio_female_clothing', 'air_conditioning', 'market_place', 'health_beauty', 'fashion_shoes', 'fashion_underwear_beach', 'housewares', 'home_appliances', 'costruction_tools_tools', 'furniture_living_room', 'arts_and_craftmanship', 'fixed_telephony', 'electronics', 'diapers_and_hygiene', 'books_imported', 'baby', 'home_comfort_2', 'fashion_childrens_clothes', 'luggage_accessories', 'small_appliances_home_oven_and_coffee', 'food_drink', 'furniture_bedroom', 'tablets_printing_image', 'fashion_sport', 'toys', 'consoles_games', 'garden_tools', 'sports_leisure', 'industry_commerce_and_business', 'books_general_interest', 'construction_tools_construction', 'dvds_blu_ray', 'musical_instruments', 'security_and_services',
 'SELLER_STATE_Index']).setOutputCol('features').transform(df3)

from pyspark.ml.feature import StandardScaler

scaler = StandardScaler(inputCol='features', outputCol='scaledFeatures')
df3 = scaler.fit(df3).transform(df3)
display(df3)

from pyspark.ml.feature import PCA
from pyspark.ml.linalg import Vectors

from pyspark.sql.window import Window  
from pyspark.sql.functions import row_number, monotonically_increasing_id
w = Window.orderBy("QTD_ORDER") 
df3 = df3.withColumn("SELLER_ID", row_number().over(w))

pca = PCA(k=2, inputCol="scaledFeatures", outputCol="pcaFeatures")
model = pca.fit(df3)

result = model.transform(df3).select('SELLER_ID', 'prediction', "pcaFeatures")

from pyspark.ml.functions import vector_to_array
from pyspark.sql.functions import col

df_temp = result.withColumn("xs", vector_to_array("pcaFeatures")).select(["SELLER_ID"] + ['prediction'] + [col("xs")[i] for i in range(2)]).withColumnRenamed('prediction', 'cluster')
df_temp = df_temp.toPandas()
df_temp['cluster'] = df_temp['cluster'].astype('category')

import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="whitegrid")


f, ax = plt.subplots(figsize=(10, 10))
sns.despine(f, left=True, bottom=True)

sns.scatterplot(x='xs[1]', y='xs[0]',
                hue="cluster",
                style="cluster",
                linewidth=0,
                s=100,
                data=df_temp, ax=ax)

Cluster Profiling

df_temp = df3.toPandas()
df_temp.groupby('prediction')["QTD_ORDER","NUMBER_OF_PRODUCTS","QTD_PRODUCT_ORDER","NUMBER_OF_BUYERS","SALE_VALUE","TICKET_SIZE","RECENCY","DAYS_DELAY_DELIVERY","FREQUENCY"].describe()

Using k=3 in Scikit Learn’s K-means clustering, the number of orders has a great influence on the separation of sellers from the majority group (prediction=0) and the most qualified group (prediction=1). Below, the detailed profile of the groups:
Prediction=0: in the most populous group (2814 sellers), the average seller has 14 orders, sells 16 products and earns almost R$2,700. As he has an average of 14 different buyers, it is possible to conclude that this seller does not have repurchases, that is, he does not have customer retention. Your average ticket (we calculate by dividing the average sales value by the average order, since the “average average” calculated by describe distorts the number; the result for prediction=0 is R$182.52) and your average price ( calculated by dividing the average sales value by the average quantity of products; the result for prediction=0 is R$159.18) are higher than those of the other groups, that is, it sells products with greater added value. It has been 5 months since you last sold (adjusting for recency of minus 44 days) and you usually close 4 sales every 3 months. Some sellers in this group tend to delay delivery. This is the profile of most Olist sellers: they make few sales and are only interesting to Olist in volume.
Prediction=1: in the least populated group (18 sellers), the average seller has 1118 orders, sells 1284 products (just over 1 product per sale) and earns just over R$143,000. As he has an average of 1104 different buyers, it is possible to conclude that this seller has a low repurchase rate, that is, he has low customer retention. Your average ticket (we calculate by dividing the average sales value by the average order, since the “average average” calculated by describe distorts the number; the result for prediction=1 is R$127.98) and your average price ( calculated by dividing the average sales value by the average quantity of products; the result for prediction=1 is R$111.39) are lower than those of the other groups, that is, it sells products with lower added value. It’s been almost 10 days since you last sold (adjusting the recency of minus 44 days) and you usually close 61 sales per month, practically 2 per day. No seller in this group tends to be late with delivery. This is the profile of Olist’s premium sellers: they make a lot of sales and guarantee good revenue.
Prediction=2: in the 3rd group (263 sellers), the average seller has 146 orders, sells 162 products (just over 1 product per sale) and earns just over R$22,000. As he has an average of 144 different buyers, it is possible to conclude that this seller has a low repurchase rate, that is, he has low customer retention. Your average ticket (we calculated by dividing the average sales value by the average order, since the “average average” calculated by describe distorts the number; the result for prediction=1 is R$150.36) and your average price ( calculated by dividing the average sales value by the average quantity of products; the result for prediction=1 is R$135.41) are intermediate in relation to the other groups, that is, they sell products with an average added value. It’s been almost 2 months since you last sold (adjusting for recency of minus 44 days) and you usually close 9 sales per month. Very few sellers in this group tend to delay delivery. This is the profile of aspiring premium sellers on Olist: they make more sales than most sellers, but still at a level well below that of premium sellers.
In general, profiles can be ordered as follows, considering the general quality of the profile and the importance of the seller to Olist:
Prediction=1 > Prediction=2 > Prediction=0

display(df3[['prediction','QTD_ORDER']])

display(df3[['prediction','NUMBER_OF_PRODUCTS']])

display(df3[['prediction','QTD_PRODUCT_ORDER']])

display(df3[['prediction','NUMBER_OF_BUYERS']])

display(df3[['prediction','SALE_VALUE']])

display(df3[['prediction','TICKET_SIZE']])

display(df3[['prediction','RECENCY']])

display(df3[['prediction','DAYS_DELAY_DELIVERY']])

display(df3[['prediction','FREQUENCY']])

display(df3[['prediction','REVIEW_SCORE']])

df_seller_state.createOrReplaceTempView("df_seller_state")
df_temp = spark.sql("SELECT SELLER_STATE, SELLER_STATE_INDEX FROM DF_SELLER_STATE GROUP BY SELLER_STATE, SELLER_STATE_INDEX")
df_temp.createOrReplaceTempView("df_temp")
df3.createOrReplaceTempView("df3")
df3 = spark.sql("SELECT prediction, A.SELLER_STATE FROM df3 A LEFT JOIN DF_TEMP B ON A.SELLER_STATE_INDEX = B.SELLER_STATE_INDEX")

df_temp = df3.groupby('prediction','SELLER_STATE').count()
display(df_temp.orderBy('SELLER_STATE','prediction'))

3. Analysis of Clusters Kmeans SKlearn (k=6)

# File location and type
file_location = "/user/hive/warehouse/olist.db/predicoes_categoriaskl"
file_type = "delta"

# CSV options
infer_schema = "false"
first_row_is_header = "True"
delimiter = ","

# The applied options are for CSV files. For other file types, these will be ignored.
df3 = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

display(df3)

df3 = df3.withColumnRenamed('km6', 'prediction')

PCA

Let’s reduce the dimensions to 2 main components in order to graphically visualize the separation of the clusters.

from pyspark.ml.feature import VectorAssembler

df3 = VectorAssembler().setInputCols(["QTD_ORDER","NUMBER_OF_PRODUCTS","QTD_PRODUCT_ORDER","NUMBER_OF_BUYERS","SALE_VALUE","TICKET_SIZE","RECENCY","DAYS_DELAY_DELIVERY","FREQUENCY","DELIVERED","SHIPPED","APPROVED","CANCELED","INVOICED","CREATED","UNAVAILABLE","PROCESSING","REVIEW_SCORE",'art', 'flowers', 'home_construction', 'fashion_male_clothing', 'kitchen_dining_laundry_garden_furniture', 'small_appliances', 'la_cuisine', 'bed_bath_table', 'signaling_and_security', 'office_furniture', 'computers', 'watches_gifts', 'auto', 'fashion_bags_accessories','construction_tools_lights', 'cool_stuff', 'cds_dvds_musicals', 'food', 'computers_accessories', 'perfumery', 'pet_shop', 'stationery', 'furniture_decor', 'drinks', 'construction_tools_safety', 'telephony', 'home_confort', 'furniture_mattress_and_upholstery', 'agro_industry_and_commerce', 'party_supplies', 'cine_photo', 'books_technical', 'home_appliances_2', 'music', 'costruction_tools_garden', 'christmas_supplies', 'audio', 'fashio_female_clothing', 'air_conditioning', 'market_place', 'health_beauty', 'fashion_shoes', 'fashion_underwear_beach', 'housewares', 'home_appliances', 'costruction_tools_tools', 'furniture_living_room', 'arts_and_craftmanship', 'fixed_telephony', 'electronics', 'diapers_and_hygiene', 'books_imported', 'baby', 'home_comfort_2', 'fashion_childrens_clothes', 'luggage_accessories', 'small_appliances_home_oven_and_coffee', 'food_drink', 'furniture_bedroom', 'tablets_printing_image', 'fashion_sport', 'toys', 'consoles_games', 'garden_tools', 'sports_leisure', 'industry_commerce_and_business', 'books_general_interest', 'construction_tools_construction', 'dvds_blu_ray', 'musical_instruments', 'security_and_services',
 'SELLER_STATE_Index']).setOutputCol('features').transform(df3)

from pyspark.ml.feature import StandardScaler

scaler = StandardScaler(inputCol='features', outputCol='scaledFeatures')
df3 = scaler.fit(df3).transform(df3)
display(df3)

from pyspark.ml.feature import PCA
from pyspark.ml.linalg import Vectors

from pyspark.sql.window import Window  
from pyspark.sql.functions import row_number, monotonically_increasing_id
w = Window.orderBy("QTD_ORDER") 
df3 = df3.withColumn("SELLER_ID", row_number().over(w))

pca = PCA(k=2, inputCol="scaledFeatures", outputCol="pcaFeatures")
model = pca.fit(df3)

result = model.transform(df3).select('SELLER_ID', 'prediction', "pcaFeatures")

from pyspark.ml.functions import vector_to_array
from pyspark.sql.functions import col

df_temp = result.withColumn("xs", vector_to_array("pcaFeatures")).select(["SELLER_ID"] + ['prediction'] + [col("xs")[i] for i in range(2)]).withColumnRenamed('prediction', 'cluster')
df_temp = df_temp.toPandas()
df_temp['cluster'] = df_temp['cluster'].astype('category')

import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="whitegrid")


f, ax = plt.subplots(figsize=(10, 10))
sns.despine(f, left=True, bottom=True)

sns.scatterplot(x='xs[1]', y='xs[0]',
                hue="cluster",
                style="cluster",
                linewidth=0,
                s=100,
                data=df_temp, ax=ax)

Clusters Profile

df_temp = df3.toPandas()
df_temp.groupby('prediction')['QTD_ORDER', 'NUMBER_OF_PRODUCTS', 'QTD_PRODUCT_ORDER', 'NUMBER_OF_BUYERS', 'SALE_VALUE', 'TICKET_SIZE',
 'RECENCY', 'DAYS_DELAY_DELIVERY', 'FREQUENCY','REVIEW_SCORE'].describe()

Using k=6 in Scikit Learn’s K-means clustering, we noticed that there may be very distinct subgroups in relation to other analyses, despite the low silhoutte score. It can be an interesting form of grouping from a business point of view, as it allows you to create up to 6 different and scalable policies to deal with sellers, creating an incentive for sellers to make an effort to migrate from group to group. Below, the detailed profile of the groups:
Prediction=0: in the 3rd most populous group (194 sellers), the average seller has 26 orders, sells 29 products (just over 1 product per sale) and earns just over R$7,100. As he has an average of 26 different buyers, it is possible to conclude that this seller does not have repurchases, that is, he does not have customer retention. Your average ticket (we calculated by dividing the average sales value by the average order, since the “average average” calculated by describe distorts the number; the result for prediction=0 is R$271.18) and your average price ( calculated by dividing the average sales value by the average quantity of products; the result for prediction=0 is R$246.18) are much higher than those of the other groups, that is, it sells products with high added value. . It has been more than 3 months since you last sold (adjusting for recency of minus 44 days) and you usually close 2 sales per month. Some sellers in this group tend to delay delivery. This is the profile of sellers of more expensive products.
Prediction=1: in the 2nd most populous group (542 sellers), the average seller has 5 orders, sells 6 products (just over 1 product per sale) and earns just over R$1,100. As he has an average of 5 different buyers, it is possible to conclude that this seller does not have repurchases, that is, he does not have customer retention. Your average ticket (we calculated by dividing the average sales value by the average order, since the “average average” calculated by describe distorts the number; the result for prediction=1 is R$212.36) and your average price ( calculated by dividing the average sales value by the average quantity of products; the result for prediction=1 is R$175.57) are the second highest among the other groups, that is, it sells products with greater added value. It has been more than 11 months since you last sold (adjusting for recency of minus 44 days) and you usually close 1 sale every 3 months. Some sellers in this group tend to delay delivery and the average review score is 2.5. This is the profile of sellers who didn’t work out and gave up on the platform.
Prediction=2: in the 4th most populous group (156 sellers), the average seller has 188 orders, sells 208 products (just over 1 product per sale) and earns around R$26,600. As he has an average of 186 different buyers, it is possible to conclude that this seller has a low repurchase rate, that is, he has low customer retention. Your average ticket (we calculated by dividing the average sales value by the average number of orders, since the “average average” calculated by describe distorts the number; the result for prediction=2 is R$141.71) and your average price ( calculated by dividing the average sales value by the average quantity of products; the result for prediction=2 is R$127.78) are intermediate in relation to the other groups, that is, they sell products with an average added value. It’s been almost a month and a half since you last sold (adjusting for recency of minus 44 days) and you usually close 11 sales per month. Very few sellers in this group tend to delay delivery. This is the profile of aspiring premium sellers on Olist: they make more sales than most sellers, but still at a level well below that of premium sellers.
Prediction=3: in the least populated group (14 sellers), the average seller has 45 orders, sells 50 products (just over 1 product per sale) and earns around R$4,500. As he has an average of 45 different buyers, it is possible to conclude that this seller has no repurchase, that is, he has no customer retention. Your average ticket (we calculated by dividing the average sales value by the average number of orders, since the “average average” calculated by describe distorts the number; the result for prediction=3 is R$100.04) and your average price ( calculated by dividing the average sales value by the average quantity of products; the result for prediction=3 is R$90.26) are much lower in relation to the other groups, that is, it sells products with very low added value. . It’s been just over 1 month since you last sold (adjusting for recency of minus 44 days) and you usually close almost 5 sales per month. No seller in this group delays delivery. This is the profile of a subgroup of lower clergy salespeople: low sales numbers, working with cheaper products, but making up for it a little in volume.
Prediction=4: in the most populous group (2170 sellers), the average seller has 19 orders, sells 21 products (just over 1 product per sale) and earns almost R$3,200. As he has an average of 18 different buyers, it is possible to conclude that this seller has a very low repurchase rate, that is, he has almost no customer retention. Your average ticket (we calculated by dividing the average sales value by the average number of orders, since the “average average” calculated by describe distorts the number; the result for prediction=4 is R$167.43) and your average price ( calculated by dividing the average sales value by the average quantity of products; the result for prediction=4 is R$146.40) they are the third largest in relation to the other groups, that is, they sell products with high added value. . It has been just over 3 months since you last sold (adjusting for recency of minus 44 days) and you usually close 5 sales every 3 months. Few sellers in this group delay delivery. This is the profile of most Olist sellers: they make few sales and are only interesting to Olist in volume.
Prediction=5: in the 2nd least populated group (19 sellers), the average seller has 1090 orders, sells 1248 products (just over 1 product per sale) and earns around R$146,300. As he has an average of 1076 different buyers, it is possible to conclude that this seller has a low repurchase rate, that is, he has little customer retention. Your average ticket (we calculated by dividing the average sales value by the average order, since the “average average” calculated by describe distorts the number; the result for prediction=5 is R$134.22) and your average price ( calculated by dividing the average sales value by the average quantity of products; the result for prediction=5 is R$117.25) they are the second worst in relation to the other groups, that is, they sell products with higher added value. low. It’s been almost 10 days since you last sold (with the recency adjustment of minus 44 days) and you usually close almost 60 sales per month, which is 2 sales per day. No seller in this group delays delivery. This is the profile of Olist’s premium sellers: they make a lot of sales and guarantee good revenue. In general, profiles can be ordered as follows, considering the general quality of the profile and the importance of the seller to Olist:
Prediction=5 > Prediction=2 > Prediction=0 > Prediction=3 > Prediction=4 > Prediction=1

display(df3[['prediction','QTD_ORDER']])

display(df3[['prediction','NUMBER_OF_PRODUCTS']])

display(df3[['prediction','QTD_PRODUCT_ORDER']])

display(df3[['prediction','NUMBER_OF_BUYERS']])

display(df3[['prediction','SALE_VALUE']])

display(df3[['prediction','TICKET_SIZE']])

display(df3[['prediction','RECENCY']])

display(df3[['prediction','DAYS_DELAY_DELIVERY']])

display(df3[['prediction','FREQUENCY']])

display(df3[['prediction','REVIEW_SCORE']])

df_seller_state.createOrReplaceTempView("df_seller_state")
df_temp = spark.sql("SELECT SELLER_STATE, SELLER_STATE_INDEX FROM DF_SELLER_STATE GROUP BY SELLER_STATE, SELLER_STATE_INDEX")
df_temp.createOrReplaceTempView("df_temp")
df3.createOrReplaceTempView("df3")
df3 = spark.sql("SELECT prediction, SELLER_STATE FROM df3 A LEFT JOIN DF_TEMP B ON A.SELLER_STATE_INDEX = B.SELLER_STATE_INDEX")

df_temp = df3.groupby('prediction','SELLER_STATE').count()
display(df_temp.orderBy('SELLER_STATE','prediction'))

Results

4. Analysis of Clusters Bissect Kmeans (k=3)

# File location and type
file_location = "/user/hive/warehouse/olist.db/predictor_category3"
file_type = "delta"

# CSV options
infer_schema = "false"
first_row_is_header = "True"
delimiter = ","

# The applied options are for CSV files. For other file types, these will be ignored.
df3 = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

display(df3)

PCA

Let’s reduce the dimensions to 2 main components in order to graphically visualize the separation of the clusters.

from pyspark.ml.feature import PCA
from pyspark.ml.linalg import Vectors

pca = PCA(k=2, inputCol="scaledFeatures", outputCol="pcaFeatures")
model = pca.fit(df3)

result = model.transform(df3).select('SELLER_ID', 'prediction', "pcaFeatures")

from pyspark.ml.functions import vector_to_array
from pyspark.sql.functions import col

df_temp = result.withColumn("xs", vector_to_array("pcaFeatures")).select(["SELLER_ID"] + ['prediction'] + [col("xs")[i] for i in range(2)]).withColumnRenamed('prediction', 'cluster')
df_temp = df_temp.toPandas()
df_temp['cluster'] = df_temp['cluster'].astype('category')

import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="whitegrid")


f, ax = plt.subplots(figsize=(10, 10))
sns.despine(f, left=True, bottom=True)

sns.scatterplot(x='xs[1]', y='xs[0]',
                hue="cluster",
                style="cluster",
                linewidth=0,
                s=100,
                data=df_temp, ax=ax)

Cluster Profiling

df_temp = df3.toPandas()
df_temp.groupby('prediction')['QTD_ORDER', 'NUMBER_OF_PRODUCTS', 'QTD_PRODUCT_ORDER', 'NUMBER_OF_BUYERS', 'SALE_VALUE', 'TICKET_SIZE',
 'RECENCY', 'DAYS_DELAY_DELIVERY', 'FREQUENCY','REVIEW_SCORE'].describe()

It is possible to see that the variables number of orders and sales value have a great influence on customer segmentation, classifying them into small, medium and large sellers.
Prediction=0: in the most populous group (2957 sellers), the average seller has 18 orders, sells 20 products and earns just under R$2,700. As he has an average of 18 different buyers, it is possible to conclude that this seller does not have repurchases, that is, he does not have customer retention. Your average ticket (we calculated by dividing the average sales value by the average order, since the “average average” calculated by describe distorts the number; the result for prediction=0 is R$145.36) and your average price ( calculated by dividing the average sales value by the average quantity of products; the result for prediction=0 is R$127.90) are the worst in relation to the other groups, that is, they sell products with low added value. It’s been almost 5 months since you last sold (adjusting for recency of minus 44 days) and you usually close 3 sales every 2 months. Some sellers in this group tend to delay delivery. This is the profile of most Olist sellers: they make few sales and are only interesting to Olist in volume.
Prediction=1: in the 2nd most populous group (119 sellers), the average seller has 226 orders, sells 252 products (more than 1 product per order) and earns just over R$40,200. As he has an average of 223 different buyers, it is possible to conclude that this seller has a low repurchase rate, that is, he has some customer retention. Your average ticket (we calculated by dividing the average sales value by the average order, since the “average average” calculated by describe distorts the number; the result for prediction=1 is R$387.74) and your average price ( calculated by dividing the average sales value by the average quantity of products; the result for prediction=0 is R$177.91) are the best in relation to the other groups, indicating that it sells products with higher added value (price). high average) and more products per sale (higher average ticket). It’s been just over 1 month since you last sold (adjusting for recency of minus 44 days) and you usually close almost 14 sales per month. Few sellers in this group tend to delay delivery. This is the profile of Olist’s intermediary sellers: they have a considerable number of more qualified sales.
Prediction=2: in the least populated group (19 sellers), the average seller has 988 orders, sells 1134 products (more than 1 product per sale) and earns just over R$170,200. As he has an average of 976 different buyers, it is possible to conclude that this seller has a low repurchase rate, that is, he has some customer retention. Your average ticket (we calculated by dividing the average sales value by the average order, since the “average average” calculated by describe distorts the number; the result for prediction=2 is R$172.17) and your average price ( calculated by dividing the average sales value by the average quantity of products; the result for prediction=2 is R$150.07) are intermediate in relation to the other groups, that is, they sell products with an average added value. It has been 15 days since you last sold (adjusting for recency of minus 44 days) and you usually close 52 sales per month. No seller in this group tends to delay delivery. This is the profile of Olist’s elite sellers: they make a lot of sales and guarantee good revenue.
In general, profiles can be ordered as follows, considering the general quality of the profile and the importance of the seller to Olist:
Prediction=2 > Prediction=1 > Prediction=0

display(df3[['prediction','QTD_ORDER']])

display(df3[['prediction','NUMBER_OF_PRODUCTS']])

display(df3[['prediction','QTD_PRODUCT_ORDER']])

display(df3[['prediction','NUMBER_OF_BUYERS']])

display(df3[['prediction','SALE_VALUE']])

display(df3[['prediction','TICKET_SIZE']])

display(df3[['prediction','RECENCY']])

display(df3[['prediction','DAYS_DELAY_DELIVERY']])

display(df3[['prediction','FREQUENCY']])

display(df3[['prediction','REVIEW_SCORE']])

df_temp = df3.groupby('prediction','SELLER_STATE').count()
display(df_temp.orderBy('SELLER_STATE','prediction'))

Results

5. Final Assessment of Clustering Techniques

The separation into 6 groups using Scikit Learn’s K-means, despite having the lowest silhoutte score among the selected techniques, brings 6 very interesting profiles from a business point of view:
sellers of more expensive products (194 sellers, 6.3% of the base);
sellers who did not work out and gave up on the platform (542, 17.5%);
aspiring Olist premium sellers: makes more sales than most sellers, but still at a level well below that of premium sellers (156.5%);
subgroup of lower clergy salespeople: low number of sales, works with cheaper products, but compensates a little in volume (14, 0.5%);
majority of Olist sellers: make few sales and are only interesting to Olist in volume (2170, 70.1%);
Olist premium sellers: makes a lot of sales and guarantees good revenue (19, 0.6%).

Therefore, our recommendation for Olist is to use Spark’s native biSec K-means clustering technique to classify sellers on an ABC curve enhanced by artificial intelligence, in order to define a different commercial policy for each of the 3 groups of sellers, taking into account the value that each group brings to the company; and use the K-means clustering technique with 6 Scikit Learn clusters to classify sellers by their usage profile of the Olist platform and, thus, create different direct marketing actions for each profile (e.g.: rescue of abandoning sellers, bonuses for aspiring premium sellers strive to achieve the premium profile).

Clustering Analysis using Spark ML and Scikit -learn— part 4

Introduction:

GitHub repository

1. Analysis of Clusters Kmeans with SparkML(k=3)

PCA

Cluster Profiling

Results

2. Analysis of Clusters Kmeans SKlearn (k=3)

Cluster Profiling

3. Analysis of Clusters Kmeans SKlearn (k=6)

PCA

Clusters Profile

Results

4. Analysis of Clusters Bissect Kmeans (k=3)

PCA

Cluster Profiling

Results

5. Final Assessment of Clustering Techniques

Written by Suhaib Arshad