This article summarizes the content of Data Science Essential class by BOTNOI on the topics of customer segmentation. We will be looking into the Supermarket dataset from michelecoscia.com and using Unsupervised Machine Learning techniques: K-means clustering
Unsupervised Learning is a machine learning technique in which the users do not need to supervise the model. Instead, it allows the model to work on its own to discover patterns and information that was previously undetected. It mainly deals with the unlabelled data.¹
Before we begin, let’s review the K-means Algorithm.
The working of K-Means algorithms.
Step 1− Pick the number of clusters, K.
Step 2− Select K random points from the data as centroids.
Step 3− Next, the cluster assignment step. Assign each data point to the cluster centroid they are closest to.
Step 4− Centroids are moved to the average positions of the data associated with them.
Step 5− Repeat steps 3 and 4 until
- Centroids of newly formed clusters do not change
- Points remain in the same cluster
- Maximum number of iterations are reached
Now we will get into the dataset
Our Machine Learning Pipeline:
- Get Data
- Data Preprocessing
- Feature Engineering
1. Get Data
First, we need to download the dataset from michelecoscia.com. In this case, we can use the command !gdown in google colab to download the dataset.
2. Data Preprocessing
Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues.
import necessary library, then read pickle.
Let’s take a look at the dataset! Select 5 sample for inspecting.
- df_purchases has records of each customer's transaction based on product_id(what they bought), shop_id(which shop), and quantity
- df_distances has records of the distances between the buyer and shop_id
- df_prices has records of the price for each product
We then check the info of each of them.
As we can see, this dataset is quite organized, well-prepared, and has no null value, so we don’t need to apply much of data preprocessing steps.
We can also see that this dataset is huge. The df_purchases dataframe has 24,638,724 entries.
Now let’s merge all three dataframes together for the next steps.
We also add new columns ‘amount’ for the total price and ‘cnt_txn” to keep tracking the number of transactions.
Let’s check the new dataframe.
and, check the basic stats.
Now, we check the number of unique customers, products, and shops.
3. Feature Engineering
After we have understood the dataset, merged them together, and created new columns, It’s time for feature engineering.
The goal of feature engineering is simply to make your data better suited to the problem at hand.
Consider “apparent temperature” measures like the heat index and the wind chill. These quantities attempt to measure the perceived temperature to humans based on air temperature, humidity, and wind speed, things which we can measure directly. You could think of an apparent temperature as the result of a kind of feature engineering, an attempt to make the observed data more relevant to what we actually care about: how it actually feels outside!²
You might perform feature engineering to:
- improve a model’s predictive performance
- reduce computational or data needs
- improve interpretability of the results
The features you use influence more than everything else the result. No algorithm alone, to my knowledge, can supplement the information gain given by correct feature engineering.
— Luca Massaron
We will create a new feature using a pivot table and see the amounts of money paid by each customer for each shop.
And, for the sake of simplicity, we will change the name of the column, so it will be easier to read and merge to other dataframe.
We will replace nan values with 0 to be able to create other important features.
Creating another feature for calculating the average amount of money each customer paid for all the shops.
Now, we can keep adding a new feature like this, but it’s better to create a function for that purpose.
Create new correlated features.
We can now call the function and check if there are any null values.
Let’s inspect some columns of the dataframe.
Originally, there were 6 features. Now, there are a total of 44 features, good enough to improve the performance of the machine learning model. The next step is to normalize the data since the range of numbers is quite large.
Standardization: scales features such that the distribution is centered around 0, with a standard deviation of 1 (-1,1)
new_data = (data — data.mean()) / data.std()
Our dataset is now ready to be used for machine learning techniques. In this case, we are going to use K-Means algorithms. The question is how are we going to decide how many clusters to use?
In K-Mean clustering, we can apply elbow method for selecting K number of cluster.
The following code run a loop from k=2 to k=16 and append the inertia score or the Within-Cluster-Sum-of-Squares (WCSS). then plot a graph of inertia vs number of clusters
WCSS or inertia is the sum of squares of the distances of each data point in all clusters to their respective centroids.
We can choose the number of clusters by estimating the biggest turning point of the graph by looking or using the below method.
This formula can calculate the distance of the longest distance to the straight line
Another method is calculating the silhouette score
Plot the graph of silhouette score.
Some facts about silhouette score
- The silhouette score of 1 means that the clusters are very dense and nicely separated. The score of 0 means that clusters are overlapping. The score of less than 0 means that data belonging to clusters may be wrong/incorrect.
Hence the numbers of cluster between 6 and 9 is our best bet!
It’s time to run K-Mean algorithms!
Check out the result!
Now our dataframe has one more column of a predicted cluster for each row
Plot bar to find out the numbers of customers in each of clusters.
Turns out that the majority of customer is in the first cluster. Next, we will be doing the data visualization process.
5. Data Visualization
Since our dataframe has 44 features, it is not possible for human being to see the data in 44 dimension, so we need to apply the PCA (Principle Components Analysis) which is the techniques for reducing the number of dimensions.
PCA is used in exploratory data analysis and for making predictive models. It is commonly used for dimensionality reduction by projecting each data point onto only the first few principal components to obtain lower-dimensional data while preserving as much of the data’s variation as possible. The first principal component can equivalently be defined as a direction that maximizes the variance of the projected data.
Next step, we will be calculating the mean of each of column in all clusters to see the differences in each cluster.
Then, if we can see the variance of money usage of customers in each unique shop and distances, we might be able to see the differences of customer behaviour.
Stack cluster, shop_id and total money for visualization
What do we know from this graph?
- Cluster 0 : Majority of customers bought products from shop_id5 > shop_id1 > shop_id2 > shop_id3 > shop_id4
- Cluster 1 : Not many money transaction in this group of customers
- Cluster 2 : Majority of customers bought products from shop_id1 > shop_id2 > shop_id3 > shop_id5 > shop_id4
- Cluster 3 : Majority of customers bought products from shop_id1 and shop_id2
- Cluster 4 : Majority of customers bought products from shop_id2 and shop_id1
- Cluster 5 : Majority of customers bought products from shop_id3 > shop_id1 > shop_id2 > shop_id5 > shop_id4
- Cluster 6 : Not many money transaction in this group of customers.
- Cluster 7 : Majority of customers bought products from shop_id4 > shop_id1 > shop_id2
Plot the radar chart for more detailed visualization
K-Mean algorithms is used for unsupervised learning with unlabelled data. The algorithm is suitable for clustering small to large dataset.
We are able to gain insight into the data by clustering the different group of customers and see the spending behaviour of each cluster. This is very valuable if we are planning to add a new shop or applying what we learn from the dataset for business purposes.
Visualizing K-Means Clustering
Mean square point-centroid distance: not yet calculated The $k$-means algorithm is an iterative method for clustering a…
Google Colab link
Follow me on Github