Study Case Clustering With Supermarket Dataset

Irfan Fadhullah
The Startup
Published in
6 min readSep 26, 2020

Okay guys, today i want to share about my previous project in Data Science. So, before we start to the case, let me explain little bit about and machine learning.

What is Machine Learning?
Machine Learning (ML) is basically that field of computer science with the help of which computer systems can provide sense to data in much the same way as human beings do.

Machine Learning is the science (and art) of programming computers so they canlearn from data.

Type of machine Learning

Supervised Learning
Supervised learning is typically done in the context of classification, when we want to map input to output labels, or regression, when we want to map input to a continuous output.

Supervised Machine Learning

Unsupervised Learning
Unsupervised Learning is a type of machine learning algorithm used to draw inferences from datasets consisting of input data without labeled responses.

Unsupervised Machine Learning

Clustering Study Case
Context The growth of supermarkets in most populated cities are increasing and market competitions are also high. The dataset is one of the historical sales of supermarket company which has recorded in 3 different branches for 3 months data. Predictive data analytics methods are easy to apply with this dataset.

Attribute information

  1. Invoice id: Computer generated sales slip invoice identification number
  2. Branch: Branch of supercenter (3 branches are available identified by A, B and C).
  3. City: Location of supercenters
  4. Customer type: Type of customers, recorded by Members for customers using member card and Normal for without member card.
  5. Gender: Gender type of customer
  6. Product line: General item categorization groups — Electronic accessories, Fashion accessories, Food and beverages, Health and beauty, Home and lifestyle, Sports and travel
  7. Unit price: Price of each product in $
  8. Quantity: Number of products purchased by customer
  9. Tax: 5% tax fee for customer buying
  10. Total: Total price including tax
  11. Date: Date of purchase (Record available from January 2019 to March 2019)
  12. Time: Purchase time (10am to 9pm)
  13. Payment: Payment used by customer for purchase (3 methods are available — Cash, Credit card and Ewallet)1
  14. COGS: Cost of goods sold
  15. Gross margin percentage: Gross margin percentage
  16. Gross income: Gross income
  17. Rating: Customer stratification rating on their overall shopping experience (On a scale of 1 to 10)

Acknowledgements Thanks to all who take time and energy to perform Kernels with this dataset and reviewers.

Purpose This dataset can be used for predictive data analytics purpose

This is the explanation about the dataset and the problem. So, after we know about the business problem, let’s move to the second process.

Data Understanding
In this step, I want to know more about the data. We need to read and process the data using some library like pandas, numpy, matplotlib, and so on.

First I need import the library and the packages to the jupyter notebook, and read the dataset. I store the dataset to a variable called data.

Read the Dataset

Next I wanna see the missing value or the dataset condition like data type and sum of row.

Data Information

Next move, I want to see the statistical summary of the data, and recheck the missing value.

Summary Statistics

From that result, we know that the mean, median, and other descriptive statistics information about the data.
Okay, after we know about some summary statistics of data, let’s see the distribution of each variable that posibble to plot.

Distribution Shape of Data

Next, I want to do some inferential statistics like univariate, bivariate, and multivariate analysis to see the correlation each variables.

In this image, I want to see the comparation between customer type.

Customer Type
Customer Type by Type

Also the comparation between branch column to the type column.

Branch by Type

And then, let see the comparation between gross income column and group by the city.

Next, I want to see the condition about rating by the city and do the mean command, and store the data to the new variable called rating.

From that image, we know that the comparison about mean rating by city is almost same.

Data Preprocessing
In this step, I do some data preprocessing like scaling the data using MinMaxScaler from scikit learn, we need to scaleing the data because clustering using K-Means model greatly affects the distance and range of data..change the categorical data to numeric type using One Hot Encoder, and see the correlation matrix of each variables.

Correlation Matrix

Modeling
From the data understanding and preprocessing, I will choose gross income, cost, and rating variables to use in Clustering Model.
In this modeling, i choose K-Means Clustering. Next, let see the number of cluster that proportion to use in the K-Means Clustering using Elbow Method. From Elbow method, we can see the best number of K in the model.

Elbow Method

As you can see from the result above, the best K-value to use in the model is 4, because, in that value, the gap to next K-value which is 5 is slower that from 3 to 4 value. Next, I try to enter the k value to the model.

K-Means Clustering Result

Finally, the result of clustering you can see above.
If the modeling is done, so we must do some evaluation of the modeling, in this process I want to see the Silhouette of the K-Means Clustering model. With this process, we can see the best K-Value based on Silhouette Evaluation.

Silhouette Evaluation

This is the condition of 4 K-Value, the Silhouette coefficient values is about 0.79. If we compare to 3 and 5 number of K-Value, the best result is 0.79 in 4 number of K-Value.

Conclussion
From the result above, the best model of K-Means clustering with supermarket data is with 4 of nukber of K-Values because has the best performance in the shilouete coefficient and in elbow. The dataset is very clean, so we just do some cleaning. The dataset has not label column, so I decide to use Clustering which is K-Means Clustering, because K-Means is the most useful model if we want to use clustering model in machine learning. So this is all from me, feel free to comment me if there are some mistake that i made.

--

--