Customer Segmentation using K-Means Clustering Algorithm

using Numpy, Pandas, Matplotlib, Seaborn and Scikit-learn libraries in Python

Dhiraj Prakash
CodeX
7 min readSep 16, 2021

--

Photo by Brooke Lark on Unsplash

In this blog post, we will perform Customer Segmentation using an unsupervised machine learning algorithm known as the K-Means using various libraries on Python such as numpy, pandas, matplotlib, seaborn, sci-kit learn.

Customer segmentation is the practice of categorizing consumers into groups based on shared qualities so that businesses may sell to each group effectively and efficiently.

Unsupervised learning is a kind of machine learning in which the training data is supplied to the algorithm without any pre-assigned labels or scores. Unsupervised learning algorithms must, as a result, first self-discover any naturally existing patterns in the training data set.

K-means clustering is a method that aims to partition the n observations into k clusters in which each observation belongs to the cluster with the nearest mean.

To know more about the working of k-means algorithm, View this post

Here is step-by-step outline of the project:

  1. Import the required libraries.
  2. Import the data.
  3. Data Pre-Processing and analysis.
  4. Identifying the number of clusters in the data.
  5. Train the model.
  6. Visualize the different clusters in the data.

The data for this project can be found here.

The complete code for this project can be found here.

Import the required libraries.

Let us import all the required libraries for this project. Please note that all the packages have to be installed before you can import them.

numpy is a library that has an extensive collection of high-level mathematical functions which operate on arrays.

pandas is a library with data manipulation tools built on top of the existing functionality of numpy.

matplotlib is a library that has plotting tools.

seaborn is a library built on top of matplotlib that provides a high-level interface to draw compelling plots.

sklearn is a library that has machine-learning tools.

Import the data

Let us read the data and check if it has got imported successfully.

The read_csv function from pandas lets us read CSV files and stores them in a DataFrame. Let’s also assign a variable df to store the DataFrame.

Let’s check if the data has been imported correctly. We can do this using the head method. The head method lets us have a look at the first five records in the DataFrame.

Now that the data has been imported correctly. We can move on to the next step of Data Analysis.

Data Pre-Processing and Analysis

Let’s now explore the data we have gathered and check if any pre-processing has to be done.

Number of records and columns

Let’s check the number of rows and columns in the data set. We can use the shape method.

We can see from the output that there are 200 records and 5 columns in the DataSet.

Information about the columns

Let’s try to understand the columns we are working with and also check their data types. We can use the info method.

We can see there are 5 columns:
1) Customer ID: This is a unique integer id assigned to each record.

2) Gender: It is a string variable used to denote the gender of a person.

3) Age: It is an integer variable used to denote the age of a person.

4) Annual Income ( k $ ): It is an integer variable used to denote the income earned by a person annually in 1000’s dollars.

5) Spending Score ( 1–100 ): It is an integer variable used to denote the money spent by a person at that mall: 1 being the lowest and 100 being the highest.

Missing Values check

Let’s see if there are any missing values in the data set.

No values are missing in our data set.

Outlier Detection

An outlier is a data point that is significantly different from the other data points. It may cause severe problems during statistical analysis. Hence it is required to clean the data if any outliers are present.

We can use the boxplot method to check for outliers.

The dots on the plots indicate outliers when present. We see just one dot in the annual income box plot. As the number of outliers is low, we can move forward with the analysis.

Statistical Information

Next, let’s use the describe method to view some basic statistical information of the integer columns present in our dataset.

The CustomerID column has a minimum value of 1 and a maximum of 200, which shows that this column seems correct.

The mean Age of the people in our data set is 38.85.

The mean Annual Income is 60.56 K $.

The min value of the Spending Score is 1, and the max being 99. which shows the data has been collected without any errors.

Count of Males vs Females visiting the mall

Let’s plot a count plot to visualize the count of males and females. We can use the countplot method from seaborn.

We see from the visualization that we have more females than males visiting this mall.

Selecting only Annual Income and Spending Score for further analysis

Of the 5 columns available, We are only selecting 2 columns for training the model. If we choose all 5 columns, it becomes difficult to visualize a 5-D plot. However, by selecting only 2 columns, we can plot a 2-D plot and visualize it effectively. Please note that we can use dimensionality reduction techniques such as PCA to choose and visualize all 5 columns.

Identifying the number of clusters in the data

We are going to use the elbow method to determine the right k size for the algorithm.

Elbow method is one of the most popular methods to select the optimum value of k for the model.

It works by calculating the Within-Cluster-Sum of Squared Errors ( WSS ) for different values of k and choose the value of k for which the WSS diminishes the most.

Lets breakdown WSS:

  1. The squared error for each point is the square of the distance between the point and the predicted cluster center.
  2. Any distance metric such as the Euclidean distance or Hamming distance can be used.
  3. The WSS score is the sum of these squared errors for all data points.
  4. The point where there is a significant drop in WSS score is selected to be the value of k.

To read more about WSS. View this post.

Let’s plot the Elbow graph and identify the optimum number of clusters for our problem.

From the elbow graph, we can see an elbow when the number of clusters = 5. Hence we can choose the value of k to be 5.

Train the model

As discussed above, we will be using the k-means algorithm with the number of clusters =5 for the task at hand. We will be using scikit-learn library for this.

We initialize the k-means model with n_clusters=5 (Number of clusters ), we use k-means++ for init. k-means++ ensures a smarter initializing of the centroids and improves the quality of clustering. random_state is used to set the seed for the random generator so that the results can be reproduced.

To read more about k-means implementation in scikit-learn use the following link.

Visualize the different clusters in the data

Let us plot or scatter plot to visualize the data points and their assigned cluster.

The scatter plot shows that the data points are divided clearly into 5 distinct clusters.

The cyan cluster earns an annual income in the range of 0 to 40 K $. Their spending score is low. Hence they become potential opportunities for the mall and can be targeted by giving special discounts.

The yellow cluster, too, earns the same annual income as the cyan group. However, their spending score is high. So they can be thought of like the ideal group.

The green cluster earns a high annual income, but their spending scores are low. This suggests that they visit the mall but do not spend the money to buy anything. They are also potential opportunities for the mall to target.

The red cluster earns a high annual income, and they also have a high spending score. The mall must ensure that their interests are met as time moves on.

Summary

Here’s a summary of the step-by-step process we followed to implement the k-means algorithm for the problem of Customer segmentation:

  1. We imported the required libraries.
  2. We imported the data.
  3. We pre-processed and analysed the data.
  4. We identified the number of clusters in the data.
  5. We trained the model.
  6. We visualized the different clusters in the data.
  7. Gave insights on how the mall can improve its revenue.

Future Work

Here are some ways in which the project can be extended.

  • We used only two 2 columns to train the model. A combination of multiple different columns can be used to gain further insights.
  • A different machine-learning model such as DBSCAN or BIRCH can be used to train the model.
  • This algorithm can be used on different data and other problems like developing a recommender system.

References

--

--