Customer Segmentation using Machine Learning| Why? | How?

8 min readJun 27, 2022

After reading this article you would understand how to do ‘Customer Segmentation’ using machine learning and why it’s important from a business perspective.

This is my first article so give it a clap if you find it informative.

Concept :

Businesses generally struggle to provide a personalized experience to customers. Understanding your customers is crucial if you want to grow your business. Knowing your customers will help you create targeted campaigns and ads. You can increase customer loyalty and customer service. It can help you in identifying new opportunities. In marketing, the personalization technique can be used to get potential customers in a case to boost sales. So, What is Customer Segmentation? it is the method of dividing customers into groups that have similar characteristics.

Types of segmentation:

Demographic Segmentation — based on age, gender, income, etc.
Geographic Segmentation — based on country, city, state, town, etc.
Physiographic Segmentation — based on social class, personality, attitude, values, interests, etc.
Behavioral Segmentation — based on customer behavior, activities, frequent actions
Value-based Segmentation — Economic value of specific customer groups on the business

To create a customer segmentation strategy, you first need to determine your team’s goals. Then segment customers into groups and target them based on their related characteristics. For the most effective results, you should analyze your marketing efforts and fine-tune your messaging as you learn more about each segment.

Data Science Project Lifecycle:

Now that we understand our Business problem. The next step is to follow the Data Science Project lifecycle of collecting data, cleaning, Exploring, Preprocessing, Modelling, and Evaluating. In the real world, you would have to do this process iteratively of selecting different features, and models until the business goal are satisfied.

Data :

In this post, I’m using data from Kaggle. But in the real world, you will have a customer profile, purchase history, customer activity, etc. which is internal data, and media browsing, survey, income, etc. which is external data.

I’m using Customer segmentation data from Kaggle. The segmentation that we are doing here is ‘Demographic’.

In the real world, you would have to coordinate with stakeholders for data collection and you might have to do data manipulation in data. Here our job is a little easy as we have readymade data.

Exploratory Data Analysis:

Why EDA is important:

It helps in uncovering the underlying structure/trends/pattern of the dataset.
It helps in understanding null values, outliers, and duplicates in the dataset.
It helps in understanding the distribution of variables and the relationship between them.

First I want to know the distribution of categorical variables. I am using a pie plot to show the distribution in a percentage format.

You can also see column distribution in the following format.

Now that we know from above we have a lot of categorical variables and data also contains null values. There are no extreme outliers in the data, which needs to be removed.

Data Cleaning:

Why Data Cleaning is important:

If you have missing values/null values in the data it would reduce the efficiency of your ML model
It also affects the overall distribution of data.
It also leads to a biased effect in the estimation of the ML model

There are a couple of ways to handle missing values in data.

Fill missing values with ‘NULL’/ ‘0’ if you don’t want to change the distribution of data
Impute missing data with Median or mean values
Backward fill — It will propagate the first observed non-null value backward.
Forward fill — it propagates the last observed non-null value forward

In this scenario, I’m imputing data with median values for both categorical and numeric variables. You need to understand your data and choose the method you wanna use. You can always come back to this step if your ML model isn’t performing as expected.

Data Preprocessing — Encoding:

The machine learning model doesn’t understand categorical variables. It needs numerical data to perform mathematical computation.

Types of Encoding:

Ordinal encoding: Categories have an inherent order.

2. Nominal encoding: Categories don’t have any inherent order

There are different techniques to perform these encoding methods on given data.

One Hot Encoding/Dummy Encoding: For each level of categorical variable we would be creating a new numerical variable

2. Label Encoding: It directly converts categorical variables into numbers, if the categorical variable doesn’t have order(it’s the Nominal Encoding technique)

3. Hash Encoding: It is similar to One-hot-encoding where it converts levels of categorical variables into a new numerical variable. The main advantage of using Hash Encoding is that you can control the number of numerical columns produced by the process. You could represent One categorical variable with more than 4–5 new variables or less.

For this data, I’m doing One hot encoding to convert categorical variables.

Standardizing Data:

Why Standardizing is important:

Standardizing data is important because it makes sure your data is internally consistent. e.g. your data could have values with different ranges, and measurement units. Which can cause trouble in Machine learning models
It is a very important step for models that are based on distance computation(such as K-means Clustering which we will be using in our analysis), if any of the features has a broad range of values, the distance will be governed by this particular feature.

Modeling and Evaluation:

For this dataset, I’m assuming that we don’t have labels. Which is a general situation in real-life data. The best model for this scenario is K-means clustering.

Working of K-means clustering:

The K-means clustering is an Unsupervised Machine learning method that is used to identify clusters of data in the dataset. K-means clustering algorithms randomly select the k-number of the centroid(which we provide) which are used as the beginning point for every cluster and then perform this process iteratively to optimize the position of the centroids.

Elbow Method:

As mentioned above k-means algorithm initialize k centroids randomly how can we decide which is the optimal number for clustering for any dataset? Elbow method to rescue… Mathematically, the Elbow of the curve is a point where the curve visibly bends. The idea in segmentation, clusters will add much information since the data is consist of that many groups(actual cluster count), but once the number of clusters exceeds the actual number of groups in the data, the added information will drop sharply, because it is just subdividing the actual groups. Assuming this happens, there will be an elbow in the graph.

Another important you need to learn before using any algorithm is the ‘Silhouette Score’. It is used for model evaluation.

Silhouette Score:

Silhouette Score is a method used for interpreting and validating consistency within clusters of data. This score is a measure of how similar an object is to its own cluster compared to other clusters. Silhouette ranges from -1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. So our aim is to have a high-value silhouette score.

My score is 0.18 decent enough. And Elbow graph shows that the optimal number of clusters could be 4.

If the clusters are not making sense to you, you can try other methods such as PCA(Principal Component Analysis) and compare the results. How to make sense of clusters will learn in the following section.

Understanding Clusters| Build a persona :

This is a very important step once you perform clustering, building persona around the clusters. Business people would want to know who the person in a particular segment would look like.

You can do some exploratory analysis to understand each cluster.

Based on mean values or distribution of variables I can define each cluster as follows.

Occasional Spender: These customers are generally male aged between 38–55 years, married with a graduation degree and working in the artistic profession, and don’t spend too much or too less, its avg.

Male emerging spenders: These customers are generally male again aged between 22–31 years, might or might not be married with a graduation degree and working in the Healthcare profession or as a Doctor, and are very frugal in spending.

Female emerging spenders: These customers are generally female aged between 35–50 years, not married with graduation degree and working in the artistic profession, and spend very less.

Elite Spender: These customers could be male/female aged between 46–75 years, married, with graduation degree and working as Lawyer or Executive and are spending a lot of money.

Conclusion:

You understood why Segmentation is important for business how we can achieve this with Machine learning and in the end how to explain your algorithmic findings to business people.

Let me know in the comments section if this was helpful or if there are any other methods I can use to improve this process.