Step By Step Customer Segmentation Analysis With Python

Rahmat Taufiq Sigit
10 min readMay 30, 2022

--

Machine Learning Classification Using K-Means Algoritm

Analyzing segmentation analysis datasets below:

The picture shows there is a structure of the segmented dataset to be processed, where there are 8 columns consisting of ID, Sex, Marital Status, Age, Education, Income, Occupation and Settlement Size.

Next we start doing python programming to analyze the dataset starting with the following steps:

  1. Importing the library
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snssns.set()from sklearn.preprocessing import StandardScalerimport scipyfrom scipy.cluster.hierarchy import dendrogram, linkagefrom sklearn.cluster import KMeansfrom sklearn.decomposition import PCAimport pickle

2. Access the database and save the result into a variable

data = pd.read_csv(“data/segmentation data.csv”, index_col = 0)

The code shows that the dataset to be processed is in the data folder with the name segmentation data.csv. In the code, the index_col = 0 parameter is added to make the ID column in the dataset index.

3. See the results of accessing the dataset with the following code:

data.head()

4. View descriptive statistics of the dataset with the command:

data.describe()

The Table shows the descriptive statistics can be read as follows:
• In the Count section, it can be seen that there are 2000 data in the dataset.
• The Mean section shows that the mean age in the data is 35 years and the median income is 120,954, for columns with categorical values a mean reading is not required.
• Min, Max data sections and distributions of 25%, 50% and 75% can also be used as a reference if deemed necessary to understand the distribution of the data

5. Check the correlation of the data using the Pearson method with the following command:

data.corr()

The picture shows that there is a correlation from -1 to 1, where -1 means a negative correlation and 1 means a strong positive correlation. 0 Correlation means the two variables are not related.
The table shows that there is a strong relationship between age and education with a correlation 0.65. This is make sense because older people tend to have higher education.

6. Show correlation of data easily using heatmap with the following command:

import matplotlib.pyplot as pltimport seaborn as snssns.set()

The command aboves typed at the beginning of library access where its function is to import the packages needed for data visualization using the matplotlib and seborn packages.
Show the heatmap with the following command:

plt.figure(figsize=(12,9))s = sns.heatmap(data.corr(),annot=True,cmap=”RdBu”,vmin=-1,vmax=1)s.set_yticklabels(s.get_yticklabels(), rotation = 0, fontsize = 12)s.set_xticklabels(s.get_xticklabels(), rotation = 90, fontsize = 12)plt.title(‘Correlation Heatmap’)plt.show()

The picture shows that the bluer the color of the correlation, the stronger the positive correlation. And the redder the higher the negative correlation.
The image above has the same results as the table in the previous step, but the view of the data with a heat map is easier to understand.

7. Visualize the data using a scatter plot.

Now, we can logically understand the correlation between age and education, income and occupation. Next, let’s move on to using a scatter plot graph for the correlation between age and income. Use the following command:

plt.figure(figsize=(12,9))plt.scatter(data.iloc[:,2],data.iloc[:,4])plt.xlabel(“Age/Umur”)plt.ylabel(“Income/Gaji”)plt.title(“Scatter Plot Korelasi Antara Age dan Income”)

age distribution can be seen. The income from the age distribution is from 20 to 50 years and the income is from 5,000 to 20,000.

8. Data Cleaning / Data Standardization

Data classification activities are look for data similarities and assess these similarities. So a gap that is too large will cause difficulties in further classification.
To keep the gap from being too big, we standardize users by importing the sklearn module as in the import library section

from sklearn.preprocessing import StandardScaler

Standarize data with the following command:

scaler =StandardScaler()
data_std = scaler.fit_transform(data)

The command is not only to standardize the data, but also save the standardized data into a variable named data_std

9. Clustering Data with Hierarchical Clustering method

Cluster hierarchy is a classification method. there are 2 methods namely Devisive and Agglomerative. Top down Classification is also called the Divisive method, where the classification begins with a certain characteristic of a data and then decomposes it into several characteristics. There is also a bottom-up hierarchical classification method from specific characteristics to general characteristics (agglomerative).

import scipy
from scipy.cluster.hierarchy import dendrogram, linkage

Next, continue the classification code with the following command:

hierarchy_cluster = linkage(data_std, method=’ward’)

10. Visualization on Hierarchical Clustering

plt.figure(figsize=(12,9))plt.title(“Cluster Hirarki”)plt.xlabel(“Observasi”)plt.ylabel(“Jarak”)dendrogram(hirarki_cluster, truncate_mode=’level’, p=5, show_leaf_counts=False, no_labels=True )plt.show()

The results are as follows:

The picture shows that there are 4 clusters

11. Performing Flat Clustering using K-means Clustering

Import the required libraries with following command

from sklearn.cluster import KMeans

Next, look for the number of clusters that best suit the experiment using WCSS :

wcss = []for i in range(1,11):kmeans = KMeans(n_clusters=i, init=’k-means++’, random_state=42) kmeans.fit(data_std)wcss.append(kmeans.inertia_)

The code aboves is to perform clustering experiments 10 times with 42 times random conditions on the data with the k-means++ algorithm and store the results stored in the wcss array variable.

12. Showing wcss result

View the results on the wcss variables using the following command:

plt.figure(figsize=(12,9))plt.plot(range(1,11), wcss, marker = ‘o’, linestyle = ‘ — ‘)plt.ylabel(‘WCSS’)plt.title(‘K-means Clustering’)plt.show()

The results are as follows:

13. Performing Clustering with the number of Clusters 4

kmeans = KMeans(n_clusters=4, init=’k-means++’, random_state=42) kmeans.fit(data_std)

Next, backup the data that has been clustered with the following command:

data_kmeans = data.copy()data_kmeans[‘segment’] = kmeans.labels_

In the dataset data_kmens variable, a new column named “segment” is created to store the results of the classification of the data.

14. Analyze the Cluster results to see the average value

data_kmeans_analisis = data_kmeans.groupby(‘segment’).mean() data_kmeans_analisis

The results are as follows:

Segment 0, sex is 0.50 which means in this segment the number of Men and Women are almost equal, Marital Status 0.69 indicates that most of this segment are married, and Age 55, Education 2.1 is higher than other clusters and Satlement Size 1.11 indicates that this cluster mostly lives in big cities, indicating that this is the oldest and most educated cluster, This segement is named the Well-off Cluster.

Segment 1, 35% are men, almost all of them are unmarried, indicated by marital status 0.019, mean age 35 years, education level 0.74, the lowest among other clusters, income 97,859, the lowest income compared to other clusters. Show this cluster is the poorest. This cluster also lives in a small town, judging by the 0.04 Completion Size for that reason, it is named the Fewer-Opportunity Cluster.

Segment 2, Marital Status is 0.99 and the mean age is 28, indicating that this is a cluster consisting mostly of new young families. this segment has secondary education level with 1.0 education, standard income 105,759 and middle management position with 0.63 Employment and almost 40% live in big city. For this reason, this cluster is named Standard

The last segment is segment 3, mostly male, married less than 20%, low education level with education score 0.7. The income value is 141,218 and Occupantion 1.27 and most of these segments are domiciled in big cities which are marked by a Settlement Size value of 1.5. For this reason, this cluster is called Carrier Focus

15. Performing the calculation of the number of members from each cluster

data_kmeans_analisis[‘jumlah_customer’]=data_kmeans[[‘segment’,’Sex’]].groupby([‘segment’]).count()data_kmeans_analisis

Output:

16. Knowing how much the average proportion of the segment to the amount of data

data_kmeans_analisis[‘rata2’] = data_kmeans_analisis[‘jumlah_customer’]/data_kmeans_analisis [‘jumlah_customer’].sum()data_kmeans_analisis

Output:

17. Make changes to the predefined segment names

data_kmeans_analisis.rename({0:’Well-Off’, 1:’Fewer-Oppotinities’, 2:’Standard’, 3:’Career-Focus’})

Output:

Well-Off segment is the smallest segment with 13% (263) and the largest segment is Standard with 35% (705)

18. Performing visualization

Visualize the data by using the initial data in the data_kmeans variable, and change the segment names according to the ones that have been defined and stored in the “labels” column

data_kmeans[‘Labels’] = data_kmeans[‘segment’].map({0:’Well-Off’, 1:’Fewer-Oppotinities’, 2:’Standard’, 3:’Career-Focus’})x_axis = data_kmeans[‘Age’] y_axis = data_kmeans[‘Income’]plt.figure(figsize = (10, 8))sns.scatterplot(x_axis, y_axis, hue = data_kmeans[‘Labels’], palette = [‘g’, ‘r’, ‘c’, ‘m’])plt.title(‘Segmentation K-means’)plt.show()

Output:

The Well-Off segment is the green color that is most clearly separated from the other clusters. However, the other 3 segments are still difficult to distinguish so that the scatter plot is not perfect.

19. Combining K-Means with Principal Component Analysis (PCA)

Import the required libraries:

from sklearn.decomposition import PCA

and continue with the following command:

pca = PCA()pca.fit(data_std)

20. See the ratio variation from the data

pca.explained_variance_ratio_

Output:

The results showed that there were 7 components. Component 1 represents 35% (0.35) of the variability data, component 26% (0.26) and so on.

21. Finding sets of data and visualizing data

plt.figure(figsize=(12,9)) plt.plot(range(1,8), pca.explained_variance_ratio_.cumsum(),marker=’o’, linestyle=’ — ‘) plt.title(‘Variasi Komponen’) plt.xlabel(‘Jumlah Komponen’) plt.ylabel(‘Cumulative Explained Variance’)plt.show()

Output:

22. Recall & display PCA data

pca=PCA(n_components=3)pca.fit(data_std)pca.components_

Output:

The resulting component is an array of size 3x7

23. Moving 3x7 array component PCA values into dataset

data_std_pca = pd.DataFrame(data=pca.components_, columns=data.columns.values, index=[‘Component 1’,’Component 2',’Component 3'])data_std_pca

Output:

24. Display data_std_pca table in visual form

sns.heatmap(data_std_pca, vmin=-1, vmax=1, cmap=’RdBu’, annot=True) plt.yticks([0, 1, 2], [‘Component 1’, ‘Component 2’, ‘Component 3’], rotation = 45, fontsize = 9)

Output:

Analysis of the Component:

Component 1 age (age), income (salary), occupation (position) and size of settlement (city of residence) are important factors. This is related to the characteristics of Career

Component 2. Gender, marital status and education have a big influence. And all work-related features like income, occupation and size of residence don’t really matter. This means that Component 2 is not related to a career but rather to the level of education and lifestyle (Education & Life Style).

Component 3. Marital status, age and occupation have a fairly strong correlation. Marital and Employment Status have a negative influence but still need to be considered. This shows that Component 3 is more indicative of the customer experience. it doesn’t matter whether it’s work experience or life experience (Experience).

25. Performing transformations on data

pca.transform(data_std)

Output:

Save the result of the transformation in the score_pca variable

skor_pca = pca.transform(data_std)

26. Re-segmentation the data using K-Means and PCA

Re-segmentation of data using K-Means and PCA features. First step is to determine the number of clusters with data from the score_pca variable

wcss = []for i in range(1,11): kmeans_pca = KMeans(n_clusters = i, init = ‘k-means++’, random_state = 42)kmeans_pca.fit(skor_pca)wcss.append(kmeans_pca.inertia_)

27. Visualize the number of clusters

plt.figure(figsize = (10,8)) plt.plot(range(1, 11), wcss, marker = ‘o’, linestyle = ‘ — ‘)plt.xlabel(‘Number of Clusters’)plt.ylabel(‘WCSS’)plt.title(‘K-means with PCA Clustering’)plt.show()

Output:

The number of Clusters can be set to 4 Clusters according to the result above.

28. Perform k means clustering on the PCA score data and display the results

kmeans_pca = KMeans(n_clusters = 4, init = ‘k-means++’, random_state = 42) kmeans_pca.fit(skor_pca)data_pca_kmeans = pd.concat([data.reset_index(drop = True), pd.DataFrame(skor_pca)], axis = 1)data_pca_kmeans.columns.values[-3: ] = [‘Component 1’, ‘Component 2’, ‘Component 3’]data_pca_kmeans[‘segment’] = kmeans_pca.labels_

Output:

29. Analyze the Cluster results to see the average value

data_pca_kmeans_freq = data_pca_kmeans.groupby([‘segment’]).mean() data_pca_kmeans_freq

Output:

It can be seen from the table, that segment 0 has a high relationship on Education & Life Style, this can be called as a Standard segment, Segment 1 has a high relationship on Career can be called as Career-Focus, Segment 2 has a high relationship on Experience but low on Career and Education & Lifestyle is called Fewer Opportunities and finally Segment 3 has high relation on all components is called Well-Off

30. Rename segments and calculate sums and averages

data_pca_kmeans_freq[‘jumlah’] = data_pca_kmeans[[‘segment’,’Sex’]].groupby([‘segment’]).count() data_pca_kmeans_freq[‘rata2’] = data_pca_kmeans_freq[‘jumlah’] / data_pca_kmeans_freq[‘jumlah’].sum() data_pca_kmeans_freq = data_pca_kmeans_freq.rename({0:’Standard’, 1:’Career-Focus’, 2:’Fewer-Opportunity’, 3:’Well-Off’})data_pca_kmeans_freq

Output:

31. Add segment name in legend column

data_pca_kmeans[‘Legend’] = data_pca_kmeans[‘segment’].map({0:’Standard’, 1:’Career-Focus’, 2:’Fewer-Opportunity’, 3:’Well-Off’})data_pca_kmeans

Output:

32. Displaying data on scatter plot

x_axis = data_pca_kmeans[‘Component 2’] y_axis = data_pca_kmeans[‘Component 1’] plt.figure(figsize = (10, 8)) sns.scatterplot(x_axis, y_axis, hue = data_pca_kmeans[‘Legend’], palette = [‘g’, ‘r’, ‘c’, ‘m’]) plt.title(‘Kluster Setelah PCA Components’)plt.show()

Output:

The characteristics of each segment are clear. In conclusion, it takes 2 Components to be able to see the difference in each segment.

33. Export model

Export the model with library pickle using the following command:

import pickle

Save the model, pca and kmeans_pca with the following command:

pickle.dump(scaler,open(‘skalar.pickel’,’wb’)) pickle.dump(pca,open(‘pca.pickel’,’wb’)) pickle.dump(kmeans_pca,open(‘kmeans_pca.pickel’,’wb’))

Source : https://github.com/miradzji/customer_segmentation_da

--

--