Step By Step Customer Segmentation Analysis With Python
Machine Learning Classification Using K-Means Algoritm
Analyzing segmentation analysis datasets below:
The picture shows there is a structure of the segmented dataset to be processed, where there are 8 columns consisting of ID, Sex, Marital Status, Age, Education, Income, Occupation and Settlement Size.
Next we start doing python programming to analyze the dataset starting with the following steps:
- Importing the library
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snssns.set()from sklearn.preprocessing import StandardScalerimport scipyfrom scipy.cluster.hierarchy import dendrogram, linkagefrom sklearn.cluster import KMeansfrom sklearn.decomposition import PCAimport pickle
2. Access the database and save the result into a variable
data = pd.read_csv(“data/segmentation data.csv”, index_col = 0)
The code shows that the dataset to be processed is in the data folder with the name segmentation data.csv. In the code, the index_col = 0 parameter is added to make the ID column in the dataset index.
3. See the results of accessing the dataset with the following code:
data.head()
4. View descriptive statistics of the dataset with the command:
data.describe()
The Table shows the descriptive statistics can be read as follows:
• In the Count section, it can be seen that there are 2000 data in the dataset.
• The Mean section shows that the mean age in the data is 35 years and the median income is 120,954, for columns with categorical values a mean reading is not required.
• Min, Max data sections and distributions of 25%, 50% and 75% can also be used as a reference if deemed necessary to understand the distribution of the data
5. Check the correlation of the data using the Pearson method with the following command:
data.corr()
The picture shows that there is a correlation from -1 to 1, where -1 means a negative correlation and 1 means a strong positive correlation. 0 Correlation means the two variables are not related.
The table shows that there is a strong relationship between age and education with a correlation 0.65. This is make sense because older people tend to have higher education.
6. Show correlation of data easily using heatmap with the following command:
import matplotlib.pyplot as pltimport seaborn as snssns.set()
The command aboves typed at the beginning of library access where its function is to import the packages needed for data visualization using the matplotlib and seborn packages.
Show the heatmap with the following command:
plt.figure(figsize=(12,9))s = sns.heatmap(data.corr(),annot=True,cmap=”RdBu”,vmin=-1,vmax=1)s.set_yticklabels(s.get_yticklabels(), rotation = 0, fontsize = 12)s.set_xticklabels(s.get_xticklabels(), rotation = 90, fontsize = 12)plt.title(‘Correlation Heatmap’)plt.show()
The picture shows that the bluer the color of the correlation, the stronger the positive correlation. And the redder the higher the negative correlation.
The image above has the same results as the table in the previous step, but the view of the data with a heat map is easier to understand.
7. Visualize the data using a scatter plot.
Now, we can logically understand the correlation between age and education, income and occupation. Next, let’s move on to using a scatter plot graph for the correlation between age and income. Use the following command:
plt.figure(figsize=(12,9))plt.scatter(data.iloc[:,2],data.iloc[:,4])plt.xlabel(“Age/Umur”)plt.ylabel(“Income/Gaji”)plt.title(“Scatter Plot Korelasi Antara Age dan Income”)
age distribution can be seen. The income from the age distribution is from 20 to 50 years and the income is from 5,000 to 20,000.
8. Data Cleaning / Data Standardization
Data classification activities are look for data similarities and assess these similarities. So a gap that is too large will cause difficulties in further classification.
To keep the gap from being too big, we standardize users by importing the sklearn module as in the import library section
from sklearn.preprocessing import StandardScaler
Standarize data with the following command:
scaler =StandardScaler()
data_std = scaler.fit_transform(data)
The command is not only to standardize the data, but also save the standardized data into a variable named data_std
9. Clustering Data with Hierarchical Clustering method
Cluster hierarchy is a classification method. there are 2 methods namely Devisive and Agglomerative. Top down Classification is also called the Divisive method, where the classification begins with a certain characteristic of a data and then decomposes it into several characteristics. There is also a bottom-up hierarchical classification method from specific characteristics to general characteristics (agglomerative).
import scipy
from scipy.cluster.hierarchy import dendrogram, linkage
Next, continue the classification code with the following command:
hierarchy_cluster = linkage(data_std, method=’ward’)
10. Visualization on Hierarchical Clustering
plt.figure(figsize=(12,9))plt.title(“Cluster Hirarki”)plt.xlabel(“Observasi”)plt.ylabel(“Jarak”)dendrogram(hirarki_cluster, truncate_mode=’level’, p=5, show_leaf_counts=False, no_labels=True )plt.show()
The results are as follows:
The picture shows that there are 4 clusters
11. Performing Flat Clustering using K-means Clustering
Import the required libraries with following command
from sklearn.cluster import KMeans
Next, look for the number of clusters that best suit the experiment using WCSS :
wcss = []for i in range(1,11):kmeans = KMeans(n_clusters=i, init=’k-means++’, random_state=42) kmeans.fit(data_std)wcss.append(kmeans.inertia_)
The code aboves is to perform clustering experiments 10 times with 42 times random conditions on the data with the k-means++ algorithm and store the results stored in the wcss array variable.
12. Showing wcss result
View the results on the wcss variables using the following command:
plt.figure(figsize=(12,9))plt.plot(range(1,11), wcss, marker = ‘o’, linestyle = ‘ — ‘)plt.ylabel(‘WCSS’)plt.title(‘K-means Clustering’)plt.show()
The results are as follows:
13. Performing Clustering with the number of Clusters 4
kmeans = KMeans(n_clusters=4, init=’k-means++’, random_state=42) kmeans.fit(data_std)
Next, backup the data that has been clustered with the following command:
data_kmeans = data.copy()data_kmeans[‘segment’] = kmeans.labels_
In the dataset data_kmens variable, a new column named “segment” is created to store the results of the classification of the data.
14. Analyze the Cluster results to see the average value
data_kmeans_analisis = data_kmeans.groupby(‘segment’).mean() data_kmeans_analisis
The results are as follows:
Segment 0, sex is 0.50 which means in this segment the number of Men and Women are almost equal, Marital Status 0.69 indicates that most of this segment are married, and Age 55, Education 2.1 is higher than other clusters and Satlement Size 1.11 indicates that this cluster mostly lives in big cities, indicating that this is the oldest and most educated cluster, This segement is named the Well-off Cluster.
Segment 1, 35% are men, almost all of them are unmarried, indicated by marital status 0.019, mean age 35 years, education level 0.74, the lowest among other clusters, income 97,859, the lowest income compared to other clusters. Show this cluster is the poorest. This cluster also lives in a small town, judging by the 0.04 Completion Size for that reason, it is named the Fewer-Opportunity Cluster.
Segment 2, Marital Status is 0.99 and the mean age is 28, indicating that this is a cluster consisting mostly of new young families. this segment has secondary education level with 1.0 education, standard income 105,759 and middle management position with 0.63 Employment and almost 40% live in big city. For this reason, this cluster is named Standard
The last segment is segment 3, mostly male, married less than 20%, low education level with education score 0.7. The income value is 141,218 and Occupantion 1.27 and most of these segments are domiciled in big cities which are marked by a Settlement Size value of 1.5. For this reason, this cluster is called Carrier Focus
15. Performing the calculation of the number of members from each cluster
data_kmeans_analisis[‘jumlah_customer’]=data_kmeans[[‘segment’,’Sex’]].groupby([‘segment’]).count()data_kmeans_analisis
Output:
16. Knowing how much the average proportion of the segment to the amount of data
data_kmeans_analisis[‘rata2’] = data_kmeans_analisis[‘jumlah_customer’]/data_kmeans_analisis [‘jumlah_customer’].sum()data_kmeans_analisis
Output:
17. Make changes to the predefined segment names
data_kmeans_analisis.rename({0:’Well-Off’, 1:’Fewer-Oppotinities’, 2:’Standard’, 3:’Career-Focus’})
Output:
Well-Off segment is the smallest segment with 13% (263) and the largest segment is Standard with 35% (705)
18. Performing visualization
Visualize the data by using the initial data in the data_kmeans variable, and change the segment names according to the ones that have been defined and stored in the “labels” column
data_kmeans[‘Labels’] = data_kmeans[‘segment’].map({0:’Well-Off’, 1:’Fewer-Oppotinities’, 2:’Standard’, 3:’Career-Focus’})x_axis = data_kmeans[‘Age’] y_axis = data_kmeans[‘Income’]plt.figure(figsize = (10, 8))sns.scatterplot(x_axis, y_axis, hue = data_kmeans[‘Labels’], palette = [‘g’, ‘r’, ‘c’, ‘m’])plt.title(‘Segmentation K-means’)plt.show()
Output:
The Well-Off segment is the green color that is most clearly separated from the other clusters. However, the other 3 segments are still difficult to distinguish so that the scatter plot is not perfect.
19. Combining K-Means with Principal Component Analysis (PCA)
Import the required libraries:
from sklearn.decomposition import PCA
and continue with the following command:
pca = PCA()pca.fit(data_std)
20. See the ratio variation from the data
pca.explained_variance_ratio_
Output:
The results showed that there were 7 components. Component 1 represents 35% (0.35) of the variability data, component 26% (0.26) and so on.
21. Finding sets of data and visualizing data
plt.figure(figsize=(12,9)) plt.plot(range(1,8), pca.explained_variance_ratio_.cumsum(),marker=’o’, linestyle=’ — ‘) plt.title(‘Variasi Komponen’) plt.xlabel(‘Jumlah Komponen’) plt.ylabel(‘Cumulative Explained Variance’)plt.show()
Output:
22. Recall & display PCA data
pca=PCA(n_components=3)pca.fit(data_std)pca.components_
Output:
The resulting component is an array of size 3x7
23. Moving 3x7 array component PCA values into dataset
data_std_pca = pd.DataFrame(data=pca.components_, columns=data.columns.values, index=[‘Component 1’,’Component 2',’Component 3'])data_std_pca
Output:
24. Display data_std_pca table in visual form
sns.heatmap(data_std_pca, vmin=-1, vmax=1, cmap=’RdBu’, annot=True) plt.yticks([0, 1, 2], [‘Component 1’, ‘Component 2’, ‘Component 3’], rotation = 45, fontsize = 9)
Output:
Analysis of the Component:
Component 1 age (age), income (salary), occupation (position) and size of settlement (city of residence) are important factors. This is related to the characteristics of Career
Component 2. Gender, marital status and education have a big influence. And all work-related features like income, occupation and size of residence don’t really matter. This means that Component 2 is not related to a career but rather to the level of education and lifestyle (Education & Life Style).
Component 3. Marital status, age and occupation have a fairly strong correlation. Marital and Employment Status have a negative influence but still need to be considered. This shows that Component 3 is more indicative of the customer experience. it doesn’t matter whether it’s work experience or life experience (Experience).
25. Performing transformations on data
pca.transform(data_std)
Output:
Save the result of the transformation in the score_pca variable
skor_pca = pca.transform(data_std)
26. Re-segmentation the data using K-Means and PCA
Re-segmentation of data using K-Means and PCA features. First step is to determine the number of clusters with data from the score_pca variable
wcss = []for i in range(1,11): kmeans_pca = KMeans(n_clusters = i, init = ‘k-means++’, random_state = 42)kmeans_pca.fit(skor_pca)wcss.append(kmeans_pca.inertia_)
27. Visualize the number of clusters
plt.figure(figsize = (10,8)) plt.plot(range(1, 11), wcss, marker = ‘o’, linestyle = ‘ — ‘)plt.xlabel(‘Number of Clusters’)plt.ylabel(‘WCSS’)plt.title(‘K-means with PCA Clustering’)plt.show()
Output:
The number of Clusters can be set to 4 Clusters according to the result above.
28. Perform k means clustering on the PCA score data and display the results
kmeans_pca = KMeans(n_clusters = 4, init = ‘k-means++’, random_state = 42) kmeans_pca.fit(skor_pca)data_pca_kmeans = pd.concat([data.reset_index(drop = True), pd.DataFrame(skor_pca)], axis = 1)data_pca_kmeans.columns.values[-3: ] = [‘Component 1’, ‘Component 2’, ‘Component 3’]data_pca_kmeans[‘segment’] = kmeans_pca.labels_
Output:
29. Analyze the Cluster results to see the average value
data_pca_kmeans_freq = data_pca_kmeans.groupby([‘segment’]).mean() data_pca_kmeans_freq
Output:
It can be seen from the table, that segment 0 has a high relationship on Education & Life Style, this can be called as a Standard segment, Segment 1 has a high relationship on Career can be called as Career-Focus, Segment 2 has a high relationship on Experience but low on Career and Education & Lifestyle is called Fewer Opportunities and finally Segment 3 has high relation on all components is called Well-Off
30. Rename segments and calculate sums and averages
data_pca_kmeans_freq[‘jumlah’] = data_pca_kmeans[[‘segment’,’Sex’]].groupby([‘segment’]).count() data_pca_kmeans_freq[‘rata2’] = data_pca_kmeans_freq[‘jumlah’] / data_pca_kmeans_freq[‘jumlah’].sum() data_pca_kmeans_freq = data_pca_kmeans_freq.rename({0:’Standard’, 1:’Career-Focus’, 2:’Fewer-Opportunity’, 3:’Well-Off’})data_pca_kmeans_freq
Output:
31. Add segment name in legend column
data_pca_kmeans[‘Legend’] = data_pca_kmeans[‘segment’].map({0:’Standard’, 1:’Career-Focus’, 2:’Fewer-Opportunity’, 3:’Well-Off’})data_pca_kmeans
Output:
32. Displaying data on scatter plot
x_axis = data_pca_kmeans[‘Component 2’] y_axis = data_pca_kmeans[‘Component 1’] plt.figure(figsize = (10, 8)) sns.scatterplot(x_axis, y_axis, hue = data_pca_kmeans[‘Legend’], palette = [‘g’, ‘r’, ‘c’, ‘m’]) plt.title(‘Kluster Setelah PCA Components’)plt.show()
Output:
The characteristics of each segment are clear. In conclusion, it takes 2 Components to be able to see the difference in each segment.
33. Export model
Export the model with library pickle using the following command:
import pickle
Save the model, pca and kmeans_pca with the following command:
pickle.dump(scaler,open(‘skalar.pickel’,’wb’)) pickle.dump(pca,open(‘pca.pickel’,’wb’)) pickle.dump(kmeans_pca,open(‘kmeans_pca.pickel’,’wb’))
Source : https://github.com/miradzji/customer_segmentation_da