Customer Segmentation with PCA

4 min readOct 8, 2019

This is project of the Udacity Data Scientist Nanodegree . You can find the details for the project on github.

Using unsupervised learning techniques of Principal Component Analysis (PCA) and KMeans to identify customer segments of the German population that were popular or less popular with a mail-order sales in Germany.

The Dataset

The data is provided by Bertelsmann partners AZ Direct and Arvato Financial Solution. There are two dataset provided first was demographic data for the general population of Germany and the second dataset was demographic data for customers of a mail-order company. Both dataset consists of 85 different features.

Data Wrangling

The data was quite messy and some was irrelevant. We need converting and mapping the data to appropriate format for data analytics.

First we deal with missing values in the dataset by dropping variables and rows with extremely high frequencies of missing values.
Next we need convert categorical variables to binary variables by using one-hot encoded features.
Then we identify mixed-type features and engineer new features because some variables that had more than two different piece of information.
We use Use StandardScaler to scale each feature to mean 0 and standard deviation 1.

PCA — Dimensionality Reduction

The main idea of principal component analysis (PCA) is to reduce the dimensionality of a data set consisting of many variables correlated with each other while retaining the variation present in the dataset.

The results of the PCA, we can observe the relative variance in the data is captured along each component.

The PCA shows that first component which is the first blue bar got 16.2% the highest variance explained. Then the next component which explain the next highest percentage of variance. Total 71 components to capture all of the variance in the data. However, we’ll not use all the 71 components for 100% of our variance but we’ll use 30 components. Let see the plot below:

This plot tells us that selecting 30 components we can preserve something around 80% of the total variance of the data.

Interpret Principal Components

We’ll show the top 5 weights of the variable on the first component to interpret their relationship, what a positive or negative value indicates.

Here’s the result:

Top 5 Positive Feature
CAMEO_DEUG_2015 (urban working class)
PLZ8_ANTG3 (Number of 6–10 family houses in the PLZ8 region)
CAMEO_INTL_2015_WEALTH ( High positive: more likely poor)
EWDICHTE (Density of households per square kilometer)
ORTSGR_KLS9 (Size of community)
Top 5 Negative Feature
FINANZ_MINIMALIST (Low financial interest)
KBA05_ANTG1 (Number of 1–2 family houses in the microcell)
MOBI_REGIO (High Movement patterns)
LP_STATUS_FEIN (Low social status, fine scale)
LP_STATUS_GROB (Low social status, rough scale)

Positive means that these features are likely to be high where negative means that these features are likely to be low.

The result shows that the first component is urban people with lower financial interest and income.

K-Means Analysis

We apply k-means clustering to the dataset and use the average within-cluster distances from each point to their assigned cluster’s centroid to decide on a number of clusters to keep.

This plot shows us the amount k-mean scores associated with each number of clusters. We decided to choose 14 because after 14 clusters, the scores does not decrease much than before.

Then, we compared the relative proportions of each clusters between the demographics and the customer data.

As we can see the cluster 1, 3 and 10 are significantly over-represented where cluster 2, 5 and 8 significantly underrepresented.

For cluster 1 (over-represented):

Estimated age around 30-60 years of age
High for dutiful
High for traditional
Higher for sparer and investor
Low for ‘be prepared’
Low for sensual minded

For cluster 2 (underrepresented):

Between 20 and 30
High for financial minimalist
Living in big cities
More households in community
Low interest in money saving
Low household net income
Low wealth

Cluster 1 is an over-represented segment, they seen more attracted in purchasing by mail order sales. It also seems reasonable that older generation are more likely to be customers in traditional way of shopping compare to youngster.