Customer Clustering Analysis and Prediction Study

8 min readJul 13, 2020

This is a project done for Udacity’s Data Science Nano Degree program. Data from a mail-order company in Germany has been provided by Udacity’s partners at Bertelsmann Arvato Analytics. The data provided can be stated as below,

Customers data
Germany Population data with the same features as Customers data
Mailout training data. This data shares the common features with customers but has an additional feature, if the person recieving the email became a customer or not.
Mailout test data. This is a unlabeled data to be predicted.

Goals of The Study

The goals of the study can be summerized in 3 items,

Cleaning the data (the hardest part)
Doing an unsupervised ML analysis called clustering and, analysing the output for how the customers and the general population differ.
Training a supervised learning ML model using the Mailout training data and do a prediction of Mailout test data for potential customers.

As a note all the further processes are done with the population data or trained with the population data. The results are only applied to the customers data.

Cleaning the Data

The data encoddings are checked inside the data and where there are data that have not been stated in the enconding list is noted. These are then re-encoded as unknown.

All the data that are encoded as unknown are decoded to np.NaN so that these data can be processed further.

The features are checked for the ratio of unknown data contained. For a given threshold the features exceeding this ratio of unknown values are removed from the data. Below is the plot for the ratio of unknowns in the features.

Distribution of missing value ratio in features

As a threshold to remove features in my case I used 20%. The features that were above this threshold are in the plot to the side.

When looking further into the data if we were to remove the rows containing unknown values, we would loose 31% of the data. Due to this reason the unknowns are imputed. This is done using the most frequent for qualitative features and median for quantitative features.

The data is then encoded using one-hot encoder. However the encoding process is done only for the ordinal features in the data.

As a last step the data is scaled using standardization.

Clustering and Demographic Analysis

To obtain better results feature reduction was required. Initially the cleaned data had 308 features. With PCA this was reduced down to 120.

Number of PCA components against explained variance in the data

To choose the correct number of components, the represented variation in the data against number of PCA components were plotted. Here it was observed that 120 components would explain about 90% of the variation in the data. This was accepted to proceed in the analysis.

With PCA trained to include only 120 features, the analysis continued with clustering. Here MiniBatchKMeans algorith from scikit-learn library was used due to reduced hardware requirements. The correct number of clusters are chosen using the elbow method, this required the clustering to be done from 0–30 cluster and the sum square errors to be plotted.

Sum square error in each trained clustering model top. Rate of change in score for increasing the cluster number bottom.

From looking at the plot on the bottom it was noticed that after 15 clusters the rate of change in cluster score is very low and stable. Therefore for the clustering model 15 is selected as the number of clusters.

After the process is complete the clusters are compared to one another clusters representing the population and customers are selected. In this case these were cluster 0 for the customers data and cluster 3 for the population data.

The top 10 different features with the defining clusters of population and customers are plotted.

Top 10 features that are different between customers and population data

By investigating these features more in depth, I have come to the conclustion listed with the items below,

HH_EINKOMMEN_SCORE (estimated household net income): Customers are dominated with very high income level, while the population is dominated by very low income level.
PLZ8_BAUMAX (most common building-type within the PLZ8): Customers are mainly 1–2 family homes while there is no obvious difference in levels for the population data.
FINANZ_UNAUFFAELLIGER (financial typology: unremarkable): Customers are defined as mainly very low and the population is right skewed towards very high.
KBA13_KW_0_60 (share of cars up to 60 KW engine power — PLZ8): Here it can be seen that the customers are defined as average and the population is defined as high.
FINANZ_ANLEGER (financial typology: investor): The customers in this feature are dominated by very low level while there is no obvious level for the population.
KBA05_ANTG3 (number of 6–10 family houses in the cell): Both customers and population are defined by no 6–10 family homes. However the customers are almost strictly at this level while the population is more distributed among each level.
INNENSTADT (distance to the city centre): The customers are almost all at distance to the city centre 10–20 km while the population is right skewed whith mode at distance to the city centre 3 km.
ORTSGR_KLS9 (size of the community): Almost all the customers are 20.001 to 50.000 inhabitants while the population is left skewed distributed between the levels with a mode at 100.001 to 300.000 inhabitants.
KBA13_BJ_2000 (share of cars built between 2000 and 2003 within the PLZ8): Here it can also be seen that the customers are strictly on average level while the population is left skewed distributed within all levels. The mode of the population is at high.
KBA13_KMH_250 (share of cars with max speed between 210 and 250 km/h within the PLZ8): This feature shows the customers to be strictly on average level while the population is mainly distributed within low, very low and average.

Supervised Learning to Predict Customers

In this section a supervised learning model is trained using the Mailout training data.

Distribution of labels in training and testing data sets

The data is firstly cleaned without scaling and went through a test-train split as 20%–80%. The data preprocessing then goes through an extra step. This is due to the data showing a high bias towards one of the target labels. Therefore the training data is arranged so that both target labels are on equal counts. This removes the bias issue on the training data.

The next step is selection of a model to train. A pipeline is created for 4 models that will be tested, these are KNN, RandomForestClassifier, GradientBoostingClassifier and GaussianNB (naive bayes). The pipelines are set as follows,

Standardization
PCA
classifier

While random forest and GradientBoostingClassifier are decision tree based ensemble classifiers GaussianNB and KNN are not. To keep the models comparable to one another, the number of neighbours for KNN and n estimators parameters for RandomForest and GradientBoostingClassifier will be tuned with grid search.

The evaluation of the models are done using ROC AUC metric. This is because this metric works with the final prediction label. Another advantage is that it is more sensitive in biased data which is our case with our test set.

The resulting scores of the tested models are as follows,

-KNN- 
Training Score: 0.54
Test Score: 0.53
 -x--x--x--x--x-
-RF- 
Training Score: 0.54
Test Score: 0.53
 -x--x--x--x--x-
-GB- 
Training Score: 0.53
Test Score: 0.51
 -x--x--x--x--x-
-NB- 
Training Score: 0.52
Test Score: 0.51
 -x--x--x--x--x-

The above results show that KNN and RF gives a higher ROC AUC score then the other models in the test and training sets. Therefore for further optimization KNN is selected because it is faster to train.

With the final form of the training data grid search is used to train and improve the model. Below is the search space that I have used when searching for the optimal model,

‘pca__n_components’: [100, 150, 200],
‘clf__leaf_size’: list(range(20, 101, 10)),
‘clf__n_neighbors’: list(range(2, 30, 3)),
‘clf__p’: [1, 2, 3]

The trained model converged on the results with,

‘clf__leaf_size’: 20, ‘clf__n_neighbors’: 5, ‘clf__p’: 1, ‘pca__n_components’: 15

Validation of the Supervised Learning Model

The supervised learning model trained is validated using the test set with ROC AUC score and also using the confusion matrix.

Training Score: 0.55
Test Score: 0.51

Confusion matrix output using the test set

It can be seen that Gridsearch improved the models training score however decreased the test score. These changes in scores are about 1–2% which is pretty negligible. The scores obtained are pretty poor at this point even though gridsearch did an exhaustive search on the hyperparameters to find the optimal set in the given ranges.

The trained model is also used to predict the results of the Mailout test data. the prediction is then submitted to the Kaggle competition. The score obtained from Kaggle was 0.49118.

Conclusion and Further Work

The data provided for population and customers are cleaned thoroughly. During the cleaning process the features in the data are filtered out according to their amount of information (missing data) and our knowledge on the features. The data is then scaled and PCA is applied to reduce the amount of features in the data. Clustering model is trained using the population data and used to predict the customers data. These both cluster outputs are then used to do a demographic comparison between the two data. The clusters representing the population and customers are compared in a way that the most divergent features between the two are investigated.

Further on a supervised learning model is created using the Mailout train data. Several models are compared to one another and KNN is chosen to be the most suitable. Gridseach is used to optimize the hyperparameters of the KNN model in order to improve the predictions. While evaluating the model ROC AUC score is which is selected because it is reliant on the final label outputs of the model. However it was found that the improvements made by grid search in this case did not improve the model any further. The model is then used to predict another dataset called Mailout test and the results are submitted to a Kaggle competition. The resulting score from Kaggle was 0.49118.

In future work, during the supervised learning the evaluation scores of the tuned model did not give results as expected. This raises the question on how other model application would have performed after propper tuning. During the model selection only bascis were used however to compare the model, all the hyperparameters can be tuned that the full potential of the other models are seen.

Also in order to eliminate the bias in the training data, the data was shaved in a way that the dominant label was reduced to the size of the other. However this removal of bias in the training data does come with a significant cost of loosing data with label 0. In order to eliminate this problem, including the customers data in the training data can be used to increase the size of the training data significantly.

My work can be found in the github repo below,

https://github.com/yesilkayacan/capstone_arvato/tree/master