Customer Segmentation Project

Alexandre Rosseto Lemos
Geek Culture
Published in
11 min readSep 20, 2021

Introduction

As part of my final project in the Udacity’s Data Scientist Nanodegree program, I chose to tackle on the Bertelsmann/Arvato Project.

The ideia behind this project is to deal with a real-life problem that the Arvato’s Data Science group has to handle: the search for new customers.

The main ideia of this project is to find similarities between people who are currently customers and people who are not. Then, use this information to find groups of potential new customers (people who are not currently customers but have high similarities with people who are).

This project was divided in 3 parts:

  • Part One: Data Analysis and Data Cleaning
  • Part Two: Customer Segmentation
  • Part Three: Supervised Learning Model

Objectives

This project is mainly divided in two goals:

The first goal of this project is to analyze demographics data for customers of a mail-order sales company in Germany, comparing it against demographics information for the general population. The purpose is to find simillar characteristics in both groups, signaling good candidates, among the general population, for a marketing campaign.

The second goal is to develop a machine learning model that can classify new samples as good or bad candidates for a marketing campaingn using the demographic information from each individual.

Part One: Data Analysis

Photo by Stephen Dawson on Unsplash

The first step in any Data Science project is to analyse the data that will be used.

For this project, four datasets were made available with the following characteristics:

  1. azdias: Demographics data for the general population of Germany. 891211 persons (rows) x 366 features (columns).
  2. customers: Demographics data for customers of a mail-order company. 191652 persons (rows) x 369 features (columns).
  3. mailout_train: Demographics data for individuals who were targets of a marketing campaign. 42982 persons (rows) x 367 (columns).
  4. mailout_test: Demographics data for individuals who were targets of a marketing campaign. 42833 persons (rows) x 366 (columns).

Two excel files containing the information about the data were also used.

  1. DIAS Information Levels — Attributes 2017: Information about the features present in the datasets.
  2. DIAS Attributes — Values 2017: Information about the values and what they represent in each feature of the dataset.

Since little background information about the data is provided, outside the excel files, they will be very used to guide wich features to use.

Data Exploration

With the datasets loaded, it’s time to explore the data provided.

The observations made were:

  • There were no duplicated samples in the datasets.
  • The data contained null values that required attention.
  • There were more columns in the datasets than in the excel files with the metadata. Due to this, only the columns that were in the excel were used, since we don’t know what the columns that are absent in the excel files represented. This extreme action was taken because there was no other way to identify what the absent columns are.
  • Some of the columns present in the excel file were absent in the datasets and were removed.
  • The removed columns were mostly transactional data

After this first process, approximately 100 columns were removed.

Data Cleaning

Photo by JESHOOTS.COM on Unsplash

In this step, the null values detected previously will be treated.

As seen in the Data Exploration, there are a lot of columns with null values, and some with a high number of null values. The approach used was to remove the null samples from the dataset.

Since none of the columns had more than 30% of null values, I didn’t removed any column (except the ones alredy removed in the exploration step).

Part Two: Customer Segmentation

Once the data analysis and cleaning are done, it’s time to find which people, among the german population present in the azdias dataset, have similar characteristics with the people from the customers dataset.

To calculate this similarity between different people, the unsupervised algorithm K-means was used.

K-means algorithm

The K-means algorithm clusters data by trying to separate samples in k groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares.

The K-means algorithm was applied for each group of features (the groups were in the excel files), and the clusters that had the most proportional data distribuition (from the azdias and customers datasets) were chosen as good clusters to be used in finding similar characteristics between the two datasets.

Example of result obtained: cluster 9 is a good cluster while clusters 0 and 5 are bad ones.

The groups of features were :

  • Person
  • Household
  • Building
  • Microcell (RR4_ID)
  • Microcell (RR3_ID)
  • Postcode
  • RR1_ID
  • PLZ8
  • Community

Then, the boxplot of all the features contained in the group were plotted, to see how the good clusters were different from the bad ones.

Results

After this procedure was done with all the groups, I came up with the following caracteristics of the good candidates of becoming customers:

The portion of the german population that most resembles the customers sample tend to:

  • be more passive elderly or cultural elderly;
  • have 46 years old or more;
  • have high financial interest;
  • be very money savers;
  • be from multiperson households;
  • be from multi-generational household;
  • have high-income;
  • have the dominating movement in their youth to be the economic miracle or the milk bar/individualisation;
  • be very religious;
  • not be very sensual minded;
  • be very rational;
  • be more dutyfull traditional minded;
  • be more traditional.
  • have a more gourmet and versatile consumption type;
  • have a very low transaction activity in the last 12 and 24 months;
  • have a higher share of cars with more than 2499ccm;
  • have a lower share of female car owners;
  • have a higher share of top German manufacturer (Mercedes, BMW);
  • have a lower share of small cars (referred to the county average);
  • have a higher share of share of upper class cars (referred to the county average);
  • have a lower share of cars with less than 59 KW engine power;
  • have a higher share of cars with an engine power of more than 119 KW;
  • have as the most common car manufacturer in the microcell: Top-German or VW-Audi;
  • have a higher share of upper class cars (in an AZ specific definition);
  • have biggers sized engines in the microcell;
  • have a lower share of small and very small cars (Ford Fiesta, Ford Ka etc.) in the microcell;
  • have a higher share of upper middle class cars and upper class cars (BMW5er, BMW7er etc.);
  • have a higher share of upper class cars (BMW 7er etc.) in the microcell;
  • have a higher share of roadster and convertables in the microcell;

With this information, the team that will coordinate the marketing campaign can narrow down the people they intend to send the advertisements, optimizing the results and reducing costs.

Part Three: Supervised Learning Model

For this final part, a new dataset is provided containing a label column that tells if the person became or not a customer after the marketing campaign. This new dataset is highly unbalanced, with 42430 samples belonging to the class of people that didn’t responded to the campaign and 532 belonging to the class that did.

Class distribuition

Metrics

It is important to point out that because of the imbalance in the amount of samples belonging to each class, using the accuracy of the model as a metric to evaluate it’s performance is not a good idea.

That’s because if, for instance, the model predicts all the samples as belonging to the class with the majority of samples (in this case the class 0), the accuracy of the model will be of 98,7%, which is a really good value.

Formula to calculate the accuracy of the model

This high accuracy doesn’t mean that the model is performing well, because he is missing the prediction of all the samples belonging to class 1. In order to evaluate the model for this project, the ROC AUC score will be used as a performance metric, since it uses the False Positive Rate and True Positive Rates, which are best suited for problems with unbalanced data.

The development of the model is presented as follows:

  1. Analysis of the dataset. Training and testing using all features and using only the selected features found in the Customer Clustering part. A Logistic Regression Classifier will be used to determin wich one is the best. The Stratified K-Fold Cross Validation algorithm will also be used to try dealing with the unbalanced data.
  2. Sampling redistribuition technique. Resampling techniques (Random Undersampling and Random Oversampling) will be used to deal with the unbalanced data. The same classifier will be trained and tested again with the resampled data and the results will be compared with the ones previosly obtained.
  3. Definition of the best model. Several Machine Learning Models will be used to find the one that delivers the best result.
  4. Hyperparameter Tuning. The hyperparameters of the best model will be tuned to find the combination that optimizes the results.

Dataset Analysis Results

Comparing the ROC AUC Score obtained using both datasets, it is clear that using all the features avaiable was better than using only the selected features. This can happen because a lot of information is beign lost when excluding the columns.

Dataset comparison result

Sampling Redistribuition Results

Using a combination of Random Under Sampling followed by a Random Over Sampler algorithms, the Logistic Regression obtained an average ROC AUC score of 0.69. Comparing this result with the one obtained in the previous step shows that the usage of resampling techniques improved the overall results. From now on, the resampled dataset will be used for further analysis.

Results comparison

Definition of the Best Model

The next step in the model development part is to find the best model to work with. There are a lot of possible models to choose from and, in this analysis, only a few were tested to see which one performed better. The models tested were:

  • Multi Layer Perceptro (Scikit-learn)
  • Logistic Regression
  • K-Nearest Neighbors
  • AdaBoost

For this part of the project, all the models above were initialized using the default parameters. I used the Repeated Stratified KFold method to apply a cross validation on each model.

The data used to evaluate each model were data with the applied resampling techniques. The results obtained on each fold were saved and the overall score for each model was obtained by calculating the means of the scores of each fold.

The results obtained for each model are shown in the table below.

From the table, the model that had the best performance was the K-Nearest Neighbors model, for the next step this will be the model used.

Hyperparameter Tuning

The final step in the model development part is to find the best hyperparameters for the model chosen. To do this, the Bayes Search algorithm will be used to find the best values for the hyperparameters of the model.

For this part, the parameters tuned were:

  • n_neighbors: Number of neighbors (values ranging from 3 to 21)
  • weights: Weight function used in prediction ( uniform and distance were tested)
  • leaf_size: Leaf size passed to BallTree or KDTree (values ranging from 20 to 100)
  • p: Power parameter for the Minkowski metric (values ranging from 2 to 20)

The best values for the hyperparameters were:

leaf_size = 20, n_neighbors = 3, weights = 'distance'

Model Evaluation and Validation

With the new parameters, the model was evaluated again using the same cross validation as before, obtaining now an auc of 0.951.

Calculating the standard deviation all the scores obtained, the results show that the model is stable and the results of each fold don’t vary much.

Scores standard deviation = 0.007875

After the best hyperparameters were calculated, the model is trained using the data avaiable and then is saved using the pickle library. Now the model is ready to be used to predict new data.

Conclusion

Photo by Patrick Robert Doyle on Unsplash

This project was very interesting to do!

This was an example of a real life project developed in the customer aquisition sector of a company. All the coments made were my actual toughts and the steps and analysis made were similar to the ones I would take if I worked in this company.

In the Customer Segmentation part of this project, several characteristics were detected that could help narrow down the possible new customers and optimize the marketing campaigns, reducing costs while improving the overall results.

In the Supervisioned Machine Learning Model part, a model was developed while dealing with some real life problems like missing data and unbalanced class distribuition. Several analysis were made to improve the final results, like finding the best dataset, testing resampling techniques and different machine learning models and optimizing the final model hyperparameters to achieve the best result.

Some improvements could be made for this part: different machine learning models could be tested to see if a different model performs better and custom ensambled models could be developed using other simpler models. Also, feature selection and dimension reduction techniques (like PCA) could be tested to see if they could also affect positively in the results. These are just some tests that could be made to see if the results improve.

Justification

The solutions found in this projects are deemed adequate.

For the Customer Segmentation part of the project, several similar characteristics were found between the two datasets analysed and could be used to guide the marketing team.

For the Supervisioned Machine Learning Model part, several machine learning models were tested, different datasets were analysed and the best model obtained was then tuned, improving the overall results. At the end, the validation step showed that the model was able to correctly assign the labels for most of the test samples.

It is important to say that a Data Science project rarely follows a straight line path, from begining to end. Usually what happens is similar to a PDCA loop: you make a plan of action (Plan), you execute the plan (Do), you check the results obtained and see possible improvements(Check) and finally, you execute these improvements (Act). Then, the cycle goes on untill a satisfying result is obtained.

Thank you so much for reading!

I hope I was able to clearly show an example of a Data Science project, step by step, with all the analysis and thougths.

Any comments and suggestions are more than welcome!

The full code is avaiable at my GitHub page

GitHub

Feel free to reach me at my Linkedin page

Linkedin

--

--

Alexandre Rosseto Lemos
Geek Culture

Data Scientist @ Trustly | I write about things I learned that helped me in my journey through Data Science