Customer Segmentation Report for Arvato Financial Services

Way X

Published in

The Startup

10 min readSep 21, 2020

1. Project Overview

1.1 Introduction

This is a capstone project for the Udacity data science nanodegree program.

In this project, I analyze demographics data for customers of a mail-order sales company in Germany, comparing it against demographics information for the general population. I use unsupervised learning techniques to perform customer segmentation, identifying the parts of the population that best describe the core customer base of the company. Then, I use a supervised model to predict which individuals are most likely to convert into becoming customers for the company.

1.2 Data sets

The data is provided by Bertelsmann Arvato Analytics and represents a real-life data science task. There are four data files associated with this project:

Udacity_AZDIAS_052018.csv: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).
Udacity_CUSTOMERS_052018.csv: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).
Udacity_MAILOUT_052018_TRAIN.csv: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).
Udacity_MAILOUT_052018_TEST.csv: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).

There are also two Excel spreadsheets, providing more information about the columns depicted in the data files.

DIAS Information Levels — Attributes 2017.xlsx is a top-level list of attributes and descriptions, organized by the informational category.
DIAS Attributes — Values 2017.xlsx is a detailed mapping of data values for each feature in alphabetical order.

1.3 Problem and Approach

There are four parts in this project:

Get to know the data

In this part, I will explore the data and then process the data regarding the missing values, data type transformation, data imputation, and feature scaling. The cleaned data will be used in the following study.

2. Customer segmentation report

In this part, I will compare the demographics data for customers against the information for the general population, to identify the core customer base of the company. I will use unsupervised learning techniques (k-means) to perform customer segmentation. Principal component analysis (PCA) will be used to reduce dimensions.

3. Supervised learning model

Here, I will use supervised learning methods to predict which individuals are most likely to convert into becoming customers for the company. I will compare four different models and optimize the model through GridSearchCV.

4. Kaggle competition

The result will be submitted for Kaggle competition.

1.4 Metrics

I will use the area under the receiver operating characteristic curve (ROC_AUC) for model selection. The ROC curve shows the false positive rate (FPR) against the true positive rate (TPR) at all possible thresholds. The idea curve is close to the top left. The area under the ROC curve (AUC) provides a way to evaluate the ROC curve to select the optimal models. The reason I use ROC_AUC is because this is a classification problem with imbalanced classes, and ROC_AUC is often much more meaningful than accuracy for this kind of problems.

2. Analysis, Methodology and Results

2.1 Data processing

I first explored four data files associated with this project:

Udacity_AZDIAS_052018.csv: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).

Udacity_CUSTOMERS_052018.csv: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).

The first few rows of CUSTOMERS data set

Udacity_MAILOUT_052018_TRAIN.csv: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).
Udacity_MAILOUT_052018_TEST.csv: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).

Each row of the demographics files represents a single person, including information about their household, building, and neighborhood. I will use the information from the first two files to figure out how customers (“CUSTOMERS”) are similar to or differ from the general population at large (“AZDIAS”), then make predictions on the other two files (“MAILOUT”), predicting which recipients are most likely to become a customer for the mail-order company.

2.1.1 Process missing data

Since some encoded values of attributes mean unknown or no available values, I need to convert them to NaNs before process missing data. I created a data frame, in which one column is the attribute, and another column is the values that indicate unknown or no values. Based on this data frame, I converted the encoded values in the Azdias that mean unknown to NaNs.

Then I studied the proportion of missing values in each column and row. The following figure illustrates the distribution of missing values per column. According to this figure, the proportions of missing values in most of the columns are less than 0.3. So, I dropped the columns with the proportion of missing values greater than 0.3.

The following figure illustrates the distribution of missing values per row. According to this figure, the proportions of missing values in most of the rows are less than 0.1. I dropped the rows with the proportion of missing values greater than 0.1.

2.1.2 Process data type

There are 6 columns that the data types are the object. We need to convert the data type before transforming the data.

The approach is to reencode ‘X’ in CAMEO_DEUG_2015 and ‘XX’ in CAMEO_INTL_2015 with NaNs, and then convert CAMEO_DEUG_2015 and CAMEO_INTL_2015 to float; reencode ‘W’ and ‘O’ in OST_WEST_KZ with 1 and 0; drop column CAMEO_DEU_2015, D19_LETZTER_KAUF_BRANCHE, EINGEFUEGT_AM.

After this process, azdias dataset has 737241 rows and 322 columns.

2.1.3 Cleaning Customer dataset

I created functions regarding the previous cleaning processes and used them to clean the Customer dataset. I also dropped the extra columns in the Customer dataset (i.e., ‘CUSTOMER_GROUP’, ‘ONLINE_PURCHASE’, ‘PRODUCT_GROUP’). After the cleaning, the Customer dataset has 134245 rows and 322 columns.

2.1.4 Data imputation and feature scaling

The data are imputed followed by feature scaling before using unsupervised learning techniques. The missing values are imputed with mean, and StandardScaler is used for feature scaling. The following figure shows the azdias dataset after data imputation and feature scaling.

azdias dataset after data imputation and feature scaling

2.2 Customer Segmentation Report

2.2.1 PCA

Since there are 322 columns in the datasets, I used Principal Component Analysis (PCA) to reduce dimensions. I plotted the change of cumulative variance explained with the number of components as below. It shows that around after 200 components, the change of cumulative variance explained becomes less significant. So I chose to retain 200 components.

2.2.2 Clustering

I used k-means clustering method on the PCA data. Before applying it, I need to find out the ideal number of clusters. I plotted the change of k-means score with the number of clusters. The plot shows that the score decreases rapidly at the beginning and then becomes slow after 9 clusters. So, I selected 9 clusters as the number of clusters for analysis.

2.2.3 K-means

I used k-means method for unsupervised learning. The model fits the cleaned azdias dataset and predicts the azdias and customers dataset.

azdias_kmeans = KMeans(9)
azdias_model = azdias_kmeans.fit(azdias_pca_200_transf)
azdias_predicted = azdias_model.predict(azdias_pca_200_transf) customers_predicted = azdias_model.predict(customers_pca_200_transf)

Then I compared the proportion of each cluster in both azdias and customers datasets. Clusters 8 and 5 are the most overrepresented in the customer compared to the general population. Clusters 3 and 4 are the most under-represented in the customer. Now find out the most important attributes in those clusters.

To find out the top attributes in the cluster, I defined two functions. get_top_component(model, n) is to find the top component in the cluster n; get_top_attributes(component_num, top_num) is to find the top n attributes of component n.

I found out the top 10 attributes of the most over-represented cluster are mainly regarding the share of middle and upper-class cars. It indicates the people who have middle and upper-class cars are more likely to become customers.

Similarly, I found out the top 10 attributes of the second most over-represented cluster are mainly regarding the financial status and age.

The top 10 attributes in the first two most under-represented clusters are the same, and they are mainly regarding the number of family houses, household income, and density of inhabitants.

As a summary, the major attributes in the over-represented population are the share of middle and upper-class cars, financial status, and age. The major attributes in the under-represented population are the number of family houses, household income, and density of inhabitants.

2.3 Supervised Learning Model

Now I will build a supervised learning model to predict whether or not an individual will become a customer. The “MAILOUT” data has been split into two approximately equal parts, each with almost 43 000 data rows. Each of the rows in the “MAILOUT” data files represents an individual that was targeted for a mailout campaign. I will verify the model with the “TRAIN” partition, which includes a column, “RESPONSE”, that states whether or not a person became a customer. Then I will create predictions on the “TEST” partition, where the “RESPONSE” column has been withheld.

The training data is cleaned using the previously defined cleaning functions, followed by imputation and data scaling. I tested four different methods to find out the best classifier. They are Logistic Regression, Random Forest Classifier, AdaBoostClassifier, and Gradient Boosting Classifier. I used 5-fold cross-validation. ROC_AUC was used as the score to evaluate the performance since this is a problem with very imbalanced classes. The classifier function is defined as below,

def classifier(estimator, param_grid, X=X, y=y):
grid = GridSearchCV(estimator=estimator, param_grid=param_grid, scoring=’roc_auc’, cv=5)
grid.fit(X, y)
print(‘Estimator:’, grid.best_estimator_)
print(‘Score:’, grid.best_score_)
return grid.best_estimator_

The result shows the ROC_AUC scores are 0.68, 0.52, 0.76, and 0.78 for Logistic Regression, Random Forest Classifier, AdaBoostClassifier, and Gradient Boosting Classifier, respectively. The higher score indicates a better model, which gives a higher recall while keeping the false positive rate low. So, I selected Gradient Boosting Classifier as the estimator and optimized the parameters using GridSearchCV. I tested different learning rate and number of estimators,

param_grid = {
‘learning_rate’: [0.01, 0.1, 1],
‘n_estimators’: [10, 100, 200]
}

The optimized parameters are learning_rate = 0.1 and n_estimators=100, and the score is 0.78. This model is used as the final estimator.

This final estimator is

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              n_iter_no_change=None, presort='auto', random_state=42,
              subsample=1.0, tol=0.0001, validation_fraction=0.1,
              verbose=0, warm_start=False)

I checked out the top 10 most important features in the final model. D19_SOZIALES is most important one, but there is no description. Other important features include the number of cars, share of Ford, year of buliding, number of professional title holder, etc. This finding basically agrees with the result shown in the customer segmentation study.

2.4 Kaggle Competition

Now it is time to test the model on the TEST dataset and submit for Kaggle competition. The entry to the competition is a CSV file with two columns. The first column is “LNR”, which acts as an ID number for each individual in the “TEST” partition. The second column, “RESPONSE”, is the probabilities of each individual became a customer.

I submitted it to Kaggle, and the final score is 0.68.

3. Conclusions

In this project, I analyzed the data of the customers of a mail-order sales company in Germany, and there are some interesting founding.

I used unsupervised learning techniques (k-means) to perform customer segmentation. It turns out the major features in the over-represented population are the share of middle and upper class cars, finanical status, and age. On the other hand, the major features in the under-represented population are the number of family houses, household income, and density of inhabitants.
I compared four different supervised learning methods to predict which individuals are most likely to convert into becoming customers for the company. Gradient Boosting Classifier gave a better result than Logistic Regression, Random Forest Classifier, and AdaBoostClassifier. The most important features include the number of cars, share of Ford, year of buliding, number of professional title holder, etc. The optimized model was submitted to Kaggle and the score is 0.68.

The project could be improved in several aspects. For data processing, try MinMax Scaler instead of Standard Scaler, and impute data using different strategies. For unsupervised learning, explore more attributes from more components and clusters. It will give us a better understanding for customer segmentation. For supervised learning, the model could be further optimized by testing more parameters.

The main findings and code are also included in my github repo (https://github.com/tyuion/Customer-Segmentation)