Customer Segmentation Analysis and Campaign Response Prediction
Introduction
Customer segmentation plays an important role in making business decisions. Businesses need to know about their customer. Having this information allows businesses to target individuals that are likely to become customers while also retaining current customers. Targeted campaigns save money and have high likelihood of success.
Project Overview and Problem Statement
In this project, I analyzed demographic data of general population and of customers of a mail-order company in Germany. This data was provided by Bertelsmann Arvato Analytics and cannot be made public. The purpose of this analysis was to do customer segmentation so that individuals in general population with high probability of becoming customers can be targeted for a marketing campaign. Another goal of this project was to come up a machine learning model to predict response of a mailout campaign i.e. an individual would convert to a customer or not.
Data Cleaning/Pre-Processing
Four datasets were provided as csv files. There were two additional files that contained information about some of the features
Udacity_AZDIAS_052018.csv
: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns)Udacity_CUSTOMERS_052018.csv
: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns)Udacity_MAILOUT_052018_TRAIN.csv
: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns)Udacity_MAILOUT_052018_TEST.csv
: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns)DIAS Attributes — Values 2017.xlsx: Feature Information
DIAS Information Levels — Attributes 20172017.xlsx: Feature Information
DIAS Attributes — Values 2017.xlsx file contained some of the features and their possible values. I took following steps to clean and standardize the data
- Updated feature values to NaN in AZDIAS dataset if the values were outside of the possible values defined in DIAS Attribute file. There were a significant number of features with high number of NaNs after this update.
- Deleted all of the features in azdias data set that contained 50% or more NaNs as these features would not have helped much in analysis or modeling
- Deleted all rows from azdias dataset where 100 or features contained NaN
- Dropped features that had correlation of 0.7 or higher with other features
- Used one-hot encoding to convert categorical features to numerical features
- Used an imputer to impute missing/NaN values. I used median as it is less susceptible to outliers as compared to mean.
- Used sklearn’s StandardScaler to scale azdias dataset. It is needed to give every feature an equal chance. Otherwise, features with higher values will play more of an important role than features with smaller values.
- After pre-processing azdias data set had 785,410 rows and 332 features(columns)
- Same pre-processing was done for customers data set
Data Analysis and Customer Segmentation
I had 332 features in azdias data set and I needed a way to reduce dimensionality. I chose Principal Component Analysis as it allows us to reduce number of dimensions without losing much variability inherent in the data. I chose 200 components which contained a little over 80% of the variability. 200 is still large but much less than 332 nonetheless.
Here are top 20 features of some of the principal components
Now I could do clustering to find clusters of population where individual are similar to one another with-in a cluster and are different from individuals in different clusters. I chose MiniBatchKMeans for clustering to speedup processing. The MiniBatchKMeans is a variant of the KMeans algorithm which uses mini-batches to reduce the computation time, while still attempting to optimise the same objective function. Mini-batches are subsets of the input data, randomly sampled in each training iteration. These mini-batches drastically reduce the amount of computation required to converge to a local solution. In contrast to other algorithms that reduce the convergence time of k-means, mini-batch k-means produces results that are generally only slightly worse than the standard algorithm.
I chose 12 clusters as there is no significant improvement with more number of clusters and error actually appears to go higher around 17.
Now we can map customers on top of this population clusters and see which cluster has higher percentage of customers or where is customer to population ratio higher etc.
We could see that cluster 3 has highest percentage of total customers. Clusters 1, 7, 9, 10 are some of the other clusters with significant customer concentration.
Cluster 9 has highest customer to population ratio of roughly 40%. I would recommend targeting population in clusters 3, 7, 9 and 10. At the same it would be advisable to see why customer to population ratio is so low in some of the clusters like 4, 6 and 8. Maybe, a special offer might entice people to become customers.
One of the challenging aspects of using principal components as features is that it is hard to interpret the results in terms of original features. Principal components provide us transformed features. We know cluster 9 is the we need to target but what makes cluster 9 so different from others that individuals become customers?? I will try to explain using the weights used by clustering algorithm for these principal components.
These are the top 4 principal components by absolute weight in cluster 9
I got the weights of original features in these principal components from PCA and then just sorted by absolute weight to get a list of important features. It is probably not the most accurate way but it still give us a list of important features in a cluster. These were the important features in cluster 9.
Let’s look at some of these feature values in cluster 9
We can see from the plots above that most customers in cluster 9 are older than 60 yrs. old. Their financial preparedness appear to be low and have high affinity to rational mind. We can use similar techniques for other clusters as well.
Supervised Learning Model
This part of the project involves building a machine learning model that can be used to predict outcome of a mailout campaign for individuals based on the same set of features used earlier.
Arvato provided two sets of files:
- One for training machine learning model which had the response variable i.e. whether an individual converted to a customer or not. This file had 42, 982 persons x 367 (columns)
- Second file had 42, 833 rows and 366 columns. We did not have response variable for this data set. This was to be used for generating prediction for Kaggle
Training Data Analysis
Training data set had the same features as that of our data set for customer segmentation. So, I used the same pre-processing steps - Cleaning, Imputing and scaling.
Our training data set was highly imbalanced.
There was a small number of positive outcomes (Success) and a very large number of negative outcomes (people who were targeted by mailing campaign but did not become customers i.e Failure). Any model would do well predicting a negative outcome since we have so many of those to train on but the real challenge is to predict positive outcome. There are a couple of ways to work with such imbalanced data. We can under or over sample to get a balanced dataset. However, we will use AUROC (Area Under the Receiver Operating Characteristics).
Most classification models calculate a probability for positive or negative outcome. Threshold is 50% by default. So, if calculated probability is more than 0.5 a positive outcome is predicted otherwise a negative outcome is predicted. AUC/ROC allows us to visualize the model performance at varying thresholds. Higher area under the curve translates to higher performance. Business knowledge is usually needed to choose a threshold. It usually depends on what is more costly to a business: Losing a potential customer or losing money on someone who has low probability of becoming a customer.
My opinion is this case would be that losing a customer is more expensive than to lose money on a mailing campaign for someone who is less likely to become a customer. Meaning, we would prefer false positives over false negatives.
I considered following for my model
- RandomForestClassifier
- LogisticRegression
- AdaBoostClassifier
- GradientBoostingClassfier
I used GridSearchCV with roc_auc as my performance metric with 3 fold cross-validation. I created a function and called it for all of my classifiers with some hyper parameters.
AdaBoostClassifier and GradientBoostingClassifier both scored pretty close to 0.76 however AdaBoostClassifier appeared a bit faster so I chose that for prediction.
Kaggle Submission
I did the same pre-processing steps for test data and then used my model to predict the outcome. My submission scored 0.79123 on the board. The top score was 0.81063 as the time of my submission.
Conclusion & Next Steps
Data pre-processing and analysis was the most time consuming part of this project. Now, I do believe that data scientist spend 80% of their time on data munging and analysis. Working with features in a different language complicated things a bit but I believe it did me a favor that I could not use my hunch or my decisions were not biased. My decision were guided completely by the data.
I was able to use PCA and a version of KMeans clustering to identify clusters of with high customer to population ratio. These clusters could be targeted for a marketing campaign as they have high probability of success. We also attempted to identify some of the important original features in these clusters.
For the machine learning portion of the project, I tried RandomForestClassifier, LogisticRegression, AdaBoostClassifier and GradientBoostingClassfier with GridSearchCV and 3 fold cross validation. AdaBoostClassifier and GradientBoostingClassfier scored almost the same but AdaBoostClassifier was faster so I chose that prediction on test data.
I believe I got a decent score but I also know that there is room for improvement. Model stacking can be done to improve score. GridSearchCV can use a wide range of hyper parameters which can take significantly longer time but produce better results. Some feature engineering can also be attempted to improve score.
Acknowledgement
This was an interesting project and I would like to thank Udacity and Arvato for making this data set available.