Creating a Prediction Model for Customer Acquisition

Published in

The Startup

9 min readSep 17, 2020

Given, a large real-life data set containing information on general population and existing customer base. A major challenge is to target the most promising section of the population for customer acquisition. In this article, a data set containing personal, household, building information of representative German population and existing customer base of a mail order company is analyzed [1]. The goal is to create a model for predicting if a person would probably respond to customer acquisition campaign.

The aim of the project is to create models for answering following questions:

What are the demographic features of a typical mail order customer?
What models can be used for identifying a typical customer based on the features provided in a data set?
Which models perform the best?

The strategy to answer the above question is

to study the data and nature of problem: Is it a binary classification problem? Is it a imbalanced data set? What metric should be chosen? Are there many missing data and what would be the best strategy to handle them? Are the features of continuous nature or categorical type?
draw inferences from two big datasets without labeled response (i.e. general population and customers of mail order company) using unsupervised machine learning algorithms (like principal component analysis for dimension reduction and subsequent cluster analysis). The inferences can be on finding clusters where customers are over-represented or under-represented and hence giving a hint on important features for identifying a potential customer.
create prediction model given labeled training and test dataset from a mail order campaign using popular new machine learning algorithms like xgboost and classical proven-in use algorithms gradient boosted machine.

Data Exploration and Wrangling

The data provided in several spreadsheets contains information on

almost a million persons (891221) in Germany. The large dataset may take a long time to load on a machine with less memory.
almost 200,000 customers of a mail-order company.
Each of the data-set had at least 366 features (e.g. age group, home data, car ownership information, and others)
In all the datasets (general population, customer, campaign target), lot many features had missing or unknown data. Therefore, the dataset had to be updated to replace all missing or unknown variables with different codes like [‘9’,’X’,’XX, …] to a uniform code (NaN).

The plot after initial data wrangling shows the features within customer group with high percentage of unknown and missing data.

As imputation strategy could lead to significant false assumption, therefore row with more than 13 feature data and columns with more than 30% missing values were removed from customer data set. This data wrangling led to customer dataset with 356 instead of 366 features and 125912 customer data points instead of 191652 data points.
Thereafter, the customer and general population dataset had 34% and 39% lesser data points, respectively. Due to sheer size of dataset, the general population was further sampled for 50% of data for reasonable execution time of learning algorithms.
Furthermore, the following Figure shows that customer dataset has now only columns with maximum of 14% of missing data. Similarly, general population has now columns with maximum 19% of missing data. Other than 4 features in both dataset only maximum of 2.5% of data is missing for most of the features.

The data preprocessing includes following steps:

Imputation: Different strategies of filling in the missing values were tried out (e.g. most frequent, median, mean, fill with zero). The experiments later during supervised learning showed that filling missing values with zero produced the best results with xgboost. Any other imputation strategy would be a strong assumption.
Encoding: The categorical feature types need to be identified, which should be encoded as dummy variable before application of learning algorithms. Since, the dataset contains lots of categorical variables and PCA is designed for continuous variables, therefore to obtain meaningful results one needs to identify the categorical variables which are not ordinal in nature. This information is available in ‘DIAS Attributes — Values 2017.xlsx’ which describes the features. There are 41 categorical features, however they are very often ordinal in nature. 18 categorical variables (e.g. ‘FINANZTYP’) were selected after studying the mentioned file. It must be noted that many of the features were not described and these were not classified as categorical variables.
Scaling: The standard scaler is used. After encoding and scaling, we have 441 features instead of 355 features.

Unsupervised Learning: What are the demographic features of a typical mail order customer?

Unsupervised learning is used for customer segmentation, i.e. to find the relationship between existing customer base and general population of Germany. This is done by

In PCA, we can select the number of components according to how much variability needs to be preserved. For example, with **280** components we can explain 96% of variance.

reducing dimensionality using principal component analysis (PCA): to ease the interpretation of the data by creating new uncorrelated variables (principal components).
By looking into the weights of principal component features, one can interpret the components. In the first principal components, number of family houses in the PLZ8 region and mobility are very important features,e.g. positive PLZ8_ANTG3 , negative MOBI_REGIO (negative). In the second principal components, financial character is important, e.g. positive FINANZ_SPARER, negative FINANZ_SPARER. In the third principal components, type of car is important, e.g. positive KBA13_HERST_BMW_BENZ, negative KBA13_SITZE_5.
Clustering: In this step, we first use k-means clustering to partition the general population dataset into k clusters. A major task is to find optimal number of clusters. Different values of k was tried out and using elbow method, k=8 was decided as number of cluster.

compare the customer and general population distribution using clusters obtained from k-means clustering

After fitting and transforming the customer data using PCA model used for general population. The following figure shows that persons in cluster 3 and 7 have higher proclivity to becoming customer of the mail order company. Therefore, a targeted campaign could address only population falling in these clusters. The persons in cluster 1 and 5 are most underrated to become the customers.
Doing the comparing of center of max cluster (i.e. 3 and 7) and min cluster (1 and 5), one can identify the features which differ significantly. We found 50 important features. Hence, they can focused on in building a prediction model. The following Figure shows the distribution of one such feature in both customer and general population dataset.

Supervised Learning: Building a prediction model

The task here is to create a supervised machine learning model for predicting whether a person would respond to a marketing campaign and become a customer. The interesting part of this exercise was find out how the model compares to other team models in a Kaggle competition. The training and test set contains more than 80,000 persons who were target of a customer acquisition campaign. Only half of the data (i.e. 42,962) contains information if the customer responded or not. The major challenge was to create a prediction model for response from the other half.

ETL: In first step, we applied the same data wrangling methods of dropping features with high amount of missing data. However, here we did not drop the rows with missing data. Furthermore, we applied the same imputation, encoding, and scaling strategy as in preprocessing for unsupervised learning.
Training: The training set was then split into a train and test split for training the prediction model and cross-validation, respectively.
Metric: There are many metrics like F1 Score, Precision, Recall, AUC-ROC, Accuracy for binary classification. The given dataset is imbalanced (i.e. only 1.2% of customers reached out responded and almost 99% did not respond in training set), Therefore one can obtain 99% accuracy by labeling all data points as not responded. So accuracy should not be used as a metric. AUC (Area under Curve)-ROC (Receiver Operating Characteristic) curve is a evaluation metric for binary classification problems, i.e. for performance measurement to tell how much model is capable of distinguishing between ‘responded’ and ‘not responded’ classes. AUC-ROC =.8 means that the classifier has 80% chance of distinguishing between responded cases and not responded cases, which are both important classes. AUC-PR could also have been an alternative.
Learning Model: After, several submissions to Kaggle. It was noted that among the many supervised learning classifier, xgboost classifier performed much better than gradient boosting classifier and simple linear classifier after tuning. Gradient Boosted Machine (GBM) had better score at cost of longer execution time. However, GBM was prone to overfitting. I.e. it showed better score on training set, but on test set the ROC-AUC score was low as seen during several Kaggle submission. Due to parallel processing in xgboost, the training is much faster than gbm. This scales during hypertuning when 100s of parameter settings needs to be tried. The XGBoost is also robust in handling missing data. It was noted that imputation strategy of ‘fill with zero’ was better than ‘most frequent’ for xgboost.

search_spaces = { 'learning_rate': (0.01, .2, 'log-uniform'),
                   'max_depth': (2, 5),
                   'min_child_weight': (2, 8),
                   'subsample': [0.7, .8, .9, 1.0],
                   'colsample_bytree': (0.5, 1.0, 'uniform'),
                   'n_estimators': (40, 100),
                   'colsample_bylevel': (0.1, 1.0, 'uniform'),
                   'scale_pos_weight': [1]}

Code Snippet for tuning number of estimators in XGBoost.

For XGBoost increasing the num of esimators beyond 100 leads to overfitting.

5. Optimization: Therefore, the focus thereafter was on tuning the xgboost classifier parameters like ‘learning_rate’, ‘max_depth’, ‘n_estimators’, ‘min_samples_split’. After initial time consuming optimization using GridSearchCV and individual tuning of parameter one at a time helped in fixing the range of values for the parameters on Bayesian optimization over hyper parameters by using BayesSearchCV from skopt library.

6. Results: Using XGBoost Classifier, following score in Kaggle was obtained.

The nice feature of XGBoost classifier, is that it can output sorted list of features according to their importance. The most important feature found by almost all classifier was an undocumented feature ‘D19_SOZIALES’, which may indicate the social status of person. On visualizing the feature between general population and customer, one can see in following Figure that it has a major influence in predicting customer acquisition.

Conclusion

The project was a real-life project, where one got to know the problem better over time. The fun part was the addictive nature of Kaggle competition. Over the time, the important breakthrough moments were selection of classifier which is more conducive to hypertuning (i.e. XGBoost). The realization that enabling wrong imputation strategy could lead to results that either look too good or too bad, depending on the amount of missing values in dataset. We found that simple imputation strategy of ‘fill with zero’ was good enough as XGBoost is quite robust to missing values. The final breakthrough was in selecting appropriate range of parameters for tuning. Very often, the optimization search found parameters which led to overfit solution. For example, selecting the upper bound of ‘num_estimators’ to 100 instead of 500 not only reduced the execution time of optimization search, but also found a not overfitted classifier whose score led improvement of 30 places in ranking from 52 to 22.

Finally, one can say effort in feature engineering (e.g. handling missing data, removing undocumented features) pays off in developing better models. Parameter tuning of supervised learning models (e.g. xgboost, gradient boosting classification, …) leads to better models only after good feature engineering (e.g. identifying categorical variables for one-hot encoding). An investment in a powerful machine is a must for data science project. One could try methods like iterative imputation, better feature engineering (e.g. feattools), other models like KERAS deep learning as future work.

[1] Data set is provided by Arvato Financial Services for an Udacity capstone project.

[2] The notebooks and python files are available in Github repo: https://github.com/kross11480/courses/blob/master/capstone/Arvato%20Project%20Workbook.ipynb

Creating a Prediction Model for Customer Acquisition

Data Exploration and Wrangling

Unsupervised Learning: What are the demographic features of a typical mail order customer?

Supervised Learning: Building a prediction model

Conclusion

Written by Hritam Dutta