Identifying Potential Customers Using Machine Learning Techniques

Published in

The Startup

16 min readJan 31, 2021

1.Introduction

This post is about my Capstone Project of Udacity’s Data Scientist Nano Degree Program. I analyzed demographics data for customers of a mail-order sales company in Germany, comparing it against demographics information for the general population.

2.Problem Statement

Analysis is composed of four major parts: Part0-Get to Know the Data, Part1-Customer Segmentation Report and Part2-Supervised learning model and Par3-Kaggle Competition.

Part0-Get to Know the Data: At first, I performed necessary studies for Data Understanding and Data Preparation.
Part1-Customer Segmentation Report: In this part, I used unsupervised machine learning techniques to perform customer segmentation, by identifying the parts of the population which best fits the customer profile of the company to target them in marketing campaign.
Part2-Supervised Learning Model : As a next step, I build a supervised machine learning model which predicts whether or not each individual will respond to the campaign.
Part3-Kaggle Competition : As a final step, after choosing the best model in previous step, I used it to make predictions on the campaign data as part of a Kaggle Competition.

3. Part0-Get to Know the Data:

3.1 Data Understanding

Four main datasets have been provided By Arvato Financial Solutions:

Udacity_AZDIAS_052018.csv: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).
Udacity_CUSTOMERS_052018.csv: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).
Udacity_MAILOUT_052018_TRAIN.csv: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).
Udacity_MAILOUT_052018_TEST.csv: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).

In addition to above datasets, two excel files have been provided to refer on getting more information about the columns of the datasets:

DIAS Information Levels — Attributes 2017.xlsx: Top-level list of attributes and descriptions, organized by informational category
DIAS Attributes — Values 2017.xlsx: Detailed mapping of data values for each feature in alphabetical order

Each row of the demographics data corresponds to a unique person. And the columns are related to wide variety of information like age, gender, year of birth, social status, family status, nationality, financial status, energy consumption type, income, transaction activity, share of household-building-car related information etc.

I went through the “Dias Attributes-Values” excel file in order to get better understanding on the columns. File was containing “Attribute”,”Description”,”Value” and “Meaning” columns, I added “Dtype” column to fill in with the category types and categorized all the columns according to my understanding from “Value” and “Meaning” columns. So I categorized them as categorical(nominal), ordinal, numerical for a later use in data preparation.

I checked the columns in azdias and customer data frames. Drop the 3 extra columns (‘PRODUCT_GROUP’, ‘CUSTOMER_GROUP’, ‘ONLINE_PURCHASE’) from customer data frame in order to have same features on both data.

I checked demographics data and attributes file if each column has been covered in attributes file:

There were 42 columns which exist in excel file but not exist in Azdias data. I dropped this attributes from excel file.
There were 94 columns which not exist in excel file but exist in Azdias data.
There were 272 columns which exist in both excel file and Azdias data.

Explore the 94 columns not exist in excel file but exist in azdias data :

I did not choose to drop all this columns not to loose any info but I decided to go over the data and tried to guess the data types: I explore those columns by looking at firstly the ones specified as non-numerical, then by looking at the ones specified as numerical.

The columns specified as non-numerical:

Non numeric specified columns in Azidas but not in Excel file

I figured out that EINGEFUEGT_AM column should be a date-time object. I will make the the data type conversion later. To understand the other two columns I looked at the distributions:

I decided to categorize both columns as categorical(nominal) and add this info to “Dtype” column in excel file. “XX” values in column CAMEO_INTL_2015 needs to be replaced with NaN.

The columns specified as numerical:

Numeric specified columns in Azidas but not in Excel file

I explored those 92 columns with similar approach as above. I looked at the distribution plots of all features and categorized them all according to my understanding. And updated the “Dtype” column in excel file. If i saw the values coinciding in distribution plot, i went deep into that feature; for example for above two columns and LNR column, i decided to keep them as numerical. LNR contains unique ID type numbers for each customer. The rest is categorized as categorical(nominal) or ordinal.

So, finally I had updated the Dtype column in excel file with categories of Azdias columns.

Explore 272 columns which exist in both Azdias and excel file:

I explored those columns with similar approach above.

The columns specified as non-numerical:

Non-Numerical specified common columns in Azidas and Excel file

Because I already categorized those columns in the excel file in the beginning, I did not make any categorization here. “X” and “XX” values will be replaced with NaN in CAMEO_DEUG_2015 and CAME_DEU_2015 columns. All the values will be replaced with numbers as 0,1,2,3..etc in CAME_DEU_2015 . And values W and O will be replaced with 0 and 1 in OST_WEST_KZ.

Explore the whole Azdias Data to check meaningless values such as -1 and unknown:

There were some meaningless values under Value column of excel file such as; -1 for all columns and 0 for some columns, meaning "unknown". Also it shows that, 9 means "unknown" for some columns.

In the excel file, I filtered the rows containing the words “unknown”, ”no classification”, “no transaction”, “no transactions”, “no online”. And created a data frame named “unknowns” containing the attributes and the meaningless values of that attribute.

3.2 Data Preparation

I tried to understand the data by going through excel file and doing all the explorations above; now it is time to prepare the data.

Cleaning:

I performed the cleaning operations that I decided during exploring in previous part. Such as replacing “X” and “XX” values with NaN, replacing “W” and “O” with 0 and 1 in OST_WEST_KZ propping LNR column, replacing the values with numbers in CAMEO_DEU_2015.

Replacing Unknowns with NaN:

I replaced all the unknown values with NaN by looping in Azdias and unknowns data frames.

Handling missing data:

Most of the columns (roughly 320 columns) have missing values less than 20 percent. So I dropped the columns having missing value more then 20 percent (40 columns).

I dropped the rows having missing values more than 15 percent (139890 rows).

Imputing rest of the missing values:

I imputed rest of the missing values based on the following assumption: For categorical (ordinal or nominal) columns, I filled the missing values with the most frequent value; for numerical columns, filled the missing values with the mean of the column.

Handling outliers:

In order to detect the columns that have outliers, I used box plot and saw that columns ANZ_HAUSHALTE_AKTIV, ANZ_STATISTISCHE_HAUSHALTE, KBA13_ANZAHL_PKW have the outliers. I used IQR principle to detect the rows and dropped the row if the row has more than 2 outliers for those 3 columns (1047 rows dropped).

Feature Encoding:

I applied feature encoding based on the following assumption: For numerical columns, I kept them as they are. For nominal columns, I applied dummy encoding, For ordinal columns, I kept them as they are. I used the the lists of columns named as categoricals, ordinals and numericals that I derived from the excel file before.

Cleaning of Customers Data:

I defined a cleaning function including all steps up to know. And clean the customer data with the help of this function.

Cleaning of Train Data:

I cleaned the train data using cleaning function.

Cleaning of Test Data:

I defined a new cleaning function for test data. I used the same cleaning function but updated the necessary parts not to drop any row; because I will need whole rows for the Kaggle submission.

Feature Scaling:

Azdias and Customers data were scaled (standardized) using StandardScaler to achieve a mean of 0 and standard deviation of 1. I did not scaled train ad test data in at this step.

Saving Data:

I saved all four data frame as pickle for later usage.

4. Customer Segmentation

4.1 Dimensionality Reduction Using PCA

To find the vectors that are having maximum variance in the data, I applied principal component analysis using sklearn’s PCA class. In the beginning I did not set any parameters to get all components been computed. I checked the variability of each component and also the cumulative variability of the components.

There are 613 columns in our both customer and azdias data. We can see the number of the components and the effects of that components on the variability. %90 of the variance of Azdias data is represented by first 323 principal components. Rest of the components have %10 effect on variance. Most of the components have less than %1 variance. Variance of first 5 components is % 18. I decided to use the first 320 principal components that make up the variance almost % 90.

4.2 Interpretation of Principal Components

I transformed azdias data setting the number of component as 320. Now we have 320 components. Each principal component is a unit vector that points in the direction of highest variance (after accounting for the variance captured by earlier principal components).

The larger the feature weight (in absoulte values), the more the Principal Component is in the direction of the feature.
If two features have large weights of the same sign (both positive or both negative), then an increase in one feature will yield an increase in the other. They show a positive correlation.
Similarly, two features with weights of different signs will be expected to show a negative correlation.

We can see the top 4 principal components and top 5 positive and negative weighted features for each in below charts.

Let us investigate PCO:

MOBIE_REGIO, PLZ8_ANTG1, LP_STATUS_FEIN are most negative weighted features. So those features are positively correlated among themselves. Thus low income earners have low mobility and small family houses.

PLZ8_ANTG3, KBA13_ANTG3, PLZ8_ANTG4 , KBA13_ANTG4 are most positive weighted features. So those features are also positively correlated among themselves. Thus, more car owners exist in the regions where more crowded houses exist.

And PCO is negatively correlated with MOBIE_REGIO, PLZ8_ANTG1, LP_STATUS_FEIN and positively correlated with PLZ8_ANTG3, KBA13_ANTG3, PLZ8_ANTG4 , KBA13_ANTG4.

We can conclude that PCO is related with the peoples having high mobility, havig crowded family houses, more cars and high income.

PLZ8_ANTG1 is related with number of 1–2 family houses in the PLZ8.
PLZ8_ANTG3 is related with number of 6–10 family houses in the PLZ8.
PLZ8_ANTG4 is number of >10 family houses in the PLZ8
LP_STATUS_FEIN is low value for low income earner, high value for top earners.

Let us investigate now PC2:

PC2 is related with the people who are avant-garde, not money saver, in low level on financial topology, not traditional minded.

PRAEGENDE_JUGENDJAHRE dominating movement in the person’s youth (high value is avant-garde or low value is mainstream)
FINANZ_SPARER financial typology: money saver : in high value low saver.
FINANZ_UNAUFFAELLIGER financial typology: unremarkable : in high value low
SEMIO_PFLICHT affinity indicating in what way the person is dutyfull traditional minded (in high value low)
SEMIO_TRADV affinity indicating in what way the person is traditional minded (in high value low)

4.3 Perform K-Means Clustering

Now, I will see how the demographics data (azdias) clusters in the principal component’s space. I applied k-means clustering on the data.

In order to tune the parameters of kmeans() model : I started setting the random_state as 42, to repreduce the exact clusters over and over again. I focused only on the parameter of “n_cluster” rather then tuning all other parameters. Standard literature suggests to use the elbow method to determine the required cluster number.

Parameters of Kmeans

So, I created below elbow plot to decide how many clusters I should use. I took the cluster number as 8 because after around this number, error diminishing speed is getting slower.

Finally I have created the model with below parameters and performed k-means clustering fitting the model with the transformed population data, created the cluster predictions (labels) for the general population. Then created the cluster predictions (labels) for the customers using same fitted model with the transformed customer data.

Afterwards, I have calculated the percentages of labels in clusters for both datasets. Also calculated the ratio of Customer percentage / Azdias percentage.

According to above chart, we have Cluster-0 and Cluster-3 at the top with highest Customer/Azdias Ratio. So they are the mostly represented customer clusters in the general German population. Thus, those clusters should be selected and focused as potential customers. We have Cluster-2 and Cluster-7 at the bottom with lowest Customer/Azdias Ratio. So they are the lowest representers and should not been focused and targeted.

By analyzing the weights of principal components on the Cluster-0, we can have better understanding on the attributes of potential customers.

So, according to below charts, most weighted or contributed principal components on Cluster-0 are PC3 and PC5. PC3 is 6 times weighted than PC5. Thus PC3 is more effected on Cluster-0. We can say that cluster-0 is negatively and strongly correlated with PC3.

Most contributed features on PC3 is the negative weighted ones; because while positive ones mostly weighted between 0.10-0.15, negative ones weighted between 0.15–0.20.

Most positive weighted features in PC3 are:

KBA13_SITZE_5 : number of cars with 5 seats in the PLZ8
KBA13_KMH_140_210 : share of cars with max speed between 140 and 210 km/h within the PLZ8
KBA13_SEG_KLEINWAGEN : share of small and very small cars (Ford Fiesta, Ford Ka etc.) in the PLZ8
KBA13_HALTER_25 : share of car owners between 21 and 25 within the PLZ8

Most negative weighted features on PC3 are:

KBA13_HERST_BMW_BENZ : share of BMW & Mercedes Benz within the PLZ8,
KBA13_MERCEDES : share of MERCEDES within the PLZ8,
KBA13_SEG_OBEREMITTELKLASSE : share of upper middle class cars and upper class cars (BMW5er, BMW7er etc.),
KBA13_BMW share of BMW within the PLZ8 : ,
KBA13_SITZE_4 : number of cars with less than 5 seats in the PLZ8.

So PC3 related with the people who have small and very small cars, there is limited car owners in the region where those people live.

To sum all up for Cluster-0, it is negatively and strongly correlated with PC3. And PC3 is strongly positive correlated with limited and small car owners.

We can conclude that target cluster-0 is the kind of wealthy people having upper middle class cars and having high share of BMW, Mercedes in the region.

I wanted to analyze the under-represented group, Cluster-2:

So, according to below charts, most weighted or contributed principal components on Cluster-0 are PC1 and PC0. PC0 is almost 2.5 times weighted than PC1. Thus PC1 is more effected on Cluster-2. We can say that cluster-2 is positively and strongly correlated with PC0.

In above interpretations, I conclude that PCO and cluster-2 is related with the people having high mobility, having crowded family houses, more cars and high income.

And the features of PC1 are as below:

KBA05_SEG6 share of upper class cars (BMW 7er etc.) in the microcell
KBA05_KRSOBER share of upper class cars (referred to the county average)
KBA05_KRSVAN share of vans (referred to the county average)
PRAEGENDE_JUGENDJAHRE dominating movement in the person’s youth (high value is avantgarde or low value is mainstream)
FINANZ_SPARER financial typology: money saver : in high value low saver.
FINANZ_ANLEGER financial typology: investor: in high value low investment.
KBA05_HERSTTEMP_3 development of the most common car manufacturers in the neighbourhood : stayed lower/average level

5. Supervised Learning Model

It’s time to build a prediction model. I have already prepared the Train and Test data in Data Preparation part by cleaning but not scaling.

After I prepared y (labels) and X (features) data from train data; I scaled all data (X, y and test) using StandartScaler.

I saw that the positive labels / all labels ratio was 0.012 in train data. So our data was heavily unbalanced. Thus I did not want to split the train data for train and validation. I decided to train the model with the whole data.

Evaluation Metric:

Since our data heavily unbalanced (we have only %1 positive labels in train data) I chose the ROC AUC score as metric to evaluate the model performance. The score is a value between 0.0 and 1.0 for a perfect classifier

Building Classification Model:

I created my classification model by comparing 4 models through Grid Search Cross Validation. I found the best model which is Gradient Boosting Classifier that gave best auc-roc score as 0.704.

Hyper Parameter Tuning:

I tuned the parameters in order to increase the score of the model. In tuning I focused on the tree specific parameters and boosting parameters.

Boosting parameters : learning_rate, n_estimators, subsamples

Tree specific parameters: min_samples_split, min_samples_leaf, max_depth, max_features, subsample

My tuning approach was as below:

I tuned the important and dependent boosting parameters (learning_rate and n_estimators) in the beginning. Decreasing the learning rate requires an increase in n_estimators; this makes the model more robust; thus they should be tuned together but requires more computation times.
After initially tuning the above two parameters, I tuned the tree-specific parameters. The important step here is the tuning order of the parameters. The parameters having higher impact should be tuned first. Thus I tuned the most effected parameters (max_depth, min_samples_split) together. Afterwards tuned those seperately: max_features, sub_sample, min_samples_leaf .

Step-1:

learning_rate : [0.05], # best parameter is 0.05 of [0.05, 0.1]

n_estimators : [40], # best parameter is 40 of [30,40,50]

Score : 0.7188

Step-2:

max_depth : [4], # best parameter is 4 of [3,4,5]

min_samples_split : [0.015], # best parameter is 0.015 of [0.013,0.015,0.016]

Score : 0.7160

Step-3:

max_features : [‘auto’], # best parameter is ‘auto’ of [‘auto’,’sqrt’,’log2']

Score : 0.7160

Step-4:

subsample :[0.68], # best parameter is 0.68 of [0.68,0.69,0.70]

Score : 0.7180

Step-5:

min_samples_leaf : [0.02], # best parameter is 0.02 of [0.020,0.022,0.024]

Score : 0.72760

I started with the first step, fit the model and found the best value by the model.best_estimator_. I stored this value and fit the model together with second step and so on.

Finally, I tuned the hyper parameters of the selected model and increased the roc_auc score from 0.7046 to 0.7276.

For further reading on Tuning of Gradient Boosting Classifier; Aarshay Jain’s excellent work can be reached here.

6. Submitting Predictions into Kaggle Competition

Finally I created predictions on the “TEST” partition, where the “RESPONSE” column has been withheld using my tuned model.

After submitting the results on Kaggle, It can be seen below that the model not performed well as on the training data, it had 0.727 on training data but 0.606 in submission. So it needs improvement.

7. Conclusion

Reflections:

I found compelling the Data Understanding and Data Preparation parts of the project. There were lots of columns and rows to handle and understand. Also the interpretations of the principal components were challenging for me. I could not deeply understand the explanations of the datas in columns. Also I can not interpret the columns on which there is no explanation in excel file.

I described the target cluster as Cluster-0 : It contains kind of wealthy people having upper middle class cars and having high share of BMW, Mercedes in the region.

I described the under-represented, should not been targeted cluster as Cluster-2. It contains My expectation was to see the features of Cluster-2 mostly in opposite direction with cluster-0. It covers the people having high mobility, having crowded family houses, more cars and high income. So, features of being targeted and being untargeted groups sounds similar, this is very interesting for me.

There should be some bare differences between them, all PC components of those clusters should be deeply investigated and before that the explanations of columns should be clear.

Improvements:

I can have better understanding on whole dataset, deeply analyze the features and interactions.

I tried to keep all the columns in the data not to loose any info, even the columns not having any explanation in the excel file are kept. I could have dropped them, thus I can interprete the result in a better way.

I consider data types of columns as nominal, ordinal and numerical but there were some columns that have mixed type of datas (having both nominal and ordinal etc.). Those columns could have been handled in effective ways.

I choose 320 components in customer segmentation section to cover the 90% of the variance. I can choose less but high varianced components (selecting the components that have greater variance than 0.005), this can reduce the noise and increase the accuracy.

For ML part, I could have done ensemble modelling and try other classification models; I could have done more benchmark study to understand the better performing classification models in the area.

Thank you for reading. If you are interested in coding and other details please refer to here.