Engaging customers through unsupervised and supervised learning

Using unsupervised and supervised approaches to engage new and existing customers on demographics data.

Jannis Jehmlich
Analytics Vidhya
11 min readDec 22, 2020

--

Overview

This article will cover two approaches to select customers for targeting with a mail campaign: An unsupervised approach and a supervised approach. The analysis is conducted in a jupyter notebook using python.

For the unsupervised approach, I have two datasets available: General demographic data of Germany (891221 rows, 366 columns) and a dataset containing demographic data of customers (191652 rows, 368 columns). The customer dataset has all the columns of the demographic dataset. I will train a KNN model on the general population of Germany to identify clusters within. Afterwards, I will use the trained model to identify clusters within the customer data. Ideally, the customer data will be dominated by a few clusters, which then can be targeted among the general population.

For the supervised approach I have the customer dataset available with an additional column indicating whether the customer responded to a campaign in the past.

Problem Statement

The problem is twofold: On the one hand, the dataset contains many columns with categorical values and many NaNs within. The categorical values need to be one hot encoded and in a latter step reduced through principal component analysis. The NaNs need to be removed, replaced, or imputed since the algorithms cannot deal with them.

On the other hand, the analysis’s problem or goal is to find a demographic group that is overrepresented in the customers and predict responses to a campaign among existing customers.

I expect good results for customers’ unsupervised clustering because the dataset contains detailed demographic information about the customers.
However, for the supervised prediction of customer responses, I only expect mediocre scores since only a few responses are present in the training data.

Metrics

I will concentrate on two metrics to evaluate whether the unsupervised and supervised learning was a success. For the unsupervised clustering, I want a significantly different distribution of clusters in the demographic data than the customer data. This would prove that certain demographic aspects are related to an interest in the financial products and subsequent that it is possible to target new customers more accurately.

In the supervised prediction of customer responses, I will measure the area under the Receiver Operating Characteristic Curve, which is a useful metric for recall dealing with probabilities. I chose this metric because I am interested in predicting as many responses correctly (true positives) as possible and don’t mind if there are false positives. I don’t mind false positives because the cost of mailing a few people who won’t answer is relatively low. This will help to catch more of the rare responses.

Libraries

Libraries I used for the project are:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score
from sklearn.svm import SVR
%matplotlib inline

Since every data-science project is 80% preparation, let’s start with the data inspection and then wrangling.

Photo by Markus Winkler on Unsplash

Data inspection

After loading the datasets it is good practice to get a thorough understanding of your data by various functions like:

df.head()
df.tail()
df.shape # Not a function but an attribute!
df.describe()
# if you have a lot of columns you want to pass the follwing arguments to get a detailed report
df.info(verbose = True, null_counts = True)

Look into single columns with functions like:

# shows the unique values and how often they appear
df.column_name.value_counts()

I inspected all columns for inbalances with the following function:

def balance(df):
data = []
print(“Column balances printed in percent:”)
for idx, column in enumerate(df.columns):
bar_data = 100*df[column].value_counts()/df.shape[0]
print(“{}: {}”.format(column, bar_data))
data.append(bar_data)
return data

Most columns have a normal distribution of the classes, but some are heavily skewed. “AGER_TYP” for example has 30 % in class 9.

Further, some visual inspection is advised, especially to determine how many NaN values are present in the dataset.

Calculate the percentage of NaN values by counting the values in a column and dividing by the column’s length. Any value is counted as 1, and NaNs are counted as 0s. By plotting the list of percentages against the column names, you get the graph below.

nan_values = []for column in df.columns:
n_nan = (len(df[column]) - df[column].count())/ len(df[column])
n_nan *= 100 #show values in percent
nan_values.append(n_nan)

plt.figure(figsize = (30,10))
plt.bar(df.columns[:], nan_values[:])
plt.xticks(rotation = 70);

This is a plot to investigate the data’s status, so it is vital to bear in mind that it needs to be informative but not pretty. The point is that I want to know how many NaN values I am dealing with. The bar chart nicely shows that I have a few columns with mostly NaNs that can be dropped.

More time needs to be spent on the decision for the many columns with 15%-20% missing data. There are three possible choices:

  • Impute missing values: Check out this article for different strategies.
  • Inspect the rows for missing values and drop them in case the NaN values concentrate in 10% of the rows.
  • Replace them with 0 for numeric columns, and one-hot encode them for categorical columns.

In my case I chose a hybrid of dropping rows with more than 10% missing values and one-hot encoding of the categorical NaNs. I will cover this in the next part.

Data Wrangling

Photo by Elodie Oudot on Unsplash

Dealing with NaNs

Data wrangling is all about organizing the mess and bringing it into a format ready for our models to train on.

First, I began to deal with the NaN values I identified in the inspection part.

I created a column that has the count of the missing values per row and then queried the data frame for the number in this column. Since I have over 200 columns, 20 is roughly less than 10%.

# count how many values are missing per row
df[“missing”] = df.apply(lambda x: (df.shape[1] — x.count()), axis = 1)
# keep only rows with less than 20 values missing
df_full = df.query(“missing < 20”)

One-hot encode

Machine learning algorithms can’t deal with categorical values. Therefore I need to create dummy variables, which is also called one-hot encoding. Pandas makes one-hot encoding very easy:

df_dum = pd.get_dummies(df_full, columns = dummy_cols, drop_first = True, dummy_na = True)
  • dummy_cols is a list of columns with categorical values
  • drop_first means that if there are four unique categories in a column, it will create three dummy columns, so they are independent of each other
  • dummy_na = True creates a dummy column for all NaNs

Feature scaling

The few numeric columns need to be rescaled, so columns with large values don’t impact the algorithms unproportionaly. This is done by

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[scale_cols] = scaler.fit_transform(df[scale_cols])

With ‘scale_cols’ being a list of columns that are numerical.

Dimensionality reduction

Having a dataset with many categorical columns and then one-hot encoding them results in a massive amount of columns. A principal component analysis (PCA) can help to reduce the number of columns significantly. The goal is to keep enough columns to explain >90% of the variability in the data.

The following code will help to determine the number of principal components to do just that:

from sklearn.decomposition import PCA
pca = PCA()
df_pca = pca.fit_transform(df)cumm_values = []
former = 0
# variable that makes sure the success print is only printed once
first = True
for idx,value in enumerate(pca.explained_variance_ratio_):
cumm_values.append(value+former)
former += value
if former >= 0.9 and first:
print(“{:.2f} % variance explained with {} components”.format(former*100, idx))
first = False
plt.plot(cumm_values);

Along with the info that 728 of 2458 columns will be sufficient for 90% variablity, we get the following graph that shows the decreasing utility of additional components.

Now the dataset is ready to be used for machine learning algorithms! It is good practice to put all the data wrangling steps in a cleaning function since you will have to do the same steps for the other two datasets.

Unsupervised machine learning

Photo by Alex Block on Unsplash

To identify clusters in the demographics dataset, I used KMeans. An important decision is how many clusters we should choose for the algorithm. This can be done by

from sklearn.cluster import KMeansclusters = [2,4,6,8,10,12,14,16,18,20]# Over a number of different cluster counts…
kmeans_score = []
for idx,n_cluster in enumerate(clusters):
print(“Fitting kmeans with {} clusters”.format(n_cluster))
# run k-means clustering on the data and…
kmeans = KMeans(n_clusters=n_cluster, random_state=0).fit(df_pca)
print(“Calculating the score…”)
# compute the average within-cluster distances.
kmeans_score.append(np.abs(kmeans.score(df_pca)))

This results in a plot that shows the error over the number of clusters. The ideal number of clusters is at the “elbow,” where the curve’s gradient gets significantly more flat. However, as you see below, there isn’t a blatant “elbow” in this case. I chose 12 clusters as the gradient is relatively flat at this point.

The resulting clusters for the demographics data can be determined by first training again on 12 clusters and then predicting on the data:

kmeans = KMeans(n_clusters=12, random_state=0).fit(df_pca)
preds = kmeans.predict(df_pca)
df[“cluster”] = preds

And visualized with

sns.countplot(x = “cluster”,data = df, color = “grey” )

The trained KMeans model can then be used on the customer data to predict the clusters within. Be sure to apply the same data wrangling steps on this dataset!

customer_preds = kmeans.predict(customer_pca)
customer_df["cluster"] = customer_preds
# Visualize the clusters in the customer dataset
sns.countplot(x = "cluster", data = customer_df, color = "grey")

This results in the following cluster distribution:

In contrast to the general population of Germany the customers are mainly in cluster 5. A medium amount is in cluster 11 and low amounts in 2, 4, and 8. For acquiring new customers, it is recommended to focus on these clusters.

Meaning of the clusters

With all the data transformation, it is impossible to understand at this point who to target for customer acquisition. However, it is possible to revert the made transformation and get insights into the clusters by

cluster_centers = kmeans.cluster_centers_
cc = pca.inverse_transform(cluster_centers)
# -1 because we do not want the last "cluster column"
cc_org = pd.DataFrame(cc, columns = customer_df.columns[:-1])

Now with the inverted PCA the 10 most impactful columns for each cluster are retrieved with

cc_org.iloc[11:12,:].T.sort_values(by = 11, ascending = False)[:10]

With 11 being the cluster 11.

Supervised learning

Photo by Michal Lomza on Unsplash

Predicting customer responses

For the last part, I will utilize the third dataset with customer demographics, including a feature if the customer responded to a previous campaign.

The response feature is heavily skewed with 42430 customers having not responded and only 532 that responded.

The same data wrangling steps as in the unsupervised learning task can be applied. To start training a model, we need to split training and testing data by

# splitting for training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

Since I will be using GridSearchCV that includes cross-validation, I need only a small proportion of the data to test the model. The first model I train is a random forest regressor because I want a probability for an answer and not just a classification of 1 or 0.

Random Forest Regressor

A random forest regressor takes quite some time to train, so at this time, I only searched across four different parameters.

# set up model
reg = RandomForestRegressor()
# set up parameters for grid search
parameters = {‘n_estimators’:[20, 80],
‘max_depth’:[20, 30]}
reg = GridSearchCV(reg, parameters, verbose = 10, scoring = “roc_auc”, cv = 5)
reg.fit(X_train, y_train)

We are judging the results not by accuracy but by recall. It is more valuable to identify every customer who will answer a campaign (true positives) and get a few wrong ones that will not respond (false positives). As we are predicting probabilities, the roc_auc_score is used to measure performance.

rfg_test_preds = reg.predict(X_test)roc_auc_score(y_test, rfg_test_preds)
>>>0.5895173685828465

ROC score before tuning: 0.5563

ROC score after tuning: 0.5895

The best hyperparameters were:

{'bootstrap': True,
'ccp_alpha': 0.0,
'criterion': 'mse',
'max_depth': 20,
'max_features': 'auto',
'max_leaf_nodes': None,
'max_samples': None,
'min_impurity_decrease': 0.0,
'min_impurity_split': None,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'n_estimators': 20,
'n_jobs': None,
'oob_score': False,
'random_state': None,
'verbose': 0,
'warm_start': False}

Only very few responses among the customers make a prediction quite tricky and result in low scores.

Support Vector Regressor

The support vector regressor has the advantage of training significantly faster so that more parameters can be included in the search. It is also set up with a GridSearchCV with

# set up model
svc = SVR()
# set up parameters for grid search
parameters = {‘C’:[0.1, 0.5, 1.5, 5],
‘degree’:[3, 4, 5],
‘kernel’: [“poly”, “rbf”]}
reg_svc = GridSearchCV(svc, parameters, verbose = 10, scoring = “roc_auc”, cv = 5)
reg_svc.fit(X_train, y_train)
# evaluating performance
svc_test_preds = reg.predict(X_test)
roc_auc_score(y_test, svc_test_preds)
>>> 0.6117327595040477

ROC score before tuning: 0.59885

ROC score after tuning: 0.6248

(average value of cross validation since testing set is very small)

The best parameters were:

{'C': 5,
'cache_size': 200,
'coef0': 0.0,
'degree': 3,
'epsilon': 0.1,
'gamma': 'scale',
'kernel': 'rbf',
'max_iter': -1,
'shrinking': True,
'tol': 0.001,
'verbose': False}

The support vector regressor not only trains faster but also yields better results. However, the results are still not in an ideal range.

Model evaluation and validation

The large value for C shows that the best results are achieved when classification errors are punished heavily and the rbf kernel usually performns well for complex problems with many dimensions.

Within training cross validation was used to make sure the results are stable and the model doesn’t overfit:

[CV] C=5, degree=3, kernel=rbf 
[CV] ...... C=5, degree=3, kernel=rbf, score=0.643, total= 46.6s
[CV] C=5, degree=3, kernel=rbf
[CV] ...... C=5, degree=3, kernel=rbf, score=0.584, total= 47.0s
[CV] C=5, degree=3, kernel=rbf
[CV] ...... C=5, degree=3, kernel=rbf, score=0.627, total= 47.6s
[CV] C=5, degree=3, kernel=rbf
[CV] ...... C=5, degree=3, kernel=rbf, score=0.609, total= 43.7s
[CV] C=5, degree=3, kernel=rbf
[CV] ...... C=5, degree=3, kernel=rbf, score=0.613, total= 47.5s

Challenges

The training takes an extensive amount of time which makes iterating over different approaches very difficult. The low amount of total responses makes it hard to identify a pattern in the data and achieve good results with the supervised learning algorithms.

Justification

The unsupervised clustering of customers worked very well as I was able to extract one main cluster that the customers belong to compared to the general population. This will enable the company to target new customers very well and drive conversion up.

The supervised prediction of which customers will respond to a campaign yielded mediocre results, but will already heavily improve the conversion in comparison to mailing all existing customers. Therefore it is a good first step in the right direction.

Conclusion

As expected the data wrangling took an extensive amount of time. Especially difficult was dealing with the large number of categorical columns and the many NaN values present. Through one-hot encoding and subsequent dimension reduction with PCA it was possible to end up with a moderate amount of columns.

Unsupervised clustering of population data and identifying clusters in the customer data was a great way to identify new potential customers for acquisition. It can significantly influence the conversion and reduce cost in the sales department.

The supervised prediction of customer responses is excellent for leveraging potential within the existing customer base; however, it only delivers mediocre results.

Improvements

More time can be spent on the parameter search for the two supervised predictions or implementing a neural net that can deliver better results. Because the two grid searches for the random forest regressor and the support vector regressor already took a very long time, I did not include more parameters at this time.

--

--