All You Need to Know About Handling High Dimensional Data

Customer Segmentation using Unsupervised and Supervised Learning

Published in

The Startup

11 min readJan 29, 2021

Targeting potential customers is an essential task for organizations. It helps boost their revenue and tailor their needs to cater to the right customers. Moreover, it helps them understand why particular segments of people do not use their services.

In this post, we will study ways of preprocessing a high dimensional dataset and prepare it for analysis with machine learning algorithms. We will use the power of machine learning to segment customers from a mail-order campaign, understand their demographics, and predict potential future customers.

The code for this project is available at GitHub.

Datasets

The data for this task is provided by Arvato Financial Services as a part of the Udacity Data Scientist Nanodegree. For confidentiality reasons, this data is not available publicly and can only be accessed via Udacity. The following files are provided:

Udacity_AZDIAS_052018.csv: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).
Udacity_CUSTOMERS_052018.csv: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).
Udacity_MAILOUT_052018_TRAIN.csv: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).
Udacity_MAILOUT_052018_TEST.csv: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).

The provided features can be grouped into various levels:

Person: Attributes of a person relating to lifestyle, social status, family, finance, and behavior.
Household: Contains information about a person’s age, background, household, and transactional activities.
Building: Information relating to the type, year, and neighborhood of the building where a person lives.
Microcell 1: Contains features that give a detailed explanation about a person’s wealth, family, and lifestyle.
Microcell 2: Contains features that relate to the vehicular belongings of the person.
Grid: Person’s transactional activity based on different product groups.
Postcode: Information about the area of residence.
PLZ8: Information on the share of vehicular brands in a certain area.
Community: General information about inhabitants of a community and their work.

Data Preprocessing

Our data is highly dimensional and consists of 366 features. We need to filter out the important features and hence a lot of preprocessing is required for our task. We will go through all the data cleaning steps one by one:

Check Columns with Nulls

Our data consists of many missing values which render the features meaningless for further evaluation. The bar plot below shows the percent of nulls in each of our features. After analyzing the plot, I decided to drop features having more than 30% values as nulls.

We were able to drop 6 columns (‘ALTER_KIND4’, ‘ALTER_KIND3’, ‘ALTER_KIND2’, ‘ALTER_KIND1’, ‘EXTSEL992’, ‘KK_KUNDENTYP’), most of them relating to the ages of the person’s children.

2. Check Columns with Similar Values

Now, we find out the features where the majority share of the values are similar. These features will not help us differentiate between customers and hence we drop the features where more than 90% of values are similar.

This approach helped us drop 24 columns from our data.

3. Convert Unknowns to NaNs

The data provided to us has some peculiar issues. Some features consist of multiple values that denote an ‘unknown’ value e.g. values 0 and 9 in some features denote an unknown. This distorts our data as multiple values mean the same but are represented differently. We convert these values to NaNs.

4. Encode Categorical Data

Next, we encode our categorical features using one-hot encoding. One-hot encoding converts the unique categories to binary features. The following code snippet can be used for this purpose.

5. Handle Mixed Data types

Some features in our dataset are a combination of string and int types and need to be converted to a generalized format. The code snippet shows an example of mixed data types and how to handle them.

Handling mixed data types in our data

6. Handle Numeric Data

Once, we have encoded the categoricals and handled the mixed data, we can move on to analyze the numerical data. For numerical data, we should always check their distribution and then decide on a strategy to process them. In our case, we check the skewness of our numeric columns.

Skewness is a measure of the asymmetry of the probability distribution of a random variable about its mean.

Computing the skewed columns in our dataframe

We transform the skewed features using log transformation to conform to normality. This is the most popular method for transformation but you should try other approaches such as binning, depending on your data.

Numeric features before(left) and after(right) log transformation

7. Detect and Remove Outliers

Outliers are observations that highly differ from the general observations in the data. These observations can take extreme values that distort the distribution of our data. Hence, it is pertinent to deal with them while preparing your data for further meaningful analysis.

Since our data does not strictly follow a normal distribution, we use Tukey's method to remove outliers. Tukey’s method suggests that points above or below 1.5 times the inter-quartile range from the quartiles: below Q1–1.5*IQR or above Q3+1.5*IQR are outliers.

Code for processing the outliers from a dataframe

8. Impute Data

Missing values in data can be handled in multiple ways. Firstly, if you have very few missing values compared to the size of your dataset, you may simply drop the rows with missing data. Secondly, you can impute the missing values if the data is too much to be dropped.

A very simple imputation technique is provided by sklearn’s SimpleImputer with which you can impute your feature data with the mean, median, or most frequent values. Other advanced techniques such as knnImputer or MICE algorithm (sklearn’s IterativeImputer) can be used to achieve better results.

9. Remove Correlated Features

An additional step we can do is remove the features that are highly correlated with other features in our data. Since highly correlated features will not provide any new information about the data, we can drop them. A minimal code snippet from Chris Albon’s work is provided below.

Dimensionality Reduction

After preprocessing our dataset, we still have around 300 features in our set. We need to further reduce the dimensionality of our data. Let’s take a look at certain approaches.

Shapely Values

We will use the concept of shapely values to determine the most important features in our dataset. In terms of game theory, shapely values are used to assign payouts to players based on their contribution to the total payout. In terms of machine learning, these players can be thought of as features that contribute to the model prediction. You can read more about shapely values here.

Using the SHAP package and a base XGBoost classifier, we were able to filter out 50 important features from our data. These features are shown in the plot below.

2. Principal Component Analysis

Now that we have reduced our feature set, we can do further analysis via principal component analysis (PCA). PCA is a dimensionality reduction technique that combines our input variables in a way such that they explain the maximum variance of the data and the least important variables can be dropped. Note that PCA retains the most valuable information from all the variables but the interpretability of features is lost.

Computing the cumulative variance of components

For our data, it can be seen from the scree plot that 35 features explain 95% variability in our data. You can decide to keep this threshold low depending on your case.

Further, we explore the features with the maximum correlation within our top 2 principal components that explain about 35% of the variability in our data.

Features with the highest correlation in component 1(left) and component 2(right)

We notice that attributes related to finance and age are highly correlated in the first principal component whereas attributes related to lifestyle and family are representative of the second principal component.

Clustering

Once we have reduced our dimensions using PCA, we segment the general population into clusters using the KMeans algorithm. KMeans is a relatively simple algorithm that clusters similar points into groups based on a distance metric, usually euclidean.

We will fit KMeans on the general population data and use it to transform the customer data to identify clusters with the most prominent share of potential customers.

Note that we need to find an optimal value of ‘k’ i.e. the number of clusters we need. An elbow plot is used for this purpose.

Code for finding out the optimal k for the KMeans algorithm

From the elbow plot below, we observe that the optimal value of ‘k’ for our data should be 6 as the curve begins to straighten thereafter.

Elbow plot to identify the optimal value of k

Using this value of k, we cluster our general and customer population into k distinct clusters and the results are shown below.

We clearly notice that the customers are over-represented in cluster 1 and underrepresented in clusters 2 and 5. Let’s dive into the characteristics that differentiate these customers from the general population.

Understanding the Demographics

Based on our findings, the customer population in cluster 1 differs from the general population in the following categories:

LP_LEBENSPHASE_FEIN (Lifestage): Customers are ‘average earners of higher age from multiperson households’.
LP_FAMILIE_FEIN (Family type): Customers live in a multiperson household.
LP_STATUS_FEIN (Social status): Customers are house owners.
FINANZ_MINIMALIST (Low financial interest): Very low for the customers.
D19_KONSUMTYP_MAX (consumption type): Customers have a ‘versatile’ consumption type.
GEBURTSJAHR (Year of birth): From our analysis, most of the customers are aged over 60. Maybe, they should be targeted.
PRAEGENDE_JUGENDJAHRE_DECADE (Youth decade): The customers were youths in the 60’s decade. This again is an indicator that the customers are of higher age.

Since the customers are almost non-existent in cluster 5, we will check the characteristics of these customers to identify our non-target population. We observe the following from our analysis:

PRAEGENDE_JUGENDJAHRE_DECADE (Youth decade): The customers were youths in the 90’s decade.
GEBURTSJAHR (Birth year): Customers are under the age of 50.
D19_KONSUMTYP (Consumption type): Customers are from the ‘informed’ category. These consumers may not be our targets.
LP_LEBENSPHASE_FEIN (Lifestage): Customers are ‘single low-income earners of middle age’.
SEMIO_VERT (affinity indicating in what way the person is dreamy): A very high affinity for customers.

Supervised Learning

We have now explored our data and understood the characteristics of our target customers. Let’s move on to the most interesting part, modeling. We are going to use the MAILOUT datasets described at the beginning of this article, for the supervised learning task.

Note: Our training set is highly imbalanced with almost 99% of values from one class(non-customer).

We undertake the following steps:

Preprocess the train and test sets using the cleaning steps explained earlier.
Extract the most important features using SHAP values.
Model the data using a base XGBoost classifier. XGBoost is our choice of classifier since it is known to perform best amongst all the other machine learning classifiers. Moreover, it can handle missing values.
Tune the parameters of the classifier. You can use a plethora of methods for this purpose: GridSearchCV, RandomSearchCV, or BayesSearchCV from sklearn. We opt to use Optuna for our hyperparameter optimization task.

Optuna is a hyperparameter tuning framework that works seamlessly with python and other machine learning frameworks. You can read more about using Optuna with XGBoost in this blog post.

Code snippet for tuning an XGBoost classifier

5. Once we have found our best model parameters, we evaluate our test data. The choice of an evaluation metric is highly important for any data science project. For our project, we will use the AUC (area under the ROC curve) score.

A ROC curve (receiver operating characteristic curve) is a graph that shows the performance of a classification model at all classification thresholds. It is a plot between the TPR (True Positive Rate) and FPR (False Positive Rate).

TPR is a synonym for recall and tells us what proportion of positive class got classified correctly. FPR tells us what proportion of the negative class got incorrectly classified by the classifier.

AUC is the measure of the ability of a classifier to distinguish between the classes and is used as the summary of a ROC curve.

The value of AUC ranges between 0 and 1, where 0 denotes a 100% incorrect model prediction and 1 denotes a 100% correct prediction.

Why do we use AUC as our evaluation metric?

AUC is scale-invariant: It measures how well the predictions are ranked, rather than their absolute value. For example, in our case, the company can target customers in the order of their conversion ranking.
AUC is classification-threshold invariant: It measures the quality of the model predictions irrespective of the classification thresholds. Since our data is highly imbalanced, it is good to access model predictions without setting a single threshold.

Results Summary

With the methodology discussed above, we could achieve an AUC score of 0.815 and 0.804 on the train and test sets respectively. This supervised modeling task is a part of a Kaggle competition. So, I urge the readers to go ahead and improve the model performance and challenge for the top spot.

Future Improvements

We found the following avenues to improve our model predictions:

Feature Selection: Our analysis showed that only a small set of features say less than 50 were able to give a good model performance. So it is important to remove features that will distort the data without adding any meaningful information for prediction. Our best model used only 48 features extracted using the SHAP values.
Feature Engineering: Although we did some basic feature engineering before selecting the features, we did not create new features after our feature selection. This would potentially help in improving our model performance.
Size of Dataset: Our dataset only had 42k samples which reduced to 34k after preprocessing. This data is not sufficient to make very good predictions. One approach I tried was to combine the general and customer population datasets to create a larger set for training. Though I couldn’t better the performance, I guess a better feature selection methodology was required here.

Closing Remarks

In this blog post, we discussed ways for preprocessing high dimensional data. Further, we used unsupervised learning to segment customers into groups and understand the demographics. Lastly, we looked at the supervised task of predicting target customers from our data. Hopefully, this analysis will add to the plethora of knowledge available on machine learning over the net.