How to Know Your Customers through Data.

Huachen Chen
Analytics Vidhya
Published in
6 min readApr 26, 2021
Know Your Customers. Image source: convergehub.com.

Understanding customers is the key to successful business strategies and decision-making. With the ever-growing amount of customer data these days, data science and machine learning have become indispensable tools to extract information about customers.

So, how to use data to know your customer better and gain business insight?

In this post, I will present two questions about customers that data can help us answer. We analyze and model a credit card customer dataset and explore the following:

(1). Customer Segmentation: Is there a clustering structure hidden in the customer dataset?

(2). Customer Churn Prediction: Can we predict which customers will stop using our product in the future?

A Snapshot of the Dataset.

The dataset we analyze consists of 10,000+ customers’ data of a credit card company.

It contains roughly two categories of information:

a). Demographic information about each customer, such as their age, gender, marital status, education, and income level.

b). Credit Card Usage information, such as their credit limits, transaction amount, utilization ratio, and attrition (that is, whether or not they have closed their card).

Acknowledgment. The dataset is available on Kaggle thanks to Sakshi Goyal.

Part 1: Customer Segmentation.

Customer Segmentation. Image Source: eventsair.com

Nowadays, news apps on our phones are smart enough to group different news articles with similar topics into a cluster. The idea behind customer segmentation is the same. We would like to group different individuals with similar traits, such as purchasing habits or preferences, into clusters. Good customer segmentation is vital for precision marketing.

Traditionally, segmentation can be done by manually examining customer information and hand-picking grouping criteria. With machine learning techniques, we now have another option. We can feed the data to some algorithm(s) and let the machine explores the clustering structure in the data.

And that is the first problem we can tackle using data.

There are mainly two challenges in this task:

(i). Preparing Data: what pattern the machine will learn largely depends on what data we feed it. How to properly prepare our data is the first question we wrestle with.

(ii). Inferencing outcome: if the machine does report back some clustering structure it finds, it may not be easy to translate that into human-understandable language.

What do we find?

Our analysis reveals two clustering structures with interesting insights. One separates our customers into two groups. The other presents three groups.

CASE I: Two Groups.

When we choose the number of groups to be 2, the model is able to learn the customers’ financial status and group them based on that!

scatter plot 1: two clusters

In the scatter plot above, each point represents a customer and the color of a point indicates which group our model thinks they belong to.

To translate that to human language, we look at the group means of each variable for both clusters and compare them showed in the table below.

For example, the twelfth row “Credit_Limit” indicates the average credit limit for cluster 1 is about 20414, and that for cluster 2 is 4187, which has a big gap! The red bar in that columns visualizes the relative difference (difference divided by sum) and gives a clear sign of the difference between the two groups.

Table of means for the two clusters.

An Insight:

From this table, we understand that customers in cluster 1 are those who have higher income, credit limits, buying capacity, but lower utilization ratio, and are more like to be male (for gender, male=0, female=1).

Our model seems to indicate a financial inequality between different genders among our customers!

CASE II: Three Groups.

When we choose the number of groups to be 3, then our model basically subdivides cluster 2 in the previous case into two groups, based on their marital status.

Table of means for the three clusters.

In other words, in case II, we have three clusters. Cluster 1 here is roughly the same as cluster 1 in case I. And cluster 2 and cluster 3 here basically add up to cluster 2 in case I. Most of the customers in cluster 2 are single, and all married customers belong to cluster 3.

(Note that the grouping structure here has a lot to do with how we prepare our data.)

These relations between the case I clusterings and case II clusterings can be visualized with the following two more scattering plots.

Compare this one below to the scatter plot 1 (the green-blue plot) above, we see that cluster 2 (purple) and cluster 3 (black) blend together and roughly take up the space of the green points.

From a different angle (or say a different slice of the data cloud), we see a clear separation of cluster 2 (purple) and cluster 3 (black).

Another insight:

Between cluster 2 (single customers) and cluster 3 (married customers), there is a noticeable difference between the average total transaction amount.

Is that a significant difference or just some random fluctuation?

A statistics test tells us that the difference is indeed significant. So perhaps another insight we learn is:

“Getting Married Saves You Money !”

which is totally counter-intuitive for me.

Warning: We should point out that the above conclusion deserves a bit more thinking on the causation, so it should be taken away with precaution!

For more details about the analysis and result, see this GitHub repo.

Part 2. Customer Churn Prediction.

Customer Churn. Image Source: nextommerce.com

Customer churn rate is the annual percentage of customers who stop using a service. Being able to predict which customers will leave is a long-after superpower in business. Because we can then come up with effective strategies to retain our customers before they leave and to identify problems before they cause irreversible damage.

For example, in our dataset, there are about 16% of customers that are no longer with us!

A barplot illustrates churn rate.
A barplot illustrates the churn rate.

So, this is the second problem we can solve using data:

We can train machine learning models to predict which customers are more likely to leave, based on their demographic and other card usage information.

How good is our model?

After a long and iterative process of human learning (mining the data) and models tuning, we select a so-called tree-based boosting model that has the following scores on a test set:

  • 90.9% Recall, which means among those customers that indeed left, the model can detect about 9 out of 10.

In a business context, being able to identify as many of those who will leave as possible is often the most valuable ability. This recall score captures exactly that. We customized our model selection criterion with an emphasis on it to get this result.

  • 97% Accuracy, which means we correctly predict if a customer will leave or not for 97% of them.

For more details about our models, see this GitHub repo.

Summary

In this post, we saw two example questions about customers that can be answered using data. They are:

(1). Customer Segmentation: we can use data and machine learning methods to detect meaningful clustering structures among our customers.

(2). Customer Churn Prediction: we can predict which customers will stop using our product with great accuracy.

For more details about our analysis, see the GitHub repo.

Thank you for your reading! Your comments are most welcomed and appreciated!

--

--

Huachen Chen
Analytics Vidhya

Data Science and Machine Learning Aspirant. PhD in Mathematics.