Why Do Customers Stop Doing Business With a Bank?

This article aims to concisely describe how a banking dataset was used to generate meaningful and interesting insights by exploring and clustering data through visualization, statistical analysis, and principal component analysis to develop a well-rounded classification model that can make an accurate prediction whether a bank’s customers will churn or not.

Noah Mukhtar
Feb 9 · 13 min read

A Brief Overview Of The Banking Sector

Photo by Pixaby on Pexels

It is imperative to mention how the financial crisis in 2008 transformed the banking sector’s strategy when it came to their customers. Before the financial crisis, banks were solely fixated on investing in the acquisition of more and more customers. However, after the market collapsed, banks learned quickly that the cost of acquiring new customers is 7 times higher than retaining existing ones, which means losing customers can be very financially detrimental.

Fast-forward, today’s global banking industry has a market cap of $7.6 trillion, with technology and legislation making it easier than ever to move assets and money between different banks. Additionally, it has led to more alternative forms of competition to banks such as open banking, neo-banks and fin-tech firms. Altogether, this wealth of choice makes it easier than ever for today’s consumers to simply switch or leave banks.

Research also indicates that returning customers are likely to spend 67% more on a bank’s products and services, further highlighting the importance of understanding why customer’s churn, and how it differs across different countries, ages, credit scores, etc.

Solution

Data is seen as an invaluable asset for banks looking to successfully navigate today’s volatile environment. Analytics can help a bank differentiate themselves and regain their competitive edge through gaining a better understanding of their customers.

Dataset

We will be using a bank’s dataset that consists of 10,000 customers from France, Spain, and Germany, and includes the following variables:

Banking Dataset’s Variables: CreditScore, Geography, Gender, Age, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary, Exited

Exploratory Data Analysis

Overall, our end goal here is not only to better understand our bank’s dataset through visualization, but also pick and create new predictors that will optimize our classification model.

Customer Churn in Banking

“When a client ends their relationship with us by switching to another bank.”

Predictors

Age

It seems that younger customers are less likely to leave a bank as they are still not as educated or have the same wealth of choice that middle-aged people have, given that they have not had the time to build on their credit score yet.

Moreover, there seems to be less of a need for younger customers to switch as most of the amenities offered by banks for this age group tends to be the same across the industry. Whereas older customers may start thinking of their pensions, inheritance, taxes, etc. which means they will emphasize on finding the best deal, regardless of the bank.

Finally, it may also be due to the fact that the older generation values customer service and real-life interaction more than the younger demographic, who are on the complete opposite spectrum and prefer as little real life interaction as possible, which is where today’s banks are heading towards with digitization, meaning that the older demographic will leave their bank if it means finding banks that better offer those traditional values and services today.

All in all, the peak churn rates seem to occur across the age groups of 50–60, 60–70, and 40–50 in that order.

Credit Score & Credit Score Given Age

Credit scores act as an indicator to a consumer’s creditworthiness. Moreover, those with lower credit scores are usually charged with higher interest rates, higher insurance premiums, and higher chances of being denied mortgage, loan, and credit card applications, which in theory should mean they have more reason to switch banks.

Therefore, we clustered them into three groups: Poor/Fair [350–669], “Good” [670–739], and “Excellent” [740–850] credit scores based on FICO’s rating system.

Logically, Poor/Fair credit scores saw a substantially higher churn rate at 26.67%, whereas “Good” and “Excellent” credit scores trailed by at 20.59% and 15.38% respectively.

Credit Score Per Age

However, unlike the US, Europe has no standard practice in calculating a credit score, and, given our dataset is made up of 3 different European countries, a solution was mandatory. Luckily enough, according to research, 15% of a credit score is based on the age of a customer’s credit history.

As a result, we made a new predictor that calculated the credit score per age, and it successively offers the perfect explanation to the pattern of the churn probability per age group, the higher probability of churn is not solely because of their age, but because of their credit score, given their age group. i.e.: how financially responsible they are given their age demographic.

Is Active Member

This is the most critical variable, as it acts as a binary indicator as to whether or not a customer is “active” in terms of being involved in the bank’s services and usage of their channels whether it’s online, phone calls, or appointments.

Non-active members were 30% more likely than active ones to leave a bank. Moreover, non-active members also seemed to be less financially responsible as their credit scores were lower.

Balance Salary Ratio

Because of the large difference, we decided to try and incorporate activity with a new ratio called the Balance Salary Ratio to give a more accurate representation of their balance relative to their salary. A good worldwide benchmark is the 80/20 rule, which is to put 80% of your salary into your bank account, that would give a balance salary ratio of 5.

Our dataset’s active members had a ratio of 4.54 compared to non-active members at 3.17, indicating that non-active members’ money might have been flowing to more than one bank as a result of the inability to get certain products at one bank due to their low credibility, or that they simply had bad money management skills.

Number of Products

Additionally, those with “excellent” and “good” credit scores had 15.36% and 7.14% higher chances of having more than one product with the bank. It seems that customers with 2 products were ideal customers, as they had the highest retained: churned ratio. Even though customers with 1 product were relatively riskier than those with 2, the majority would still be a safe bet in terms of customer retention.

However, anything more than 2 products almost always meant that they were customers who were biting off more than they could chew and as a result are more likely to leave a bank.

Tenure, Balance, and Salary

We created a new variable called Tenure by Age, and visualized the Tenure by Age and Balance against the Number of Products. Our findings indicate that the more number of products a customer had, the less variance there was in predicting their tenure by age and balance, building on our earlier understanding of the interaction between these variables. In general, the higher the tenure by age, the less number of products a customer had.

Additionally, the more number of products a customer owned, the more likely their balance is to fall within less extreme values relative to the average customer. The same is applicable to Estimated Salary and Balance Salary Ratio, but to a lesser extent.

Creating Our Prediction Model

We need to confirm our exploration of the data by running a Random Forest algorithm to see which predictors really are the most significant in predicting churn.

According to the variable importance plot, the number of products is by far the best predictor followed shortly by a predictor we created: credit score given age. Additionally, geography and gender seemed like two predictors that we could analyze more in depth.

Backwards Elimination: Creating The Logistic Regression Model

We created dummy variables for all our categorical predictors, excluded confidential variables such as the surname and customerIDs, and then started running iterations and removed the most insignificant predictors and rebuilt our model for a total of 4 times until we found the ideal mix.

However, we decided to stop at the balance salary ratio as the model had a higher AIC and lower residual deviance compared to any of the subsequent iterations. Also, industry standard dictates this as a crucial indicator of the loyalty of the customer to the bank, the better the relationship, the more of their salary they would entrust their bank with.

Correlation Matrix

Clustering

Now that we have a profound understanding of our data and the way it interacts, it’s important to use principal component analysis to visually represent and confirm the correlation amongst our variables. Therefore, it seems that balance and the number of products are negatively correlated, which confirms our initial suspicions that anything more than 2 products is to make up for something critically missing in the consumer’s creditworthiness.

Additionally, the engagement of the consumer, credit score & credit score given age are negatively correlated with the probability of churning, hence it seems our initial findings were spot on.

However, it’s still vital to contextualize everything from a geographical and gender-oriented viewpoint, and principal component analysis tells us that while France & Spain behave similarly, Germany is clustered in an entirely different manner, and this is why we added geo_ger as one of our key predictors.

Geography

Compared to their French & Spanish counterparts, German customers tend to have higher credit scores, are older, have a higher tenure, have substantially higher balance salary ratios, and tend to not have the “danger-zone” of 3–4 number of products.

However, they are relatively less active members, with their probability of churn being twice as high at 32% as opposed to France/Spain’s 16%.

This is explainable by Germany’s GDP of 123 per inhabitant compared to France’s 104 and Spain’s 91. Additionally, German banking differs quite drastically when compared to the rest of the world. They offer the lowest prices for banking products and services, with a low interest rate environment, and a recent tightening on regulations that offer expanded consumer protection laws.

However, this hasn’t stopped fin-tech firms from threatening to take a staggering 15% of the German banking industry’s income over the next 5 years, putting competition at an all time high.

However, it’s important to understand that Germany is the only country out of the 3 in our dataset that has a current account surplus at 1.9 compared to France and Spain’s deficits of -2.5, meaning that German customers are more likely to buy foreign goods than German ones. Furthermore, deficit economies are still fuelling demand for German products, creating a capital misallocation in countries such as Spain. This seems to be creating an unsustainable continental imbalance that is now starting to fall apart because of the US’s trade war.

This is evident in the dropping consumer confidence indicators, surveys, and expectations. The sense of fear in German consumers’ confidence is reflected by their higher churn rates, and according to the World Bank’s Dataset, they have a higher education expenditure per capita, rank higher in both the corruption index and competitiveness ranking, which makes German customers much more knowledgeable, and thus more susceptible to churning to seek better opportunities relative to customers in France and Spain.

Gender

On average, women are more likely to churn regardless of the region. However, the disparity between genders was most evident in Spain and France, with women being 60% more likely to churn. However, German women had the highest churn rate out of all gender/geography combinations at a staggering 38%, but it was only 10% higher than their male counterparts.

This is attributable to the statistical fact that women are more conservative than men when it comes to risk and reward, with 70% of woman identifying as savers as opposed to investors. This comes at a perfect time as the German economy and banks are being put under intense speculation, it is sensible to seek alternatives and less German-economy dependant banks.

Secondly, it could be possible that women across all regions are more relationship oriented with a greater emphasis on the customer service banks offer and are more susceptible to change based on recommendations made reinforcing this category.

Thirdly, according to women’s substantially lower balance salary ratio of 2.9 compared to men’s 4.7, they could be bigger spenders and will seek banks that offer more attractive credit terms. Therefore, we added gender as a predictor.

Testing Our Prediction Model

Boosting & Logistic Regression Analysis

From using a mixture of business acumen, principal component analysis, statistical analysis, and backwards selection, we were able to produce a model that predicts churn rates based on the following boosted / logistic regression model:

Exited ~ CreditScore + Tenure + Balance + NumOfProducts + IsActiveMember + EstimatedSalary + Gender_female + geo_Ger + CreditScoreGivenAge

The boosted regression model was used to train the first 5,000 rows of our dataset, to predict the last 5,000 rows. We obtained a mean squared error of 0.2889, it is possible to lower the error to 0.2839 by removing our least significant predictors geo_Ger and Gender_female, however, we believe that if we were to test our model on external data, the relative influence of those predictors would probably be the difference between a correct and wrong classification.

Regardless, it is safe to conclude that credit score given age was our best predictor, and estimatedsalary, creditscore, balance, numofproducts, and balancesalary ratio’s sum far outweighed their individual influence.

Likewise, our logistic regression model also proved successful as it had an accuracy of 82%. However, it was far better at classifying customers who stayed rather than churns, and this was probably because of the imbalanced training dataset which we saw some improvements in classifying churns after resolving by including an equally distributed share of gender, geography, and churn in our training dataset.

Principal Component Analysis

The numerical results of the PCA indicate that most of the variability found amongst customers who churned can be explained by two factors: their credit score given their age and their credit score. After accounting for the first principal component, most of the variability came from their balance and the number of products they had signed up for.

Conclusion & Recommendations

After creating our model and understanding the data, we have much better insights for banks trying to retain their customers.

Our findings are summarized as follows:

  • Banks should develop loyalty programs and retention campaigns targeted at customers who can still be saved, i.e.: not those who are at an extremely old age, or those with poor credit scores given their age, or anyone with a credit score below 475.
  • The age groups that were most likely to churn were 40 to 70 year old’s, however, for the sake of lifetime customer value, it seems that trying to retain the 40–50 and 50–60 demographic has a higher return on value.
  • Therefore, it is essential for banks to understand their customers, they need to build in more predictors similar to how we did with this dataset to create and identify patterns that end up being more accurate than the original variables.
  • Customers in Germany have the highest churn probability, men and women included, but Spanish and French women are 60% more likely to churn than their male counterparts, which means that French & Spanish banks need to allocate more resources into pursuing female-oriented promotions, which is possible for example by offering rewards cards, as our data evidently shows them as bigger spenders.
  • We have included a classification tree that can be used by any manager, regardless of region in order to obtain the probability of churn. The classification tree is a simplified version which acts as more of a pre-requisite to approving a customer as opposed to analyzing an existing one as it is not detailed enough, but a survey would be required in order to find out if they are an active member or not with their pre-existing bank.
Classification Tree — Probability of Churn
  • The most important predictor was engagement, and our research indicated that women are more susceptible to leaving for better customer service or advisory services.

Overall, our prediction model should not act as a binary classification model, but more of an early warning system and explanatory tool that can aid a bank in refining its marketing efforts to target customers with more relevant campaigns, offers, etc. while building a personal connection.

Photo by Pixaby on Pexels

Finally, even though the banking industry has one of the lowest churn rates compared to other sectors, the impact on profitability by losing a single customer is exponentially higher than most industries.

Author’s LinkedIn:

Data Driven Investor

from confusion to clarity, not insanity

Thanks to Justin Chan

Noah Mukhtar

Written by

Masters in Analytics — McGill University | Analyst at Business Development Bank of Canada | CFA Level 2 Candidate | https://www.linkedin.com/in/nmukhtar/

Data Driven Investor

from confusion to clarity, not insanity

More From Medium

More from Data Driven Investor

More from Data Driven Investor

More from Data Driven Investor

More from Data Driven Investor

Introduction to Stock Analysis in Python.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade