Improving Fintech customer retention through data science

Published in

Analytics in Action @ Columbia Business School

4 min readMay 22, 2023

A look at how we leverage data to predict the probability of a user churning at a FinTech company, enabling them to improve retention

Retention is one of the most important goals for any tech company. A strong customer retention rate directly impacts the company’s financials. Customer retention data provides clues to how to serve users better.

Over the last few months, we, a group of MBAs and engineering students from Columbia Business School, worked with a leading FinTech to help predict churn.

The company we worked with is Brigit. Brigit helps everyday Americans build a brighter financial future. Its app features a suite of products geared towards this common goal. Some of these features are available free of cost, while others are accessible through a $9.99 monthly subscription.

Since their launch, Brigit has helped more than 4 million people relieve their financial stress. As they keep growing, they don’t lose focus on their current user base and their retention.

Brigit partnered with us through Columbia Business School’s Analytics in Action course to delve deep into user retention and to help them predict whether and when a particular user will churn.

We first conducted analyses to understand the problem at a high level

Before starting to predict, we wanted to understand the size of the problem. We did a cohort analysis to find out what percentage churned month over month.

A Brigit user can churn in two ways. One, they can cancel their subscription. Two, they can miss their subscription payments. From the data, we learned the majority churned users churn via the latter.

The way we define churn then is a user that (i) canceled their subscription or (ii) missed two subscription payments in a row.

Getting the data ready for modeling

The next step was pre-processing the data. We created churn labels, did data cleaning, dealt with missing values, corrected variable types, and removed highly correlated features.

With the data in the desired shape, we were ready to build the model. The first thing we decided, together with Brigit, is that we needed two separate models, one for each type of churn.

Since the majority of users churned by missing their payments, the model to predict that was our primary focus.

The missed payments model performed great

We started with the basics, we did logistic regressions and kNN models. Both models worked pretty well for the non-payment type of churn; in technical words, they achieved high precision and recall (~0.80).

Later on, we tried two more advanced models, Random Forest, and XGBoost, with great results. The models were very robust, performing well in every metric.

The other model proved to be trickier

As we mentioned before, the big majority of users do not deactivate their subscriptions. This is good news for the business, but, when trying to build a predictive model, it presents the challenge of data imbalance.

To deal with data imbalance, we downsampled the large segment, upsampled the small segment and we tried with the SMOTE method, but none of them worked. For every model that we trained, the precision and recall were poor.

Finally, in coordination with Brigit, we decided to deprioritize this model and focus on understanding the most relevant features behind the non-payment model.

We dug deeper to better understand the most important features

We found that the month of the year is a key variable and that the churn rate accelerated around mid-year. We also proved the intuitive relation between missing subscription payments and missing loan payments, as well as the relation between the monthly inflow of the user and their ability to pay.

Final thoughts

We learned a great deal from this project. Perhaps the most important learning is that such modeling exercise is highly iterative. Arriving at the right model entailed multiple rounds of data pre-processing and trying out various modeling techniques.