Churn Bank Predict

Remaking my first hackathon challenge

Murilo Eziliano
5 min readAug 23, 2022
Photo by Waranont (Joe) on Unsplash

A few months ago I had published one of my first articles, right here in Medium, about my first contact with Data Science. The project which I had made with a few partners in a hackathon, offer by one course which I had made in the last semester, and the purpose was to create a model to predict if a bank costumer will leave the services or not. In others words, the challenge was create a Machine Learning model and a deployment for a bank institution.

It’s funny look back at this article and see how I dealt with my limited knowledge about data science, programming skills and many others things. I tried to impress my fellows with python librares as Pycaret for the auto ML and Sweetviz to made the EDA part.

Today I understand the necessity of get the best insights of the data through my own efforts, not trying the get this insights by a single line of code. And that’s why I decide to remake this challenge with others interesting libraries and modules which Python have.

That’s why I decided to focus on different libraries such as: SHAP, streamlit, plotly and apply the repetition structures with came with Python.

Photo by Javier Garcia Chavez on Unsplash

Getting started with the data

That dataset is very famous, ok not like titanic or iris, but there are many notebooks and very interesting studies about that data and was very easy to find at Kaggle.

That project is divide in a few files. I decide to do this to not lose the focus of each step of the development and better organize my own code. To begin I started to understand a few patterns at the dataset and there was very interesting insights to take a look.

First of all our dataset has ten thousand records with fourteen columns. With clients spread across tree european countries, age, Credit Score, with four services available and a few columns with binary values like: Has credit card, is active member and the target exited.

One of my greatest concerns about any dataset is simples questions such as: There is duplicated data? Are null values? how many outliers are?Are there too many zeroed values?

Maybe these questions are too much inside other part of the data process, it’s possible begin with something more easy like: Are there numeric values? Are there values written? To start to answer these questions I always like to run a simple code. df.info . Them, it’s possible to begin to answer my concerns. One of the biggest problems about kaggle dataset is great part of the data available it’s uniformed to run Machine Learning scripts, don’t have to care about many others things beside understand the data and run machine learning codes.

Anyway, one of my newest discovers about pandas is the there is a way to turn the tables more interesting and visual charming, furthermore charts are not the only way to show data, right? The pandas style is a very interesting way yo display data. The cons of using this method is the all data will be display, so if you dealing with ten thousand values maybe is not interesting show all rows and all columns values, but it’s very nice could check small tables with many information.

At the code below, it’s possible to check many zeroed values are by column in the dataset.

And I thought, why not trying to apply this in one of my favorite method of pandas df.describe() ? Of course, there are a few more lines of code to get this result below…

Enjoyed? Cause at this point it’s time to plot some charts and going deeper at the exploratory data analysis.

So, thanks to the built in functions of python combine with some data viz libraries like seaborn it’s possible to figure out this.

All clients, without exceptions, with number of products 4 had left the bank. If there was any doubt about this note, I executed the code df.groupby(NumOfProdcts)[Exited].mean()*100, just to have an quick view about that. So if I had to elaborate a report for C level, that information certainly would be in a report.

Although the Exploratory Data Analysis — EDA, it was possible to know there are great number of clientes were in France and Spain, but the proportion of German costumers who left it’s more worrying them other two countries. In numerical terms, 16.2% of France costumers had left the bank, and 16.7 of the Spanish costumers being that 32,4 of German had left. Other columns doesn’t provide satisfying result as these columns,

If I won’t developed a Machine Learning, my job would finish here, but it wasn’t finished yet. When I made this project was convinced the age was an important and was necessary to split the column to get a better performance for the model, so I keep a few analysis about the column age

The Machine Learning

At the part of Machine Learning several models had been trained, since simple linear models until complex models such as Neural Networks. But the model which had a better performance among all was the Gradient Descent Classifier.

But to achieve the best hyper parameters without Scikit-learn method GridSearchCV, an “exhaustive search over specified parameter values for an estimator” as said the own page, I decided to apply a very good side of python a loop.

I was wondering myself how to do this, and I find this…

With this output:

And I did it for a few parameters of Gradient Boost, like:

  • n_estimators
  • max_depth
  • min_samples_split
  • min_samples_leaf
  • max_features

After I finished this project, and this article, I would like to comment about one topic which simple blow my mind in the last months, I mean after all this process I really want to understand why the model, which I pass months to develop this parameters, had said if that determined client will leave or not instead this other? The model explainability and interpretability is a topic surprising for me. That’s why I decide to use a few codes of the library SHAP Values to figure out which feature has more importance for the output, and the result was stunning.

And this it!

--

--

Murilo Eziliano

Data analyst — pandas, dataviz and EDA library enthusiast