My First contact with Data Science — Bank Customer Churn

Murilo Eziliano
5 min readSep 29, 2021

--

Photo by Franki Chamaki on Unsplash

Recently, I start to read a book which tells about a story of a small Bank in Virginia called Signet Bank. You probably had never listen about this bank, but if you probably had listen about one of his companies: Capital One. But before Capital One become a great company they had a great insight about the market. In the 80’s the data science started to change the credit market, and the computers became powerful enough to handle a large amount of data. And in the early 90 the market was restrict. At that time the credit card was the same for every client in the bank.

Fortunately, a few weeks after started to read that history about Capital One, I had the opportunity to participe in a hackathon the same challenge: Bank Customer Churn.

In the hackathon me and my team receive a file with the data. The following dataset represent a few information about clients how exited the company. It contains 12 features include target ‘Exited’.

With Python, me and my team read the file, which was available in .csv format, and started to work in the dataset. To manipulate the file and the data we use a Python library Pandas. So we started import the dataset and saw the first rows and columns. There are a few interesting commands with Pandas for a first look in a dataset. Commands which I used in this project, like:

df.info()

df.describe()

df.head()

df.tail()

On the first question about any dataset must be about he own dataset. Question sounds like: is the dataset complete? A mean, how many missing data we have to work? In how features? So I utilized another command, df.isnull().sum()to check how many values are missed.and I complemented with other command to check the percentage of missing values. This second command complete was like:

pd.DataFrame(train_df.isnull().sum(axis=0)).sort_values(by=0, ascending=False)/train_df.shape[0]*100

But why use df.describe() and df.info() after use df.head() and df.tail()? There is no rule about use first one or other. But in this case the were not null, but they are fill with zero. Almost 40% of the data were fill with zero value.

Then after the check the dataset, me and the team start to look the features and how they behavior with others features. In the column Age I decide to slice the feature in categories.

One things makes me think about what is the cause of the churning customers was the Column Age in the dataset. To better understand this feature I plotted a histogram chart. Let’s see it!

So I create a new column run a code to split in a few categories like: Senior, Young and Adult. For this attempt I do the code:

df['AgeCategory'] = Adult
df.loc[df["Age"]>= 55, "AgeCategory"] = 'Senior'
df.loc[df["Age"]<= 25, "AgeCategory"] = 'Young'

display(df)

But the result wasn’t good enough. Even when I tried another division, that time using the lambda function
df['AgeGroup'] = df['Age'].apply(lambda x: 'Senior if x >=55 else 'Adult')

df.head()

When finish this part of the code another colleague told the feature Number of Products has something interesting: So to ckech this information, in percentage, I executed the code:

df.groupby('NumOfProducts')['Exited].mean*100

The result was amazing. 100% of the customers with the maximum number of Products has exited the bank. And almost 85% with 3 products has exited too.

EDA — Sweetviz

The was passing and the answers are not been founded. So to make some time I utilize a library named Sweetiz which is EDA — Exploratory Data Analysis. That library plot graphics about all features and gave the associates. If you don’t know this library you should try it once. It helps me a lot, not just in this hackathon, but in others projects.

If you want to check this library I let the link below.

After executed the code the output

Plotting some charts

After the hackathon be finished, I decide to continue improving the analysis and with some help of libraries like matplotlib and seaborn I could plotting some charts to see how this features were distributed.

Charts like histogram, pie charts, heat map and a pair plot too.

With this two charts, in others circunstancies, is possible to check the correlation of the features.

The pair plot has more details than the heat map. Which is more in more intuitive, but contais less details.

In this dataset, with all these features, unfortunately, any correlation is not visible in the graphs, and they don’t help to build a machine learning model. But they are powerful tools to drive any analysis.

Final Considerations

In the code I wasn’t response for the Machine Learning, but I could still help with an auto ML library named Pycaret which helps a lot the team.

If you want to check this library I let the link below.

Unfortunately we didn’t won the challenge, but this experience shows me how in few hours of coding I could help others people

Thanks for read!

--

--

Murilo Eziliano

Data analyst — pandas, dataviz and EDA library enthusiast