Image for post
Image for post

A simple guide to creating Predictive Models in Python, Part-1

“If you torture the data long enough, it will confess” — Ronald Coase, Economist

Zahash Z
Zahash Z
Nov 23, 2018 · 11 min read

This guide is the first part in the two-part series, one with Preprocessing and Exploration of Data and the other with the actual Modelling. The data set that is used here came from superdatascience.com. Huge shout out to them for providing amazing courses and content on their website which motivates people like me to pursue a career in Data Science.

Don’t focus too much on the code throughout the course of this article but rather get the general idea of what happens during the Preprocessing stage

Also, this is a looong article so don’t forget to grab some coffee with you.

Part 1: Preprocessing and Exploration:

Image for post
Image for post

Preprocessing is a crucial part to be done at the very beginning of any data science project (unless someone has already done that for you). It includes dealing with NULL values, detecting outliers, removing irrelevant columns through analysis, and cleaning the data in general.

Let's look at how to do that in python;

First, let's make the necessary imports. We will make a few more when required down the line.

The “%matplotlib inline” is a magic function in IPython and it will make your plot outputs appear and be stored within the notebook.

Now, let's load the data into python as a pandas DataFrame and print its info along with a few rows to get a feel for the data

Output:

Image for post
Image for post

Note that this is a relatively clean data set. Many real-life data sets contain a ton of missing values, outliers and completely irrelevant columns. If you want to specifically take a look at how to clean a data set then check out my other notebook | post | article which is dedicated to that topic.

As we can see from the above summary the data has 10000 rows and 14 columns where the last column ‘Exited’ is the label on which we have to run our predictions.

No column has NULL values but it can surely have outliers like -99999. We will look at some techniques to check for outliers down the line.

But first, let's remove the irrelevant columns like ‘RowNumber’, ‘CustomerId’ and ‘Surname’ because they are not used anywhere in modelling or analysis

Output:

Image for post
Image for post

Now we only have the data (columns) that we actually care about

Let's make a copy of the DataFrame that we will use in deep learning models. The reason for this will be explained in the ‘Artificial Neural Network’ part in ‘Modelling’ which is in Part-2.

Now, make a list of numerical columns that are not necessarily continuous (‘NumOfProducts’, ‘HasCrCard’, ‘IsActiveMember’ are categorical columns) to find any outliers.

Output:

Detecting outliers can be done in many ways one of which includes plotting Boxplots. But we will look at another approach which includes checking some general statistics about the columns like ‘min’, ‘max’, ‘mean’ and ‘median’. Be on the lookout for anything absurd in the ‘min’ and ‘max’ values (like ‘min’ = -99999), and also check if the ‘mean’ and ‘median’ are close enough (in most cases they shouldn’t be too off).

Some domain knowledge would also help, like, the age should be at least 18 years because we are dealing with a bank and no value in any continuous column should be negative.

Output:

Image for post
Image for post

We can observe that the ‘min’ EstimatedSalary is just 11.58 which is quite suspicious. Maybe it is just a broke college student without any job.

Output:

Image for post
Image for post

The above small analysis shows that the person is actually a 45 years old male and already has a credit card with high credit score and a balance of almost 123 K. But he has an estimated salary of only 11.58 which is pretty weird. Maybe it is just an error in Data collection or maybe he just lost his job or possibly got retired. We can consider it as an outlier and delete that row from the Data Frame but it is a judgement you have to make as a Data Scientist/Analyst.

Now get ready for some heavy analysis and visualizations ahead

Image for post
Image for post

Deciding which columns are relevant is a huge part in Feature Engineering. It requires some amount of Domain Knowledge and by doing so it increases the predictive power of any machine learning algorithm. Feature Engineering is an art by itself.

Let's decide if the ‘Gender’ column is relevant.

That is if any of the gender (M|F) has a greater impact on the label (‘Exited’). For example, if 70% of males choose to quit (‘Exited’ = 1) the bank while in females it is only 30% then we can conclude that males are more likely to leave the bank and the ‘Gender’ column plays an important role in deciding the churn rates. While, if it is like 30% for males and 33% for females, then the ‘Gender’ column doesn’t play much of a role because the churn rates are not distinguishable depending on the gender.

Output:

Let's visualize them

Output:

Image for post
Image for post

Around 25 % of females and 16 % of males chose to exit. The difference is not that much but let's just include this column in the Data Frame. Again, this is a judgement you have to make as a Data Scientist/Analyst

Lets now decide if the ‘Geography’ column is relevant by doing similar analysis as above

Output:

Visualizing the results

Output:

Image for post
Image for post

It can be clearly seen that the customers from Germany left twice as much as the other countries. So we should definitely consider this column in our Data Frame.

Lets now decide if the age column is relevant. This might be a bit overwhelming at first but it is simple if you understand it.

Let's start off by plotting a histogram of the ‘Age’ column

Output:

Image for post
Image for post

It can be seen that most of the customers are from the age group ’30 to 40'

The below scatter plot shows the age of customers(y-axis) who did not quit (‘Exited’ = 0)

Output:

Image for post
Image for post

We can see that most of the customers who did not quit the bank are in the age group ’20 to 45'

The below scatter plot shows the age of customers(y-axis) who quit the bank (‘Exited’ = 1)

Output:

Image for post
Image for post

We can see that most of the customers who exited are in the age group ’30 to 60'

!!! Do not let the above plots fool you !!!

Because we cannot be sure as the customers from the age group ’30 to 40' are more in number and thus have a greater impact on the plot (results). If every age group had the same number of people then we could have trusted the above plots

The following code gives the “percentage” of people who quit (‘Exited’ = 1) in each age group

bucketizing the age column and using ‘groupby’ to create groups for each age group

calculating the percentage of people who exited and rounding off the result to 2 decimal places

Output:

Let's just plot the above data to get a good sense

Output:

Image for post
Image for post

It can be clearly seen from the above plot that the age group ’50 to 60' has the highest percentage of people who quit the bank. We previously might have got misled into thinking that ’30 to 60' has the most exited people (from the plots) with the bulk at ’40 to 50'.

By the above analysis, it is clear that the age column is definitely relevant but in the form of buckets. So let's convert the continuous Age column into buckets (categorical). Bucketizing basically means aggregating/putting the values in the corresponding ranges.

For example, 33 belongs in the range 30–40, 67 belongs in the range 60–70 and so on. The ranges are decided by us, like, for example, we can have ranges 10–20, 20–40, 40–60, ….. or like, 10–35, 35–60, 60–85, …..

Now, let's replace the ‘Age’ column with the bucketized column

Output:

Image for post
Image for post

One major part of Preprocessing is to encode the non-numeric and categorical columns like ‘Geography’, ‘Gender’ and ‘Age’ in our case. You may ask why ‘Age’? Well, ‘Age’ was a numeric column but we bucketized it and changed it into a categorical column. This process is called ‘dummy encoding’ where every unique value in a column gets a separate column by itself. You will understand this by looking at the below table

Image for post
Image for post
Image for post
Image for post

Take a moment to notice that the categorical columns ‘Geography’, ‘Gender’ and ‘Age’ no longer exist in the table. The unique values in those columns got their own columns, like, ‘Geography_France’, ‘Gender_Male’, ….. and they replaced their corresponding parent columns. These columns are also called as Dummy Variables.

There are only two possible values in those new columns ‘0’ and ‘1’. Let's look at the Geography section in the first row. We can see ‘Geography_France’ = 1, ‘Geography_Germany’ = 0 and ‘Geography_Spain’ = 0. It means the first row in the previous table (before dummy encoding) had ‘France’ in the first row of the ‘Geography’ column. You can verify that by scrolling up and looking at the older table. The same is true for all the encoded columns.

Remove one column each from the created dummy variables to avoid ‘dummy variable trap’

“But what is Dummy Variable Trap ?”

It is a bit hard to explain the actual meaning of it in this article but to keep it simple, let's take an example of ‘Gender_male’ and ‘Gender_female’ from the above table. we know that only one of the two columns has the value ‘1’ in a row. That is, a person cannot be both male and female. So, the possible values for the two columns in any given row are (1, 0) or (0, 1). It is clearly seen that one column is a complement of the other. That is, if we are given the value of ‘Gender_male’ then we can guess the value of ‘Gender_female’ in any given row and vice versa. So, having the second column is quite redundant. Therefore removing a column is fine and we MUST do it !!

The same is true for ‘Geography_France’, ‘Geography_Germany’ and ‘Geography_Spain’. The possible values for the three columns in any given row are (1, 0, 0) or (0, 1, 0) or (0, 0, 1). Here if we remove any one column, say, ‘Geography_Spain’, then still we are not losing any information because we can guess the value of the third column based on the first two columns. For example; if ‘Geography_France’ and ‘Geography_Germany’ are (1, 0) then ‘Geography_Spain’ is ‘0’ because only one of the three will have the value ‘1’ in any given row. Again, if ‘Geography_France’, ‘Geography_Germany’ are (0, 0) then ‘Geography_Spain’ should be ‘1’.

Therefore we MUST remove any ‘one’ encoded column to avoid falling into the Dummy Variable Trap

Now, let's remove (any) one of the encoded column each from the parent column. In our case, we choose ‘Geography_France’ from ‘Geography’ and ‘Gender_Female’ from ‘Gender’ but you can choose different columns if you want.

Don’t forget to remove the bucketized dummy Age column ‘Age_(90,100]’ by selecting all the columns except the last one

Let's take a look at the first five rows of the modified Data Frame

Image for post
Image for post
Image for post
Image for post

Save the Data Frame as a .csv file to use it later for modelling

We have successfully processed the data and it is ready for modelling which we will see in Part-2.

Data Driven Investor

from confusion to clarity not insanity

Sign up for DDIntel

By Data Driven Investor

In each issue we share the best stories from the Data-Driven Investor's expert community. Take a look

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Zahash Z

Written by

Zahash Z

Just a college student who strives to be a data scientist

Data Driven Investor

from confusion to clarity not insanity

Zahash Z

Written by

Zahash Z

Just a college student who strives to be a data scientist

Data Driven Investor

from confusion to clarity not insanity

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade