A simple guide to creating Predictive Models in Python, Part-1
“If you torture the data long enough, it will confess” — Ronald Coase, Economist
This guide is the first part in the two-part series, one with Preprocessing and Exploration of Data and the other with the actual Modelling. The data set that is used here came from superdatascience.com. Huge shout out to them for providing amazing courses and content on their website which motivates people like me to pursue a career in Data Science.
Don’t focus too much on the code throughout the course of this article but rather get the general idea of what happens during the Preprocessing stage
Also, this is a looong article so don’t forget to grab some coffee with you.
Part 1: Preprocessing and Exploration:
Preprocessing is a crucial part to be done at the very beginning of any data science project (unless someone has already done that for you). It includes dealing with NULL values, detecting outliers, removing irrelevant columns through analysis, and cleaning the data in general.
Let's look at how to do that in python;
First, let's make the necessary imports. We will make a few more when required down the line.
The “%matplotlib inline” is a magic function in IPython and it will make your plot outputs appear and be stored within the notebook.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Now, let's load the data into python as a pandas DataFrame and print its info along with a few rows to get a feel for the data
df = pd.read_csv("Churn_Modelling.csv")
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
RowNumber 10000 non-null int64
CustomerId 10000 non-null int64
Surname 10000 non-null object
CreditScore 10000 non-null int64
Geography 10000 non-null object
Gender 10000 non-null object
Age 10000 non-null int64
Tenure 10000 non-null int64
Balance 10000 non-null float64
NumOfProducts 10000 non-null int64
HasCrCard 10000 non-null int64
IsActiveMember 10000 non-null int64
EstimatedSalary 10000 non-null float64
Exited 10000 non-null int64
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB
Note that this is a relatively clean data set. Many real-life data sets contain a ton of missing values, outliers and completely irrelevant columns. If you want to specifically take a look at how to clean a data set then check out my other notebook | post | article which is dedicated to that topic.
As we can see from the above summary the data has 10000 rows and 14 columns where the last column ‘Exited’ is the label on which we have to run our predictions.
No column has NULL values but it can surely have outliers like -99999. We will look at some techniques to check for outliers down the line.
But first, let's remove the irrelevant columns like ‘RowNumber’, ‘CustomerId’ and ‘Surname’ because they are not used anywhere in modelling or analysis
# removing the irrelevant columns
cols_to_drop = ["RowNumber","CustomerId","Surname"]
df = df.drop(columns=cols_to_drop,axis=1)# first five rows of dataframe after removing columns
Now we only have the data (columns) that we actually care about
Let's make a copy of the DataFrame that we will use in deep learning models. The reason for this will be explained in the ‘Artificial Neural Network’ part in ‘Modelling’ which is in Part-2.
deep_df = df.copy(deep = True)
Now, make a list of numerical columns that are not necessarily continuous (‘NumOfProducts’, ‘HasCrCard’, ‘IsActiveMember’ are categorical columns) to find any outliers.
numerical_columns = [col for col in df.columns if (df[col].dtype=='int64' or df[col].dtype=='float64') and col != 'Exited']
Detecting outliers can be done in many ways one of which includes plotting Boxplots. But we will look at another approach which includes checking some general statistics about the columns like ‘min’, ‘max’, ‘mean’ and ‘median’. Be on the lookout for anything absurd in the ‘min’ and ‘max’ values (like ‘min’ = -99999), and also check if the ‘mean’ and ‘median’ are close enough (in most cases they shouldn’t be too off).
Some domain knowledge would also help, like, the age should be at least 18 years because we are dealing with a bank and no value in any continuous column should be negative.
We can observe that the ‘min’ EstimatedSalary is just 11.58 which is quite suspicious. Maybe it is just a broke college student without any job.
df[df['EstimatedSalary'] == df['EstimatedSalary'].min()]
The above small analysis shows that the person is actually a 45 years old male and already has a credit card with high credit score and a balance of almost 123 K. But he has an estimated salary of only 11.58 which is pretty weird. Maybe it is just an error in Data collection or maybe he just lost his job or possibly got retired. We can consider it as an outlier and delete that row from the Data Frame but it is a judgement you have to make as a Data Scientist/Analyst.
Now get ready for some heavy analysis and visualizations ahead
Deciding which columns are relevant is a huge part in Feature Engineering. It requires some amount of Domain Knowledge and by doing so it increases the predictive power of any machine learning algorithm. Feature Engineering is an art by itself.
Let's decide if the ‘Gender’ column is relevant.
That is if any of the gender (M|F) has a greater impact on the label (‘Exited’). For example, if 70% of males choose to quit (‘Exited’ = 1) the bank while in females it is only 30% then we can conclude that males are more likely to leave the bank and the ‘Gender’ column plays an important role in deciding the churn rates. While, if it is like 30% for males and 33% for females, then the ‘Gender’ column doesn’t play much of a role because the churn rates are not distinguishable depending on the gender.
percentages = for gen in list(df["Gender"].unique()):
p = round((df["Exited"][df["Gender"]==gen].value_counts()/df["Exited"][df["Gender"]==gen].value_counts().sum())*100 , 2)
print(gen,"(% to exit) : ", p)
Female (% to exit) : 25.07
Male (% to exit) : 16.46
Let's visualize them
plt.bar(1, percentages)plt.xticks((0,1), ('Female','Male'))
plt.title("Percentage of gender to Exit")
Around 25 % of females and 16 % of males chose to exit. The difference is not that much but let's just include this column in the Data Frame. Again, this is a judgement you have to make as a Data Scientist/Analyst
Lets now decide if the ‘Geography’ column is relevant by doing similar analysis as above
percentages = for country in list(df["Geography"].unique()):
p = round((df["Exited"][df["Geography"]==country].value_counts()/df["Exited"][df["Geography"]==country].value_counts().sum())*100 , 2)
print(country,"(% to exit) : ", p)
France (% to exit) : 16.15
Spain (% to exit) : 16.67
Germany (% to exit) : 32.44
Visualizing the results
for i in range(len(percentages)):
plt.bar(i, percentages[i])plt.xticks((0,1,2), ('France', 'Spain', 'Germany'))
plt.title("Percentage of Country to Exit")
It can be clearly seen that the customers from Germany left twice as much as the other countries. So we should definitely consider this column in our Data Frame.
Lets now decide if the age column is relevant. This might be a bit overwhelming at first but it is simple if you understand it.
Let's start off by plotting a histogram of the ‘Age’ column
It can be seen that most of the customers are from the age group ’30 to 40'
The below scatter plot shows the age of customers(y-axis) who did not quit (‘Exited’ = 0)
plt.title("People who did not Exit (Exited = 0)")
We can see that most of the customers who did not quit the bank are in the age group ’20 to 45'
The below scatter plot shows the age of customers(y-axis) who quit the bank (‘Exited’ = 1)
plt.title("People who Exited (Exited = 1)")
We can see that most of the customers who exited are in the age group ’30 to 60'
!!! Do not let the above plots fool you !!!
Because we cannot be sure as the customers from the age group ’30 to 40' are more in number and thus have a greater impact on the plot (results). If every age group had the same number of people then we could have trusted the above plots
The following code gives the “percentage” of people who quit (‘Exited’ = 1) in each age group
bucketizing the age column and using ‘groupby’ to create groups for each age group
age_bucket = df.groupby(pd.cut(df["Age"],bins=[10,20,30,40,50,60,70,80,90,100]))
calculating the percentage of people who exited and rounding off the result to 2 decimal places
age_bucket = round((age_bucket.sum()["Exited"] / age_bucket.size())*100 , 2)
age_bucket(10, 20] 5.62
(20, 30] 7.61
(30, 40] 12.09
(40, 50] 33.97
(50, 60] 56.21
(60, 70] 31.42
(70, 80] 8.26
(80, 90] 10.00
(90, 100] 0.00
Let's just plot the above data to get a good sense
x = [str(i)+"-"+str(i+10) for i in range(10,91,10)]
plt.title("Percentage of people in different Age Groups that exited")
It can be clearly seen from the above plot that the age group ’50 to 60' has the highest percentage of people who quit the bank. We previously might have got misled into thinking that ’30 to 60' has the most exited people (from the plots) with the bulk at ’40 to 50'.
By the above analysis, it is clear that the age column is definitely relevant but in the form of buckets. So let's convert the continuous Age column into buckets (categorical). Bucketizing basically means aggregating/putting the values in the corresponding ranges.
For example, 33 belongs in the range 30–40, 67 belongs in the range 60–70 and so on. The ranges are decided by us, like, for example, we can have ranges 10–20, 20–40, 40–60, ….. or like, 10–35, 35–60, 60–85, …..
Now, let's replace the ‘Age’ column with the bucketized column
df["Age"] = pd.cut(df["Age"],bins=[10,20,30,40,50,60,70,80,90,100])df.head()
One major part of Preprocessing is to encode the non-numeric and categorical columns like ‘Geography’, ‘Gender’ and ‘Age’ in our case. You may ask why ‘Age’? Well, ‘Age’ was a numeric column but we bucketized it and changed it into a categorical column. This process is called ‘dummy encoding’ where every unique value in a column gets a separate column by itself. You will understand this by looking at the below table
df = pd.get_dummies(df)df.head()
Take a moment to notice that the categorical columns ‘Geography’, ‘Gender’ and ‘Age’ no longer exist in the table. The unique values in those columns got their own columns, like, ‘Geography_France’, ‘Gender_Male’, ….. and they replaced their corresponding parent columns. These columns are also called as Dummy Variables.
There are only two possible values in those new columns ‘0’ and ‘1’. Let's look at the Geography section in the first row. We can see ‘Geography_France’ = 1, ‘Geography_Germany’ = 0 and ‘Geography_Spain’ = 0. It means the first row in the previous table (before dummy encoding) had ‘France’ in the first row of the ‘Geography’ column. You can verify that by scrolling up and looking at the older table. The same is true for all the encoded columns.
Remove one column each from the created dummy variables to avoid ‘dummy variable trap’
“But what is Dummy Variable Trap ?”
It is a bit hard to explain the actual meaning of it in this article but to keep it simple, let's take an example of ‘Gender_male’ and ‘Gender_female’ from the above table. we know that only one of the two columns has the value ‘1’ in a row. That is, a person cannot be both male and female. So, the possible values for the two columns in any given row are (1, 0) or (0, 1). It is clearly seen that one column is a complement of the other. That is, if we are given the value of ‘Gender_male’ then we can guess the value of ‘Gender_female’ in any given row and vice versa. So, having the second column is quite redundant. Therefore removing a column is fine and we MUST do it !!
The same is true for ‘Geography_France’, ‘Geography_Germany’ and ‘Geography_Spain’. The possible values for the three columns in any given row are (1, 0, 0) or (0, 1, 0) or (0, 0, 1). Here if we remove any one column, say, ‘Geography_Spain’, then still we are not losing any information because we can guess the value of the third column based on the first two columns. For example; if ‘Geography_France’ and ‘Geography_Germany’ are (1, 0) then ‘Geography_Spain’ is ‘0’ because only one of the three will have the value ‘1’ in any given row. Again, if ‘Geography_France’, ‘Geography_Germany’ are (0, 0) then ‘Geography_Spain’ should be ‘1’.
Therefore we MUST remove any ‘one’ encoded column to avoid falling into the Dummy Variable Trap
Now, let's remove (any) one of the encoded column each from the parent column. In our case, we choose ‘Geography_France’ from ‘Geography’ and ‘Gender_Female’ from ‘Gender’ but you can choose different columns if you want.
df = df.drop(columns=["Geography_France","Gender_Female"],axis=1)
Don’t forget to remove the bucketized dummy Age column ‘Age_(90,100]’ by selecting all the columns except the last one
df = df.iloc[:,:-1]
Let's take a look at the first five rows of the modified Data Frame
Save the Data Frame as a .csv file to use it later for modelling