Building Classification Model with Python

Rafi Atha
Analytics Vidhya
Published in
14 min readJan 29, 2021

--

Hi! On this article I will cover the basic of creating your own classification model with Python. I will try to explain and demonstrate to you step-by-step from preparing your data, training your model, optimising the model, and how to save it for later use. This article is the second part of a mini-series I have been working on, if you haven’t read my previous article on ‘Multi-Linear Regression Using Python’ be sure to check it out.

You can try to follow along using the notebook here, good luck and have fun!

Introduction

In machine learning, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, based on a training set of data containing observations (or instances) whose category membership is known. Couple examples of classification problems are: (a) deciding whether a received email are a spam or an organic e-mail; (b) assigning a diagnosis of a patient based on observed characteristics of the patient (age, blood pressure, presence or absence of certain symptoms, etc.)

In this article we will use the Bank Marketing Dataset from Kaggle to build a model to predict whether someone is going to make a deposit or not depending on some attributes. We will try to build 4 different models using different algorithm Decision Tree, Random Forest, Naive Bayes, and K-Nearest Neighbours. After building each model we will evaluate them and compare which model are the best for our case. We will then try to optimise our model by tuning the hyper parameters of the model by using GridSearch. Lastly, we will save the prediction result from our dataset and then save our model for re-usability.

To start we will load some basic libraries such as Pandas and NumPy and then make some configuration to some of those libraries.

Data Pre-Processing

Before we can begin to create our first model we first need to load and pre-process. This step ensure that our model will receive a good data to learn from, as they said “a model is only as good as it’s data”. The data pre-processing will be divided into few steps as explained below.

Loading Data

In this first step we will load our dataset that has been uploaded on my GitHub for easier process. From the dataset documentation found here we can see below are the list of column we have in our data:

Input variables:

  1. age (numeric)
  2. job : type of job (categorical: ‘admin.’, ‘blue-collar’, ‘entrepreneur’, ’housemaid’, ‘management’, ‘retired’, ‘self-employed’, ‘services’, ‘student’, ‘technician’, ‘unemployed’, ‘unknown’)
  3. marital : marital status (categorical: ‘divorced’, ‘married’, ‘single’, ‘unknown’; note: ‘divorced’ means divorced or widowed)
  4. education (categorical: ‘basic.4y’, ‘basic.6y’, ‘basic.9y’, ‘high.school’, ‘illiterate’, ‘professional.course’, ‘university.degree’, ‘unknown’)
  5. default: has credit in default? (categorical: ‘no’, ‘yes’, ‘unknown’)
  6. housing: has housing loan? (categorical: ‘no’, ‘yes’, ‘unknown’)
  7. loan: has personal loan? (categorical: ‘no’, ‘yes’, ‘unknown’)
  8. contact: contact communication type (categorical: ‘cellular’, ‘telephone’)
  9. month: last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’)
  10. day_of_week: last contact day of the week (categorical: ‘mon’, ‘tue’, ‘wed’, ‘thu’, ’fri’)
  11. duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=’no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
  12. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
  13. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
  14. previous: number of contacts performed before this campaign and for this client (numeric)
  15. poutcome: outcome of the previous marketing campaign (categorical: ‘failure’, ‘nonexistent’, ‘success’)

Output variable (desired target):

  • y: has the client subscribed a term deposit? (binary: ‘yes’, ‘no’)

According to the dataset documentation, we need to remove the ‘duration’ column because in real-case the duration is only known after the label column is known. This problem can be considered to be ‘data leakage’ where predictors include data that will not be available at the time you make predictions.

Sample of the data we will be working on

Class Distribution

Another important thing to make sure before feeding our data into the model is the class distribution of the data. In our case where the expected class are divided into two outcome, ‘yes’ and ‘no’, a class distribution of 50:50 can be considered ideal.

no     5873
yes 5289
Name: deposit, dtype: int64

As we can see our class distribution is more or less similar, not exactly 50:50 distribution but still good enough.

Missing Values

Last thing to check before moving on is missing values. In some case our data might have missing values in some column, this can be caused some reasons such as human error. We can use the is_null() function from Pandas to check for any missing data and then use the sum() function to see the total of missing values in each column.

age          0
job 0
marital 0
education 0
default 0
balance 0
housing 0
loan 0
contact 0
day 0
month 0
campaign 0
pdays 0
previous 0
poutcome 0
deposit 0
dtype: int64

From the result we can be assured that our data have no missing value and are good to go. In the case where you did have missing value in your data you can solve it by doing imputation or just remove the column altogether depending on your case. Here is a link to a good Kaggle course on how to handle missing value in dataset.

Scale Numeric Data

Next up, we will scale our numerical data to avoid outlier presence that can significantly affect our model. Using StandardScaler() function from sklearn we can scale each our columns that contains numerical data. The scaling will be done using the formula below:

Our data after the scaling the numeric column

Encoding Categorical Data

Same as the numerical data, we also need to pre-process our categorical data from words to number to make it easier for the computer to understands. To do this we will use OneHotEncoder() provided by sklearn. Basically it will transform a categorical column from this:

…into something like this…

In this code cell we will also encode our label column by replacing ‘yes’ and ‘no’ with 1 and 0 respectively. We can do this by applying simple lambda/in-line function on the deposit column.

Split Dataset for Training and Testing

To finish up our data pre-processing steps we will split our data into two dataset, training and testing. In this case because we have enough data we will split the data with ratio of 80:20 for training and testing respectively. This will result in our training data having 8929 rows and 2233 rows for the testing data.

Shape of training feature: (8929, 50) 
Shape of testing feature: (2233, 50)
Shape of training label: (8929,)
Shape of testing feature: (2233,)

Modelling

After making sure our data is good and ready we can continue to building our model. In this notebook we will try to build 4 different models with different algorithm. In this step we will create a baseline model for each algorithm using the default parameters set by sklearn and after building all 4 of our models we will compare them to see which works best for our case.

To evaluate our model we will use the confusion matrix as our base for the evaluation.

Source: Confusion Matrix for Your Multi-Class Machine Learning Model (Towards Data Science)

where: TP = True Positive; FP = False Positive; TN = True Negative; FN = False Negative.

We will use 6 metrics below to evaluate models:

  • Accuracy: the proportion of true results among the total number of cases examined.
  • Precision: used to calculate how much proportion of all data that was predicted positive was actually positive.
  • Recall: used to calculate how much proportion of actual positives is correctly classified.
  • F1 score: a number between 0 and 1 and is the harmonic mean of precision and recall.
  • Cohen Kappa Score: Cohen’s kappa measures the agreement between two raters who each classify N items into C mutually exclusive categories.

where Po is the empirical probability of agreement on the label assigned to any sample (the observed agreement ratio), and Pe is the expected agreement when both annotators assign labels randomly. Pe is estimated using a per-annotator empirical prior over the class labels.

  • Area Under Curve (AUC): indicates how well the probabilities from the positive classes are separated from the negative classes

In this case we want to focus on the recall value of our model because in our problem we should try to predict as many actual positive as we can. Because a misclassification of customer who actually wanted to make a deposit can mean a lose opportunity/revenue.

Below we will define a helper function to evaluate each trained model and with the metrics mentioned above and save the score to a variable.

As I said in earlier par of this article I will try to build 4 different models: Decision Tree, Random Forest, Naive Bayes, and K-Nearest Neighbours. Before we start below are a simple definition of each algorithms and how they work. If you aren’t familiar with any of the said algorithm you should definitely try to read more in-depth explanation about them before you continue.

Decision Tree

Decision tree is a tree shaped diagram used to determine a course of action. Each branch of the tree represents a possible decision, occurrence or reaction.

Source: Telkom Digital Talent Incubator — Data Scientist Module 5 (Classification)

Advantages:

  • It can be used for both regression and classification tasks and that it’s easy to view the relative importance it assigns to the input features.
  • It is also considered as a very handy and easy to use algorithm, because it’s default hyper-parameters often produce a good prediction result.

Disadvantages:

  • Many trees can make the algorithm to slow and ineffective for real-time predictions. A more accurate prediction requires more trees, which results in a slower model.
  • It is a predictive modelling tool and not a descriptive tool.

Random Forest

Random forest or Random Decision Forest is a method that operates by constructing multiple decision trees during training phases. The decision of the majority of the trees is chosen as final decision.

Source: Random Forest Algorithm (Simplilearn)

Advantages:

  • It can be used for both regression and classification tasks and that it’s easy to view the relative importance it assigns to the input features.
  • It is also considered as a very handy and easy to use algorithm, because it’s default hyper-parameters often produce a good prediction result.

Disadvantages:

  • Many trees can make the algorithm to slow and ineffective for real-time predictions. A more accurate prediction requires more trees, which results in a slower model.
  • It is a predictive modelling tool and not a descriptive tool.

Naive Bayes

Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. There is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable. Below are the Bayes theorem formula:

For example, given:

  • A doctor knows that meningitis causes stiff neck 50% of the time
  • Prior probability of any patient having meningitis is 1/50,000
  • Prior probability of any patient having stiff neck is 1/20

Then the probability of patient who have stiff neck to also have meningitis is:

K-Nearest Neighbours

K-Nearest Neighbours (KNN) classify new data by finding k-number of closest neighbour from the training data and then decide the class based on the majority of it’s neighbours. For example in the image below where k=3 majority of it’s neighbour is classified as B, but when k=7 the majority will changes to A.

Source: Telkom Digital Talent Incubator — Data Scientist Module 5 (Classification)

Advantages:

  • Simple technique that is easily implemented
  • Building model is cheap
  • Extremely flexible classification scheme

Disadvantages:

  • Classifying unknown records are relatively expensive
  • Requires distance computation of k-nearest neighbours
  • Computationally intensive, especially when the size of the training set grows
  • Accuracy can be severely degraded by the presence of noisy or irrelevant features

Building Model

After understanding how each model works let’s try to train our model by using the training dataset we have earlier. Below are sample code to fit our model using Decision Tree and evaluate the model with our helper function we created before. The full code for each algorithm can be found in the notebook here.

Accuracy: 0.6336766681594268
Precision: 0.6215953307392996
Recall: 0.598314606741573
F1 Score: 0.6097328244274809
Cohens Kappa Score: 0.2648219403033133
Area Under Curve: 0.6322045136712157
Confusion Matrix:
[[776 389]
[429 639]]

Model Comparison

After building all of our model, we can now compare how well each model perform. To do this we will create two chart, first is a grouped bar chart to display the value of accuracy, precision, recall, f1, and kappa score of our model, and second a line chart to show the AUC of all our models.

From the figures above we can see that our Random Forest model tops the other models in 5 of the 6 metrics we evaluate, except precision. So we can assume that Random Forest is the right choice to solve our problem.

Model Optimisation

On the next part of this notebook, we will try to optimise our RandomForest model by tuning the hyper parameters available from the scikit-learn library. After finding the optimal parameters we will then evaluate our new model by comparing it against our base line model before.

Tuning Hyperparameter with GridSearchCV

We will use GridSearchCV functionality from sklearn to find the optimal parameter for our model. We will provide our baseline model (named rf_grids), scoring method (in our case we will use recall as explained before), and also various parameters value we want to try with our model. The GridSearchCV function will then iterate through each parameters combination to find the best scoring parameters.

This function also allow us to use cross validation to train our model, where on each iteration our data will be divided into 5 (the number are adjustable from the parameter) fold. The models then will be trained on 4/5 fold of the data leaving the final fold as validation data, this process will be repeated for 5 times until all of our folds are used as validation data.

Source: Cross-Validation (Kaggle)

To see the result of which parameters combination works best we can access the best_params_ attribute from our grid search object.

Note: The more combination provided, the longer the process will take. Alternatively, you can also try RandomizedSearchCV to only randomly select specified number of parameters which can result in faster running time.

{'max_depth': 50,
'max_features': 2,
'min_samples_leaf': 3,
'min_samples_split': 8,
'n_estimators': 100}

Evaluating Optimised Model

After finding the best parameter for the model we can access the best_estimator_ attribute of the GridSearchCV object to save our optimised model into variable called best_grid. We will calculate the 6 evaluation metrics using our helper function to compare it with our base model on the next step.

Accuracy: 0.7174205105239588
Precision: 0.7635705669481303
Recall: 0.5926966292134831
F1 Score: 0.6673695308381655
Cohens Kappa Score: 0.42844782511519086
Area Under Curve: 0.7785737249039559
Confusion Matrix:
[[969 196]
[435 633]]

Model Comparison

The code below will draw the same plot as before only with our original Random Forest model and it’s optimised version. It will also print the change on each evaluation metrics to help us see if our optimised model work better than the original one.

Change of 0.37% on accuracy.
Change of 2.79% on precision.
Change of -3.89% on recall.
Change of -0.96% on F1 score.
Change of 0.95% on Kappa score.
Change of 0.84% on AUC.

The result show that our optimised performed little bit better than the original model. The optimised models show an increase in 4 out of the 6 metrics but perform worse in the other metrics, especially the recall with -3.89% decrease. Because we want to focus on predicting as many actual positive values as possible we should stick with our original model for the prediction because it has higher recall score.

Output

We have our model, what next? As data scientist it’s important to be able to develop a model with good re-usability. In this final part I will explain on how to create a prediction based on new data and also how to save (and load) your model using joblib so you can use it in production or just save it for later use without having to repeat the whole process.

Making Predictions

In this step we will predict the expected outcome of all the row from our original dataset using the Random Forest model and then save it into a csv file for easier access in the future.

New dataset with prediction result

Saving Model

We can save our model for further model re-usability. This model can then be loaded on another machine to make new prediction without doing the whole training process again.

Conclusion

For a simple model we can see that our model did decently on classifying the data. But there are still some weakness on our model, especially shown on the recall metric where we only get about 60%. This means that our model are only able to detect 60% of potential customer and miss the other 40%. The result is not that much different after optimising the model using GridSearchCV which can means that we hit our limit with this model. To improve our performance we can try to look into another algorithm such as GradientBoostingClassifier.

Thank you for reading, I hope you find it helpful! If you have any suggestion or question feel free to leave a comment (clap will definitely be appreciated!)

--

--

Rafi Atha
Analytics Vidhya

Data enthusiast. I use Python and R (mostly Python) to do stuff with data.