Machine Learning Development Life Cycle

Vishal Sinha
Analytics Vidhya
Published in
6 min readDec 13, 2019

In this article, I will try to cover the life cycle of a Machine Learning project. Machine Learning model development workflow will be covered in various stages. In this article, we will do a classification task on the famous Titanic dataset for predicting the survival of titanic passengers. This dataset contains some information of passengers who was there in the fatal accident of “Sinking of the RMS Titanic”.

Ignite, TCS

The above image cover almost all the stages which are required for systematic and proper development of ML model in a sequential manner. Let us cover all (except deployment) the stages in details.

Problem Definition

As already mentioned, the problem definition is to predict the survival of the titanic passengers . This will be basically a classification task on two classes : Survived or Not Survived.

Data Selection

The dataset is the famous Titanic dataset which can be downloaded from Kaggle after enrolling into the competition.

Exploratory Data Analysis(EDA)

This process is one of the crucial process which provides initial data analysis and data investigation and provides important insights and characteristics of the data using summary statistics and graphical representations.

The training dataset contains 891 rows and 12 columns out of which 1 is dependent are other 11 are independent variables . A sample from the dataset is mentioned below:

A small description of each column is given below:

Kaggle
  • Data has integer, float and object values
  • Data has null values in Age, Cabin and Embarked column as count is not equal to the number of rows
  • The numerical operations are returned as NaN for the object type columns
  • It shows the unique value counts in the target variable i.e. Survived
  • It shows that out of 891 data in given dataset, 342 survived and 549 do not

This is the correlation matrix for the numerical columns of the data.

  • It can be inferred that Parch has good correlation with the SibSp
  • Survived has good correlation with the Fare

Data Pre-Processing

It is being seen that data transformation is often clubbed with the data pre-processing. But, here we will do both the processes separately.

First thing we will do is find out which columns have missing values

So, Age, Cabin and Embarked has null values in it where in Cabin almost 80% of data is null(Let’s see we can find out some insights from it or drop it)

Now, let us fill the missing values with appropriate values called imputation.

Second process involved here is dealing with continuous and noisy data which are Age and Fare. Here we will use the binning method to convert the continuous variable into categorical variable.

After analyzing the Age and Fare, bins are created as Age_bin and Fare_bin respectively. 5 Bins are created such that each bin contains at least 5% of the total data (For details, visit the github repo).

Data Transformation

In this process, we will find out whether we can extract some useful insights from the existing columns and we will transform the columns into certain ways which will be useful for model

Name column in the data is all unique, but it can be seen that all names are preceded by English honorific (Mr, Mrs, Master etc.) also called salutation which can serve as one categorical variable. Also from Cabin, we can extract out the deck value available in the ship. Then we took major occurring salutation values such as Mr, Mrs, Master, Miss and make the remaining (Lady, Doctor, Captain) as others. Similarly in deck, the missing deck values are given an arbitrary values as Z.

Feature Selection

In feature selection, here we will use Attribute Relevance Analysis(ARA).

Attribute relevance analysis phase has task to recognize attributes (characteristics) with strongest impact on churn. Attributes which shows greatest segregation power in relation with churn (churn = “Yes” or “No”) by attribute relevance analysis will be selected as best candidates for building predictive churn model.

Attribute Relevance Analysis is used for churn model development but it can be also used for classification task to find out which features are best for the classification model. In ARA, we will use Weight of Evidence(WOE) and Information Value(IV).

Weight of Evidence(WOE)

According to www.listendata.com, WOE is defined as:

The weight of evidence tells the predictive power of an independent variable in relation to the dependent variable. Since it evolved from credit scoring world, it is generally described as a measure of the separation of good and bad customers. “Bad Customers” refers to the customers who defaulted on a loan. and “Good Customers” refers to the customers who paid back loan.

and IV is defined as :

Information value is one of the most useful technique to select important variables in a predictive model. It helps to rank variables on the basis of their importance.

The formula for calculating WOE and IV is :

There are some pre-requisites for calculating the WOE and IV. They are :

  • There should not be any missing values in the data
  • There should not be any continuous column. If continuous feature is there, it should be converted into categorical variable using binning
  • Each created bin should have atleast 5% of the total data

For calculating the WOE and IV, instead of using some package, let us do it by our own.

Let us calculate WOE and IV for all the features present after the transformation process. The IV values are :

You might be wondering, what is the eligibility criteria of IV for getting selected for the modelling purpose. The rules related to IV is :

IV rule

Based on the rules given for IV, we have selected the columns which have IV greater than 0.1.

After selecting the features, we can convert the features into categorical variables which will be beneficial for model training.

Model Selection, Model Training and Model Evaluation

Here, for model selection, we have taken various ensemble classifier such as decision tree, random forest, bagging, xgboost etc. and we trained each one of the model and made a comparison between the evaluation metrics of each of them. For evaluation metrics, we have taken accuracy, precision and recall.

From the above comparison, we found out that XGBoostRandomForest classifier works best on the titanic dataset here with accuracy of 82.7%, precision of 84.0% and recall of 80.0%.

For code kindly visit the github repo.

Kindly go through https://neptune.ai/blog/life-cycle-of-a-machine-learning-project for more insights on ML Lifecycle.

Other Articles

--

--

Vishal Sinha
Analytics Vidhya

Deep Learning and Machine Learning Enthusiast. Writer at Medium and Analytics Vidhya