Machine Learning Project: Titanic Problem Statement

Published in

The Startup

6 min readSep 6, 2020

Introduction

Everyone knows about the Titanic ship as many of the people have seen the Titanic movie and how it reached it’s tragic end on the night of 15th April 1912.A ship which was termed as the ‘unsinkable’ one struck disaster by colliding with an iceberg and after few hours it was at the bottom of the ocean.

Only 1517 passengers were able to survive the shipwreck with the help of life boats but it could not accommodate all the passengers therefore it lead to a huge loss of life.

Problem Statement

So Kaggle recently hosted an open online competition where the competitors had to design a model based on given training data set which predicted the survival of passengers during the shipwreck.I was very excited to attempt this problem as this was my first ML project after learning basic concepts of algorithms and data processing.

Approach

The life cycle of any Machine learning or Data Science project consists of 3 basic workflows:

Data collection: The very first step is data collection process which can be obtained from many sources like company side,kaggle,ML repository,surveys,3rd party API etc. and import that data set in form of comma separated file and import the required modules.

Exploratory Data Analysis: The data set we import using the read_csv function of pandas is in the form of rows and columns called as a Data Frame and each column is a Series or a feature.We then try to analyze the relationship between different features by drawing various plots and also understanding the correlation between them. We should first check if there are any null values in the data.

The yellow spots show the null values for that column

3) Feature Engineering: The most important step in any ML project is feature engineering which deals with 3 points:-

a)Handling missing values in data: Many times while filling surveys people do not tend to give their personal information or for some reason the data of an individual cannot be obtained since no one knows that person.These are few reasons which explains the missing data from the data set.

Our job is to fill those null values in a specific feature by any method possible so that those null values do not affect the accuracy of our model.

There are 2 types of data: 1) Numerical data 2) Categorical data

For the numerical data we use the mean median imputation method and replace the null values by the median of that column.By using median and not mean,it reduces the impact of outliers which can be present in the data.

replacing null values of ‘Age’ column with median

b) Encoding the categorical data: The thing with the categorical data is that the algorithm does not understand the categories so we convert that category feature into 1 and 0 called as dummy variable.This process of converting categorical into continuous value is called One Hot Encoding.

Based on the number categories in a feature,those many number of columns are created and only n-1 columns are taken into consideration as the third column can be represented with the help of 2 other columns.This is called as Dummy variable trap.

categorical feature sex converted into dummy variables

By analyzing the data we can observe that few columns can be dropped as they do not affect the accuracy of our model.The process of cleaning such data is called as data cleaning.

There are some other methods too to handle missing values:

Random Sample Imputation
Capturing NAN values with new feature
End of distribution imputation and many more…

Train Test split operation

After feature engineering is completed and we have a data set with no null values and dummy variables,we classify the data into training and testing data set.

20% of the data is the test data set and the 80% is considered to be the training data set which we will feed to our algorithm and after learning from that,it can make the predictions on the testing data.

We use scikit learn module to import train_test_split operation:

Algorithm used

The algorithm I used here is Logistic Regression which is generally used in binary classification problem statements.It is used in problems where the data is linearly separable and our algorithm designs a line which best linearly separates the two classes.It basically predicts the probability whether that particular event will happen or not.

We then create the classifier and the fit the training data into it and then use the predict function for the output.

Since the model is now ready,it is time to evaluate the performance of our classifier or model.

There are 4 metrics which I have used for evaluation :

Accuracy score
K Fold cross validation score
ROC and AUC score

The accuracy score which I got is 82% which is decent.

K Fold Cross Validation: Accuracy is not the only way to judge the accuracy of any model because if the data set is imbalanced ,it might give us false results but since this is not an imbalanced set,accuracy is the best classification metric.I have used other metrics here just to analyze the different score and K fold CV is one of them.

In K Fold CV,we divide the original data set into k subsets and we calculate accuracy for each subset.For each subset,based on the value of k ,test data will be decided and those k iterations will be carried out and we will calculate the mean of those k accuracy and that will be our final accuracy.

Conclusion

So in all and all this was a very exciting projects which gave me a very comprehensive idea about the life cycle of any ML or data science project.It has given me a wonderful insight on how basic math and statistics concepts are used here to solve real world problems.The more and more I read about Machine learning,the more it fascinates me and I am looking forward to apply other algorithms of ML like Decision Tree,Random forest,Support Vector machines.