In the early hours of 15 April 1912, the RMS Titanic had sunk on collision with an iceberg on its maiden voyage from Southampton to New York City. There were an estimated 2224 passengers on board, and more than 1500 died, making it one of the worst passenger ship disasters in history.
Have you ever wondered what would have happened if you were caught in this man-made tragedy? What could have been your chances of survival? Let’s find out using machine learning.
Let's pose this as a classification problem of predicting the survival of passengers traveling in Titanic. First of all, Thanks to Kaggle for sharing the Titanic dataset. This dataset contains information about 891 passengers onboard.
The main threat to any dataset is the percentage of missing values. Let’s remove the cabin feature as it contains more missing values. After removing and performing preprocessing steps.Looking at the survival Count.
About 550 people lost their lives and close to about 330 people survived the tragedy in our dataset.
Let’s look at the pair plot between the features to arrive at conclusions.
Analyzing all the pair plots is very important as we can stress on the feature which correlates well the hue(Survived) in our case. For example, look at this familySize graph.
The rate of survival of people increases with an increase in family size which drops later but that can be attributed to fewer people who traveled in Titanic with family size more than 4.
Here there is a good correlation between sex and Survival Rate and other than that it looks like other features do not correlate with the survival rate. But there are correlations between the existing features which are self-explanatory. Let’s remove certain categorical features that do not correlate well with the survival rate like removing the name of the passenger which does not add any significance to their survival. But there is an important cache here. Look at this.
The survival rate of the title having (Mr.) is very poor.Titles(Miss and Mrs) have good rates of survival when compared to others.
Now let's build a machine learning model with this information that we have gathered. This is a Classification Problem. The model should either return zero or one based on the input and a bit of interference would be good.
Let’s deploy a RandomForestClassifier for this problem. Random Forest is a classification algorithm consisting of a large number of decision trees. For constructing each tree we take a random number of rows and columns from the data.
After getting the results from each individual tree. A majority voting classifier can be used at the end or we can take the mean of the results to arrive at a conclusion. After deciding the algorithm we need to tune in the hyperparameters for the algorithm.
In our random forest algorithm, we are using 500 decision trees as it yields a 100 percent accuracy on the training data. All the hyperparameters have been tuned in by RandomSearch using the cross-validation dataset. Let’s apply the algorithm on the test data and look at the result.
After building and testing our model on the unseen data, we are getting an accuracy score of 78.7 %. This is because the features don’t correlate well enough with the survival rate. Let’s plot the Receiver Operating Characteristic curve of the model.
We are getting the Area Under the Curve(AUC) score of about 75.37 %. Now our model is ready to make predictions. Let’s measure the model’s performance on unseen data using the confusion matrix.
This is a rough model that can be used to predict survival rates which were built just to learn the concepts more clearly. This model has to be more optimized and various techniques have to be applied so that the accuracy rate and AUC score can be improved.
After performing optimizations, it will be submitted for the Kaggle Competition.
Thanks for your time!
P.S: My Chances of survival was close to 0.45%.Want to know yours? Click on the link.