Factors Affect the Survival Prediction in the Titanic Disaster

Umar Farooq
Analytics Vidhya
Published in
3 min readMay 11, 2020
Photo Credit: https://gallery.azure.ai/Experiment/Tutorial-Building-a-classification-model-in-Azure-ML-18

Introduction

In this blog-post, I will go through the process of finding those factors that affect the survival rate prediction in the titanic disaster. I’ve to use the famous titanic dataset that available on Kaggle.

The main goal is to find the survival rate prediction of passengers. which type of passengers survived in this disaster according to the given data.

Which features contribute to a higher survival rate?

By reviewing the data, I find out every feature have their own importance and have a great impact on the model accuracy except “passenger ID, Name and Ticket” column.

There are a total of 891 training set examples, with 11 features + target variable(survived). I’ve seen there are multiple missing values in different features.

There are 77% of data is missing in-cabin feature and 19% of data in the age feature and 0.2% of data is missing in Embarked features.

By performing data wrangling, drop the cabin and age column because it has multiple missing values that affect the model performance.

What is the probability rate between Gender and Age columns?

You have seen that “Male(men)” have a higher probability of survival between the age of 18–33 years and the female have a higher probability of survival between 14–40 years.

What is the correlation between Embarked and Gender?

The embarked feature seems to be correlated with survival, depending on gender.

You see that men on port C have a higher chance of survival and lower in port S and port Q. The women have a higher probability of surviving in port Q and S and inverse if they are in port C.

What is the correlation of Pclass with survival?

Pclass seems to be correlated with survival. I’ll visualize the insights of data to better understanding.

We clearly see that “Class 1" has a higher probability of survival as compared to the other two classes. To better understanding, I’ll show you another Pclass plot.

Our assumption about class 1 is true, but we see that the lower chance of survival in Pclass 3.

To visualize the insights of data find out these features that have a great impact on the survival rate in the titanic disaster.

Conclusion

In this article, we have to look at which features have to contribute to the survival rate?

We have to see multiple visualizations to see the correlation between different features with survival probabilities. We also have seen that there are multiple missing values in different features.

We have seen that Sex, Pclass, and Embarked features have increased the model performance and I’m further investigating the other features and share the repository link with improved model performance.

--

--

Umar Farooq
Analytics Vidhya

Passionate about learning how machines predict the future | Computer Vision Enthusiast | Healthcare Analyst