Would you survive the Titanic?

5 min readOct 16, 2019

This is the guide to the famous Kaggle competition. Here’s how my outcome looks:

The Challenge

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

The aim is to build a predictive model using passenger data (ie name, age, gender, socio-economic class, etc) to predict if a particular passenger with given details is more likely to survive or not.

Data Understanding

Survived is the Binary Dependent Variable in this context

A first look at this data suggests that there are not a huge number of columns, but they do contain some missing values and outliers.

Variable Analysis

Survived, the DV has strong -ve correlation with Pclass
Survived is positively correlated to Fare
Fare and Pclass are nevatively correlated, which explains the above points
Pclass also has a negative correlation with Age which says that older people preferred 1st class
Age is inversely related to Sibling/Spouse :: Lower aged has more siblings/spouse
Parent/Children also has a direct correlation with Siblings/Spouse. Says that they prefer travelling with family.

Pclass vs age : We can see that most of the aged population traveled in 1st class

There is no clear distinction for a person to survive based on the port of embarkment

Also to be noted, there is quite some distinction in survival if the passenger is from 1st Class

Data Preparation

Note that Data Preparation is an important step, sometimes even more than modeling as proper data manipulation can quickly change the scenario from a bad model to a good model.

Missing Values

Finding missing values is an important part of Feature engineering and also deciding the outcome of it. In this case lets see the number of missing values in the data:

Age          256
Fare           1
Cabin       1006
Embarked       2

Age : I have decided to impute it using the Random Forest Regressor. This helps in predicting the age in a mini model keeping other variables as Independent Variables.

from sklearn.ensemble import RandomForestRegressor
impute_age = RandomForestRegressor(n_estimators=1000)

Fare : Only 1 value is missing here, but I chose to impute it using the mean of the Fare but considering the other factors like Sex, Pclass, Embarked.

Cabin : Now this is a tricky field. It has lots of missing values and it contains strings. I extracted the CabinID from here and created a new one (X) for all the missing ones.

X    1006
C      94
B      63
D      46
E      41
A      22
F      21
G       5
T       1
Name: CabinID, dtype: int64

Embarked : Used the most common port of embarkment.

Feature Engineering

Feature Engineering is the concept of generating new derived variables from existing variables. The above fetching of the CabinID is a perfect example of Feature Engineering. Some of the engineered variables are listed below :

Title from Name

df[‘Title’] = df.Name.apply(lambda name: name.split(‘,’)[1].split(‘.’)[0].strip())

Familia (Total Family) = Parents + Children + Siblings + Spouse

Cabin ID

Model Building

I have chosen multiple classifier models to understand and differentiate them. Below are some of the outputs:

Logistic Regression

Accuracy Score   : 83.051
Precision score: : 80.0
Recall score     : 82.051
F1 score         : 81.013

Naive Bayes

Accuracy Score   : 83.051
Precision score: : 76.667
Recall score     : 88.462
F1 score         : 82.143

Random Forest

Accuracy Score   : 84.181
Precision score: : 85.714
Recall score     : 76.923
F1 score         : 81.081

Xtreme Gradient Boosting

Accuracy Score   : 87.006
Precision score: : 88.732
Recall score     : 80.769
F1 score         : 84.564

Thus by far, XGB seems to be the best bet in accurately predicting the outcome variable ‘Survived’.

You may use Grid Search to fine tune the parameters and get a better model.

Optimized Model
— — — 
Final accuracy score on the testing data: 0.8701 
Final F-score on the testing data: 0.8702

Summary and Conclusion

Interesting Finds

One of the key finds in the feature engineering part is the Name Length. If we take the length of the name, it is found to be indeed correlated with the ‘Survived’ variable.

Sex is one the most important variables, but only for Random Forest. This is considering that women and children were given preference.

Having a Family (Sibling/Spouse/Parent/Children) don’t seem to have any significant impact on the outcome variable.

Conclusion

Using the XGBoost algorithm yields better results, however, there are theories that simple models yield good results as well. This all depends on the features(variables) selected and tuning the hyper-parameters. I have had better models with Decision Tree and Logistic Regression as well in initial cases. Feel free to use your own algorithm and explore. That’s the key in building better models: Exploration ! Happy Coding :)