Titanic dataset Analysis (~80% accuracy)

Aishani Basu
11 min readNov 25, 2021

--

Introduction

The Titanic or, in full, RMS Titanic was part of the one of the most iconic tragedies of all time. RMS Titanic was a British passenger ship that hit an iceberg while on its voyage from Southampton to New York City, and sank in the North Atlantic ocean, leaving hundreds of passengers to die in the aftermath of the deadly incident. Some of the passengers who survived till help arrived were rescued while many lost their lives helplessly waiting for help to arrive.

The legendary Kaggle problem, Titanic, based on the tragedic sinking of the RMS Titanic records data about 891 passengers of RMS Titanic, and we are required to predict if they have survived or not, based on the information we have available about the passengers and the outcome after the sinking of the ship.

Note: This notebook is my analysis of the titanic dataset to obtain any meaningful insights from the data and scores an accuracy of ~80 percent (top 5 percent of 14k entries on Kaggle) .

Let’s get started!

Contents:

1. About the data

2. Problem definition and metrics

3. EDA

4. Baseline model performance

5. Model Selection

6. Results and conclusion

About the data:

First, we have the prediction variable, that is if they survived the sinking or not. Then a bunch of numeric variables, like ids, ages of the passengers, etc. followed by categorical variables like class of the ticket, and strings like name, etc.

Example RMS Titanic ticket : Source

Printing first 5 rows of given dataset..

Problem definition and metrics

The problem is a binary class classification problem. We can use binary crossentropy or logistic loss as the loss function and any metric like accuracy or/and ROC AUC score as a metric to evaluate the results.

EDA

Correlation between features:

Drawing insights from the correlation between features..

  • Fare and Pclass are negatively correlated (with a Pearson’s correlation coefficient of -0.55) ; Obviously, higher fares imply better better ticket classes (lower the class number) and vice versa.
  • Pclass and the target feature are moderately negatively correlated (-0.34) implying better the ticket class, higher the chance of survival.
  • A similar pattern can be observed with the features Parch, SibSp and Age; Both the features, Parch and SibSp are very slightly correlated with all the features except Age feature, and both have a negative correlation (-0.19 and -0.31 respectively) with Age ; Lower the age, more the number of family accompanying the passenger
  • SibSp and Parch features are positively correlated which are both indicative of the number of family members accompanying passenger
  • PClass and Age are negatively correlated (-0.37) implying higher the age, better the ticket class

Missing data

Age : Contains 177 nan values out of 891 entries. Imputed with median gave best results.

Embarked : Contains 2 nan values. Imputed with mode of existing data.

Cabin : 687 out of 891 Cabin entries are nans, i.e. more than 50 percent of the total data exists as missing data or Nans so assumed its better to derive other features from this feature.

Now, printing the correlation heatmap after handling missing data and converting categorical strings to encodings (0,1,2..)..

  • Embarked also has slight correlation with the target variable (-0.17), the port at which the ship was boarded by the passengers did determine their chance of survival
  • Sex is also highly correlated with the target variable (-0.54), indicating the passenger’s gender had a high effect on their chance of survival
  • Embarked and Fare of the passengers are negatively correlated (-0.22); obviously, fare depends on the port a passenger boards the ship from.
  • Embarked and Sex also seem slightly correlated (0.11) indicating the port the passenger boarded from was dependent on the gender of the passenger; Embarked and Pclass are also correlated (0.16) indicating a 1st class passenger probably boarded at a different port than a 3rd class passenger.
  • Sex and Fare, Sex and Parch, SibSp, Sex and Pclass all seem slightly correlated (-0.18, -0.25, -0.11, 0.13 respectively); i.e the fare the passenger paid, the family aboard, and the class of the ticket was seem slightly dependent on the gender of the passenger.

Question 1: How do the features ‘Age’, ‘Sex’, ‘Fare’, ‘Embarked’ affect the chance of a passenger’s survival?

Age :

  • Both distributions, of people who didn’t survive and the ones who did, are normal with a spike around age 30 for passengers who did not survive i.e. for people around the age 30 had a higher chance of not surviving
  • People of ages > ~51 are considered outliers in the distribution of people who did not survive. owing to the fact people above the age of ~51 are very less and people of age ~55 above are outliers in the distribution of people who did survive, i.e. very few people above ~55 actually did survive

Sex:

  • From the above stacked plots, it is clear more percentage of female passengers survived i.e. 74.2 % while only 18.9 % of male passengers survived, even though more male passesngers were aboard

Fare :

  • Distribution of fares of survivors clearly has a higher median value, hence passengers who survived had higher average fares
  • Lots of outliers in the distribution of both
  • A huge spike in the distribution of the people who didn’t survive, or the probability that the person who didn’t survive had a fare of ~10 units of currency is very high

Embarked:

  • Maximum passengers seem to have boarded the ship at Southampton port and only 33 percent of them survived; out of the total passengers who boarded the ship at Queenstown, a similar pattern was observed.
  • However, passengers who boarded at Cherbourg a higher percentage (55.4%) survived; thus, if a passenger boarded at Cherbourg port, they had a higher chance of survival.

Question 2: How to determine other features that affected a passenger’s chance of survival?

Feature Engineering

Three kinds of visualizations used (primarily box plots and density plots):

  1. Boxplots : To tell us about the range of distribution (or, 1st quartile, median, 3rd quartile) of the feature for each output class and give us an idea about the outliers in the data
  2. Density plots : To tell us about the distribution and more about the shape of the distribution (resembling normal or any other disb) of the feature
  3. Stacked countplots : Tells us more about count of each category of the feature with information about the percentage belonging to which output class

SibSp and Parch

Since both features are highly positively correlated (0.41) we can derive a single feature indicative of the number of family members accompanying the passenger by adding the number of spouse/siblings (SibSp) and number of parents/children (Parch) aboard.

  • From the boxplot, those who survived had a median of around 1 family member, while those who did not had, had lesser or no family
  • From the probability distribution, both the distributions resemble normal distributions with a slight right skew
  • Lesser the count of family members, higher the chance that passenger did not survive

Name

Since the name of a passenger cannot be used directly as a feature, we need to derive features from it. Multiple features can be derived from the names of each passenger such as:

  • Length of the names
  • Titles of the name (Mr/Mrs/etc)
  • Number of words (Discarded since seemed really correlated with length of names)
  • Letter of surname

Name length

  • Passengers who did not survive tend to have names with a median of ~25 letters, while those who did survive have names with a median of ~30 letter ; there is a higher chance passengers who did not survive had shorter names (~25 letters).

Name title

  • There are 5 classes of name_title, passengers with ‘Miss’, ‘Mrs’, ‘Master’, ‘Mr’ and the fifth class with all the remaining minority of titles like ‘Dr’, ‘Rev’.
  • A steep spike in the probability distribution of passengers who did not survive in the 3rd class implies passengers with title ‘Master’ or younger boys had a lesser chance of surviving.

Letter of surname

  • The most common surnames started with the following letters:
Maximum occurence of letters in surname beginning : [('S', 86), ('M', 74), ('B', 72), ('C', 69), ('H', 69)]
  • These are considered as Classes 1 to 5 for surname feature, while the remaining surnames are part of Class 6.
  • Similar probability distribution noticed for both distributions of passengers, those who survived and those who didn’t, with a spike in passengers whose surname begins with ‘H’.

Cabin

Due to the presence of a huge number of nan values, we derive other features from this feature:

  • Letter the cabin number starts with (location of the cabin determines the survival of the passengers since it determines the distance of the passenger from the boats)
  • If the cabin value is nan or not (in case the data being missing or unknown signifies lesser chance of survival or correlation with some other factor)

Cabin number binning

  • Cabin is binned into 6 classes, into tickets starting with ‘A’, ‘B’, ‘C’, ‘D’, ‘E’ and the 6th class with remaining cabin numbers.
  • Similar pattern observed for both probability dstributions with a spike for cabin numbers starting with ‘E’.

Cabin nans

Inspecting random rows from the dataset, we can see a lot of 3rd Class passengers have nans in their Cabin numbers in multiple instances.

Random rows from the data with ‘Cabin’ as nan for ‘Pclass’ value 3
  • A spike seen in the distribution of passengers who did not survive with nans in their cabin numbers; there is a higher chance a passenger did not survive if their cabin values were nans
  • 1st Class passengers were put up in cabins with decks A, B, C, D, and E, while second class passengers were put up in cabins in decks D and E primarily, while further decks had cabins for 3rd Class passengers.

Ticket:

  • Ticket number refers to the value printed on the ticket (in image above) as provided by the booking agent which could signify the date of booking. (Did not produce good results so not used)
  • Binned on the basis of contents of ticket string (with digit, and then first letter of every string but due to better performance kept two bins, one with ticket number containing only digits and one with digits and alphabetical letters).
  • We check the significance of the first letters in the ticket number and bin on the basis of that.
  • Similar pattern observed in both probability distributions and also similar averages of both the distributions observed, with slightly steeper distributions for passengers who did not survive.

Feature Encoding:

I have used one hot encoding using Pandas’ get_dummies() function. For multiple categories in data, pd.get_dummies() creates a column for each of the labels with values either 0 or 1, avoiding using label encoding which gives us numerical data(0,1,2,3.. number of classes) as a feature which might confuse the model.

Correlation between features after feature engineering

Combining highly correlated features

We combine highly correlated features to create a single new feature, namely :

  • Sex and name_title
  • P_class, fare_bin, and cabin_bin

Imbalanced dataset

‘Survived’ feature countplot
  • Data is clearly imbalanced, the count of people who survived is almost close to half the count of people who didn’t survive
  • We can either use stratified sampling, oversampling, SMOTE or any other technique to handle imbalances in data but i have sticked with stratified sampling with reference to few notebooks referenced here.

Feature importances

Traning an extraTreesClassifier to view the most important features from X_train.

Baseline Model

For a baseline model, we use an optimized Random Forest model since it is extremely robust and the algorithm creates decorrelated trees that help overcome overfitting the model to fewer features.

Following is a tree from the trained Random Forest model:

A single decision tree from the trained Random Forest model

At the top we have as our first split, the feature ‘Sex_name_title’ , seemingly the most important feature to tell us about a passenger’s chance of survival. Splitting X_train based on this feature gave us a gini impurity score or in simpler word, an information gain of 0.469.

‘cabin_nan’ and ‘Age’ are the features used for the next split, having the second best information gain values.

Zoomed in image of above

We use hyperopt library that implements Bayesian optimization to obtain optimized hyperparameters for the models.

The term Bayesian is meant to imply a probability law that inclues a prior which is a probability distribution of an event before the observations are made, determined from past knowledge and a likelihood function which provides information about the probability distribution of the event from the observations recorded.

Bayesian optimization is an algorithm for determining the global minimum of a function using the prior probability. In simple words, unlike grid search and random search algorithm, which use the current values of the prediction to find the minimum error point in the error surface, Bayesian optimization uses information about the prior errors to predict the next prediction which makes it more reliable in the process of finding the minimum error point in the optimization problem. This is a great blog explaining Bayesian optimization in very simple terms.

Model selection

We experiment with the simple binary classification models like Logistic Regression, SVM, KNN, followed by decision tree based classifiers, like Random Forest, XGBoost.

Above plot presents the results of classification, where SVC and Random Forest predicted with minimum errors out of the above classifiers, while XGBoost predictions had the maximum errors.

Results and conclusion

  • We get a final accuracy score of 0.8334 on the test set using Random Forest and KNN and a score of 0.855 using SVC.

Thanks for reading!

Also, if there’s any suggestion to improve this article, please let me know!

Here is the link to the Kaggle notebook for reference.

--

--