The Perils of Human Hubris: Predicting Titanic Survivors using Machine Learning
One of the most infamous shipwrecks in history is the Titanic.
This prestigious launch attracted a crowd of around 100,000 spectators and a throng of excited reporters. The voyage is considered to have begun at Southampton, though the ship was built in Belfast and sailed form this port for its maiden voyage on April 2nd, 1912. It rested in Southampton for 6 days when it was briefly opened for public viewing.
The Titanic, considered to be “unsinkable” was a luxury British steamship that set on sail on April 10, 1912 with 2,224 passengers on board, sinking during its maiden voyage. Some of these passengers were migrating to America. The calamity struck in the wee hours of April 15, 1912 around the coast of Newfoundland, in the North Atlantic after colliding with an iceberg, leading to the deaths of more than 1,502 passengers and crew members amidst ice-strewn waters. It was a tragic event that led to loss of human lives due to lack of lifeboats.
Labeled as one of the worst disasters at sea, this tragic event led to the creation of numerous safety regulations and policies to prevent future catastrophes.
- The ship had three classes of fares. The largest number of passengers were in the 3rd class, some of whom had paid less than $20 dollars.
- While there was some element of luck on who would survive, it seemed some groups like children under a certain age range, women and upper class passengers were more likely to survive than others.
- Overall there was a 32% survival rate.
- The captain went down with the ship while trying to save passengers.
Explore factors that influenced a person’s likelihood to survive on Titanic.
Based on passenger data and characteristics like age, gender, socio-economic class, etc. build a predictive model that identifies people more likely to survive. Compare to the known ground data to validate model performance.
About the Dataset
The Titanic dataset from Kaggle has been used for analysis that consists of 891 cases with the following 11 features and 1 response variable (Survived).
Categorical/Nominal Features: Features that have two or more categories.
- survived: Identifies whether the passenger survived or not as a with values as 0 for not survived and1 for survived.
- sex: Gender of the passenger.
- embarked: The port of embarkment; C for Cherbourg, Q for Queenstown and S for Southampton.
Ordinal Features: Features that have an ordering/rank.
- pClass: Ticket class as 1 for 1st class/upper, 2 for 2nd class/middle and 3 for 3rd class/lower. A proxy for socio-economic status (SES).
Continuous Features: Numerical features that can have infinite values.
- age: in years, is fractional if less than 1.
- fare: Passenger fare.
Discrete Features: Features that have definite boundaries, are discontinuous.
- passengerid: unique row identifier ranging from 1 to 891.
- sibsp: # of siblings or spouses aboarding Titanic. The dataset defines family relations as
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
- parch: # of parents/children of the passenger. The dataset defines family relations as
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
If children travelled only with a nanny then parch=0
Text based features:
- ticket: Ticket number alphanumeric.
- name: Name of the passenger including title, first, last name, nickname.
- cabin: Alphanumeric.
Data Science Workflow Goals
The workflow solves for the following goals. Broad categories include data wrangling, feature extraction /engineering, feature scaling /selection, and model development. Some of the steps below could be combined and done in parallel:
- Collect: Gather data needed for analysis.
- Classify: Classify or categorize samples, understand the implications or correlation of different classes with our solution goal.
- Correlate: Determine which features (categorical vs numerical) significantly contribute to prediction of the response variable — Survived. This may also help in imputing features. One of the best practices is to performing feature correlation analysis early in the project and match these later with our modeled correlations.
- Convert/Clean/Correct: For modeling, categorical text features will need to be converted to numeric equivalent values. We may also require us to estimate any missing values for features to ensure that the model is working with no missing values. Outliers may need to be corrected for features so the contributing dependent variables are not skewing results. We may also discard features that do not contribute to the analysis.
- Create: We may also create new features based on existing features that align with correlation, conversion and completeness goals.
- Chart: We will select apt charts to visualize our data. For data analysis, one of the best practices is to using multiple plots instead of overlays for readability.
- Contextualize, Conclude and Communicate: Perform in-depth analysis via statistical models, machine learning and algorithms to model, infer, predict, recommend, and visualize. Model development involves evaluation, tuning, assessment, deployment and monitoring.
Data Science Tools
RStudio, R libraries for statistical computing
Exploratory Data Analysis (EDA)
EDA is an approach to analyze datasets with the goal to summarize key characteristics, often visually. A statistical model may be used but the primary purpose is to explore what the data can tell us beyond formal modeling or hypothesis testing.
The data columns are converted to either numeric or factors as applicable.
We see the lowest, highest values, mean and median values.
We see the following titles in the name feature.
Creating New Features
- We may want to create new feature that combines Parch and SibSp to get the total family member count on board. We could further collapse this into discrete values of family size as singletons, small and large families.
- We may want to extract title, first and last name from Name feature for exploratory analysis. However this column is not needed for our model.
- We may want to convert our continuous age feature to ordinal categorical feature.
- We may want to create fare ranges.
- We could infer ethnicity based on last name, but not needed for the predictive model.
- Based on age, we could create a new feature to identify if the passenger is a child (<18) or adult (≥18).
Correcting & Completing/Imputing Data
- Age: We see that age column has 177 values blank (NA’s). These have been imputed using the mean of the age column as 29.70 so the distribution stays the same.
- Embarkment: There were two null values which were imputed as ‘S’ based on the fare of $80 paid by the passengers and their fare class is 1.
- Ticket and PassengerID: This feature does not contribute to our model analysis and removed from dataset.
- Cabin: This feature has 148 unique values and is being dropped, too many levels. This feature also has 687 blank values and is highly incomplete.
- Name: This feature is non-standard and does not contribute to predicting survival.
What’s are the titles of the passengers?
Titles such as ‘Dona’, ‘Lady’, ‘the Countess’, ‘Capt’, ‘Col’, ‘Don’,
‘Dr’, ‘Major’, ‘Rev’, ‘Sir’, ‘Jonkheer’ have been consolidated under “Rare Title”. This leaves Mr, Master, Miss and Mrs. We see that 62% of the passengers were males vs 35% females. 667 unique last names were identified.
What is the age of passengers by gender?
Most passengers are in the 15–35 age group. From the graph we see more females under 10. Above age 15+ most of the females are younger than males and few above 65+. Overall few elderly passengers, with the oldest passengers being upto 80 years. Passengers in the age group 28 to 35 had a higher rate of survival.
Did more Males or Females survive?
A total of 577 Males vs 314 Females passengers were aboard. From the graph, we see a higher number of females who survived.
What is the age of those who deceased vs survived?
From the graph shows the distribution of passenger ages for the survived and deceased groups. There are no strong trends shown although the mean ages for the survived group is slightly lower than that of the deceased group.
Is there any difference in survival rates based on passenger ticket class? We see that the proportion of passengers who paid a higher fare or in other terms had 1st class were more likely to survive.
What is the survival rate for passengers with siblings or spouse?
The graph shows that a large number of deceased passengers had no siblings or spouses traveling with them.
What is the survival rate for families vs sole travelers?
The graph shows that a large number of deceased passengers had no children or parents traveling with them.
How does the fare compare for those who survived vs deceased?
We see that those who survived had a wide range of fares and overall higher than those who didn’t survive. Some passengers paid as high as $512 accounting for <1% of the total passengers.
Does the survival rate differ based on embarkment location?
Most of the passengers boarded from location Southampton, England; which was also the primary port, followed by Cherbourg, France and finally the least number of boarding passengers from Queenstown, Ireland.
Do families sink or swim together?
A new feature family size is created based on number of siblings/spouse(s) (maybe someone has more than one spouse) and number of children/parents. While the median is 1, mean as 2, the max is a family size of 11. Mostly the upper class passengers with a higher family size survived.
What are the interrelationships between features?
Survival likelihood was lowest for third class passengers though chances improved for females. It is alarming that 50% of toddlers and adolescents died. Based on the conditional interference tree diagram below, a plausible explanation could be that they came from larger families.
Model Fitting & Assessment
Based on the problem statement we need a binary classification model. Data in the training set is used to train the model and test set is used to measure prediction accuracy. The following machine learning algorithms were compared to choose the best performing one. Further 10-fold cross-validation can be done. R programming language has been used for model fitting and assessment.
- Random Forest, 84.03% accuracy
- Decision Trees, 83% accuracy
- Lasso/Ridge Regression, 82% accuracy
- Radial Support Vector Machine, 82% accuracy
- Logistic Regression, 81% accuracy
- Linear Support Vector Machine, 81% accuracy
Random Forest model was selected as it has the highest accuracy rate of 84% (lowest misclassification rate from confusion matrix) to predict whether a passenger survived or not.