Predicting the Survival of Titanic Disaster

Ernest Owojori
devcareers
Published in
4 min readSep 12, 2019

The Titanic was a luxury British steamship that sank in the early hours of April 15, 1912, after striking an iceberg, leading to the deaths of more than 1,500 passengers and crew. Visit history.com for details.

In this blog post, I’d be using Logistic Regression to predict the famous Titanic dataset downloaded from DSN Pre-qualification on Kaggle to predict whether a passenger survives the deadline disaster or not.

Name of variables

Data Dictionary Variable Definition Key

survival Survival 0 = No, 1 = Yes

pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd

sex Sex

Age Age in years

sibsp # of siblings / spouses aboard the Titanic

parch # of parents / children aboard the Titanic

ticket Ticket number

fare Passenger fare

cabin Cabin number

embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

Variable Notes

pclass: A proxy for socio-economic status (SES)

1st = Upper

2nd = Middle

3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way…

Sibling = brother, sister, stepbrother, stepsister

Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way…

Parent = mother, father

Child = daughter, son, stepdaughter, stepson

Some children travelled only with a nanny, therefore parch=0 for them.

Import Packages

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Import Datasets

Titanic_train = pd.read_csv(‘train.csv’)
Titanic_test = pd.read_csv(‘test.csv’)

Titanic_train.head()

Titanic_test.head()

#check the shape of train and test data
print(Titanic_test.shape)
print(Titanic_train.shape)

Exploratory Data Analysis

Let’s explore the data to better understand the features and target.

####Missing Data
sns.heatmap(Titanic_train.isnull(),yticklabels=False,cbar=False,cmap=’Spectral_r’)

As shown in the heatmap, Cabin, Age and Embarked has missing values respectively in order of severity
Beyond visual aid, it is seen that 77.1044%, 19.8653% and 0.2245% of the data in Cabin, Age, and Embarked are missing respectively. The Cabin might eventually become an item of deletion because of large missing values.

Titanic_train[‘Survived’].value_counts()

#Checking Survived
sns.set_style(‘darkgrid’)
sns.countplot(x=’Survived’,data=Titanic_train,palette=’RdBu_r’)

549 and 342 passengers did not survive and survived respectively.

#Checking Survived with Sex
sns.set_style(‘whitegrid’)
sns.countplot(x=’Survived’,hue=’Sex’,data=Titanic_train,palette=’RdBu_r’)

In the category of those that survived, more females did survive while in the group of those that couldn’t greater number of males died.

#Checking Survived with Pclass
sns.set_style(‘whitegrid’)
sns.countplot(x=’Survived’,hue=’Pclass’,data=Titanic_train,palette=’rainbow’)

As clearly displayed in the graphical representation, the lower class passengers died extremely more than the middle class and first class. However, first-class passengers survived partially more than the other two categories.

Titanic_train[‘Age’].hist()

The age group of most of the passengers is within 20–30 years. However, there is a decrease in the proportion of older ages. That is, less grandma and grandpa was involved in the disaster.

#SUrvived and SibSp
sns
.countplot(x=’SibSp’,hue = ‘Survived’, data=Titanic_train)

For those that have no siblings with them, greater percentage of them died, for those that had one sibling with them, a close number of them died and survived( variation not too large). Also, for those that range of 2–8 of their siblings are with them, there was no extreme difference in the number that survived and died.

Filling Missing values with bfill and ffill

#Check for missing data and find a way of fixing them
Titanic_train.info()

Titanic_train.Age.bfill(axis=None, inplace = True) #backward fill(bfill) used to fill Age in train data…. Cabin to be dropped and Embarked to be filled up
Titanic_train.Embarked.bfill(axis=None, inplace = True)
print(Titanic_train.info())

Cabin to be dropped

#Filling missing values in test data using ffill
Titanic_test.Age.ffill(axis=None, inplace = True)
Titanic_test.Fare.ffill(axis=None, inplace = True)
print(Titanic_test.info())

Cabin to be dropped too

Dropping features that are not to be used

Titanic_train_ = Titanic_train.drop([‘Cabin’, ‘Name’, ‘PassengerId’, ‘Ticket’], axis = 1) #drop Cabin as it is not important for test and train data
Titanic_test_= Titanic_test.drop([‘Cabin’, ‘Name’, ‘PassengerId’, ‘Ticket’], axis = 1)

View of the data after dropping

Converting categorical features to dummy

sex = pd.get_dummies(Titanic_train_[‘Sex’],drop_first=True)
embark = pd.get_dummies(Titanic_train_[‘Embarked’],drop_first=True)
Titanic_train_.drop([‘Sex’,’Embarked’],axis=1,inplace=True)

#Converting test data
sex_test = pd.get_dummies(Titanic_test_[‘Sex’],drop_first=True)
embark_test = pd.get_dummies(Titanic_test_[‘Embarked’],drop_first=True)
Titanic_test_.drop([‘Sex’,’Embarked’],axis=1,inplace=True)

Titanic_train_ = pd.concat([Titanic_train_,sex,embark],axis=1)

Titanic_test_ = pd.concat([Titanic_test_,sex_test,embark_test],axis=1)

Standardize dataset using StandardScaler from sklearn

from sklearn.model_selection import train_test_split

predictors = Titanic_train_.drop([‘Survived’], axis = 1)
target = Titanic_train_[‘Survived’]
x_train, x_val, y_train, y_val = train_test_split(predictors, target, test_size =0.20, random_state = 0)
from sklearn.preprocessing import StandardScaler
standard = StandardScaler()

X_train_s = standard.fit_transform(x_train)
X_val_s = standard.transform(x_val)

Model Building with Logistic Regression

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,confusion_matrix
from sklearn.metrics import classification_report
logReg = LogisticRegression()
logReg.fit(X_train_s, y_train)
y_pred = logReg.predict(X_val_s)

print(y_pred)

The predicted values

Model Evaluation

acccuracy = round(accuracy_score(y_pred, y_val)*100, 2)
print(acccuracy)

Cm = confusion_matrix(y_pred, y_val)

80.45% accuracy, 80 precision, recall, and f1-score!

--

--

Ernest Owojori
devcareers

Product Manager | Data Analyst | Statistician | Community Manager