Predicting the Survival of Titanic Disaster

Published in

devcareers

4 min readSep 12, 2019

The Titanic was a luxury British steamship that sank in the early hours of April 15, 1912, after striking an iceberg, leading to the deaths of more than 1,500 passengers and crew. Visit history.com for details.

In this blog post, I’d be using Logistic Regression to predict the famous Titanic dataset downloaded from DSN Pre-qualification on Kaggle to predict whether a passenger survives the deadline disaster or not.

Name of variables

Data Dictionary Variable Definition Key
survival Survival 0 = No, 1 = Yes
pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex Sex
Age Age in years
sibsp # of siblings / spouses aboard the Titanic
parch # of parents / children aboard the Titanic
ticket Ticket number
fare Passenger fare
cabin Cabin number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
Variable Notes
pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way…
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way…
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

Import Packages

import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt

Import Datasets

Titanic_train = pd.read_csv(‘train.csv’) Titanic_test = pd.read_csv(‘test.csv’)

Titanic_train.head()

Titanic_test.head()

#check the shape of train and test data print(Titanic_test.shape) print(Titanic_train.shape)

Exploratory Data Analysis

Let’s explore the data to better understand the features and target.

####Missing Data sns.heatmap(Titanic_train.isnull(),yticklabels=False,cbar=False,cmap=’Spectral_r’)

As shown in the heatmap, Cabin, Age and Embarked has missing values respectively in order of severity

Beyond visual aid, it is seen that 77.1044%, 19.8653% and 0.2245% of the data in Cabin, Age, and Embarked are missing respectively. The Cabin might eventually become an item of deletion because of large missing values.

Titanic_train[‘Survived’].value_counts()

#Checking Survived sns.set_style(‘darkgrid’) sns.countplot(x=’Survived’,data=Titanic_train,palette=’RdBu_r’)

549 and 342 passengers did not survive and survived respectively.

#Checking Survived with Sex sns.set_style(‘whitegrid’) sns.countplot(x=’Survived’,hue=’Sex’,data=Titanic_train,palette=’RdBu_r’)

In the category of those that survived, more females did survive while in the group of those that couldn’t greater number of males died.

#Checking Survived with Pclass sns.set_style(‘whitegrid’) sns.countplot(x=’Survived’,hue=’Pclass’,data=Titanic_train,palette=’rainbow’)

As clearly displayed in the graphical representation, the lower class passengers died extremely more than the middle class and first class. However, first-class passengers survived partially more than the other two categories.

Titanic_train[‘Age’].hist()

The age group of most of the passengers is within 20–30 years. However, there is a decrease in the proportion of older ages. That is, less grandma and grandpa was involved in the disaster.

#SUrvived and SibSp sns.countplot(x=’SibSp’,hue = ‘Survived’, data=Titanic_train)

For those that have no siblings with them, greater percentage of them died, for those that had one sibling with them, a close number of them died and survived( variation not too large). Also, for those that range of 2–8 of their siblings are with them, there was no extreme difference in the number that survived and died.

Filling Missing values with bfill and ffill

#Check for missing data and find a way of fixing them Titanic_train.info()

Titanic_train.Age.bfill(axis=None, inplace = True) #backward fill(bfill) used to fill Age in train data…. Cabin to be dropped and Embarked to be filled up Titanic_train.Embarked.bfill(axis=None, inplace = True) print(Titanic_train.info())

#Filling missing values in test data using ffill Titanic_test.Age.ffill(axis=None, inplace = True) Titanic_test.Fare.ffill(axis=None, inplace = True) print(Titanic_test.info())

Dropping features that are not to be used

Titanic_train_ = Titanic_train.drop([‘Cabin’, ‘Name’, ‘PassengerId’, ‘Ticket’], axis = 1) #drop Cabin as it is not important for test and train data Titanic_test_= Titanic_test.drop([‘Cabin’, ‘Name’, ‘PassengerId’, ‘Ticket’], axis = 1)

Converting categorical features to dummy

sex = pd.get_dummies(Titanic_train_[‘Sex’],drop_first=True) embark = pd.get_dummies(Titanic_train_[‘Embarked’],drop_first=True) Titanic_train_.drop([‘Sex’,’Embarked’],axis=1,inplace=True)

#Converting test data sex_test = pd.get_dummies(Titanic_test_[‘Sex’],drop_first=True) embark_test = pd.get_dummies(Titanic_test_[‘Embarked’],drop_first=True) Titanic_test_.drop([‘Sex’,’Embarked’],axis=1,inplace=True)

Titanic_train_ = pd.concat([Titanic_train_,sex,embark],axis=1)

Titanic_test_ = pd.concat([Titanic_test_,sex_test,embark_test],axis=1)

Standardize dataset using StandardScaler from sklearn

from sklearn.model_selection import train_test_split

predictors = Titanic_train_.drop([‘Survived’], axis = 1)
target = Titanic_train_[‘Survived’]
x_train, x_val, y_train, y_val = train_test_split(predictors, target, test_size =0.20, random_state = 0)
from sklearn.preprocessing import StandardScaler standard = StandardScaler()
X_train_s = standard.fit_transform(x_train) X_val_s = standard.transform(x_val)

Model Building with Logistic Regression

from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score,confusion_matrix from sklearn.metrics import classification_report logReg = LogisticRegression() logReg.fit(X_train_s, y_train) y_pred = logReg.predict(X_val_s)

print(y_pred)

Model Evaluation

acccuracy = round(accuracy_score(y_pred, y_val)*100, 2) print(acccuracy)

Cm = confusion_matrix(y_pred, y_val)

Predicting the Survival of Titanic Disaster

Converting categorical features to dummy

Written by Ernest Owojori