Predicting the Survival of Titanic Disaster
The Titanic was a luxury British steamship that sank in the early hours of April 15, 1912, after striking an iceberg, leading to the deaths of more than 1,500 passengers and crew. Visit history.com for details.
In this blog post, I’d be using Logistic Regression to predict the famous Titanic dataset downloaded from DSN Pre-qualification on Kaggle to predict whether a passenger survives the deadline disaster or not.
Name of variables
Data Dictionary Variable Definition Key
survival Survival 0 = No, 1 = Yes
pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex Sex
Age Age in years
sibsp # of siblings / spouses aboard the Titanic
parch # of parents / children aboard the Titanic
ticket Ticket number
fare Passenger fare
cabin Cabin number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
Variable Notes
pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way…
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way…
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.
Import Packages
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
Import Datasets
Titanic_train = pd.read_csv(‘train.csv’)
Titanic_test = pd.read_csv(‘test.csv’)
Titanic_train.head()
Titanic_test.head()
#check the shape of train and test data
print(Titanic_test.shape)
print(Titanic_train.shape)
Exploratory Data Analysis
Let’s explore the data to better understand the features and target.
####Missing Data
sns.heatmap(Titanic_train.isnull(),yticklabels=False,cbar=False,cmap=’Spectral_r’)
Titanic_train[‘Survived’].value_counts()
#Checking Survived
sns.set_style(‘darkgrid’)
sns.countplot(x=’Survived’,data=Titanic_train,palette=’RdBu_r’)
#Checking Survived with Sex
sns.set_style(‘whitegrid’)
sns.countplot(x=’Survived’,hue=’Sex’,data=Titanic_train,palette=’RdBu_r’)
#Checking Survived with Pclass
sns.set_style(‘whitegrid’)
sns.countplot(x=’Survived’,hue=’Pclass’,data=Titanic_train,palette=’rainbow’)
Titanic_train[‘Age’].hist()
#SUrvived and SibSp
.
snscountplot(x=’SibSp’,hue = ‘Survived’, data=Titanic_train)
Filling Missing values with bfill and ffill
#Check for missing data and find a way of fixing them
Titanic_train.info()
Titanic_train.Age.bfill(axis=None, inplace = True) #backward fill(bfill) used to fill Age in train data…. Cabin to be dropped and Embarked to be filled up
Titanic_train.Embarked.bfill(axis=None, inplace = True)
print(Titanic_train.info())
#Filling missing values in test data using ffill
Titanic_test.Age.ffill(axis=None, inplace = True)
Titanic_test.Fare.ffill(axis=None, inplace = True)
print(Titanic_test.info())
Dropping features that are not to be used
Titanic_train_ = Titanic_train.drop([‘Cabin’, ‘Name’, ‘PassengerId’, ‘Ticket’], axis = 1) #drop Cabin as it is not important for test and train data
Titanic_test_= Titanic_test.drop([‘Cabin’, ‘Name’, ‘PassengerId’, ‘Ticket’], axis = 1)
Converting categorical features to dummy
sex = pd.get_dummies(Titanic_train_[‘Sex’],drop_first=True)
embark = pd.get_dummies(Titanic_train_[‘Embarked’],drop_first=True)
Titanic_train_.drop([‘Sex’,’Embarked’],axis=1,inplace=True)
#Converting test data
sex_test = pd.get_dummies(Titanic_test_[‘Sex’],drop_first=True)
embark_test = pd.get_dummies(Titanic_test_[‘Embarked’],drop_first=True)
Titanic_test_.drop([‘Sex’,’Embarked’],axis=1,inplace=True)
Titanic_train_ = pd.concat([Titanic_train_,sex,embark],axis=1)
Titanic_test_ = pd.concat([Titanic_test_,sex_test,embark_test],axis=1)
Standardize dataset using StandardScaler from sklearn
from sklearn.model_selection import train_test_split
predictors = Titanic_train_.drop([‘Survived’], axis = 1)
target = Titanic_train_[‘Survived’]
x_train, x_val, y_train, y_val = train_test_split(predictors, target, test_size =0.20, random_state = 0)
from sklearn.preprocessing import StandardScaler
standard = StandardScaler()X_train_s = standard.fit_transform(x_train)
X_val_s = standard.transform(x_val)
Model Building with Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,confusion_matrix
from sklearn.metrics import classification_report
logReg = LogisticRegression()
logReg.fit(X_train_s, y_train)
y_pred = logReg.predict(X_val_s)
print(y_pred)
Model Evaluation
acccuracy = round(accuracy_score(y_pred, y_val)*100, 2)
print(acccuracy)
Cm = confusion_matrix(y_pred, y_val)