How to Achieve more than 98% of accuracy on Titanic dataset

Rutvij Bhutaiya
Analytics Vidhya
Published in
5 min readNov 6, 2019
Photo by Evelyn Paris on Unsplash

The Titanic dataset is very commonplace to begin practice with — for a machine learning enthusiast.

To predict the passenger survival — across the class — in the Titanic disaster, I began searching the dataset on Kaggle. I decided to choose, Kaggle + Wikipedia dataset to study the objective.

The study was done in R Studio.

Before loading the data kindly carefully read the ‘Data Cleaning and Edit section in Readme.md’.

The structure of the variables in the dataset are...

$ PassengerId       : int  1 2 3 4 5 6 7 8 9 10 ...
$ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Name : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 156 287 531 430 23 826 775 922 613 855 ...
$ NameLength : int 23 27 22 29 24 16 23 30 22 22 ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ SibSpouse : int 1 1 0 1 0 0 0 3 0 1 ...
$ ParentsChild : int 0 0 0 0 0 0 0 1 2 0 ...
$ TicketNumber : Factor w/ 929 levels "110152","110413",..: 721 817 915 66 650 374 110 542 478 175 ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Cabin : Factor w/ 187 levels "","A10","A11",..: 1 108 1 72 1 1 165 1 1 1 ...
$ Age_wiki : num 22 35 26 35 35 22 54 2 26 14 ...
$ Age_Months : int 264 420 312 420 420 264 648 24 312 168 ...
$ HometownCountry : Factor w/ 47 levels "","Argentina",..: 18 47 19 47 18 24 47 41 47 27 ...
$ Boarded : Factor w/ 5 levels "","Belfast","Cherbourg",..: 5 3 5 5 5 4 5 5 5 3 ...
$ Destination : Factor w/ 292 levels "","Aberdeen, South Dakota, US",..: 217 187 186 242 186 186 73 57 250 60 ...
$ DestinationCountry: Factor w/ 10 levels "","Brazil","Canada",..: 3 10 10 10 10 10 10 10 10 10 ...
$ Lifeboat : Factor w/ 21 levels "","1","10","11",..: 1 12 7 21 1 1 1 1 8 1 ...
$ LifeboatSupport : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 1 2 1 ...

The important task we begin in data analysis was on EDA(Exploratory Data Analysis). We also used Tableau for analysis, not only that we also used tableau chart analysis in treating missing values.

Based on the analysis we found that 1st class passengers had priority for Lifeboats as ratio 62% of passengers survived and from 3rd class, only 24% of passengers survived. However, the total count for a passenger from 1st, 2nd, and 3rd class was 323, 277 and 709 respectively.

The best part of the study was EDA, which covers,

# Treating Missing Values
# Pivot table Analysis
# Fare Variable Analysis for Outliers
# Boxplot Analysis for Outlier - Used Hypothesis
# Other Variables Outliers Study
# Correlation
# Data Normalization
## And Finally we stored clean data into TitanicCleanData.csv file.

For the study, we decided to study three ML techniques, on the same dataset (TitanicCleanData.csv). A Complet study was done in R.

Before model building, we separated data into three parts, 1. Missing values from Survived variable — actual prediction. 2. Traning dataset — for model training. 3. Testing dataset — for testing model based on unknown and performance.

## Development (Study) dataSet and Validation (Test) dataset Prediction = subset(TitanicCleanData, is.na(TitanicCleanData$Survived)) train = na.omit(TitanicCleanData)attach(train) # %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% # Make Ratio of 30% and 70% for test1 and train1 dataset  ind = sample(2, nrow(train), replace = TRUE, prob = c(0.7,0.3)) train1 = train[ind == 1,]
test1 = train[ind == 2,]

First, we applied the Random Forest technique to predict the survival of passengers. We used ‘Boruta’ feature selection technique, and eliminated less significant features from the study.

#Feature Selection Techniques 
attach(train1)
set.seed(123)
boruta.train <- Boruta(Survived ~ . , data=train1, doTrace = 2)
Photo by Geran de Klerk on Unsplash

We tuned the random Forest model and selected the right number of variables and trees based on the error rate to avoid overfitting.

train.tune = tuneRF(x = train1[, -c(1)], 
y = as.factor(train1$Survived),
mtryStart = 6,
ntreeTry = 41,
stepFactor = 1.2,
improve = 0.0001,
trace = TRUE,
plot = TRUE,
nodesize = 20,
doBest = TRUE,
importance = TRUE)
train.rf = randomForest(as.factor(train1$Survived) ~ .,
data = train1,
ntree =31, mtry = 5, nodesize = 20, importance = TRUE)

Same model — train.rf — we applied to predict the passenger survived from the Titanic disaster on the test dataset, to check the model performance.

For performance measurement on unknown data, we used three techniques, 1. Confusion Matrix, 2. F1 Score, 3. AUC score.

# Confusion Matrix  
library(caret)
library(e1071)
confusionMatrix(as.factor(test1$Survived), test1$predict.class) ## F1 Score precision.test1 = precision(as.factor(test1$Survived), test1$predict.class)
# [1] 0.9940828
recall.test1 = recall(as.factor(test1$Survived), test1$predict.class)
# [1] 0.9767442
test1.F1 = (2*precision.test1*recall.test1) / sum(precision.test1, recall.test1)
# [1] 0.9853372
## ROC Curve
roc(test1$Survived, as.numeric(test1$predict.class), plot = TRUE, main = 'ROC Curve for test1', col = 'darkseagreen')

Based on the TitanicCleanData.csv file, we have also build Logistic Regression and K-Nearest Neighbour model to predict the survived passengers from the Titanic.

For project implementation, kindly visit… Github

--

--