Titanic Survival Prediction

Retty George
6 min readFeb 26, 2022

--

When it was first launched in 1912, the British steamer was the largest ship in the world. An incredible 882 feet long and 175 feet high, the Titanic was proclaimed the most expensive, most luxurious ship ever built. It was said to be “unsinkable.” It was on April 10, 1912, that the Titanic was sailing for New York. Four days after setting out, a great disaster happened, The Titanic sank to the bottom of the North Atlantic Ocean.

After 109 years, there are still unanswered questions about the Titanic disaster. When these are yet to find, let’s try to predict the survival chances of each passenger using machine learning techniques.

Find the Kaggle notebook link here

Learn more about the problem here

Now Let’s start with coding

We have two data sets

Train Dataset: This dataset is used to train ML models (train.csv)
Test Dataset: This dataset is used to test ML models (test.csv)

Column Description

  1. PassengerId : Uniques id number to each passenger
  2. Survived : 1-survived and 0-died
  3. Pclass : Passenger class
  4. Name : Name of passenger
  5. Sex : Gender of passenger
  6. Age : Age of passenger
  7. SibSp : Number of siblings/spouses
  8. Parch :Number of parent/children
  9. Ticket : Ticket number
  10. Fare : Ticket Price
  11. Cabin : Cabin category
  12. Embarked : Port where passenger embarked(C =Cherbourg , Q = Queenstoown ,S=Southampton )

First let’s load and Check both the datasets

Loading train data set (train.csv)
Loading test data set (test.csv)

After loading the data let’s try to understand the data

checking the different columns in the data set
Count of unique values in each column

The next step is to do the Basic data analysis

We will check how the values in different columns are affecting the values in the survived column.

Name and Ticket columns are having a lot of unique values. So, assume that it’s not making a huge difference in how the survived column is calculated.

Pclass vs Survived

from the above data it’s very evident that Passengers with Ticket class 1 are most likely to survive while passengers with Ticket class 3 are least likely to survive

Sex vs Survived

females has more chances to survive compared to males

SibSp vs Survived

When SibSp is 0,1 or 2 passengers has more chances to survive

Parch vs Survived

When Parch is 0,1,2 or 3 Passengers has more chances to survive

Embarked vs Survived

When Embarked is C people has more chances to survive compared to Q and S

After data loading and analyzing the data, I have followed the tutorial here which helped me to achieve an accuracy of 0.77511.

So, my goal was to increase the accuracy from 0.77511 and after hours of coding and 40 + versions I was able to improve the accuracy to 0.78708

Here is my progress..

  1. The first step was to find missing values and fill missing values
Missing values

Here Age,Cabin and Embarked are having missing values. So, missing values in age can be filled with mean and missing values in Embarked can be filled with C as passengers with Embarked C has highest chances to survive and Cabin column can be dropped.

This helped me to achieve an accuracy of 0.78468

2. The next steps was to derive useful information from all the available columns

from data analysis it is very evident that passengers with ibSp is 0,1 or 2 and Parch is 0,1,2 or 3 are more likely to survive. So, we will add a new column to check if the person is alone or not (column_name = is_Alone) Assuming that person without family has more chances to survive compared to person with family members.

After including is_Alone also to Random Forest Classifier the accuracy improved to 0.78708

3. From data analysis it was also clear that the people with more number of family members are less likely to survive compared to people with less number of family members.

So we will add a new column named “Size” that corresponds to the size of the family which is sum of SibSp and Parch columns.

after adding size parameter also to the random classifier we got an accuracy of 0.77751 which is higher than the initial accuracy but not the best one.

4. We have also tried dividing the Fare to 5 different ranges and applying this fare range also in the random classifier.

train_data[‘Fare_range’] = [0 if 39.688<x<512.329 else 1 if 21.679<x<9.688 else 2 if 10.5<x<21.679 else 4 if 7.854<x<10.5 else 4 for x in train_data[‘Fare’]]
train_data.head()

test_data[‘Fare_range’] = [0 if 39.688<x<512.329 else 1 if 21.679<x<9.688 else 2 if 10.5<x<21.679 else 4 if 7.854<x<10.5 else 4 for x in test_data[‘Fare’]]
test_data.head()

After including the fare_range to the Random Forest Classifier we got an accuracy of 0.77900

5. We have also tried dividing the Age to 5 different ranges and applying this fare range also in the random classifier.

After including the fare_range to the Random Forest Classifier we got an accuracy of 0.78229

My Best Score

Contribution

The initial model was working on the whole data set so, the first step was to understand the data and identify missing values and fill missing values and remove the unwanted columns. The age and Embarked column was having missing values. So, missing values in age can be filled with mean and missing values in Embarked can be filled with C as passengers with Embarked C has highest chances to survive and Cabin column can be dropped.

Spend hours to establish the relationship between different columns and tried to create valuable information out of each column and applied this to the random classifier to get better accuracy. Only those changes that resulted in better accuracy compared to the initial one is included in the blog. The final accuracy is reached after hours of coding 40+ versions and submissions.

Tried to classify Age and Fare function to 5 different ranges and then included that in random forest classifier this helped in increasing the accuracy rate from the initial one.

--

--