Titanic Survival Prediction
When it was first launched in 1912, the British steamer was the largest ship in the world. An incredible 882 feet long and 175 feet high, the Titanic was proclaimed the most expensive, most luxurious ship ever built. It was said to be “unsinkable.” It was on April 10, 1912, that the Titanic was sailing for New York. Four days after setting out, a great disaster happened, The Titanic sank to the bottom of the North Atlantic Ocean.
After 109 years, there are still unanswered questions about the Titanic disaster. When these are yet to find, let’s try to predict the survival chances of each passenger using machine learning techniques.
Find the Kaggle notebook link here
Learn more about the problem here
Now Let’s start with coding
We have two data sets
Train Dataset: This dataset is used to train ML models (train.csv)
Test Dataset: This dataset is used to test ML models (test.csv)
Column Description
- PassengerId : Uniques id number to each passenger
- Survived : 1-survived and 0-died
- Pclass : Passenger class
- Name : Name of passenger
- Sex : Gender of passenger
- Age : Age of passenger
- SibSp : Number of siblings/spouses
- Parch :Number of parent/children
- Ticket : Ticket number
- Fare : Ticket Price
- Cabin : Cabin category
- Embarked : Port where passenger embarked(C =Cherbourg , Q = Queenstoown ,S=Southampton )
First let’s load and Check both the datasets
After loading the data let’s try to understand the data
The next step is to do the Basic data analysis
We will check how the values in different columns are affecting the values in the survived column.
Name and Ticket columns are having a lot of unique values. So, assume that it’s not making a huge difference in how the survived column is calculated.
Pclass vs Survived
from the above data it’s very evident that Passengers with Ticket class 1 are most likely to survive while passengers with Ticket class 3 are least likely to survive
Sex vs Survived
females has more chances to survive compared to males
SibSp vs Survived
When SibSp is 0,1 or 2 passengers has more chances to survive
Parch vs Survived
When Parch is 0,1,2 or 3 Passengers has more chances to survive
Embarked vs Survived
When Embarked is C people has more chances to survive compared to Q and S
After data loading and analyzing the data, I have followed the tutorial here which helped me to achieve an accuracy of 0.77511.
So, my goal was to increase the accuracy from 0.77511 and after hours of coding and 40 + versions I was able to improve the accuracy to 0.78708
Here is my progress..
- The first step was to find missing values and fill missing values
Here Age,Cabin and Embarked are having missing values. So, missing values in age can be filled with mean and missing values in Embarked can be filled with C as passengers with Embarked C has highest chances to survive and Cabin column can be dropped.
This helped me to achieve an accuracy of 0.78468
2. The next steps was to derive useful information from all the available columns
from data analysis it is very evident that passengers with ibSp is 0,1 or 2 and Parch is 0,1,2 or 3 are more likely to survive. So, we will add a new column to check if the person is alone or not (column_name = is_Alone) Assuming that person without family has more chances to survive compared to person with family members.
After including is_Alone also to Random Forest Classifier the accuracy improved to 0.78708
3. From data analysis it was also clear that the people with more number of family members are less likely to survive compared to people with less number of family members.
So we will add a new column named “Size” that corresponds to the size of the family which is sum of SibSp and Parch columns.
after adding size parameter also to the random classifier we got an accuracy of 0.77751 which is higher than the initial accuracy but not the best one.
4. We have also tried dividing the Fare to 5 different ranges and applying this fare range also in the random classifier.
train_data[‘Fare_range’] = [0 if 39.688<x<512.329 else 1 if 21.679<x<9.688 else 2 if 10.5<x<21.679 else 4 if 7.854<x<10.5 else 4 for x in train_data[‘Fare’]]
train_data.head()
test_data[‘Fare_range’] = [0 if 39.688<x<512.329 else 1 if 21.679<x<9.688 else 2 if 10.5<x<21.679 else 4 if 7.854<x<10.5 else 4 for x in test_data[‘Fare’]]
test_data.head()
After including the fare_range to the Random Forest Classifier we got an accuracy of 0.77900
5. We have also tried dividing the Age to 5 different ranges and applying this fare range also in the random classifier.
After including the fare_range to the Random Forest Classifier we got an accuracy of 0.78229
My Best Score
Contribution
The initial model was working on the whole data set so, the first step was to understand the data and identify missing values and fill missing values and remove the unwanted columns. The age and Embarked column was having missing values. So, missing values in age can be filled with mean and missing values in Embarked can be filled with C as passengers with Embarked C has highest chances to survive and Cabin column can be dropped.
Spend hours to establish the relationship between different columns and tried to create valuable information out of each column and applied this to the random classifier to get better accuracy. Only those changes that resulted in better accuracy compared to the initial one is included in the blog. The final accuracy is reached after hours of coding 40+ versions and submissions.
Tried to classify Age and Fare function to 5 different ranges and then included that in random forest classifier this helped in increasing the accuracy rate from the initial one.