Machine Learning Way to Predict the Survival of RMS Titanic Passengers

Photo by Nick Hawkes on Unsplash

In the middle of my journey process, I decided to do a challenge on kaggle competition. The problem is to predict which passenger survived in the Sinking of the RMS Titanic. While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we have to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc). The datasets is consist of 891 unique example and 25 attribute. Because this is my first attempt on ML, I just use several attribute as the features. Here is the link to the challenge.

The competition page
The competition page

Work on Data

As I mention before, the datasets is consist of 891 unique example for training, and 418 for testing datasets. I did this problems as part of Introduction to Machine Learning in Kaggle Micro-Course, so I got to see how they did it first with the model.

It started with importing the datasets from kaggle. I used a notebook in Kaggle so it will be easier to submit the results. The results should be in csv format consist of 2 columns, the first one is the passenger ID, and the last one is the predictive results of surviving, 1 if the passenger is survived, 0 if they didn’t.

We import the datasets and using Pandas Library to called the datasets and put it in a variable called train_dataset
We import the datasets and using Pandas Library to called the datasets and put it in a variable called train_dataset

We will also called the test datasets into the notebook.

The interesting part is, Kaggle provide a CSV file in the list of datasets together with train and test datasets called Gender_Submission.csv. It is basically just an example of how the submission will look like. But the idea is instead of predict which passenger survived, it use an assumption that all female passengers survived but not with male passengers.

It is interesting because there is a story in the sinking of Titanic that they ask female and children to go first in the safety boat for evacuation. If you watch the movie, there was a scene like that. So I try to prove it from the train datasets, and it shows

the female passenger are more likely to survived
the female passenger are more likely to survived

Yes, female passengers are more likely to survived if we look into the training datasets. When I submit the csv file, it gave 76.555% accuracy which not bad but also not very good. But it is enough to give us a perception on the data.

Build the Model

So I try to build the model just as how Kaggle ask me to do it. I use 4 features (you can see in the picture below). I use get_dummies function from pandas library to work on categorical features so it will change it into several new feature range from 0 to 1 (binary). I use Random Forest for the algorithm and get 77.551% accuracy.

Using random forest and 4 features with minimal data cleaning
Using random forest and 4 features with minimal data cleaning

Improve the Model

It is an improvement from the previous submission but I still have more space to improve because it only increase by 1%. I decided to add more features, split the datasets into train and validation data, work on data cleaning, but still use random forest.

I will not write all of the code here. Instead I will highlight some of the features I use to build my model that I find interesting. I still use the same datasets and because of adding several new features, I have to deal with missing data by replace it by the mean of it. I also still use get_dummies function to handle categorical data.

The interesting part is, although we done with the categorical data, there are some feature that need to be scaling because it range to far from the other features (such as age and fare). So I use MinMaxScaler function from scikit learn library. The same thing also need to be done with the test data.

As I mention before, I use train_test_split from scikit learn to split the train datasets, by doing so, I have the validation data that I’ll be using to see the accuracy of my model before give it a try with the test datasets.

After build the model which also use random forest, and using validation data to see the error of my model, I successfully submitted it with an improve score 78.947% accuracy. I see this a great improvement but I know that there are a lot of things to learn. I tried to use neural network just like I learn in coursera for image classification into this model using keras, but i turned to error so I think I will find out about it later. But for now, I think it’s a good improvement. You can also find my whole code here. Here is the datasets for training and testing. It might not work right away in google colab because I copied it from Kaggle, but you can find a way to solve it using those datasets, goodluck!

My first journey with solving ML problem on my own. I realize there are a lot of things that can be improve. I will try to learn more about various methods and algorithms so I can make the model even better. Thank you for reading this post!

A fresh graduate who always want to learn new things. Actively self-learn Data Science and Data Analytics. Interested in Dutch as a foreign language.