RMS Titanic 1912: Machine Learning From Disaster

Rifah Maulidya
ILLUMINATION’S MIRROR
5 min readJun 22, 2024

Here, we will predict survival on Titanic and get familiar with ML basics.

Image by ArtStation on Pinterest

What would it be if you were asked about the biggest tragedy that ever happened in this world and human history? You probably answer the question randomly just say the Ukraine civil war, the holocaust, the China floods, or even the black death. Well, it is correct but this is more well-known and everybody knows this, even researchers and wealthy people want to explore in depth about this tragedy through its legacy. And yes, it is RMS Titanic — A British ocean liner and, at the time, the world’s largest ship.

In 1912, the luxurious and tragic RMS Titanic sailed to her doom amid the cold waters of the North Atlantic. This has been the object of fascination for more than a hundred years, giving rise to numerous narratives, films, and research. However, what if we could utilize today’s technology to forecast potential survivors? Yup, it is possible. A lot of analyses and forecasts have been predicted about the Titanic dataset at Kaggle. There are various details about the passengers such as their age, gender, and class that analysts can use to develop predictive models.

I discovered many researchers have used Logistic Regression and Support Vector Machines to meet this challenge in different ways:

  • Logistic Regression: One group achieved a 79% accuracy rate with this method in binary classification problems, thereby showing how effective it is.
  • Support Vector Machines (SVM): Another group achieved 83% accuracy using SVM in high-dimensional data; this demonstrates that this kind of machine has great strength.

In this article, we will explore building the prediction tool using the dataset of Titanic from Kaggle and we will maximize the prediction as accurately as possible using the Random Forest method.

Download the dataset

Before we start, make sure you have the dataset downloaded in your local file explorer. To download it, you can go to the Kaggle dataset and download three different data, one for ‘train.csv’ for training, ‘test.csv’ for testing, and ‘gender_submission.csv’ for gender and passenger ID information. We will use the Kaggle notebook to run the code.

This is what the ‘train.csv’ dataset looks like. It includes name, gender, age, etc.

Load the data

After you download the datasets, we have to load the data to import the dataset from our local file explorer into the notebook that you will use to code with the following codes. Here we only use Python, so it’s easier for beginners to follow it through as well!

# Import the important library 
import pandas as pd

# Load the training, test datasets, and gender submission file
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
gender_submission = pd.read_csv('gender_submission.csv')

Study the dataset

The Titanic dataset is a collection of information about the passengers on the RMS Titanic. Several characteristics will be found in it that could assist in determining the possibility of persons surviving from sinking ships or not. Here is what I find:

  • PassengerId: Identifier for each passenger.
  • Survived: Binary indicator if the passenger survived (1) or not (0).
  • Pclass: Passenger class (1st, 2nd, or 3rd).
  • Name: Name of the passenger.
  • Sex: Gender of the passenger.
  • Age: Age of the passenger.
  • SibSp: Number of siblings/spouses aboard.
  • Parch: Number of parents/children aboard.
  • Ticket: Ticket number.
  • Fare: Fare paid by the passenger.
  • Cabin: Cabin number.
  • Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).

These key features are included in the dataset, so with these, we can predict for their chance to get survived or not.

Explore and clean data

Data cleaning is an important step to enhance the accuracy of prediction. We will handle the missing value and convert the categorical data into numerical data. Here’s how we code:

Handle missing value

# Fill missing values for 'Age' with the median age
train['Age'].fillna(train['Age'].median(), inplace=True)
test['Age'].fillna(test['Age'].median(), inplace=True)

# Fill missing 'Embarked' (port of embarkation) with the most common value
train['Embarked'].fillna(train['Embarked'].mode()[0], inplace=True)
test['Embarked'].fillna(test['Embarked'].mode()[0], inplace=True)

# Fill missing 'Fare' in the test set with the median fare
test['Fare'].fillna(test['Fare'].median(), inplace=True)

Convert categorical variables

# Convert 'Sex' to numerical values: 0 for male, 1 for female
train['Sex'] = train['Sex'].map({'male': 0, 'female': 1})
test['Sex'] = test['Sex'].map({'male': 0, 'female': 1})

# Convert 'Embarked' to numerical values using dummy variables
train = pd.get_dummies(train, columns=['Embarked'], drop_first=True)
test = pd.get_dummies(test, columns=['Embarked'], drop_first=True)

Feature selection

This process will help us identify which data is more relevant and contributes most to the prediction of the target variable. In the context of the Titanic dataset, one of the things that need to be done is to choose features such as ‘Pclass’, ‘Sex’, ‘Age’, ‘SibSp’, ‘Parch’, ‘Fare’, and ‘Embarked’ since they seem to have a major say when it comes to survival prediction.

features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked_Q', 'Embarked_S']
X_train = train[features]
y_train = train['Survived']
X_test = test[features]

Train the model

For classification tasks, we are going to employ a Random Forest classifier, which is an effective machine-learning algorithm.

# train the data
train_data = pd.read_csv("/kaggle/input/titanic/train.csv")
train_data.head()
# test the data
test_data = pd.read_csv("/kaggle/input/titanic/test.csv")
test_data.head()

In this training, we try to train our model so it can read the dataset and make the prediction model. After we train the model, this is what get from the training.

(Picture 1): There is given information on train data that the model has read. (Picture 2): The information that the model reads from test data after we run our training code.

Make the prediction

With our trained model, we can make the prediction model to know the ratio of people who survived based on gender.

# looking for prediction how many people survived or died
women = train_data.loc[train_data.Sex == 'female'] ["Survived"]
rate_survived = sum(women)/len(women)

print("% of women who survived is:", rate_survived)

After we run this code, the output says that there is 0.7420382165605095% of women who survived.

# looking for predictions how many men survived
men = train_data.loc[train_data.Sex == 'male'] ["Survived"]
rate_survived = sum(men)/len(men)

print("% of men who survived is:", rate_survived)

Then, the output of this code says there is only 0.18890814558058924% men who survived from this tragedy.

It is appropriate that rescue in this tragedy was prioritized for women and children, therefore the percentage shows that more women survived than men who were given little opportunity to board emergency boats.

Final word

By undertaking this study, we developed a model for forecasting survival on the Titanic by using feature selection, data pre-processing, and machine learning techniques. This research not only improves the way we see past data but also serves to display how strong data analysis could be in identifying patterns and making forecasts that are supported by facts in advance.

Find the tutorial here (by Kaggle)!

--

--

Rifah Maulidya
ILLUMINATION’S MIRROR

A person who is interested in AI, robotics, and CS. Learning 1% lessons everyday for 99% good results in the next days.