Predicting Titanic Survivors Is Reality

Jonas Prado
The Startup
Published in
5 min readNov 2, 2020

Is it possible to “predict” the passengers who survived or died in the sinking of the Titanic in 1912? There is no point in looking at the answer on the internet!
Difficult?
What if I offer some information obtained when boarding these passengers? Can you predict?

Before answering, let’s remember what that wreck was. On April 10, 1912, the RMS Titanic was inaugurated, with more than 2,200 people on board, considered the most luxurious and safest ship of its time. His first (and last) itinerary was United Kingdom x New York.

Unfortunately, after 3,600 kilometers traveled, 4 days after sailing, the ship collided with an iceberg and was wrecked.
After the collision, lifeboats were launched, and in less than 3 hours the ship was submerged. Just under a third of the passengers on board survived.

Were the survivors of that accident randomly defined, or was there any kind of priority on boarding the lifeboat? Could this priority be age related? Sex? Class? Cabin?

The Kaggle competition and challenge platform provides a database with Titanic passenger information. These data were obtained when boarding the ship. In addition, the platform still provides the variable response for some of the passengers, and expects us to “forecast” the rest.

Let’s do this challenge!

On the previous question, the answer is YES! We were able to “predict” passengers who survived or died in the wreck. Next, I will present the way I approached the topic, and highlight my hypotheses.

Let’s start with the technical part, using the Python language, but, rest assured, each step will be explained.

The first step is to import the libraries to be used.

Within the Kaggle platform there is a dictionary for this dataset, containing the description of each column in the file.

The next step is to see the quality of each column, starting with the amount of missing information.

In the database we have 891 passengers / records. The Cabin variable has 77% of blank records; the Age variable has 20% of the records blank; the Embarked variable has less than 1% of blank records. Following the modeling assumptions, we cannot proceed to the next step with missing data, so I chose to exclude the Cabin variable, and keep the Age and Embarked variables, in order to perform some type of treatment afterwards.

The PassengerId variable is a unique ID for each customer, and it only helps us to carry out a passenger identification, not bringing information gain to the model.

Regarding the Ticket variable, I understand that we can use it to build other variables. However, as this process will be laborious, in this first moment, I also choose to remove it from the model.

When looking at the information in the Name column, we noticed that the terms “Mr.”, “Mrs.”, “Miss”, “Ms” and “Master” are mentioned. Looking for the meaning of each term we have:

Miss: Single women
Mrs .: Married women
Mr .: Man
Master .: Children

After this construction, we can delete the Name column.

The columns Pclass, Sex and Embarked are dimensions and not measured, so we need to transform them into dummy variables.

Remember that the Age variable had 20% of the data blank? Now let’s do the treatment and fill it out. For this, we will use the average per category that we obtain through the Name variable (Mrs, Mr, Master and Miss).

How about seeing the correlation of the base variables with Survived (response variable)?

According to the images above, we can analyze the correlation of the explanatory variables with the response variable, with that, we can already have an idea of which variables we should prioritize in the model.

Finally we are close to our model, let’s start the preparations.

We need to tell the model, which is our target variable (Y) and the variables that will help in our prediction (X), then we do the split, which is the division of our training and test base to evaluate after the model result . In our case, we separated 70% for training the model and 30% to perform the test later.

The model that was chosen is the Logistic Regression, summarizing the logistic regression models the probability of Y belonging to a particular category, in our case it is whether the passenger survived or not.

Logistic regression will provide us with a forecast always between 0 and 1, so that we can interpret its results as a valid probability

We have our model trained with the data we worked on during this challenge.
The time has come to evaluate, make the prediction and compare with the data we separated (those 30% of the split, remember?).

The evaluation metrics are based on the confusion matrix, if you have forgotten, I will leave an image to help

Table shows the classification frequencies for each class of the model, they are: True positive (true positive — TP), False positive (false positive — FP), False true (true negative — TN) and False negative (false negative — FN) .

We did it !!
In our first challenge we chose an accuracy of 87%, and when we look at the f1-score (weighted average of precision and recall) we had 80% assertiveness in survivors and 90% in non-survivors.

The work is not over yet, it is possible to improve this score even more, writing for you I already found several opportunities in which we can work, this is just the beginning.

For more details, below is the complete code link:
https://github.com/joonaspp/kaggle-titanic

Thanks for listening, see you next time.

--

--