Would Jack and Rose Have Survived the Disaster???

What does my computer say?

Ainish Lodaya
The Startup
5 min readDec 13, 2020

--

Below is my take on the Titanic challenge hosted on Kaggle. The goal is to predict the survival of a passenger given various data points about the passenger and their journey.

My take — is to build a neural network to take on this challenge and then check how accurately my model fits the movie.

Data, Data, Data, its always data first.

Presented in the problem is a training data set of 891 passengers to infer from. Lets dive in

First look at the data
Code 1

Mental notes

  • Passenger id: looks unimportant, let’s drop it
  • Survived: The most important piece of information
  • Pclass: Class of ticket bought. Useful information. (think airport lines, first class first, followed by business, followed by partner programs, followed by priority, followed by you, yes you plebian!)
  • Name: Not interesting for now, but maybe in a more sophisticated model later.
  • Sex: Gender. Of course this information is useful. Its 1912, chivalry was a thing.
  • Age: Maybe younger people would survive better. Maybe older people would be rescued first. Hey! What about the babies?? Let’s keep this column
  • SibSp:(#siblings/spouses on board) Yes my sister is annoying, but no chance I’m leaving the boat without her. Nada, not happening
  • Parch:(#parents children)
  • Ticket: Ticket number. Useful for a refund if they offered it, but to predict survival — not so much
  • Fare: “Did you buy a ticket with priority pass on a lifeboat? £100 wasted. What do you think will happen — recon we’ll hit an iceberg?” I’m surely keeping this for further analysis.
  • Cabin: What if we get our own cabin? Interesting!
  • Embarked: Are people on one port better swimmers than the other? I don’t know, let’s see.

More Mental notes:

  • Not everyone gave their age. We’ll need to fix that. For simplicity let’s use average?
  • Two folks do not have their port of Embarkment marked, let’s just use the most frequent one
  • Cabin — only 204 cabin numbers are printed. Let’s use whether a cabin was allocated or not. Nothing fancy
  • We will need to encode and scale features to fit our network.
  • Sex and Cabin can be encoded using Label Encoder. Pclass, Embarkment can be encoded using OneHotEncoder

Ok. Now let’s really look at the Data!

(ps. While the challenge was implemented in python, I’ve taken this opportunity to brush up my excel skills)

table 1

Pclass — Class of ticket bought.
As suspected, the first class folks were rescued first. The survival rates observed are declining with the class of tickets.

Sex — Ladies first?

Yes, if the below article is to be believed

Source: http://www.paperlessarchives.com/titanic_newspaper_archive.html
chart 1

In the data set, we see that the survival rate among women was ~74% whilst among men was ~19%. These are statistically different from the total ~38% and should be used as an input to the model. Also notable is the totals, where there are 314 women for the 577 men.

chart 2

Age
As is evident from the graph, for the given data, there is a significant difference in survival rates amongst people of different ages, and this is more pronounced in males than it is in females. This will also be considered for the model.

chart 3

Siblings, spouses, parents and children
Combining the above variables, we know the number of family members traveling together. We can try and infer if this number made a difference to survival rates. Did families do better than individuals? Did the smaller family groups onboard lifeboats more easily than larger ones?

chart 4

Cabin
There isn’t much usable information about cabins themselves. We can however see that some passengers did have cabins and others didn’t. Assuming that the cabin information is correct, we can only infer whether a passenger chose to book a cabin or not.

Following from that: The survival rate amongst those who did have their own cabin was significantly higher than those who didn’t.

A second observation, while less than 25% of all passengers had their own cabin, they made up over 40% of all survivors

And the rest of the independent variables
Guessing you’ve got the gist of my thought process, for the last two variables Fare and Port of embarkment, I’m going to skip over the commentary leaving behind thought experiments. (Ps. I didn’t think Fare was as obvious as it looked)

chart 5
table 3

And that concludes the data commentary for this model. (Mostly because I want to get to the fun part of building the model.

Photo by Nastya Dulhiier on Unsplash

Neural Networks

Neural networks are computing systems where we try to mimic our guess of the working of the human brain. We feed it information, we let neurons adjust to that information, then we feedback results and let neurons adjust again. (and again) until it is able to ‘think’ like humans, but at a much larger scale.
For this exercise, lets build one with one input layer, two hidden layers and one output layer. Given we’re tying to predict survival, which fortunately (or unfortunately, and for the course of this article) is binary, we can use a sigmoid activation function for the output layer.
The network will train on the training set we’ve scrubbed out above and then output the survival probability for a given set of input parameters.

code 2

So does the movie hold?

Picking 4 characters from the movie Jack, Rose, Caledon and Ruth, plugging in values that we know — PClass, Sex, SibSp, Parch, using a wikia for age, and making guesses for port and cabin, we have the following chance of survival

Jack — 13%
Rose — 93%
Ruth — 98%
Caledon— 48%

Or as James Cameron said — “very simple because it says on page 147 [of the script] that Jack dies.”

--

--