PyTorch: Examining the Titanic Sinking with Ridge Regression

We examine the data of more than 800 Titanic passengers and train a Machine Learning Model on it

Falcon
11 min readJun 28, 2020

In this notebook, we shall use this dataset containing data about passengers from the Titanic. Based on this data, we will use a Ridge Regression model which just means a Logistic Regression model that uses L2 Regularization for predicting whether a person survived the sinking based on their passenger class, sex, the number of their siblings/spouses aboard, the number of their parents/children aboard and the fare they payed.

First, we import everything we need for plotting data and creating a great model to make predictions on the data.

Data Exploration

Here, we can see what the data actually looks like. The first column indicates whether the person survived with a 1 or a 0, where the 1 stands for survival and the 0 for death. The rest of the columns are all our input columns used to predict the survival. We will, however, forget about as it does not hold important information needed to predict survival. You can also see below that we have 887 persons with their data and 8 total columns where 6 of them will be the input values and 1 of them (the Survived column) the corresponding label.

To get a little bit more familiar with the data we can do some computations with it and plot it. First we print how big the part of the survivors is which can also be described as the total probability of survival.

When we look at how likely different people from different classes and of different sexes are to survive we can see a clear trend that the higher the class the higher the survival probability and also that women are more likely to survive. Ever wondered why is this the case? The answer is quite simple.

When the Titanic began to sink, women and children would go off-board in the lifeboats first before the men. The lower class passengers were not treated equally at the time of sinking as there were so many people in the lower class that not all could not be informed by the stewardesses. Subsequently, it took much longer for them to get to the deck for rescue while first and second class passengers were already boarding the lifeboats. Also, the sailors fastened down the hatchways leading to the third-class section. They said they wanted to keep the air down there so the vessel could stay up longer. It meant all hope was gone for the passengers still down there.

Another reason why so many people died was the missing safety measures onboard the Titanic. For example, there were not enough boats for the passengers to escape the ship. The lifeboats would have only been sufficient for half the people onboard and due to bad organization not all of them were completely filled. More than half of the passengers were left behind. One good aspect, however, is that the laws for a ship’s safety have become more strict since this disaster. If you want to read about the sinking in detail, have a look at this: https://en.wikipedia.org/wiki/Sinking_of_the_RMS_Titanic

Looking at the prices, which are all measured in pounds, we can see the total average fare and then the different ones from the different classes. Note that due to inflation these numbers measured in pounds today would be a lot higher.

There are also people from all ages on board while the average age is 30 years.

To see the difference of survival probability already mentioned and explained above more visually, we can plot them like the following plots show. Here, you can see the difference between the different classes and sex very well.

Let’s now look at the fare distribution and the costs from the different classes.

In the following we can clearly see that most passengers did not have any siblings/spouses or parents/children aboard.

Lastly we look at the distribution of the ages of the passengers.

Data Preparation

Now to be able to train our model we want to convert our pandas dataframe into PyTorch Tensors. To do this we define the dataframe_to_arrays method which does the conversion to NumPy arrays. To use the function we need to specify 3 kinds of columns namely input columns, categorical columns (columns that do not contain numbers but rather a string standing for a category) and output columns so that it can properly create a NumPy array for input data and labels with all input data from the input columns (by first converting the categorical columns to numerical ones) and the labels from the output columns. Then we can easily convert them to PyTorch Tensors and specify the desired data types so that we are ready to define the model to be ready for the training on the data. Note also that the normalize parameter is set to True which makes the function normalize the input data by squishing all values in a range between 0 and 1 with Min Max Normalization for the model to be able to better learn from the data as it is more uniform now.

Now that we have the PyTorch Tensors for the input data and the labels we put them into a PyTorch TensorDataset which contains pairs of inputs and labels.

Another thing we have to do is to split the original dataset into one for training the model and another one for validating that the model is learning something. This means that the validation dataset contains data that the model has never seen before and by making predictions on it we can see how well the model can perform on unknown data. This accuracy from the validation data will be used as a metric for all training epochs as well as the loss on the validation data.

Last thing we do with our data is to put it into a DataLoader (one for the validation data and one for the training data) which will be used to train the model with shuffled and batched data.

Defining Model Structure

Now we can create our model which is just a simple Logistic Regression model with a linear layer that accepts 6 inputs and outputs 1 value between 0 and 1 which basically makes the model’s forward method return the probability of survival it predicts by using the sigmoid activation function. This is necessary as then we can train the model to output a 1 when it thinks that the person would survive and a 0 when it does think that the person will not survive even though the model will probably never return a 1 or a 0 but it will predict a probability closer to 1 or 0 after some time of training. Moreover we define some other methods in our model for training and computing accuracies or printing them. One more thing to note is that as a loss function in training_step we use the Binary Cross Entropy loss. Lastly, we create an instance of the TitanicModel called model which we will train on the training data.

Training the Model

Now comes the cool part — the actual training! For this we need to make a fit_one_cycle function to do the training for our model. Like any usual fit functions this one uses an optimizer to adjust the models parameters with a certain learning rate according to the gradients (the gradients are just the partial derivatives for each parameter with respect to the loss) which are obtained by backpropagating the loss backwards through the model. Here however there are some things about the fit function I want to point out that are not just the usual computing loss and then adjusting weights and biases thing.

  • Learning rate scheduling: This is a technique replacing the fixed learning rate usually done by changing the learning rate after every batch of training. This can be done several ways but the way we will do it is with the “One Cycle Learning Policy” which starts with a smaller learning rate and then starts to gradually increase for the first 30% of epochs and then decreasing it again for optimal learning. For this scheduler we just need to set the maximum learning rate to which it will increase over time. If you want to go deeper into this topic, I suggest you read this: https://sgugger.github.io/the-1cycle-policy.html
  • Weight decay / L2 Regularization: Another thing we use is weight decay which adds the sum of the weights squared to the loss function so that bigger weights will be punished as bigger weights are usually a sign of overfitting. Thereby we make the model able to generalize better and achieve better results on unknown data as the weights are being lowered. See this for more information about weight decay: https://towardsdatascience.com/this-thing-called-weight-decay-a7cd4bcfccab
  • Gradient clipping: Lastly there is gradient clipping. This is actually quite simple but still very useful. The way gradient clipping works is that it just limits the gradient to a certain value so that if the gradient would take the model in the wrong direction it is limited to a certain size which means that the model can’t be hurt due to large gradient values. Here is an interesting post about it: https://towardsdatascience.com/what-is-gradient-clipping-b8e815cdfb48

Now, we should be all set to start the training. First, we compute the accuracy and loss on the validation data to see how good the model initially performed vs how it performs after the training. For the training itself we define the hyperparameters like maximum learning rate, epochs to train for, weigh decay value, gradient clipping and the optimizer which will be the adam optimizer.

After this brief training, our model should be very good at predicting survival for Titanic passengers (Note that it can never be perfect as there is always a component of luck involved in the individual probability for survival). You can also see this when you look at the accuracy and the loss on the validation data and compare it to the one computed before the training. To verify this further, let’s plot the accuracies on the validation data over the course of the training phase as well as the loss on both training data and validation data. This last thing can also show us how much the model is overfitting since as soon as the training data loss decreases, the validation data loss increases or stays the same. We are overfitting since we are getting better and better on the training data, but worse on the validation data which is definitely not what we want. However, the model does not seem to be overfitting, which is great!

Save the Model

Now we need to save our model by writing its state which means all of its parameters to a file and log our hyperparameters, final accuracy and final loss so we can later easily compare different model architectures and different choices of hyperparameters to see how well they perform.

Note that when looking at the weights from the model.state_dict() output we can see how important each of the input values is. For example, we can see that the class is associated with a negative value which is good since people from a class described with a higher number like class 3 were less likely to survive. The next weight shows the extreme importance of the sex for prediction as it is associated with the largest negative value which can be understood, if you know that a man is represented with 1 and a woman with 0. What we can also deduce from the last weight is that the larger the fare paid the higher the survival probability which makes sense, too.

Test the Model on Samples

Having the training phase behind us, we can do some testing on various single examples from the validation data to get a feeling for how well the model performs. Therefore, we need to make a function which will return the models prediction for a given dataset element as well as what the person’s data is and whether the person actually survived or not. As you can see in order to display the data it is important that we denormalize our data again by putting all values from the range between 0 and 1 back to their initial range and converting the categorical column Sex back to the strings female and male from the numbers 0 and 1 as which they were represented in the dataset.

As expected, the model gets most predictions right with a survival probability that makes complete sense when looking at the input data. Even though it was wrong on the first prediction, this is not a bad sign since there is also always a component of luck involved which makes a case like this not perfectly predictable. If we recall the survival probabilities for the persons of different sexes and of different classes, we can see that the prediction is actually pretty close to that, which I think is a good sign.

Don’t you want to find out as well whether you would have survived the Titanic disaster. To do this we have a nice function that asks you to input your data and then returns its prediction after converting the categorical values to numericals and normalizing the input data. Just think of a fare reasonable for your chosen class (or not and try to break the predictions). You can, of course, completely make up data to test the model and see which people would have survived.

Lastly we can make a submission .csv file for the Titanic competition on kaggle to become first place p ;).

See the entire notebook here.

Summary and Opportunities for future work

Lastly, I want to summarize the amazing things I learned from this nice project at Jovian. The first major takeaway was how to deal with Pandas dataframes and data in general which was usually done for me when I was provided a starter notebook. Now that I did this project from scratch I read about the pandas library and its various functions so I was able to use this data for my project very well. I also learned quite a bit about data normalization.

Another thing I took away from this was a lot of knowledge about Logistic Regression as I read quite a lot on the various approaches. For example, I read about why you would use 1 output neuron vs 2 output neurons for binary classification and came to the result that the usage of 1 output neuron is less prone to overfitting as it has less parameters which makes totally sense. This is also why I used this for my model with the Binary Cross Entropy. Moreover, I learned the math behind regularization to be able to better understand it and implement it which helped a lot when implementing regularization and choosing the weight decay hyperparameter.

Not to forget are also the things I learned about the disaster by examining the data and also from additional research, which was very interesting.

To sum up, I cannot stress enough on how great such projects are for learning as by doing everything yourself you can learn much better. I feel more comfortable with the PyTorch library and Machine Learning now.

I can’t wait to work on more challenging projects in future with other datasets and compete in various interesting Kaggle challenges with all the newly-learned things to deepen my knowledge in the world of AI and have fun. I am really looking forward to doing the thousands of projects that I have in my mind!

--

--