Machine Learning Model to Predict Survival in Titanic [Pt. 1]

ML Basics: Investigating the sinking of the Titanic

Vinicius Nala
7 min readJan 22, 2023

Kaggle is an online community platform for data scientists and machine learning enthusiasts. Kaggle allows users to publish datasets, use GPU-integrated notebooks, and compete with other data scientists to solve data science challenges. Initially, this online platform aims to help professionals and learners reach their goals in their data science journey with the powerful tools and resources it provides. Today (2023), there are over 10 million registered users on Kaggle.
All users can dispute with each other to see who builds the more precise machine learning model in the competition. In this article, we will construct one for “Titanic — Machine Learning From Disaster” competition. The objective is very simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck. I chose this contest because it is well-known as the “Hello World!” of data science area, so there wasn’t a better option to do an article showing the machine learning basics.

Introduction

Every project in data science area requires — before importing the dataset and starting coding — follow a few steps:

  1. Understand the problem
  2. Obtain data
  3. Explore the data
  4. Prepare the dataset
  5. Modeling
  6. Evaluate

When you plan the route that you will take, doing a flowchart or a checklist of the sequence of tasks that needs to be done, the way becomes easier to surpass.

Source Code

Notebook on Kaggle:

Notebook on Github:

Understand the problem

Although the variable luck played an important role in the survival of some passengers, there were people more prone to survive than others. And this is our first task: understand why someones have more chance to survive than others. Then we build a Machine Learning Model to predict if people survived or not, based on the data given by Kaggle.

The complete description of the competition is available on Kaggle website:

Obtain data

All the necessary data can be accessed on the competition site, it has been split into two groups:

Training set (train.csv)

  • should be used to build the machine-learning model
  • provide which passenger survived or not

Test set (test.csv)

  • should be used to see how well your model performs on unseen data
  • do not provide if the passenger survived or not

To download the files it’s necessary to be registered on Kaggle.

Explore the data

Undoubtedly, this is the most important part of the project, here we will spend 70% — 80% of our time. The quality of our analysis is directly related to the performance of our model.

In this step, the objective is to identify variables that inform most about the target variable. A good way to do this is by discovering the correlation between the informative variables and the target variable.

First, let’s start looking at all the variables and briefly think about how it is related to the problem:

Data Dictionary

  • Passengerid: Each passenger has a unique id number, so it does not affect anything in our problem.
  • Survival: Informs 1 when the passenger survived and 0 when died (target variable).
  • Pclass: Informs the class of the passenger (1 = 1st, 2 = 2nd, 3 = 3rd).
  • Names: As Passengerid, it does not have a relation with Survival, once is a unique value for each one.
  • Sex: Informs if the person is a male or female.
  • Age: Tells the age of each passenger.
  • SibSp: Total of spouses and siblings aboard the ship.
  • Parch: Total of parents and children aboard the ship.
  • Ticket: The number of the ticket, each passenger has a unique value.
  • Fare: Price of the passage.
  • Cabin: The number of the cabin of each passenger.
  • Embarked: From what port each person embarked (C = Cherbourg, Q = Queenstown, S = Southampton).

Overview

Now, it’s time to get our hands on the code, let’s begin looking at the characteristics of the data set:

*Usually, the data in the real world do not come such well processed as in this dataset

Percentage of missing values

Cabin comes with more than 77% of missing values, is the variable with the highest percentage of missing values in the dataset, followed by Age with 19% of missing values.

Statistic Distribution

Using the method .describe() we can see the measures of central tendency and some information about the statistical distribution of the dataset.

Plotting histograms is a better way to see the statistical distribution of the dataset. It is very useful because we can get some pieces of information with just a quick view, as we can see above:

  • The majority of people died
  • There are fewer people in the 2nd class than in the 1st class
  • The statistical distribution of the age of the people aboard the ship, which most are 20 years old

Which groups of people have more chance to survive?

Now we can start identifying correlations between the variables. To do this, let’s plot some graphs comparing the survivors with sex, class of the passenger, and place embarked:

Look how interesting plot a graph can be; with it, the following information can be extracted:

  • Woman has more chance to survive than men
  • A highest class means a highest chance to survive
  • More survivors: Cherbourg(C) > Queenstown(Q) > Southampton(S)

Look how understanding a graph can be very insightful, isn’t it wonderful? The process of reading a graph and see what this graph is showing is an essential skill if you aim to become a data scientist. This skill can be improved by studying data analysis.

Let’s see another example:

Analyzing the age distribution of the survivors, we can see very discretely that there is a peak in the right graph, between 0 and 5 years. Although the charts are very similar, a more attentive eye can see that kids have more chance to survive: “Ladies and children first!”.

A very interesting type of graph that is worth mentioning is the Scatter Matriz of Pandas:

The advantage of this kind of graph is that it brings useful information. Exhibiting the histogram of each variable we can see how is the distribution of that variable, and the scatter graph of all variables shows the relationship between them.

After analyzing it, we can notice that the older ones were concentrated in the 1° class, while the younger ones were concentrated in the 3° class. To see it was not clear, it is a very subtle distinction that requires a more acute eye.

Here on the left graph, it’s better to see this distinction: in the 3º and 2º classes a person of 60 years is considered an outlier, whereas in the 1º class doesn’t. On the right graph we can see the same graph, but grouped by Survived (0=died, 1=alive), and we can see that in all the classes, the older ones died while the younger ones stayed alive, which supports what was mentioned above: younger people are more prone to survive.
Finally, let’s look at the heatmap of the variables to see the correlation between them.

Correlation can be described as how much a variable influences another variable, this value can be calculated and varies between -1 and 1. -1 is a perfect negative correlation, which means that one variable decreases as the other increase; 0 means no correlation at all; + 1 is a perfect positive correlation, which means that one variable increases, the other increases too.

For now, it’s only this in Part 1. In the next part, we will prepare the data frame, handling the missing values, and start modeling.

Conclusion

In this article, we understood the situation, comprehended the problem, plotted the main graphs, and learned how to identify relevant variables through graph visualization.

Remember: neglecting this initial phase, going directly to the modeling, and choosing informative variables without any criteria — will result in poor performance.

If you wanna succeed in your analysis, learn how to document everything, detailing to the maximum each stage in the notebook.

Next part we will prepare the dataset for the machine learning model: handle the missing data, outliers, and categorical variables. Start modeling and, in the last, we will evaluate the performance of the model.

--

--

Vinicius Nala

🚀 Eternal learner, trying to understand the world through data