Image taken from K. Mitch Hodge

Random Forest on Titanic Dataset ⛵.

Here we will explore the features from the Titanic Dataset available in Kaggle and build a Random Forest classifier.

Many times i have entered Kaggle looking for solutions or different datasets. I have taken different machine learning courses and all of them, at one point or another, use a dataset from Kaggle. And it makes sense, since all datasets are well described, divided into training and testing and with many features for you to explore. So i decided to jump into Kaggle and try my first competition, and the best starting point is the Titanic dataset, that one is the getting started in Kaggle. For those who don’t know, RMS Titanic was a British passenger liner operated by the White Star Line that sank in the North Atlantic Ocean in the early morning hours of 15 April 1912, yo can read more in Wikipedia, also there is a beautiful movie called Titanic.

The idea is to use the Titanic passenger data (name, age, price of ticket, etc.) to predict who will survive and who will die, kind of creepy but is a valid approach. So let’s start by loading the dataset. In my case i download it as a zip file from Kaggle.

We will use Python and Jupyter Notebook. Let’s start with our imports and extracting the .zip file. I’m setting a Seaborn style that i like.

import of libraries
Head view of Data frame

We have our testing and training data loaded, the training dataset contains 891 training examples and 12 features including the label, and the testing data set contains 418 rows and 11 features, no label. Next is the description of the features.

  • Passenger ID to identify the passenger, numerical feature (Passenger ID/Ticket Number).
  • Survived is our label, as we can see is a binary feature, 1 if survived and 0 otherwise.
  • Pclass is the Ticket class (1 = 1st (Upper), 2 = 2nd (Middle), 3 = 3rd (lower))
  • Age is the age in years
  • Sibsp is the number of siblings / spouses aboard the Titanic
  • Parch is the number of parents / children aboard the Titanic
  • Ticket is the ticket number
  • Fare is the Passenger fare
  • Cabin is the cabin number
  • Embarked means Port of Embarkation. C = Cherbourg, Q = Queenstown, S = Southampton

Let’s see a description of the data.

Description of dataset

By describing the data we can see we have many missing features. We have 891 passengers and 714 Ages confirmed, 204 cabin numbers and 889 embarked. Now, if you saw the movie you would agree with me in say that one of the missing ones is Jack Dawson (Leonardo DiCaprio in Titanic). Let’s see if he is in this dataset 😅.

Processing missing data and duplicates

Before training a model to make any kind of classification or regression, first we have make sure our data is understandable and there is not garbage in it. Although cleaning data is not as entertained as training algorithms, this step is of critical importance in every Machine Learning project. First let’s check for missing values or duplicated ones.

Missing and duplicated values

Filling Embarked and Fare

We can see we have many missing values in Age and Cabin, as we supposed before in the describing part. Most of the missing data comes from the cabin feature, this could be because not everyone had a cabin. I remember from the movie that many were stowaways and many slept together in the same cabin. There are two missing values in embarked and one in Fare. We can fill the fare with the average value of fares, that’s not a problem and fill the embarked with the most common port of embark.

Filling Age

We filled Embarked and Fare, now what can we do with Age?

To fill the age, we can check the titles (Miss, Mr, Mrs, Master, Dr) and take the age average of each one, then fill the age according to the title. Yes, Master is one of the titles used in Titanic, is used for boys and young men, mostly by english people.

Heat Map of missing data

Almost all the data from Cabin is lost, i think we could make some assumptions to figure out a way to fill it, for example, let’s keep only the first letter and fill the missing ones with X.

We can see the mean of the missing X class is very low, this means that people without a cabin assign had, in almost every case, a lower fare, but there is some outliers that we can handle. We could take those outliers and assign them to class C or B since they have a higher Fare, lets do that. I’m going to make the assumption that people without a Cabin assigned payed a low fare, so everyone with a relative high fare in class X is an outlier, therefor i will reasigna them using the mean of the rest of the classes.

Now Cabin X has a low Fare as it should be and there are no more missing values in our dataset.

Feature engineering

Feature engineering involves analyze the features and extract useful information out of it, also creating new features out of existing ones. Let’s start by doing some visualization.

According to the graphics, we can see that most of the people were alone and most belonged to 3rd class (lower). This corresponds to what we saw earlier with the Cabins and the fare, most people without a cabin assign had a small fare, makes sense they belong to class 3. We can create a new feature that specifies if the person was traveling alone or with family based on SibSp (Siblings/Spouses) and Parch (Parents/Children) attributes, also the size of the family. Those attributes could be of interest. Also, let’s plot the data in relation to the label.

From the graphics we can see that most people died in the incident. Although most of the passengers were males, most of the survivals were woman, of course, ship workers and staff were probably males in majority, so it makes sense. We can see also that the majority of dead ones belonged to 3rd class (low class people), they were probably evacuated last and possibly were located in parts of the ship of more difficult access. Most of those who died were alone, this makes sense cause people from 3rd class were mostly alone and were the ones who died the most. Now let’s see the age and the fare of the survivors.

Clearly the Fare of those ho survived were higher, we can see that in the distribution of each one. Those ho survived where a little bit younger than those ho died. We can also see that people in 1st class were older than the rest and people from 3rd class were younger. Next we can see the correlation between the features. Fare and Survived has som correlation between, but correlation doesn’t take into account categoricals, so better to map features like Sex and Embarked to numbers.

Correlation matrix of features

Mapping categorical features

Machine Learning algorithms deals with numbers and not categorical, so we need to find a way to map these categories into numbers. This is easy to do in Pandas using map() method. I will define 1 for female and 0 for males, to have a positive correlation since most of the survivals were females.

OK, now let’s drop the non importance features like the name and ticket number.

Normalize the data

Many machine learning algorithms like Regression types and distance based ones can converge faster when the data is normalized, this is a key step in every machine learning situation. To do so i will use MinMaxScaler library from Scikit Learn, but first we need to drop the label. When scaling, we only fit the scaler to the training dataset.

Scaled training data set

Classification

I will use Random Forest Classifier. Random forests is a supervised learning algorithm. It can be used both for classification and regression. It is also the most flexible and easy to use algorithm. You can read more about it in here. Let’s split the data to use a testing set with labels.

Random Forest fit

An interesting thing about the Random Forest Classifier from Sklearn is that provides a very easy way of reviewing the important features for the classification.

Revising feature importance

We can remove the less important features for the algorithm, and see if the accuracy improves. With all the features the accuracy was 82%. Let’s see removing the less important ones like ‘Alone’, ‘Parch’ and ‘Embarked’.

Removing the less important features the accuracy improves, also our model is better identifying dead people over survivors, after all, there were more examples of them. Finally we have the confusion matrix and the prediction in our original test dataset.

Summary

In this article you have seen how to explore features from the Titanic Data Set available in Kaggle. We achieved an accuracy of 85% in classifying the survivors, not so good but this is a small data set. The accuracy could be improved by tuning the hyper parameters of the classifier, adding new features or maybe trying a different classifier, there is a good article about tuning Random Forest hyper parameters in here. I will explore that in a later publication. The complete notebook is available on GitHub.

References

Kaggle: https://www.kaggle.com/c/titanic

About Random Forests: https://www.datacamp.com/community/tutorials/random-forests-classifier-python

There is also this good publication about Random Forests: https://towardsdatascience.com/random-forest-classification-and-its-implementation-d5d840dbead0

And of course Scikit-learn: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html

--

--

Carlos Raul Morales
Analytics Vidhya

Hi there, I’m Carlos R. — aka Charlie5DH 👋. Passionate learner of everything related to AI.