The chronicles of an aspiring Data Scientist with the titanic task of classifying the titanic dataset.

HELLO WORLD! Oh yes I am indeed a programmer. Well trying to be one anyway.

Life’s too short to learn from one’s own mistakes alone, right? So well in this blog and in the ones to come, I intend to share the mistakes I make, and the lessons I learn from them(well at least I hope I learn from them). And rest assured I WILL be making tons of errors so you will surely have lots to learn from them. So do enjoy the journey!

Just to give you a brief overview, in this titanic dataset we need to predict if a particular person has survived or not based on the existing training dataset on the test dataset. The coding has been done in Python and the algorithm I’ve used here is Random Forest. Not because it’s the best algorithm to use for this problem but because it’s one I’ve begun my journey with. So what you are about to see here will be a nice way to get started with the algorithm and with Data Analysis as a whole too. I’ve used the sklearn module for implementation here. No Feature engineering has been done and I won’t be going into the EDA part or the data cleaning part either. This blog is to introduce you with Random Forests and to show the titanic dataset being trained with this algorithm. I shall begin by explaining Decision trees in brief now. If you are familiar with these basics, please feel free to skip them.

What are Decision trees?

Let’s start with something small. Say I have a question bugging me. Do I order a Pizza? There are now two ways to go about it right? If I want to eat one, I will order it and if I don’t then well I won’t right? Let’s extrapolate this a bit further. Let’s say I have an assorted choice of delicacies to choose from. The first question I’ll be asking here is “Am I hungry in the first place?” If yes I will order some food and if no then well I won’t. Then I’ll take a look at my wallet and based on the amount of money I am willing to spend, my choice of order becomes narrower. Now I decide if I like my food spicy or sweet and so on until I reach the food I want to order. This is exactly what a decision tree is. Of course the core concepts behind constructing one goes way beyond this simple example but this was just to give a gist of what they are.

Decision trees are classified under supervised machine learning models. For those of you who are unfamiliar with this, supervised machine learning models are those where the training data, i.e the data based on which we build our model consists of a precise set of input and output values. Thus we know precisely as to which data to train our model with and which data it is that we need to predict. In machine learning, similar decision trees can be constructed wherein these decisions are made on the various input features of the data.

What are Random Forests?

What if there is some noise(unwanted junk data or outliers) in our data? What if we have features which aren’t of any use to predict our output. Well random forest takes care of this but taking multiple samples of equal size from the data, each with equal number of varying input features, training the model on each of these samples and aggregating the results we get from them to predict the output. As a result, the effect of noise is removed considerably.

The titanic dataset:

This is the data from the training dataset. The first column, “Survived” is what we’ll be needing to predict in the test set. Note that this is a cleaned dataset, in that the missing values have been taken care of and the necessary encoding has been done, amongst other processes. If you aren’t familiar with data cleaning or preprocessing, don’t worry about it for now. I will probably cover them in my future blogs. While it seems impossible to predict this from the data available, it turns out that we actually can! For example, the numbers of females who survived were a lot more than the number of males. Information like this can be used to perform the necessary prediction.

So, now that I’ve given a brief explanation about the algorithm and the dataset, let’s take a look at some of the algorithms shall we?

While trying to perform this initially, I kept getting this error ‘Could not convert String to Float’ and well it bugged me like anything. That’s when I realized that these algorithms can handle only numeric data and my ‘Embarked’ column was a string with values ‘C’, ‘Q’ and ‘S’. I performed One Hot Encoding to tackle this problem. What is it you ask? Well as you can see in the above dataset, there are three columns for embarked instead of one, for each value ‘C’, ‘Q’ and ‘S’. That is one hot encoding. The ‘1’s in those columns represent that for that particular row, the value of ‘Embarked’ column was the column which the ‘1’ represents.

With hopes of getting a decent accuracy, I went on to run the algorithm to get an accuracy of around 78% while my peers boasted accuracies in 80s. That’s when I realized that I had forgotten one of the most important steps I needed to perform. ‘Hyper parameter tuning’. Think of this as a fan’s regulator. You won’t need it to get the fan running. But you would certainly need it to adjust the fan’s speed to optimize it for your need. Hyper parameters are something similar. The are parameters you set in a model before the learning process. There are no fixed ‘perfect values’ for these. That varies according to your data. Turns out that we have algorithms (You can search for Grid Search and Random Search for further information on these algorithms) where we can give them a range of value which we think are suitable for our model. The algorithm searches for all possible combinations of these hyper parameters and gets that combination which yields the best result for our model. So I implemented it. Only to find out that the algorithm took more than 10 minutes to execute. It’s not sustainable in day to day problems you may solve in the future. Moral of the story? Don’t blindly try out tons of values. Go with a small range of values. Go through the hyper parameters one by one. Go with the Hill Climbing approach.

So far so good. I now went on to execute my model. Ended up with a decent accuracy of 84%. Good right? Not necessarily. And this is easily the most common mistake you can end up falling a prey to.

The model is trained by forming cross validation sets within the train data. Do look up about cross validation if you aren’t familiar with the concept.

Anyway, if there is a massive gap between the errors between the train set and cross validation set, your model is bound to perform bad in a test set.

Moral of the story? While checking for accuracy always check if the train set accuracy matches with the cross validation set accuracy.

Red lines represents accuracy of train data and blue represent accuracy for the CV set

Aspiring data scientist. Passionate for physics and mathematics. Have a special love for poems.