Journey of the RMS Titanic through Data Science

Saahil Sharma
GreyAtom
Published in
6 min readJul 24, 2017

Introduction

RMS Titanic was by far one of the biggest passenger liner ships ever made in the entire history. Created in the year of 1912, this ship was so enormous that it could carry as many as 3500 passenger(including the crew) without any trouble.

But the fate decided something different for it. On its very first voyage from Southampton to New York City, the captain of the ship couldn’t see the upcoming trouble and the ship hit an iceberg so hard that it was cut in half and based on the estimate, almost 1500 passengers died.

This disaster is considered as the most epic disaster of all time. Many adaptations have also been made in the form of movies and plays.

Approach

Now the main questions are — How do I start? How am I suppose to approach this problem? These questions always come to every beginner data scientist’s mind. Kaggle also provides help with respect to this — KAGGLE TUTORIALS AND KERNELS.

I came across to a beautiful tutorial by Mr. Manav Sehgal.

PS: All the cleaning and Wrangling of data is done by him. My main motive is to apply some machine learning algorithms to test the accuracy on the Kaggle competition.

About the Dataset

Well, extensively, Titanic doesn’t need an introduction as everyone on this planet is aware of the ship (thanks to James Cameron), but we still need to understand about the dataset recorded and all the information it is giving to us.

  • On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. Translated 32% survival rate.
  • One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew.
  • Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class passengers.

Notebook

Lets Import all the packages we will be working with.

After successfully importing the packages, we will import the ‘Train’ and ‘Test’ datasets from the Kaggle website.

Since it is a function, the path will be given during function call

Since Ticket and Cabin are not much contributing to our predictions, we will eliminate both of the features.

Now using pandas built-in function crosstab, we will create a cross-tabular form for the features ‘Name’ and ‘Sex’ and based on that we can depict what kind of gender does the person has. But before that, we need to extract the names from the Name column.

Now we need to deal with all the different titles we have in the title columns. These also will not contribute to our machine learning model. Hence, we will replace each one of them with ‘Rare’ word and wherever there is a prefix, we will be replacing it with the appropriate word.

Since we will be needing ‘PassengerId’ from ‘Test’ dataset, we will remove the attribute from the ‘Train’ dataset and also we will be assigning 1 to Female and 0 to male so that our categorical variables will be numerical one.

At Last, we will create a new column with IsAlone for those passengers who embarked alone or without any family members. Also, we will fill the null values in Fare column with the median of the column.

This Ends our Data Cleaning process. We are ready with a genuine workable dataset to predict outcomes.

‘X’ is the Train Dataset
‘y’ is the Test Dataset

With the help of pandas get_dummies function, we will Label Encode this feature with 1,0 values.

Now we are ready to build our model and check the accuracy on the test dataset.

For Classification, we will be using RandomForestClassifier with no Hyper-Parameters to check how our base model is performing.

Sci-kit Learn’s API provides an easy way to call the model function and use it on the dataset. We just need to put the Model Function in a variable and we are good to go for using it.

As you can see it is so easy to call. All the Algorithm behind the scenes are taken care by Sci-Kit Learn

oob_score: Whether to use out-of-bag samples to estimate the generalization accuracy.

max_depth: The maximum depth of the tree. If none, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

random_state: random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Now we have our model ready, lets fit it to our training dataset with ‘Survived’ column as Target and rest all features present.

In our case, it will return all rows and columns from 1st to last for features and 0th column for the Target.

We have also used predict method to predict the model we just fitted on Test dataset.

In the End, we will save the returned array into a Dataframe, with the ‘PassengerId’ column from Test dataset and Predicted values as ‘Survived’ column.

After submitting to the Kaggle competition, the result which i got is this —

I know, it is not that good, but for the first timer, I think I have achieved quite a percentage.

In the next Post, we will try to Hyper-Tune our models parameter and lets see if the accuracy goes up. Also, we will be doing some Cross-Validation and Feature Engineering.

For complete code, please check out my Github Repository —

Till next time, keep reading the blogs, see you around.

--

--