How I scored in the top 9% of Kaggle’s Titanic Machine Learning Challenge

Peter [REDACTED]
I like Big Data and I cannot lie
10 min readSep 25, 2017

For those aspiring to be Data Scientists or simply those wanting to get their feet wet with machine learning, Kaggle is a great site to try. Kaggle is a website that hosts a ton of machine learning competitions presented either by Kaggle itself or major companies such as Google, Intel, and Facebook. They have a few beginner competitions for newbies, including their most popular one: The Titanic Machine Learning Challenge.

I just recently started “Kaggling” and I must say, the challenges can be quite addicting as you try and improve your predictions and see your name soar up the leaderboards as your prediction scores improve. I decided to take a shot at the Titanic challenge and was able to crack the top 9% with one of my submissions.

This post will be an abbreviated walk through of some of the data wrangling, feature engineering, and modeling I tried in order to achieve that score, but for the full code, feel free to checkout my Jupyter Notebook on my GitHub account.

Disclaimer: You will probably notice that some of the features I used are borrowed from other people that have posted their results in other blogs. My work is an amalgamation of what I have read on other blogs, a quick course I took on datacamp.com, and my own machine learning knowledge through my current Master’s in Data Science program.

The Titanic challenge of course is based on the infamous sinking of the behemoth passenger ship that sank in the middle of the Atlantic on April 15, 1912. A combination of icy cold waters, insufficient life boats for all passengers onboard, and other systematic failures led to 1,502 out of the 2,224 passengers dying. The Kaggle challenge provides data on 891 passengers (the training data), including wether they survived or not and the goal is to use that data to predict the fate of 418 passengers (the test data) whose fate is unknown.

Here are the modules I used for this project:

# classifier models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
# modules to handle data
import pandas as pd
import numpy as np

I. Data Wrangling and Preprocessing

Kaggle provides a test and a train dataset. The training data provides a Survived column which shows a 1 if the passenger survived and a 0 if they did not. This is ultimately the feature we are trying to predict so the test set will not have this column.

Because I’m lazy and don’t like doing things twice, I first loaded the data into a train and test variable and then created a titanic variable where I appended the test to the train so that I can create new features to both data sets at the same time. I also created an index for each train and test so that I can separate them out later into their respective train and test.

# load data 
train = pd.read_csv('./Data/train.csv')
test = pd.read_csv('./Data/test.csv')
# save PassengerId for final submission
passengerId = test.PassengerId

# merge train and test
titanic = train.append(test, ignore_index=True)
# create indexes to separate data later on
train_idx = len(train)
test_idx = len(titanic) - len(test)

Let’s quickly peek at the data and see what it looks like:

# view head of data 
titanic.head()

With the full data in the titanic variable, we can use the .info() method to get a description of the columns in the dataframe.

# get info on features
titanic.info()

This shows us all the features (or columns) in the data frame along with the count of non-null values. Looking at the RangeIndex we see that there are 1309 total entries, but the Age, Cabin, Embarked, Fare, and Survived have less than that, suggesting that those columns have some null, or NaN, values. This is a dirty dataset and we either need to drop the rows with NaN values or fill in the gaps by leveraging the data in the dataset to estimate what those values could have been. We will choose the latter and try to estimate those values and fill in the gaps rather than lose observations. However, one thing to note is that the Survived feature will not require us to fill in the gaps as the count of 891 represents the labels from the train data. Remember that we are trying to predict the Survived column and so the test set does not have this column at all.

Even though this is technically the “Data Wrangling” section, before we do any data wrangling and address any missing values, I first want to create a Title feature which simply extracts the honorific from the Name feature. Simply put, an honorific is the title or rank of a given person such as “Mrs” or “Miss”. The following code takes a value like “Braund, Mr. Owen Harris” from the Name column and extracts “Mr”:

# create a new feature to extract title names from the Name column
titanic['Title'] = titanic.Name.apply(lambda name: name.split(',')[1].split('.')[0].strip())

After viewing the unique Titles that were pulled, we see that we have 18 different titles but we will want to normalize these a bit so that we can generalize a bit more. To do this, we will create a dictionary that maps the 18 titles to 6 broader categories and then map that dictionary back to the Title feature.

# normalize the titles
normalized_titles = {
"Capt": "Officer",
"Col": "Officer",
"Major": "Officer",
"Jonkheer": "Royalty",
"Don": "Royalty",
"Sir" : "Royalty",
"Dr": "Officer",
"Rev": "Officer",
"the Countess":"Royalty",
"Dona": "Royalty",
"Mme": "Mrs",
"Mlle": "Miss",
"Ms": "Mrs",
"Mr" : "Mr",
"Mrs" : "Mrs",
"Miss" : "Miss",
"Master" : "Master",
"Lady" : "Royalty"
}
# map the normalized titles to the current titles
titanic.Title = titanic.Title.map(standardized_titles)
# view value counts for the normalized titles
print(titanic.Title.value_counts())

Printing the result gives us the following counts:

The reason I wanted to create the Title feature first was so that I could use it to estimate the missing ages just a little bit better. The next step is to estimate the missing Age values. To do this, we will group the dataset by Sex, Pclass (Passenger Class), and Title.

# group by Sex, Pclass, and Title 
grouped = titanic.groupby(['Sex','Pclass', 'Title'])
# view the median Age by the grouped features
grouped.Age.median()

Which gives us:

Instead of simply filling in the missing Age values with the mean or median age of the dataset, by grouping the data by a passenger’s sex, class, and title, we can drill down a bit deeper and get a closer approximation of what a passenger’s age might have been. Using the grouped.Age variable, we can fill in the missing values for Age.

# apply the grouped median value on the Age NaN
titanic.Age = grouped.Age.apply(lambda x: x.fillna(x.median()))

Next, we move onto the next features with missing values, Cabin, Embarked, and Fare. For these, we wont be doing anything too fancy. We will fill Cabin with “U” for unknown, Embarked we will fill with the most frequent point of embarkment, and since Fare only has 1 missing value we will just fill it in with the median value of the dataset:

# fill Cabin NaN with U for unknown
titanic.Cabin = titanic.Cabin.fillna('U')
# find most frequent Embarked value and store in variable
most_embarked = titanic.Embarked.value_counts().index[0]

# fill NaN with most_embarked value
titanic.Embarked = titanic.Embarked.fillna(most_embarked)
# fill NaN with median fare
titanic.Fare = titanic.Fare.fillna(titanic.Fare.median())

# view changes
titanic.info()

And now we view the data with .info() again:

Everything looks good now. As expected, Survived still has missing values but since we are going to eventually be splitting the data back to train and test, we can ignore that for now.

II. Feature Engineering

We will quickly create two more features before we begin our modeling. The next feature of interest is family size per passenger, since having a larger family may have made it harder to secure a spot on a life boat compared to an individual passenger or a small family trying to get on a life boat. We can leverage the SibSp and Parch features to determine family size since these are a count of sibling/spouse and parent/children respectively per passenger.

# size of families (including the passenger)
titanic['FamilySize'] = titanic.Parch + titanic.SibSp + 1

The last feature we will create will leverage the Cabin feature and simply extract the first letter of the cabin which determines the section where the room would have been. This is potentially relevant since it is possible that some cabins were closer to the life boats and thus those that were closer to them may have had a greater chance at securing a spot.

# map first letter of cabin to itself
titanic.Cabin = titanic.Cabin.map(lambda x: x[0])

If you view the head of your data, it should look like this now:

The last step to perform before we can begin our modeling is convert all our categorical features to numbers, as our algorithms can only take an array of numbers as an input, not names or letters. As you noticed from the previous screenshot, we have a few columns to convert. We use the pd.get_dummies() method from Pandas that converts categorical features into dummy variables.

# Convert the male and female groups to integer form
titanic.Sex = titanic.Sex.map({"male": 0, "female":1})
# create dummy variables for categorical features
pclass_dummies = pd.get_dummies(titanic.Pclass, prefix="Pclass")
title_dummies = pd.get_dummies(titanic.Title, prefix="Title")
cabin_dummies = pd.get_dummies(titanic.Cabin, prefix="Cabin")
embarked_dummies = pd.get_dummies(titanic.Embarked, prefix="Embarked")
# concatenate dummy columns with main dataset
titanic_dummies = pd.concat([titanic, pclass_dummies, title_dummies, cabin_dummies, embarked_dummies], axis=1)

# drop categorical fields
titanic_dummies.drop(['Pclass', 'Title', 'Cabin', 'Embarked', 'Name', 'Ticket'], axis=1, inplace=True)

titanic_dummies.head()

Viewing the head again we get:

Perfect! Our data is now in the format we need to perform some modeling. Let’s separate it back into train and test data frames using the train_idx and test_idx we created in the beginning of the exercise. We will also separate our training data into X for the predictor variables and y for our response variable which in this case is the Survived labels.

# create train and test data
train = titanic_dummies[ :train_idx]
test = titanic_dummies[test_idx: ]

# convert Survived back to int
train.Survived = train.Survived.astype(int)
# create X and y for data and target values
X = train.drop('Survived', axis=1).values
y = train.Survived.values
# create array for test set
X_test = test.drop('Survived', axis=1).values

III. Modeling

I tested both a logistic regression model, which is a binary classifier, and a random forrest classifier model which fits a number of decision tree classifiers on the data. I used GridSearchCV to pass in a range of parameters and have it return the best score and the associated parameters.

The logistic regression model returned a best score of ~82% while the random forrest model got a best score of ~84% which is the model I ended up using for my predictions. As a result, I will only cover the random forrest model in this section.

GridSearchCV needs the estimator argument which in this case is the random forrest model and a param_grid which is a dictionary of parameters for the estimator. To prevent this post from being longer than it needs to be, I will let you look up the documentation for the random forrest classifier to find out what the parameters do.

First, I created my dictionary of parameters with different ranges:

# create param grid object 
forrest_params = dict(
max_depth = [n for n in range(9, 14)],
min_samples_split = [n for n in range(4, 11)],
min_samples_leaf = [n for n in range(2, 5)],
n_estimators = [n for n in range(10, 60, 10)],
)

Next, I instantiate the random forrest classifier:

# instantiate Random Forest model
forrest = RandomForestClassifier()

Lastly, we build the GridSearchCV and fit the model:

# build and fit model 
forest_cv = GridSearchCV(estimator=forrest, param_grid=forrest_params, cv=5)
forest_cv.fit(X, y)

Once this finishes (and it will take quite a few minutes depending on your computer’s speed) you can use the best_score_ and best_estimator_ methods to retrieve the best score and the parameters that led to that score:

print("Best score: {}".format(forest_cv.best_score_))
print("Optimal params: {}".format(forest_cv.best_estimator_))

Now we are ready to predict and submit! Remember that we saved the test set under X_test, so we can simply do the following:

# random forrest prediction on test set
forrest_pred = forest_cv.predict(X_test)

forrest_pred returns a 418 x 1 array of predictions for the Survived values. In the very first step, I placed the PassengerId column from the original test data into its own variable that I named passengerId. For our final submission, all we have to do is combine the passengerId with forrest_pred into a data frame and output to a csv. The following code does this:

# dataframe with predictions
kaggle = pd.DataFrame({'PassengerId': passengerId, 'Survived': forrest_pred})
# save to csv
kaggle.to_csv('./Data/titanic_pred.csv', index=False)

This is what it looked like when I submitted it:

This got me a score of 0.80861 which was good enough for top 9% at the time of submission.

Hopefully that was informative for anyone who made it all the way to this point and I hope you can take from my submission and add your own ideas to try and get an even higher score!

--

--