My Journey to Kaggle Competitions Grandmaster Status: A Look Into Feature Engineering (Titanic Dataset)

Daniel Benson
The Startup
Published in
5 min readSep 30, 2020

We are back at it again, this time seeing if we can get a closer look at some of the data and possibly engineer some features that will increase our accuracy score. For this installment I began by dropping the same 4 columns as before, “PassengerId”, “Name”, “Ticket”, “Embarked” as I still found little value in them. For a refresher, my dataframe looks like so:

A look at the first five rows of my dataframe

From the last exploration we determined that 75% of the values in the “Cabin” column contained NaN (empty) values. I wanted to get a closer look at why this might be, and if it perhaps might be inportant in some way yet unknown. I created a new feature called “cabin_listed” that returned a value of 0 if the passenger did not have a listed cabin number and 1 if the passenger did have a listed cabin number.

The code I used to engineer the “cabin_listed” feature (top) and the result of that compared to the “Cabin” feature (bottom)

From here I decided that dropping the Cabin column was still in my best interest. Also like the last installment I filled the “Age” column NaN values with the mean value, as I determined that this was still the best way to go.

With this newly engineered feature added I wanted to look closer at its possible correlations to the rest of the features in the dataset, especially our target feature “Survived”. I put together six graphs to visually represent those comparisons, seen below.

The code used to create the graphs (above) and the graphs themselves (below) comparing “cabin_listed” to all other features of the dataframe.

These graphs lead me to a few interesting observations. A hidden correlation can be found when comparing “cabin_listed” with our target feature “Survived” indicating the possibility that a passenger’s cabin being listed is correlated to a passenger’s higher survivability. Using the other comparisons we can begin to see a story unfolding. For those passengers in which the cabin was listed, the greatest portion were 1st class passengers, whereas most of the 2nd and 3rd class passengers had no cabin listed. We know from the last installment that a passenger’s class was directly correlated with a passenger’s survivability, so this makes sense. After doing some research (https://www.ultimatetitanic.com/interior-fittings) I was able to determine that 2nd and 3rd class passengers (especially 3rd class) generally held more people to a room than 1st class, which helps to explain the above observation. This can also explain the correlation we see between “cabin_listed” and “Fare”, as a strong correlation is seen between upper class passengers and amount paid for their passage. Finally, we also see a correlation between “cabin_listed” and number of family members on the ship, with more family members correlating to a higher chance of having an unlisted cabin.

I made one final feature engineering decision, which was to combine the “SibSp” and “Parch” columns into one feature, “family_aboard”, storing a number representing the total number of family members (sibling+spouse+parent+child) traveling on the ship with them. This decision was made based on the fact that both the original features show correlations important to our model, but neither one appears more important than the other; in fact, they show very similar correlations across the board.

The code used to feature engineer “family_aboard” (above) and its resulting comparison with “SibSp” and “Parch” features

I then looked at the comparison between “family_aboard” and “Survived” using a scatter plot and a bar chart to determine any visual form of correlation.

The code used for creating the scatter plot shown (top) and a scatter plot comparing “family_aboard” with “Survived”
The code for the bar chart shown (top) and a bar chart comparing “Survived” with “Family Aboard”

I finished up my second go at cleaning and feature engineering by dropping the “SibSp” and “Parch” features and taking a closer look at the “Fare” feature to ensure there weren’t any other correlations hiding from sight. I created six graphs to compare “Fare” with all remaining features.

Code used to create a canvas of six graphs (top) and those six graphs comparing the “Fare” column with all other remaining columns in the dataframe

Having determined, now, that “Fare” had no significant visual correlation I dropped it from the final dataframe, giving me the following dataframe for use in my model:

Final features that will be used for the model and their first five rows of data

Finally, I followed the same steps as I did previously for preprocessing, splitting the data into train and validation subsets (80% and 20% respectively) and using a LabelEncoder to convert the “Sex” feature values to numeric binary values. Running the same RandomForestClassification as before, with the same default parameters, I achieved a new accuracy score of…

Drum roll.

Very little change. In fact we saw a minor decrease in our validation accuracy score, from 82.7% down to 79.9% and our test accuracy results decreasing from 72.248% to 72.009%. This minimal change is actually understandable considering the overall minor changes we actually made in the features and the fact that we were dealing with a significantly skewed feature in “cabin_listed”.

The baseline model’s validation accuracy score after some feature engineering (top) and the test prediction accuracy score (bottom)

Stay tuned for next time when we go over some hyper-parameter tuning in the hopes of increasing our accuracy score! As usual, brave reader, thank you for following on my journey. Rep it up and happy coding.

--

--

Daniel Benson
The Startup

I am a Data Scientist and writer prone to excitement and passion. I look forward to a future I am able to focus those characteristics into work I love.