My Journey to Kaggle Competitions Grandmaster Status: A Look Into Feature Engineering (Titanic Dataset)
We are back at it again, this time seeing if we can get a closer look at some of the data and possibly engineer some features that will increase our accuracy score. For this installment I began by dropping the same 4 columns as before, “PassengerId”, “Name”, “Ticket”, “Embarked” as I still found little value in them. For a refresher, my dataframe looks like so:
From the last exploration we determined that 75% of the values in the “Cabin” column contained NaN (empty) values. I wanted to get a closer look at why this might be, and if it perhaps might be inportant in some way yet unknown. I created a new feature called “cabin_listed” that returned a value of 0 if the passenger did not have a listed cabin number and 1 if the passenger did have a listed cabin number.
From here I decided that dropping the Cabin column was still in my best interest. Also like the last installment I filled the “Age” column NaN values with the mean value, as I determined that this was still the best way to go.
With this newly engineered feature added I wanted to look closer at its possible correlations to the rest of the features in the dataset, especially our target feature “Survived”. I put together six graphs to visually represent those comparisons, seen below.
These graphs lead me to a few interesting observations. A hidden correlation can be found when comparing “cabin_listed” with our target feature “Survived” indicating the possibility that a passenger’s cabin being listed is correlated to a passenger’s higher survivability. Using the other comparisons we can begin to see a story unfolding. For those passengers in which the cabin was listed, the greatest portion were 1st class passengers, whereas most of the 2nd and 3rd class passengers had no cabin listed. We know from the last installment that a passenger’s class was directly correlated with a passenger’s survivability, so this makes sense. After doing some research (https://www.ultimatetitanic.com/interior-fittings) I was able to determine that 2nd and 3rd class passengers (especially 3rd class) generally held more people to a room than 1st class, which helps to explain the above observation. This can also explain the correlation we see between “cabin_listed” and “Fare”, as a strong correlation is seen between upper class passengers and amount paid for their passage. Finally, we also see a correlation between “cabin_listed” and number of family members on the ship, with more family members correlating to a higher chance of having an unlisted cabin.
I made one final feature engineering decision, which was to combine the “SibSp” and “Parch” columns into one feature, “family_aboard”, storing a number representing the total number of family members (sibling+spouse+parent+child) traveling on the ship with them. This decision was made based on the fact that both the original features show correlations important to our model, but neither one appears more important than the other; in fact, they show very similar correlations across the board.
I then looked at the comparison between “family_aboard” and “Survived” using a scatter plot and a bar chart to determine any visual form of correlation.
I finished up my second go at cleaning and feature engineering by dropping the “SibSp” and “Parch” features and taking a closer look at the “Fare” feature to ensure there weren’t any other correlations hiding from sight. I created six graphs to compare “Fare” with all remaining features.
Having determined, now, that “Fare” had no significant visual correlation I dropped it from the final dataframe, giving me the following dataframe for use in my model:
Finally, I followed the same steps as I did previously for preprocessing, splitting the data into train and validation subsets (80% and 20% respectively) and using a LabelEncoder to convert the “Sex” feature values to numeric binary values. Running the same RandomForestClassification as before, with the same default parameters, I achieved a new accuracy score of…
…
Drum roll.
…
Very little change. In fact we saw a minor decrease in our validation accuracy score, from 82.7% down to 79.9% and our test accuracy results decreasing from 72.248% to 72.009%. This minimal change is actually understandable considering the overall minor changes we actually made in the features and the fact that we were dealing with a significantly skewed feature in “cabin_listed”.
Stay tuned for next time when we go over some hyper-parameter tuning in the hopes of increasing our accuracy score! As usual, brave reader, thank you for following on my journey. Rep it up and happy coding.