Analytics Vidhya
Published in

Analytics Vidhya

ML24: Top 4% in Titanic on Kaggle

Adding interactions into Random Forest

Check the repository on GitHub for complete details.

  • Top 4 % (833/22219) in Titanic: Machine Learning from Disaster, an iconic entry-level competition on Kaggle, in 2020/05. This project was conducted with R.
  • This was actually an assignment of the graduate level course “Data Science” of department of CS in NCCU. In addition, I got 96 (A+) in this course.
Figure 1 & 2: Top 4% ranking on public leaderboard of Titanic on Kaggle.

I got a 0.89 accuracy on test data using Random Forest with 10-fold under 3-way split.

Figure 3: Scores on the whole dataset given by Kaggle.

(1) Introduction to Features
(2) Missing Value Imputation
(3) Features Engineering
(4) Feature Extraction
(5) Model Selection

(1) Introduction to Features

Figure 4: Details of the features.

This snapshot was taken in 2021/03. The feature “Name” has been deleted at this point. Amid these 10 variables, “survival” is clearly the target and the rest 9 variables are all features. So I had 10 features at the time (2020/05) I did this project.

(2) Missing Value Imputation <- mice(Raw,
m = 1,
maxit = 50, # max iteration
method = "rf",
seed = 188,

As a prominent procedure among preprocessing, missing value imputation is often put aside though. The reader may check ML23: Handling Missing Values for why addressing missing value properly can be very helpful.

Of course we can impute the missing values with mean, median or mode; however, there are advanced ML algorithm imputation methods might yield better outcomes. So I chose Random Forest using mice( ) in R.

(3) Feature Engineering

The reader may check the repository on GitHub for complete details here. Through inspecting cross table of the target “survival” and every feature respectively, I was able to unveil how to split the non-numeric data into categories distinguishable by “survival”.

As for numeric data, in reality, I’ve tried log transformation and converting numeric data into categorical data, but none of them worked.

(4) Feature Extraction

Leveraging stepwise linear regression with higher-degree terms & interactions (using stepwise( )), I was able to choose a few influential features.

I figured that this part is probably the reason why I got top 4% ranking merely using one model, Random Forest, without leveraging stacking. Adding the interactions into Random Forest, a quite creative move, might have led to the success.

(5) Model Selection

Then, I input those influential features to models, and tried combinations of those features in every model. The models I tried ranging from Naive Bayes, Linear Regression, SVM, Random Forest, XGBoost to Neural Network. Ultimately, I found that Ramdom Forest yielded the best results.

Here are a couple of best models I came by. Note that I didn’t even adopt stacking but already got a satisfactory top 4% (833/22219) ranking.

fold1_rf   = randomForest( Survived ~ Title + Family_size:Sex_Survival + Fare + Embarked , data= Titanic_train , ntree = 1000, importance = F)fold1_rf01 = randomForest( Survived ~ Title + Family_size:Sex_Survival + Fare + Embarked , data=Titanic_train , ntree = 1000, importance = F)fold1_rf02 = randomForest( Survived ~ Title + Family_size:Sex_Survival + Fare:Age + Embarked + Ticket_02 , data=Titanic_train , ntree = 1000, importance = F)



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Yu-Cheng (Morton) Kuo

Yu-Cheng (Morton) Kuo

ML/DS using Python & R. A Taiwanese earned MBA from NCCU and BS from NTHU with MATH major & ECON minor. Email: