ML24: Top 4% in Titanic on Kaggle
Adding interactions into Random Forest
Check the repository on GitHub for complete details.
- Top 4 % (833/22219) in Titanic: Machine Learning from Disaster, an iconic entry-level competition on Kaggle, in 2020/05. This project was conducted with R.
- This was actually an assignment of the graduate level course “Data Science” of department of CS in NCCU. In addition, I got 96 (A+) in this course.


I got a 0.89 accuracy on test data using Random Forest with 10-fold under 3-way split.

Outline
(1) Introduction to Features
(2) Missing Value Imputation
(3) Features Engineering
(4) Feature Extraction
(5) Model Selection
(1) Introduction to Features

This snapshot was taken in 2021/03. The feature “Name” has been deleted at this point. Amid these 10 variables, “survival” is clearly the target and the rest 9 variables are all features. So I had 10 features at the time (2020/05) I did this project.
(2) Missing Value Imputation
mice.data <- mice(Raw,
m = 1,
maxit = 50, # max iteration
method = "rf",
seed = 188,
print=FALSE)
As a prominent procedure among preprocessing, missing value imputation is often put aside though. The reader may check ML23: Handling Missing Values for why addressing missing value properly can be very helpful.
Of course we can impute the missing values with mean, median or mode; however, there are advanced ML algorithm imputation methods might yield better outcomes. So I chose Random Forest using mice( ) in R.
(3) Feature Engineering
The reader may check the repository on GitHub for complete details here. Through inspecting cross table of the target “survival” and every feature respectively, I was able to unveil how to split the non-numeric data into categories distinguishable by “survival”.
As for numeric data, in reality, I’ve tried log transformation and converting numeric data into categorical data, but none of them worked.
(4) Feature Extraction
Leveraging stepwise linear regression with higher-degree terms & interactions (using stepwise( )), I was able to choose a few influential features.
I figured that this part is probably the reason why I got top 4% ranking merely using one model, Random Forest, without leveraging stacking. Adding the interactions into Random Forest, a quite creative move, might have led to the success.
(5) Model Selection
Then, I input those influential features to models, and tried combinations of those features in every model. The models I tried ranging from Naive Bayes, Linear Regression, SVM, Random Forest, XGBoost to Neural Network. Ultimately, I found that Ramdom Forest yielded the best results.
Here are a couple of best models I came by. Note that I didn’t even adopt stacking but already got a satisfactory top 4% (833/22219) ranking.
fold1_rf = randomForest( Survived ~ Title + Family_size:Sex_Survival + Fare + Embarked , data= Titanic_train , ntree = 1000, importance = F)fold1_rf01 = randomForest( Survived ~ Title + Family_size:Sex_Survival + Fare + Embarked , data=Titanic_train , ntree = 1000, importance = F)fold1_rf02 = randomForest( Survived ~ Title + Family_size:Sex_Survival + Fare:Age + Embarked + Ticket_02 , data=Titanic_train , ntree = 1000, importance = F)