Kaggle Titanic Competition

My first attempt at a Machine Learning competition.

I’ve cleaned the data and performed an initial analysis. I then tested out creating new features using a tool called SelectKBest which scores features according to their relevance with the target variable (in this case survival).

I ended up with 3 original features (Fare, Pclass, Sex) and 3 new features:

  • fem_1st_fareover29_under65 — true if female, 1st class, paying over 2- and under 65
  • fare_age_combo — true if under 65 and paying over 29
  • female1st2nd — true if female and in 1st and 2nd class

I also tried various combinations using Parch and Sibsp to create small or large family groups the SelectKBest scores were not very significant.

I tried these out with several SKLearn algorithms (GaussianNB, LinearSVC, DecisionTree, LogisticRegression and RandomForest) and I used a combination of GridSearchCV and intuition to tune the parameters.

My final algorithm was a Random Forest with n_estimators=1000, criterion=’entropy’, warm_start=True, bootstrap=True, and min_samples_leaf=6. The StratifiedShuffleSplit test results were:

Precision: 85% (% of passengers flagged as survived who definitely survived)
Recall: 71% (% of passengers who survived are being idenfied correctly)

This came out as 0.78469, 2063 on the Titanic leaderboard, the same as 400+ others and in the top 30%. Not so bad for a second ever machine learning project.