Machine Learning and Data Analysis with Python, Titanic Dataset: Part 4

Make Improvements and Resubmit

Quinn Wang
Analytics Vidhya
6 min readMar 3, 2020

--

We have made our first submission to Kaggle in Part 3. In the final part of this series we are going to try to make improvements on our predictions and resubmit to see our results. Link to a video version of this tutorial will be at the bottom.

Let’s get started!

When it comes to improving our predictions, there are generally 2 things we want to consider:

  • Further feature engineering
  • Model tuning

But before we go any further, let’s take a look back at what was our submission score using the baseline model:

Baseline model score

You may have noticed some discrepancy here. Why is it that the test set from our 80–20 split scored an 82.68% accuracy, while predictions on the Kaggle test set only gave 76% accuracy?

One of the reasons is that there are some randomness in the ways these models predict. The 80% data we trained on might just happen to look more similar to the 20% we left as the simulated test set, compared to the test set Kaggle provided. And as a result the model trained from this 80% data would perform better on the 20% simulated test set. Especially when our training set is small, it is very likely for predictions to perform slightly better on one randomly selected test set than another.

Let me show you what I mean by running some tests on a different random selection of the 20% test set.

We were using random state 42 in our train test split function:

This number essentially determines which 20% of data gets picked so every time we run this line it’s going to use the same split. This is called a random seed, and it is useful in the sense that we get to have some control over the randomness in order to get fair comparisons between different trials.

Let’s see what happens when we change this random state to 0:

Results with random state 0 at train_test_split

Compare this with the result when random state is 42:

Results with random state 42 at train_test_split

Our accuracy score drop by over 2%.

Therefore, before we try to create more features or improve the model, in order to compare accuracies between any 2 models, we might want to take some sort of average to minimize the effect of this randomness.

To get the mean score:

Import the numpy library and use .arange to create a list of 50 numbers as our random states. Use each of these 50 numbers as the random state and take average of the respective accuracy score. This averages to a 78.8% accuracy.

Now we can go back to the dataframe and see what additional features we can use to improve this performance. In the baseline model we dropped the columns Name, Ticket, and Cabin, so maybe there is some useful information we can extract from these 3 columns and incorporate them back into the model.

We know there’s a lot of missing values in the Cabin column, so it’s probably not worth it to try to fill all of them like we did with the Age column. However, there might be reasons behind why a certain passenger is missing his/her Cabin entry, and this reason could be correlated to whether he/she survives. We can check if this is a possibility by printing out the survival ratio of people with missing Cabin entries and those without:

Survival rate comparisons between passengers with missing vs. non-missing Cabin entries

The survival rate is indeed very different, so it looks like this is a feature we want. To use this as a feature:

Note that boolean features are also accepted as our model input.

Then run the loop again. This time, our average score comes out to be 78.7%, which suggests that there is no improvements to the model by adding this feature.

Let’s try adding something else.

In the Kaggle discussion section, there’s often people sharing code that were useful in improving their model. You can find inspiration from these discussion and add features to your own model. For example, I found a piece of code that handles the name column here:

Essentially, there is a title succeeding every passenger’s name in the name column. Some common titles are Mr., Mrs., Miss., and these titles are always followed by a “.” character. To extract title:

And our Title column will look like:

Title column and it’s distribution

We can see that there are some common titles such as Mr., Miss., and Mrs., and also some uncommon titles such as Major., Sir., and Capt. These less common titles resembles high cardinality features in the sense that training with them will not provide generalizable information. This is why in the discussion kernel, the author mapped all the uncommon titles to one unique value, and the common ones to their own unique values:

Mapping Title to integers

If we run the loop again with the Title feature, we’d get an accuracy of 79.16%, which slightly better than our baseline model.

We have now experimented with 2 additional features. One of them doesn’t improve our score at all and the other only improves it by less than 0.5%. Don’t get discouraged here. This is sometimes indicative that the model setup is not suited for the problem, which takes us to the second way of improving performance: make adjustments to the model. We had been using the default hyper parameters for our ExtraTreesClassifier, where a parameter called max_depth is set to None. This parameter limits the maximum depth a tree in the forest can grow to, and the default setting “None” allows each tree to grow as deep as possible, which could make the model prone to overfitting.

Let’s try giving a maximum depth limit of 10:

Set max_depth = 10

As expected, there is a more significant increase in the average accuracy score.

Knowing the affect of limiting max_depth, let’s go back to our baseline model and see what our average accuracy score would be with the baseline features and max_depth =10. If you run our baseline setup again with the new model setup, you should get an average score of 81.7%, which is even higher.

Unfortunately, the conclusion at this point is that our baseline features paired with the parameters {n_estimators=100, max_depth=10} gives the best performance. Does this mean the 2 additional features we explored were completely useless? It is possible that adding these 2 features doesn’t help, but in practice we would usually do a lot more optimization before jumping to the conclusion that features are not useful (this would require another article to explain). A set of features are only at their optimal performance when the model’s hyper parameters are also suitable for these features. We have already seen an example of this: the baseline features gave a higher average accuracy score when we set the max_depth to 10 instead of the default None. I will write an article about common ways of hyper parameter tuning I’ve used, with advantages and caveats of each. Stay tuned if you are interested!

Because we didn’t use additional features, we don’t have to do any more preprocessing on the Kaggle test set. Simply predict with the new model (max_depth=10) and save our predictions in the submission.csv and resubmit. You should see your ranking go up by adding this one change!

Now the video as promised a couple minutes ago…

Video tutorial of this article

--

--

Quinn Wang
Analytics Vidhya

Data analyst with an interest in machine learning. Passionate about understanding the theoretical backings of ML algorithms.