Pump it up — How to build a high-ranking model
This is the final article in a series of four describing my workflow in the Driven Data Pump it Up competition. Click here for the first article on EDA, here if you want to learn more about dealing with missing data or here if you are curious how I approached feature selection and feature engineering. In this final article, we will cover modelling and model evaluation.
Creating baseline models
After spending a good amount of time on EDA, data cleaning, feature engineering and feature selection, the modelling part of the project is actually pretty easy. I like to start by running some cross-validated baseline models to see what models give the best performance.
Hyperparameter tuning
The next step is to tune the top ranking models a bit further using Grid Search. Finding the right set of hyperparameters is always a bit of a puzzle. If you have (a lot of) time, you can simply run your Grid Search on a wide range of hyperparameters.
I usually start with some plotting or a Randomized Search to get a feel for the optimal ranges. As you can see there is not much point in using more than 200 trees in a random forest for this dataset.
I’m sure you -like me - searched online for the best hyperparameters to tune for each model. I don’t think there is a single best solution here, but I can show you which hyperparameters and what ranges I used in my final Grid Search and the resulting score on the public leaderboard.
Model ensembles
After the hyperparameter tuning step, I combined the best performing models in an ensemble. For the ensemble I used a regular voting classifier, a weighted classifier and a stacking classifier. The weighted classifier resulted in top 4% score (rank 504).
Have you tried stacking specialized models yet?
Before I move on to model evaluation, I want to share one final modelling approach you might want to explore a bit further. I’m sure you have noticed by looking at your classification reports, that most models are better at identifying the functional and non-functional classes than the rarer repair class.
What if you could you train three models, each specializing in one of the three classes? Well, using the stacking classifier algorithm, you can!
Interpreting the results
This competition scores model performance using the classification rate (accuracy). From the classification report and confusion matrix, it is clear that my model performs better on the functional and non-functional classes than on the rarer repair class.
A large percentage of the water points that need repairing are misclassified as being functional or non-functional, resulting in a low recall score for this class.
The precision score for the repair class is quite a bit better. Relatively few pumps were mislabeled as needing repairs whilst actually being functional or broken.
Despite numerous attempts to increase the frequency of the repair class using different oversampling and undersampling techniques, I was not able to improve the performance on this rare class.
Feature importances
You may not always be interested in knowing why your model came to a given prediction. In our case, it would certainly be interesting to know why some water points are more likely to fail than other, because these insights are useful for maintenance operations. So what features are the most important when predicting the status of a water point?
A feature importance chart that plots the relative importance of each feature in a popular tool to answer this question. It is good to note here that models can use different methods to calculate feature importance, making it hard to compare the feature importance charts of different models.
Let’s explore the feature importances of the tuned Random Forest, XGBoost and CatBoost models.
All three models identify the quantity type as an important feature. Water quantity appears to be a very useful feature for identifying non-functional pumps.
The permit variable ranks pretty low in all three models. Given that the distribution of pump classes doesn’t differ much for water points with and without a permit, this is hardly surprising.
It is also interesting to see that the Random Forest attributes high importance to longitude and latitude, whereas XGBoost attributes little importance to these features. This may be caused by the way feature importance is calculated by the Random Forest model. Gini Importance, or ‘Mean Decrease in Impurity’ is used by the Random Forest to calculate the contribution of a feature to the model. This method is known to be inflate the importance of high cardinality features and can therefore be misleading.
In the end, feature importance charts give you just a summary of what your model has learned during training, but don’t say much about how well this holds up for new data.
SHAP
In my search for a more consistent and accurate way of calculating the contributions of a model, I stumbled upon SHAP values. For now, all you need to now is that SHAP values quantify the impact of each feature on the final prediction made by the model.
SHAP values can be calculated using a dedicated Python package, which comes with a variety of options to visualize the contribution of features for a single as well as a group of predictions.
I have calculated the SHAP values for my tuned Random Forest model. Because calculating SHAP values can take quite some time, I only calculated them for a random sample of 40% of my data, keeping the distribution of the three status classes the same as in the original dataset.
Remember how the Random Forests’ Gini’s method attributed great importance to longitude and latitude? Well, SHAP does not. Instead, SHAP attributes higher importance to the features that intuitively should have high importance, like the amount of water available at the water point, the age of the water point and the method used to extract the water.
Final tips and tricks
I hope this series has given you enough inspiration to start or continue this competition. Here are some final tips and tricks:
1. Spent a good amount of time on EDA and create data quality reports for some essential insights into your data.
2. Try out different imputation techniques and see how they affect the distribution and variance of your data.
3. Don’t immediately disregard high cardinality features but try grouping rare classes together into a single new class.
4. Try out different ways of encoding your categorical variables.
5. Play around with different test-sizes while training your model.
6. When making a submission, retrain your model on the full training set.
7. Combine your best models in an ensemble.
8. Don’t spend hours on tuning your models. You can better spent this time on EDA, cleaning and feature engineering.
9. Try organizing your work. Start each new approach in a new notebook, make regular submissions, and keep track of the results.
10. Don’t underestimate the power of common sense. There are a lot of fancy packages out there, but following your gut can go a long way.
The code from this article can be viewed on Github.
References and further reading
· Tran, K. 2021. SHAP: explain any machine learning model in Python. https://towardsdatascience.com/shap-explain-any-machine-learning-model-in-python-24207127cad7
· Mazzanti, S. 2021. Which of your features are overfitting? https://towardsdatascience.com/which-of-your-features-are-overfitting-c46d0762e769
· Ngai, A. 2019. Analytics snippet — feature importance and the SHAP approach to machine learning models. https://www.actuaries.digital/2019/06/18/analytics-snippet-feature-importance-and-the-shap-approach-to-machine-learning-models/
· Scikit-Learn, 2021. Feature importances with a forest of trees. https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html
· Abu-Rmileh, A. 2019. The multiple faces of ‘feature importance’ in XGBoost. https://towardsdatascience.com/be-careful-when-interpreting-your-features-importance-in-xgboost-6e16132588e7