Pump it up — Which features should you include in your model?
This is the third article in a series of four describing my workflow in the Driven Data Pump it Up competition. Click here for the first article on EDA or here if you want to learn more about dealing with missing data. In this third article, we cover feature selection and feature engineering.
Multicollinearity
During EDA, I identified features that appeared to have great overlap with other features. Multicollinearity occurs when two or more variables are highly correlated with one another. In regression tasks, multicollinearity is especially problematic because it becomes difficult to quantify the effect of the individual variables. Tree-based models can deal with this issue a lot better, but your model may still be hard to interpret because multicollinearity affects the stability of the feature importance coefficients. Either way, having excessive features makes your model needlessly complex and may result in overfitting.
Pairwise Pearson correlations
You could calculate the pairwise Pearson correlation between features to see how strongly they are correlated. I have done this for a subset of the variables, and it is clear that some of them are highly correlated. Quantity and quantity group have a correlation of 1, indicating that they contain exactly the same information. As expected, the three extraction variables are also highly correlated…
But what if a variable can be calculated from a group of other variables? For example, couldn’t you calculate the GPS height if you knew both the longitude and latitude?
Detecting multicollinearity with VIF
This is where VIF (Variance Inflation Factor) comes in. VIF is one of the methods commonly used to detect multicollinearity, and is expressed on a scale from 1 to infinity. Variables with a VIF above 5–10 are said to have strong multicollinearity with other variables in your dataset. A VIF of 10 corresponds to an R2 of 0.9, meaning that 90% of the variation in one variable can be explained by another variable. A VIF of 100 corresponds to an R2 of 0.99.
VIF can be easily calculated using the variance_inflation_factor module in the Statsmodels package.
After several iterations in which I drop either 1 or 2 variables based on their VIF values, I am left with 21 uncorrelated variables.
Comparing feature selection strategies
I wonder what impact dropping this many variables has on model performance, so I decide to compare the performance of a Random Forest on a full feature set, the VIF feature set and a set of features that I manually selected after EDA.
I find the highest accuracy score for the full feature set. This is not completely surprising. The more information your model receives, the more learning options it has. However, a big risk of including this many features is overfitting on rare classes, and this certainly seems to be the case here.
Why do I think this? Well, let’s have a look at the feature importance plots of each of the three models.
In the full feature set, some high cardinality features such as subvillage and waterpoint name — that intuitively should have low feature importance — rank pretty high.
The model is giving too much weight to features that only contain pump-level information. What would happen if we would use this model on a dataset with different water point names? Right, its accuracy would be a lot lower.
Interestingly, the VIF model also attributes great importance to these high cardinality features.
I am pretty happy with the results of my manually selected feature set, where I dropped some (but not all) of the high cardinality and correlated features. This model has fewer features to work with, yet the accuracy is only slightly lower when compared to the full feature set model.
What about wrapper-based methods?
So what about wrapper-type feature selection techniques like RFECV? RFECV wraps around your model and iteratively eliminates the variables with the lowest importance until the desired number of variables remains. Using RFECV I was able to drop 3 low-ranking features. Not nearly as much as I would have liked. Of course, there are many other feature selection techniques out there. This article by Zixuan Zhang certainly gives a nice overview.
What is the take away message here?
In a perfect world you would not want to have a model that overfits and VIF (in combination with a bit of common sense) could go a long way to avoid this. However, my experience from this particular competition is that in order to get a high ranking score, you will probably end up with a model that overfits to some extent.
Feature Engineering
Feature selection and feature engineering go hand in hand. You might decide to drop a feature because you know you will combine it with another one. During EDA, I already came up with some ideas for EDA.
Creating new variables
I decided to create an age feature by subtracting the construction year from the recorded date variable. I also used the recorded date variable to determine if the recording was performed in the rainy or dry season. Region and district code were combined to a region-district feature, after which I dropped both region and district code.
Then I created a feature reflecting the missingness of the amount tsh feature. I found that in pumps where the amount tsh was not missing, the functionality rate was much higher compared to pumps where the amount tsh was unknown. The final feature I created was authority scheme, which groups scheme management into four new classes (private, autonomous, non-autonomous and other).
Dealing with high cardinality variables
After experimenting for a bit with the funder and installer variables, I decided to keep the 500 most common categories and grouped the rare categories together. Before grouping, I cleaned up the installer variable using the Python Dirty Cat package. U can use this package to encode categorical features based on how similar their string names are, but I used it to detect potential typo’s. Do you agree that danid and danida or commu and community probably reflect the same categories?
Next, I decided to round longitude and latitude to 2 decimal places. The coordinates are expressed using 6 decimal places, which is 0.111 meter accurate. This level of accuracy only provides information about the exact position of each water point, but no information on the general location of the pump. By using 2 decimal places, I am able to capture the location of the water point to an accuracy of 1.1 km and lower the cardinality of these variables tremendously.
To prevent my model from overfitting on rare classes, I decided to group some of the rare classes in the extraction type and source type features.
Encoding my categorical variables
I have played around with different encoding strategies, including one-hot encoding, frequency encoding and target encoding, and finally landed on label encoding. I was a bit hesitant about using label encoding because the model might falsely assume there is some sort of order in the encoded classes, but I consistently got the best results using label encoder. Curious about what other encoding techniques are out there? Have a look at this great overview by Baijayanta Roy.
All the code used in this article can of course be found on Github.
In the next article we will cover the machine learning component of the competition.
References and further reading
· Bhandari, A. 2020. What is multicollinearity? Here’s everything you need to know. https://www.analyticsvidhya.com/blog/2020/03/what-is-multicollinearity/
· Frost, J. 2021. Variance Inflation Factors (VIFs). https://statisticsbyjim.com/regression/variance-inflation-factors/
· Quora, 2021. Is multicollinearity a problem in decision trees? https://www.quora.com/Is-multicollinearity-a-problem-in-decision-trees
· Roy, B. 2019. All about categorical variable encoding. https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02
· Zhang, 2019. Feature selection why & how explained. https://towardsdatascience.com/explaining-feature-importance-by-example-of-a-random-forest-d9166011959e