How to Maximize ML Models and Improve Predictions

Customizing sklearn’s VotingClass to Optimize Features and Models

Anissa Mike
When I Work Data
6 min readNov 22, 2021

--

The current article discusses customizing sklearn’s Voting Class so that it is more flexible and so that the features within the Voting Ensemble can be customized to each model. It discusses why we take this approach and the benefits and limitations of Random Forest and Logistic Regression and why they are stronger together.

Ensemble Learning

Ensemble learning takes the aggregated predictions of multiple machine learning models together to form a single, final prediction. The combining of information from multiple models should form a stronger prediction than a single model on its own. I like to think of it as giving many judges the same portfolio of information. Each judge has its own background it brings to the table and its own methods it uses to come to a decision. When we take all of the judges’ predictions together we come to a more well rounded and informed prediction than taking a single perspective alone.

At When I Work, we had built a Random Forest model and relied on it for quite a while. The Random Forest is itself an example of an ensemble method, as it takes multiple decision trees and aggregates them together. Adding models to the Random Forest should help us improve our overall predictions and make them more well-rounded. More information on the benefits of ensemble learning can be found here.

I’ve always relied heavily on regression — both to perform exploratory and insight analyses. Even when establishing our Random Forest model, I conducted regressions with each of the predictors first to ensure they had a base relationship with our outcome and were worth putting into our final model, and it offers some flexibility in features that Random Forest lacks. Thus a Logistic Regression seemed like a natural layer to add on to our model.

The Random Forest

The Random Forest offers the benefits of randomness and complexity. It both randomizes the subset of data that is used in each decision tree (a process called bagging) and the subsets of features generating the splits of each tree. This means the Random Forest can try out multiple combinations of features, while offering some protections against overfitting because of the random subsets of data. It is not sensitive to multicollinearity because it examines a single feature at a time. Even though it can build out a branch, if two features are highly related, the second feature simply does not add any information — it does not interfere with the feature before it.

The Logistic Regression

The Logistic Regression examines the data as a whole. It can look at features simultaneously and account with their relationships with one another as well as their relationships with the outcome. It gives us the capability to look at interactions between the features and to look at polynomial terms (that is, if we think the relationship between an outcome and a feature is not linear but curvilinear). The Logistic Regression utilizes all data at its disposal to come up with estimates. Unfortunately, it is sensitive to multicollinearity. That is, if two variables are highly correlated, then they will try to account for the variance in one another, rather than in the outcome, making our estimates with the outcome unreliable and our model weaker.

Given these benefits and drawbacks, the Random Forest and Logistic Regression seem to be good compliments to each other, and we were interested in combining them to form more robust prediction.

The Voting Method

scikit-learn offers a way to easily combine classifiers using the Voting method. However, we found the Voting method as built did not sufficiently meet our needs, as it was inflexible as to which features were to be run with specific models. It expected that all features contained in the dataset would be run with each model.

Because of the above mentioned issues with multicollinearity, we desired to run different features for the different models. With the Logistic Regression model we wanted to be more parsimonious and only include our biggest predictors, minimizing overlap between features. With the Random Forest, we wanted to include a broader range of features since the iteration through combinations of features and the freedom from multicollinearity could help identify features that added even minor incremental prediction (which would have been obscured in the Logistic Regression model). Indeed, we found nine features with reasonable relationships to our outcome and five of those features did not indicate any issues with multicollinearity when all were evaluated together. In addition, the logistic regression would allow us to add an interaction — something that couldn’t be specified appropriately in the Random Forest model since it needs to be run in tandem with its original, non-interacted parents.

While we could have chosen to simply limit the features that went into the Random Forest and Logistic Regression to match each other, we instead decided to modify scikit-learn’s Voting Class to our needs.

The Modified Voting Class

The key elements from the original Voting Class we needed to retain were:

  • fit() which fits each model with the specified features for that model.
  • predict_probas() which returns all the probabilities for each model.
  • predict_proba() which returns the single aggregated probability from all models.
  • get_feature_importance() which returns the feature importances for the Random Forest and the coefficients from the Logistic Regression in a special class. This isn’t necessary for the model to run, but is important to our understanding of the various features.

We replicated these elements in our own VotingClassifier. Instead of taking any models listed, as the VotingClassifier does, we built in three models we believed complimented each other well — a logistic regression, a random forest, and an SVM.* It instantiates with a list of features that each model can accept. These can be modified in a configs.py file or upon calling VotingClassifier.init(). It functions much as the original VotingClassifier and is fit calling fit() and outputs the final probability using predict_proba(). predict_probas() can also be called so that the predictions across models can be examined and compared if desired.

Example of how to run the VotingClassifier using features training_features, outcome data labels, and specified features for each model (features and regression_features).

(A repo representing our full VotingClassifier and its supporting elements can be found here.)

get_feature_importance() mimics the Random Forest command, but returns a different format — a specific class built to house feature importances from a Random Forest and coefficients from a regression. (Unfortunately, SVM does not offer an equivalent to this and could not be included.)

Calling feature importances and coefficients from the random forest and logistic regression, respectively.

Outcomes were the probability of an activity occurring within our app — a 1 out of 0 or 1. The returned probabilities had two columns since models return both the probability of a 0 and a probability of a 1 (even though one is the inverse of the other). Because this was built specifically to our use case and for a specific project, we returned results as pandas DataFrames instead of numpy arrays, which functioned better with our output. For our use case, we labeled the columns as prob_not_endorsed and prob_endorsed, though these can be modified within the VotingClassifier.

Conclusion

Modifying the Voting Class from sklearn gave us some additional flexibility that allowed us to capitalize on the strengths and minimize the weaknesses of each model and utilize the most data. In our update from the old to new model, which also included some feature tweaking, we increased our prediction by 72%.

Notes

*Note: The SVM is also insensitive to multicollinearity and utilized all the same features as the Random Forest. SVM offers a different perspective on top of Logistic Regression and Random Forest. For time and space I’m not going to expound upon it here.

The When I Work app helps streamline scheduling for employees and managers in shift based workplaces. Check us out here.

--

--