Simple Weighted Average Ensemble | Machine Learning

Published in

Analytics Vidhya

5 min readOct 14, 2020

This is a walk-through about how to apply the weighted average ensemble to improve your prediction scores.

Purpose of This Blog
Last week, one of my machine learning class assignments asked us to perform ensemble predictions by combining predictions from the various algorithms. And I found out this topic is fascinating; however, there are not many sources online that can simply explain this concept. So, I decided to write this blog to people who are looking for a straightforward solution with my best efforts.
During my research on this topic, I found there are many different styles and approaches to practically performing the Weighted Average on prediction models in the online articles. If you feel you would like to learn more about this topic and want to dig deeper to study the algorithms and mathematic principles behind their demonstrations, I do encourage you to do more research and read those published papers regarding the applications of Weighted Average Ensemble.
One of the papers I found online that has done a good job to explain how to apply the Weighted Average Ensemble on a case is conducted by Nivedhitha Mahendran and his teammates from multiple universities. You may find the article through this link.

Content Table

Introduction & General Usage

Case Walk-though

Conclusion

INTRODUCTION & GENERAL USAGE

Don’t put your eggs in one basket

The basic idea for the Average Ensemble or the Weighted Average Ensemble is to reduce the total errors by aggregating the predictions from multiple different classifiers. The rationale is: we first assume that every classifier will make different mistakes in training and predicting. And then, we generate a set of classifiers with many diversities and combine their outputs together. So, the total errors of the final prediction will somehow decrease after we conduct aggregating.

It is really like the concept of diversification: you do not put eggs in one basket. When you invest your money in the stock market, you will want to build a portfolio that contains projects from different industries. If one sector performs poorly, the loss will be balanced by the gain from other sectors. The risk of suffering a substantial financial loss will be reduced.

The simple math behind it

Suppose you have a set of five classifiers. You find out each of them may produce an error of 0.2, assuming the classifiers are all independent. And the situation where the ensemble classifier goes wrong on an instance is at least 3 out of your 5 classifiers made mistakes together (majority voting). Therefore, the probability of a wrong prediction is calculated as the following:

The combination of 3 out of 5 goes wrong is 10;

The combination of 4 out of 5 goes wrong is 5;

And the combination of all 5 goes wrong is 1.

As you can see, the total error is decreased from 0.2 to 0.058; however, as you add more classifiers into the combination, you should expect a bottleneck, and the trend of reduction of errors will plateau.

How to implement one intuitively

Usually, there are two ways to do it.

First, you train the same classifier (e.g., Decision Tree) over multiple different subsets of training data, which leads to multiple different models (DT1, DT2, DT3,…). Then, you predict the test data with those models and average the results (Figure 1.1).

Second, you can train multiple different (the more diverse, the better) classifiers with the whole training set, and average the results (Figure 1.2).

Click here back to top.

CASE WALK-THROUGH

Data

I am using a case competition that happened on Kaggle five years ago as an example, and you may find all the data and case descriptions here.

Basically, we need to predict which Homesite’s customer will purchase a given quote. The target variable is binary, either 0 (didn’t purchase) or 1 (Quote converted).

I will skip all the data cleaning parts since it is not what we are focusing on in this blog. My final training data has 65000 rows × 596 columns. My final testing data has 173836 rows × 595 columns. All the categorical variables were converted to numeric data. Submissions are evaluated on the area under the ROC curve between the predicted probability and the observed target.

Code

I implemented five models in this case:

KNN, MLPC, XGBoost, Decision Tree, and Random Forest

I will compare the results between the individual classifier and the ensemble classifier without tuning any parameters.

First step: Load all the libraries and packages (use “pip install package_name” to install the classifiers that you did not have)

Second step: Fit each model with our training data

Figure 2.1 shows the Kaggle scores for each classifier base performance.

Figure 2.1 Individual Model’s Performance

Third step: Test the average and the weighted average method

Figure 2.2 shows the Kaggle scores for the ensemble methods:

Figure 2.2 Kaggle scores for The Ensemble Methods

As you can see, the weighted average ensemble outperforms all the other classifiers in this case competition. The ROC score increased roughly 0.38 from the worst performance (KNN), which is a huge jump. The Weighted Average Ensemble method even outperformed our best individual model (XGB Classifier) by 0.045. This is a significant increase in any case competition, especially when Kagglers always like to push extremes.

Also, in this code, I manually assigned the weights to the model based on experience. If you want to find out what is the “optimial” decisions on weights, here is a guide for hyperparameter tuning your ensemble: Hyperparameter Tuning the Weighted Average Ensemble in Python.

Click here back to the top.

Conclusion

This is just a super straightforward walk-through about performing Weighted Average Ensemble predictions with multiple different models. As I said above, the more diverse the combination is, the better the result could be. One example would be: you can add some linear regression model to the combo so that the linear regression will likely have a chance to catch the linear relationship among the variables.

However, sometimes it just does not work out in this way. It is just an example from five years ago. Many questions that seem tough five years ago will look much easier today because more challenging questions just emerged in the market. In some of the cases that I have worked on, this technique just could not outperform my best individual model, no matter how many different combinations of weights I try. Nonetheless, this technique is always a good way to help me avoid overfitting.

Please feel free to connect with me on LinkedIn.