Stacking and Blending — An Intuitive Explanation

Steven Yu
6 min readSep 30, 2019

--

This post is a brief explanation about two very powerful ensemble learning methods. It serves as an easy to read memo to remind you about how stacking and blending works. I thus assume you have a basic understanding about what ensemble learning is. Although, the article should be intuitive enough for you to have an idea about how the two methods works.

Introduction

Stacking and Blending are two powerful and popular ensemble methods. The two are very similar, with the difference around how to allocate the training data. They are most noticeable for their popularity and good performance in winning Kaggle competitions. In fact, in most recent Kaggle competitions, at the later stage of the competition, there are almost certainly some stacking or blending methods used to boost the final performance.

Stacking

Stacking or stacked generalisation was introduced by Wolpert. In the essence, stacking makes prediction by using a meta-model trained from a pool of base models — the base models are first trained using training data and asked to give their prediction; a different meta model is then trained to use the outputs from base models to give the final prediction. The process is actually quite simple. To train a base model, K-fold cross validation technique is used.

Step 1: You have Train Data and Test Data. Assume we are using 4-fold cross validation to train base models, the train_data is then divided into 4 parts.

Training data (4-fold) and Testing data

Step 2: Using the 4-part train_data, the 1st base model (assuming its a decision tree) is fitted on 3 parts and predictions are made for the 4th part. This is done for each part of the training data. At the end, all instance from training data will have a prediction. This creates a new feature for tain_data, call it pred_m1 (predictions model 1).

Model 1 training and prediction using 4-fold cross validation

Step 3: Model 1 (decision tree) is then fitted on the whole training data — no folding is needed this time. The trained model will be used to predict Test Data. So test_data will also have pred_m1.

Step 4: Step 2 to 3 are repeated for the 2nd model (e.g KNN) and the 3rd model (e.g. SVM). These will give both train_data and test_data two more features from the predictions, pred_m2 and pred_m3.

Step 5: Now, to train the meta model (assume it’s a logistic regression), we use only the newly added features from the base models, which are [pred_m1, pred_m2, pred_m3]. Fit this meta model on train_data.

Step 6: The final prediction for test_data is given by the trained meta model.

Sample Code

First, define a function to do stacking based on input model. This function will use n_fold cross validation to train the input model, and return the predictions on train_data and test_data, which is used as new features.

Next, we train two more base models using the defined function. We use Decision Tree and KNN as examples here.

Finally, we create a meta-model, logistic regression, by using the predictions from base models (new feature), to give final prediction.

Blending

Blending is very similar to Stacking. It also uses base models to provide base predictions as new features and a new meta model is trained on the new features that gives the final prediction. The only difference is that training of the meta-model is applied on a separate holdout set (e.g 10% of train_data)rather on full and folded training set.

Step 1: train_data is split into base_train_data and holdout_set.

Step 2: Base models are fitted on base_train_data, and predictions are made on holdout_set and test_data. These will create new prediction features.

Step 3: A new meta-model is then fit on holdout_set with new prediction features. Both original and meta features from holdout_set will be used.

Step 4: The trained meta-model is used to make final predictions on the test data using both original and new meta features.

Sample Code

Two base models, Decision Tree and KNN, will be trained on the base train data and make predictions on the holdout_set and test_data.

The original and meta features will be combined in holdout set and test_data, and a logistic regression model will be trained and make final predictions on test_data.

Summary

Stacking and Blending are very powerful ensemble methods. They can effectively boost your model performance and in many cases can be a deciding factor to win competitions. It is become so successful in competitions that recently there is growing number of implementations of extremely complex multi-layer stacking methods. In general, to build a robust and fast stacking or blending infrastructure is normally challenging. However, thanks to some popular modules such as ML-Ensemble, the cost of building it has dramatically decreased.

The popularity of stacking and blending along with their growing complexity in implementation has brought many discussion about its commercial value against its cost. This is enough to open up a new thread to discuss and is beyond the topic of this post. The links at the end provide some good insights. I found Marios’ discussion about StackNet particularly interesting. In his Coursera talk, he links StackNet to Neural Networks and gives good comparison.

In my opinion, despite the question about the its complexity and cost, Stacking and Blending gives a good opportunity to increase the model performance even before digging too deep into any specific model. This can be served as a good benchmark for data scientists to test the feasibility of a machine learning task. Plus, what seems to be computational expensive today, might not be the case in the future. Just like what cloud computing and big data have done to Machine Learning and Deep Leaning.

Disclaimer

This post is a summary of existing resources and and is benefited from the following few posts.

Below is an in-depth summary of the Ensemble Techniques used in Kaggle Competition. It is written by a top ranking Kaggle Team and summaries virtually all noticeable ensemble techniques used in competitions.

The link below gives a comprehensive explanation to Ensemble Learning using Python. Code from this post is also benefited from it.

Finally, another top ranking Kaggle participant, contributed to a Coursera course and gives a very intuitive guide on how stacking works.

--

--