Stacking — A Super Learning Technique

5 min readOct 18, 2017

In our childhood we all listened a story regarding bunch of sticks stating the principal “Unity is Strength”.Ensemble Learning is a learning mechanism that follows this principal. These use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms. Many of the popular modern machine learning algorithms are actually ensembles.In short ,this is a technique that take a collection of weak learners and form a single, strong learner.For example Random Forests(Bagging) and Gradient Boosting(Boosting)are both ensemble learners.Stacking is a technique that belongs to these family of learning.Lets dive into it

Stacking

Stacking-a Meta Modeling Technique is introduced by Wolpert in the year 1992.In Stacking there are two types of learners called Base Learners and a Meta Learner.Base Learners and Meta Learners are the normal machine learning algorithms like Random Forests, SVM, Perceptron etc.Base Learners try to fit the normal data sets where as Meta learner fit on the predictions of the base Learner.

Stacking Technique involves the following Steps:-

Split the training data into 2 disjoint sets
Train several Base Learners on the first part
Test the Base Learners on the second part and make predictions
Using the predictions from (3) as inputs,the correct responses from the output,train the higher level learner or meta level Learner

Meta Learner is kind of trying to find the optimal combination of base learners.Let us take an example of classification problem where we are trying to classify 4 classes and as a part of traditional paradigm we are testing various models and we find out that Logistic Regression is making better predictions on class 1 data and SVM is making better on class 2 and class 4 and KNN is doing better on class 3 and class 2.This performance is predicted because in general no model is perfect and has its own advantages and disadvantages.So,If we train a model on the predictions of these model can we get better results?.This is the idea on which this entire concept is built upon.So if we train a Random Forest Classifier on these predictions of LR,SVM,KNN we get better results.

Lets Try to give shape to this technique:-

Set up the ensemble.

Specify a list of L base algorithms (with a specific set of model parameters).
Specify a meta learning algorithm.

Train the ensemble.

Train each of the L base algorithms on the training set.
Perform k-fold cross-validation on each of these learners and collect the cross-validated predicted values from each of the L algorithms.
The N cross-validated predicted values from each of the L algorithms can be combined to form a new N x L matrix. This matrix, along wtih the original response vector, is called the “level-one” data. (N = number of rows in the training set.)

Generation of Data For the Meta Learner Using K-fold Cross Validation

4.Train the meta learning algorithm on the level-one data. The “ensemble model” consists of the L base learning models and the meta learning model, which can then be used to generate predictions on a test set.

Predict on new data.

To generate ensemble predictions, first generate predictions from the base learners.
Feed those predictions into the meta learner to generate the ensemble prediction.

The answers to the questions like what are the perfect combinations of Base Learners mostly depends on the experimentation there is no perfect method to determine the set we have to experiment different set of combinations.You’ll get more benefit if your first layer models are fairly distinct from each other like using SVM and decision trees, which are pretty different from each other.Generally Ensemble Learners like Random Forests,XGBoost,GradientBoost etc are used as meta learner.But any algorithm can be used as a meta learner there is no rule that we have to use only ensemble learners.

When to Use Stacking?

A definite answer is in all Kaggle or Analytics Vidhya competitions.The performance of this technique in all competitions is massive and remarkable.You can easily get to the top of the Leader Board if you use this technique.In Machine Learning there is a principle called Occam’s Razor stating that “Simple Hypothesis is good than the Complex Ones.”Plot the evaluation metric of the base learners along with the staked model.If you are getting better results then only go for stacking.If not then you can drop this idea because you might ran a risk of over fitting the data set.

N-Level Stacking

The concept of Stacking can be extended to many Levels.

For each level you use the predictions of the previous level as the data set and applying the same procedure of dividing the data set as we did for 2 Levels.

The main disadvantage that runs in increasing the levels is the curse of data for the higher levels.As we increase the levels ,the data set goes on dividing and it may lower the data for the higher levels preventing the models to get enough data to learn.But maintaining a proper trade-off between the levels vs no of models per level we can reduce this problem.