Stacking to Improve Model Performance: A Comprehensive Guide on Ensemble Learning in Python

9 min readMay 1, 2023

Stacking is a strong ensemble learning strategy in machine learning that combines the predictions of numerous base models to get a final prediction with better performance. It is also known as stacked ensembles or stacked generalization. This Medium post will discuss machine learning in detail, addressing its concept, benefits, implementation, and best practices.

What exactly is stacking?

Stacking is a machine learning strategy that combines the predictions of numerous base models, also known as first-level models or base learners, to obtain a final prediction. It entails training numerous base models on the same training dataset, then feeding their predictions into a higher-level model, also known as a meta-model or second-level model, to make the final prediction. The main idea behind stacking is to combine the predictions of different base models in order to get more extraordinary predictive performance than utilizing a single model.

What is the process of Stacking?

Stacking, also known as "Stacked Generalization," is a machine learning ensemble strategy that integrates many models to improve the model’s overall performance. The primary idea of stacking is to feed the predictions of numerous base models into a higher-level model known as the meta-model or blender, which then combines them to get the final forecast.

Here’s a detailed description of how stacking works:

Preparing the Data: The first step is to prepare the data for modeling. This entails identifying the relevant features, cleaning the data, and dividing it into training and validation sets.
Model Selection: The following step is to choose the base models that will be used in the stacking ensemble. A broad selection of models is typically chosen to guarantee that they produce different types of errors and complement one another.
Training the Base Models: After selecting the base models, they are trained on the training set. To ensure diversity, each model is trained using a different algorithm or set of hyperparameters.
Predictions on the Validation Set: Once the base models have been trained, they are used to make predictions on the validation set.
Developing a Meta Model: The next stage is to develop a meta-model, also known as a meta learner, which will take the predictions of the underlying models as input and make the final prediction. Any algorithm, such as linear regression, logistic regression, or even a neural network, can be used to create this model.
Training the Meta Model: The meta-model is then trained using the predictions given by the base models on the validation set. The base models’ predictions serve as features for the meta-model.
Making Test Set Predictions: Finally, the meta-model is used to produce test set predictions. The basic models’ predictions on the test set are fed into the meta-model, which then makes the final prediction.
Model Evaluation: The final stage is to assess the stacking ensemble’s performance. This is accomplished by comparing the stacking ensemble’s predictions to the actual values on the test set using evaluation measures such as accuracy, precision, recall, F1 score, and so on.

In the end, the goal of stacking is to combine the strengths of various base models by feeding them into a meta-model, which learns how to weigh and combine their forecasts to generate the final prediction. This can frequently result in higher performance than utilizing a single model alone.

A simple way to understand the stacking process👇

Step 1: Split the training dataset into two parts:

Step 2: Train several base models on the training data:

Step 3: Make predictions using the base models on the hold-out validation data:

Make predictions using the base models on the Hold-Out Validation Data

Step 4: Train the meta-model on the hold-out validation data using the predictions from the base models as input features:

Train the meta-model on the Hold-Out Validation Data.

Step 5: To make a prediction for new data:

Final step: Evaluate the performance of the stacked model on a separate test dataset that was not used during training.

The Advantages of Stacking

Stacking has various advantages in machine learning:

Improved Predictive Performance: Stacking can reduce bias and variance in the final forecast by merging the results of numerous base models, resulting in improved predictive performance. Stacking allows the meta-model to learn from the capabilities of several base models and potentially capture intricate patterns that individual models may not be able to capture alone.
Model Diversity: Stacking promotes the usage of varied base models that may be trained using a variety of algorithms, architectures, and hyperparameter settings. This diversity can serve to lessen the danger of overfitting while also making the stacked ensemble more robust to different types of data.
Flexibility: Stacking is a versatile strategy that may be used to solve a variety of machine learning issues, including classification, regression, and even time series forecasting. It can be used with a variety of basic models, including decision trees, support vector machines, neural networks, and others.
Interpretability: Stacking can also reveal the significance of many base models and their forecasts for the final prediction. We can acquire a better grasp of their relative importance and interpretability by studying the weights or contributions of each base model in the meta-model.

Stacking Best Practices

Here are some best practices to remember when using stacking in machine learning:

Model Diversity: The diversity of the base models is the strength of stacking. It is critical to use various base models that are likely to make different sorts of errors since this can help limit the danger of overfitting and increase the stacked ensemble’s performance. As base models, avoid utilizing comparable models or models with similar hyperparameter settings, as this may not result in a significant gain in performance.
Data Leakage: When adopting stacking, be mindful of data leakage. When information from the validation or test set is used during the training of base models or meta-models, data leakage occurs. As a result, performance estimations may be too optimistic, resulting in poor generalization performance. Make sure the data is properly divided into training, validation, and test sets, and that the training data is only used to train the base models and meta-models.
Performance Evaluation: Using proper evaluation measures, assess the performance of the stacked ensemble. To obtain credible estimates of the stacked ensemble’s performance, use cross-validation or hold-out validation techniques. To ensure that the stacked ensemble is truly enhancing predictive performance, compare its performance to that of the individual base models.
Meta-Model Selection: Experiment with many meta-models to determine which one is best for your problem. The meta-model chosen can have a major impact on the stacked ensemble’s performance. Logistic regression, decision trees, and even more complicated models like neural networks or gradient boosting machines are commonly used meta-models. When making a decision, consider the meta-model’s interpretability, complexity, and generalization performance.
Tuning Hyperparameters: Optimize hyperparameters for both base and meta-models. The performance of the stacked ensemble can be greatly influenced by hyperparameter adjustment. Experiment with various hyperparameter settings for both the base models and the meta-models to find the best combination for maximizing prediction performance.
Ensemble Size: Overfitting and increased computing complexity may result from utilizing too many base models in the stacked ensemble. Experiment with different ensemble sizes to determine the appropriate number of base models for optimal performance without losing model interpretability or computational efficiency.
Consider Domain-Specific Aspects: When adopting stacking, keep domain-specific aspects in mind. Different domains may have distinct characteristics that influence the stacked ensemble’s performance. In image identification tasks, for example, employing base models that capture several types of characteristics (e.g., color, texture, and shape) can lead to improved performance.

Implement stacking in machine learning with a real-life example

Example: House price prediction

Step 1: Load the Dataset

We’ll be utilizing Scikit-learn’s Boston Housing dataset. This dataset contains data about numerous residences in Boston, such as their size, location, number of bedrooms, number of bathrooms, and so on. The goal at hand is to forecast the median value of owner-occupied residences in thousands of dollars.

from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

# Load the Boston Housing dataset
boston = fetch_openml(name='boston')

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(boston.data, boston.target, test_size=0.2, random_state=42)

Step 2: Train the Base Models

We’ll use three different machine learning techniques to train three different base models: Decision Tree, Random Forest, and Gradient Boosting.

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

# Train the base models
dt = DecisionTreeRegressor(random_state=42)
dt.fit(X_train, y_train)

rf = RandomForestRegressor(random_state=42)
rf.fit(X_train, y_train)

gb = GradientBoostingRegressor(random_state=42)
gb.fit(X_train, y_train)

Step 3: Make Predictions on the Validation Set

We’ll use the trained models to make predictions on the validation set.

# Make predictions on the validation set
dt_pred = dt.predict(X_val)
rf_pred = rf.predict(X_val)
gb_pred = gb.predict(X_val)

Step 4: Train the Meta-Model

We’ll use the predictions from the base models as input to the meta-model. We’ll use a linear regression model as the meta-model.

from sklearn.linear_model import LinearRegression

# Combine the predictions of the base models into a single feature matrix
X_val_meta = np.column_stack((dt_pred, rf_pred, gb_pred))

# Train the meta-model on the combined feature matrix and the target values
meta_model = LinearRegression()
meta_model.fit(X_val_meta, y_val)

Step 5: Make Predictions on New Data

# Make predictions on new data
X_new = np.array([[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3]])
dt_pred_new = dt.predict(X_new)
rf_pred_new = rf.predict(X_new)
gb_pred_new = gb.predict(X_new)

# Combine the predictions of the base models into a single feature matrix
X_new_meta = np.column_stack((dt_pred_new, rf_pred_new, gb_pred_new))

# Make a prediction using the meta-model
y_new_pred = meta_model.predict(X_new_meta)

print("Predicted median value of owner-occupied homes: ${:.2f} thousand".format(y_new_pred[0]))


###### Result #####
Predicted median value of owner-occupied homes: $49.75 thousand

This example demonstrates how to construct predictions on a real-world dataset using stacking, an ensemble learning technique. We trained three base models (Decision Tree, Random Forest, and Gradient Boosting) and used a linear regression meta-model to aggregate their predictions. The trained models and meta-model were then utilized to make predictions on additional, previously unseen data. By combining the benefits of various algorithms, stacking can assist enhance the accuracy of machine learning models.

The most often used Stacking method

Simple stacking, also known as basic stacking or vanilla stacking, is the most frequent type of stacking used in machine learning. This is due to the fact that it is reasonably simple to implement and can produce good outcomes in many circumstances. The predictions of the underlying models are aggregated using a simple aggregation function, such as averaging or summing, in simple stacking. The aggregated forecasts are then utilized to train the meta-model, which makes the final prediction.

Stacking with meta-features or stacking with distinct input features, on the other hand, can be successful in some cases. The sort of stacking you use will be determined by the specific problem you’re attempting to answer, the features of your data, and the performance of the many models you’re utilizing as base models.

Conclusion

Stacking is a powerful ensemble learning strategy that can increase the predictive performance of machine learning models dramatically. Stacking can reduce bias and variation, increase model variety, and improve the interpretability of the final forecast by merging the predictions of many base models.

It does, however, necessitate careful implementation, which includes selecting varied base models, avoiding data leakage, accurately measuring performance, tweaking hyperparameters, and taking domain-specific features into account. Stacking can be a significant tool in the arsenal of machine learning practitioners for building more accurate and robust predictive models if standard practices are followed.