Ensemble Models
A guide to learning ensemble techniques in a simple walkthrough
In our life, at the time of making an essential decision like applying for a university program, signing a job contract, we tend to seek a piece of advice. We try to collect as much information as possible and reach out to multiple experts. Since such decisions could impact our future, we trust more the decision that has been taken through such process. Machine learning predictions follow a similar behavior. Models process given inputs and produce an outcome. The outcome is a prediction based on what pattern the models have seen during the training.
One model is not enough in many cases, and this article sheds light on this spot. When and why do we need multiple models? How to train those models? What kind of diversity those models provide. So, let us right jump on the subject, but with a quick overview.
Overview
Ensemble models is a machine learning approach to combine multiple other models in the prediction process. Those models are referred to as base estimators. It is a solution to overcome the following technical challenges of building a single estimator:
- High variance: The model is very sensitive to the provided inputs to the learned features.
- Low accuracy: One model or one algorithm to fit the entire training data might not be good enough to meet expectations.
- Features noise and bias: The model relies heavily on one or a few features while making a prediction.
Ensemble Algorithm
A single algorithm may not make the perfect prediction for a given dataset. Machine learning algorithms have their limitations and producing a model with high accuracy is challenging. If we build and combine multiple models, the overall accuracy could get boosted. The combination can be implemented by aggregating the output from each model with two objectives: reducing the model error and maintaining its generalization. The way to implement such aggregation can be achieved using some techniques. Some textbooks refer to such architecture as meta-algorithms.
Ensemble Learning
Building ensemble models is not only focused on the variance of the algorithm used. For instance, we could build multiple C45 models where each model is learning a specific pattern specialized in predicting one aspect. Those models are called weak learners that can be used to obtain a meta-model. In this architecture of ensemble learners, the inputs are passed to each weak learner while collecting their predictions. The combined prediction can be used to build a final ensemble model.
One important aspect to mention is those weak learners can have different ways of mapping the features with variant decision boundaries.
Ensemble Techniques
Bagging
The idea of bagging is based on making the training data available to an iterative process of learning. Each model learns the error produced by the previous model using a slightly different subset of the training dataset. Bagging reduces variance and minimizes overfitting. One example of such a technique is the Random Forest algorithm.
- Bootstrapping: Bagging is based on a bootstrapping sampling technique. Bootstrapping creates multiple sets of the original training data with replacement. Replacement enables the duplication of sample instances in a set. Each subset has the same equal size and can be used to train models in parallel.
- Random Forest: Uses subset of training samples as well as subset of features to build multiple split trees. Multiple decision trees are built to fit each training set. The distribution of samples/features is typically implemented in a random mode.
- Extra-Trees Ensemble: is another ensemble technique where the predictions are combined from many decision trees. Similar to Random Forest, it combines a large number of decision trees. However, the Extra-trees use the whole sample while choosing the splits randomly.
Boosting
- Adaptive Boosting (AdaBoost): is an ensemble of algorithms, where we build models on the top of several weak learners [1]. As we mentioned earlier, those learners are called weak because they are typically simple with limited prediction capabilities. The adaptation capability of AdaBoost made this technique one of the earliest successful binary classifiers. Sequential decision trees were the core of such adaptability where each tree is adjusting its weights based on prior knowledge of accuracies. Hence, we perform the training in such a technique in sequential rather than parallel process. In this technique, the process of training and measuring the error in estimates can be repeated for a given number of iteration or when the error rate is not changing significantly.
- Gradient Boosting: Gradient boosting algorithms are great techniques that have high predictive performance. Xgboost [2], LightGBM [3], and CatBoost are popular boosting algorithms that can be used for regression and classification problems. Their popularity has significantly increased after their proven ability to win some Kaggle competitions.
Stacking
We have seen earlier that the combining models can be implemented using an aggregating method (for example, voting for classification or average for regression models). Stacking is similar to boosting models; they produce more robust predictors. Stacking is a process of learning how to create such a stronger model from all weak learners’ predictions.
Please note that what is being learned here (as features) is the prediction from each model.
Blending
Very similar to the stacking approach, except the final model is learning the validation and testing dataset along with predictions. Hence, the features used is extended to include the validation set.
Classification Problems
Since classification is simply a categorization process. If we have multiple labels, we need to decide: shall we build a single multi-label classifier? or shall we build multiple binary classifiers? If we decided to build a number of binary classifiers, we need to interpret each model prediction. For instance, if we want to recognize four objects, each model tells if the input data is a member of that category. Hence, each model provides a probability of membership. Similarly, we can build a final ensemble model combining those classifiers.
Regression Problems
In the previous function, we determine the best fitting membership using the resulted probability. In regression problems, we are not dealing with yes or no questions. We need to find the best predicted numerical values. We can average the collected predictions.
Aggregating Predictions
When we ensemble multiple algorithms to adapt the prediction process to combine multiple models, we need an aggregating method. Three main techniques can be used:
- Max Voting: The final prediction in this technique is made based on majority voting for classification problems.
- Averaging: Typically used for regression problems where predictions are averaged. The probability can be used as well, for instance, in averaging the final classification.
- Weighted Average: Sometimes, we need to give weights to some models/algorithms when producing the final predictions.
Example
We will use the following example to illustrate how to build an ensemble model. The Titanic dataset is used in this example, where we try to predict the survival of the Titanic using different techniques. A Sample of the dataset and the target column distribution to the passengers’ age are shown in Figure 6 and Figure 7.
The Titanic dataset is one of the classification problems that need extensive feature engineering. Figure 8 shows the strong correlation between some features such as Parch (parents and children) with the Family Size. We will try to focus only on the model building and how the ensemble model can be applied to this use case.
We will use different algorithms and techniques; therefore, we will create a model object to increase code reusability.
# Model Class to be used for different ML algorithms
class ClassifierModel(object):
def __init__(self, clf, params=None):
self.clf = clf(**params)def train(self, x_train, y_train):
self.clf.fit(x_train, y_train)
def fit(self,x,y):
return self.clf.fit(x,y)
def feature_importances(self,x,y):
return self.clf.fit(x,y).feature_importances_
def predict(self, x):
return self.clf.predict(x)def trainModel(model, x_train, y_train, x_test, n_folds, seed):
cv = KFold(n_splits= n_folds, random_state=seed)
scores = cross_val_score(model.clf, x_train, y_train, scoring='accuracy', cv=cv, n_jobs=-1)
return scores
Random Forest Classifier
# Random Forest parameters
rf_params = {
'n_estimators': 400,
'max_depth': 5,
'min_samples_leaf': 3,
'max_features' : 'sqrt',
}
rfc_model = ClassifierModel(clf=RandomForestClassifier, params=rf_params)
rfc_scores = trainModel(rfc_model,x_train, y_train, x_test, 5, 0)
rfc_scores
Extra Trees Classifier
# Extra Trees Parameters
et_params = {
'n_jobs': -1,
'n_estimators':400,
'max_depth': 5,
'min_samples_leaf': 2,
}
etc_model = ClassifierModel(clf=ExtraTreesClassifier, params=et_params)
etc_scores = trainModel(etc_model,x_train, y_train, x_test, 5, 0) # Random Forest
etc_scores
AdaBoost Classifier
# AdaBoost parameters
ada_params = {
'n_estimators': 400,
'learning_rate' : 0.65
}
ada_model = ClassifierModel(clf=AdaBoostClassifier, params=ada_params)
ada_scores = trainModel(ada_model,x_train, y_train, x_test, 5, 0) # Random Forest
ada_scores
XGBoost Classifier
# Gradient Boosting parameters
gb_params = {
'n_estimators': 400,
'max_depth': 6,
}
gbc_model = ClassifierModel(clf=GradientBoostingClassifier, params=gb_params)
gbc_scores = trainModel(gbc_model,x_train, y_train, x_test, 5, 0) # Random Forest
gbc_scores
Let combine all the model cross-validation accuracy on five folds.
Now let’s build a stacking model where a new stronger model learns the predictions from all these weak learners. Our label vector used to train the previous models would remain the same. The features are the predictions collected from each classifier.
x_train = np.column_stack(( etc_train_pred, rfc_train_pred, ada_train_pred, gbc_train_pred, svc_train_pred))
Now let’s see if building XGBoost model learning only the resulted prediction would perform better. But, we will take a quick peek at the correlations between the classifiers’ predictions before that.
We will now build a model to combine the predictions from multiple contributing classifiers.
def trainStackModel(x_train, y_train, x_test, n_folds, seed):
cv = KFold(n_splits= n_folds, random_state=seed)
gbm = xgb.XGBClassifier(
n_estimators= 2000,
max_depth= 4,
min_child_weight= 2,
gamma=0.9,
subsample=0.8,
colsample_bytree=0.8,
objective= 'binary:logistic',
scale_pos_weight=1).fit(x_train, y_train)
scores = cross_val_score(gbm, x_train, y_train, scoring='accuracy', cv=cv)
return scores
The previously created base classifiers represent Level-0 models, and the new XGBoost model represents the Level-1 model. The combination illustrates a meta-model trained on the predictions of sample data. A quick comparison between the accuracy of the new stacking model to the base classifiers are shown as follows:
Careful Considerations
- Noise, bias, and Variance: The combination of decisions from multiple models can help improve the overall performance. Hence, one of the key factors to use ensemble models is overcoming these issues: noise, bias, and variance. If the ensemble model does not give the collective experience to improve upon the accuracy in such a situation, then a careful rethinking of such employment is necessary.
- Simplicity and Explainability: Machine learning models, especially those put into production environments, are preferred to be simpler than complicated. The ability to explain the final model decision is empirically reduced with an ensemble.
- Generalizations: There are many claims that ensemble models have more ability to generalize, but other reported use cases have shown more generalization errors. Therefore, it is very likely that ensemble models with no careful training process can quickly produce high overfitting models.
- Inference Time: Although we might be relaxed to accept a longer time for model training, inference time is still critical. When deploying ensemble models into production, the amount of time needed to pass multiple models increases and could slow down the prediction tasks’ throughput.
Summary
Ensemble models is an excellent method for machine learning. The ensemble models have a variety of techniques for classification and regression problems. We have discovered the types of such models, how we can build a simple ensemble model, and how they boost the model accuracy. A complete example of the code can be found on my Github.
Thank you!
References
[1] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139, August 1997.
[2] Chen, Tianqi, and Carlos Guestrin. “Xgboost: A scalable tree boosting system.” Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016.
[3] Ke, Guolin, et al. “Lightgbm: A highly efficient gradient boosting decision tree.” Advances in neural information processing systems 30 (2017): 3146–3154.