For taking steps to know about Data Science and Machine Learning, till now in my blogs, I have covered briefly an introduction to Data Science, Python, Statistics, Machine Learning, Regression, Linear Regression, Logistic Regression and Decision Trees. In this sixth of the series, I shall cover Boosting, an ensemble method.
Introduction to Boosting :
“Alone we can do so little and together we can do much” — Helen Keller
A road to success is incomplete without any failures in life. Each failure teaches you something new and makes you stronger at each phase. Each time you make a mistake, it’s important to learn from it and try not to repeat it again. Just as we sometimes develop life skills by learning from our mistakes, we can train our model to learn from the errors predicted and improvise the model’s prediction and overall performance. This is the most basic intuition of Boosting algorithm in Machine Learning.
Bagging (stands for Bootstrap Aggregating): It is an approach where you take random samples of data, build learning algorithms and take simple means to find bagging probabilities.
Boosting: Boosting is similar, however the selection of sample is made more intelligently. We subsequently give more and more weight to hard and classify observations.
Ensemble is a machine learning concept in which multiple models are trained using the same learning algorithm. Bagging is a way to decrease the variance in the prediction by generating additional data for training from dataset using combinations with repetitions to produce multi-sets of the original data. Boosting is an iterative technique which adjusts the weight of an observation based on the last classification. If an observation was classified incorrectly, it tries to increase the weight of this observation. Boosting in general builds strong predictive models. Ensemble methods combine several decision trees classifiers to produce better predictive performance than a single decision tree classifier. The main principle behind the ensemble model is that a group of weak learners come together to form a strong learner, thus increasing the accuracy of the model.
The term Boosting refers to a family of algorithms which converts weak learner to strong learners. Boosting is an ensemble method for improving the model predictions of any given learning algorithm. The idea of boosting is to train weak learners sequentially, each trying to correct its predecessor. A weak learner is defined to be a classifier that is only slightly correlated with the true classification. In contrast, a strong learner is a classifier that is arbitrarily well-correlated with the true classification.
To understand this definition, consider the following example of solving a problem of spam email identification:
How would you classify an email as SPAM or not? Our initial approach would be to identify ‘spam’ and ‘not spam’ emails using following criteria. If:
- Email has only one image file (promotional image), It’s a SPAM
- Email has only link(s), It’s a SPAM
- Email body consist of sentence like “You won a prize money of $ xxxxxx”, It’s a SPAM
- Email from any official domain say ril.com, Not a SPAM
- Email from known source, Not a SPAM
Above, we’ve defined multiple rules to classify an email into ‘spam’ or ‘not spam’. But, these rules individually are not strong enough to successfully classify an email i.e. individually, these rules are not powerful enough to classify an email into ‘spam’ or ‘not spam’. Therefore, these rules are called as weak learner.
To convert weak learner to strong learner, we’ll combine the prediction of each weak learner using methods like:
· Using average/ weighted average
· Considering prediction has higher vote
For example: Above, we have defined 5 weak learners. Out of these 5, 3 are voted as ‘SPAM’ and 2 are voted as ‘Not a SPAM’. In this case, by default, we’ll consider an email as SPAM because we have higher (3) vote for ‘SPAM’.
How Boosting Algorithms works?
Boosting combines weak learner (base learner) to form a strong rule. To find weak rule, apply base learning (ML) algorithms with a different distribution. Each time base learning algorithm is applied, it generates a new weak prediction rule. This is an iterative process. After many iterations, the boosting algorithm combines these weak rules into a single strong prediction rule.
For choosing the right distribution, here are the following steps:
Step 1: The base learner takes all the distributions and assign equal weight or attention to each observation.
Step 2: If there is any prediction error caused by first base learning algorithm, then we pay higher attention to observations having prediction error. Then, we apply the next base learning algorithm.
Step 3: Iterate Step 2 till the limit of base learning algorithm is reached or higher accuracy is achieved.
Finally, it combines the outputs from weak learner and creates a strong learner which eventually improves the prediction power of the model. Boosting pays higher focus on examples which are mis-classiﬁed or have higher errors by preceding weak rules.
Suppose we have a binary classification task. A weak learner has an error rate that is slightly lesser than 0.5 in classifying the object, i.e. the weak learner is slightly better than deciding from a coin toss. A strong learner has an error rate closer to 0. To convert a weak learner into strong learner, we take a family of weak learners, combine them and vote. This turns this family of weak learners into strong learners. The idea here is that the family of weak learners should have a minimum correlation between them.
Types of Boosting
The accuracy of a predictive model can be boosted in two ways: Either by embracing feature engineering or by applying boosting algorithms straight away. It is preferred to work with boosting algorithms as it takes less time and produces similar results.
There are multiple boosting algorithms like AdaBoost, Gradient Boosting, XGBoost, etc. Every algorithm has its own underlying mathematics and a slight variation is observed while applying them.
- AdaBoost (Adaptive Boosting)
AdaBoost combines multiple weak learners into a single strong learner. The weak learners in AdaBoost are decision trees with a single split, called decision stumps. When AdaBoost creates its first decision stump, all observations are weighted equally. To correct the previous error, the observations that were incorrectly classified now carry more weight than the observations that were correctly classified. AdaBoost algorithms can be used for both classification and regression problem.
The diagram shown below, aptly explains AdaBoost.
Box 1: You can see that we have assigned equal weights to each data point and applied a decision stump to classify them as + (plus) or — (minus). The decision stump (D1) has generated vertical line at left side to classify the data points. We see that, this vertical line has incorrectly predicted three + (plus) as — (minus). In such case, we’ll assign higher weights to these three + (plus) and apply another decision stump.
(Box 1) (Box 2) (Box 3) (Box 4)
Box 2: Here, you can see that the size of three incorrectly predicted + (plus) is bigger as compared to rest of the data points. In this case, the second decision stump (D2) will try to predict them correctly. Now, a vertical line (D2) at right side of this box has classified three mis-classified + (plus) correctly. But again, it has caused mis-classification errors. This time with three -(minus). Again, we will assign higher weight to three — (minus) and apply another decision stump.
Box 3: Here, three — (minus) are given higher weights. A decision stump (D3) is applied to predict these mis-classified observation correctly. This time a horizontal line is generated to classify + (plus) and — (minus) based on higher weight of mis-classified observation.
Box 4: Here, we have combined D1, D2 and D3 to form a strong prediction having complex rule as compared to individual weak learner. You can see that this algorithm has classified these observation quite well as compared to any of individual weak learner.
AdaBoost works on similar method as discussed above. It fits a sequence of weak learners on different weighted training data. It starts by predicting original data set and gives equal weight to each observation. If prediction is incorrect using the first learner, then it gives higher weight to observation which have been predicted incorrectly. Being an iterative process, it continues to add learner(s) until a limit is reached in the number of models or accuracy.
Mostly, we use decision stamps with AdaBoost. But, we can use any machine learning algorithms as base learner if it accepts weight on training data set. We can use AdaBoost algorithms for both classification and regression problem.
Advantages of AdaBoost :
- Can be used with many different classifiers
- Improves classification accuracy
- Commonly used in many areas
- Simple to implement
- Does feature selection resulting in relatively simple classifier
- Not prone to overfitting
- Fairly good generalization
Disadvantages of AdaBoost :
Ø The drawback of AdaBoost is that it is easily defeated by noisy data, the efficiency of the algorithm is highly affected by outliers as the algorithm tries to fit every point perfectly.
Ø Even though this algorithm tries to fit every point, it doesn’t overfit.
Ø Suboptimal solution
from sklearn.ensemble import AdaBoostClassifier # For Classification
from sklearn.ensemble import AdaBoostRegressor # For Regression
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier() clf = AdaBoostClassifier(n_estimators=100, base_estimator=dt,learning_rate=1)
# We have used decision tree as a base estimator. We can use any ML learner as base estimator, if it accepts sample weight
You can tune the parameters to optimize the performance of algorithms, The key parameters for tuning:
- n_estimators : It controls the number of weak learners.
- learning_rate :Controls the contribution of weak learners in the final combination. There is a trade-off between learning_rate and n_estimators.
- base_estimators : It helps to specify different ML algorithm.
You can also tune the parameters of base learners to optimize its performance.
- Gradient Boosting
Just like AdaBoost, Gradient Boosting works by sequentially adding predictors to an ensemble, each one correcting its predecessor. However, instead of changing the weights for every incorrect classified observation at every iteration like AdaBoost, Gradient Boosting method tries to fit the new predictor to the residual errors made by the previous predictor. Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. The objective of any supervised learning algorithm is to define a loss function and minimize it.
Gradient Boosting Machine (GBM) uses Gradient Descent to find the shortcomings in the previous learner’s predictions and are an extremely popular machine learning algorithm. In GBM, we take up a weak learner and at each step, we add another weak learner to increase the performance and build a strong learner. This reduces the loss of the loss function. We iteratively add each model and compute the loss. The loss represents the error residuals (the difference between actual value and predicted value) and using this loss value the predictions are updated to minimise the residuals.
GBM algorithm can be given by following steps.
- Fit a model to the data, F1(x) = y
- Fit a model to the residuals, h1(x) = y−F1(x)
- Create a new model, F2(x) = F1(x) + h1(x)
- By combining weak learner after weak learner, our final model is able to account for a lot of the error from the original model and reduces this error over time.
GBM is the most widely used algorithm.
The intuition behind gradient boosting algorithm is to repetitively leverage the patterns in residuals and strengthen a model with weak predictions and make it better. Once we reach a stage that residuals do not have any pattern that could be modeled, we can stop modeling residuals (otherwise it might lead to overfitting). Algorithmically, we are minimizing our loss function, such that test loss reach its minima.
Advantages of GBM:
- Often provides predictive accuracy that cannot be beat.
- Lots of flexibility — can optimize on different loss functions and provides several hyper-parameter tuning options that make the function fit very flexible.
- No data pre-processing required — often works great with categorical and numerical values as is.
- Handles missing data — imputation not required.
Disadvantages of GBM:
- GBMs will continue improving to minimize all errors. This can overemphasize outliers and cause overfitting. Must use cross-validation to neutralize.
- Computationally expensive — GBMs often require many trees (>1000) which can be time and memory exhaustive.
- The high flexibility results in many parameters that interact and influence heavily the behavior of the approach (number of iterations, tree depth, regularization parameters, etc.). This requires a large grid search during tuning.
- Less interpretable although this is easily addressed with various tools (variable importance, partial dependence plots, LIME, etc.).
from sklearn.ensemble import GradientBoostingClassifier # For Classification
from sklearn.ensemble import GradientBoostingRegressor # For Regression
clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1)
- n_estimators: It controls the number of weak learners.
- learning_rate:Controls the contribution of weak learners in the final combination. There is a trade-off between learning_rate and n_estimators.
- max_depth: maximum depth of the individual regression estimators. The maximum depth limits the number of nodes in the tree. Tune this parameter for best performance; the best value depends on the interaction of the input variables.
You can tune loss function for better performance.
XGBoost stands for eXtreme Gradient Boosting and is another faster version of boosting learner. XGBoost is an implementation of gradient boosted decision trees designed for peed and performance. Gradient boosting machines are generally very slow in implementation because of sequential model training. Hence, they are not very scalable. Thus, XGBoost is focused on computational speed and model performance. XGBoost provides:
- Parallelization of tree construction using all of your CPU cores during training.
- Distributed Computing for training very large models using a cluster of machines.
- Out-of-Core Computing for very large datasets that don’t fit into memory.
- Cache Optimization of data structures and algorithm to make the best use of hardware.
XGBoost is similar to gradient boosting algorithm but it has a few tricks up its sleeve which makes it stand out from the rest.
Features of XGBoost are:
- Clever Penalisation of Trees
- A Proportional shrinking of leaf nodes
- Newton Boosting
- Extra Randomisation Parameter
In XGBoost the trees can have a varying number of terminal nodes and left weights of the trees that are calculated with less evidence is shrunk more heavily. Newton Boosting uses Newton-Raphson method of approximations which provides a direct route to the minima than gradient descent. The extra randomisation parameter can be used to reduce the correlation between the trees, as seen in the previous article, the lesser the correlation among classifiers, the better our ensemble of classifiers will turn out. Generally, XGBoost is faster than gradient boosting but gradient boosting has a wide range of application.
Advantages of XGBoost :
- Standard GBM implementation has no regularization like XGBoost, therefore it also helps to reduce overfitting.
- In fact, XGBoost is also known as regularized boosting technique.
- Parallel Processing:
- XGBoost implements parallel processing and is blazingly faster as compared to GBM.
- But hang on, we know that boosting is sequential process so how can it be parallelized? We know that each tree can be built only after the previous one, so what stops us from making a tree using all cores? I hope you get where I’m coming from. Check this link out to explore further.
- XGBoost also supports implementation on Hadoop.
- High Flexibility
- XGBoost allow users to define custom optimization objectives and evaluation criteria.
- This adds a whole new dimension to the model and there is no limit to what we can do.
- Handling Missing Values
- XGBoost has an in-built routine to handle missing values.
- User is required to supply a different value than other observations and pass that as a parameter. XGBoost tries different things as it encounters a missing value on each node and learns which path to take for missing values in future.
- Tree Pruning:
- A GBM would stop splitting a node when it encounters a negative loss in the split. Thus it is more of a greedy algorithm.
- XGBoost on the other hand make splits upto the max_depth specified and then start pruning the tree backwards and remove splits beyond which there is no positive gain.
- Another advantage is that sometimes a split of negative loss say -2 may be followed by a split of positive loss +10. GBM would stop as it encounters -2. But XGBoost will go deeper and it will see a combined effect of +8 of the split and keep both.
- Built-in Cross-Validation
- XGBoost allows user to run a cross-validation at each iteration of the boosting process and thus it is easy to get the exact optimum number of boosting iterations in a single run.
- This is unlike GBM where we have to run a grid-search and only a limited values can be tested.
- Continue on Existing Model
- User can start training an XGBoost model from its last iteration of previous run. This can be of significant advantage in certain specific applications.
- GBM implementation of sklearn also has this feature so they are even on this point.
Ensemble methods are learning models that achieve performance by combining the opinions of multiple learners. Typically, an ensemble model is a supervised learning technique for combining multiple weak learners or models to produce a strong learner with the concept of Bagging and Boosting for data sampling. Ensemble method is a combination of multiple models, that helps to improve the generalization errors which might not be handled by a single modeling approach.
The term Boosting refers to a family of algorithms which converts weak learner to strong learners. The main idea of boosting is to modify a weak learner to become better. Boosting is an ensemble technique in which the predictors are not made independently, but sequentially. This technique employs the logic in which the subsequent predictors learn from the mistakes of the previous predictors. Therefore, the observations have an unequal probability of appearing in subsequent models and ones with the highest error appear most. (So the observations are not chosen based on the bootstrap process, but based on the error). The predictors can be chosen from a range of models like decision trees, regressors, classifiers etc. Because new predictors are learning from mistakes committed by previous predictors, it takes less time/iterations to reach close to actual predictions. But we have to choose the stopping criteria carefully or it could lead to overfitting on training data.
Boosting can be used primarily for reducing bias and also variance. It is a weighted average approach, i.e. by iteratively learning and forming multiple weak learner and combining them to make a final strong learner. Adding of each weak learner to make a strong learner is related to the accuracy of each learner. The weight is recalculated after each weak learner is added to overcome the problem of misclassified gain weight.
Multiple boosting algorithms are available to use such as AdaBoost, Gradient Boost, XGBoost, etc. Different boosting algorithm has its own advantages over different types of dataset. By tuning multiple algorithms with a wider range of input data, classification or good predictive modeling can be build.
So in this series, we looked at Boosting, one of the method of ensemble modeling to enhance the prediction power.
“Failure is the key to success; each mistake teaches us something.” — Morihei Ueshiba