Bagging and Boosting in Machine Learning

Saurya Pande
7 min readAug 16, 2020

--

Bagging is a technique used in ML to make “weak classifiers” strong enough to make good predictions. The technique which we use is - we make use of many weak classifiers to make a prediction on our *data and then combine the results by taking their average or by picking the most likely prediction made by the majority of these weak models. The main catch here is that all the weak classifiers used are independent i.e they are not influenced by the other classifier errors or prediction, and they predict individually.

Boosting is an ensemble technique in which the weak classifiers model are not made to work independently, but sequentially. The logic behind this is that the subsequent weak classifiers model learns from the mistakes of the previous predictors. The highest error appears most in the subsequent model while low error diminishes. The weak classifier model can be chosen from a range of models like decision trees, regressors, classifiers etc. The stopping criteria should be chosen carefully or it could lead to overfitting on training data.

*data: Data which we are feeding to these weak classifiers at the same time is the training data. Randomness in data is introduced before feeding them to these classifiers to prevent the issue of overfitting.

What are weak classifiers/strong classifiers?

Fig 1

If our model is making a prediction around 0.5, this means that the classifier is “weak”. The reason behind this is, suppose we need to identify from the data whether the image is a “dog” or a “cat”. “0” denotes dog and “1" denotes cat. Here if our classifier model gives prediction close to 0.5 this denotes that our classifier model is weak in determining whether the image is a perfect cat or dog.

While on the other hand “strong classifier” will predict close to either 0 or 1 which will ensure that the model is confident enough to identify the image is cat or dog.

Boosting and Bagging Algorithms:

We select weak classifier models which we are going to use in the boosting algorithms to make them work effectively. The weak classifier models which can be used are “Decision Trees” and many others. The decision trees are although fast and easy to understand but they have very high variance i.e they try to overfit the data and completely memories it rather than generalizing it. Therefore they are weak learner classifiers.

Some common boosting algorithms are:

  1. AdaBoost
  2. XGBoost
  3. Gradient Tree Boosting

Common bagging algorithm is:

  1. Random Forest

BAGGING ALGORITHM

Random Forest:

Fig 2.1

TASK: The table represents the data we have, with each column represent features(Gender, Age, Location, Job, Hobby, App). Now we are building app recommendation system, i.e which app is more likely to be downloaded by which category of people. For eg we can see from the data that if [Gender: F, Age: 15, location: US and so on…] is more likely to download Pokemon Go.

FEATURES: Here output feature is “App” while other features are input features. So we will feed input features to our model and it will spit which app will be downloaded by that person.

WEAK LEARNER CLASSIFIER: Now for the model, we are using “Decision Trees”. We can clearly see in the flowchart in Fig 2.1 that the Decision Tree has memorized every input data and overfitting has occurred.

Fig 2.2

CONCEPT: We select some random input features like (Gender, Job, Hobby) and feed it to “Decision Tree”. We repeat this procedure several times and so we now have many decision trees with us with their predicted output. Now we choose the output which is common in most of the decision tree models.

Random Forest can be used both for classification and regression. A forest is comprised of trees i.e “Decision Trees”. It is said that the more trees it has, the more robust a forest is. It also provides a pretty good indicator of the feature importance.

APPLICATION: Random forests has a variety of applications, such as recommendation engines, image classification, and feature selection. It can be used to classify loyal loan applicants, identify fraudulent activity, and predict diseases.

PROS OF RANDOM FOREST:

  1. It overcomes the problem of overfitting by averaging or combining the results of different decision trees.
  2. Random forests work well for a large range of data items than a single decision tree does.
  3. Random forest has less variance then single decision tree.
  4. Random forests are very flexible and possess very high accuracy.
  5. Scaling of data does not require in random forest algorithm. It maintains good accuracy even after providing data without scaling.
  6. Random Forest algorithms maintain good accuracy even a large proportion of the data is missing.

CONS OF RANDOM FOREST:

  1. Complexity is the main disadvantage of Random forest algorithms.
  2. Construction of Random forests is much harder and time-consuming than decision trees.
  3. More computational resources are required to implement Random Forest algorithm.
  4. It is less intuitive in the case when we have a large collection of decision trees.
  5. The prediction process using random forests is very time-consuming in comparison with other algorithms.

BOOSTING ALGORITHM

1. Gradient Boosting Algorithm:

Fig 3.1 Loss Function — Mean Square Error

The logic behind this is to minimize the loss function by using gradient descent and update our prediction on the basis of the learning rate.

Fig 3.2

CONCEPT:

  1. Use a weak classifier model like “Linear Regression, Decision Trees” on the data and get the predicted output (Output-1).
  2. Now calculate the error through any loss function you want to use like MSE(Mean Square Error).[Error=Actual_Output-Output-1]
  3. Now use another weak classifier model on the loss/error and get the output. (Output-2)
  4. Now add both these outputs(Output-1+Output-2).We can correlate with Fig 3.2 and can introduce the learning rate. **Remember that output-2 is the outcome when the error was being fitted to the model.
  5. Now fit this (Output-1+Output-2) to a new weak classifier model and repeat the above procedures until the sum becomes constant. Also, make sure to not overfit the data.

2. XG Boosting Algorithm:

XG Boost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. This is a major enhancement over gradient boosting although it is built over it, through system optimization and algorithmic enhancement, it outperforms it.

System Optimization:

  1. Parallelization: XGBoost approaches the process of sequential tree building using parallelized implementation. This is possible due to the interchangeable nature of loops used for building base learners; the outer loop that enumerates the leaf nodes of a tree, and the second inner loop that calculates the features. This nesting of loops limits parallelization because without completing the inner loop (more computationally demanding of the two), the outer loop cannot be started. Therefore, to improve run time, the order of loops is interchanged using initialization through a global scan of all instances and sorting using parallel threads. This switch improves algorithmic performance by offsetting any parallelization overheads in computation.
  2. Tree Pruning: The stopping criterion for tree splitting within GBM framework is greedy in nature and depends on the negative loss criterion at the point of split. XGBoost uses ‘max_depth’ parameter as specified instead of criterion first, and starts pruning trees backward. This ‘depth-first’ approach improves computational performance significantly.
  3. Hardware Optimization: This algorithm has been designed to make efficient use of hardware resources. This is accomplished by cache awareness by allocating internal buffers in each thread to store gradient statistics. Further enhancements such as ‘out-of-core’ computing optimize available disk space while handling big data-frames that do not fit into memory.

Algorithmic Enhancements:

  1. Regularization: It penalizes more complex models through both LASSO (L1) and Ridge (L2) regularization to prevent overfitting.
  2. Sparsity Awareness: XGBoost naturally admits sparse features for inputs by automatically ‘learning’ best missing value depending on training loss and handles different types of sparsity patterns in the data more efficiently.
  3. Weighted Quantile Sketch: XGBoost employs the distributed weighted Quantile Sketch algorithm to effectively find the optimal split points among weighted datasets.
  4. Cross-validation: The algorithm comes with built-in cross-validation method at each iteration, taking away the need to explicitly program this search and to specify the exact number of boosting iterations required in a single run.

--

--