A visual guide
If you’re following Kaggle competitions, you must have noticed that boosting algorithms win most of the competitions in “structured data” class. In this article, I will try to explain how the boosting algorithms work and where their power comes from.
In general, boosting algorithms attempt to obtain a strong learner by combining weak learners created in each iteration. I will try to show what this means first for a classification problem using Adaboost algorithm. Then we will look at a regression problem with Gradient Boosting.
If you want to examine the shared graphics or codes in detail and make experiments, you can reach all of them via my GitHub account.
Adaboost, developed by Robert Schapire and Yoav Freund in 1996, won the prestigious Gödel award as the first successful boosting algorithm.
Let’s assume that we have a classification problem that we try to classify the blue and red dots as on the left.
The solution seems to be very easy when we can visualize it, but most of the time we don’t have the chance to deal with data that can be visualized in this way.
The algorithm will draw the best solution in the first iteration where all points are equal. In this case, a distinction will be created on the left. As you can see, the 3 points I marked with yellow are on the wrong side. For this reason, we need to increase their weight for the 2nd iteration. But how?
In the 1st iteration, we have 7 correctly and 3 incorrectly classified points. Let’s assume we want to bring our solution into a 50/50 balance situation. Then we need to multiply the weight of incorrectly classified points with (correct/incorrect) which is (7/3 ≈ 2.33). If we increase the weight of the incorrectly classified 3 points to 2.33, our model will be 50–50%. We keep the results of 1st classification in our minds and go to the 2nd iteration.
In the 2nd iteration, the best solution is as on the left. Correctly classified points have a weight of 11, whereas incorrectly classified points’ weight is 3.
To bring the model back to a 50/50 balance, we need to multiply the incorrectly classified points’ weight by (11/3 ≈ 3.66).
With new weights, we can take our model to the 3rd iteration.
The best solution for the 3rd iteration is as on the left. The weight of the correctly classified points is 19, while the weight of the incorrectly classified points is 3 (once again).
We can continue the iterations, but let’s assume that we end it here.
We have now reached the stage of combining the 3 weak learners. But how do we do that?
ln(correct/incorrect) seems to give us the coefficients we want.
If we think a little about it, for example, a solution with an equal number of correct and incorrect weight will not help us. Thus, ln (1) → 0 meets our expectations.
As the correct number increases, the coefficient of the formula will gradually increase towards infinite, and the weight of a solution that gives all the results correct will be infinite — ln(∞)→∞.
A model that gives a lot of incorrect information also helps us. We just need to reverse the model. A model that makes incorrect assumptions at all will have a negative infinite weight — ln(0)→-∞
If we consider the blue region as positive and the red region as negative; we can combine the result of 3 iterations like the picture on the left.
As we sum the weight of each section in each iteration, the result is slowly starting to emerge. We can see the result of our algorithm if we make the positive parts in blue and the negatives in red.
Adaboost solved our problem in a wonderful way.
While creating this example, I was inspired by Udacity Machine Learning Engineer Nanodegree Boosting Algorithms section. In the solution, we use an AdaBoostClassifier with 3 iterations using a DecisionTreeClassifier with a maximum depth of 1. Our code is as follows.
This algorithm is based on the studies of Leo Breiman.
Let’s suppose we have a regression problem in which we try to predict target values for the graph on the left.
At the first iteration, Gradient Boosting creates a function “F” to calculate the predictions with the given data. Then it calculates the difference (residual) between the real values (target) and the predictions and creates a function “h” for the residuals. In the 2nd iteration, the algorithm combines the functions “F” and “h” and makes new predictions and calculates the difference between predictions and real values. In this way, by continuously adding over function “F”, it tries to increase its success and minimize the residuals.
In the following image, you can see the prediction of the model after the first iteration as the red line in the graph on the left. I have shown the residual between the predictions and target values for each x value on the right graph.
As the iterations progress, the success of the model will increase. Below you can see the results of the 10th iteration.
In the 25th and 50th iteration results, you can now see that the prediction and target difference of the model approaching zero. In fact, the 50th iteration results show that the model is starting to overfit a little. To avoid overfitting, it is useful to compare the results with a separate validation dataset and find the appropriate number of iterations for you.
XGBoost, which is created by Tianqi Chen and Carlos Guestrin with the addition of some advanced features to the logic of Gradient Boosting, is now renowned as the most widely used algorithm by the Kaggle competitors. The algorithm is used in some products of companies such as Uber, Airbnb, Amazon and Google Cloud.