Ensemble Learning Methods
I want to start this article with a question —How many jelly beans exactly do you think there are in this jar? Give a wild guess…
A finance professor named Jack Treynor performed this same experiment on 56 student of his students and asked them to write down the number of jelly beans in a jar. Individually, students came nowhere close to the correct answer; many assumptions were either too high, or too low.
However, when the average of those assumption was calculated, 871, it came really close to the actual number of jelly beans in the jar, 850, with an error of only 2%. Similar experiments have been performed over and over, and the results were the same; the wisdom of the crowd is better than the wisdom of the few.
This is known as emergence. It is a concept that explains how individual entities (the wisdom of a single student, from the example above) have different properties compared to a group with a large number of those entities (the wisdom of a class of students).
Emergence: https://www.youtube.com/watch?v=16W7c0mb-rE
Emergence has found spaces of application in machine learning as well, specifically in ensemble learning. Ensemble learning is a learning technique that combines multiple ‘dumb’ algorithms that individually perform poorly but combined produce accurate predictions. It’s actually quite simple. You force each algorithm to compensate for the mistake of the other, and you end up with a better ensemble algorithm than any best individual one.
Nowadays ensemble learning techniques are used for:
- Everything that fits classical algorithm approaches (but works better)
- Search systems
- Computer vision
- Object detection
Popular algorithms: Random Forest, Gradient Boosting, AdaBoost
From research and experience, after neural networks, ensembles are the most used and promising algorithms in the industry.They’re usually designed in three general ways: stacking, bagging, and boosting.
Stacking
Stacking is an ensemble learning method where multiple different ‘dumb’ algorithms are stacked (hence the name) on top of each other, and are asked to learn the data as best as they can. All these algorithms are fed with the same data.
Just like with elections in a utopian world, the final decision of the algorithm is based on democracy, where each one casts a vote in the final decision.
If the algorithm predicts a class then, then the problem to be solved is known as a classification problem. It is calculated by counting the number of votes for each class. On the other hand, if the algorithm predicts a value then it is a regression problem and it is calculated by taking the mean of classes and then outputting a final decision.
The algorithms generalize better too → the reason for this is because we use different algorithms that are learning from the same data. Since different algorithms have different inherent biases, they fill in the gaps of error of each other, resulting in a model that is able to generalize better than any one model individually.
An advantage of stacking method is that it can run individual algorithms as parallel processes in the processor, this way saving time with a little price on accuracy; in real-time implementations this trade-off is worth it.
Stacking is not as popular, simply because the other two methods work better.
Bagging
In Bagging (or Bootstrap AGGregatING) you have the same simple algorithms working together to understand the data. Compared to stacking, bagging uses the same algorithm with different subsets of the data.
If Decision Trees are used, then the whole bagging algorithm will use a bunch of trees to make a final decision — this particular combination of Decision Trees is known as Random Forest.
They’re pretty famous in the world of tech — why you ask? They work pretty well.
Xbox 360 Kinect used Random Forest to model the real-time pose estimation from single depth image. Even though Kinect eventually failed as a product, the technology was novel and innovative for it’s time (it was released literally more than 10 years ago)
Similar to stacking, bagging can also be parallelized due to the nature of the structure of the algorithm. The data is randomly split in the beginning and the process of learning can be run independently and simultaneously.
Boosting
Boosting works similar to bagging, multiple algorithms are run to make a final decision. However, instead of them running in parallel, they run in succession. The reason for this is because the data that is incorrectly classified by the first algorithm is passed down to the second one, but with emphasized importance, this way forcing the algorithm to focus more on wrong predictions. This process continues until all data is classified correctly, or we’re satisfied with the accuracy of the model.
Compared with stacking and bagging, boosting does not parallelize computations. The reason for this is that the data points that pass to the second classifier are dependent on the examples the first classifier classified incorrectly.
Boosting is used in one of the most famous (even though it’s been around for quite a while, since 2001 according to Wiki) face detection algorithms, known as Viola-Jones. To summarize, this algorithm tries to find faces in an image from very simple features, which are learned with AdaBoost (Adaptive Boosting).
Sum up
Hope this article helped you get a clear general idea on ensemble learning algorithms and how are they used in different industries. I’d love to hear your ideas on what you’d like to read next — let me know down below in the comment section!
You can always connect with me via LinkedIn.