Using multiple AIs to create smarter AI (an overview of ensemble learning)

Masaya Mori 森正弥
7 min readMay 31, 2020

--

This article is about “collective learning (Ensemble Learning)”.

(From the title, if you’re thinking that it’s about interactive loop between AIs like the Back Translation Model published by Facebook AI Research in 2018, I will mention it at the end of this article.)

At CEATEC 2016 which is the largest exhibition event in Japan about IoT and Cyber-Physical System, I gave a talk titled Business and People Change in the Age of Artificial Intelligence. (Sorry this is an article in Japanese.)

In that talk, I took an example of a horse racing hackathon Rakuten held in 2015, where one of teams with hackathon participants created a winning horse picking app. It’s an episode where that app made an amazing prediction at the drinking-party after hackathon.

After the hackathon, we all went to Oi Racecourse to watch the horse races. One of the teams created an AI app that picked the winning horse. We've taken the time to make a prediction with that app. By the way, some horse racing professionals also made a prediction and unfortunately it was not on target.

The app successfully won on the horses in the first race. In the second race, a horse the app picked didn't come in first place. But it was in second place by a short head. In the third race, a horse the app picked came in first place again.

As you could see in the article, the members of this team didn’t know about horse racing at all until they participated in the hackathon. However, they said that they tried various methods of machine learning in detail and improved the accuracy. What they used here is an approach called “Ensemble Learning”.

Although this technique is very widely used in the field of data science and AI service development, it is not well known among people. That’s why I would like to write a little about it.

What is ensemble learning? In simple terms, it is “a method that compounds multiple machine learning models to obtain better performance than each performance of individual models in terms of prediction and classification. I used the word performance, but what you get is sometimes accuracy, sometimes stability, sometimes the removal of bias. There are many approaches depending on what you want to improve.

For example, the most basic way to do it is like this. First, you secure training data and test data, and then you pick one supervised learning method. Let’s say you choose a KNN. Use the training data to train KNN and build a model. Other supervised learning methods are also available. Prepare a linear regression, decision tree or SVM and train them with training data in the same way to create a model. This will allow you to get results from each model as you put the data in. Now, if what you want to do is predict, you add the results of each model (KNN, linear regression, decision tree, SVM) together to get the average value. This means that the overall value of this set of learning models takes into account the performance of each learning model. This makes it possible to reduce the impact of large errors (outliers) in a single model so that stable accuracy predictions can be made. (If the classification is what you want to do, you can use the results of each model to do a majority vote, etc., then you’ll get the same stable precise classification by doing so.)

You can also add weights to each model rather than treating them evenly, which can give you more accuracy. You will take a strategy that is aligned with the cross-validation method approach about which I wrote the other day, to adjust the model with the validation data.

As I wrote earlier, “A method for using multiple machine learning models in combination to achieve better performance than the prediction and classification performance of a single model by itself”, but multiple models don’t have to be made from multiple algorithms. For instance, there are other methods of ensemble learning, such as linear regression only, or SVM only, i.e. making multiple models from the same algorithm and compounding them. This is called Bootstrap Aggregating or Bagging. In this case, the training data would be separated and the models would be combined with different training sets.

Here the technique called Boosting comes into play. In the separated training data set, you focus on the training data that one model could not answer correctly, and feed it into the other models so that they can learn it. In doing so, you can increase the adaptability of the entire model to training data that is difficult to learn. Specific methods include Adaboost and XGBoost. Rakuten Institute of Technology has used Adaboost in the past to select particularly beautiful images among product images, and has applied XGBoost on a large scale to improve the accuracy of product data classification and organization. With the variety of techniques working with training data, you can say that ensemble learning is an approach that has the advantage of increasing accuracy.

Now, one of the most commonly used methods in ensemble learning is Random Forest. In the case of the aforementioned horse racing hackathon, the team that created the winning horse prediction app was using Random Forest.

Random Forest is an ensemble learning method to improve the accuracy of classification by combining models of decision trees, based on the Bagging method. Since each decision tree is modeled from “randomly” divided data and those “trees” are assembled, it is named “Random Forest”.

Random Forest is such a powerful method that even Rakuten Institute of Technology used it fully for interest rate and market forecasting in FinTech. Random Forests doesn’t require any tuning, but it relies on learning data and is prone to overfitting. If the structure of the decision tree is simple, it will not fit the data due to bias, and if the structure of the decision tree is deep, it is more likely to be sensitive to small changes in the data. Also, because of this mechanism, the models created in Random Forest as a whole tend to be black boxes that are difficult to modify or adjust. Given the above features, some applications require careful adaptation.

Finally, let’s take a look at XGBoost, which I briefly touched on in the explanation of Boosting. That is an ensemble learning combining the Gradient Boosting and Random Forest algorithms.

I wrote that XG Boost is a combination of Gradient Boosting and Random Forest. Gradient Boosting is a method that utilizes Gradient Descent (an algorithm for finding the minimum value of a function) for Boosting, and has been getting a lot of attention in recent years because it marked high scores in various data competitions. It’s almost impossible to explain the details of Gradient Boosting in a simple way, but I wrote above that Boosting emphasizes the training data that one model can’t get right in a separated training data set, and feeds it to other models to train them. Gradient Boosting is a method to use Gradient Descent method so that the error (loss function) is as small as possible when adding a model.

The first two letters of XG in XGBoost stand for eXtreme Gradient Boosting. With parameters that need to be tuned and optimized by cross-validation, XGBoost is much more accurate than Random Forests, which is why it’s so popular. XGBoost is also mentioned in the following material presented by members of the Rakuten Institute of Technology Boston.

That is the overview of ensemble learning. It is an approach widely used in the field of improving accuracy with AI, so there are many applications, and the potential of Gradient Boosting, which can be said to be suitable for the big data era, is still paving new ways. We look forward to its continued achievement.

And I didn’t mention this. Ensemble learning can boost performance for the objective, even in the absence of business knowledge or domain knowledge. It’s a very significant point. The horse racing case I mentioned at the beginning of this article is one of those examples. The team that created the winning horse prediction app doesn’t have any knowledge of horse racing, but they have created a highly accurate prediction AI. This is somewhat taboo in data science, but it has an important meaning in today’s world, where the Internet has become widespread, consumer behavior has been transformed, and analysis using big data is now very common. I would like to write about that separately again in the near future.

On a side note:

By the way, I just talked about ensemble learning, but this is not the only way to improve accuracy with multiple AIs. Recently there have been attempts across all fields to improve the accuracy of systems toward a goal by interacting with multiple AI’s and creating loop structures.

Facebook AI Research announced a cutting edge model for machine translation in 2018, called the Back Translation Model. This is a method of combining the translation model of language X to language Y and vice versa to improve accuracy, and it achieves an unprecedented high figure on the BLEU score, the standard benchmark for machine translation.

There’s more, Dentsu, the largest ad agency in Japan, has been developing an AI system called ACM (Advanced Creative Maker), which automatically generates advertising banners, using a combination of AI that generates banners and AI that predicts the CTR of the generated banners to increase accuracy.

These methods of interacting with each other and creating loop structures show the possibility of developing AI systems from a single Deep Learning model to more complex models, and it can be said to be a trend that should be watched closely in the future.

--

--

Masaya Mori 森正弥

Deloitte Digital, Partner | Visiting Professor in Tohoku University | Mercari R4D Advisor | Board Chair on AI in Japan Institute of IT | Project Advisor of APEC