Ensemble Reinforcement Learning

7 min readApr 22, 2020

This is a series on ensemble reinforcement learning, containing four articles:

SARSA
Q-Learning
REINFORCE
Ensemble (this article)

Motivation

Ensemble learning is a method of combining multiple learning models, such as logistic regression and naive Bayes classifier, to produce a single learner to perform inference on the data. The act of aggregating predictions from multiple models is most popular in classification models and entire schemes of classification have been developed centering this idea. Most notable examples of ensemble learning concern tree-based models (Hoeting, A. J., et al, 1999), such as random forests (i.e. growing classification trees and aggregate their predictions into a single inference); bagging (i.e repeatedly drawing bootstrap samples from a set and average the predictions for all samples) or boosting (i.e. deliberately using samples that reduce the error in predictions by multiple iterations).

Random Forest. Source: Dimitriadis, S. I. & Liparas, D. (2018)

In practice, an ensemble usually yields better results in comparison metrics compared to any of the single models in its component (Ghimire, B., et al, 2012). Also, ensembles that are diverse in their constituents, which means that they have models different in nature (for example, a logistic regression classifier is considered quite different from a regression tree), tend to perform better than ensembles containing homogeneous components. This is probably because all models have biases and models with different working mechanisms tend to have different biases, thus when aggregate under an ensemble we can receive a cancellation effect, promoting the overall strength of the ensemble. The ensemble also has more modeling strength, since it usually combines model in non-linear fashions (like majority voting), and thus will have better performance. Another quite interesting point to make is that “random” models, when combined under the ensemble, will likely perform better than deliberate models. For example, a collection of random trees (unpruned) forming a forest will likely perform better than a collection of logistic regressors.

The motivation to explore ensemble learning in reinforcement learning is two-fold. First, ensemble learning in neural networks has been investigated and show promising results, and most powerful reinforcement learning methods to date are of neural network natures, such as Deep-Q-Learning or Deep Policy Network. Opitz, D. & Maclin, R. (2011)’s results indicated that since neural networks are powerful non-linear function approximators, using ensemble learning paradigms on them yields positive gains in performance. Second, ensemble learning in reinforcement learning has been relatively underexplored, sparing some scattered articles like Weiring, M. A. & Hasselt, H. (2008) or Chen, X. et al. (2018). Both of these articles indicate that ensemble learning in these problems is a fruitful direction to explore.

Motivated by these results, this article aims to investigate the effect of ensemble learning in a simple Gym environment with three algorithms relatively different in nature: REINFORCE, Sarsa and Q-learning.

The Ensemble

RL Ensemble Learning With Three Trained Agents

The ensemble implemented here has three different voting mechanisms: majority voting, average probability, and Boltzmann-normalized probability. All mechanisms take three probability vectors (output by each of the component model) indicating the preferences of the among the action space and return a sample action aggregated from these vectors.

Majority vote

First, the majority voting rule is implemented above. It randomly samples an action from each of the probability vectors and takes the most popular action among the three. The reason why we have to perform sampling is that unlike in normal learning scenarios, in reinforcement learning we have to promote the exploration of the action space as well as the exploitation of the model predictions. Next, the average probability and Boltzmann-normalized probability is implemented below.

Average Vote & Boltzmann

Both of these methods randomly sample the action space from the average probability of the input vectors, and the latter exponentiate and normalize the probability according to a given temperature parameter to control the spread of the distribution (i.e. how much exploration do we want to have).

The ensemble methods investigated in this article include the major algorithms in RL: REINFORCE, Q-Learning, and SARSA. You can check this, this and this article for a detailed description of these algorithms and how they are implemented, and for the description of a well-known environment in OpenAI Gym we are using for this article: CartPole.

Result

As we mentioned, ensemble learning works best when the components are as diverse as possible, which means they learn using different mechanisms from each other (Chen, X. et al., 2019). To illustrate this point, let's use a simple DQN and investigate whether an ensemble of three DQNs can perform better than its constituents.

We implemented a shallow deep Q network trained to solve the CartPole environment. We then trained three separate DQNs, each of which reached around 140 consecutive steps of balancing the pole on the cart after 100 epochs of training. After these agents have been trained, we let each of the agents play 20 games and record how many consecutive steps it manages to balance the pole, plotted in figure 2.

Figure 2: Histogram of average consecutive step for a DQN

An ensemble of these three DQNs with the voting mechanisms played 20 games of CartPole, and the average consecutive time steps for these ensembles are presented in figure 3. As we can see, the average reward for these ensembles is almost the same as a single DQN, which means there was not much improvement gained due to ensemble learning. This is probably because the components are very similar in nature, so combining them did not successfully eliminate the biases inherent in each of the models. This can probably be fixed by having hundreds or thousands of DQNs trained separately and ensembled, taking lessons from random forest classifiers, which also have hundreds of similar decision trees aggregated together. If we reach this number of components, the weakness due to random fluctuations in the training of each model might be canceled out and we are left with a stronger ensemble. The code for implementing DQNs and comparing the ensemble results can be found here.

Figure 3: Histogram of average consecutive step for ensembles of DQNs

Now let us turn to the ensemble of three different models. The average reward for each of the models is presented in figure 4 after each model has been trained lightly to play CartPole (which explains the low reward, only around 12 consecutive time steps of balancing the pole).

Figure 4: Histogram of average consecutive step for each learner

Lastly, an ensemble of these diverse learners are made to play the CartPole game, and the results are plotted in figure 5. As we can see, there were significant increases in the average reward, from around 12 in each of the learners if we let them play separately to around 16-17 if we use ensemble learning. This confirms our initial hypothesis that ensemble learning can have positive effects on RL agents, at least for simple games like CartPole. You can find the implementation of these ensembles here.

Figure 5: Histogram of average consecutive step for ensembles of learners

Conclusion

In this series of articles, we have gone through the three major algorithms in RL: SARSA, Q-Learning, and REINFORCE. We have seen them in action both in their classical forms (traditionally implemented, like with Q-Learning) and more recent implementations (with neural networks, like with SARSA and REINFORCE). These are very powerful algorithms, with state-of-the-art performances on various video games and real problems.

For article specifically, we have seen that 1) ensemble learning can be effectively implemented in RL and yield significant performance increases, compared to individual learners; and 2) the constituted learners need to be diverse in learning mechanisms in order for ensemble learning to work.

It would be interesting to further investigate the effect of ensemble learning on large ensembles, and on its effects on a variety of environment.

References:

Ghimire, B. et al. (2012). An assessment of the effectiveness of a random forest classifier for land-cover classification. ISPRS. Retrieved April 4th, 2020 from https://doi.org/10.1016/j.isprsjprs.2011.11.002

Hoeting, J. A. (1999). Bayesian Model Averaging: A Tutorial. Statistical Sciences. Retrieved April 4th, 2020 from https://www.jstor.org/stable/2676803

Opitz, D. & Maclin, R. (2011). Popular Ensemble Methods: An Empirical Study. Retrieved April 4th, 2020 from https://arxiv.org/abs/1106.0257

Chen, X. et al (2018). Ensemble Network Architecture for Deep Reinforcement Learning. Retrieved April 4th, 2020 from https://www.hindawi.com/journals/mpe/2018/2129393/

Wiering, M. A. & Hasselt, H. (2008). Ensemble Algorithms in Reinforcement Learning. IEEE. Retrieved April 4th, 2020 from https://ieeexplore.ieee.org/document/4509588