Multiple Classifier Systems — a brief introduction

Published in

luisfredgs

14 min readApr 25, 2021

Classification is an important task in Pattern Recognition, which is the main reason why the past few decades have seen a broad number of research projects devoted to classification methods. Even though the methods accessible in the literature may differ in many aspects, most of the recent research results lead to a consensus; creating a monolithic classifier to cover all the variability inherent to most pattern recognition problems is problematic to some extent.

For that reason, Multiple Classifier Systems are an important direction in machine learning and pattern recognition. Indeed, combining classifiers is now a respected and established research area known under different names in the literature, such as Mixture of Experts, committee-based learning, or Ensemble methods.

This post pretends to be a brief introduction to ensemble learning, mentioning core concepts and pointing the reader to further directions in this vibrant research area.

The core idea behind the ensemble methodology

It is noteworthy that one major task of pattern recognition, data mining, and machine learning is to construct good models from data sets. In contrast to dominant ordinary learning approaches which try to construct one single learner from training data, ensemble methods try to construct a set of learners and combine them.

The core idea behind the ensemble methodology is to aggregate multiple models to obtain a combined model that outperforms every single model in it. For that reason, ensemble is usually significantly more accurate than a single learner.

An ensemble contains a number of learners called base learners — or individual learners. In the classification context, we refer to these base learners as base-classifiers or individual classifiers. Base learners are usually generated from training data by a base learning algorithm which can be a decision tree, neural network, or other kinds of learning algorithms.

Actually, ensemble methods are appealing mainly because they are able to boost weak learners which are even just slightly better than random guess to strong learners.

In general, we expect the ensemble to exploit the strengths of the base classifier models to produce a high-quality pattern recognition system overcoming the performance of individual classifiers.

The conventional ensemble methods include bagging, boosting, and stacking-based methods. These methods have been well studied in recent years and applied widely in different applications. For example, Bagging is widely used in classification and regression. Besides, Random Forest (a variant of bagging), Random Subspace, AdaBoost (a sequential ensemble method) have been widely studied methods.

Random Forest is a well-known algorithm, which combines the concepts of bagging and random subspace. Random Forest is similar to AdaBoost, despite random forest is parallel whereas AdaBoost is sequential.

Moreover, more recent approaches for ensemble learning such as XGBoost, LightGBM, and CatBoost has emerged in the literature and frequently used in competitions of machine learning on the Kaggle platform.

Ensemble methods commonly used in classification

Ensemble learning involves three types of approaches namely, bagging, boosting, and stacking. In bagging, the ensemble is made of classifiers built on top of bootstrap replicates of the training set; Booting trains multiple models in sequence in which the error function used to train a particular model depends on the performance of the previous models; stacking combines many ensemble methods in order to build a meta-learner. I will better explain these methods in the next sections.

Ensemble learning process

The ensemble learning process can be divided into three phases, according to the literature. The first one is ensemble Generation, which consists of generating a set of models.

The first step in ensemble learning is the ensemble generation, whose goal is to obtain a pool of models.

A second step is Selection. Specifically, a single classifier or a subset having the best classifiers is (are) selected in that phase.

The last phase is Integration, where a strategy to combine the base models is defined. This strategy is then used to obtain the prediction of the ensemble for new instances, based on the predictions of the individual models.

A number of redundant models are often generated during the generation. For that reason, there is commonly an ensemble pruning step following generation, where the ensemble is pruned by eliminating some of the models generated earlier — in some cases, ensemble pruning can reduce the size of the ensembles obtained without greatly degrading the accuracy.

Generation

When models are generated using the same induction algorithm, the ensemble is called homogeneous, otherwise, it is called heterogeneous.

A major task in ensemble learning involves the generation of a set of mutually complementary classifiers whose combination provides optimal accuracy. Besides, one can argue that a successful ensemble is one whose generated predictors are accurate and that makes errors in different parts of the input space.

The Bagging algorithm has been widely used in the generation phase of multiple classifier systems. Bagging is an acronym for Bootstrap AGGregatING, an ensemble meta-algorithm that was designed to improve the accuracy and stability of supervised machine learning models. In such a method, the ensemble is made of classifiers built on top of bootstrap replicates of the training set.

It is worthy of mention that the diversity of the classifier outputs is a vital requirement for the success of the ensemble strategy. The diversity necessary to make the ensemble work appropriately in bagging is created by using different training sets and unstable algorithms. For such a reason, bagging implementations are usually based on unstable algorithms like Neural Networks, Decision Trees, and son. We will take a look at the diversity topic further in this post.

Bagging takes advantage of the instability by training the models using random samples (with replacement) of the dataset, named bootstrap samples

The bootstrap sampling process in bagging generates a different training set for each classifier, which naturally increases the ensemble diversity. The underlying idea is that small changes in the training set should lead to large changes in the classifier output. Otherwise, the resultant ensemble will be a pool of similar classifiers, therefore unlikely to improve on a single classifier’s performance.

In the last phase of the algorithm, the classifier outputs are combined by using the plurality vote. If the individual outputs are continuous-valued, the combination is done by averaging.

Bagging classifiers — Bagging is created by drawing bootstrap samples from the training data sets, training a classifier on each sample, and combining the label outputs through majority voting. If the individual outputs are continuous-valued, the combination is done by averaging. Bagging is a parallel algorithm in both its training and operational phases. Besides, it can dramatically reduce the variance of unstable procedures like trees and perceptrons, leading to improved prediction.

In parallel, Boosting has been also used in the generation process. Boosting algorithms are well-known methods due to their accuracy, robustness, and wide applicability.

The main idea in boosting is that it is possible to convert a weak learning algorithm into one that arbitrarily achieves high accuracy — a weak learning algorithm is one that performs slightly better than random prediction. This conversion is performed by combining the estimations of several predictors.

“The purpose of boosting is to sequentially apply the weak classification algorithm to repeatedly modified versions of the data, thereby producing a sequence of weak classifiers. The predictions from all of them are then combined through a weighted majority vote to produce the final prediction.” — Hastie, 2003
“Boosting involves training multiple models in sequence in which the error function used to train a particular model depends on the performance of the previous models. This can produce substantial improvements in performance compared to the use of a single model. Boosting can give good results even if the base classifiers have a performance that is only slightly better than random, and hence sometimes the base classifiers are known as weak learners.” — Bishop, 2006

Boosting — with Boosting, the aim is to produce a very accurate prediction rule by combining moderately inaccurate learners. According to Hastie (2003), boosting is similar to bagging, but fits a model that is additive in the models of each individual base learner, which are learned using non i.i.d. samples.

A very popular boosting algorithm is AdaBoost (Adaptive boosting). Like in bagging, the examples are randomly selected with replacement but, in AdaBoost, each example has a different probability of being selected.

The general idea is to develop the classifier ensemble incrementally, adding one classifier at a time. The classifier that joins the ensemble in a given step is trained on a data set selectively sampled from the training data set. The sampling distribution starts from uniform and is updated for each new classifier. The likelihood of the objects being misclassified at a given step expands so that they have a higher chance of entering the training sample of the next classifier. The learning process is therefore sequential.

Another important boosting method is Gradient Boosting Machine (GBM), an improved version of AdaBoost. Gradient boosting stands for gradient descent and boosting. It is a powerful machine learning algorithm that produces a prediction model in the form of an ensemble of weak prediction models. XGBoost, which a winner machine learning algorithm in Kaggle competitions is a variant of GBM.

GBM builds several models in an additive and gradual fashion. The major distinction between Gradient Boosting and AdaBoost Algorithm is how both algorithms identify the flaws of weak learners. While the AdaBoost model identifies the weakness by using high weight data points, gradient boosting performs the same task by using gradients in the loss function.

Another well-known ensemble learning method is Stacking, an effective and generic framework, which combines many ensemble methods. Stacking has two levels of learning: 1) base learning and 2) meta-learning. In the first one, the base learners are trained with training data set. Once trained, the base learners create a new data set for a meta-learner.
The meta-learner is then trained with that new training data set. Finally, the trained meta-learner is used to classify new instances.

Stacking is an ensemble method in which the predictions generated by using various learners are used as inputs in a second-layer learning algorithm which is called meta-learner. The trained meta-learner is used to classify new instances.

A key difference between stacking and other ensembles is that in stacking a meta-classifier is applied as final predictor. Stacking has been applied, for example, in anomaly-based intrusion detection of imbalanced network traffic.

Although ensemble methods are effective, one significant downside here is that both the required memory and processing time increase substantially with the number of individual learners in the ensemble. Therefore, the generation can be followed by a pruning phase, which is added to reduce the ensemble size without meaningfully reducing the accuracy of the ensemble predictions. In general, the pruning aim is to improve the predictive ability in the ensemble or reduce computational costs by selecting a subset of individual learners in an ensemble.

A code example in python for ensemble generation can be found here. Another code example regards to pruning of ensemble is available here.

Combining

In the combination step, the aim is to combine the predictions of the models from the ensemble in order to obtain the final ensemble prediction.

A fusion strategy is necessary in cases where all classifiers are used (with no selection) or when an ensemble is selected. The combination of the base classifiers can be achieved using the class labels, such as in the Majority Voting scheme, or by using the scores obtained by the individual classifiers (posterior probability) for each of the classes in the classification problem.

In particular, different models can be competent — or “experts” — in different local regions of the feature space. For that reason, a convenient approach is to select the most promising model(s) from the ensemble, and the prediction of the ensemble is only based on the selected model(s). In the literature, such as approach is widely known as Dynamic Selection of Classifier(s), a vibrant research area in ensemble learning.

Regards to this point, the main concepts are related to the type of selection (static or dynamic) and the notion of classifier competence. In static selection, the selection of classifiers is performed in the training phase. Conversely, when using dynamic selection, the selection of classifiers is performed for each new test sample in the classification phase.

The literature has been reported interesting results obtained by selecting specific classifiers for each test pattern, instead of using the same classifier for all newly arrived patterns during generalization. Thereby, dynamic selection has been a preferable solution.

Furthermore, dynamic selection can either select one single classifier (Dynamic Classifier Selection, or DCS) or a subset of classifiers, which is referred to as Dynamic Ensemble Selection (DES). In the last case, the outputs of the selected classifiers must be combined for the classification of patterns.

The rationale behind the preference for dynamic selection is to select the most locally competent classifiers for each new pattern. Traditionally, the selection is done by estimating the competence of the base classifiers on local regions of the feature space.

The major issue in dynamic selection is how to estimate the competence of a base classifier for the classification of a new test pattern. Traditionally, kNN has been used to compute the accuracy of individual classifiers regards to the neighborhood of the test sample in the validation set, where the set of K nearest neighbors of the test instance is called the region of competence of the test pattern. It is expected that the better the region of competence, the higher the precision of dynamic selection systems.

Nonetheless, it is important to emphasize current techniques approaching dynamic selection do not take into account the existence of different scenarios when estimating the competence of a classifier for the classification of a test sample using its region of competence. For example, given a test sample and a classifier, the test sample can be located in a safe region, where almost all samples belong to the same class, or in an indecision region, where samples belong to more than one class, and the classifier can have its decision boundary crossing or not crossing the region of competence of the test sample. Such an issue can result in unexpected behavior of the system, as pointed out by Oliveira, Dayvid VR, and colleagues.

Finally, it is important to remember dynamic selection — one of the most promising ensemble approaches — is currently a hot topic in the multiple classifier system literature, with several important contributions.

A code example in python for dynamic selection can be found here.

Diversity in ensembles of classifiers

In any good ensemble strategy, the classifiers should be as accurate as
possible and, at the same time, should not make coincident errors. We do not need an ensemble if we have a perfect classifier that makes no errors. Conversely, if the classifier does make errors, then it should be complementary with another classifier that makes errors on different instances. The diversity of the classifier outputs is hence a critical requirement for the success of the ensemble. It has been recognized as a very important characteristic in classifier combination systems.

To get a good ensemble, it is generally believed that the base learners should be as accurate as possible, and as diverse as possible. Therefore, a key component to the success of these algorithms is that intuitively at least, they build a set of diverse classifiers.

Diversity emerges as a guiding measure of the design process of the multiple classifier systems. Thus, the design of the ensemble aims to include mutually complementary individual classifiers which are characterized by high diversity and accuracy.

Diversity in heterogeneous systems is easier to be achieved. However, when working on homogeneous systems, it may not be so easy to control the diversity between the generated models when using several algorithms. Nonetheless, more accurate and diverse predictors can be achieved by manipulating the data or the hyperparameters of individual algorithms.

It worth emphasizing despite is intuitive that increasing diversity could lead to the better accuracy of the system, there is no formal proof of such a dependency. Nevertheless, the literature approaches several diversity measures. A good starting point can be found here.

To ensuring diversity of a classifier pool, common strategies have been the manipulation of either individual classifier inputs, or their hyperparameters. In the former, we assume that classifiers trained on different input subspaces — using different data partitions, or using different sets of features — become complementary. In the latter, we assume that different hyperparameter sets in the same algorithm can lead to diverse models. Therefore, the diversity includes data diversity and parameter diversity.

Conclusions

Multiple Classifier Systems (or Ensemble) have been widely studied as an alternative for increasing accuracy in pattern recognition over monolithic classifiers. They typically refer to methods that generate several models that are combined to make a significantly better prediction.

Moreover, ensemble of classifiers emerges as an alternative to increase classification accuracy in many pattern recognition tasks, such as image classification, handwritten recognition, recommendation systems, fraud detection, and face recognition. Consequently, it is a strategic research area in machine learning, with important applications in the industry as well.

This post is a brief introduction to the topic, and I hope you have found this useful. Leave a comment if you have any questions or ideas for new posts.

Citation

For attribution in academic contexts or books, please cite this work as

@misc{luisfredgs2021,
    title   = "Multiple Classifier Systems — a brief introduction",
    author  = "Luís Gonçalves",
    year    = "2021",
    howpublished = {https://medium.com/luisfredgs/multiple-classifier-systems-a-brief-introduction-71238d9c794f},
  }

References

Michał Woźniak, Manuel Graña, Emilio Corchado, A survey of multiple classifier systems as hybrid systems, Information Fusion, Volume 16, 2014, Pages 3–17.
Y. Ren, L. Zhang and P. N. Suganthan, Ensemble Classification and Regression-Recent Developments, Applications and Future Directions. IEEE Computational Intelligence Magazine, vol. 11, no. 1, pp. 41–53, 2016.
João Mendes-Moreira, Carlos Soares, Alípio Mário Jorge, and Jorge Freire De Sousa. Ensemble approaches for regression: A survey. ACM Computing Surveys, 2012.

Generation

F. N. Walmsley, G. D. C. Cavalcanti, Dayvid V. R. Oliveira, R. M. O Cruz, Robert Sabourin. An Ensemble Generation Method Based on Instance Hardness. International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, 2018.
D. V. R. Oliveira, T. N. Porpino, G. D. C. Cavalcanti and T. I. Ren, A bootstrap-based iterative selection for ensemble generation, International Joint Conference on Neural Networks (IJCNN), Killarney, 2015.
M. A. Souza, G. D. C. Cavalcanti, R. M. O. Cruz and R. Sabourin, On the characterization of the Oracle for dynamic classifier selection, International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, pp. 332–339, 2017.
J. J. Rodriguez, L. I. Kuncheva and C. J. Alonso, Rotation Forest: A New Classifier Ensemble Method, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 10, pp. 1619–1630, Oct. 2006.
Kuncheva L.I. and J.J. Rodriguez, Classifier ensembles with a random linear oracle, IEEE Transactions on Knowledge and Data Engineering, 19 (4), 500–508, 2007.

Diversity

Ludmila Kuncheva and Christopher J. Whitaker. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine Learning, 2003.
E. K. Tang, P. N. Suganthan, X. Yao. An analysis of diversity measures. Machine Learning, 2006.
George D. C. Cavalcanti, Luiz S. Oliveira, Thiago J. M. Moura and Guilherme V. Carvalho. Combining Diversity Measures for Ensemble Pruning. Pattern Recognition Letters, 2016.

Pruning

George D. C. Cavalcanti, Luiz S. Oliveira, Thiago J. M. Moura and Guilherme V. Carvalho. Combining Diversity Measures for Ensemble Pruning. Pattern Recognition Letters, 2016.

Dynamic Selection

Rafael M.O. Cruz, Robert Sabourin, George D.C. Cavalcanti. Dynamic classifier selection: recent advances and perspectives. Information Fusion, 2018.
Alceu S. Britto, Robert Sabourin, Luiz E.S. Oliveira, Dynamic selection of classifiers — A comprehensive review, Pattern Recognition, Volume 47, Issue 11, 2014, Pages 3665–3680.
Rafael M. O. Cruz, Robert Sabourin, George D. C. Cavalcanti and Tsang Ing Ren. META-DES: A Dynamic Ensemble Selection Framework using Meta-Learning. Pattern Recognition, v. 48, 1925–1935, 2015.
Rafael M.O. Cruz, Robert Sabourin, George D.C. Cavalcanti. META-DES.Oracle: Meta-learning and feature selection for dynamic ensemble selection. Information Fusion, v.38, p. 84–103, 2017.
Anandarup Roy, Rafael M. O. Cruz, Robert Sabourin, George D. C. Cavalcanti. Meta-learning Recommendation of Default Size of Classifier Pool for META-DES, Neurocomputing, pp. 351–362, 2016.