Intuitions Behind Mixture of Experts Ensemble Learning

Bakary Badjie
10 min readSep 4, 2023

--

Mixture of Experts in Autonomous Driving

The capacity of a neural network to absorb and process information is limited by the number of its parameters. Therefore, finding more effective ways to increase model parameters without adding extra computational overhead has become a trend in deep learning research, especially when dealing with complex real-world applications such as autonomous driving systems.

A mixture of experts is an ensemble learning technique that combines different neural networks, each specializing in a specific part of the training data, and a managing or gating network that looks into the entire input space to decide which experts to trust for a particular input scenario.

It involves decomposing the input space into different subsets of data based on some domain knowledge of the problem statement, training an expert model on each subset of the training data, developing a gating model that learns which expert to trust based on the input to be predicted, and combining the predictions.

In typical machine learning, all the parameters of the model are activated for every input; this gives rise to high training costs as the complexity of the model and training data increases. Therefore, selectively activating only a single part of a neural network can minimize the high computational demands and interdependencies between the layers of the network, increase the generalization capabilities of the network, and improve the diversity of the model.

Although the technique was initially described using neural network experts and gating models, it can be generalized to use models of any type. As such, it shows a substantial similarity to stacked generalization and belongs to the class of ensemble learning methods referred to as meta-learning.

In this tutorial, you will discover the mixture of experts approach to ensemble learning.

After completing this tutorial, you will know:

  • An intuitive approach to ensemble learning involves dividing a task into subtasks and developing an expert on each subtask.
  • A mixture of experts is an ensemble learning method that seeks to explicitly address a predictive modeling problem in terms of subtasks using expert models.
  • The divide-and-conquer approach is related to the construction of decision trees, and the meta-learner approach is associated with the stacked generalization ensemble method.

An Intuitive Overview

In the course of this tutorial, we will discuss and help you understand the following:

  1. Subtasks and Experts
  2. Mixture of Experts

1. Subtasks

2. Expert Models

3. Gating Model

4. Pooling Method

3. Relationship With Other Techniques

1. Mixture of Experts and Decision Trees

2. Mixture of Experts and Stacking

Subtasks and Experts

Some predictive modeling tasks are remarkably complex and require covering a significant amount of tasks. Therefore, dividing the parent task into multiple subtasks can make it feasible to perform suitable and efficient predictive modeling without the need for extra computational resources.

For example, consider an autonomous driving system, where an image classification machine learning model has to analyze and classify the behaviors of different dynamic and static objects at different locations and scenarios. The model is expected to be complex and scalable, with an extremely high number of parameters to be able to classify all driving scenarios within the driving environment. Dividing this complex task into subtasks and having a managing or gating network to decide which expert to rely on for a given input can enhance the scalability of the image classification machine learning model in autonomous driving. In this way, the model can easily and effectively learn which expert to call upon during inference for a new input.

This is a divide and conquer approach to problem-solving and underlies many automated approaches to predictive modeling, as well as problem-solving more broadly.

This approach can also be explored as the basis for developing an ensemble learning method.

The subproblems may or may not overlap, and experts from similar or related subproblems may be able to contribute to the examples that are technically outside of their expertise.

This approach to ensemble learning underlies a technique referred to as a mixture of experts.

Mixture of Experts

A mixture of experts is an ensemble learning technique that implements the idea of training different expert neural networks on subtasks of a predictive modeling problem.

In the neural network community, several researchers have examined the decomposition methodology. Mixture–of–Experts methodology that decomposes the input space, such that each expert examines a different part of the space. A gating network is responsible for picking an expert for a particular scenario and combining the outputs various experts.

There are four elements to the approach, which are as follows:

  • Division of a task into subtasks.
  • Develop an expert for each subtask.
  • Use a gating model to decide which expert to use.
  • Pool predictions and gating model output to make a prediction.

The figure below illustrates a high-level concept of a mixture of experts in machine learning.

Example of a Mixture of Experts Model with Expert Members and a Gating Network
Taken from: Ensemble Methods

Subtasks

The first step is to divide the predictive modeling problem into subtasks. This often involves splitting the input space into different subspaces using domain knowledge. For example, dividing an autonomous driving dataset into different driving scenarios, such as urban driving, highway driving, parking and valet, Platooning, and Autonomous delivery.

For those problems where the division of the task into subtasks is not obvious, a simpler and more generic approach could be used. For example, one could imagine an approach that divides the input feature space by groups of columns or separates examples in the feature space based on distance measures, inliers, outliers for a standard distribution, and much more.

In a mixture of experts framework, one of the key problems is how to find the natural division of the task and then derive the overall solution from sub-solutions.

Expert Models

The mixture of experts approach was initially developed and explored within the field of artificial neural networks, so traditionally, experts themselves are neural network models used to predict a numerical value in the case of regression or a class label in the case of classification.

It should be clear that we can “plug in” any other model for the expert. For example, we can use neural networks to represent both the gating functions and the experts. The result is known as a mixture density network.

Experts each receive the same input pattern (row) and make a prediction.

Gating Model

A gating model is a separate neural network trained alongside the experts. It is responsible to interpret the predictions made by each expert and to aid in deciding which expert to trust for a given input. This is called the gating model or the gating network, given that it is traditionally a neural network model.

The gating network will take in as input the entire input space that was divided between the experts, and based on that information, it will have knowledge about all the experts’ contributions for a given input and easily decide whom to trust among those experts.

The gating network will pick or rely on an expert for a particular input if that expert's squared error is lower than the weighted average squared error of all the other experts. It will then raise its probability of picking that expert, and if it decides to assign a very low probability of picking any expert, the expert will have a very small gradient, and therefore, its parameters won’t be affected by that input.

…..the weights determined by the gating network are dynamically assigned based on the given input, as the MoE effectively learns which portion of the feature space is learned by each ensemble member.

A mixture-of-experts can also be seen as a classifier selection algorithm, where individual classifiers are trained to become experts in some portion of the feature space.

When neural network models are used, the gating network and the experts are trained together such that the gating network learns when to trust each expert to make a prediction. This training procedure was traditionally implemented using expectation-maximization. The gating network might have a softmax output that gives a probability-like confidence score for each expert.

In general, the training procedure tries to achieve two goals: for given experts, to find the optimal gating function; and for a given gating function, to train the experts on the distribution specified by the gating function.

Pooling or Aggregating Method

Finally, the mixture of expert models must make a prediction, and this is achieved using a pooling or aggregation mechanism. This might be as simple as selecting the expert with the largest output or confidence provided by the gating network.

Alternatively, a weighted sum prediction could be made that explicitly combines the predictions made by each expert and the confidence estimated by the gating network. You might imagine other approaches to making effective use of the predictions and gating network output.

The pooling/combining system may then choose a single classifier with the highest weight or calculate a weighted sum of the classifier outputs for each class and pick the class that receives the highest weighted sum.

Relationship With Other Techniques

The mixture of experts method is less popular today, perhaps because it was described in the field of neural networks.

Nevertheless, more than 25 years of advancements and exploration of the technique have occurred, and you can see an excellent summary in the 2012 paper “Twenty Years of Mixture of Experts

Importantly, I’d recommend considering the broader intent of the technique and exploring how you might use it on your predictive modeling problems.

For example:

  • Are there obvious or systematic ways that you can divide your predictive modeling problem into subtasks?
  • Are there specialized methods that you can train on each subtask?
  • Consider developing a model that predicts the confidence of each expert model.

Mixture of Experts and Decision Trees

We can also see a relationship between a mixture of experts to Classification And Regression Trees, often referred to as CART.

Decision trees are fit using a divide and conquer approach to the feature space. Each split is chosen as a constant value for an input feature, and each sub-tree can be considered a sub-model.

Mixture of experts was mostly studied in the neural networks community. In this thread, researchers generally consider a divide-and-conquer strategy, try to learn a mixture of parametric models jointly and use combining rules to get an overall solution.

We could take a similar recursive decomposition approach to decompose the predictive modeling task into subproblems when designing the mixture of experts. This is generally referred to as a hierarchical mixture of experts.

The hierarchical mixtures of experts (HME) procedure can be viewed as a variant of tree-based methods. The main difference is that the tree splits are not hard decisions but rather soft probabilistic ones.

Unlike decision trees, the division of the task into subtasks is often explicit and top-down. Also, unlike a decision tree, the mixture of experts attempts to survey all of the expert submodels rather than a single model.

There are other differences between HMEs and the CART implementation of trees. In an HME, a linear (or logistic regression) model is fit in each terminal node instead of a constant, as in CART. The splits can be multiway, not just binary, and the splits are probabilistic functions of a linear combination of inputs rather than a single input, as in the standard use of CART.

Nevertheless, these differences might inspire variations in the approach for a given predictive modeling problem.

For example:

  • Consider automatic or general approaches to dividing the feature space or problem into subtasks to help broaden the suitability of the method.
  • Consider exploring both combination methods that trust the best expert, as well as methods that seek a weighted consensus across experts.

Mixture of Experts and Stacking

The application of the technique does not have to be limited to neural network models and a range of standard machine learning techniques can be used in place seeking a similar end.

In this way, the mixture of experts method belongs to a broader class of ensemble learning methods that would also include stacked generalization, known as stacking. Like a mixture of experts, stacking trains a diverse ensemble of machine learning models and then learns a higher-order model to best combine the predictions.

We might refer to this class of ensemble learning methods as meta-learning models. These are models that attempt to learn from the output or learn how to best combine the output of other lower-level models.

Meta-learning is a process of learning from learners (classifiers). In order to induce a metaclassifier, first the base classifiers are trained (stage one), and then the metaclassifier (stage two).

Unlike a mixture of experts, stacking models are often all fit on the same training dataset; e.g., there is no decomposition of the task into subtasks. Also, unlike a mixture of experts, the higher-level model that combines the predictions from the lower-level models typically does not receive the input pattern provided to the lower-level models and instead takes the predictions from each lower-level model as input.

Meta-learning methods are best suited for cases in which certain classifiers consistently correctly classify, or consistently misclassify, certain instances.

Nevertheless, there is no reason why hybrid stacking and a mixture of expert models cannot be developed that may perform better than either approach in isolation on a given predictive modeling problem.

For example:

  • Consider treating the lower-level models in stacking as experts trained on different perspectives of the training data. Perhaps this could involve using a softer approach to decomposing the problem into subproblems where different data transformations or feature selection methods are used for each model.
  • Consider providing the input pattern to the meta model in stacking in an effort to make the weighting or contribution of lower-level models conditional on the specific context of the prediction.

By Bakary Badjie

--

--

Bakary Badjie
0 Followers

I am a PhD Student in Computer Science. My research interest is focused on Enhancing Robustness of Machine Learning Models for Autonomous Driving Systems