Committee of Intelligent Machines — Unity in Diversity of #NeuralNetworks

photo credit

Have you noticed that the best fitness functions that most creatures adopt for survival is to work in collectives? School of fishes, Hive of bees, the nest of ants, horde of wildebeests or flock of birds all have something in common. They co-operate to survive. This is an example of Unity in Uniformity (when the same species act as a collective to form a fitness function)

What is even more perplexing about nature is the ecological inter-dependence of different species, collectively surviving to see a better day. This fitness function is a sum of averages of sorts which enables a different form of collective strength. It’s called Unity in Diversity. In essence, this signifies the ‘unity without uniformity and diversity without fragmentation’ when it comes to ecological fitness at any given time.

How do we apply this learning to AI? Specifically Neural Networks?

In the previous posts, we discovered how Neural Networks learn, and also observed that the models can begin overfitting the data. We learnt that the L1/L2 weight penalties, early stopping and weight constraints provide a mechanism to improve prediction accuracy while reducing overfitting.

In this post, we shall learn about better generalization functions based on how nature works, which is, Unity, or collaborative prediction. This is the first post in series and will establish the fundamentals of ‘Committee of Machines’ model.

Note that the true definition of Committee of Machines is when Machines co-operate to do a complex task. I have used this post to get the appetite whetted.

One of the ways to reduce overfitting and improve prediction accuracy is to have a collection of Neural Networks (similar to nature) predicting the target value. The idea is to average the prediction errors in order to get a better prediction.

As shown in illustration, if each model predicts an output ‘y’ with an error distance d1 from the target (due to the absolute prediction value, or the weight of the ball, as a analogy), then it’s possible to combine the variance in the distance to arrive at a better prediction.

Just like in nature,

  • We can use the same type of Neural Net models (unity in uniformity) all trained on the same dataset but predicting different values and average them.
  • We can use different Machine Learning models (unity in diversity) all trained on the same dataset predicting different results and average them. The machine learning models need not necessarily be a Neural Network. As long as it is using the same dataset to predict towards the same target value, it is good enough.

The idea behind combining various models is to reduce the variance in resulting prediction. We know that ‘variance’ is a function of the squared error distance from the mean prediction. The reason we look at squared error distance is because, we want predictions that are farther away from the target to produce a higher standard deviation and predictions which are closer to produce a lower standard deviation.

Squaring moves the value of prediction errors into the realm of positive numbers (to the right of the target).

squared error distance

The illustration gives the idea that when you have more than one model trained on the same dataset used to predict the outcomes, the squared error distance shall produce, on average, a better output than any individual output.

You might wonder, why we are not retaining the model that is closer to the target and throwing the rest of the models away? Note that the same model can produce different errors in prediction for different inputs. The model may produce better results for some inputs more accurately than others. Similarly, you shall have other models producing different predictions for the same input. The idea is to get, on average, good predictions for all inputs.

The prediction vector for all inputs for a given model may be pointing to different directions in the resultant vector sub-space. The combination of all the directions of the prediction vector across all models achieve equilibrium by combining the average predictions as illustrated:

To find the average prediction, we apply the variance function as follows:

The above equation can be used as a error function to train the models. This concept is called co-operative error optimization.

Note that we are comparing the average of all predictions with the target value to improve accuracy of prediction. This helps reduce variance and improve model accuracy.

But, this has a bad side-effect though. As the models co-operate to improve the average prediction, the equation lands up overfitting higher on average as well. This is because, we are taking the combined squared error over ‘all models collectively’. So this is not a good outcome for reducing overfitting.

Mixture of Experts

Instead, there is a better approach to improve generalization while improving accuracy. The concept is called a “Mixture of Experts”.

Citation: Geoffrey Hinton along with others have published two papers on Mixture of Experts which can be found > here.

A better approach for improving prediction accuracy while improving model generalization is based on the concept of specialization. Instead of comparing the average of all predictors with a target value as a cost function, Hinton et al, proposed a better model where each predictor is compared individually and separately with the target value and average-out the ‘probability’ of selecting a expert based on the ‘class’ of input.

  • This requires an ability to identify the input class, and
  • Requires a gating function to assign the input to the right predictor based on the class of the input.

One of the ways to classify an input before prediction, is to use the internal representation of the input vector and cluster them based on ‘un-supervised' learning. There are several techniques to cluster the input data as follows:

  • Self Organizing Maps (Using Neural Net models)
  • K-means clusters
  • Expectation Maximization
  • Graph based models
  • Softmax Network

Once a class of the input is obtained, we can keep track of the probability of accuracy of a model participating in the ‘Committee of Machines’ and come up with a better cost function that can be used to train a Neural Net that is participative in the Committee.

(Again, I am using ‘Committee of Machines’ nomenclature for this concept as the models are co-operating for better prediction. In fact, the true definition of Committee of Machines is when the machines co-operate to perform ‘disparate’ tasks to achieve a complex objective).

The setup for a Committee of Machines looks as follows:

Committee of Machines with a gating network.

Here, P1, P2, P3 are the probabilities assigned to the input class. The number of probabilities are equal to the number of experts. An expert can be any machine learning model that can predict the target value, given the input feature vector. Subsequently, y1, y2, y3 are the resultant prediction.

The manger is used for allocating the input vector based on its class to different specialized expert (a machine learning model) and is done by the softmax gating network. A softmax gating network can be a Neural Network which simply takes the raw inputs and spits out ’N’ probabilities based on the ’N’ experts we have. Note that N is just the number of experts and is not co-related to the experts in any other way.

The softmax function is a normalized exponential function that squashes a k-dimensional vector to a vector with real values between {0 and 1} which adds upto 1, and hence is a good gating function to identify the internal representation of the input vector and come up with a probability distribution over N.

The base loss function is as follows:

Now, if you need a signal to train each expert, we have to differentiate the error w.r.t the output of each expert as follows:

If you need a signal to train the softmax gating network (which is supervised learning Neural Network), we have to differentiate the error w.r.t the output of the softmax gating network as follows:

Probability of target under a mixture of Gaussians

In fact, Hinton et al > here, proposes a better way of coming up with the probability distribution for predicting the target value. They state that, we can think of the prediction made by each expert as a Gaussian distribution around their outputs with unit-variance. In that case, they propose:

  • Think of each expert with a gaussian distribution around its output with a unit-variance.
  • Softmax manager chooses a probability based on the scale of the gaussian. The scale is called a ‘Mixing proportion’.
  • Maximize the log probability (same as minimizing the squared errors) of the target value using the mixture of gaussians.

Then, the probability of a target can be thought of:

Here,

  • P_c,i is the mixing proportion that was assigned to an expert ‘i’ for a case c. This is nothing but the value assigned by the softmax gating function to a specific output class given the input vector.
  • t_c is the target value expected for a given case ‘c’.
  • y_c,i is the output of the expert ‘i’ for the given case ‘c’.
  • The rest of the terms in the function is a gaussian of unit variance being applied on the error.

Drop-outs

In fact, drop-outs as regularizers works on a similar concept as Committee of Machines. Instead of using different network for predicting the target value, it is proposed that for each training example:

  • We randomly drop-out hidden units H from prediction with a probability of 0.5
  • This is similar to creating mini networks within the same network where we are sampling 2^H different networks within the same Neural Net Model.
  • Among 2^H Neural models, only few gets trained for a training example. Maybe they all get trained only on 1 training example each.
  • But, since the overall models share the weights within the same network, the overall model is more strongly regularized and is having a better accuracy than L1 or L2 regularizers.
  • Drop-outs are fast to setup (No need of multiple networks) and powerful enough to predict with better generalization.

In conclusion, you learnt about

  • the collaborative optimizations using the co-operative optimization technique which is a powerful predictor but poor regularizer.
  • A better mixture of expert model that helps regularize better using a softmax gating function.
  • Drop-outs as a regularizer providing another alternative to use the same network architecture model without having to spawn new networks.

Nature in fact is a great regularizer…