Outrageously large Neural Network: Gated Mixture of Experts (Billions of parameter, same computation cost).

Published in

techburst

7 min readDec 3, 2017

More the number of parameter more the information absorbed by the Neural Network, as easy as it seems. Well, but then we have the computation limitation to spoil the party. At first glance, it is utterly shocking to me that no one had successfully implemented it prior to this paper (though there have been efforts which are aptly mentioned in the paper). In practice, however, there are major challenges in achieving high performance and high quality. So lets dive in. Game time!!!! Link to paper.

Introduction:

A general understanding for the neural networks are that if dataset is big enough then with closed eyes you can increase the number of parameters to correctly generalise for the humongous dataset and give better accuracy. But for typical Deep Learning modules the entire network is activated for each batch_size/example thereby blowing up the computation costs. Unfortunately the modern GPUs also fall short of tackling this issue.

Conditional Computation has been a thing and people have exploited and applied it in the field of text, audio and images, where in we activate parts of network per example wise, it roughly allows us to limit the number of parameters per iteration and very wisely choose the parts of network thereby improving model capacity without exponentially increasing the computational cost. To decide which part of the network to activate for each example various Gating mechanism has been applied, the top ones been binary, sparse, continuous, stochastic or deterministic. Various forms of reinforcement learning and back-propagation are proposed for training the gating decisions.

The Sparsely Gated Mixture of Experts Layer

A new type of general purpose neural network component: a Sparsely-Gated Mixture-of-Experts Layer (MoE) is introduced. The MoE:

Consists of a number of experts, each a simple feed-forward neural network.
A trainable gating network which selects a sparse combination of the experts to process each input.
All parts of the network are trained jointly by back-propagation.

In the above paper, the MoE model is applied to language modelling and machine translation tasks. However, I feel and hope that researchers come up with similar architectures for Segmentation task where these kind of network has the ability to perform better than SOTA. For the language modelling task, the MoE is appleid convolution-ally between stacked LSTM layers and the MoE is called once for each position in the text, selecting a potentially different combination of experts at each position. The different experts tend to become highly specialised based on syntax and semantics. Here is an example from the paper.

The structure of MoE

Consists of set of ’n’ expert network E1, E2….. En. And a gating network G, whose output is a sparse n dimensional vector.
The experts are themselves neural network each having parameters of their own
G(x) : output of the gated network
Ei(x) : output of the ith expert network given the input x

On Left, Y will be the output of the MoE Layer
To save computation at the sparsity of G(x), whenever G(x) = 0, E(x) is not computed. Out of thousand of experts only a hand-full of them is used.
If the number of experts are high, we can reduce the computation by branching them into two level hierarchical MoE. A primary gating mechanism chooses a sparse combination of experts, each of which is itself a secondary mixture of MoE with its own gating mechanism

Experimental setup of MoE Layer Architecture:

Each expert in the MoE layer is a feed forward network with one ReLU-activated hidden layer of size 1024 and an output layer of size 512. Thus, each expert contains [512 ∗ 1024] + [1024 ∗ 512] = 1M parameters.
The output of the MoE layer is passed through a sigmoid function before dropout.
The number of experts between models is varied, (for the experiments in the paper)using ordinary MoE layers with 4, 32 and 256 experts and hierarchical MoE layers with 256, 1024 and 4096 experts.
For the hierarchical MoE layers, the first level branching factor was 16, corresponding to the number of GPUs in our cluster.
Using Noisy-Top-K Gating) with k = 4 for the ordinary MoE layers and k = 2 at each level of the hierarchical MoE layers. Thus, each example is processed by exactly 4 experts for a total of 4M ops/time-step. The two LSTM layers contribute 2M ops/time-step each for the desired total of 8M.

Gating Network

Softmax Gating: A very obvious, effective and naive sparse Gating mechanism is the Softmax Gating, where we take the softmax of the Input multiplied by the weight matrix Wg.

Noisy Top-K Gating: Before taking the softmax function, Gaussian noise is added, then keep only the top k values, setting the rest to −∞ (which causes the corresponding gate values to equal 0). The noise term helps with load balancing, The amount of noise per component is controlled by a second trainable weight matrix Wnoise.

Training of the Gating Network: To take into the Gating behaviour into consideration, the training is done with the gating network along with the entire network. If k>1 then the gate values of the K experts will have non-zero derivatives for back-propagation.

The challenges for such networks:

The shrinking batch problem

Lets say we have n experts for each example, now if the gating network allows ‘k’ experts to be active then for a batch size of ‘b’, each expert receives a much smaller batch of approximately kb/n examples.This causes a naive MoE implementation to become very inefficient as the number of experts increases. The solution is to start with a much larger batch_size, so here are few proposed ways to increase the batch size for each expert.

Mixing Data Parallelism and Model Parallelism: Model parallelism is done by keeping a copy of the model on each of the GPUs and each receiving distinctive batches of data and parameters are synchronised through a set of parameter servers. Now, we can run these different batches run synchronously so that they can be combined for the MoE layer. We distribute the standard layers of the model and the gating network according to conventional data-parallel schemes, but keep only one shared copy of each expert. Each expert in the MoE layer receives a combined batch consisting of the relevant examples from all of the data-parallel input batches. The same set of devices function as data-parallel replicas (for the standard layers and the gating networks) and as model-parallel shards (each hosting a subset of the experts). If the model is distributed over d devices, and each device processes a batch of size b, each expert receives a batch of approximately kbd/n examples. Thus, we achieve a factor of d improvement in expert batch size.
Recurrent MoE with increased batch size: The weight matric of the RNN or LSTM can be simply replaced by a MoE.

Balancing expert utilization

As a result of training through backpropogation, it is possible that the network decides it’s favourite expert and every time those are trained more rapidly as compared to the other experts. To tackle this, they take a soft constraint approach i.e Define the importance of an expert relative to a batch of training examples to be the batchwise sum of the gate values for that expert. Also define an additional loss L(importance), which is added to the overall loss function for the model. This loss is equal to the square of the coefficient of variation of the set of importance values, multiplied by a hand-tuned scaling factor wimportance. This additional loss encourages all experts to have equal importance.

Conclusion and My take of the paper

It’s surprising to me that the authors didn’t evaluate using different architectures for each experts network. That would’ve been the first use case that comes to my mind. They mention this possibility in the paper but I would’ve loved to see experiments for this.
However, It seems that you really only get large gains with this technique if you have an incredible amount of gpu memory available to you for all of these parameters. The amount of parameters (and back-propagation) through all of them seems to be a nightmare to handle. Basically, if you’re Google, this paper is a solid improvement on language models. If you’re us, then get some serious investment capital.
I love the computational efficiency of the model presented. The authors achieve extreme sparsity yet fully utilise their GPUs. In particular, the authors design the network in such a way that there are very few connections between active and non-active units. If we have, say, a sparsely activated fully-connected network, most computation would be wasted on network connections that start on active units and end on non-active units.
As mentioned, I would love to see such implementation for Semantic Segmentation tasks and really hoping to surpass the SOTA.
How to apply this: Tensorflow has incredible support for sparse tensor operations. I do think this is possible to do in tensorflow. Check the link.