GPipe and Pipeline Parallelism in Neural Networks

Saliya Ekanayake
2 min readMar 5, 2019

--

Source: GPipe (https://arxiv.org/pdf/1811.06965.pdf)

In a previous article, I mentioned how one could incorporate pipeline parallelism to improve the training of a neural network. Well, now you can use it in practice with GPipe as it seems.

… you can stream the data items in your minibatch, where one item may be in the forward pass in layer X, while the other item is in the forward pass in layer 1

The main objective of that article, however, was to explain why the phrase “model parallelism”, as used in today’s deep learning context, is misleading. To recap, let’s look at the axes of parallelization in a deep neural network.

Source: Model Parallelism in Deep Learning is NOT What You Think (https://medium.com/@esaliya/model-parallelism-in-deep-learning-is-not-what-you-think-94d2f81e82ed)

While model parallelism should mean splitting the model horizontally across processes, it is often used to indicate workload partitioning, which splits the model vertically across processes (or devices). The difference is that in the former, computations can happen concurrently, whereas in the latter devices sit idle for their predecessor partitions to finish computing in another device.

In this recent post from Google introducing GPipe, it succinctly mentions this inefficiency or limitation of current workload partitioning as follows.

But due to the sequential nature of DNNs, this naive strategy may result in only one accelerator being active during computation, significantly underutilizing accelerator compute capacity

So how does GPipe overcomes this underutilization using pipeline parallelism?

It streams data items from your minibatch, which the authors mention as micro-batches, through the vertical partitions of your model. If you are thinking why it works, it is because in the mini-batch training you update weights per mini-batch, not after each item in the mini-batch. The immediate benefit of this approach is that the same weights could be shared with all the items in the mini-batch. This is how GPipe exploits pipeline parallelism to improve performance.

--

--