Big Models Era: Techniques, challenges & systems to train and serve large models

Anjali singh
7 min readSep 27, 2022

--

Big models have become prominent and have increased more than 10 times in per years. For example open AI gigantic GPT-3 model with 175 (B)billion parameters recognized as most important breakthrough in this decade people are training larger and larger models. For example 2 months ago meta open sourced pre-trained model called OPT with 175 B parameters making large language models at this scale first time accessible to everyone. Several days ago Big science took initiative and open sourced model called bloom(language model) with 76 B parameters hence larger and larger model are now available as open source.

How can we embrace larger and larger models. Like larger models provides a lot of benefits it also poses a lot of new challenges in computing. Core problems such as many language models like BERT can fit in a memory of a single device such as a GPU (16–40GB) so with this assumption computation is conducted on a single device using frameworks like Tensor flow and Pytorch without any problem however when model becomes large it is hard to fit even a single layer of this model into the limited memory. So, how can we serve this big model???

The only possible way is using parallelization and partitioning of big models on more devices. However this requires a lot of engineering efforts that is specific to post model definition and training cluster.

So, how big models are trained and server to help embrace models in research and development

How big is big models?

Any model that can’t fit into a single memory device Is big model.Even for people having access to GPUs of different size like 40 GB, 100GB etc. but at last if your model can’t fit into your 1 GPU it is a big model.

Idea is to handle your model with available GPUs and not needing to upgrade your GPUs.

In coming Future, Training large model in parallel is going to become normal. But, parallel training is an extremely complex problem.

Must be wondering right How can we train model in parallel ? Why can’t we rely on more advanced and high performance based hardware like RAMs, CPUs, GPUs ,TPUs etc.

For this we need to understand supply and demand crunch of performance and memory based hardware.

So 2 laws have been governing computing industry for decades:

What the above means is while every 2 years the no of transistors double the power consumption remains constant. “That means same packaging same everything still you are not going to over cook your chips”. There are 2 important corollaries with these laws

1st- The memory capacity doubles every years without increasing power consumption.

2nd- The CPU core performance doubles every 18th month & this is because in addition of doubling the transistors the latency between transistors decrease as transistors are packed closer together. This provide the additional boosting performance doubling every 18 month instead of every 2 years .Unfortunately moose law has ended .

Green area shows the golden age of Moore's law where as the processor performance indeed was doubling every 18th month but at the beginning of this millennium performance increase has started to slow down. Today core performance just increase a few % every 18th months. Menawhile the compute demands of ML are exploding.

According to the above study the compute demand to train a ML model has doubled every 5.7 months over past 12 years equivalent to compute demands growing 10 times every 18th months.

So one reason is larger models have shown to achieve much better accuracy.

This is plot of GPT 3 seminar paper published in 2020 which shows accuracy increases significantly as the model size for a variety of training and fine tuning tasks including 0 shot, few and 1 shot learning.

As the reason of emergence of larger models that have shown to be effective in learning multiple tasks as well as improving the accuracy of the individual tasks.

To put things together by overlying Moore's law, performance increase as well as core performance increase for previous plot. You can see the increase in demand for training state of art models and the core performance is huge and is growing rapidly.

But wait , What about specialized hardware??

A plethora of Hardware have been release while some of them helps but is not enough.

They have been of huge help but still fall short for training state of the art models.

This gap continues to grow exponentially. Even ML models were to stop increasing in size it would take decades for processors to catch up.

This is also true for memory demand and not just for (compute demand) core performance.

Around 4 years ago the largest model was fitting on a single GPU and fast forward today we would need hundred to thousands of GPUs Just to fit in just all parameters. To target the performance specialization won’t help with performance as with memory we would still need at least 1 transistors to store 1 bit of information. This means to support ML workload there is no other way than to parallelize the training.

Several distributed systems have been developed to speed up training most model has data parallel training. Solution is to train model on different batches of data in parallel and average their weights very optically, this solution as well assumes the model would fit on a single machine/GPU. What if we can’t fit model in single GPU but then in this case there is no choice but to parallelize the models itself. But, question is how??

Consider the forward pass of Neural Networks. The most expensive operation is to compute the output of each layer that is tensor operation is that the multiply the input vector with a weight matrix.

That is partition this model by layers or stage. This means that each matrix multiplication is still computed on a single GPU. But, different operation can run on different GPUs this is called Inter-Operator Parallelism (IOP).Again as different set of operators can run on different GPUs. The complexity IOP is computed by the need to pipelines the execution on the both forward and backward propagation paths.

The fact that the GPUs can be either on same or different machines. Which provides many different computation characteristics which impacts the performance.

Another way is to partition a stage or a layer. which leads to a partitioning as a matrix multiplication operation called as Intra operator parallelism as the same operator can now be executed across multiple GPUs.

Challenges:

Optimizing training of a model that doesn’t fit on a single machine/GPUs can be extremely challenging.

Search(optimization) space is huge:

a) Data parallelism

b) Model parallelism- Inter Operator parallelism & intra Operator parallelism(OP)

c) There is a growing diversity of resources such as GPUs, TPUs, this processors have diff characteristics which effect training performance.so these much be taken into account

d)Huge diversity in architecture of Neural Networks we need to train at scale.

These challenges have plethora of systems for training very large model being developed over couple of years .

These system at high level need to decide how to decompose the model, to decide which device to assign each component of the model, how to route the data between components & when to schedule execution of each component the systems can be categorized in 2 groups

1.Transofrmer based architecture models and these systems might require hand tuning and include Nvidia’s Megatron LM and Google’s Gshard with mixture of experts.

2.The system that supports large variety of neural network architectures. but employ only a subset of parallelization methods, In doing so these systems effectively train the quality of optimization for the time it takes to come up with optimization.

3.Alpa — first fully automatic optimizer for training large models that can match the performance of existing hand tuned solutions.

To deep dive into how these existing systems parallelize I would be writing part 2 of this article.

Reference for this article is from a ICML conference presented by a Professor of university of Berkeley.

Stay tuned for more such detailed posts.

HAPPY LARNING !!!

--

--

Anjali singh

Data Scientist / MLE with more than 6 + years of experience in IT and expertise in Python, Aws, Azure & R.