Scaling a massive State-of-the-Art Deep Learning model in production

Lysandre Debut
Jun 24 · 8 min read

Last week, at Hugging Face, we launched a new groundbreaking text editor app. It’s different from traditional text editors in that an NLP model can complete your sentences if you ask it to, bringing a new dimension to “writing with a machine”. It’s based on GPT-2, OpenAI’s language model that can generate syntactically accurate sentences and coherent paragraphs of text.

Telling a story with GPT-2’s help

The demo is live on and you’re welcome to try it out! 🦄 Write with transformer is to writing what calculators are to calculus.

This model is part of the latest trends in NLP which revolve around creating very large language models that obtain excellent results on a variety of tasks when fine-tuned on those specific tasks. This results in models, “Transformers”, with large amounts of parameters (up to 1.5 billion parameters for GPT-2 Large, or Grover), which are difficult to handle because of their weight.

Our app allows the user to choose between two models: GPT-2 small, and GPT-2 medium. Loading them both in the computer’s RAM takes a total of 2.4GB of memory.

Here we offer to show the approach we took in order to scale these models and respond to the 10,000 unique users and the equivalent of more than a hundred books written we got in the first few days. We explain the thoughts that went into it, define the best fitting architecture for optimal processing and discuss what we could have improved on.

Issue at hand

This app has several constraints in order to be enjoyable by users. It must have the lowest possible response time and generate long-enough sentences. The system must offer several possible completions at each trigger so that the user may choose one of them, tripling the amount of data to be generated. The goal is, therefore, to optimize as best as possible the computation, creating a workflow taking advantage of the highly parallelizable aspect of GPUs.

Setting up our workspace

We used falcon for the web servers(any other http framework would have worked too) in conjunction with gunicorn to run our instances and balance the load. Our own GPT-2 Pytorch implementation is the backbone of this project. We have a few examples in our examples directory if you’re interested in doing something similar.

Gunicorn sets up “workers” which will independently run the application, efficiently balancing the load across different workers. You can check exactly how they work on the official gunicorn documentation.

3-way autocompletion

The most naïve approach we could have is using a single worker with a model loaded behind:

Naïve API

Using this architecture, every request would be treated sequentially, and the model would be prompted to generate three different sentences before responding to the incoming request.

This infrastructure could be easily scaled up by adding more workers while keeping in mind that each worker loads the model in the RAM/VRAM according to GPU usage or not.

Multi-worker naïve API

Using this approach implies that we have processes loading the model and operating on them, requesting three different sentences. If our model is able to perform batch inference, it can generate the three sentences at once. However, if it cannot, it needs to generate each sentence individually — resulting in three model iterations. We will be considering the case where batch inference is not available as it requires a slightly more engineered approach.

It would be better to parallelize the three iterations as we are looking for the lowest response time on autocompletion. Luckily for us, Python gives us access to several parallelization options that could be of use in our scenario:


If a thread accesses our model object, then no other thread can access that object until the first thread has finished dealing with it. This approach is therefore similar in execution to not using any thread at all, as the three iterations will be treated sequentially. The only performance difference will be the additional time spent starting/joining each thread, which is detrimental to our objective.

If one really wanted to use threading, three different models could be loaded into the RAM, each being used by a separate thread. We did not choose to go this way as explained further below.


A tricky part here is making sure the model doesn’t have to be loaded into the RAM every time it has to compute an inference; big models take a long time to load in memory.

We chose to take yet another, different approach.

Our approach using gunicorn load balancing

Final model with two different Falcon/Gunicorn servers

When a request is sent from the front-end app to our API, it is handled by our first web server. This web server has a single worker that runs our API. This API is responsible for sending three identical requests to the second web server. The requests sent from this API contain the current context (the previous sentences in the document) as well as some information regarding the parameters (small or medium model, specific top_k values, …).

This second web server has several workers which handle the requests separately. Three workers will handle each request received from the API, which can, therefore, be handled simultaneously. We use separate threads in the API so that requests can be sent to the second web server in parallel rather than sequentially (http requests -> no GIL issue).

This architecture has several advantages that other, previously mentioned methods, do not have out-of-the-box:

  • We can generate as many workers as the number of models that can fit in our memory. We split the workers among the different GPUs if we have a distributed system.
  • Each worker loads a single model in memory. Therefore, there may be more models loaded (more computing power) than if three models were loaded each time, such as for the threading approach.
  • Launched as webserver workers, the models will always stay loaded in memory.
  • We’re making use of gunicorn’s load balancing at every step in our architecture. We are not simply spawning processes running in parallel, we have a way to make sure each process handles loads relative to its computing capabilities. If we were to use two different GPUs of different computing power, the bottleneck created by the lower computing GPU wouldn’t impact the other one as much as it would in a purely multi-process program.

Here is a GIF showing how the architecture behaves for memory management during initialization and when two concurrent requests are sent to the API.

Initialization and concurrency behavior


This system is particularly adapted to vertical scaling as it adapts to the system’s memory and computing power. However, it does not compare to a model that can perform batch inference as this approach will store three models in the memory, versus a single one if using batch inference.

Further improvements

An additional improvement could be the use of the TorchScript module. Since we used Pytorch for our model, we could see a TorchScript version of it that could be used to do inference in any programming language. We could, therefore, have optimized a better, more task-specific web server in a very low-level language if we wanted to optimize to the fullest.

This system has proven its worth as it held the load until now, handling more than 100,000 different requests in a week’s time while running on a single 4-GPU (K80) machine. If you would like to try out the app and see how our system responds to traffic, you’re welcome to try it out here 🦄

This concludes this quick post on the system architecture we had to optimize for parallel computing, using our big Transformer model in production. All thoughts and claps are welcome!


Stories @ Hugging Face

Thanks to Clément Delangue, Julien Chaumond, and Victor Sanh

Lysandre Debut

Written by

ML engineer @HuggingFace, passionate about NLP. I tinker a lot. github/twitter @LysandreJik


Stories @ Hugging Face

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade