Scaling a massive State-of-the-Art Deep Learning model in production

Published in

HuggingFace

8 min readJun 24, 2019

Last week, at Hugging Face, we launched a new groundbreaking text editor app. It’s different from traditional text editors in that an NLP model can complete your sentences if you ask it to, bringing a new dimension to “writing with a machine”. It’s based on GPT-2, OpenAI’s language model that can generate syntactically accurate sentences and coherent paragraphs of text.

The demo is live on https://transformer.huggingface.co and you’re welcome to try it out! 🦄 Write with transformer is to writing what calculators are to calculus.

This model is part of the latest trends in NLP which revolve around creating very large language models that obtain excellent results on a variety of tasks when fine-tuned on those specific tasks. This results in models, “Transformers”, with large amounts of parameters (up to 1.5 billion parameters for GPT-2 Large, or Grover), which are difficult to handle because of their weight.

Our app allows the user to choose between two models: GPT-2 small, and GPT-2 medium. Loading them both in the computer’s RAM takes a total of 2.4GB of memory.

Here we offer to show the approach we took in order to scale these models and respond to the 10,000 unique users and the equivalent of more than a hundred books written we got in the first few days. We explain the thoughts that went into it, define the best fitting architecture for optimal processing and discuss what we could have improved on.

Issue at hand

Disclaimer: our approach here is specific to models that cannot perform batch inference. For models that can do batch inference, like the one we used, the shown workaround may not be necessary.

This app has several constraints in order to be enjoyable by users. It must have the lowest possible response time and generate long-enough sentences. The system must offer several possible completions at each trigger so that the user may choose one of them, tripling the amount of data to be generated. The goal is, therefore, to optimize as best as possible the computation, creating a workflow taking advantage of the highly parallelizable aspect of GPUs.

Setting up our workspace

We’ll be building a server-side API to which our front-end app will connect. This API will be responsible for handling the computation needed to generate sentences. We’ll be using Python for this task as most NLP models are readily available. Other lower-level languages such as C++ or Rust would be more appropriate for performance-oriented backends, and we discuss their usage in the last part of this post.

We used falcon for the web servers(any other http framework would have worked too) in conjunction with gunicorn to run our instances and balance the load. Our own GPT-2 Pytorch implementation is the backbone of this project. We have a few examples in our examples directory if you’re interested in doing something similar.

Gunicorn sets up “workers” which will independently run the application, efficiently balancing the load across different workers. You can check exactly how they work on the official gunicorn documentation.

3-way autocompletion

On each autocompletion request, we want our API to return three different possible sentences. These sentences will be displayed to the end-user who will then choose one between the three. This is an essential part of our design, and our API must reflect that. The three sentences should appear at the same time and optimally a unique request should be sent to the server for each autocompletion.

The most naïve approach we could have is using a single worker with a model loaded behind:

Using this architecture, every request would be treated sequentially, and the model would be prompted to generate three different sentences before responding to the incoming request.

This infrastructure could be easily scaled up by adding more workers while keeping in mind that each worker loads the model in the RAM/VRAM according to GPU usage or not.

Using this approach implies that we have processes loading the model and operating on them, requesting three different sentences. If our model is able to perform batch inference, it can generate the three sentences at once. However, if it cannot, it needs to generate each sentence individually — resulting in three model iterations. We will be considering the case where batch inference is not available as it requires a slightly more engineered approach.

It would be better to parallelize the three iterations as we are looking for the lowest response time on autocompletion. Luckily for us, Python gives us access to several parallelization options that could be of use in our scenario:

Multithreading (threading)
Multiprocessing (subprocess or multiprocessing)
Distinct web servers as a form of multiprocessing (our approach)

Multithreading

Multithreading in Python is usually done using the threading class, which allows the program to create several threads that will each go on about their respective operations. The problem with multithreading is the way the Global Interpreter Lock — or GIL — works in Python.

If a thread accesses our model object, then no other thread can access that object until the first thread has finished dealing with it. This approach is therefore similar in execution to not using any thread at all, as the three iterations will be treated sequentially. The only performance difference will be the additional time spent starting/joining each thread, which is detrimental to our objective.

If one really wanted to use threading, three different models could be loaded into the RAM, each being used by a separate thread. We did not choose to go this way as explained further below.

Multiprocessing

Multiprocessing can go two ways; either by booting up completely separate processes and connecting to their input/output (using the subprocess module) or by spawning python processes that can inherit the current Python interpreter process’ resources (bypassing the GIL issue, using the multiprocessing module).

A tricky part here is making sure the model doesn’t have to be loaded into the RAM every time it has to compute an inference; big models take a long time to load in memory.

We chose to take yet another, different approach.

Our approach using gunicorn load balancing

Our approach is slightly different in that we choose to use the power of gunicorn workers to parallelize our work. In order to do so, we add another layer to our previous model. The previously defined architecture can receive several requests and process them all at once on several workers. We will use that to our advantage. The final model is detailed below.

Final model with two different Falcon/Gunicorn servers

When a request is sent from the front-end app to our API, it is handled by our first web server. This web server has a single worker that runs our API. This API is responsible for sending three identical requests to the second web server. The requests sent from this API contain the current context (the previous sentences in the document) as well as some information regarding the parameters (small or medium model, specific top_k values, …).

This second web server has several workers which handle the requests separately. Three workers will handle each request received from the API, which can, therefore, be handled simultaneously. We use separate threads in the API so that requests can be sent to the second web server in parallel rather than sequentially (http requests -> no GIL issue).

This architecture has several advantages that other, previously mentioned methods, do not have out-of-the-box:

We can generate as many workers as the number of models that can fit in our memory. We split the workers among the different GPUs if we have a distributed system.
Each worker loads a single model in memory. Therefore, there may be more models loaded (more computing power) than if three models were loaded each time, such as for the threading approach.
Launched as webserver workers, the models will always stay loaded in memory.
We’re making use of gunicorn’s load balancing at every step in our architecture. We are not simply spawning processes running in parallel, we have a way to make sure each process handles loads relative to its computing capabilities. If we were to use two different GPUs of different computing power, the bottleneck created by the lower computing GPU wouldn’t impact the other one as much as it would in a purely multi-process program.

Here is a GIF showing how the architecture behaves for memory management during initialization and when two concurrent requests are sent to the API.

Results

Unsurprisingly, we obtain large improvements in response time when using a parallel system compared to the initial sequential system. Benchmarking on a single request which has to be broken up in three model iterations, we get a third of the initial response time, the actual local http request only taking a few hundred microseconds.

This system is particularly adapted to vertical scaling as it adapts to the system’s memory and computing power. However, it does not compare to a model that can perform batch inference as this approach will store three models in the memory, versus a single one if using batch inference.

Further improvements

This system was designed to be run on a single machine, so we didn’t consider containerization or horizontal scaling. These are welcome and necessary in the case of a full-blown production system that needs to handle the 100,000s of users. This can be discussed in a future post.

An additional improvement could be the use of the TorchScript module. Since we used Pytorch for our model, we could see a TorchScript version of it that could be used to do inference in any programming language. We could, therefore, have optimized a better, more task-specific web server in a very low-level language if we wanted to optimize to the fullest.

This system has proven its worth as it held the load until now, handling more than 100,000 different requests in a week’s time while running on a single 4-GPU (K80) machine. If you would like to try out the app and see how our system responds to traffic, you’re welcome to try it out here 🦄

This concludes this quick post on the system architecture we had to optimize for parallel computing, using our big Transformer model in production. All thoughts and claps are welcome!