Better performance by optimizing Gunicorn config
Practical advice on how to configure Gunicorn.
TL;DR, For CPU bounded apps increase workers and/or cores. For I/O bounded apps use “pseudo-threads”.
Gunicorn implements a UNIX pre-fork web server.
Great, what does that mean?
- Gunicorn starts a single master process that gets forked, and the resulting child processes are the workers.
- The role of the master process is to make sure that the number of workers is the same as the ones defined in the settings. So if any of the workers die, the master process starts another one, by forking itself again.
- The role of the workers is to handle HTTP requests.
- The pre in pre-forked means that the master process creates the workers before handling any HTTP request.
- The OS kernel handles load balancing between worker processes.
To improve performance when using Gunicorn we have to keep in mind 3 means of concurrency.
1st means of concurrency (workers, aka UNIX processes)
Each of the workers is a UNIX process that loads the Python application. There is no shared memory between the workers.
For a dual-core (2 CPU) machine, 5 is the suggested
gunicorn --workers=5 main:app
2nd means of concurrency (threads)
Gunicorn also allows for each of the workers to have multiple threads. In this case, the Python application is loaded once per worker, and each of the threads spawned by the same worker shares the same memory space.
To use threads with Gunicorn, we use the
threads setting. Every time that we use
threads, the worker class is set to
gunicorn --workers=5 --threads=2 main:app
The previous command is the same as:
gunicorn --workers=5 --threads=2 --worker-class=gthread main:app
The maximum concurrent requests are
workers * threads 10 in our case.
The suggested maximum concurrent requests when using workers and threads is still
So if we are using a quad-core (4 CPU) machine and we want to use a mix of workers and threads, we could use 3 workers and 3 threads, to get 9 maximum concurrent requests.
gunicorn --workers=3 --threads=3 main:app
3rd means of concurrency (“pseudo-threads” )
Gunicorn allows for the usage of these asynchronous Python libraries by setting their corresponding worker class.
Here the settings that would work for a single core machine that we want to run using
gunicorn --worker-class=gevent --worker-connections=1000 --workers=3 main:app
worker-connections is a specific setting for the gevent worker class.
(2*CPU)+1 is still the suggested
workers since we only have 1 core, we’ll be using 3 workers.
In this case, the maximum number of concurrent requests is 3000 (3 workers * 1000 connections per worker)
Concurrency vs. Parallelism
- Concurrency is when 2 or more tasks are being performed at the same time, which might mean that only 1 of them is being worked on while the other ones are paused.
- Parallelism is when 2 or more tasks are executing at the same time.
In Python, threads and pseudo-threads are a means of concurrency, but not parallelism; while workers are a means of both concurrency and parallelism.
That’s all good theory, but what should I use in my program?
Practical use cases
By tuning Gunicorn settings we want to optimize the application performance.
- If the application is I/O bounded, the best performance usually comes from using “pseudo-threads” (gevent or asyncio). As we have seen, Gunicorn supports this programming paradigm by setting the appropriate worker class and adjusting the value of
- If the application is CPU bounded, it doesn’t matter how many concurrent requests are handled by the application. The only thing that matters is the number of parallel requests. Due to Python’s GIL, threads and “pseudo-threads” cannot run in parallel. The only way to achieve parallelism is to increase
workersto the suggested
(2*CPU)+1, understanding that the maximum number of parallel requests is the number of cores.
- If there is a concern about the application memory footprint, using
threadsand its corresponding gthread worker class in favor of
workersyields better performance because the application is loaded once per worker and every thread running on the worker shares some memory, this comes to the expense of some additional CPU consumption.
- If you don’t know you are doing, start with the simplest configuration, which is only setting
(2*CPU)+1and don’t worry about
threads. From that point, it’s all trial and error with benchmarking. If the bottleneck is memory, start introducing threads. If the bottleneck is I/O, consider a different python programming paradigm. If the bottleneck is CPU, consider using more cores and adjusting the
Building the system
We, software developers commonly think that every performance bottleneck can be fixed by optimizing the application code, and this is not always true.
There are times in which tuning the settings of the HTTP server, using more resources or re-architecting the application to use a different programming paradigm are the solutions that we need to improve the overall application performance.
In this case, building the system means understanding the types of computing resources (processes, threads and “pseudo-threads”) that we have available to deploy a performant application.
By understanding, architecting and implementing the right technical solution with the right resources we avoid falling into the trap of trying to improve performance by optimizing application code.
- Gunicorn is ported from Ruby’s Unicorn project. Its design outline helped on clarifying some of the most fundamental concepts. Gunicorn architecture cemented some of those concepts.
- Opinionated blog post about how Unicorn deferring some of the most critical features to Unix is good.
- Stack Overflow answer about the pre-fork web server model.
- Some more references to understand how to fine tune Gunicorn.