Better performance by optimizing Gunicorn config

Practical advice on how to configure Gunicorn.

Gunicorn architecture

  • Gunicorn starts a single master process that gets forked, and the resulting child processes are the workers.
  • The role of the master process is to make sure that the number of workers is the same as the ones defined in the settings. So if any of the workers die, the master process starts another one, by forking itself again.
  • The role of the workers is to handle HTTP requests.
  • The pre in pre-forked means that the master process creates the workers before handling any HTTP request.
  • The OS kernel handles load balancing between worker processes.

1st means of concurrency (workers, aka UNIX processes)

gunicorn --workers=5 main:app
Gunicorn with default worker class (sync). Note the 4th line in the image: “Using worker: sync”.

2nd means of concurrency (threads)

gunicorn --workers=5 --threads=2 main:app
Gunicorn with threads setting, which uses the gthread worker class. Note the 4th line in the image: “Using worker: threads”.
gunicorn --workers=5 --threads=2 --worker-class=gthread main:app
gunicorn --workers=3 --threads=3 main:app

3rd means of concurrency (“pseudo-threads” )

gunicorn --worker-class=gevent --worker-connections=1000 --workers=3 main:app

Concurrency vs. Parallelism

  • Concurrency is when 2 or more tasks are being performed at the same time, which might mean that only 1 of them is being worked on while the other ones are paused.
  • Parallelism is when 2 or more tasks are executing at the same time.

Practical use cases

  1. If the application is I/O bounded, the best performance usually comes from using “pseudo-threads” (gevent or asyncio). As we have seen, Gunicorn supports this programming paradigm by setting the appropriate worker class and adjusting the value of workersto (2*CPU)+1.
  2. If the application is CPU bounded, it doesn’t matter how many concurrent requests are handled by the application. The only thing that matters is the number of parallel requests. Due to Python’s GIL, threads and “pseudo-threads” cannot run in parallel. The only way to achieve parallelism is to increase workers to the suggested (2*CPU)+1, understanding that the maximum number of parallel requests is the number of cores.
  3. If there is a concern about the application memory footprint, using threads and its corresponding gthread worker class in favor of workers yields better performance because the application is loaded once per worker and every thread running on the worker shares some memory, this comes to the expense of some additional CPU consumption.
  4. If you don’t know you are doing, start with the simplest configuration, which is only setting workers to (2*CPU)+1 and don’t worry about threads. From that point, it’s all trial and error with benchmarking. If the bottleneck is memory, start introducing threads. If the bottleneck is I/O, consider a different python programming paradigm. If the bottleneck is CPU, consider using more cores and adjusting the workers value.

Building the system


  1. Gunicorn is ported from Ruby’s Unicorn project. Its design outline helped on clarifying some of the most fundamental concepts. Gunicorn architecture cemented some of those concepts.
  2. Opinionated blog post about how Unicorn deferring some of the most critical features to Unix is good.
  3. Stack Overflow answer about the pre-fork web server model.
  4. Some more references to understand how to fine tune Gunicorn.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store