ML Model Deployment Option with Concurrency (Flask + uWSGI)

Read this beginner’s guide to ML solutions deployment maintaining concurrency + load testing. Dive into our discussion on a data scientist’s deployment responsibilities.

Ivan Kunyankin
8 min readFeb 8, 2021

For some time, I was exclusively focused on the research aspect of machine learning, developing custom machine learning solutions for different tasks. But lately, new projects have come in, and it’s sometimes faster to take care of the initial deployment myself than to seek help from other developers. I’ve found several deployment options that differ in scale, ease of use, pricing, etc.

Photo by Michel Bosma on Unsplash

Today, we’ll discuss a simple yet powerful approach to Machine Learning model deployment. It allows us to process multiple requests simultaneously and scale the application if needed. We’ll also discuss a data scientist’s responsibilities when getting machine learning models to production and how you can load test your web-app with some handy Python tools.

What We’ll Cover in This Article

  • The Responsibilities of a Data Scientist
  • Flask and concurrency
  • Load testing with Locust
  • Concurrency with uWSGI
  • Summary

The Responsibilities of a Data Scientist

You can find plenty of open-source solutions for almost every task. Some existing services can even take care of data validation and processing, data storage, model training and evaluation, model inference and monitoring, and more.

But what if you still need a custom solution? You’ll have to develop the whole infrastructure yourself. And here comes the question I have been thinking about for quite some time: What are data scientists responsible for, exactly? Is it just the model itself, or do we have to get it to production?

Usually, a data scientist’s responsibility differs from company to company. I discussed the question with my CTO. We talked about some cases where a data scientist should have expertise. They should be able to deliver their solution as an API, containerize it, and, ideally, develop the solution to process multiple requests simultaneously.

As for mobile devices, it’s typically enough to provide mobile developers with a model converted to a corresponding format. On top of that, you can provide documentation describing what the model takes as the input and what it returns as the output.

And if one Docker container can’t handle the expected traffic, a data scientist should delegate further scaling to an appropriate specialist.

How do you feel about your role as a data scientist? Let me know in the comments below about your responsibilities and what you think about them!

Flask and Concurrency

We’ll be using Flask as a part of a simple application to experiment with. It’s a micro web framework built in Python, and designed to be used for small applications.

When receiving a request, this application will send a request to httpbin.org — a service that helps you experiment with different requests. Once the request is sent, our application will receive a response with a two-second delay. We need this delay to experiment with concurrency.

Pure Python has its “infamous” GIL restriction, which essentially limits one Python thread to run at a time (read about it here). If we want our application to process more requests in a given time, we have two options: threading and multiprocessing. The one to use depends on the application’s bottleneck.

When to Choose Threading?

You should be using threading whenever you need to overlap the waiting time. The way our application is written represents a typical I/O-bound operation. So, most of the execution time is spent waiting for other services (like an operating system, database, internet connection, etc.). In this case, we can benefit from threading as it helps to overlap the waiting time.

When to Choose Multiprocessing?

On the other hand, when you want to improve application performance, you should be using multiprocessing. Suppose our application actively uses a CPU (for example, forward passing data through a neural network), and its performance depends solely on the CPU’s computational power. This application is described as CPU-bound. To improve our application’s performance, we’d need multiprocessing. Unlike threading, we create separate interpreter instances and execute calculations in parallel.

Flask’s built-in server is threaded by default from version 1.0 onwards. So why don’t you deploy your applications with Flask entirely? Flask’s site clearly states that “Flask’s built-in server is not suitable for production” as it doesn’t scale well.

We’ll take a look at another deployment solution in a minute. But first, I suggest testing the application to understand how well it handles the load.

Load testing with Locust

It’s important to load test your API before sending traffic to it. One way to do this is to use a Python library called Locust. It runs a web application on localhost and has a simple interface that allows us to customize tests and visualize the testing process.

Getting Locust Running

Let’s run some tests on the Flask application on localhost.

  1. Install Locust. Do this with the following command:
pip3 install locust

2. Create the script and add it to our project directory

3. Run the application with the following command:

python3 demo.py

4. Run our load testing application with another command:

locust -f load_testing.py --host=http://0.0.0.0:5000/

5. Go to http://localhost:8089 in your browser, and you’ll see the Locust interface

6. Specify the number of unique users to generate and the number of requests to be sent per second

7. Choose parameters according to your needs

Locust library. Screenshot by author

Testing Threading

Now we’ll look at the application’s test with and without threading. It’ll allow us to understand whether or not it helps process more requests in a given time. Remember that we set up a two-second delay before getting a response from the server.

The following image is with threading turned off:

 ...
if __name__ == "__main__":
app.run(host='0.0.0.0', threaded=False)
Not threaded. Screenshot by author

The next test is with threading turned on. To do that, remove the parameter mentioned before.

Threaded. Screenshot by author

As you can see, you get a higher RPS (requests per second) rate with threads.

But what if we develop an application to classify images, for example? This operation would actively use our CPU. In this case, instead of threading, it would be better to process requests with separate processes.

Concurrency with uWSGI

To set up something with concurrent requests, we’ll need to use uWSGI. It’s a tool that allows us more control over multiprocessing and threading. uWSGI gives us enough functionality and flexibility to deploy an app while still being accessible.

Let’s change the Flask application we created earlier, so it looks a bit more like a real machine-learning service:

Once run, it initializes the model. It’ll then perform a forward pass of an array of zeros through the model every time it receives a request to simulate a real-world application’s functionality.

First, let’s take a look at RPS (requests per second) for the application without uWSGI:

Without uWSGI, Threaded=False. Screenshot by author
Without uWSGI, Threaded=True. Screenshot by author

We tested the “service” as a pure Flask application with threaded=False and threaded=True, respectively. As you can see, although the RPS is higher when threaded=True, the improvement isn’t much. This is because the application still depends mainly on the CPU.

Testing with uWSGI

First, we need to install uWSGI:

pip3 install uwsgi

Then, we’ll need to add a configuration file to our project directory. This file has all the parameters uWSGI needs to run with our app.

Let’s go over the most important parameters that you’ll see in this configuration file:

module = demo:app — This is the script name containing our application:Flask object name

master = true — This is the main uWSGI process necessary for repeated calls of workers, logging, and managing other functions. In most cases, this should be set to “true”

processes = 2 / threads = 1 — This specifies the number of processes and threads to be run. You can also use uWSGI’s submodule called cheaper to scale the number of processes and threads automatically

enable_threads = true — This runs uWSGI in multithreaded mode

listen=1024 — This is the size of the queue for requests

need-app = true — This flag prevents uWSGI from running if it can’t find or run the application. If it equals False, uWSGI will ignore any import issue and will return 500 status for requests.

http = 0.0.0.0:5000 — This is the URL and port to access the application. It’) only used if a user sends requests directly to the application.

By default, uWSGI loads your application and then forks it. But you can specify lazy-apps = true. This way, uWSGI loads your application separately for every worker. It can help avoid errors with TensorFlow models or sharing other data between workers.

Another critical parameter is listen. It’s essential to set this parameter to the maximum number of unique users you expect to enqueue during a reload. Otherwise, some of them might get an error. By default, listen equals 100. Read more about it here.

uWSGI has many more useful parameters, but for now, let’s run the application:

uwsgi uwsgi.ini

Now, we can look at the load testing results for the Flask application wrapped into uWSGI:

With uWSGI, 1 process/2 threads. Screenshot by author
With uWSGI, 2 processes/1 thread. Screenshot by author

Using two separate processes improved RPS radically (15 without uWSGI and threaded=False versus 30 with uWSGI and two processes). However, whether an application is I/O-bound or CPU-bound isn’t always obvious. And choosing between multiprocessing and multithreading is a bit tricky. Take a look at this post for a better understanding.

The Wrap-Up

Hopefully, now you see why that question has been on my mind all this time. Do data scientists need to concern themselves more about getting a model ready for deployment, and if so, what’s the limit? Ideally, with Flask and uWSGI, you’re equipped with the bare essentials to get it up and running. But the sky is the limit, and your situation may warrant more.

Lastly, if you want your application to be open to the world, you should take care of security. We didn’t cover security in this article because it’s a whole other subject, but you should keep it in mind.

As always, I hope this article has been useful to you. Stay safe.

--

--