Optimize Response Time of your Machine Learning API in Production

Published in

Sicara's blog

2 min readJan 13, 2020

This article demonstrates how building a smarter API serving Deep Learning models minimizes the response time.

Your team worked hard to build a Deep Learning model for a given task (let’s say: detecting bought products in a store thanks to Computer Vision). Good.

You then developed and deployed an API that integrates this model (let’s keep our example: self-checkout machines would call this API). Great!

The new product is working well and you feel like all the work is done.

But since the manager decided to install more self-checkout machines (I really like this example), users have started to complain about the huge latency that occurs each time they are scanning a product.

What can you do? Buy 10x faster — and 10x more expensive — GPUs? Ask data scientists to try reducing the depth of the model without degrading its accuracy?

Cheaper and easier solutions exist, as you will see in this article.

A basic API with a big dummy model

First of all, we’ll need a model with a long inference time to work with. Here is how I would do that with TensorFlow 2’s Keras API (if you’re not familiar with this Deep Learning framework, just step over this piece of code):

When testing the model on my GeForce RTX 2080 GPU, I measured an inference time of 303 ms. That’s what we can call a big model.

Now, we need a very simple API to serve our model, with only one route to ask for a prediction. A very standard API framework in Python is Flask. That’s the one I chose, along with a WSGI HTTP Server called Gunicorn. Our unique route parses the input from the request, calls the instantiated model on it and sends the output back to the user.

We can run our deep learning API with the command:

gunicorn wsgi:app

Okay, I can now send some random numbers to my API and it responds to me with some other random numbers. The question is: how fast?

Let’s load test our API

Read the full article on Sicara’s blog.

Optimize Response Time of your Machine Learning API in Production

A basic API with a big dummy model

Let’s load test our API

Written by Yannick Wolff