Simple Ways to Save Big in Computer Vision Inference

How you can optimize your CPU and GPU utilization

Rashad Moarref
Jun 8 · 8 min read
Image for post
Image for post

At GumGum, we use Computer Vision (CV) to leverage page visuals for our contextual targeting and brand suitability product called Verity. We process millions of images every hour, and at this rate, our long-term inference costs dwarf the upfront training costs. So, we tackled this issue head-on. In this post, I’ll benchmark and highlight the importance of multi-threading for I/O operations and batch processing for inference. Note that implementing these strategies may be an overkill if your application’s scale is of the order of a few thousand images an hour.

Let’s look at our application components:

Image for post
Image for post
A typical workflow in CV inference applications.

API. The API provides an interface between the client and the CV module. Minimally, a client request contains an image url and a task, e.g. “check whether the image depicts violence”. Here, we assume that the API performs at the desired scale and is not a bottleneck in our application.

Message queue. There are at least two reasons for putting a message queue (we use SQS) between the API and the CV module: First, it allows inference to be done asynchronously which is useful since CV tasks could be time-consuming. Second, a managed queue like SQS automatically brings some level of fault tolerance to our application by allowing retries and storing unsuccessful requests in a dead letter queue. Since a standard SQS provides nearly unlimited throughput, it is not an application bottleneck.

CV module. This module is responsible for the following tasks: (a) Get the image url from SQS; (b) Download the image from the internet; (c) Preprocess the image in CPU, e.g. resizing, and load it into GPU memory; (d) Run inference in GPU; and (e) Send results to the API through an http callback.

  • I/O operations (a, b, e). If we perform the I/O operations sequentially, i.e. on the main thread, the CPU and GPU can remain idle for relatively extended periods of time depending on the network speed (mainly limited by the servers that host the image and TCP/IP handshake). This idle time, if not addressed properly, can become the main bottleneck.
  • CPU preprocessing (c). The downloaded image should go through a series of transforms by the CPU before being loaded into GPU memory. This ensures that the input to the model is similar to the images that were used during training. Common transforms include resizing, cropping, and modifications to the mean, standard deviation, and color mode. This step can be optimized to not become a bottleneck.
  • GPU processing (d). A GPU achieves high throughput by executing a compute kernel across its thousands of CUDA cores in parallel. The actual number of cores that are used in each run depends on the input size to the GPU. We can increase the number of used cores and save GPU time by sending a batch of images to the GPU. The maximum feasible batch size is determined by the available GPU memory. In the absence of I/O and CPU bottlenecks, the GPU power or memory limits could become a bottleneck.

I’ll discuss simple strategies for eliminating or reducing the impact of the above bottlenecks. You’ll find references to Python since it’s the language of choice for taking advantage of deep learning frameworks such as PyTorch and TensorFlow. The updated workflow involves multi-threading for I/O operations and batch processing for inference:

Image for post
Image for post
Updated inference workflow with multi-threading for I/O operation and batch inference.

Fortunately, multi-threading for I/O operations is not affected by Python’s infamous GIL. In fact, Python’s threading module can be used to completely eliminate the I/O bottleneck.

  • Use Python’s thread-safe queue.Queue to store inputs and outputs of the multi-threaded functions:
from queue import Queue
# ... inside your class
self.download_queue = Queue(maxsize=MAX_QUEUE_SIZE)
self.inference_queue = Queue(maxsize=MAX_QUEUE_SIZE)
self.results_queue = Queue(maxsize=MAX_QUEUE_SIZE)

It’s important that you specify a maximum size for the queues to bring some coordination between the multi-threaded functions and prevent faster methods from doing work that slower methods can’t keep up with. For example, without a maximum size, the pull_from_queue method keeps reading from SQS at a much faster pace than the application can process the images. These messages will become visible to all the queue consumers after reaching the visibility timeout. As a result, the application will end up hopelessly processing the same message over and over!

The value of MAX_QUEUE_SIZE should be large enough such that the underlying methods can do their work without depleting the queue, but not too large such that the pulled message from SQS reaches its visibility timeout. The following equation can be used to get an upper bound on MAX_QUEUE_SIZE:


where RPS (request/second) is the number of messages that the application can process in one second, and NUM_QUEUES=3 since we have three back-to-back queues in our application.

  • Create a wrapper method for each of the pull_from_queue, download_image, and send_results methods. Here is an example for the download_image method:
def download_image_wrapper(self):
while True:
# read from download_queue
# blocks if download_queue is empty
image_url = self.download_queue.get()
# download the image
image = self.download_image(image_url)
# put image in inference_queue
# blocks if inference_queue has reached max size

Then start a desired number of threads that run this wrapper:

from threading import Thread
for _ in range(N_DOWNLOAD_THREADS):
  • There is one more optimization you can do to further speed up the pull_from_queue method even though it is not related to multi-threading. Specify a SQS_BATCH_SIZE to read a maximum number of 10 messages each time receive_message is called. This is much faster than reading SQS messages one by one:
import boto3
sqs = boto3.client("sqs")
messages = sqs.receive_message(

As discussed earlier, we can use batch inference to make the most of our GPU time. Now that we have downloaded multiple images and placed them in the inference_queue, it’s time for the CPU to preprocess a batch of images and send them to the GPU.

  • The preprocessing is CPU-intensive and ideally should take advantage of Python’s multiprocessing module. Fortunately, many of the methods that are used for image transformations use libraries such as numpy that run outside the GIL and easily use multiple CPU cores.
  • In addition, we should ensure that CPU preprocessing is run in parallel with GPU processing. If the CPU starts preprocessing the next batch of images only after it receives the results of the previous batch from the GPU, we are making CPU and GPU to wait on each other. One way to do this is to use a separate queue, say preprocessed_queue, to store the preprocessed images. Each time the GPU is ready to take the next batch, the CPU can read directly from this queue and load it into GPU memory.

Alternatively, if you’re using frameworks like PyTorch, take advantage of the provided DataLoader class with memory-pinning since it does the above parallelization while utilizing Python’s multiprocessing module and fast loading into GPU under the hood.

Let’s look at a study we did to investigate the effect of above strategies on a Threat Classifier module developed by GumGum. The benchmark is done on a g4dn.2xlarge EC2 instance (Tesla T4 GPU, 8 vCPUs, 32 GB RAM, 16 GB GPU Memory).

Image for post
Image for post
Benchmarks results on a g4dn.2xlarge instance for a Threat Classifier developed by GumGum.
  • Grey column: The baseline case represents a module that runs all tasks sequentially on the main thread. The resulting request/second (RPS) is 3 and CPU and GPU utilizations are in single digits.
  • Blue column: This case removes I/O idle time by using 10 threads for image download, and 3 threads for reading from SQS and sending the results callback. A maximum queue size of 300 is used for the underlying queues. We see that the RPS is increased to 9, three times larger than baseline. The CPU and GPU utilizations have also increased. The used GPU memory is unchanged since we are still sending a batch of size one to the GPU.
  • Green column: This case runs CPU preprocessing in parallel with GPU processing using Pytorch’s DataLoader. The number of I/O threads are the same as the blue case, and we are sending one image to the GPU. We see that the RPS is increased to 21, seven times larger than baseline and more than two times larger than the blue case that runs preprocessing and processing sequentially. The CPU and GPU utilizations have also doubled compared to the blue case.
  • Red column: This case is the same as the green case except that it uses a batch of 8 images for inference. We see that the RPS is increased to 28, almost ten times larger than baseline, and 33% larger than the green case where a batch size of one was used. The CPU utilization has increased to 75%. The GPU memory usage has increased while the GPU duty cycle has slightly decreased. This is because the GPU runs a batch of images in one shot, using more memory but less compute time.

GumGum performs different tasks on images like threat classification, scene classification, or object detection. These tasks have their own model and can use different frameworks. To make our codebase easier to maintain, we have abstracted away the shared functions in an inference library. If your application, like ours, relies on multiple CV modules, I recommend you consider a similar solution. Here is how we did it:

  • The inference library handles multi-threading for reading from SQS, downloading images, and sending the results callback.
  • It also provides an abstract class with abstract methods for the only two module-specific tasks, i.e. preprocess and process. All that the CV modules need to do is to extend this abstract class and implement the required concrete methods.

Your inference costs could be significantly lower if you incorporate the following simple strategies:

  • It’s always a good idea to use Python’s threading module for I/O operations. It becomes a must when you’re paying top dollar for GPU time.
  • Run CPU preprocessing in parallel with GPU processing. If you’re using frameworks like PyTorch, take advantage of the provided DataLoader class.
  • Consider using batch inference in GPU. A good practice is to increase the batch size as long as it increases throughput without running into GPU out-of-memory error.

We’re always looking for new talent! View jobs.

Follow us: Facebook | Twitter | | LinkedIn | Instagram


Thoughts from the GumGum tech team

Rashad Moarref

Written by

Software Engineer with entrepreneurial spirit. Passionate about building Machine Learning applications at scale. PhD in ECE, Univ. Minnesota. Caltech Alumnus.


Thoughts from the GumGum tech team

Rashad Moarref

Written by

Software Engineer with entrepreneurial spirit. Passionate about building Machine Learning applications at scale. PhD in ECE, Univ. Minnesota. Caltech Alumnus.


Thoughts from the GumGum tech team

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store