At GumGum, we use Computer Vision (CV) to leverage page visuals for our contextual targeting and brand suitability product called Verity. We process millions of images every hour, and at this rate, our long-term inference costs dwarf the upfront training costs. So, we tackled this issue head-on. In this post, I’ll benchmark and highlight the importance of multi-threading for I/O operations and batch processing for inference. Note that implementing these strategies may be an overkill if your application’s scale is of the order of a few thousand images an hour.
Bottlenecks in a Typical Inference Workflow
Let’s look at our application components:
API. The API provides an interface between the client and the CV module. Minimally, a client request contains an image url and a task, e.g. “check whether the image depicts violence”. Here, we assume that the API performs at the desired scale and is not a bottleneck in our application.
Message queue. There are at least two reasons for putting a message queue (we use SQS) between the API and the CV module: First, it allows inference to be done asynchronously which is useful since CV tasks could be time-consuming. Second, a managed queue like SQS automatically brings some level of fault tolerance to our application by allowing retries and storing unsuccessful requests in a dead letter queue. Since a standard SQS provides nearly unlimited throughput, it is not an application bottleneck.
CV module. This module is responsible for the following tasks: (a) Get the image url from SQS; (b) Download the image from the internet; (c) Preprocess the image in CPU, e.g. resizing, and load it into GPU memory; (d) Run inference in GPU; and (e) Send results to the API through an http callback.
- I/O operations (a, b, e). If we perform the I/O operations sequentially, i.e. on the main thread, the CPU and GPU can remain idle for relatively extended periods of time depending on the network speed (mainly limited by the servers that host the image and TCP/IP handshake). This idle time, if not addressed properly, can become the main bottleneck.
- CPU preprocessing (c). The downloaded image should go through a series of transforms by the CPU before being loaded into GPU memory. This ensures that the input to the model is similar to the images that were used during training. Common transforms include resizing, cropping, and modifications to the mean, standard deviation, and color mode. This step can be optimized to not become a bottleneck.
- GPU processing (d). A GPU achieves high throughput by executing a compute kernel across its thousands of CUDA cores in parallel. The actual number of cores that are used in each run depends on the input size to the GPU. We can increase the number of used cores and save GPU time by sending a batch of images to the GPU. The maximum feasible batch size is determined by the available GPU memory. In the absence of I/O and CPU bottlenecks, the GPU power or memory limits could become a bottleneck.
Alleviating Application Bottlenecks
I’ll discuss simple strategies for eliminating or reducing the impact of the above bottlenecks. You’ll find references to Python since it’s the language of choice for taking advantage of deep learning frameworks such as PyTorch and TensorFlow. The updated workflow involves multi-threading for I/O operations and batch processing for inference:
Eliminating the I/O Bottleneck Using Multi-threading
- Use Python’s thread-safe
queue.Queueto store inputs and outputs of the multi-threaded functions:
from queue import Queue
# ... inside your class
self.download_queue = Queue(maxsize=MAX_QUEUE_SIZE)
self.inference_queue = Queue(maxsize=MAX_QUEUE_SIZE)
self.results_queue = Queue(maxsize=MAX_QUEUE_SIZE)
It’s important that you specify a maximum size for the queues to bring some coordination between the multi-threaded functions and prevent faster methods from doing work that slower methods can’t keep up with. For example, without a maximum size, the
pull_from_queue method keeps reading from SQS at a much faster pace than the application can process the images. These messages will become visible to all the queue consumers after reaching the visibility timeout. As a result, the application will end up hopelessly processing the same message over and over!
The value of
MAX_QUEUE_SIZE should be large enough such that the underlying methods can do their work without depleting the queue, but not too large such that the pulled message from SQS reaches its visibility timeout. The following equation can be used to get an upper bound on
MAX_QUEUE_SIZE < SQS_VISIBILITY_TIMEOUT x RPS / NUM_QUEUES
RPS (request/second) is the number of messages that the application can process in one second, and
NUM_QUEUES=3 since we have three back-to-back queues in our application.
- Create a wrapper method for each of the
send_resultsmethods. Here is an example for the
# read from download_queue
# blocks if download_queue is empty
image_url = self.download_queue.get() # download the image
image = self.download_image(image_url) # put image in inference_queue
# blocks if inference_queue has reached max size
Then start a desired number of threads that run this wrapper:
from threading import Thread
for _ in range(N_DOWNLOAD_THREADS):
- There is one more optimization you can do to further speed up the
pull_from_queuemethod even though it is not related to multi-threading. Specify a
SQS_BATCH_SIZEto read a maximum number of 10 messages each time
receive_messageis called. This is much faster than reading SQS messages one by one:
sqs = boto3.client("sqs")
messages = sqs.receive_message(
Reducing CPU and GPU Bottlenecks
As discussed earlier, we can use batch inference to make the most of our GPU time. Now that we have downloaded multiple images and placed them in the
inference_queue, it’s time for the CPU to preprocess a batch of images and send them to the GPU.
- The preprocessing is CPU-intensive and ideally should take advantage of Python’s
multiprocessingmodule. Fortunately, many of the methods that are used for image transformations use libraries such as
numpythat run outside the GIL and easily use multiple CPU cores.
- In addition, we should ensure that CPU preprocessing is run in parallel with GPU processing. If the CPU starts preprocessing the next batch of images only after it receives the results of the previous batch from the GPU, we are making CPU and GPU to wait on each other. One way to do this is to use a separate queue, say
preprocessed_queue, to store the preprocessed images. Each time the GPU is ready to take the next batch, the CPU can read directly from this queue and load it into GPU memory.
Alternatively, if you’re using frameworks like PyTorch, take advantage of the provided
DataLoader class with memory-pinning since it does the above parallelization while utilizing Python’s
multiprocessing module and fast loading into GPU under the hood.
Nothing Speaks the Truth Like a Benchmark!
Let’s look at a study we did to investigate the effect of above strategies on a Threat Classifier module developed by GumGum. The benchmark is done on a
g4dn.2xlarge EC2 instance (Tesla T4 GPU, 8 vCPUs, 32 GB RAM, 16 GB GPU Memory).
- Grey column: The baseline case represents a module that runs all tasks sequentially on the main thread. The resulting request/second (RPS) is 3 and CPU and GPU utilizations are in single digits.
- Blue column: This case removes I/O idle time by using 10 threads for image download, and 3 threads for reading from SQS and sending the results callback. A maximum queue size of 300 is used for the underlying queues. We see that the RPS is increased to 9, three times larger than baseline. The CPU and GPU utilizations have also increased. The used GPU memory is unchanged since we are still sending a batch of size one to the GPU.
- Green column: This case runs CPU preprocessing in parallel with GPU processing using Pytorch’s
DataLoader. The number of I/O threads are the same as the blue case, and we are sending one image to the GPU. We see that the RPS is increased to 21, seven times larger than baseline and more than two times larger than the blue case that runs preprocessing and processing sequentially. The CPU and GPU utilizations have also doubled compared to the blue case.
- Red column: This case is the same as the green case except that it uses a batch of 8 images for inference. We see that the RPS is increased to 28, almost ten times larger than baseline, and 33% larger than the green case where a batch size of one was used. The CPU utilization has increased to 75%. The GPU memory usage has increased while the GPU duty cycle has slightly decreased. This is because the GPU runs a batch of images in one shot, using more memory but less compute time.
Bonus: An Inference Library
GumGum performs different tasks on images like threat classification, scene classification, or object detection. These tasks have their own model and can use different frameworks. To make our codebase easier to maintain, we have abstracted away the shared functions in an inference library. If your application, like ours, relies on multiple CV modules, I recommend you consider a similar solution. Here is how we did it:
- The inference library handles multi-threading for reading from SQS, downloading images, and sending the results callback.
- It also provides an abstract class with abstract methods for the only two module-specific tasks, i.e.
process. All that the CV modules need to do is to extend this abstract class and implement the required concrete methods.
Your inference costs could be significantly lower if you incorporate the following simple strategies:
- It’s always a good idea to use Python’s
threadingmodule for I/O operations. It becomes a must when you’re paying top dollar for GPU time.
- Run CPU preprocessing in parallel with GPU processing. If you’re using frameworks like PyTorch, take advantage of the provided
- Consider using batch inference in GPU. A good practice is to increase the batch size as long as it increases throughput without running into GPU out-of-memory error.