Optimizing Resnet-50: 8X inference throughput with just a few commands

7 min readAug 18, 2023

Resnet-50 is one of the most downloaded models from HuggingFace and really popular for image classification. Resnet was introduced a few years ago in the paper Deep Residual Learning for Image Recognition and has many fine tuned versions available on HuggingFace Hub.

Trying out this model using Transformers API is straightforward. However, the challenge arises when doing large-scale deployment or transitioning to production: understanding model throughput, latency, capacity planning… the cost of running and how to decrease model inference computational costs?

During years of my engineering career my brain was trained to always think about system throughput, scalability, performance and make sure that the system I build can handle the load and provide certain service guarantees. I believe that all these criterias apply to serving machine learning models as well. Actually, they are probably even more important due to high computational costs when running model inference.

To get the best performance out of the model we need to optimize the inference and get the most out of available hardware. Optimization is not easy and requires a lot of evaluation and benchmarking which is usually manual and time-consuming work. Luckily we can use HuggingBench to help us quickly iterate through this process and increase Resnet inference throughput 📈.

In this blog post I’ll walk you through leveraging https://github.com/huggingbench/huggingbench to minimize GPU and memory usage while serving Resnet-50. Minimizing resource usage is directly correlated to minimizing inference costs 💰.

If you want to understand what HuggingBench is and its origins I recommend reading Introducing HuggingBench: A Path to Optimized Model Serving 🚀.

⚙️ HuggingBench Setup

We’ll obviously need the appropriate hardware for testing, as well as a suitable environment to run HuggingBench. While you’ll likely prefer running it on a machine equipped with a GPU, a CPU-only machine, like your laptop, can suffice. However, be aware that opting out of GPU utilization may compromise performance. That said, some models can still deliver decent performance on a CPU, especially when considering cost-effectiveness.

For this tutorial, I’ve set up an instance on Genesis Cloud, equipped with an Nvidia RTX 3080 boasting 10GB of GPU memory, paired with 32GB of system memory and an 8-core processor.

Please refer to the Github README for detailed instructions on installing HuggingBench and its associated dependencies on your chosen machine. HuggingBench heavily utilizes containers to ensure a seamless user experience, so be prepared for a short wait — it might take a few minutes to build and download the necessary Docker images.

The tool relies on the following Docker images:

The Nvidia Triton Docker image for inference, available here.
Three custom-built Docker images that assist in converting between various model formats and quantization, namely ONNX, OpenVino, and TensorRT.
Prometheus, which is used to gather metrics during benchmarks.
Grafana for visualization and charting.

👀 📊Observe and visualize

Maintaining clear visibility is crucial during optimizations. It’s essential to grasp how our system responds under load, and the combination of Prometheus and Grafana is invaluable in this regard. The Nvidia Triton server provides metrics in the Prometheus format, and HuggingBench comes with a built-in Grafana Dashboard, streamlining the process for generating basic charts. For details on initiating the Observability stack, please refer to the provided instructions.

cd docker/observability && ./start-docker-compose.sh

Check if Prometheus and Grafana Docker containers are running

If you are running on a remote machine you might want to open ports 3000 and 9090 for accessing Grafana and Prometheus. You can try accessing respective server ports through your browser to make sure that services are up. You can find and tweak Grafana login in this file.

Grafana with pre-built HuggingBench Dashboard

☀️ Warm up

For a start, let’s run the tool using its default settings. In doing so, the tool will:

Download the resnet-50 model from HuggingFace.
Transition the model from PyTorch to ONNX.
Deploy the model on the Nvidia Triton server with a basic configuration (keep in mind this will utilize only the CPU).
Execute our load-testing tool using simulated data.
Provide a summary upon completion.

hbench triton --id microsoft/resnet-50

From the results, it’s evident that the default CPU setup can handle roughly 20 inferences per second. While default settings might not be particularly intriguing, they do offer a glimpse into the potential inference capacity. Now, let’s dive deeper!

The batch size

Inference can be executed by sending individual requests sequentially. Alternatively, we can consolidate multiple requests into a single batch and observe the model’s response. Certain models excel at processing numerous inputs simultaneously, which can enhance inference performance. Let’s delve into the details and explore how the GPU, along with different batch sizes, impacts our throughput. Wondering about the ideal batch size? The command below might shed some light on that.

hbench triton --id microsoft/resnet-50 --device gpu --client_workers 8 --batch_size 2 4 8

We display summary statistics in the console output, complemented by several generated charts. This provides an immediate glimpse into system performance, bypassing the need to check the Grafana dashboard.

The above chart clearly indicates that batching effectively doubles the inference throughput! We’ve uncovered a straightforward strategy that could potentially halve our inference costs! Let’s check Grafana charts for more insights.

Grafana charts for inference with different batch sizes

From the data, it’s evident that with a batch size of 8, we achieve approximately 1100 inferences per second. The GPU usage hovers around 90%, suggesting that we’re nearing the GPU’s capacity. Notably, the increased batch size and inference rate didn’t have a major impact on GPU utilization.

🏗️ Model format and model instances

TensorRT, a model format by Nvidia, is highly optimized for inference. We’re keen to see if it can bolster our inference performance. Alongside assessing the model format, it might be worthwhile to investigate whether running multiple model instances offers any advantages. By default, the Triton server operates just one instance, but could running additional instances enhance our ability to handle more inference requests? Naturally, there’s a catch: deploying more model instances could strain GPU memory, so it’s a balancing act.

hbench triton --id microsoft/resnet-50 --device gpu --client_workers 8 --batch_size 8 --format trt --instance_count 1 2 4

Inferences for different number of model instances with TensorRT

It’s evident that the TensorRT format delivers a twofold boost in inference performance! The number of model instances doesn’t appear to significantly affect this performance. Moreover, with over 2K inferences per second, CPU utilization is nearing its maximum at 100%, while the GPU lingers around 50%. This suggests that by augmenting the CPU capacity — which is relatively cheap — we might potentially double our inference rate!

💡Quantization

Quantization is often used in ML to increase performance. Let’s try to explain what quantization is. Wikipedia offers definition in broad terms:

“Quantization is the process of mapping continuous infinite values to a smaller set of discrete finite values”.

HuggingFace provides more precise definition from ML perspective:

“Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and activations with low-precision data types like 8-bit integer (int8) instead of the usual 32-bit floating point (float32).”

Let’s see if quantization can help us in achieving even better inference performance with Resnet-50 on Triton server. The default model is using 32-bit floating point precision and we will try 16-bit float point precision.

hbench triton --id microsoft/resnet-50 --device gpu --client_workers 8 --batch_size 8 --format trt --precision fp16

Wow! Remarkably, by employing half-precision (FP16), we’ve managed to again double the inference throughput, reaching up to 4,000 inferences per second. We observe that GPU utilization remains around 50%, but the CPU is nearly at its limits. It seems likely that by allocating a few more CPU cores — which is cost-effective — we could further enhance our inference rate.

📝 Concluding remarks

Our exploration has taken us from a starting point of mere hundreds to achieving thousands of inferences per second. And what is even better, we did it all by just running a couple of commands in our terminal! Some noteworthy conclusions:

Batch size of 8 doubled the throughput
TensorRT format when compared to ONNX doubled the throughput
Using half-precision (fp16) doubled the throughput
We most probably can increase inference throughput even more by provisioning more CPU cores

Keep an eye out for further benchmark insights!