Decoding JPEGs: OpenCV Falls Behind TensorFlow and torchvision in Speed Tests

5 min readFeb 26, 2024

In machine learning, specifically for computer vision tasks, the speed at which images are processed isn’t just a tech issue — it’s also a cost issue. If your image loading is sluggish, it can slow down batch preparation to the point where your GPU is just waiting around. Here’s how that hits your wallet:

Longer training times: In the cloud, you pay for GPU usage by the hour. If your batches are slow to prepare, you’re burning through cash without actually training the model.
Slower iterations: More time per iteration means slower progress. It’s not just about patience; it’s about efficiency and cost.
Underutilized talent: While your GPUs idle, so do your machine learning engineers and researchers. Their time is expensive; if they’re waiting on slow model training, that’s money down the drain.

So, speedy image processing isn’t a luxury — it’s essential to keeping costs down and productivity up.

This text shows the results of comparing different image reading libraries.

For those who prefer a hands-on approach and wish to run the benchmark themselves, here’s the link to the GitHub repository containing all the necessary code.

Introduction

Low GPU utilization is a tricky beast in the world of machine learning and can stem from a myriad of issues, each with its own set of solutions. But today, we’re zeroing in on one particular scenario that’s deceptively impactful: the process of reading JPG images from disk into numpy arrays, which are then typically used for further image processing with tools like Albumentations or ImgAug.

Why focus on this? Unlike hardware upgrades, which can be costly and time-consuming, switching the image reading library is a quick fix — a few lines of code.

The real question is, with all libraries available, how do they stack up against each other in terms of performance?

Benchmark Configuration

Libraries and Versions

We evaluated the following libraries:

OpenCV: 4.9.0.80
Pillow (PIL): 10.2.0
jpeg4py: 0.1.4
scikit-image (skimage): 0.22.0
imageio: 2.34.0
torchvision: 0.17.1
tensorflow: 2.15.0.post1

Hardware Setup

Tests were performed on an AMD Ryzen Threadripper 3970X 32-Core Processor with these storage devices:

QVO 870: SAMSUNG 870 QVO SATA III 2.5" SSD 8TB
Samsung 960Pro: Samsung 960 PRO Series — 512GB PCIe NVMe — M.2
EVO 850: Samsung 850 EVO 1 TB 2.5-Inch SATA III Internal SSD
WDC 8TB: WD Red Pro 8TB NAS Internal Hard Drive — 7200 RPM, SATA 6 Gb/s
WDC 6TB: WD Red Pro 6TB 3.5-Inch SATA III 7200rpm NAS Internal Hard Drive
Toshiba 8TB: Toshiba X300 8TB Performance Desktop Gaming Hard Drive 7200 RPM 128MB Cache SATA 6.0Gb/s 3.5 inch

Benchmark Methodology

We utilized the first 2000 images from the ImageNet validation dataset for this evaluation. Our methodology included a warm-up run for the disks, which is particularly important for HDDs, to ensure they were at operational speed. This was followed by five measurement runs, during which we shuffled the order in which libraries were tested to prevent bias.

Due to the caching mechanisms inherent in modern operating systems and storage devices, it’s important to note that this benchmark primarily assesses the JPEG decoding speed of each library rather than measuring the raw random access speed of image reading from the disk. This allows us to assess the performance of the JPEG decoding process, independent of disk I/O speeds.

Results

Source: https://github.com/ternaus/imread_benchmark

There are no significant fluctuations in image reading speeds across different disk types in this specific benchmark setup. While SSDs (QVO 870, Samsung 960Pro, EVO 850) and HDDs (WDC 8TB, WDC 6TB, Toshiba 8TB) perform similarly under these conditions, it’s important to note that in real-world scenarios, the overall random access read speed from the disk can be quite pronounced. However, this aspect falls outside the scope of this particular benchmark, which focuses on evaluating JPEG decoding speeds independent of disk read times.
Though not shown here, removing the tensor-to-numpy array conversion in torchvision and TensorFlow and the BGR-to-RGB conversion in OpenCV does not significantly change the results.
An unexpected finding is that OpenCV performs almost twice as slow as the other libraries, which could be a significant consideration for projects relying heavily on image processing.

Conclusion and Recommendations

If low GPU utilization during training (which can be monitored using tools such as nvidia-smi or ntop) is not accompanied by high CPU usage (observable through tools like htop), it suggests that there are no bottlenecks in image augmentation processes.

In such cases, checking the efficiency of image reading could be beneficial. For those using OpenCV, switching to another library might mitigate issues with minimal code changes, potentially leading to better utilization of hardware resources.

Keep an eye on both your GPU and CPU metrics to ensure that your machine-learning pipelines are running as efficiently as possible. If you observe imbalances, such as low GPU and CPU utilization, consider revising your data preprocessing and image loading strategies.

Running the Benchmark Yourself

Access the benchmark code here for a direct experience and to test different image sets or hardware configurations. This can provide insights tailored to your specific setup.

Engagement

If you found this analysis helpful, consider supporting it with claps on Medium, which allows up to 50 per article.

Additionally, if the benchmark code aids your project, a “star” on the repository would be greatly appreciated.

I’m open to connecting on various platforms:

Your feedback and connections are highly valued.