How We Boosted Video Processing Speed 5x by Optimizing GPU Usage in Python

Lightricks
Lightricks Tech Blog
7 min readOct 11, 2022

By: Nitsan Hasson

Those of us in the ML fields are familiar with the difficulties of taking a cool algorithm and making it production ready. At the Lightricks research infrastructure team, our job is to take the ground breaking work done in Lightricks’ research department and make it run faster, so our users can enjoy the best user experience.

Some of our use cases involve working with videos, so being able to efficiently decode and preprocess them becomes an important task. In this blog post, I will show how we managed to boost the decoding and preprocessing speed of videos by 5x, while leveraging the GPU’s amazing parallel processing power, and still keeping the code simple and easy to maintain.

I started working with a great team of researchers on a video feature that is processed in the cloud, and saw that the existing way of loading and preprocessing videos worked just fine. The code receives a path to a file, reads it frame by frame in a loop, resizes the frames according to some predefined configuration, and returns a list of the downsampled frames.

It looked something like this:

There was only one problem — the execution time. Here is the resulting output:

Preprocessing on CPU…
Preprocess on CPU time: 15.3462252616882

For a 10-second 3584x2240 resolution video, the implementation above takes over 15 seconds to complete. During these 15 seconds, the rest of the pipeline is simply idle.

Preprocessing
High level flow of the project

A possible way around it would be to use prefetch, which means loading and preprocessing the video on one thread, while the rest of our algorithm runs on another thread. So when the algorithm thread is processing frame number X, the preprocessing thread is already handling frame X+1, which will be ready for the algorithm’s thread to use by the time it’s finished with frame X. In our case, the algorithm works on the entire video file, and it needs all of the video frames at the same time, so it has to wait for the preprocessing to be over, therefore we cannot use prefetch.

Another issue we have here is that the returned object is a list of frames. Python lists are not optimized for batch processing, and converting this list into a tensor or a numpy array is incredibly expensive, as this requires copying the entire data into a different buffer.

Using the GPU to Speed Things Up

We wanted to see how we could use the GPU for the preprocessing. Nvidia provides codecs and libraries that do exactly what we were looking for, things like Nvidia’s Video Codec SDK or VPF. The downside of using them directly is that they require quite a bit of work to set up, and are harder to use compared to the easy life we had with our previous Python implementation.

While researchers will most likely prefer to stay with their slow-but-working Python preprocessing pipeline, in production speed is a higher priority. It was starting to look like we would have to integrate one of the lower-level alternatives instead. This solution could get cumbersome, with a different implementation for training and for production. It would most likely result in slightly different outputs from the preprocessing stage, caused by either small implementation differences, or just precision differences, which could result in artifacts in the final algorithm results.

We were looking for a way to speed things up for both use cases, and provide an easy way for everyone to use it.

Torchvision’s Solution

Torchvision implemented a video reader that helps decode videos using the GPU. It is still in Beta stage, and requires some work to set up, as you must build it from source. The steps are pretty straightforward: pre-install FFmpeg, download Nvidia’s Video Codec SDK, and make sure you’re building the package with a matching Torch version. The end result is a Torchvision package with GPU video decoding capabilities that’s super easy to use. You can then take this package and publish it to your local repository, making it accessible to everyone on your team.

Here is the Torchvision version of the previous code:

This code initializes the VideoReader object, then reads frames into a list. Keep in mind, we want to perform the resize on big batches of frames, for better performance. However, holding all the full-size frames in GPU memory might cause us to crash, depending on the video length and the amount of GPU memory available, so we use the frames_per_cycle parameter. Every frames_per_cycle frames, we resize the current frames, and move on to the next batch. Depending on the GPU used, and the video’s original resolution, we can fine-tune this parameter. I used a T4 GPU.

The execution time improved greatly, here is the output of the new implementation:

Preprocessing on GPU…
Preprocess on GPU time: 3.00951886177063

We see a speedup of about 5x in the preprocessing time!

Here is a diagram showing the results for a number of other videos:

Benchmarking results for several videos
Benchmarking results for several videos

Notice how our new GPU implementation is about 2.5 times faster than the old CPU one for all 1152x720 resolution videos, except for the 10-second one, for which the CPU version is actually a bit faster. It looks like that for short, low-resolution video, the overhead of launching the GPU is larger than simply using the CPU. For the 3548x2240 resolution videos the GPU version performs 5 times faster across all video lengths.

What Happened Here?

Let’s take a look at the profiling results of these two implementations. Using Nvidia’s Nsight Systems, we’ll be able to understand why the new implementation actually works faster. Nsight enables us to examine the code’s workflow, and see how different resources are being used.

First let’s look at the first implementation’s results:

Old implementation’s profiling results
Old implementation’s profiling results

We can see there is no sign of GPU usage. The NVTX row shows us the stack trace — the code just reads the video frame by frame and resizes it, using only the CPU.

Now let’s look at the new implementation’s profiling results:

New implementation’s profiling results
New implementation’s profiling results

Marked in red is everything that runs on the CPU. We can see that we now have a row called “CUDA API”, which shows us the CUDA operations happening on the CPU. Marked in purple is a new segment that we didn’t see in the previous results; it shows everything that runs on the GPU. The row called “>99.9% Kernels” shows all the CUDA kernels launched by the application. The higher and denser the blocks in this row are, the more utilized the GPU is at that given time.

You’ll notice that even when using the GPU, there are still a lot of gaps in its graph, meaning it’s still idle sometimes. This is due to the fact that we’re still operating in a Python environment — we have a Python loop reading frames as tensors and appending them into a Python list, before resizing them on the GPU.

We can also see several peaks in the GPU utilization, one for each resize operation. Since we ran the profiling on a 10-second video of 30 fps, and resized batches of 30 frames, we have exactly ten peaks. They all start with the CPU launching the resize on the GPU, which is marked with the purple arrows. Here is a closer look:

Closer look on the new implementation’s profiling results
Closer look on the new implementation’s profiling results

In between each pair of peaks (marked in pink), we can see a more subtle GPU utilization marked in green. This is the part that actually reads the frames one by one, directly into GPU memory.

To improve our benchmarks even more, and avoid idle GPU time as much as possible, we would have to use a lower-level solution. I mentioned earlier that Torchvision is using Nvidia’s Video Codec SDK under the hood, which provides an API for hardware accelerated video encoding and decoding. We can use that SDK to write our own C++ pipeline that decodes and preprocesses the video, then sends it to the algorithm. Since we wanted to increase the preprocessing speed, but still keep things Pythonic and simple, using Torchvision was an acceptable compromise for us.

Conclusions

Turns out it doesn’t have to get ugly to go faster! We managed to utilize our GPU better and get the preprocessing stage running 5x faster, without adding almost any complexities to the codebase. The end result is something that is both easy to work with and maintain, and suitable for our runtime constraints. If you run into a similar situation, where you can get a great performance boost without compromising your code’s simplicity, you should definitely give it a go.


Create magic with us
We’re always on the lookout for promising new talent. If you’re excited about developing groundbreaking new tools for creators, we want to hear from you. From writing code to researching new features, you’ll be surrounded by a supportive team who lives and breathes technology.
Sounds like you? Apply here.

--

--

Lightricks
Lightricks Tech Blog

Learn more about how Lightricks is pushing the limits of technology to bridge the gap between imagination and creation.