A Kickstart in Deep Learning Real-Time Video Processing

A 0 to 100 guide to get you started with Video Processing within Deep Learning. Image processing, video formats, re-encoding, streaming through HTTP, WebSockets, and WebRTC.

Alex Razvant
Decoding ML
9 min readMay 9, 2024

--

Lately, the whole ML space seems like it’s been overshadowed by LLMs and RAGs. It’s natural to be that way since a lot of use cases can benefit from using these new foundation models, but there’s still a gap they’ve got to cover on non-text-based data.

I tend to compare this current stage in ML to the car industry and the shift between gasoline and electric cars. There’s a whole infrastructure set up for gasoline-based cars (car services, accessible gasoline pumps, etc) whereas the electric charging stations and specialized electric car service locations are not mature enough — but they catch up.

Where I’m going with this comparison is the following idea — transformer-based models have proven their utility in a lot of use cases, but they still require time to overcome the already-existing and proven systems on vision tasks.

However, today’s focus is on engineering — especially how one could tackle the problem of latency in embedded video streaming applications that use ML since video reading/processing/streaming is at the core of vision systems.

Table of Contents

  1. What is Video Processing?
    1.1 Video Codec
    1.2 Bitrate
    1.3 Resolution
    1.4 Framerate
    1.5 Video Containers
    1.6 How video re-encoding works
  2. Common vision libraries
    2.1 OpenCV
    2.2 Albumentations
    2.3 PyAV (FFmpeg bindings)
  3. Video Streaming Methods
    3.1 Streaming with HTTP
    3.2 Streaming with Websockets
    3.3 Streaming with WebRTC
  4. Conclusion

What is video processing?

Video processing refers to a set of techniques and methods used to manipulate and analyze video streams. Let’s go over the key components one must know when describing video processing:

  1. Codec
    A codec is a hardware- or software-based process that compresses (encodes) and decompresses (decodes) large amounts of video and audio data. They are essential for reducing video/audio file sizes and streams, as one RAW video file might take up a very large space.
    Let’s take an example and verify the raw size for a 60-second 1920x1080 30 FPS video file.
W = Width (pixels)
H = Height (pixels)
FPS = Frame Rate (frames/s)
BIT = Bit Depth (bits per pixel)
DUR = Duration (video length in seconds)

File Size (bytes) = W x H × FPS x BIT x DUR
File Size (bytes) = 1920 x 1080 x 30 x (24 / 8) x 60 = 11197440000 (bytes)
File Size (mbytes) = 11197440000 / (1024 ** 2) = 10678,71 (mbytes)
File Size (gbytes) = 10678,71 / 1024 = 10,42 (gbytes)

If that would be the case to store and stream videos, YouTube will be able to store and stream only Pewdiepie’s channel — nothing else due to storage and network limits.

Here are the most popular codecs used for video compressing:

  • H.264 (AVC): Highly efficient, balances quality with relatively low file sizes. Compatible with almost all video players and streaming services.
  • H.265 (HEVC): Better data compression at the same level of video quality as H.264.
  • VP9: Developed by Google, used primarily for streaming high-definition video on platforms like YouTube.

2. Bitrate
Refers to the amount of data processed in a given amount of time, typically measured in bits per second (bps). In video, bitrate is crucial as it directly affects the quality and size of the video:

  • High Bitrate: More data per second, leading to higher video quality but larger file sizes.
  • Low Bitrate : Reduces file size leads to poorer video quality, manifesting as blurriness or blockiness in the video.

3. Resolution
Indicates the number of pixels in each dimension that can be displayed. We’re all familiar with HD (1280x720), FHD (1920x1080), and 4K (3840x2160) which are the resolutions widely used everywhere.

4. Frame Rate
Describes how many individual images are shown each second. I still remember the 9FPS I got when playing GTA4 on a crappy PC.

5. Container Formats
Containers such as MP4, and AVI encapsulate video, audio, and metadata. They manage how data is stored and interchanged without affecting the quality. When your media player streams a video, it processes blocks of data from within a container.

On a more detailed note, due to the way a video container is structured — it makes it simple to convert from one video format to another video format. In this case, the following key terms are employed:

  • SOURCE — the video in format A.
  • DEMUX — the component that splits the video stream from the audio stream.
  • DECODER — decompresses both streams (from low format to RAW format)
  • ENCODER — re-compresses the RAW streams using new Video and Audio codecs.
  • MUX — re-links and synchronizes the video stream with the audio stream.
  • TARGET — dumps the new data stream (video + audio) into a new container.
Image by author.

Common libraries to work with video in Python?

When working on Computer Vision projects, image processing and manipulation are mandatory.
Starting with data preparation, labeling, QA, augmentations, and model training, to pre-processing/post-processing steps required after a model is deployed in production.

Here’s a list of libraries and tools a Computer Vision Engineer has to know/work with:

1. OpenCV

Image by Author.

2. Albumentations
Fast and efficient library widely used within dataset augmentation when working on vision tasks. The majority of augmentations are implemented as GPU kernels.

Image by the Author.

3. PyAv
Packages the Python FFmpeg bindings. PyAv is recommended when more detailed control over the raw image frame packets or audio packets is required.
Here, frames are unpacked in YUV420p format, with Y(luma) and U, V(chroma) planes to store color information, which is lighter than RGB.

+----------------+-----------------+--------------------------------+
| Feature | YUV420 | RGB |
+----------------+-----------------+--------------------------------+
| | Y, U, V | Red, Green, Blue |
| Channels | (Luminance and | |
| | two chrominance)| |
+----------------+-----------------+--------------------------------+
| Storage | Less storage | More storage required due to |
| Efficiency | due to | for all three color channels. |
| | subsampling | |
+----------------+-----------------+--------------------------------+
| Bandwidth | Highly | Requires more bandwidth, all |
| Usage | efficient for | channels are fully sampled. |
| | transmission | |
+----------------+-----------------+--------------------------------+
| Complexity | Higher | Lower |
+----------------+-----------------+--------------------------------+
| Suitability | Better | Better for image editing |
| | for video | Universal compatibility |
| | compression and | |
| | transmission | | |
+----------------+-----------------+--------------------------------+
Image by the Author.

Video Streaming Methods

When real-time streaming is required in production use cases, Computer Vision Engineers often have to develop specific video processing workflows that are optimized for low-compute especially if the deployment use case also includes vision models like Object Detectors or Segmentation and is intended to run on Edge.

That is, video decoding is CPU-hungry and when deploying at Edge, because hardware resources are limited, one should aim to get the most out of the deployed system, while keeping resources and energy footprint low.
We’ll get into that in a future article.

In a large majority of Computer Vision projects, the processing is done at the Edge, either on servers that have access to RTSP Cameras or on devices that dump frames locally or stream them through Ethernet.

For instance, to tackle the problem of detecting the manufactured pieces that would fail the quality test from inside a Factory line, one could train and deploy stacks that use live video feed and Segmentation models to identify key areas at risk.

Image taken from viso.ai

Another instance would be the problem of identifying shelf-restocking times in a store, by employing Object Detection, Depth Prediction, and Semantic Segmentation to alert staff, in real-time whenever to re-stock shelves.

Image taken from viso.ai

For the scope of this article, let’s start simple and firstly focus on common video streaming methods that one could implement using Python, to solve the problem of streaming frames in real-time, from an API to a Client App.

For this, we would use FastAPI as our streaming API, and a basic React Application as our Client, to demonstrate the concept.

We’ll cover 3 methods, HTTP, WebSockets, and WebRTC.
For each exemplification, we’ll iterate over the code, both FastAPI and React, and note when the method would apply best.

Streaming through HTTP

This is a quick and practical approach, making it the most straightforward one to validate the concept of streaming a video to a Web Application.

For small use cases, this might work but once the application scales and has to support many devices or workflow streams, the latency, overhead added by HTTP headers, and bandwidth start to impose challenges.

  • FastAPI Endpoint
Image by Author.
  • React Web Endpoint

Streaming through WebSockets

Websockets provide a more efficient way compared to HTTP as they allow for lower latency, real-time interaction, and a more optimized way to send data.

Compared to HTTP, which is stateless, meaning that you trigger the endpoint and get a response, on sockets — once the handshake is complete, the data streams as long as the connection is in an Open state.

That leads to the requirement of managing and “storing” the socket state, making them stateful.

  • FastAPI Endpoint
Image by Author.
  • React Web Endpoint
Image by Author.

Streaming through WebRTC

WebRTC (Web Real-Time Communication) is a technology standard that enables real-time communication over P2P (peer-to-peer) connections without the need for complex server-side implementations.

This is a more complex protocol to understand, compared to HTTP and Websockets, and it’s tackling video/audio streaming specifically.

Be it Zoom calls, Facetime, Teams, or Google meetings — that’s RTC at play!

Here are its main components to get you going:

  • Data channels: Enables the arbitrary exchange of data between different peers, be it browser-to-browser or API-to-client.
  • Encryption: All communication, audio, and video are encrypted ensuring secure communications.
  • SDP (session-description-protocol): During the WebRTC handshake, both peers exchange SDP offers and answers. In short, SDP describes the media capabilities of the peers, such that they can gather information about the session. The SDP offer describes the type of media the peer is requesting, and the SDP answer confirms that the offer was received and like-wise, exchanges its media configuration.
  • Signaling: The method through which the offer-response communication is achieved (Sockets, REST API). In our use case, we’re using a POST endpoint to open the channel.

Let’s walk through the code:

  • FastAPI Endpoint
Image by Author.
  • React Web Endpoint

As we’ve iterated through the streaming methods, let’s see them live.
Once you’ve cloned the 🔗 Repository, and have installed the required packages described in the README file, run the following:

# To start the FastAPI
make run_api

# To start the React Web App
make run_ui

Inspecting Results

Once you’ve started both, the FastAPI backend and the ReactWeb frontend, head over to localhost:3000 in your browser and check the results.

The React UI — image by Author.

Conclusion — End of Part I

In this article, we covered the structure of a video format and the key components one must grasp to understand how video works.

We’ve also iterated over a few widely-known libraries that make it easy to start up and work with video/image data. We ended up with the walkthrough over 3 video streaming methods, HTTP, Sockets, and WebRTC.

In Part II, in the same manner, we’ll extend our little project into a system where we’ll detach Ingestion from Inference, deploy our model using Triton, and check results on our React App — in real-time.

Stay tuned!

Ending note!

If you’ve found this article educative — don’t forget to support it with a clap!
Also, to stay up-to-date with more articles like this, follow me and subscribe to the Decoding ML Newsletter.

--

--

Alex Razvant
Decoding ML

Senior ML Engineer @ Everseen | Join the NeuralBits Newsletter, to uncover Deep Learning Systems - bit by bit, week by week! neuralbits.substack.com