Understanding the Inner Workings of Digital Video

Patrick Plaatje
6 min readJun 5, 2023

--

Part II of the series: “Decoding the magical technology of OTT Live Streaming: Insights Unveiled!”

Introduction

Before delving deeper into the components of a (live) streaming architecture, it’s crucial to comprehend the fundamentals of digital video itself. This article will explore what video is, how it becomes digital, and the concepts that apply to digital video.

Video

Video is essentially a collection of still frames. It resembles the pictures found in a photo album or a flipbook. Each frame captures a moment in time, similar to a snapshot. When these frames are played back rapidly, they create the illusion of motion. Our eyes have a phenomenon called persistence of vision, where an image lingers in our vision for a split second after it disappears. By displaying a series of frames in quick succession, our eyes blend them together, resulting in the perception of fluid movement on a screen.

Taken from a Youtube flipbook video: https://www.youtube.com/watch?v=e0vuOiZzCA4

Capturing video

The process of capturing frames to construct a video is similar to that of a regular photo camera. Light reflecting off objects around us is captured by the camera. This light passes through the lens and reaches a photosensitive surface. Using clever techniques, the camera interprets the captured light to determine the actual colours in the image.

These frames are represented as a collection of pixels. For example, a full-HD resolution will capture over 2 million pixels. Depending on the compression method chosen (most cameras allow compression mode selection), a single frame can range from 3.5 MB to 6 MB in size. When multiplied by the frames per second desired for the video, it results in roughly 60 frames per second multiplied by 3.5MB (equivalent to 210MB for each second of video). It is clear that broadcasting or sending 210Mbps every second to viewers is impractical. This is where encoding comes to the rescue!

Encoding

Video encoding is the process of converting raw video data into a compressed format that is easily stored, transmitted, and played back on various devices. Compression algorithms are applied during encoding to reduce the size of the video file while maintaining an acceptable level of visual quality. However, we won’t delve deeper into the technical aspects of encoding in this article.

Codecs

Video codecs, short for “coder-decoder,” are software or hardware-based technologies that compress and decompress video data. They are crucial components of the video encoding and decoding processes. Video codecs utilise algorithms and techniques to efficiently analyse, compress, and store video data. They play a vital role in reducing video file sizes while preserving visual quality. Popular codecs today include AVC/H.264, HEVC/H.265, and AV1.

When it comes to live streaming, the mentioned codecs employ a concept where compression does not occur on each individual frame. Instead, there is one frame called the Intra-frame (i-frame) that represents a complete image from a group of pictures forming the resulting video. Subsequent frames do not contain all the data to construct a full image but rather include only the necessary information, such as motion vectors, to generate the frame at that particular point. These frames are known as P-frames (predictive frames). Additionally, there is a special frame called a B-frame that utilises data from both the previous i-frame and the next i-frame to construct the frame.

Codec example

If you consider the image below (taken from Wikipedia: https://en.wikipedia.org/wiki/Video_compression_picture_types), you see a small Pacman starting to eat dots. As each pixel is fully present in the image, this would be considered an Intra frame.

When the next frame is created, the encoding process will look back at the Intra-frame, compare it with the frame that is needed and only keep the changed parts between the both of them! As such the next frame will not have to have all the information encoded into it, but only a small subset that describes the changes between them. This next frame is called a P-frame (a predictive frame). There’s also a special frame called a B-frame. This type of frame has data associated with it, but instead of just predicting the image from the previous I-frame, it also uses the data from the next I-frame in order to be constructed.

Bitrate

An encoder implements bitrate by controlling the amount of data used to represent a video stream. Bitrate refers to the number of bits used per unit of time, usually measured in kilobits per second (Kbps) or megabits per second (Mbps).

When encoding a video, the encoder analyses the content and makes decisions on how to allocate bits efficiently. It aims to strike a balance between preserving visual quality and reducing file size. The higher the bitrate, the more data is allocated to represent the video, resulting in potentially higher quality but larger file sizes. Conversely, lower bitrates result in smaller file sizes but may lead to reduced quality.

There are several different bitrate models used in video encoding to control the allocation of bits and achieve the desired quality and file size. Here are three commonly used models:

  1. Constant Bitrate (CBR): In CBR encoding, the bitrate remains constant throughout the entire video. Each frame is allocated the same amount of bits, regardless of the complexity of the content. CBR is useful in scenarios where a fixed bandwidth is available for streaming or when predictable file sizes are required. However, CBR may result in inefficiencies as it may allocate too many bits for simple scenes and not enough for complex scenes, leading to potential quality variations.
  2. Variable Bitrate (VBR): VBR encoding dynamically adjusts the bitrate based on the complexity of the video content. More bits are allocated to complex scenes with a lot of detail, and fewer bits are used for simpler scenes. This allows for higher quality in complex scenes and better compression efficiency overall. VBR can provide a more consistent visual quality compared to CBR, but the resulting file sizes may vary depending on the content.

Challenges to solve

All of the above becomes extremely important for the video flow upstream during live streaming. A typical live stream consists of around 30 to 50 frames per second. As illustrated earlier, due to the nature of B-frames, the encoder needs access to the next i-frame. Therefore, the frequency of i-frame insertion in the video impacts the stream’s latency. If the number of i-frames is low (e.g., one every 30 seconds), the encoder has to wait for the next i-frame, and frames dependent on it are placed in a buffer before being sent out. If there are too many i-frames, the stream becomes too large for transport.

Broadcasters strive to find the optimal balance for inserting i-frames (keyframes). It is generally recommended to use a combination of all three frame types with a keyframe interval of 2 seconds. This means inserting an i-frame every 2 seconds. Sometimes, this value is represented as a group of pictures (GOP), which, at 30fps, is equivalent to a GOP value of 60, as a keyframe is inserted every 60 frames.

As laid out above, the decision between CBR and VBR is also not very straightforward. VBR might result into peak bitrates that cannot be supported by the connection used. On the other hand, CBR might lead to encoding inefficiencies and / or unneeded bandwidth requirements.

Understanding the concept of video and encoding (and encoding parameters) will be needed to understand how the video is transported and processed further down the digital streaming flow and how it is ultimately delivered to end viewers. The next article will focus on how the stream is transported to a live streaming platform that will transform (transcode) the content for wider viewership, see you there!

--

--