How DISH Wireless Telco Network Allows for 10x More Video Transmission via Compression

DISH Wireless DevEx
9 min readJan 23, 2024

--

By: Ankita Patil, Hamza Khokhar, Monisha Gnanaprakasam, Tamanna Kawatra, Vinayak Sharma, Members of Scientific Staff, DISH Wireless

Introduction and Problem Statement

Did you know that videos are responsible for over 65% of internet volume? The rising dominance of video is enabling edge systems with functional autonomy and elevated decision making. However, transmitting videos is incredibly data intensive. DISH Wireless R&D is breaking the barriers on traditional video transfer over networks. Here at DISH Wireless we have developed machine learning frameworks to enable 10x better compression than the incumbent compression frameworks. We believe in enabling our customers to focus on the decision making and autonomy systems, while we alleviate their data transmission problems.

We are hyper-focused on disrupting the security footage industry. We want to enable security camera providers to transmit 10x more videos at the same price point. To do this we enable compression at the edge device, and decompression in our telecom network. Optimizing for robust video quality ensures that no downstream applications are degraded (i.e. theft detection or object classification).

Part I of this blog will focus on how we researched different techniques for video file size reduction and reconstruction and finalized top 3 solutions based on a variety of evaluation criteria.

Part II will focus on how we built a scalable and modularized infrastructure easily deployed through github.

About the techniques

Super Resolution

Resolution is a key factor that determines the quality of a video. It refers to the number of pixels contained in each frame, typically expressed as width x height. For example, a video of 1080p resolution has 1920x1080 pixels within each frame. You can imagine higher resolution implies greater detail thanks to an increased pixel count, but it also leads to larger file sizes.

While trying to compress high-resolution videos to reduce file size, resolution reduction is one of the most commonly explored techniques. In this technique, reduction of video file size is achieved through downsampling, or reducing the number of pixels in each frame. Pixels are reduced using different methods like averaging and omitting pixels. Lanczos Resampling is one such method used to reduce pixels. It aims to maintain quality of the original image while resizing by using a weighted average of neighboring pixels.

Once the video resolution has been reduced for efficient storage and transmission, it is imperative to reproduce video quality for playback. Super Resolution is used to enhance resolution and quality of video beyond its original size or dimension. Increasing the resolution means increasing the number of pixels per frame which in turn requires inferring missing pixels.

Neural Networks can be train to help with the recreation by looking at other similar samples and inferring what the missing pixels should be. They are trained on large datasets of high-resolution and low-resolution image pairs to learn patterns and relationships. These pre-trained models help predict high-resolution images from low-resolution images.

We explored several pre-trained Convolutional Neural Networks (CNNs) to super resolve downsampled frames within videos. Fast Super-Resolution Convolutional Neural Network (FSRCNN), Efficient Sub-Pixel Convolutional Neural Network (ESPCN) and Laplacian Pyramid Super-Resolution Network (LapSRN), to name a few. These models work by learning a mapping between low-resolution and high-resolution images through a series of CNNs and predict high-resolution images from low-resolution images, enhancing video resolution.

Another kind of model that we explored was Super Resolution Generative Adversarial Networks (SRGAN). These models are a combination of two NNs, namely generator and discriminator. The Generator aims to produce high-resolution images that are indistinguishable from the real ones, while the Discriminator aims to become better at distinguishing real from generated ones. This adversarial training process drives the generator to improve its super-resolution capabilities.

Utilizing the pre-trained models mentioned above, each frame within the videos were compressed by downsampling, then super-resolved to the original resolution while keeping the quality of videos. This process resulted in nearly 15x reduction of video file size.

Reconstruction using Super Resolution

Neural Network Codec

Codecs, short for Coder-Decoders, are essential hardware or software components used to encode and decode data, and they play a crucial role in devices like cameras and software applications such as Windows Media Player for managing audio and video media files. For instance, when a digital camera captures an image, it gathers substantial data, including high-resolution details and color/light information, which codecs then compress to reduce data size for storage and transmission. These codecs also decode the compressed data when you view or edit images or videos. In the context of video, a video codec is a technology focused on compressing and decompressing digital video data, with its primary functions being efficient storage and transmission through compression, while ensuring playback quality is maintained by striking the right balance between file size reduction and video quality preservation.

There are two types of video codecs — traditional and machine learning codecs. Traditional codecs rely on algorithmic techniques to compress data. They use various methods to remove redundancy and compress video frames. They are designed with fixed compression techniques and parameters. H.264 and MPEG-4 are examples of some of the widely popular traditional codecs.

On the other hand, machine learning codecs use machine learning techniques, like deep neural networks. They learn to optimize compression parameters based on patterns and features within the video data. They can be configured to prioritize significant features while compressing less essential ones. They can continuously improve their compression performance as they are exposed to more data. However, it’s important to note that these codecs demand substantial training and are computationally intensive during encoding and decoding.

For our use case, we used machine learning codec to compress videos. We implemented an end-to-end deep video compression framework[1] that uses deep neural networks to extract generalized ‘features’ (encoding) and reconstruct the original frame within a video (decoding).

Deep Video Compression Framework[1]

The codec works as follows:

Encoding (Compression)

  • Extract relevant features from the data using neural networks
  • Capture the differences (residuals) between the original data and the extracted features
  • Quantize and compress the features and residuals
  • Store the compressed data

Decoding (Decompression)

  • Decode the compressed data
  • Reconstruct features from the quantized data
  • Add back the residuals to improve quality
  • Generate the final decompressed data

While encoding, three categories of features are extracted from individual frames within videos. The first type comprises residual features, which encompass data concerning minute details and distinctions between the original image and its downscaled version. The second category encompasses residual prior features, which convey information about common characteristics in videos and video frames. Finally, there are motion features, which assist the model in understanding the motion between frames, allowing it to primarily remember frame-to-frame changes rather than storing information from every individual frame. After encoding video frames, these features are decoded and final videos are reconstructed.

By implementing this NN codec, each frame within the videos were encoded and decoded, resulting in nearly 15x reduction of video file size.

Background Subtraction and Addition

Background subtraction is a technique used to separate the foreground objects from the background in a video or image sequence. It involves separating the moving or dynamic elements from the static or stationary elements within the visual frame. In simpler terms, it’s separating the actors from the stage. Its applications are diverse, ranging from surveillance and security to virtual backgrounds and gesture recognition.Background subtraction is commonly done using Mixture of Gaussian (MOG) and K-Nearest Neighbor but in order to get more optimized results, we calculated the median frame to obtain the static image.

For our project, to calculate the static frame from the video we took n random frames from the entire video and, for each pixel position (i, j), calculated the median value across all frames in the stack at that position. As the security cameras are mounted on a fixed spot in parking lots, there will be a static background transmitted every time in the clip. Using this we can reduce the file size by only transmitting the dynamic video.

After finding the static frame in the background, we calculate the absolute difference between each frame in the video and the static frame to extract the foreground objects from the background.

This method works best when the background is relatively static and when there are no persistent moving objects. With this technique, we were able to achieve file size reduction by 8x.

Background subtraction Principle[2]

For reconstructing the video, at the cloud, we used background addition technique where we used the same static frame and masked video and reconstructed the video by blending technique.

Frame removal and Interpolation

Videos are inherently a streamed set of frames. The commonly used metric ‘frames per second’ refers to the number of still images (frames) that are streamed per second to create the visual effect of a continuous video. The frame removal technique focuses on providing compression capabilities via reducing the frames per second. The frame interpolation technique then provides reconstruction of the video based on the research paper “Many-to-many Splatting for Efficient Video Frame Interpolation.”[3]

We provide frame reduction as an API call on the edge device to provide compression. A 2x reduction in frames approximately provides a 2x compression of video size. Then during transmission of the video, the reconstruction of frames are processed on larger compute modules also via API call of our in-house many-to-many Splatting technique.

Keyframe extraction

Keyframe extraction is a technique used in video analysis and processing to select representative frames from a video sequence. This technique is based on a principle that using the keyframes we can find the idea behind the video and using reconstruction techniques like neural network based interpolation we can interpolate the video.

For extracting the features from the frames, we used a pre-trained CNN model from the frames and using KNN as a clustering algorithm, we grouped similar frames together to form clusters. Using a minimum distance algorithm we chose the key frame from each cluster.

For transmitting video, as a reduction module, we encoded the video using H.264 codec.

For reconstructing the video we experimented with Linear interpolation. The interpolated frames were created by blending 2 keyframes at a time and an interpolation factor. Linear interpolation is used to determine intermediate values between two keyframes and helps to smoothly transition from one keyframe to another. With this technique, we were able to achieve 9x reduction in file size.

Evaluation Criteria

Video or image quality is measured using subjective and objective methods. The subjective methods are based on human perception. We conducted a survey where each respondent was shown 12 videos reconstructed from above techniques. They evaluated the videos on resolution, movement continuity, clarity and color accuracy. The scores from these criteria were aggregated as the mean opinion score. In addition to the human grading, we used two open source object detection models yolo and mediapipe to evaluate the quality of the output video. The evaluation metric is the mean of average confidences of all the object detection in a frame. The mean average confidence for the original videos and reconstructed videos are measured.

The objective methods are based on computational models that predict the perception. The main metric considered was the filesize or the compression ratio as the primary objective was to obtain the methods that could efficiently compress videos. The other objective metrics include frames per second (FPS), Bitrate, Codec and resolution.

Our findings from the two methods are outlined below

Top 3 results

Based on our analysis, we find the Neural Network Codec to be promising as the changes introduced by this technique are imperceptable to both the human eye and object detection models. Resolution reduction and reconstruction, and FPS reduction and smooth FPS performed almost on par with NN codec.

Conclusion

Various codec techniques are used in order to improve transmission of video files, the current Neural Network based codecs seem to reduce the file sizes better than traditional codecs while maintaining the quality of the videos. With more development, the changes made by the NN based codecs will be imperceptible to any downstream model.

Future Steps

We would like to revise our experiments by combining reduction techniques to retain quality, explore more ML/DL based reconstruction techniques and their combinations, evaluate more video based on video quality metrics comparable to video streaming platforms, and utilize downstream models beyond object detection.

References

[1] https://arxiv.org/pdf/1812.00101.pdf

[2] https://docs.opencv.org/3.4/d1/dc5/tutorial_background_subtraction.html

[3] https://ieeexplore.ieee.org/document/9878793

About the authors
Ankita Patil, Monisha Gnanaprakasam, Tamanna Kawatra
and Vinayak Sharma are Members of Scientific Staff (MSS) at DISH Wireless. The MSS are focused on data science, engineering and development work for our cloud-native developer ecosystem, to enable technology transformation in the intelligent systems domain. Building on top of our cloud-native network of networks, the MSS develop bleeding edge technologies that will enable the fourth industrial revolution.

--

--

DISH Wireless DevEx

We are a community of software developers, data scientists and connectivity enthusiasts building a one-of-a-kind developer platform.