Video Streaming Analytics (VSA)

Published in

Niometrics Tech Blog

8 min readNov 27, 2020

Introduction

Thankfully we are living in a world where encryption adoption is ever increasing. Unfortunately, this has hurt the ability of Communications Service Providers (CSPs) to understand the Quality of Experience (QoE) being offered to their subscribers. This post will describe how to use Machine Learning to infer the Key Quality Indicators (KQIs) of video streaming sessions of applications like YouTube or Netflix.

Streaming Is Essential to Consumers

The global video streaming market size was valued at USD 42.60 billion in 2019 and is projected to grow at a compound annual growth rate (CAGR) of 20.4% from 2020 to 2027. The advent of digitalisation into entertainment media is a significant factor in boosting the streaming market. The main drivers behind streaming services are convenience and personalisation. The growing popularity of video-focused social media platforms and the use of digital mediums for branding and marketing is anticipated to further fuel the demand.

What Consumers Expect

While consumers expect a good content mix, competitive price and ease of use from the streaming platform, they count on their network service provider for better quality video and far more immersive experiences.

During the pandemic, the lockdown measures resulted in a video streaming viewership rise of approximately 10%. On the one hand, more people are forced to stay at home to try to curb the spread of the novel coronavirus; on the other hand, apart from entertainment purposes, the video medium is in an upward trajectory in social media, corporate conferencing, online learning and medical services. These resulted in an unexpected surge in both fixed and mobile network traffic. Due to network limitations, streaming service providers such as Netflix, YouTube, and others have reduced the streaming quality. However, as the situation stabilises and CSPs continuously work to improve their network capacity, the rapacious consumer appetite for higher quality video shall return.

Measuring User Experience

Background - Adaptive Bitrate Streaming

In recent years, most video streaming platforms have moved to adaptive bitrate streaming to make the best use of the available bandwidth by minimising rebuffering time, while delivering the optimum resolution. This works by splitting videos into chunks and transcoding at many different bitrate/resolution combinations. Each chunk is just a small part of the entire video.

Splitting videos into chunks and transcoding at many different bitrate/resolution combinations

The manifest file with the location details of each chunk resolution/bitrate combination is the first thing downloaded by the client video player at the initiation of a video playback. The resolution of the next chunk to be requested by the client is decided based on the client’s current network condition, the buffer health of the player, as well as other software and hardware parameters. The below figure shows a qualitative example of varying bandwidth and the delivered resolution.

Qualitative example of varying bandwidth and the delivered resolution

QoE Metrics

The widely used measures of throughput, packet loss, jitter and delay are typically used to measure the network conditions, such as the network Quality of Service (QoS), but do not take into account user factors or the service context. Those can be derived by the Application QoS which is expressed by factors such as the video startup time (or the initial buffering time which is the time between the user pressing play and the video actually starting), the median video quality or resolution, variations of the video quality (resolution upgrades/downgrades), as well as the rebuffering ratio (the ratio between the total stalling duration and the duration of the playback). The Application QoS KQIs can be combined in a single score called the Mean Opinion Score (MOS) which expresses the opinion of the average user/subscriber.

Network QoS, Application QoS, User QoE (MOS)

We live in a world where encryption adoption is ever-increasing, and video content is transmitted encrypted, which is great for privacy. However, telcos and ISPs can no longer measure the QoE offered to their subscribers, being limited only to the network QoS level. This, combined with video QoE becoming the best proxy for network service quality due to its increasing traffic share, has significantly hampered the ability of the telcos to validate their network planning and optimisation strategies.

Our Solution - Machine Learning-Based QoE Estimation

Given the limitations caused by encrypted traffic, we developed a solution that uses supervised Machine Learning to estimate the KQIs and ultimately the QoE of each video session.

Introduction to Supervised Machine Learning

Supervised Machine Learning uses datasets consisting of input features paired with the target variable. The objective of a supervised learning model is to predict the target variable accurately for new input data. During training, the algorithm will search for recurring patterns in the data that correlate well with the target variable, in order to produce a trained model that can generalise well to previously unseen inputs.

Ground Truth Data Collection

As the actual network traffic is encrypted and as such unlabelled, we need to build a dataset on which to train and evaluate the models. Subscribers stream videos from different services varying in content type and streaming mechanisms. In this project, we collected a large number of YouTube and Netflix videos from the top-watched list under various bandwidth-latency profiles (by utilising network conditioners to simulate different network conditions, such as 2G, 3G, 4G, fixed connections, sudden connection losses, etc.). To increase the diversity of the dataset, multiple player sizes were used to approximate user experience on different devices. In addition, we captured data from both QUIC and TCP connections, as well as connections using IPv4 and IPv6. Sampling was performed across multiple platforms to ensure that the ground truth was representative of real-world scenarios. After all, predictive models are only as good as the data from which they are built.

Feature Engineering

Payload signals

We can extract features from the traffic volume time series aggregated at fixed intervals (e.g., 0.5 seconds). The payload buckets provide a first overview of the bursty patterns caused by the amount of data being transferred for a specific part of the video. It is intuitive that bursts with larger magnitude will be observed in high-resolution videos.

Chunk signals

In addition to payload signals, we can further improve the modelling by extracting chunk signals. Due to adaptive bitrate streaming, the video content is transferred in discrete segments, commonly referred as chunks. The packets are encrypted at the transport layer, but chunk sizes and inter-arrival times are visible to the CSPs. Compared to payload buckets at fixed intervals, aggregating traffic data at chunk level offers a better insight regarding the streaming characteristics.

Since the actual chunk sequence information is not directly visible to the network due to encryption, we have developed a methodology to detect video chunks from the network traffic with high accuracy. The idea is pretty simple: A chunk is the segment of the video consisting of the packets between two consecutive requests. A chunk starts with a GET request from the client, with which the player asks for a video segment from a particular server. Subsequently, the video segment is carried over TCP (or UDP in case of QUIC) packets, and the entire chunk is fully downloaded before the start of the next chunk download and its corresponding GET request. Therefore, the total amount of data downloaded between two consecutive uplink request packets can be considered equal to the chunk size. Note that we consider uplink packets as carrying GET requests only if their payload is above a certain threshold. The player can open multiple connections (i.e., flows) in parallel to speed up streaming, but chunks are requested sequentially within a flow. While the size of audio chunks is almost constant, video chunks can have large variation in size. Intuitively, video chunk sizes are correlated to the video resolution. We of course validated our chunk detection algorithm by comparing the size of the detected chunks against the actual chunk size as seen by the browser.

With the bucketised payload and chunk size time series extracted from the traffic, we generate features by applying several statistics on the signals. We also introduce up to nth order differences (i.e., the lag) to capture the relative changes between buckets or chunks. Intuitively, a large difference between two consecutive chunks could signal a change in the resolution.

Generating features by applying several statistics on the signals

Modelling

We trained several classifiers and regressors to predict KQI metrics, with XGBoost at the core of the predictive models. Standard training techniques were employed including cross-validation, and actual field testing. We use Random Search for hyper-parameters tuning and a combination of over-sampling/under-sampling using SMOTE to handle data imbalance. Data imbalance handling is crucial in our modelling, since videos with stalls or resolution changes are quite rare even in bad network conditions.

We carefully assess the performance of the models using standard metrics such as RMSE, MAE for regression and precision/recall, F1 and Matthews correlation coefficient for classification. The computational resources required for both feature calculation and inference are also investigated given our context of dealing with massive input data from internet traffic.

A good Machine Learning model is the result of try-and-error in multiple experiments and we are happy to have learned some interesting lessons along the way:

Our attempt to use automatic feature creation resulted in too many irrelevant features, especially in our problem where network traffic domain knowledge was needed.
Another attempt to convert time series signals into spectral images (i.e., by applying Fourier/Wavelet Transformations) also failed since no visual patterns could be observed. One reason was that the traffic did not have any periodic properties in general.
On the subject of feature engineering, we tested a rolling window approach, in which the KQIs were predicted for a short window instead of the whole video. This approach, which was more computationally intensive, resulted in model overfitting and aggregation of the window error.
Finally, we tried Deep Learning using Recurrent Neural Networks for rolling window prediction. A significant advantage about Deep Learning was that we did not need to handcraft features; however, it could not beat the traditional XGBoost in this case.

VSA in action

Machine Learning, no matter how sophisticated, is totally useless if it cannot be deployed at scale. Luckily, we can leverage the top-notch infrastructure used by Niometrics to deploy VSA at CSPs scale. Some statistics to highlight the scale of our system at one of our deployments:

Several PBs of video volume from hundreds of millions of YouTube sessions per day.
Up to 50 million predictions per day per DPI engine with minimal overhead negating any need to derate the engines.
Results aggregated by hundreds of thousands of locations (CGIs), core network elements, and thousands of device models.

To achieve efficient operation at scale, we need to find the optimal deployment of the parts of the Machine Learning pipeline — centralised feature generation and inference, or feature generation on edge and centralised inference, or both feature generation and inference on the edge. It is a matter of sizing and available resources.