Using video session clustering to get new insights on AB-test results

Igor Mukam
Lumen Engineering Blog
6 min readMay 9, 2023

Context

The performance of a video session powered by Lumen® Mesh Delivery can vary depending on a lot of configuration parameters. Those parameters allow us to fine tune internal mechanisms according to the broadcaster’s use case. The way we set those parameters can have a great impact on our key performance metrics, among which are:

  • P2P efficiency (ratio of the amount of video downloaded through peer-to-peer over the total amount of video downloaded during the session), which we want to be as high as possible
  • Buffering ratio (ratio of the time spent rebuffering over the total time spent watching the video), which we want to be as low as possible.

For example, we can control the maximum number of peers a viewer can connect to. The more peers, the higher the probability of having good P2P efficiency, but the stronger the impact on the device CPU. Say that we know that a customer’s traffic is concentrated in a country where streaming devices are likely to be low-end; we might choose to configure a low maximum number of peers for this customer so as not to strain the CPU of the end user devices.

It is important to point out that it is possible to set specific configurations for Lumen Mesh Delivery:

  • for a given customer
  • for a given viewer platform (desktop browsers, mobile browsers, mart TVs, Android / iOS native apps, etc.)
  • for a given stream type (Live / VOD)
  • for a given ISP

Our initial intention was to find other criteria that inherently affect the outcome of the session, to be able to optimize the configuration of Mesh Delivery according to these other criteria. That’s why we started a project aiming at categorizing all video sessions into performance-related groups (the performance metrics are defined later in the article) and tried to see if we could correlate those groups to pre-existing characteristics of the video session.

The information available at the beginning of a session (the “pre-existing characteristics”) include:

  • Media information (video track count, bitrates, media duration…)
  • Stream information (number of new users in the past x minutes, total number of users currently on the stream…)
  • Client information (platform, user agent string, from which we can derive the browser, browser version, OS, etc.)
  • Others (time of day…)

This article will explain how we managed to build, visualize and interpret video session clusters, as well as how we use them to get more insights for our R&D.

Building the clusters

We started by defining the information that we wanted to give as an input to the clustering algorithm:

Dimensions (categorical data)

  • Version (version of the technology, as many versions can co-exist in production)
  • Platform
  • Device name
  • Browser name
  • OS name

Initial metrics (numerical data, available at the beginning of each session)

  • Nb_users: number of users that arrived on the stream in the same 5-minute window as the considered session
  • SegmentDuration (size of the segments on the considered content)
  • PlaylistLength (in minutes)
  • MediaDuration (in minutes)
  • TrackCount
  • BitRate (in Mb/s)

Performance metrics (numerical data, only available when the session ends)

  • P2pRatio
  • UploadRatio
  • WithBufferRatio (ratio of the time spent with video “frozen” because of high CPU usage over the total time spent watching the video, capped at 0.05 to consider 5% as the maximum acceptable rebuffering rate)
  • WithoutBufferRatio (ratio of the time spent with video rebuffering because segments had not been received over the total time spent watching the video, capped at 0.05 to consider 5% as the maximum acceptable rebuffering rate)
  • OverheadRatio (ratio of the data downloaded through P2P that was not actually played over the total volume of data downloaded through P2P, capped at 0.1)
  • CountAvg: average number of peers that were connected during the session

All numerical data was normalized to bring back values between 0 and 1.

At first, we tried to leverage the categorical data to build the clusters, so we used the KPrototypes algorithm. A good implementation of that algorithm can be found here: https://github.com/nicodv/kmodes/blob/master/kmodes/kprototypes.py

To compute distances between data points, we used:

  • The Hamming distance for categorical features (d=0 if the two data points have the same value for the feature, d=1 otherwise)
  • The Euclidian distance for all numerical features

We also tried using the more classical KMeans algorithm, considering only the numerical features.

We used a whole week of data for a specific broadcaster; each data point represented a video session, with all the features mentioned above.

The following plot shows the inertia of the clusters, for K ranging from 4 to 15, for both KMeans and KPrototypes:

We finally went with KMeans algorithm, with K = 9, as inertia did not evolve that much when we added more clusters.

Validating the clusters

To make sure our clusters were robust and reliable, we fitted a KMeans model with different initializations on four different weeks of data for the same broadcaster.

To find the “equivalent” centroids over different KMeans runs, we used the Hungarian algorithm (linear_sum_assignment function in Scipy library).

Comparing the distances between “equivalent” centroids allowed us to see that they were quite stable from one run to the other. The heatmap below shows the distances between the centroids of two different runs, and the diagonal values are the interesting distances: they are very close to 0, which means “equivalent” centroids are close to one another.

Interpreting the clusters

After validating that our clusters are stable, we tried to characterize them. For that, we plotted the 25th, 50th and 75th percentiles of each feature for each cluster and compared them to the percentiles of the global population. The results for two of the nine clusters are shown below:

On those charts, the feature percentiles of the cluster considered are in red, and the percentiles of the overall population are in black. This shows us very quickly which features make the sessions in that cluster different from the rest of the sessions. For example, in cluster 0 are sessions with a very high withoutBufferRatio. Sessions in cluster 2 are sessions with a very low countAvg.

Conclusion : using the clusters centroids and sizes

In the end, we found it difficult to have meaningful correlations between performance metrics of a video session and the pre-existing characteristics of the session. Thus, we were not able to find useful new pre-optimization criteria based on pre-existing characteristics.

However, A/B testing proved to be a very good use case for our session clusters. We often work with our customers to A/B test new features and new algorithms for Mesh Delivery. Visualizing changes to the size of each cluster gave us completely new insights on how an algorithm or a feature affected audiences. While overall metrics may be stable over the two populations of an AB test, the size of specific session clusters can reveal otherwise hidden nuances. For example, reducing the size of the cluster of sessions with a very high withoutBufferRatio is a very good side effect; on the other hand, increasing the size of the cluster with a low uploadRatio means we concentrated the upload needs of the entire network on smaller number of peers.

We saved the centroid coordinates from the KMeans model, and we added the cluster size calculation as an automatic analysis in our AB testing statistical tests.

Special thanks to Joseph El Hachem, who spent a lot of time with me working on this project. Don’t hesitate if you have any questions on this article, and any feedback would be much appreciated!

This content is provided for informational purposes only and may require additional research and substantiation by the end user. In addition, the information is provided “as is” without any warranty or condition of any kind, either express or implied. Use of this information is at the end user’s own risk. Lumen does not warrant that the information will meet the end user’s requirements or that the implementation or usage of this information will result in the desired outcome of the end user. All third-party company and product or service names referenced in this article are for identification purposes only and do not imply endorsement or affiliation with Lumen. This document represents Lumen products and offerings as of the date of issue.

--

--

Igor Mukam
Lumen Engineering Blog

Data Scientist / Engineer, Data / R&D Team Lead for Lumen Technologies (formerly Streamroot)