More with less; Client Side Rollups for Video Quality Metrics

Saving millions; while getting more granular reading on Quality of Experience

Crunchyroll
Crunchyroll
6 min readFeb 19, 2020

--

Authored By Michael Dale

Crunchyroll’s challenge:

Crunchyroll fans today stream well over 2 billion minutes of video content a month. Within a given minute of playback numerous events around bitrate switches, playback position, buffering events and ad events can be triggered and be relevant towards capturing the quality of the viewing experience. Crunchyroll’s initial implementation of player telemetry attempted to capture and transmit many of these events in conjunction with duplication of this data stream to a dedicated QoE platform.

Within our own telemetry this mapped to north of 16 billion events. Managing this event volume can get costly. Looking just at the front door of a Cloudfront integrated WAF support at $1c per 10K requests this maps to $16K a month before that data hits any back end infrastructure; as you can imagine cost grows many times over once you start to do anything useful with data at this scale. Likewise dedicated QoE platforms incurred significant hard costs to support our usage levels. If another solution could be found we could save millions of dollars annually at Crunchyroll scale.

This high fidelity event beaconing approach resulted in massive volume of data being sent and processed within our data infrastructure to produce reports with key performance indicators around user experience on our service. Many tools we used in other parts of the organization for analyzing user app activity, feeding lifecycle marketing efforts, funnel analysis and churn modeling did not cleanly map in our raw video QoE data. Server side roll ups were challenging to deploy and maintain consistently cross multiple data spec version implementations. Secondarily finding cycles within the data teams already full schedule to build and maintain ETLs for QoE rollups was difficult. This resulted in an inability to produce and evolve useful dashboards against the KPIs we were interested in without complicated and costly development cycles that were never realized. In the end we were paying a lot of money for data telemetry that was difficult to leverage effectively.

To address these challenges we decided to explore client side rollups. This approach enabled us to surface key performance indicators at the video viewing session level without overloading our back-ends with high frequency telemetry. Because we run these calculations locally we were actually able to increase the fidelity of events that we capture and pull into QoS metrics that we then transformed into the rollup event that we then sent to the back end event data pipeline.

What are useful Quality of Experience KPIs for video?

Fortunately, some smart people have thought about this problem a lot. We based our client side rollup model against the “Streaming Quality of Experience Events, Properties and Metrics’ ‘ (CTA) spec which was authored video community members that care deeply about quality of experience. This spec has been discussed at FOMS conferences that we participate in and servers as a good baseline for relevant KPIs.

How do client side rollups work?

Client side rollups maintain a per video viewing session object that keeps a running tally of all relevant data. We leverage the react native asyncstorage library in connection with our multi-platform react player approach for fast local and persistent storage of QoE state. Because we are just storing this data locally we are able to monitor events at much higher fidelity than if we were sending the events to the back end.

For example in this section of code, we monitor every segment that is delivered and are able to build per-host delivery speeds into the rollup event. With our older beacon solution there was not an easy way to tag every segment delivered since the model was built on an event level tracking rather than correlating what was relevant. This tracking is important because often we often leverage more than a single CDN for delivery of our video content, but need to evaluate CDNs independently. By consolidating these concerns to the player team useful metrics are a few lines of code in single codebase rather than coordination and orchestration of multiple back end systems.

How do client side rollups avoid losing data?

Once the video session is complete or the user abandons video playback the video QoE rollup action and event is triggered. In the case of web abandonment we attempt to leverage the on beforeunload event to send the rollup the data and register this attempt.

In any case if the user abandons during a network outage or does not get application level beacon dispatched for any reason the detailed QoE state is stored locally for the next time the user load the application and the value is flushed once they startup the application. We have found this approach results in a very small percentage of lost data relative to the real time event telemetry approach we leveraged before.

In the old model all the events flowed into the Crunchyroll event collection pipeline directly. In the new model we rollup relevant events clients side before we send them to the event collection pipeline.

Multi-CDN metrics?

With client side rollups it’s a lot easier to measure top level KPIs against ISN data of the source IP of the client. This reduces the number of events we need to process from hundreds of millions to millions a day for useful CDN tagging and performance modeling. Also the data is more natively reflective of metrics we are interested in. This makes it practical to insert this data into a multi-CDN data store that can roll up into ISN level weights for edge based decisioning. AWS provides good best practice guides for building these sorts of multi-CDN solutions that leverage these types of performance data points.

How did these events integrate into our data ecosystem?

By pre-computing these values we are able to make it much easier for data science, business intelligence and product teams to work with this data. For example churn prediction model dashboards that we have built and displayed in looker are able to incorporate the rollup QoS/QoE data much easier than before.

With these rollup events we are able to build dashboards that integrate video QoE into traditional product analysis tools. For example we can look how the average bitrate consumed relates to subscription conversion or propensity for churn.

How do you address the challenges of Client side Rollups?

There are definitely some considerations if employing client side rollups for your QoE tracking.

Reduced real time single.

  • If the playback is cut completely you may get this data flowing in slightly delayed as users abandon playback or reload applications rather then as exact moment the degradation occurs relative to heartbeat like events.
  • In practice we did not find this to be that detrimental. Additionally we have service health metrics against our AWS infra that also surfaces these alerts and other services like playhead position that are triggered regularly without hitting data infra.
  • Having a main QoE event helps drive clarity in analyzing service distress.

Heavy dependency on client implementation consistency.

  • With “simple” events in you could have multiple runtimes consistently implementing an event spec; while with rollup logic your adding complexity to the client.
  • In practice even our “simple” event spec became complex and had to be normalized cross client on the server. The new approach depends on normalization on the client with consistent javascript runtime cross platform targets. Maintaining rigorous QA automation & testing is as critical as ever with this approach.

Next steps:

We will continue to explore client side rollups adding metrics as needed. The Velocity player runtime will continue to roll out to the remaining platforms that are not yet covered. We work to incorporate these video metrics into actionable decisions towards delivering the best possible experience to our fans.

Thanks for reading!

--

--