Katana: Lessons Learned from Slicing HLS and Dicing Dash on the FaaS Edge

Authored By Michael Dale & Maxim Levkov

Published in

Crunchyroll

10 min readNov 10, 2020

Crunchyroll is a top provider of anime globally. In 2020, we have experienced significant growth of minutes streamed. Today, Crunchyroll streams billions of minutes of video every month. We continue to see significant growth domestically and internationally.

As Crunchyroll’s global audience has continued to grow, so has the need to improve our global service latency for video delivery. Within the Crunchyroll technology stack the set of services in the per-user “playback domain” map to the following areas:

Resource Discovery & Content Security
Multi-CDN decisioning for optimal delivery across multiple CDNs
Server Side Ad Insertion for the advertising supported streams on the service

Historically, these domains lived with per-vendor solutions. With the need for stable low latency in startup time globally, and increasing sophistication of global compute platforms such as lambda @ edge, we set about a project to dramatically improve our end user experience. We aimed to avoid latencies identified with detailed global performance monitoring data captured by client side rollups telemetry we deployed earlier in the year. Because we did not have a large team for the effort, we were looking to minimize operational complexity in building a new solution. Functions as a Service (FaaS) aka serverless edge compute was attractive in minimizing operational cost while reducing latency globally.

We identified many services that were not yet geo-provisioned everywhere our users were:

We identified one API endpoint to get the list of streams; another to manipulate the manifest for multi-CDN; one for DRM token auth; and yet another endpoint to do stream manipulation for SSAI; and finally the serving of the manifest itself was yet another round trip for a playback client.

Unsurprisingly running each of these on separate domains / services was catastrophic for performance — especially where users were accessing the service internationally and had to open network connections to many of these servers in blocking sequence at varying degrees of proximity.

Borrowing ideas from GraphQL and high performance service constructs, we set about to build Katana or Edge PlaybacK OrchestrAtion for STAteless Multi-CDN & SSAI (don’t worry, the service is more clever than it’s acronym) With Katana, we aimed to empower the velocity player to build a custom request for everything needed for playback in a single round trip. We aimed to do so in an edge application that was (once cache hot) could be fully resolved within a given edge lambda for SVOD. For AVOD, it would support persistent connections for high speed server to server to communication of ad server calls with minimal latency across requests. The request model in that case looks something like the following:

We were able to achieve performance targets of single digit millisecond (ms) runtime server side and a significant improvement over our baseline for ad-server streams. Katana is able to achieve these performance numbers through leaning heavily on lambda object persistence, adopting an “edge monolithic” model; an optimal in-memory cache with presentation level transforms (rather than object copies) and deterministic hashing for ad streams for in-memory sequencing after VMAP parsing.

All in all, this dramatically improved our time to first frame globally, reduced video abandonment rates, and dramatically improved on baseline ad fillment render rates, and ultimately has provided us with much deeper control over the ad experience of our users.

From Zero to Katana

We set about building the service we knew we wanted to support service composition so we started with each service individually. For example, the multi-CDN service operated as a drop-in replacement for our multi-CDN manifest service,followed by our SSAI service operating similarly on distinct urls for VMAP and sequenced HLS or DASH. Beyond replacing our existing services, we added capabilities relative to our own product needs.

Things to keep in mind with Multi-CDN

Manifest level multi-CDN decisioning needs to be fast. Targeted 10ms for CDN rules engine. i.e we were comparing traditional CDN cached manifests.
Manifest level multi-CDN let us roll out to standard HLS / HLS clients quickly — no client updates needed. See Bitmovin’s note on how to represent multiple CDNs in DASH and HLS in a standard way. To do this with multi-CDN tokens, concatenate the signatures in the template.
Where the player ABR system is controlled, we can use the multi-CDN list as the “recommended order” and have the client do congested negotiation on-top of that list.
We built in flexibility of business rule allocations:
Modulo content hash distributions for fair CDN comparisons with comparable cache hit-rates while only handling a portion of our traffic.
ASN (autonomous system number) maps for large scale CDN deals that may reflect ISP interconnect relationship

Things to keep in mind manifest level SSAI

You will want to take advantage of HTTP keep-alive cross lambda handlers calls for faster HTTP calls to ad servers.
Deterministic hashing of ad stream URL to directly map to cache keys for entirely on edge lambda sequencing.
Note: Some ad servers have tokens and signatures for their media urls (take this into account with a ad media url template engine)
Fully RESTful URLs for sequence state in URI. (Any node can service any request. We have no “session” construct)
We leverage the same system for ads as content enables new codec rollouts, perfectly aligned content and ad profiles, and no new-connections to new domains for ads vs content.
Combine with multi-CDN manipulations so we get multi-CDN benefits.

Considerations with DRM token auth cert proxy

We were able to drive significant performance gains in the DRM flow relative to caching the certificate server side to enable proxy of the license request portion.
Leveraging persistent servers side keep-alive connection reduced time a client would otherwise have spent opening a new connection.

Serverless and AWS lambdas @ edge

When building a system at this scale hundreds of millions of lambda @ edge invocations mapping to half dozen underlying services, it’s important to understand the underlying execution model of lambdas.

There are plenty of more detailed overviews of lambdas execution model but, essentially, a single Lambda will handle a single request at a time in sequence for its lifecycle. This gives us predictability and avoids a lot of concurrency service pressure scenarios.

We closely monitor the runtime of the lambdas and see an average runtime of a few hours relative to constant request throughput. During these few hours a given lambda cache will hold thousands of DASH and HLS documents and serve them from its in-memory cache. We use the 1024 MB lambda allocation of which the majority of that memory is allocated to in-memory cache. You can see AWS outlines this object outside of the handler persistence model for lambdas around leveraging external data. In our case, the external data sits very close to the lambda through the CF layer facilitating only a few milliseconds penalty for “cold” fill while supporting retrieval from in-memory cache hits in microseconds. (As we are not duplicating the cache object during retrieval this involves just returning a reference.)

In summary, the first cache tier is the memory of the lambda itself. As outlined, we leverage Cloudfront as our fail-through cache and then go all the way to origin in rare cases where we have a cloudfront miss. This model applies universally to anything used within playback orchestration including our resource objects that describe all the subtitles, encodes across formats, codecs, bitrates, and tracks available.

What is presentation time manipulation?

To squeeze out a few milliseconds of performance gains, we leverage shared objects with presentation layer manipulation instead of full object clones. When reading from our in-memory cache looks like the following:

This avoids data duplication and helps the overall runtime of the edge lambda against multiple cache reads. We leverage non-destructive presentation level transforms for multi-CDN representation.

Did someone say Edge Monolith?

While the code base is structured for use as distinct lambda modules, performance concerns necessitated that each orecestrated function operates within a single lambda response handler cycle. This approach has significant benefits with the lambda execution model and towards in-memory cache reuse across requests.

For example, a single instance of SSAI playback flow needs to read the duration of content from the Dash manifest to first manipulate VMAP per playback sequence and read that same DASH manifest object to create the multi-period DASH manifest to feed into the player. Using the edge monolith model enabled storing a single element in memory for service in multi-use cases within a single response runtime. The monolith approach provided access for inline communication and reuse of common functions across internal sub-contexts with the same in-memory cache.

If the lambda instance is to be distributed into separate requests, aside from the huge latency penalty of client side round trip (50–500ms), the best case runtime is constrained per hit rates spread across a dozens of concurrent lambdas within a given region. In other words, your chances of cache hits across dozens of ads and content resources referenced would likely incur a fail though cache hit penalty. Consequently, this would have taken an additional 3–25 ms of latency to obtain a stored object across disparate requests or micro-service hops.

Orchestration

As outlined above, orchestration benefits both the server and client. As we built each piece, we also built in graph-ql like service composition to hit the goals of a single round trip for all playback resources.

Our base resource discovery would normally provide a link to our multi-CDN urls:
/asset/kpv0kr0nytw7wrn?streamTypes=drm_adaptive_dash

On web where it is straightforward for service workers to intercept the XHR manifest requests from playback engine, (shaka 3 in our our case) enables us to extended the api with embedManifest which now includes the payload of the stream manifest in that same response which allows us to avoid roundtrip for the DASH manifest.

/asset/kpv0kr0nytw7wrn?streamTypes=drm_adaptive_DASH?embedManifest=-

When loading ads, we passed the stream URL into our VMAP call. This would normally involve one request to get the stream resources, one request to get the VMAP, then one call to get the manifest, but with the orchestration, it’s also collapsed to a single call:

/asset/kpv0kr0nytw7wrn?streamTypes=drm_adaptive_dash?adParams={params}&embedManifest=-

Observability

AWS lambdas can seamlessly leverage cloudwatch to graph arbitrary metrics. In our case, we are rolling up as much useful information that we can in a JSON format that we then use to graph arbitrary dimensions within cloudwatch. Some items we keep track of:

Network request time for all resources loaded.
Number of items in cache, memory used, cache hit / miss at what level.
Javascript XML to javascript object model parse time for all ad VMAP, VAST and DASH XML parsed.
Per service request graphs /drm, /vmap, ssai manifest, svod multi-CDN.
Total runtime as observed by the lambda handler.

Because we can graph and log arbitrary dimensions we also log a lot of business metrics including:

MultiCDN resolver: what CDN we choose & why
Paid ads vs house ads: what ad networks filled how many ads against what latency

Some more things to keep in mind with edge serverless:

It’s important to not have a lambda call another lambda directly. Concurrency becomes difficult to model. Event based architectures through SqS / SnS are the way to go. (Also true for microservices nowadays.)

In our case, with Lambdas, one issue we ran into related to our edge lambda triggering an origin lambda for ad ingestion. Once, throttled ingestion did not take place, and so resulted in longer running, or failing ingest, calls resulting in a percentage of failed requests. AWS calculates concurrency limits across edge and origin and size of lambda memory allocations can impact listed limits for a given region. This means concurrency is counted in the same bucket as the edge lambda in that region.

We have since switched to SqS with deduplication on the queue itself, which avoided thousands of origin lambda invocations doing that duplication check and made cases of large new sets of new ad creatives not resulting in high origin lambda concurrency or risking hitting concurrency limits.

What’s next?

Rollout the resource graph consumption API request / response handling in player
Enhance server side recommended CDN order with realtime per ASN performance data.
Enhance player level negotiation on-top of server based multi-CDN recommendations. Would like to work with the video dev community on ways to single this in a standard way in DASH and HLS manifest formats.
Open source some useful stuff!

Katana: Lessons Learned from Slicing HLS and Dicing Dash on the FaaS Edge

Authored By Michael Dale & Maxim Levkov

From Zero to Katana

Written by Crunchyroll