Under pressure

Tech@ProSiebenSat.1
ProSiebenSat.1 Tech Blog
4 min readJun 11, 2021

--

by Fabian Desoye

You’ve probably been asked this before in interviews in one or the other way:

How do you perform under pressure?

Since the latest addition to our zoo of services, my team has the right to say

Even better than in idle mode!

About 2 years ago, we decided to implement the API for a new system that manages our video’s metadata based on AWS Appsync rather than Apollo GraphQL or some other framework.
While the service was intended to be a pure backend service, the team found that it might be handy if we exposed a subset of our data also directly to our websites.

When we put the thing into production, it was only a few days before the finale of one of our flagship shows. For the last decade, this event is responsible for the highest peaks of traffic on all our systems. And of course it was a first proof point on how our API performs under pressure.
If you are in the video streaming domain, you might recognize the following usage graph of a normal day:

Requests per minute on a “normal” day

During the season of that flagship show we already see quite a bit of an increased traffic on show days and the subsequent day, but the day of the finale everything goes through the roof:

Requests per minute on a finale day

The figures above are normalized with 100% being somehow normal during the afternoon. Just after the airing of the finale ended, users went online to stream the highlights or even the complete show again. The two peaks at the beginning of the show and just after the end of the show are pretty obvious.

More interesting though is that this spike comes just after the airing of the show, i.e. within a minute or so systems had to scale by factor 10 and more.

For traditionally built systems (and let me count “containerized” as traditional here), this would probably have caused quite some headache. As we have this credo of “serverless first”, we were not scared at all though. We already knew previous week’s semi-finale, that things will be alright but I was pretty much amazed when I looked at the metrics the day after.

Let me add the p95 latency to the normal day graph from above:

Req/min and latency on a “normal” day

As you can see, while the number of requests increases over the day, our AWS AppSync API gets more and more resources heated up by AWS with the effect of latency actually going down. Same happened during that finale show:

Req/min and latency on a finale day

Latency went down to 13ms when the requests hit the peak which is about a 3rd of what we see as normal. This “of course” comes with no errors or other issues.

And best of it: that API is almost a “Hello World” implementation of AWS Appsync.

As the time of writing, AppSync didn’t allow to configure a custom domain name as ApiGateway can do, we put CloudFront in front of AppSync to be able to provide a known domain name for our API. Although CloudFront is supposed to also cache the requests, we neither enabled caching on CloudFront nor on AppSync but let every request hit our DynamoDB table. While this can obviously be improved, we saw from that load test that we can very well survive without it and provide reasonable response times without caching.

Thanks to all those AWS managed (serverless) resources, our next answer to the question of how we perform under pressure might even be

We sleep well!

p.s.: You might ask what this API actually does: it is consumed by our player application when the user starts watching a video. It provides a list of topics related to that video so that our ad servers can place more relevant ads.

p.p.s.: Now you might ask, how we assign those topics to the videos. To find out more about this have a look at the articles of our AI team — like this one from Dr. Anca-Roxana Tudoran .

--

--