Counting is easy… until it’s hard

Published in

disney-streaming

7 min readJan 25, 2021

At Disney Streaming, we’re passionate about bringing the world’s greatest stories to our global Disney+ subscribers in the highest quality possible. One of the ways we do this is to have various kinds of monitoring in place which allow us to efficiently tell if our subscribers, or a subset of our subscribers, are experiencing any issues receiving our content so that we can get on top of those issues and resolve them, hopefully before our subscribers are aware there’s an issue.

Counting is easy

As an example of the kind of monitoring we have for Disney+, we want to know how many of our subscribers are watching our content at any given moment in time. Oversimplifying this a lot, we can do this by thinking about a bucket — every time we see a subscriber request some of our content, we put a token in the bucket. After a minute, we look at how many tokens there are in the bucket, and then start a new bucket for the next minute; for example if it’s currently 09:03:30, the recent buckets would look something like this:

In this scenario, the 09:00–09:01, 09:01–09:02 and 09:02–09:03 buckets are “closed” while the 09:03–09:04 bucket is “open” and still receiving new tokens. Whilst we’ve got a lot of subscribers watching our content at once, computers are pretty fast these days so they can increment the count for each bucket very quickly. This can be done in real time as the new tokens are received, so doesn’t require any processing at the end of each minute.

After a few minutes, we can look at the number of tokens in each bucket, and if the number of tokens in the most recent bucket is significantly less than in the previous few buckets, that indicates that we may well have an issue that we need to resolve in order to ensure that we’re delivering the content in the way we want to.

Counting gets harder

Unfortunately, monitoring isn’t as simple as that in the real world — if an issue affects only a small fraction of our subscribers, it’s still going to cause a large number of people to get a bad experience though it could show up as less than a 1% drop in our overall numbers. That said, most issues that we see affect well defined subsets of our subscribers, which gives us a way forward. Again oversimplifying, instead of just one bucket we could create two buckets, one for “North American subscribers” and another for “Subscribers outside North America”; when we see a subscriber requesting content, we then put a token into the appropriate bucket and at the end of each minute look at how many tokens there are in each bucket. Now, we have twice as much work to do as we have to count the tokens in two different buckets, but computers are still fast so that’s OK. Our buckets would now start looking something like this:

Token buckets separated by time and geography

At the end of every minute, we can obviously calculate the total number of tokens just by adding the totals on each bucket for that minute. This does require some “end of minute” processing, but is more efficient than incrementing two counts — the geography specific bucket and the total — every time a token arrives, at least once you have more than two buckets per minute.

Now, while splitting into North America and outside is a good start, it’s not specific enough to spot a lot of our issues, so instead of splitting into just two buckets, we have approximately 200,000 buckets, one for each of the zip codes we see from our subscribers. This makes the end of minute processing non-trivial in that we need to sum the counts from those 200,000 buckets, but computers are really quite fast these days so we can still manage that without any difficulty.

Counting gets really quite hard

As well as geographically based issues, we can also see issues which occur to subscribers with a specific device. On its own, this wouldn’t present an issue as we see around 10,000 different devices from our subscribers (there are a lot of Android devices out there…) and we can sum 10,000 buckets without difficulty.

The problem comes when we want to be able to determine both geographical issues and device-specific issues. Naively, there are a couple of approaches we could take:

Create one bucket for every combination of zip code and device. In theory this is fine, but this would mean that we would have 200,000 x 10,000 = 2 billion buckets. Now, while our computers are fast, they’re not fast enough to count 2 billion buckets every minute.
Create a separate set of buckets for each of the zip codes and the devices. This would give us 200,000 + 10,000 = 210,000 buckets which isn’t an issue, but we would now have to count every token twice. While we could manage that, we actually want to split our data by more than 30 different properties which would mean we’d have to count every token 30 times, which is inefficient to say the least. This scheme would also not let us spot issues which are specific to a device and a location, which can happen.

In reality, what we did was to try approach 1, but it didn’t give us anything like the performance we needed — while we didn’t actually have 2 billion buckets every minute (we don’t see every device in every zip code every minute), by having so many buckets to sum, we couldn’t give a quick answer to the simple and important question “how many tokens are there in total?”

Taking a different approach

OK, so if we can’t count every bucket we want to count (because there are too many buckets) and we can’t count each dimension separately (because that means counting everything 30+ times), is there anything we can do?

To best see how this can work, we’ll have to expand the set of dimensions we’re looking at a bit. As well as the “problematic” dimensions with high cardinality like zip code and device model, we’ll take some easier ones:

country: which of the ~50 countries where Disney+ is available is the subscriber in?
platform: a high-level indicator of the type of device in use (Apple, Android, Web, Xbox, etc)
subtitles: a simple boolean indicating if subtitles were in use

When we try and count tokens, we now count each tokens twice, but in slightly different ways:

We look at the full set of dimensions available, creating the very large number of buckets.
We ignore the zip code and device model dimensions. As there are far fewer possible values for the other dimensions, this means that we have a much smaller number of buckets — in our case, a few thousand which is easily within our ability to sum quickly.

This now lets us answer all sorts of queries efficiently — for queries which don’t involve zip code or device model — e.g. “how many tokens in total” or “how many tokens in Australia”, we can use the small number “Type 2” buckets and get an answer quickly, but we can still answer more specific queries like “how many tokens on a Google Pixel 4” by looking at the 200,000 or so “Type 1” buckets which contain Google Pixel 4 tokens.

At a slightly more abstract level, the “Type 2” buckets are partially precomputing the results for what we know are some of the more common queries we expect to answer — the trick here was to get the balance right between the additional up-front work to precompute the data for the “Type 2” buckets and the work that had to be done for every query.

Does this approach work? Yes — when we deployed this in our monitoring infrastructure inside Disney Streaming, and for some of our really important use cases like “total number of tokens”, it gave us a more than 10x improvement in query speed.

Technology and conclusions

I’ve deliberately avoided mentioning specific technology choices in this article — and whilst we at Disney Streaming certainly do have our favourite technologies, the improvements from the idea of partial pre-computation are independent from the technology in use. We have spent a reasonable amount of time optimising the data pipeline we use, and though that’s important and has produced decent results, sometimes you do just run up against the fundamental limits of how fast you can perform computations — at that point you need to step back and look at the algorithms you’re using, rather than relying on marginal gains from optimisations.