Scaling: Part 2 — Not scaling, scaling delays, and caching

Jonathan Parkin
BBC Product & Technology
7 min readJan 27, 2022

In the first part of this blog mini-series we introduced some of the terms we use to discuss scaling software systems in the cloud. In this part we’ll be taking a bit of a deeper look at things we need to think about when scaling, then in the final part we will look at some of the pros and cons of a way to architect our systems to minimise our scaling headaches.

We’ll be focusing here on approaches to handling legitimate traffic that doesn’t follow a perfectly predictable pattern. Handling non-legitimate traffic, like Denial of Service (DoS/DDoS) attacks, should also be considered as part of system design.

Not scaling

In the first part we talked about using scaling as a mechanism for keeping us in the zone where costs are low and we’re handling all our work effectively. As developers it’s often a point of pride with us that our systems are responsive, resilient to all eventualities, and just generally work well. However we always need to bear in mind the cost of doing this. Adding in more complexity around scaling, backups, and so on comes at the cost of not doing other work. If the impact of your system being down is small — like some part of your website not updating for a little while — it might not be worth investing a lot of effort to make sure it’s always available. While we want to keep our failure rates low, it’s even more important to focus on giving the best overall user experience. Letting some pieces of our systems be less-than-perfect so we have time to focus elsewhere can be good engineering.

Instead of scaling it might be appropriate to look at other approaches to dealing with a lot of traffic. For example if you have a few heavy users that are taking up all your capacity and preventing you from serving more typical users, you might be able to use a rate limiting approach. This means giving each user a quota of how many requests they can make to you within a given time window and then cutting them off, preserving your service for the majority of your users. Alternatively it might be that if your system is running out of resources then instead of scaling out you could switch over to fallback content, returning a degraded but good-enough default response rather than your usual one. That might look like giving a user a list of the ten most popular things on your website, rather than ten things that are tailored to that user’s specific interests.

It’s also often possible to get scaling behaviours without doing it yourself. Cloud providers, and many other companies, offer managed services — anything from complete software systems to things like running a database for you, or “serverless” technologies for running your code. Managed services generally handle scaling for you. This does take a number of the concerns around scaling off of your plate — but also takes them out of your control. This can sometimes cause issues, rather than just solving them. For example if you hand off responsibility for scaling a system that makes use of a database, and under the hood that system is horizontally scaled out to meet demand, your database now has an arbitrarily number of incoming connections to handle. (This is a problem so common that systems like AWS RDS Proxy have been created to help handle it).

Delays are inevitable

Scaling is not a silver bullet for dealing with very rapid changes in traffic load. When it comes to dealing with a sudden influx of requests the only way to make sure you have capacity to deal with them is to already be scaled out to handle that capacity — or in other words to be over-provisioned for the work you were doing before that traffic spike. Doing that puts us outside the cost/benefit zone we use scaling to stay within. If you know in advance when a traffic spike is coming you can preemptively scale out to be ready for it, otherwise you’re faced with a choice. Do you set the metric thresholds that trigger your scaling actions so that you’re typically over-provisioned but have the capacity for handling spikes should they occur, or set your scaling thresholds for typical use and accept that if you experience a sudden spike in traffic you’ll be under-provisioned until your scaling can catch up?

This is because of the delays that might come about when scaling. We mentioned pause time in the previous post; giving the system metrics a chance to reflect changes before you re-evaluate if you need more/less capacity. If this is too short you could end up scaling a second time before the first scaling action takes effect, but if it’s too long you might face a delay between scaling out once to deal with an increase in traffic and then scaling out a second time — meaning for a time you might still be under-provisioned and your users would suffer the consequences. This tends to encourage people to scale out in big steps, and to scale in in small steps, which errs on the side of being over-provisioned but serving the clients’ needs. The more sudden you expect your traffic increases to be, the larger your scale-out steps would normally be.

Another form of delay is metric collection time. Dynamic scaling actions happen as a result of metric changes, and metrics are gathered from logs and counters in our systems as a result of activity. No matter how rapidly you gather those metrics together, they always tell you about the past, so you’re responding to things that have already happened and guessing that the trend will continue in the near future. But how quickly should we respond? If you scale based off a metric that is collected once an hour, and one second after that hour you get an increase in traffic, your system won’t scale to handle it for another hour. On the other hand if your metric is collected ten times a second a brief blip might be enough to make you scale out unnecessarily. Typically teams Cloud Engineering work with choose a middle ground of using metrics that update at 5-minute or 1-minute intervals.

Even when you’ve made the decision to scale out there are more delays! It takes time for your scaling action to request new machines and for your system to start. When we scale out we’re effectively ordering and then turning on new computers, so they have to go through their start-up processes and load our software. We have to wait until our system health checks are reporting that the new machines are ready to handle requests before we pass user traffic to them. Depending on your technology stack there might be ways to minimise these delays, but the only way to avoid them is to already have a machine up and running before you need it — and if you don’t need it yet you’re over-provisioned. You also need to make sure you pause time is longer than these provisioning and start-up times, or you will end up taking a second scaling action before the first one has had time to take effect.

Diagram showing how pause time needs to include up until metrics are updated following changes to make decision on if we need to scale again.

Where increases and decreases in traffic aren’t very sudden, or a sudden spike is anticipated and scaled for in advance, these delays don’t cause too much in the way of difficulty. But where systems need to cope with user behaviours shifting dramatically in a short space of time you can end up with an outage before your system can cope. Critical systems will often look at ways to degrade gracefully — like using fallback content — even though they hope always to scale quickly enough to meet user demand.

Caching

Caching of content is a key way to mitigate load on your system. Caching is keeping a copy of old results for a while (called a Time To Live, or TTL), and if we’re asked for them again we serve the old ones rather than re-calculating them. As our systems don’t have to re-do work they consume fewer resources, and that means we have to scale out less. This can be particularly good for dealing with any traffic spikes that happen due to a sudden burst of interest in a few particular items, for example if a new episode of a popular TV show is released.

It’s often worth caching even if you need to use low TTLs. If you have popular items that take a lot of traffic then caching them for even a single second can reduce the load on your system. It’s worth remembering that for a great deal of data most humans won’t even notice update delays of seconds, minutes, or even longer. There’s also a lot of data that just doesn’t change very often — like company logos — where you could likely have a cache TTL of weeks or more.

Many caches are used by their systems across all requests, and so are shared across all users, which is fine for things that are the same for everyone — but what about where data is specific to one user? It can still be worth caching that data; perhaps the same piece of a user’s data is used to create multiple website pages that they’ll rapidly be clicking through, or they’re likely to be browsing content on a home page, dipping in and out of articles and re-visiting the same page over and over again. In both cases caching user-specific data is worthwhile.

It might be possible to do caching on user devices. For example your system might return a web page that’s specific to a user, but can’t be effectively cached within your system. In such cases you might send the web page back with a Cache-Control: private header, indicating that the response should be cached in private places like the users's own web browser, but not in any shared caches. This can be useful when caching personalised experiences, or other ways that personal data is viewed by a user.

Next time

Here we’ve looked at ways to avoid scaling, how delays inherent in dynamic scaling approaches make scaling for sudden traffic increases difficult, and taken a look at how caching can reduce load on our system. Next time we’ll look at taking an asynchronous static publishing approach to help sidestep scaling issues by organising our systems to take caching ideas one step further.

--

--