Design Patterns and Principles That Support Large Scale Systems
Today even small startups may have to work with terabytes of data or build services that support hundreds of thousands of events per minute (or even a second!). By “scale”, we usually refer to a large amount of requests/data/events that the system should handle in a short time period.
Trying to implement services that need to handle large scale in a naive way is doomed to failure at worst, or to be very costly in the best-case scenario.
This article will describe a few principles and design patterns that enable systems to handle large scale. When we discuss large-scale (and mostly also distributed) systems we usually judge how good or stable they are by looking at three attributes:
- Availability: a system should be available as much as possible. Uptime percentage is key for customer experience, not to mention that an application is not useful if no one can use it. Availability is measured with “nines”.
- Performance: a system should continue to function and perform its tasks even under heavy loads. Furthermore, speed is critical for customer experience: experiments show that it’s one of the most important factors in preventing churn!
- Reliability: a system should accurately process data and return a correct result. A reliable system does not fail silently or return incorrect results or create corrupted data. A reliable system is architected in a way that puts effort into avoiding failures, and when it’s not possible it will detect, report, and maybe even try to auto-remediate them.
We can scale a system in two ways:
- Scale vertically (scale-up): deploy the system on a stronger server, meaning, a machine that either has a stronger CPU, more RAM or both
- Scale horizontally (scale-out): deploy the system on more servers, meaning, spin up more instances or containers that enable the system to serve more traffic or process more data/events
Scaling-up is usually less preferable mainly because of two reasons:
- it usually requires some downtime
- there are limits ( we can’t scale up “forever”)
On the other hand, in order to be able to scale a system out, it has to have certain characters that allow this kind of scaling. For example, in order to be able to scale horizontally, a system has to be stateless (e.g. most databases can’t be scaled out).
This article’s intention is to give you a taste of many different design patterns and principles that enable systems to scale out while maintaining reliability and resiliency. Due to this nature, I wasn’t able to go deep into every topic but rather just provide an overview. That said, in each subject I tried to add useful links to more thorough resources on that subject.
So let’s dig in!
This term is borrowed from Mathematics, where it’s defined as:
f(f(x)) = f(x)
This may look intimidating at first but the idea behind it is simple: no matter how many times we call the function
x we’ll get the same result. This property provides great stability to a system because it allows us to simplify code and it also makes our operational life easier: an HTTP request that fails can be retried, and a process that crashed can be restarted without worrying about side effects.
Furthermore, a long-running job can be split into multiple parts each of which can be idempotent on its own, meaning, when the job crashes and gets restarted, all the parts that already got executed will get skipped (resumability).
When we make a synchronous call, the execution path is blocked until a response is returned. This blocking has an overhead of resources, mainly the cost of memory and context-switch. We can’t always design our systems using async calls only, but when we can we make our system more efficient. One example that shows how async provides good efficiency/performance is Nodejs, which has a single-threaded event-loop, yet it is giving a fight to many other concurrent languages and frameworks.
This pattern is specific to microservices: every service should implement a
/health route which should return very quickly after running quick sanity of the system. Assuming that everything’s okay it should return HTTP code 200 and if the service is malfunctioning it should return 500 Error. Now, we know that some errors will not be caught by a healthcheck but assuming that a system under stress will function poorly and become latent, it will also be reflected by the healthcheck which will become more latent as well and this can also help us identify that there’s an issue and auto-generate an alert that an on-call can pick-up. We may also choose to temporarily remove the node out of the fleet (see Service Discovery below) until it becomes stable again.
Circuit breaker is a borrowed term from the electricity domain: when a circuit is closed the electricity is flowing and when a circuit is open the flow stops.
When a dependency is not reachable, all the requests to it will fail. According to the Fail Fast principle, we’ll prefer our system to fail fast when we try to make the call and not wait until there’s a timeout. This is a great use-case for Circuit breaker design pattern: by wrapping a call to a function with a circuit-breaker, the circuit-breaker will identify when calls to a specific destination (e.g. a specific IP) fail, and will start failing the calls without really making them, allowing the system to fail-fast.
The circuit-breaker will hold a state (open/close) and will refresh its state by retrying an actual call every interval of time.
Kill Switch / Feature Flag
Another common practice today is to perform “silent deploy” to new features. It is achieved by gating the feature with a conditional
if that checks if the feature flag is enabled (or alternatively, by checking that the relevant kill-switch flag is disabled). This practice doesn’t provide a 100% guarantee that our code is bug-free but it does have the effect of reducing the risk of deploying a new bug to production. Further, in case we enabled the feature flag and we see new errors in the system, it’s easy to disable the flag and “go back to normal” which is a huge win from an operational perspective.
Bulkhead is a dividing wall or barrier between compartments a the bottom of a ship. Its job is to isolate an area in case there’s a hole at the bottom — to prevent water from flooding all the ship (it will flood only the compartment in which the hole was created).
The same principle can be applied to software by building it with modularity and segregation in mind. One example can be thread pools: when we create different thread pools for different components to ensure that a bug that exhausts all the threads in one of them — will not affect the others.
Another good example is ensuring that different microservices will not share the same DB. We also avoid sharing configuration: different services should have their own configuration settings, even if it requires some kind of duplication, in order to avoid a scenario where a configuration bug in one service affects a different service.
In the dynamic world of microservices, where instances/containers come and go, we need a way to know when a new node joins/leaves the fleet. Service discovery (also called Service registry) is a mechanism that solves this problem by allowing nodes to register in a central location, like yellow pages. This way when service B wants to call service A, it will first call service discovery to ask for a list of available nodes (IPs) which it’ll cache and use for a period of time.
Timeouts, Sleep & Retries
Any network may suffer from transient errors, delays, and congestion issues. When service A calls service B the request may fail, and if a retry is initiated the second request may pass successfully. That said, it’s important to not implement retries in a naive way (loop) without “baking” into it a mechanism of delay (also called “sleep”) between reties. The reason for that is that we should be conscious about the called service: there may be multiple other services that call service B simultaneously, and if all of them will just keep retrying, the result will be a “retry storm”: service B will get bombarded with requests which may overwhelm it and bring it down. In order to avoid a “retry storm”, it’s common practice to use an exponential backoff retry mechanism which introduces an exponentially growing delay between retries and an eventual “time-out” which will stop any additional retry.
Sometimes we just need a “plan B”. Say we’re using a recommendation service in order to get the best and most accurate recommendations for a customer. But what can we do when the service goes down or becomes unreachable for a few moments?
We can have another service as a fallback: that other service may keep a snapshot of recommendations of all our customers, refresh itself every week and when it’s called all it needs to do is return the relevant record for that specific customer. This kind of information is static and easy to cache & serve. These fallback recommendations are a bit stale, true, but it’s much better to have recommendations that are not perfectly up-to-date than not having anything at all.
A good engineer considers such options when architecting their system!
Note that a circuit-breaker implementation may include an option to serve fallbacks!
Metrics, Monitoring & Alarms
When operating a large-scale system, it’s not a question of if the system will fail but rather a question of when will it fail: due to the high-scale, even something so rare that can happen only once in a million events — will eventually happen.
Now that we understood and accepted errors as “part of life”, we have to figure out the best ways to deal with them.
In order to have a reliable system that is also available, we need to be able to detect (MTTD) and fix (MTTR) errors quickly and for that, we need to gain observability into our system. This can be achieved by publishing metrics, monitoring these metrics, and raising an alarm whenever our monitoring system detects a metric that “goes off”.
Google defines 4 metrics as golden signals, but that doesn’t mean that we shouldn’t have other metrics published as well. We can categorize metrics into 3 buckets:
- Business metrics: a metric that is derived from a business context, for example, we may publish a metric every time an order is placed, approved, or canceled
- Infrastructure metrics: a metric that measures size/usage of part of our infrastructure, for example, we can monitor the CPU usage, memory, and disk space used by our application
- Feature metrics: a metric that publishes information about a specific feature in our system. An example can be a metric that is published on an A/B test we’re running to provide insight about users that are allocated to different cells of the experiment
Small anecdote: back in the days I worked for Netflix, one of the things my team and I did, was to develop Winston to enable teams to have their services auto-healed from known scenarios by creating a programmatic runbook!
Rate-limiting or throttling is another useful pattern that helps reduce stress from the system.
There are 3 types of throttling:
Backpressure is a technique used to handle a situation where a load of requests is coming from an upstream service at a rate that is higher than can be handled. One way of dealing with backpressure is signaling the upstream service that it should rate-limit itself.
There’s a dedicated HTTP response-code 429 “too many requests” that’s designed to signal the client that the server is not ready to accept more requests at the current rate. Such a response is usually returned with a
Retry-After header to indicate how many seconds the client should wait before it retries.
Two other ways to handle backpressure are throttling (also known as “throwing requests on the floor”) and buffering.
Additional recommended reading about backpressure can be found here.
Canary testing is a technique used to roll out changes gradually to production. When the monitoring system catches an issue — the canaries are rolled back automatically with minimal damage to production traffic.
Keep in mind that in order to enable canary release we need to be able to monitor the canary cluster separately from the “normal” nodes, then we can use the “regular” fleet of nodes as a baseline and compare it with the metrics we receive from the canaries. For example, we can compare the rate of 500 errors we receive in both and if the canaries produce a higher rate of errors we can roll it back.
There’s also a more conservative method of doing canaries using shadow traffic from production.
That’s it for today folks, I hope you were able to learn something new!
If you think there’s an important pattern/principle I missed — please write a comment and I’ll add it.