Chapter 1 — Reliable, Scalable, and Maintainable Applications

Welcome to the gate of becoming a system design expert!

Photo by Anne Nygård on Unsplash


The main functionality of a standard system:

  • Store data so that they, or another application, can find it again later (databases)
  • Remember the result of an expensive operation, to speed up reads (caches)
  • Allow users to search data by keyword or filter it in various ways (search indexes)
  • Send a message to another process, to be handled asynchronously (stream processing)
  • Periodically crunch a large amount of accumulated data (batch processing)

There are datastores that are also used as message queues (Redis), and there are message queues with database-like durability guarantees (Apache Kafka).

If you have an application-managed caching layer (using Memcached or similar), or a full-text search server (such as Elasticsearch or Solr) separate from your main database, it is normally the application code’s responsibility to keep those caches and indexes in sync with the main database.


We can understand reliability as meaning, roughly, “continuing to work correctly, even when things go wrong.”

A fault is usually defined as one component of the system deviating from its spec, whereas a failure is when the system as a whole stops providing the required service to the user. It is impossible to reduce the probability of a fault to zero.

Hard disks are reported as having a mean time to failure (MTTF) of about 10 to 50 years. Thus, on a storage cluster with 10,000 disks, we should expect on average one disk to die per day.

AWS’s virtual machine platforms are designed to prioritize flexibility and elasticity over single-machine reliability.


If the system grows in a particular way, what are our options for coping with the growth?” and “How can we add computing resources to handle the additional load?

The load can be described with a few numbers which we call load parameters. The best choice of parameters depends on the architecture of your system: it may be requested per second to a web server, the ratio of reads to writes in a database, the number of simultaneously active users in a chat room, the hit rate on a cache, or something else.

Two main features of Twitter as of 2012:

  1. Post tweet: A user can publish a new message to their followers (4.6k requests/sec on average, over 12k requests/sec at peak).
  2. Home timeline: A user can view tweets posted by the people they follow (300k requests/sec).

There are two ways of handling fan-out for Twitter and Twitter switched from 1 to 2. As the average rate of published tweets is almost two orders of magnitude lower than the rate of home timeline reads, and so in this case it’s preferable to do more work at write time and less at read time. Twitter eventually moved to a hybrid of both approaches.

  1. Posting a tweet simply inserts the new tweet into a global collection of tweets. When a user requests their home timeline, look up all the people they follow, find all the tweets for each of those users, and merge them.
  2. Maintain a cache for each user’s home timeline — like a mailbox of tweets for each recipient user. When a user posts a tweet, look up all the people who follow that user, and insert the new tweet into each of their home timeline caches. This approach doesn’t perform well for users having millions of followers.

In a batch processing system such as Hadoop, we usually care about throughput — the number of records we can process per second, or the total time it takes to run a job on a dataset of a certain size.

Latency and response time are often used synonymously, but they are not the same. The response time is what the client sees: besides the actual time to process the request (the service time), it includes network delays and queueing delays. Latency is the duration that a request is waiting to be handled — during which it is latent, awaiting service.

Random additional latency could be introduced by a context switch to a background process, the loss of a network packet and TCP retransmission, a garbage collection pause, a page fault forcing a read from disk, mechanical vibrations in the server rack, or many other causes.

Use median response instead of average response time because the latter one doesn’t tell you how many users actually experienced that delay. If the median response time is 200 ms, that means half your requests return in less than 200 ms, and half your requests take longer than that. The median is also known as the 50th percentile.

An SLA may state that the service is considered to be up if it has a median response time of less than 200 ms and a 99th percentile under 1 s (if the response time is longer, it might as well be down), and the service may be required to be up at least 99.9% of the time.

It only takes a small number of slow requests to hold up the processing of subsequent requests — an effect is sometimes known as head-of-line blocking. Even if those subsequent requests are fast to process on the server, the client will see a slow overall response time due to the time waiting for the prior request to complete. Due to this effect, it is important to measure response times on the client-side.

When generating load artificially in order to test the scalability of a system, the load-generating client needs to keep sending requests independently of the response time.

If you want to add response time percentiles to the monitoring dashboards for your services, you need to efficiently calculate them on an ongoing basis. For example, you may want to keep a rolling window of response times of requests in the last 10 minutes. Every minute, you calculate the median and various percentiles over the val‐ ues in that window and plot those metrics on a graph.

In reality, good architectures usually involve a pragmatic mixture of approaches: for example, using several fairly powerful machines can still be simpler and cheaper than a large number of small virtual machines.

An elastic system can be useful if load is highly unpredictable, but manually scaled systems are simpler and may have fewer operational surprises.

The architecture of systems that operate at a large scale is usually highly specific to the application — there is no such thing as a generic, one-size-fits-all scalable architecture.


Three design principles for software systems:

  • Operability: Make it easy for operations teams to keep the system running smoothly.
  • Simplicity: Make it easy for new engineers to understand the system, by removing as much complexity as possible from the system. (Note this is not the same as the simplicity of the user interface.)
  • Evolvability: Make it easy for engineers to make changes to the system in the future, adapting it for unanticipated use cases as requirements change. Also known as extensibility, modifiability, or plasticity.


An application has to meet various requirements in order to be useful. There are functional requirements (what it should do, such as allowing data to be stored, retrieved, searched, and processed in various ways), and some non-functional requirements (general properties like security, reliability, compliance, scalability, compatibility, and maintainability). In this chapter, we discussed reliability, scalability, and maintainability in detail.

Reliability means making systems work correctly, even when faults occur.

Scalability means having strategies for keeping performance good, even when load increases.

Maintainability has many facets, but in essence, it’s about making life better for the engineering and operations teams who need to work with the system.




Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store