Designing Data-Intensive Applications: The Essentials of Reliability, Scalability, and Maintainability — Part 1

7 min readMay 19, 2024

In today’s digital landscape, applications are becoming increasingly data-intensive, no longer bound by raw CPU power but by the vast amounts and complexities of data they process. This evolution from simple server setups to intricate architectures requires a deep understanding of what makes systems reliable, scalable, and maintainable.

Inspired by Martin Kleppmann’s “Designing Data-Intensive Applications,” this series of articles explores the core concepts that underpin the modern data systems supporting large-scale applications.

The first chapter discusses Scalability, Reliability, and Maintainability. Given its extensive nature, I decided to divide the first chapter into a two-part article. In part 1, I dive into Scalability and explore practical strategies to handle real-world challenges.

Why Data Systems Matter

As applications evolve, the conventional boundaries between databases, caches, message queues, and search indexes blur. These components often converge into what we now refer to as “data systems”. Typical examples include Redis, used as both a data store & message queue, and Apache Kafka, which combines message queuing with a database-like guarantee.

1. Scalability

Scalability is the system’s ability to handle increased loads without compromising performance. It requires careful planning and the right architectural choices to distribute data and requests efficiently across a growing infrastructure.

1.1. Describing Load

Understanding the load on a system is crucial for scaling effectively. Load parameters vary by type of service:

Web Server: Measured in requests per second.
Database: Ratio of read operations to write operations.
Cache: Hit rate, which reflects how often data requested is found in the cache.

1.2. Case Study: Twitter’s Scalability Evolution

Twitter’s scalability challenges provide a practical framework for understanding how real-world systems adapt over time. Initially, Twitter struggled with the “fan-out” problem rather than the volume of tweets. The “fan out” effect refers to the numerous requests made to other services to process a single incoming request. For example, a single action like posting a tweet could overwhelm the system due to the need to update the feeds of all followers.

Evolution of Twitter’s Architecture:

Global Collection Approach(Pre-2012): Originally, tweets were added to a global collection. So each time the user accesses the home timeline, it requires fetching and sorting tweets from everyone a user follows.
Cached Timeline Approach(Post-2012): Twitter switched to maintaining a cached timeline for each user, making reading operations very fast. Now, when users post a tweet, the system automatically updates the timelines of all their followers by inserting the new tweet into each follower’s cached timeline — similar to adding a letter to multiple mailboxes.

Twitter’s Data Pipeline for delivering tweets to followers, as of November 2012

3. Hybrid Approach: Currently, Twitter uses a hybrid model where most users receive the cached timeline approach, while high-profile accounts that could cause significant fan-out use the global collection model.

1.3. Techniques for Handling Increased Load

1. Sharding: Dividing data into smaller, manageable parts across multiple servers to distribute the workload. — This allowed Twitter to distribute the load across multiple machines and handle the increasing volume of data.

2. Caching: Storing frequently accessed data in temporary storage to reduce database load. — By caching frequently accessed data, Twitter was able to reduce the number of database queries and improve performance.

3. Decoupling: Separating services into distinct components that can scale independently, enhances the system’s ability to manage different types of load efficiently. — Home timeline service was separated from other services such as the post tweet service. This allowed Twitter to scale the home timeline service independently and avoid overloading it with traffic from other services.

4. Fault Tolerance in Post Services: Enhancing reliability with techniques such as write-ahead logs to ensure data integrity and quick recovery from failures. — Post tweet service was made more fault-tolerant by using a write-ahead log to record all incoming requests. This allowed Twitter to recover from failures more quickly and reduce the risk of data loss.

1.4. Describing Performance

When you increase a load parameter, how much do we need to increase the resources if the performance needs to be unchanged?

These are the performance metrics that help us understand how a resource needs to be changed with varying loads:

For batch processing systems like Hadoop:

Throughput: Throughput is defined as the number of requests processed per second or the total time taken to complete a job on a dataset of a certain size. Higher throughput indicates more efficient data handling and processing capabilities.

For interactive services:

Latency — Waiting time before a request begins to be processed
Service time — Time to process the request
Response time — The total time from when a user sends a request to when they receive a response. This includes service time, plus any network and queuing delays.

1.5. Understanding variability in response times:

Response times can vary greatly, even for identical requests, due to factors like context switching, network packet loss, or pauses due to garbage collection. This variability makes it important to consider the distribution of response times rather than relying on a single average value.

Percentiles as a Performance Measure

Illustrating mean and percentiles: response times for a sample of 100 requests for a service

While the average, or arithmetic mean, can summarize the distribution of response times with a single value, it doesn’t always provide an accurate representation of user experiences. Consider this — if 100 users access a service, and 99 of them encounter a response time of 1 second while one user experiences a delay of 10 minutes, the average response time would be skewed to nearly 7 seconds. This misrepresents the actual user experience, where 99% of users see response times of only 1 second.

Instead of relying on the average, using percentiles offers a more insightful measure of performance. A percentile indicates the threshold below which a certain percentage of observations fall. For instance, if the response time at the 99th percentile is 1 second, it means that 99% of the requests are processed in 1 second or less.

Effective monitoring involves tracking response times over a rolling window, calculating medians, and other percentiles to capture the system’s behavior under varying loads. Advanced algorithms like forward decay, t-digest, and HdrHistogram are used to calculate these metrics efficiently, minimizing CPU and memory costs.

p50 (Median): Half of all requests are faster than this value.
p99: 99% of requests take less time than this, a crucial metric for understanding the worst-case scenarios short of the absolute maximum.

Tail latencies: or the high percentiles of response times, are significant because they often determine the user experience under peak load conditions. While only a minority of requests might suffer from extreme latencies, these instances are particularly likely to affect the most frequent and valuable users. These are the most valuable users generating numerous requests, making them more susceptible to experiencing these slower response times. For example, Amazon has shown that even slight increases in page load times can significantly impact customer behavior and satisfaction.

1.6. SLOs and SLAs:

To prevent user attrition due to poor performance, it is critical to manage and minimize high percentile response times effectively. One practical approach is to establish performance agreements like Service-Level Objectives (SLOs) and Service Level Agreements(SLAs). These help us define the expected performance and availability of a service. For example, an SLO might state that 99% of all requests within a week must be completed within 1 second. In practical terms, if there are 1 million requests in a week, then up to 10,000 of those requests might take longer than one second without breaching the SLO.

Implementing such measures not only ensures a better experience for users but also aligns technical performance closely with strategic business goals, ensuring high service standards are maintained across the board.

Understanding Queueing Delays and Their Impact on Performance

Queueing delays are a significant factor in system performance, often leading to two notable issues:

Head-of-Line Blocking: This occurs when a high-priority request blocks subsequent requests in the queue. Even if those following requests could be processed quickly, they must wait until the high-priority task is completed, leading to increased overall response times.
Tail Latency Amplification: In systems where requests require multiple backend operations, the likelihood of encountering a slow response increases. This is because even if most requests are processed quickly, any slow operations in the chain can delay the entire request. As a result, a small percentage of slow requests can disproportionately affect the overall system performance and degrade user experience.

By understanding and addressing these issues, systems can be optimized to deliver more consistent and reliable performance, enhancing user satisfaction.

1.7. Strategies for Coping with Load

- Scaling Up (Vertical Scaling): Increasing the power of existing machines.
- Scaling Out (Horizontal Scaling): Distributing the load across many smaller systems, known as a shared-nothing architecture.
- Elasticity: Automatically adjusting computing resources based on the detected load increases.

1.8. Conclusion

Scalability is a cornerstone of building robust, data-intensive applications. By understanding and implementing effective load management strategies — such as sharding, caching, and decoupling services — along with leveraging performance metrics like percentiles, you can ensure your system handles increased demand without compromising performance. Real-world examples, like Twitter’s evolution, highlight the practical applications of these principles.

Stay tuned for Part 2, where I will dive into the equally critical aspects of Reliability and Maintainability. We’ll explore how to keep systems running smoothly, even under adverse conditions, and how to ensure long-term system health and adaptability.

References

“Designing Data-Intensive Applications” by Martin Kleppmann
https://robertovitillo.com/why-you-should-measure-tail-latencies/

Want to Connect?

If you have enjoyed this article, please follow me here on Medium for more stories about Computer Science and System Design.

Linked In — Archana Vellanki| LinkedIn