Patterns for the Distributed Systems in Cloud — Part 1

8 min readDec 7, 2019

Quote: “necessity is the mother of invention.”

This article provides the essence of a set of patterns published by AWS builders-library under https://aws.amazon.com/builders-library/. This article series might evolve as the library expands. The goal is to provide a simplified view of various patterns with and solutions with visuals and let the reader refer to the actual library for a deep dive, which is too large to comprehend for a quick overview.

These patterns are organized into two categories

Architectural
Software Delivery and Operations

These patterns are generic enough to be used in any distributed system, though some of these reference specific AWS Services as examples

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Challenges with distributed systems (Architectural)

Distributed Systems can be of the following types

Offline: Batch processing systems, analytics clusters, etc. fall into this. They enjoy the benefits of scalability and fault-tolerance while not having the complex failure modes and non-determinism
Soft-realtime: These are critical systems that must continually run to produce / update results but have a generous time window. Examples are Search indexer or an EC2 role updater which if not run the customer impact is very low
Hard-realtime: These are typically request-response services. Examples are Front-end webservers, order systems, telephony, Credit card transactions, and AWS APIs.

Request-Reply across a network

In this typical scenario, the Client posts messages over the network, and the server processes it and updates its state and sends the reply back over the network again.

This scenario can fail in 8 ways where

Client fails to post (Network issue or Server rejected)
Network Delivery is successful, but the server crashed before receiving
Server-side validation fails
Updating Server-side state could fail
The server fails to post a reply (server issue )
Network fails to deliver the reply
Client-side validation of reply fails
Client-side update of the state fails

As the number of components grows, the permutations of failure scenarios grow enormously

Challenges

Engineers can’t combine error conditions. Instead, they must consider many permutations of failures. Most errors can happen at any time, independently of (and therefore, potentially, in combination with) any other error condition.
The result of any network operation can be UNKNOWN, in which case the request may have succeeded, failed, or received but not processed.
Distributed problems occur at all logical levels of a distributed system, not just low-level physical machines.
Distributed problems get worse at higher levels of the system, due to recursion.
Distributed bugs often show up long after they are deployed to a system.
Distributed bugs can spread across an entire system.
Many of the above problems derive from the laws of physics of networking, which can’t be changed

For more details see

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Avoiding insurmountable queue backlogs (Architectural)

Queues are powerful tools to build reliable asynchronous systems. At the expense of some latency, queues can increase durability and availability by efficiently processing messaging, even amidst failures and outages of systems. Queuing systems can be long-running tasks, coordinating multiple steps or batching (aggregating results over time and processing).

While benefits of Queues look obvious if queues start accumulating a backlog due to failures, the processing delay can outweigh the benefits.

Some of the examples of Queue based systems are

AWS Lambda functions, which make sure that every request gets executed even if there are server failures
AWS IoT Core, which allows devices to publish and subscribe to messages durably when several of them may be offline frequently.

AWS SQS provides a durable platform for designing queue-based systems.

When dealing with backlogs, the capacity needs to process backlog multiplies, for example, an hour of a backlog now doubles the next hour’s capacity until the backlog is cleared. This could cause latency and even saturate other dependent systems

It is key to measure latency and availability. There can be many ways.

Producer side latency could be typically dependent on the Queue system latency like SQS.
The number of unprocessed messages in a Dead Letter Queue (DLQ) can be another indicator.
The age of messages can provide useful insight into how much we are behind.
In the case of IoT, AgeOfFirstAttempt is a useful metric, and AgeOfFirstSubscriberFirstAttempt can be useful when sending messages to a large number of devices.
For distributed tracing, a request Id needs to be tracked to piece things together in a big picture.

In a multi-tenant system, it is essential to protect different tenants by only exposing the necessary interfaces. For example, not exposing an internal queue but provide a light-weight API. Also, imposing rate-limits helps to prevent run-away clients from affecting other tenants.

It is vital to protect each layer of the asynchronous system, whereas, for synchronous systems, it may be sufficient to protect the entry point. Using multiple queues can provide tenant isolation. A Last-In-first-out (LIFO) may be preferable to let the latest data come sooner while we can clear the backlog when capacity is available

Strategies for Resilient and Multi-Tenant asynchronous systems

Isolation: Use separate queues for different workloads

Shuffle-Sharding: A fixed number of queues can be used, and customer requests can be hashed into a targeted list of queues instead of having a single queue for all customers or each customer (AWS Lambda and Route53 use this technique). See the next section for a full review of this approach

Sidelining excess traffic to a separate queue: When a specific customer exceeds the rate, those messages are moved to a different queue for processing independent (when resources are available) while allowing the main queue to operate normally

Sidelining old traffic to a separate queue: In this case, we work on live traffic quickly and queue the older traffic to catch up later (The LIFO discussed earlier)

Dropping old messages: Some systems can afford to lose messages as they may be periodically synchronizing. In this case, losing old messages may make more sense

Limiting threads (and other resources) per workload: When concurrency of requests change, it could exhaust thread pools. Limiting threads per workload is necessary
Sending backpressure upstream: When appropriate, the queue depth can be measured and start rejecting requests at the producer level. It might not suite systems like order taking systems, but could be useful for systems like signing up for a promotion where the user can be directed to check later
Using delay queues to put off work until later: This is a variation of LIFO where a bunch of messages can be moved a different queue for delayed delivery and have a chance to work at fresh messages
Avoiding too many in-flight messages: Sometimes, code bugs may forget to delete the message, and they might reappear after visibility timeout. This situation can lead to too many in-flight messages.
DLQ: Dead letter Queues can be used to set aside messages that cannot be processed. It could be due to a bug and can be processed once the issue is resolved. they can be monitored to alert when there are too many indicating a more significant problem
Ensuring additional buffer in polling threads per workload: Always plan for a buffer of threads for sudden spikes
Heartbeating long-running messages: When a message expires, we should stop the work or use a heartbeat to remind SQS that work is still happening. Otherwise, it can lead to churn and cascading brownouts
Plan for cross-host debugging: Use advanced techniques like recording queue depth periodically, propagating trace-ids. AWS provides services like X-Ray and Step Functions to enable distributed tracing and advanced workflow options

For details, pls see

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Workload isolation using shuffle-sharding (Architectural)

Shuffle Sharding is a core pattern to provide a cost-effective multi-tenant experience to clients that gives a single-tenant experience.

In the absence of Shrading, the following scenario explains the load of 8 workers. In the typical scenario, eight workers equally serve customers. If one worker is impacted by say DDoS attack, it goes down, and the load shifts to 7 workers. Then the next one goes down, and it cascades eventually taking down everyone

In regular sharding, we divide eight workers into four shards of 2 workers each, and each customer is assigned a shard. When a customer is impacted, it can take down two workers resulting in a 25% reduced capacity

In Shuffle sharding, we assign two virtual shards to each customer, which ensures that no shard contains the same combination of customers. When a customer is impacted, it reduces the impact to 2 shards, but actual customer impact is much less as another good shard is serving each customer. Eight workers can be of 28 unique combinations, and the impact is now 1/28 compared to 1/4 in regular sharding.

In the case of Route 53, Amazon used 2048 servers to shuffle shard that gives 730 billion possible combinations, and any attack on a single customer is unlikely to make any impact on other customers.

For more details, see

Patterns for the Distributed Systems in Cloud — Part 1

Challenges with distributed systems (Architectural)

Avoiding insurmountable queue backlogs (Architectural)

Workload isolation using shuffle-sharding (Architectural)

Other related Blogs

Written by Sathiya Shunmugasundaram