How Vanguard operates business-critical event-driven workloads in a multi-region cloud environment

Vanguard Tech
Vanguard Tech
Published in
20 min readMay 23, 2024

Vanguard’s mission is to take a stand for all investors, to treat them fairly, and to give them the best chance for investment success. To deliver on this mission, it is imperative that our clients know they can trust us; that Vanguard is always there for them whenever and wherever they need us. To ensure we are always available for our clients, we have made the strategic decision to operate our business-critical workloads in multiple geographic regions in the cloud to increase our overall system reliability and availability. Operating across multiple geographic regions in the cloud, though, comes with additional complexities, trade-offs, and risks. In this blog, we will discuss how we operate our business-critical event-driven workloads across multiple regions while maintaining data consistency, integrity, and meeting the availability needs of our clients.

Background: multi-region for business continuity

Over the past few years, Vanguard has made the deliberate decision to operate our critical workloads in multiple AWS regions, effectively increasing the overall reliability and availability of our systems and providing business continuity in the event of a regional AWS service outage. For Vanguard, multi-region architecture is not a one-size-fits-all pattern. We have mission-critical systems that need to operate on a global scale, and we have business-critical systems that can meet their operational requirements through a dual-region architecture. These systems are then composed of web applications, asynchronous event-based workloads, synchronous application programming interface (API)-based workloads, event streaming, data replication, and many other lower-level architecture patterns. Each of these capabilities’ architectures need to be fine-tuned and designed to meet the business capabilities’ requirements around Recovery Time Objective (RTO), Recovery Point Objective (RPO), and Service Level Objectives (SLOs); many of which stem from strict compliance and regulatory requirements.

The challenges and trade-offs with distributed, multi-region applications

First, let’s level-set on the following terms that will be used to expand on the challenges and trade-offs that can be made in distributed, multi-region systems.

  • Consistency: A system is consistent when all clients see the same data at the same time, no matter which node or server they connect to.[1]
  • Availability: A system is available when every working node successfully returns a response for every request made.
  • Partition tolerance: A system is tolerant to partitions and network failures when it can continue operating even when messages or packets are lost or fail.
  • Latency: The time difference between making a request and receiving a response.
  • Recovery Time Objective (RTO): The maximum acceptable time delay between the interruption of a service and its restoration.[2] The acceptable amount of downtime a system can incur.
  • Recovery Point Objective (RPO): The maximum acceptable amount of time since the last data recovery point.[2] The acceptable amount of data loss incurred during an outage.
  • Cost: The total amount of money spent architecting, building, optimizing, and operating the system.

Many of the trade-offs in multi-region architecture patterns deal with “adjusting the dials” of these characteristics to fine-tune each of them to meet the requirements of the business capability that they support. Some of these characteristics have inverse relationships with one-another, such as consistency, availability, and partition tolerance, as stated in Eric Brewer’s CAP Theorem, [1] where systems can only maximize two out of the three characteristics and must make trade-offs between them.

In a multi-region architecture where the network is naturally partitioned, architecting for high availability comes at the expense of consistency and vice versa. To put it simply, there is no one-size-fits-all multi-region architecture pattern that provides the best of everything; the best pattern is the one that fits your business requirements the closest.

At a high-level, multi-region architecture patterns range from the Backup & Restore [3] pattern to multi-region Active/Active patterns as seen in the diagram below, [4] each of which provide varying degrees of RTO, RPO, consistency, availability, and cost for which you need to match to your business requirements. In the case of the Active/Active pattern, availability is very high, RTO, RPO, and end-user latency are relatively low, but it comes at a high financial cost, and the difficulties of implementing a performant and consistent data solution led most to eventual consistency. Conversely, simply backing up data and restoring the system in another region when a failure occurs is relatively cheap, but the RTO and RPO can be extremely high.

Figure 1 — AWS Multi-Region Architecture Patterns [5]

The good news is that high-availability patterns are more like a spectrum than a one-size-fits-all, and choosing between these or other high-availability patterns is largely a matter of determining which of these characteristics matter most to your business and fine-tuning your architecture to fit the business needs the closest. For Vanguard to meet the business requirements around our business-critical capabilities and achieve RTO and RPO near zero while minimizing cost, we’ve chosen to implement high-availability patterns that somewhere fall between the Active/Active and Active/Warm-Standby end of the spectrum, as seen in the diagram above.

Data consistency across multiple regions

Stringent requirements around data consistency also exist with the need to ensure exactly once processing. When a client requests a withdrawal of money from their brokerage account, our applications and systems of record need to ensure that the cash is withdrawn only once, and that data remains consistent across multiple regions.

A critical part of meeting these challenges is engraining engineering best-practices into our applications to provide data integrity and application resiliency while operating in complex, multi-region environments that contain network partitions and additional latency between regions. One important software engineering best practice is architecting our systems to be idempotent. Idempotency is the property of a request such that, no matter how many times said request is executed, the actual execution happens only once. To further our above example, in an idempotent multi-region system, when a client requests to withdraw cash from their brokerage account, that request can be sent and executed in geographically independent regions simultaneously, but the actual withdrawal of the cash from the brokerage account happens only once. In short, through implementing idempotency into our systems, we can maintain state and data consistency across multiple regions, allowing our systems the freedom and flexibility to operate more resiliently in the process.

In a distributed system, applications work by making network calls to other applications that perform some tasks and return the results back over that same network. While the multi-region architecture patterns have many benefits, one drawback is that it introduces a significant point of failure — the network that these requests and responses must traverse is geographically dispersed and naturally partitioned, which significantly increases the likelihood of faults or the dropping of messages between two systems. Thankfully, many of the network errors that applications may encounter are transient; if the application just retries that same request, it will more often than not succeed.[6] Therefore, we can add additional resiliency to our systems while maintaining consistency by building our capabilities and processes to be retriable without side-effects, meaning we can retry the request as many times as we need to execute it exactly once. To do this, our system needs to be idempotent. We’ll cover how to implement idempotency in a later section. First, we will discuss how idempotency enables safe multi-region architectures.

Architecting business-critical capabilities to operate in multiple regions safely and effectively

As mentioned previously, we have many “dials to tune” while operating systems across distributed, multi-region workloads to best fit their business requirements. To demonstrate the trade-offs that we can make, we will continue to build on the capability to allow a client to withdraw cash from their brokerage account for the rest of the article.

Below is a purposefully over-simplified architecture diagram of the business capability to withdraw cash, which just begins to express the complexities of distributed systems. In this process, we have both synchronous API requests and asynchronous events that stem from the cash withdrawal micro-service when a client makes a request to withdraw cash from their account.

Figure 2 — Cash withdrawal simplified

Before taking a deeper dive into an individual event-driven process of the cash withdrawal capability, let’s expand our overly simplified architecture diagram of the cash withdrawal capability into multiple regions. You can see very quickly that as the number of regions we are operating in grows, the underlying processes become more and more complex, even in a significantly over-simplified picture. The biggest takeaway from the below diagram is that there are databases spread across both active regions and that these databases are constantly receiving writes — maintaining data consistency and integrity across these data stores is paramount for ensuring that we’re meeting both our business requirements and always acting in the best interest of our clients.

Figure 3 — Cash withdrawal expanded to two regions

The business requirements

To make appropriate decisions about architectural tradeoffs, it is necessary to understand the application’s functional and non-functional requirements. We will begin by looking at the relevant requirements for the cash withdrawal capability.

  • The capability needs to have a maximum allowable downtime of less than an hour per year, or “4 x 9s” (99.99%) availability.
  • Operations should be automated across all AWS regions.
  • Recovery Time Objective (RTO) of less than one minute
  • Recovery Point Objective (RPO) as close to zero as possible, minimizing the risk of data loss at all costs.
  • Ensure data consistency and exactly once processing of cash withdrawals.
  • Use a repeatable pattern that delivers consistency into the Vanguard platform for how we operate multi-region, event-driven architectures at scale across thousands of applications and hundreds of development teams.
  • The cost of the architecture should be kept to a minimum.
  • Ensure business continuity at scale. Remove dependencies on manual processes, individual development teams, and business-line toggles to seamlessly maintain operations. We should be able to change the active region in which we’re operating in an instant, for any reason, and continue business-as-usual without having someone need to be hands-on-keyboard.

Executing a cash withdrawal

Now that we have our requirements to help position the architectural trade-offs, let’s focus in on the event-driven process that supports the overall business capability for withdrawing cash. We’ll zoom into a single region to explain each individual component and step in the process.

Figure 4 — Single region event-driven process

Let’s imagine that in this scenario a client is withdrawing $10,000 from their brokerage account.

Steps in the withdrawal process:

  1. The client logs into their Vanguard.com account and submits a withdrawal request to have $10,000 withdrawn from their brokerage account and deposited into their personal checking account at their bank. Route53, Amazon’s DNS routing service, routes the request to the appropriate region and compute instance, using either geolocation or latency-based routing.
  2. Within the Vanguard network hosted on AWS, the Cash Withdrawal microservice publishes a message to the Cash Withdrawal topic on Amazon Simple Notification Services (SNS) with the details of the client’s withdrawal request.
  3. An Amazon Simple Queue Services (SQS) subscribes to the Cash Withdrawal SNS topic and adds the message to the queue.
  4. Next, the AWS Lambda function processes each message from the queue, effectively withdrawing the cash from the client’s account.
  5. Finally, after processing the withdrawal successfully, the Lambda function writes the transaction to DynamoDB for historical purposes.

Implementing cash withdrawal in multiple regions

Now that we have the business and non-functional requirements, and we’ve described each step in the event-driven workflow, let’s expand the capability to multiple regions and break down each step in the process to walk through the trade-offs we can make to “tune the dials” and meet our business requirements. We’ll continue to use the example of a client withdrawing $10,000 from their brokerage account.

For the cash withdrawal capability, we chose to use an active/active routing pattern to ensure ultra-high availability, near-zero RTO, near-zero RPO, and low latency from the client’s browser to the cash withdrawal compute instances, allowing us to both provide a better user experience and meet our stringent availability and data loss requirements.

For other business capabilities which are not as critical, an active/warm-standby pattern is more appropriate. To demonstrate the trade-offs between the active/active and active/warm-standby patterns, we’ll do a walkthrough of both to compare the two for the rest of the article.

Routing

In the active/active pattern, the client logs into their Vanguard.com account and submits a withdrawal request. Route53, Amazon’s DNS routing service, routes the request to the appropriate region and compute instance, using either geolocation or latency-based routing.

Figure 5 — Active/Active routing

In the active/warm-standby pattern, routing would be the primary control lever we could use to control the primary region of operation by integrating automated health checks and/or a toggling mechanism into our Route53 configurations. Here the request would route to a compute instance in the primary region of operation, effectively reducing our compute costs by scaling down our compute in the secondary region to an absolute minimum. The trade-off here is that we cannot achieve “4 x 9s” availability with this pattern, so it may not be appropriate for certain use-cases.

Figure 6 — Active/Warm-Standby routing

Messaging

In the active/active pattern, once the request is received, the cash withdrawal micro-service publishes the message to the cash withdrawal SNS topic in the same region. As alluded to previously, to minimize our Recovery Point Objective (RPO), or potential for data loss, we chose to have both region’s SQS queues subscribed to both region’s SNS topic using a fan-out pattern. The SQS queues are subscribed to both regions’ SNS topics to intentionally duplicate messages cross-regionally. The fan-out pattern ensures that in the event of a regional AWS service outage or disruption, messages can continue to be processed in the second region and we’ve met our requirements around RPO near-zero (no data loss) and high-availability. Cash withdrawals can continue to be processed as usual, but it does come with a cost. While there’s no cost to delivering messages from SNS to SQS, data transfer fees are applied by AWS when sending SNS notifications to an SQS queue in a different region using the fan-out pattern.

Figure 7 — Active/Active messaging

To reduce cost compared to the active/active pattern, only the active region’s compute layer is publishing to the SNS topic in the active region.

Figure 8 — Active/Warm-Standby messaging

In the active/warm-standby pattern, we still fan-out the messages sent to the primary region’s SNS topic. The fan-out pattern ensures that in the event of a regional outage in any service in the primary region, messages in-flight are delivered to the standby regions so that we do not incur data loss and the standby region can pick up message processing where the primary region left off. We do incur data transfer costs here, but they are halved compared to the active/active pattern. To further optimize cost, messages in the standby region SQS queues live for the specified retention period and are then subsequently deleted by SQS, incurring no compute invocations or costs associated with managing these messages when inactive. More on this later.

Processing

Continuing in the active/active pattern, messages in the queues are processed by the Cash Withdrawal Lambda function in both regions simultaneously. The Lambda functions are configured as the trigger for the associated region’s SQS queue, triggering it to process the cash withdrawal in the same region.

Vanguard’s Functions as a Service (FaaS) platform provides a Lambda Control Plane in which we can toggle on or off the SQS trigger in either region either via automated health checks or a manual toggle, giving us the operational flexibility to control compute, costs, and in which region messages are being processed from the queue at any given time.

Figure 9 — Active/Active processing

Compared to the active/active pattern (Figure 9), the warm/standby pattern (Figure 10) operates with the secondary region’s SQS queue having the Lambda consumer “turned off” or not configured as the trigger.

Figure 10 — Active/Warm-Standby processing

In the event of a regional service outage, network blip, or disaster, an alarm is triggered by the Lambda Control Plane and a Lambda function utilizing the AWS CDK can determine whether to change the primary region in which we are processing messages. The Lambda Control Plane adds the Lambda consumer to the standby region’s SQS queue as a trigger while simultaneously “turning off” the primary region. Messages that were stored in the standby region’s SQS queues are now able to be processed by the compute layer, essentially continuing business operations as usual, but now in a different AWS region. For cost optimization purposes, messages in the standby or inactive region’s SQS queues live for the specified retention period, or time-to-live (TTL), and are then subsequently deleted by SQS; incurring no compute invocations or costs associated with managing these messages in the standby region. By doing so, we’re able to significantly reduce compute costs through the warm-standby pattern. Again, this pattern seeks to optimize for operational cost at the expense of overall availability, so this is a trade-off you will need to consider when deciding between the two patterns.

It’s important to note that because we’ve chosen to intentionally create duplicate messages across multiple regions in both patterns to ensure an ultra-low Recovery Point Objective, deduplication logic using idempotency is included in the Lambda consumer layer to ensure data consistency across all regions and allows us to process each message safely.

What is idempotency?

We glossed over the topic of idempotency above, but as it is a fundamental element of operating safely and ensuring data integrity in distributed systems, we felt it was important to explicitly define it. Idempotency is defined as the property of a request or an operation such that no matter how many times you execute it you achieve the same result. To give an example, let’s say a client has a balance of $100,000 in their brokerage account and wants to withdraw $10,000. In an idempotent system, we could process that request 1, 100, or 1,000 times and the result would be the exact same — $10,000 was successfully withdrawn and the client’s balance is now $90,000. Conversely, if the same system were not idempotent, processing that request more than once would result in significantly more than $10,000 being withdrawn from the account — this behavior is unacceptable and needs to be avoided at all possible costs.

By putting technical guardrails enforcing exactly-once processing into the processing layer of many of our critical capabilities, such as cash withdrawals, we can guarantee they are idempotent; this ensures that we are both meeting our regulatory requirements and always acting in the best interest of our clients.

How is idempotency achieved?

There are two main requirements that need to be in place to implement idempotent processing:

  1. A persistence layer that can be used to check whether an identical request has already been processed.
  2. A method to determine whether two requests are identical.

For a horizontally scalable distributed system, the first requirement should be met by an external data store. For most of our systems, that data store is Amazon DynamoDB due to its simplicity and low latency. Other systems may choose to use Amazon Aurora as the data store as it may better fit their business or technical requirements.

The second requirement may be met using an idempotency key unique to each request. This key does not need to be randomly generated. Rather, a best practice is to identify a set of fields that makes a request unique and then use a hashing function on those fields to create a reproduceable hash that serves as the idempotency key. Whenever a request comes in with fields that generate the same hash, the request will be considered identical. For instance, a withdrawal might be considered identical if it contains the same account identifier, withdrawal amount, and destination.

With both the persistence store and idempotency key generation in place, a service can process idempotently with the following high-level logic:

  1. Use a hashing function on the unique fields of a request to generate the idempotency key.
  2. Check whether this idempotency key exists in the idempotency persistence store.
  3. If the key exists, retrieve the saved result from the idempotency store to the consumer.
  4. If the key does not exist, continue with step 3.
  5. Process the request.
  6. Save the result of the request to the idempotency store.
  7. Return the result of the request to the consumer.

A scalable solution to data integrity across multiple regions

Our last goal was to create and utilize a repeatable pattern that delivers consistency into the Vanguard platform for how we operate multi-region, event-driven architectures at scale across thousands of applications and hundreds of development teams.

It is worth noting that when operating in multiple regions in an idempotent manner there are several edge cases to be considered: What if multiple requests come in at nearly identical times? What should services do with an idempotency record if request processing fails? Should certain requests only be made idempotent within a limited timeframe, and if so, how should that be implemented?

Because of the complexity of the edge cases, it is highly encouraged to put idempotency logic into a common library and not reimplement it for each business capability or service. For our Lambda-based processes, we make use of Powertools for AWS Lambda, which nicely encapsulates these edge cases and ensures we’re handling them consistently across the organization. As Vanguard has a lot of Lambda functions written in Node, we’ve chosen to partner with AWS on the build-out of the idempotency feature within the Typescript version of Powertools for AWS Lambda library to further help us achieve data consistency and integrity across our organization. Along with the Node library, there are Python and Java versions of the Powertools libraries have the idempotency feature as well.

Before the Lambda function executes the cash withdrawal from the brokerage account, it first must check whether the exact message has already been processed, either in the same region or a different region.

Figure 11 — Idempotent Active/Active processing

The functions do so by creating an idempotency key by using a hashing function to create a unique identifier based on the body of the message from SQS. This ensures that duplicative messages for a single cash withdrawal will produce the same hash and will be able to be matched cross-regionally to one another. We utilize the Lambda Powertools Library to do this as it handles idempotency in a consistent manner across languages and greatly simplifies the logic we need to code into our functions.

In the example code snippet below, written in Java, we’re simply streaming events from SQS into our Lambda function and invoking the process method with the messageId and messageBody as parameters.

To make the process method idempotent, all we need to do is add the Idempotent annotation to the method and use the IdempotencyKey annotation to tell Powertools that we want to use the messageBody as the idempotency key.

Once we have defined our idempotency key, the Powertools library will check the persistence layer in DynamoDB to determine if the request has already been processed. If the idempotency key is not in our persistence layer, our function can proceed with the request as normal. However, if the idempotency key is already in the persistence layer, our function can simply return the result of the previous response and does not need to execute the transaction again.

Using the previous example, when a second request comes in with the same cash withdrawal, the same idempotency key will be generated using the hash function; therefore, we can determine that the request has already been processed and return the result from the previous execution.

Figure 12 — Idempotency implementation [7]

As we mentioned before, many of the network errors that applications may encounter are transient; if the application just retries that same request, it will succeed more often than not. To give an example, in the above diagram, the initial request to process the withdrawal of cash was successfully executed, but the response was not received by the client. This could have been caused by one or many of the retriable exceptions documented in Figure 12, such as network connectivity issues. Because we have made our functions idempotent, we are able to successfully retry this request to get a 200 response and ensure that we only withdraw the cash from the brokerage account a single time. The cash withdrawal function can write the transaction to its operational DynamoDB table for historical purposes and move on to the next cash withdrawal request. Behind the scenes, the transaction history is asynchronously replicated across the globe using DynamoDB Global Tables.

Figure 13 — Retriable and non-retriable exceptions [8]

Why is the idempotency persistence layer in a single region only?

Referencing back to CAP Theorem, we know that in distributed systems we can only achieve, at most, two out of three guarantees for a database — Consistency, Availability, or Partition tolerance. DynamoDB is a database that is designed to deliver availability and partition tolerance at the expense of consistency, meaning that two DynamoDB tables in different regions configured as Global Tables will only ever be eventually consistent, and that data being read has a chance of being stale (not representative of current state).

For this reason, choosing DynamoDB as the database of choice means the idempotency persistence layer must only exist and operate within a singular region to guarantee data consistency. For business continuity purposes, it has been proposed to have a secondary DynamoDB table, enabled as a Global Table, as a standby database in case the primary region has an outage with the DynamoDB service. However, even if we put controls in place to measure the ReplicationLatency and PendingReplicationCount metrics between the two regions, we still cannot ensure with 100% accuracy and confidence that data will always be consistent between the two. Therefore, we have chosen to only operate the idempotency persistence layer in a single region for now.

We plan to explore the use and test out other persistence layers, such as Aurora Global Database and its global write forwarding features, to see if they could meet our stringent requirements and be an alternative idempotency persistence layer for us in the future.

Summary

Through the use of serverless AWS-native services, a strategic use of idempotency to maintain data integrity and consistency cross-regionally, and platform automation at scale via the Lambda Control Plane, we are able to optimize for reliability, operational excellence, performance efficiency, and cost all while meeting, and in many cases, exceeding our requirements for our critical, event-driven business capabilities. These event-driven processes and capabilities that were once reliant on a single AWS region for uptime and availability are now able to operate seamlessly in multiple regions with data integrity and minimal to no business impact given a regional service outage or blip in the network.

Ultimately, though, the biggest beneficiary is our clients who can rest assured that Vanguard has their best interests in mind; that we will keep their data safe, consistent, and integral, and will ultimately ensure that they have access to their assets whenever and wherever they may need it.

Taking a stand for all investors

Vanguard’s mission mentioned in the beginning inspires us to come to work each day ready to come up with solutions to difficult technology problems for our clients. As Vanguard technologists, the unwavering focus on clients empowers us as we build and optimize intuitive client experiences and modernize our applications across multiple regions in the cloud while tackling the complex data integrity challenges that come along with it. In the technology space, making high-impact tradeoffs such as the ones discussed in this article are not often an easy feat, but they are made easier by having the “north star” in mind — our mission.

References

[1]What is the cap theorem?. IBM. (n.d.). https://www.ibm.com/topics/cap-theorem#:~:text=Let’s%20take%20a%20detailed%20look,which%20the%20CAP%20theorem%20refers.&text=Consistency%20means%20that%20all%20clients,which%20node%20they%20connect%20to.

[2] https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/rel_planning_for_recovery_objective_defined_recovery.html

[3]https://aws.amazon.com/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-ii-backup-and-restore-with-rapid-recovery/

[4]https://aws.amazon.com/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-iv-multi-site-active-active/

[5] https://aws.amazon.com/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-ii-backup-and-restore-with-rapid-recovery/

[6] Brooker, Marc. “Timeouts, retries and backoff with jitter.” aws.amazon.com, https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/

[7] Beswick, James. “Handling Lambda functions idempotency with AWS Lambda Powertools.” aws.amazon.com, https://aws.amazon.com/blogs/compute/handling-lambda-functions-idempotency-with-aws-lambda-powertools/8

[8] Chew, John. “Avoiding Double Payments in a Distributed Payment System.” medium.com, https://medium.com/airbnb-engineering/avoiding-double-payments-in-a-distributed-payments-system-2981f6b070bb

Come work with us!
Vanguard’s technologists design, architect, and build modernized cloud-based applications to deliver world-class experiences to 50+ million investors worldwide. Hear more about our tech — and the crew behind it — at vanguardjobs.com.

©2024 The Vanguard Group, Inc. All rights reserved.

--

--