Vanguard’s global multi-region approach

Vanguard Tech
Vanguard Tech
Published in
19 min readJul 31, 2023

--

Vanguard’s Global Investments and Financial Services (GIFS) IT subdivision is a truly global organization supporting investment teams comprised of portfolio managers, traders, and other operation functions. GIFS IT supports multiple investment sites across the Australia (AU), the United Kingdom (UK), and the United States (US). The enduring mission of GIFS IT is to enable better global system resiliency for the multiple investment sites, however past solutions traded off user experience due to latency. We want to improve the user experience by decreasing latency and simultaneously increasing resiliency — distributing their systems in multiple regions and replicating data between them. The following blog explores how we analyzed our usage, created architectural patterns, and developed tools to allow our applications to simultaneously run across multiple regions.

This topic was presented in part at the AWS 2022 New York Summit and AWS re:Invent 2022.

We implemented four steps when developing our global multi-region strategy.

Fundamental #1 — Understand the requirements.

Fundamental #2 — Understand the data.

Fundamental #3 — Understand dependencies.

Fundamental #4 — Ensure operational readiness.

Understand the requirements

Many organizations see multi-region simply as a resiliency initiative. With multiple investment sites accessing the same US-based system, network latency and bandwidth have a significant impact on user experience. GIFS initially saw multi-region as a means to improve our user experience and only later as a means to increase our availability.

We already used several techniques that reduced the impact of network data transfers in “chatty” client-server applications. One example is sophisticated global caching filesystems that decreased the time it took for desktop applications to open files used around the globe. Another example is virtual desktop infrastructure and application streaming (which allows a user to interact with an application “streamed” from a remote server) . We also employed network traffic compression. We wanted to do better though, and nothing compares to having both compute and data located close to our users.

We also wanted our business-critical systems to be highly available. It wasn’t enough to have disaster recovery (DR) solutions because we didn’t want a failure to cause an outage that we could recover from, we wanted failures to have an insignificant impact on our users. We looked at several options for increasing our resiliency but felt that multi-region provided the best trade-off between resilience and complexity.

Other requirements included keeping the costs low, and supporting applications developed using techniques such as event-driven processing.

These requirements must be met within some technical constraints. Our data stores fell into two categories, primary/secondary data stores and multi-primary data stores.

Figure 1: Primary/Secondary data stores.

A primary/secondary data store configuration is depicted in Figure 1. With a primary/secondary data store configuration, only the primary region supports reads and writes, while the secondary regions support only reads. Using this data store configuration, there is no possibility of data corruption caused by write conflicts because all writes happen in the same region. If the primary fails, one of the secondary regions is promoted to primary. Examples of primary/secondary data stores include Amazon Aurora, Azure SQL Database, and Google Cloud SQL which all have single primary region able to support read and write activity which replicates data out to multiple secondary regions able to support read-only activity.

Figure 2: Multi-primary data stores.

A multi-primary data store configuration is depicted in Figure 2. With a multi-primary datastore configuration every region supports read and write activity and changes are propagated around all regions using a “last-writer-wins” conflict resolution strategy. This can lead to data corruption, which is undetectable in real-time. One way to eliminate data corruption is to introduce “guards” to force multi-primary data stores to behave like primary/secondary data stores. Examples of a multi-primary datastores include Amazon DynamoDB, Azure Cosmos DB, and Google Cloud Spanner.

Understand the data

We next examined workloads across GIFS to understand how they used data. Specific details included:

  • Who or what needed to read the data?
  • Who or what was responsible for adding and updating the data?
  • Where did the data originate?
  • Where will the data finally reside?
  • When and where those actions were performed?
  • What downtime (RTO) could be tolerated?
  • Which technologies can support these workloads?

Based on the answers to these questions, we categorized our workloads into various patterns.

Deployment patterns

We categorize active/passive patterns as deployment patterns. These are the simplest and most cost-effective of the three sets of patterns that we use to build global multi-region applications.

Active/passive workloads

We have some workloads that are entirely automated, except in the event of an error. When systems rather than users are responsible for interacting with the data, then user experience can take a back seat to cost, and acceptable downtime becomes the most important factor. When the passive region becomes active (which we call a “rising primary”), processes run, and resources may need to be deployed before the application is available again. As a result, we call these patterns our “Deployment Patterns” and categorize them into “cold”, where the processes take a long time and/or the number of resources to deploy are significant, through “warm” to “hot”, where there is very little process and/or no resources to deploy. For example, “hot” environments would be ready to take over the load instantly, “warm” environments might need to scale up from the bare minimum number of resources, while “cold” environments might need to restore databases from snapshots.

These workloads can work with both primary/secondary data stores and multi-primary data stores.

Our Transfer Agency system is a good candidate for an active/passive solution. It receives data feeds from third parties, updates the recordkeeping system, and sends files back to the third parties. There is no customer impact even if there are single digit hours of downtime.

Figure 3 shows how the Deployment Patterns fit within the GIFS Global Multi-Region approach.

Figure 3: Deployment patterns.

Data distribution patterns

For workloads that are active/active across multiple regions, managing the data becomes critically important. Our data distribution patterns support different types of usage.

Hub-and-spoke workloads

Centralized workloads perform work (read/write) in one region, and the results are used (read-only) globally. For example, our Risk Management process is usually performed by a single centralized team in the US that updates the investment risk data by writing to a data store on the US East coast. The results of this process are read by teams based in the UK and AU and low latency read access is possible because compute and data are located close to the users in these regions. Very rarely do the UK or AU need to make updates (write) but when they do, the writes are forwarded to the hub. This ensures data integrity because one region is handling all the writes.

Figure 4: Hub-and-Spoke Pattern.

We refer to this type of pattern as “hub-and-spoke” where the hub is the single central data store and the spokes are replicated copies. This pattern can be used with either primary/secondary data stores like Amazon Aurora, or multi-primary data stores like Amazon DynamoDB using a “guard”, so we categorize this pattern as a Primary/Secondary Data Distribution Pattern.

If the hub fails through, the spokes don’t get updates. Data will get stale over time. The solution to this is “switchable hubs”.

Figure 5: Hub switched to the AU region.

With switchable hubs, in the event of a failure in the primary region, we switch the primary of the data store to a new region and perform the replication from there instead. In this contingency situation, there would be increased latency leading to a deteriorated experience for the risk management team. However, the system would still be available and a team in another region could run the risk management process locally if necessary.

Figure 6: The US region has failed. The hub has switched to the UK region. US-based users experience increased latency, but still have access to data.

Follow-the-sun workloads

Vanguard uses a global pass-the-book model when we execute trades. A portfolio manager generates an order or trade list, and then all relevant regions cooperate to execute that trade, but only during their working day. As the sun rises in the east, the traders in AU need write access to the data to record their trades. Later, the AU traders pass the outstanding trades on and the traders in the UK pick up and need write access to the data to record their trades. Then once more, the UK traders pass the outstanding trades on and the traders in the US pick up and need written access to the data to record their trades. The need for write access aligns to the working day in each geography region, hence the name “follow-the-sun.”

This pattern is actually very similar to the hub-and-spoke model, except that instead of switching the hubs only in the event of a failure, the hub is switched at specific intervals throughout the day. During the day in AU, the AU region is the hub (primary) and the UK and US are the spokes (secondaries).

Figure 7: During the day in Australia, users in AU have read/write access to the data store.

Then during the day in the UK, the UK region is the hub (primary) and AU and the US are the spokes (secondaries).

Figure 8: During the day in Europe, users in the UK have read/write access to the data store.

Later again during the day in the US, the US region is the hub (primary) and AU and the UK are the spokes (secondaries). Again, occasional writes are forwarded to the primary region.

Figure 9: During the day in the Americas, users in the US have read/write access to the data store.

The follow-the-sun pattern can also be used with either primary/secondary data stores like Amazon Aurora, or multi-primary data stores like Amazon DynamoDB, Azure Cosmos DB, or Google Cloud Spanner, using a “guard”, so it is also categorized as a Primary/Secondary Data Distribution Pattern. Due to the frequent switching of the primary region, it is easier to achieve using a guarded multi-primary data store because it is easier to switch the guard than it is to do a planned failover of a primary/secondary data store.

Failures are handled in the same way as the hub-and-spoke model.

Segmented workloads

The Hub-and-Spoke and Follow-the-Sun patterns are either read-only or read/write access on an all-or-nothing basis across the database. A region is either primary where it has read/write access to the entire data store, or secondary where it has read-only access to the entire datastore. Some systems, however, need to support writes to individual items from specific regions.

Our portfolio managers (PMs) work around the world but use a single global portfolio management system. PMs in each region manage funds generally offered in that region. For example, a PM based in the US might be responsible for the Vanguard 500 Index Fund, while a PM based in Australia might be responsible for the Vanguard Australian Shares Index Fund. Each needs write access in their region to the data “owned” by that region.

Figure 10 shows how data in segment #1, which is owned by the US, is read, written, and replicated.

Figure 10: Segment #1 (top) is owned by the US. Users in the US have read/write access. The guards in the EU and Australia allow local read access but forward writes to the US.

Figure 11 shows how data in segment #2, which is owned by the UK, is read, written, and replicated.

Figure 11: Segment #2 (middle) is owned by the UK. Users in the UK have read/write access. The guards in the US and Australia allow local read access but forward writes to the UK.

Figure 12 shows how data in segment #3, which is owned by AU, is read, written, and replicated.

Figure 12: Segment #3 (bottom) is owned by AU. Users in AU have read/write access. The guards in the US and UK allow local read access but forward writes to AU.

This pattern can only be used with multi-primary data stores like Amazon DynamoDB, Azure Cosmos DB, or Google Cloud Spanner because every region is able to both read and write to the data store. In order to prevent corruption a “guard” ensures that each region only writes to the data segment that it “owns”. These guards are implemented as microservices and all access to the data stores is managed through them. In the event of a failure, another region (which may be the closest, or the region that owns the least data) takes over from the failed region. As a result, this pattern is known as “Segmented”, and categorized as a Guarded Multi-Primary Data Distribution Pattern.

In the event that a failure makes one region inoperable, another region takes over as owner of that region’s data. Figure 12 shows how data in segment #1, which was originally owned by the US, is now owned by the UK and how data from segments #1/#2, and #3 are read, written, and replicated.

Figure 13: The US region has failed (Segment #1, Top) and UK now owns the US data and it’s own (Segments #1, Top, and #2, Middle). This diagram shows the paths from each region to write to segments 1/2 and 3, and the replication of segments 1/2 and 3.

Unguarded multi-primary workloads

There are some situations where multiple regions need to update data that they do not “own”, such as when multiple PMs make decisions to issue trades for multiple securities for multiple funds spanning multiple regions. This is known as multi-block transactions and may occur when a benchmark index, used by multiple funds across more than one region, undergoes a change. For example, if security A is held in US fund 1, UK fund 2, and AU fund 3 all following the same benchmark, a single multi-block order for security A is generated for all three funds spanning the US, UK and AU. As a result, there is a risk of data corruption due to conflicting writes which is technically complex and expensive to resolve.

Figure 14: Unguarded multi-primary.

Instead of relying on a technological “guard” to prevent conflicts, we instead use business processes to coordinate across regions to ensure global data integrity. Effectively, for this pattern the “guard” changes from a system to a business process. As a result, there are no systems supported by GIFS which need to use the Technical Unguarded Multi-Primary Data Distribution pattern. Business processes protect against conflicts, so the guards can be switched off when systems using the Segmented by Geography pattern, such as the Portfolio Management system, need to support multi-block transactions.

Figure 15 shows how both the Deployment Patterns, and Data Distribution Patterns fit within the GIFS Global Multi-Region approach.

Figure 15: Deployment and Data Distribution Patterns.

Event-driven patterns

Our deployment patterns and data distribution patterns are just two of the three sets of patterns that we use to build global multi-region applications. The investment and financial systems that we support rely on data received from, and sent to, many different organizations (some internal to Vanguard) in many formats using various types of technologies. During our migration to the public cloud, we have evolved from time-based scheduling to event-driven systems. These event-driven patterns are often used to load data from third-parties into applications built on the deployment or data distribution patterns, distribute data from those applications, or as a method of integrating with them.

There are three types of event-driven patterns. Our event-driven patterns rely on two or three components: the producer, the consumer, and the target resource.

  • Fan-out patterns
  • Fan-in pattern
  • Multi-point pattern

Fan-Out Patterns

The Fan-Out Event-Driven pattern is used to distribute data from one region to multiple regions. For example, we have vendors who write data directly to our cloud object stores and we use a fan-out pattern to quickly create copies or derivatives of the data in other regions.

A producer runs in one region and is responsible for sending data, either directed to just one other region (like active/passive) or distributed to all the other regions (like active/active). Sometimes, the producer is the consumer of another process.

The consumer might consist of a single active listener in one region receiving data from the producer while the consumers in the other regions are disabled. Alternatively, it might be multiple active listeners, one in each region receiving data from the producer. And sometimes the consumer is the producer of another process.

Finally, the target resource is where the data ends up. Target resources may be regional, such as a REDIS cache, or global, such as a DynamoDB Global Table.

Figure 16: Fan-out patterns, to a global resource (left) and a regional resource (right).

The fan-out pattern has in total eight variants of directed or distributed producer, single or multiple active consumers, and regional or global target resource. Some of these variants are more useful than others.

Fan-in pattern

Sometimes, there are specific resources which are available at a particular location or locations that perform specialist functions or have special connectivity. Examples include third-party products that require certain hardware that is only available in our data centers, or connectivity to financial services over dedicated lines. In the fan-in pattern, the target resource is always a regional resource, never a global resource.

Figure 17: Fan-in pattern showing three regions sending events/data to a consumer in a single region.

In contrast to the fan-out pattern, the fan-in pattern has multiple producers running in various regions, and a single consumer running at a single location. The location of the consumer in the fan-in pattern is unlikely to be a cloud provider region, and more likely to be a co-lo facility or corporate data center.

The producers running in all regions would send traffic to the consumer, and often the producers are part of a latency-based DNS resolution group.

The consumer is a single point of failure and should be architected to be as resilient as possible in the region in which it runs. Buffers may sit between the producer and the consumer to help protect the consumer from spikes in requests. The consumer may be the target resource, or more likely, it may be an integration component that interfaces with the target resource.

Multi-point pattern

Event-driven processes often consist of multiple chains of components which act as both consumers and producers. The multi-point pattern provides high availability but also reduces the latency that could result from a workload continually hopping between regions. It achieves this by having specific points in the chain that are designated hopping points.

Figure 18: Multi-point pattern. The green line shows an optimal path, while the red line shows worst-case. Both are successful at updating the target resource.

In the multi-point pattern, different cloud technologies are used for different parts of the chain. The first stage of processing might use three serverless functions, the second stage might use two containers, and the third stage might run on three sets of virtual machines. Since entire cloud regions rarely fail, but more likely the services in those regions might experience outages, the multi-point pattern will run the first stage of three jobs in a region where containers might be offline, then hop to a different region to run the two jobs that requires containers.

The entire set of patterns are shown on this chart:

Figure 19: Deployment, Event-Driven, and Data Distribution Patterns.

Understand dependencies

Our dependencies fell into two types — third party dependencies and failover orchestration.

On-prem dependencies

We still use applications installed on-prem in our own data centers, so some of our systems run in a hybrid model. While we are working to transition more workloads into the cloud, we are also transitioning cloud workloads to multiple regions. We use several techniques to ensure that our on-prem systems are able to leverage multi-region in order to reduce the number of single points of failure, described in the “Fan-Out Patterns” section above.

Third-party dependencies

Our investments and financial divisions interact with many third parties. For example, we pull stock market indices from some financial data vendors and provide details of our products to other financial institutions.

One file transfer mechanism that financial services companies have used for many years and continue to use today is Secure File Transfer Protocol (SFTP). In order to increase our availability, we host SFTP servers in multiple regions and use DNS routing to ensure and available server is always reachable. The SFTP servers use cloud object stores and the cloud service providers native replication mechanisms to ensure copies are distributed globally.

More recently when financial services companies share a cloud service provider, there are often, faster, more convenient, more cost-effective, and arguably more secure mechanisms for transferring data between organizations. Bypassing the SFTP servers, our partners and vendors write directly to our object stores. Again, the native replication mechanisms ensure object copies are distributed globally.

Figure 20: Multi-region support for vendors using both S3 and SFTP.

One question that is often asked is around the need for multi-region solutions on our side if the third party is not multi-region. It’s important to realize that service outages are more common than regional outages, and just because the service we use to process the data provided by the third-party is down, that does not mean the service that the third-party uses in the same region is down. One-sided multi-region solutions can still increase overall system availability.

Failover orchestration

While active/passive failover is relatively simple to orchestrate, the other patterns described above include switching hubs, changing the origin of replication, and the use of “guards” to prevent data writes in secondary regions.

We use microservices as the “guards,” but they need to be aware of the region in which they are running, and the primary and secondary regions. They would need to perform different actions depending on whether they were in the primary or secondary regions. In the primary region, both reads and writes are made locally, but in the secondary regions, reads are made locally but writes are forwarded to the primary region. The code must also support receiving forwarded writes. There may be a switch in progress though, so the primary region may not be ready to accept the write, the region which receives the request might not consider itself the primary region, or some other error could occur. The ability to handle errors and buffer the writes and then replay them later might be necessary. This is a significant amount of work for every development team and every time conditional logic is used, all paths must be tested. Testing how systems handle various failures is sometimes difficult to achieve.

Our “guards” solution was to build two components which work together and relieve the development team from the burden of dealing with multi-region enablement. One component was the Global Orchestration and Status Tool (GOaST) which maintains the state and coordinates the transitions of primary and secondary regions. The other component is the Global Multi-Region library (GMRlib). Although they work together, they were designed not to be dependent on each other so the system would continue to function if one failed.

Microservices include the GMRlib, which is initialized as part of startup and attempts to provide GOaST with status information. GOaST in return provides details of the primary and secondary regions, a list of times when the primary region will change, and when the GMRlib should next provide status information. This dynamic interval allows for infrequent interactions when things are stable, and more frequent interactions if failures occur. Note that there is nothing to prevent GMRlib from calling GOaST more frequently if it deems it necessary, for example if calls take longer or fail altogether.

Figure 21: GOaST and GMRlib abstract multi-region away from development team who only need to define operations initially, and later pass data when they invoke them by name.

Development teams using GMRlib do need to refactor their applications. Instead of interacting directly with the data stores, they must pre-define read and write operations, and later invoke those defined operations with the data required. This separation of operation from data enables much of the functionality required from GMRlib and allows the development team to offload the multi-region support to the library. GMRlib knows the operation has to run, and has the data required. It can run the operation immediately, buffer it, or forward it and because every microservice has the same operations defined, GMRlib can forward write requests by sending the operation name and the data without the development team needing to write code to explicitly handle write forwarding.

Ensure operational readiness

An untested DR strategy is no DR strategy. How do we ensure that a failure will not result in an outage? The various patterns provide increasing levels of functionality, availability, and confidence in the strategy.

Figure 22: Steps towards full operational readiness.

Building a hub-and-spoke pattern supports a higher level of availability than a single region solution. It still allows read-only access in the secondary regions (spokes) if the primary region (hub) fails, although the data may become stale over time. Some business processes or data sets may not be sensitive to stale data, particularly when they rely on data from the stock market open or close, so they might still be fully operational.

Switchable hubs support full, high availability so that even if the primary region (hub) fails, one of the secondary regions takes over. Data will not get stale as another region will handle writes. Regional failover though isn’t trivial. It requires awareness from the microservices and, potentially the orchestration of the data store if it is primary/secondary rather than multi-primary.

Follow-the-sun makes this regional failover part of the standard operation procedure and runs it three times a day. GOaST and GMRlib use the same processes to support regional switching in both the follow-the-sun and switchable hub patterns so we can be confident that regional failover will be successful because GOaST and GMRlib do it all the time for follow-the-sun.

In a globally-distributed system, we want to use every available data point to help us understand the state of the global system. Microservices use a function of the GMRlib to notify GOaST of any issues that they detect. This is currently designed to support human decision making, and will soon support automated failover. Over time we hope to train machine learning models using this data to predict failures before they occur and pre-emptively failover.

One possible scenario is that region A will think that region B is the primary, but region B won’t realize it is the primary. We plan to have GOaST simulate such a scenario by sending the invalid status to the regions, and then checking the notifications it receives from the GMRlib.

Conclusion

Improving resiliency requires effort, and building a global multi-region solution that supports reads in all regions and writes in the primary region takes significant effort. This approach leads to increased resiliency, a better user experience, and elimination of single points of failure.

There is no doubt that it is costly to have extra copies of the infrastructure running. However, we run some critical workloads. If the multi-region solution keeps one of these critical workloads online in the event of a failure, the cost of the infrastructure will be covered for many years.

Come work with us!
Vanguard’s technologists design, architect, and build modernized cloud-based applications to deliver world-class experiences to 50 million investors worldwide. Hear more about our tech — and the crew behind it — at vanguardjobs.com.

--

--