Multi-Site Clustering and Rack Awareness

Jagjeet Singh
Aerospike Developer Blog
10 min readNov 3, 2020

--

Aerospike has recently launched “Multi-Site Clustering” (MSC). This is a major release that greatly expands the ability to support globally distributed applications by leveraging Active-Active architecture, which ensures that writes are synchronously replicated across the sites and all the sites are available to receive reads and writes. Running a Multi-Site Cluster provides always-on, strongly consistent, globally distributed transactions at scale. It preserves strong consistency with no data loss and provides 100% availability during site failures. Multi-Site Clustering provides a true real-time Active-Active solution for global companies.

The blog explains the read and write performance in a cluster on widely distributed sites. It explains the write performance buckets and how read performance is sensitive to the consistency level desired. We’ll also see how MSC can reduce TCO significantly by avoiding cross-region data transfer charges for reads.

It also shows how average ~100ms write and sub-ms read performance is possible in an MSC spanning the US east and west coasts while achieving strongly consistent and resilient data. The observations are based on actual measurements and deployments at multiple clients.

Why Active/Active?

In the new digital environment, national and global brands must field applications that are always-on and perform in near real-time. This has led to the adoption of distributed technology for applications as well as the databases serving those applications. In order to best serve a wide geography and to gain higher levels of availability, organizations have distributed their digital presence across geographies, whether data centers of their own, or the zones represented by the public cloud vendors, or a combination of both.

I have personally worked with a few customers where they have a mandate to adopt Active-Active deployment architecture for all the P1 (Critical) Applications.

Applications such as trade settlements, global supply chain management, currency exchanges, parcel tracking, smart contracts, and others typically require a highly resilient, geographically distributed database platform with strong consistency and reasonable runtime performance to meet their target service level agreements (SLAs).

The Aerospike Advantage

With Aerospike, your applications don’t need to cope with complex conflict detection and resolution scenarios. Aerospike actively prevents conflicting writes, so you don’t need to worry about lost updates or other data inconsistencies. With Aerospike, new types of applications involving globally distributed transactions are now feasible and relatively straightforward to implement.

Cross-Region Deployment

In a Multi-Site clustering configuration, the nodes comprising a single Aerospike cluster are distributed across sites. A site can be a physical rack in a data center, an entire datacenter, an availability zone in a cloud region, or a cloud region.

Here is an example of Cross-Region Deployment using Replication Factor 3. Aerospike cluster has nodes across all three regions (West, Central, and East).

Rack Awareness

In a Multi-Site cluster, Aerospike’s Rack Aware (RA) feature forces replicas of data records (grouped in data partitions) to be stored on different hardware failure groups (i.e., different racks). Through data replication factor settings, administrators can configure each rack to store a full copy of all data. Doing so maximizes data availability and local read performance.

With an Aerospike cluster configured for Strong Consistency (SC), a Multi-Site cluster guarantees that all writes will be replicated across sites without data loss. Such a system can survive the loss of an entire site/region (rack) with no data loss and continue to operate. Therefore, such a cluster is an active/active configuration with strong consistency and availability during site failure scenarios.

Data distribution looks like below.

Record 1: Master(West Rack), Replica1(Central Rack), Replica2(East Rack)
Record 2: Master(Central Rack), Replica1(East Rack), Replica2(West Rack)
Record 3: Master(East Rack), Replica1(West Rack), Replica2(Central Rack)

The main trade-off is dealing with the low latency of writes and reads.

  • Applications running on a given site can be configured to read with low latency from the local nodes running nearest to them. An entire copy of the cluster’s data is available in nodes in the local rack, region, or site.
  • Application writes might experience additional latency depending on the effective distance between the two sites, whether actual physical distance or latency as a cause of network configuration. For example, latency could be as little as a couple of milliseconds for sites that are a few miles apart on the ground. On the other hand, you can expect increased latency because of sites that are thousands of miles apart via satellite link.

Setting up a Multi-Site Cluster is straightforward in a public cloud environment once you have VPC Peering setup. (another blog topic!).

I did a quick test and set up a cluster across 2 Regions (US-West and US-East). AWS doesn’t have a region in the US Central. Each node can further be placed in a separate Availability Zone for HA with-in the respective region if needed.

A common practice is to ensure that the replication factor is set equal to the number of racks so that Apps can read data locally. A Multi-Site cluster relies on the distributed clustering algorithms intrinsic to Aerospike itself, independent of the distance between sites. We are using replication factor 2.

Let’s ensure that we have set up the rack-id appropriately.

Admin> show config like node, rack-id -flip~test Namespace Configuration 
NODE rack-id
node1:3000 1
node2:3000 1
node3:3000 1
node4:3000 2
node5:3000 2
node6:3000 2

[root@node1 ~]# ping node4
PING node4 (10.1.0.16) 56(84) bytes of data.
64 bytes from node4 (10.1.0.16): icmp_seq=1 ttl=255 time=70.3 ms
64 bytes from node4 (10.1.0.16): icmp_seq=2 ttl=255 time=70.0 ms
64 bytes from node4 (10.1.0.16): icmp_seq=3 ttl=255 time=71.1 ms
64 bytes from node4 (10.1.0.16): icmp_seq=4 ttl=255 time=70.5 ms
64 bytes from node4 (10.1.0.16): icmp_seq=5 ttl=255 time=70.0 ms
64 bytes from node4 (10.1.0.16): icmp_seq=6 ttl=255 time=70.1 ms

As you can see, Node1–3 (US-West Region) have rack-id 1, and Node4–6 (US-East) have rack-id 2. When rack-aware feature is enabled by assigning the rack-id setting for each node in the cluster; Aerospike makes sure that master and replica for a particular record are not stored in the same rack.

A quick ping test across the region (Node1 to Node4) shows the approximate latency. I have observed latency fluctuates somewhere between 65ms to 75ms between US-West to US-East region.

When The Rubber Hits The Road

Let’s run some tests and see how does rack-aware feature work. The idea behind this blog is to focus on how read and write latency is affected by the cross-region deployment, and how the rack-aware feature can be used to achieve sub-millisecond latency for reads. This blog is not focused on achieving higher throughput or testing resiliency. Aerospike can be scaled vertically and horizontally.

My colleague Neel Phadnis has written a good blog on Resiliency in Aerospike Multi-Site Clusters. Please have a look if you are interested to learn about the resiliency.

Write Workload

I am using Aerospike java benchmark to generate the workload. The client is running in the US West Region on its separate virtual machine. Inserted 100K keys of 1KB object size.

One interesting observation here, you can see, 50% of writes are under 71 ms (between 64ms to 128ms bucket), and 50% of writes are under 145ms (between 128ms to 245ms bucket).

Aerospike provides immediate write consistency. The client sends data to the master node, and the master node sends data to the replica[s]. The client is given the acknowledgment when data is successfully written to the master and all the replicas.

We can have 2 possible scenarios here.

Scenario 1: Master Record in the same region where client app is running. In our scenario, let’s call it Local-to-Remote path (shortest path).
Scenario 2: Master Record in the remote region. Remote-to-Remote path (longest path).

As we have learned, the client sends data to the master, and the master sends data to the replica. Let’s see how each scenario impacts response time.

Scenario 1

Scenario 1 — Master and Client in the same Region. Local-to-Remote path

We assume that the master record will be in the US-West region, where the client is running (c1). The client can reach to the master node in 1ms, and takes 70ms for the master to send data to the replica in the US-East region. Overall, latency is <71ms.

Scenario 2

Scenario 2— Master and Client are in different Region. Remote-to-Remote path (longest path)

Let’s assume that the master record will be in the remote region US-East. The client from US-West region needs approx. 70ms to reach to master in the US-East and again 70ms to reach the replica in the US-West region. Overall, latency under 145ms.

That’s why 50% of write transactions should be completed under 71ms and remaining 50% under 145ms. On avg. 105ms write latency.

What if you have 3 Regions and using Replication Factor 3? The master node sends data to all the replicas simultaneously. I still expect that overall latency numbers should stay in an approximately similar range. However, there will be an extra acknowledgment message sent back to the replicas. We can expect increased fabric traffic and SSD write load.

Read Workload

Let’s have a look at the Reads. Aerospike offers four read consistency modes. Irrespective of the mode you choose, you won’t get dirty reads and successfully written data will not be lost.

  1. Linearize / Global Consistency — impose a linear view of the data for all clients that access it. This is the strictest read mode. Aerospike does additional coordination for every read to avoid stale reads. For Example, if an application completes its write operation, all later reads by any application will see the value of that write.
  2. Session Consistency (default) — is more practical. It guarantees that all reads and writes from the same session are ordered. Session consistency guarantees that the same application will read its own writes. Other applications may or may not (a very rare scenario), depending on the timing of operations.
  3. Allow replica — allows read from the master or any replica. This could violate session consistency in rare cases.
  4. Allow unavailable — allows read from certain partitions even though the partition is unavailable. However, this mode is out of scope for this blog.

Next, I will show side by side performance comparison of choosing various consistency modes.

Linearize / Global Consistency

As described above, Linearize consistency gives a linearize view for all the data. However, the master node does coordinate with each replica to avoid stale reads. I expect latency similar to the writes due to cross-region reads.

As you can see, similar performance profile is similar to the write workload, 50% reads are under 71 ms (between 64ms to 128ms bucket), and 50% reads are under 145ms (between 128ms to 245ms). Achieving under 3000 write TPS.

Session Consistency

Session Consistency is expected to perform better than linearize mode. It comes with a very rare scenario where strong consistency may be violated. Session Consistency reads from the master. We can assume 50% of masters in the local region and 50% in the remote region.

Testing shows that approximately 50% of read transactions are completing under 1ms and 50% under 71ms (between 64ms to 128ms).

Two distinct financial institutions in the United States and Europe are using Aerospike Multi-Site clusters (Session Consistency) to transfer money between member banks within seconds. In each case, Aerospike stores the state of payment transactions and guarantees the immediate consistency of data shared across geographically distributed applications.

Allow Replica

We are more interested in exploring Allow replica consistency mode.

Applications running on a given site can be configured to read with low latency from the rack located at the same site. An entire copy of the cluster’s data is available in nodes in the local rack. The application can choose preferred rack.

Allow replica mode can also read from the master if a replica doesn’t contain all current data and has undergoing data migrations. Aerospike automatically switches to master as and when needed. There are rare chances that the session may get stale read (but no dirty read) when the cluster is recovering and migrations are going on.

Please note parameters I have passed below for the workload. I have specified -rackId 1 (US-WEST) and -readModeSC as “allow_replica”. Similarly, Application can be configured to specify preferred Rack.

We could achieve 620k reads per second under <1ms. The above screenshot does show that reads (Ops/Sec) are being served by Node1–3 as I passed “rackId 1” which is US-West region.

I have compiled the table below.

Again, the focus of this blog isn’t on the throughput. One can get higher throughput by adding more clients/threads for Linearize and Session consistency. I wanted to do a quick comparison of read performance.l

Allow replica offers sub-millisecond reads and lower your overall TCO by avoiding cross-region data transfer charges. One of our customers saved millions of dollars by reading data from the local region.

Summary

Aerospike is uniquely positioned to deliver a compelling, cost-effective solution. Simply put, Aerospike enables firms to deploy a single cluster across multiple geographies with high resiliency, automated failovers, and no loss of data. Aerospike can be run in active/active configurations regardless of whether the data is stored in a private cloud, a public cloud, or any combination of both.

--

--