Ripple effect in Aerospike cluster: Production issue debugging

Ghanshyam Verma
Naukri Engineering
Published in
4 min readNov 30, 2021

In Infoedge, we use Aeropsike for many distributed caching use-cases. We have multiple clusters in production. An incident happened on one of the clusters that we are going to discuss in this post.

A dependent application was giving timeout errors indicating a kind of slowness in Aerospike. On further debugging, we found out that it was a ripple effect of ssd malfunctioning on a particular node. Let’s deep dive into it.

In Naukri, Aerospike is a critical component that provides fast access for read heavy traffic. We are heavily using Aerospike and its features like multi-bin, secondary indexes to solve software business problems at scale. The reason behind choosing Aerospike is: It’s a multi-threaded distributed caching solution that claims to provide 1 millions TPS from a single node.

As it’s a cluster solution, it provides automatic data replication, which provides fault tolerance and scale both. The number of replications can be configured and controlled at namespace level via replication factor. And the cluster can operate as Available and Partition-tolerant (AP) or Consistent and Partition-tolerant (CP), and that depends on the configuration. By default Aerospike operates as AP,

  1. Read transaction, fetches from master
  2. Write transaction, write locally and synchronously writes to all replicas and then return successful.

Incident:

In Naukri, we are on SAO architecture and we have more than 200 microservices. Architecture at this level mandates monitoring and observability principles. We use ELK & Solarwinds APM tool for it.

We got timeouts from one of services, and although we have solarwinds for APM but the slowness was not detected in it. So, we used the old day practices and put profile logging via spring AOP. On analyzing it, we found this was because of slowness from Aerospike. And that’s why solarwinds was unable to detect it as it doesn’t instrument the Aerospike codebase.

So, we were getting slowness from Aerospike cluster from all the nodes. We wanted to be sure that it’s not because of the application to cluster network slowness, and all slowness is on Aerospike cluster itself. We analyzed the latencies via Aerospike asloglatency utility.

Fig. Aerospike 3 node cluster
Incident - Aerospike 3 node cluster

Useful commands:

Asloglatency: Aerospike by default logs histograms around reads, writes and queries. Via asloglatency we can read the histograms into latency measurements.

Command: asloglatency -l AEROSPIKE-LOG-FILENAME -h write -N NAMESPACE -f ‘Oct 14 2021 10:10:00’ -d 180 -n 12 -e 1

With default logging asloglateny gives high level information about latency of read and writes. All the 3 nodes showed higher latency on writes. To get deeper details about latency, aerospike provides the option of micro-benchmarking.

Aerospike micro-benchmarking: On enabling it, aerospike dumps end to end measurements around slices of enabled transactions. E.g. on enabling on write, it gives latency for wait time for incoming queue, de-dup resolution time, write on disk time, write on replica time (write-replication) and response return time. Via this, it’s easy to drill out the slowness root-cause.

This can be enabled via these commands:

asinfo -v ‘set-config:context=namespace;id=NAMESPACE;enable-benchmarks-write=false’
asinfo -v ‘set-config:context=namespace;id=NAMESPACE;enable-benchmarks-read=false’
asinfo -v ‘set-config:context=namespace;id=NAMESPACE;enable-benchmarks-storage=false’

Observations:
Node1

Fig2: Node 1 asloglatency output for read , write-master, write-replication

Node2:

Fig3: Node2 asloglatency output for read, write-master, write-replica
  1. The cluster is of 3 nodes. And service was getting high write latency from all nodes.
  2. Read latency: At 17:08:32, around 4.01% requests took more than 512ms on Node1. On Node 2, all read requests were served under 64ms.
  3. Write-master latency: At 17:08:42, around 4.17% request took more than 512ms. On Node2, all write-master transactions were done under 32ms.
  4. Write-replica latency (write-repl-write): At 17:08:52, around 2.74% requests took more than 512ms. On Node2, only 0.62 requests took more than 512ms.
  5. IO wait on all nodes are in same ranges, and less than 1% only.

Analysis:

SSD responding slow on node 1 because of which reads & writes slowed down on node1. Writes on node2 & node3 slowed as of replication sync writing on node1, and that’s why write-replication higher side on these two nodes.

Conclusion:

One of the node SSD malfunctioned and gave latency on writes to SSD, and cluster was responding with higher latencies on writes as it synchronously writes to replicas, and this lead to ripple effect to sibling nodes on writes.
We verified it by removing the faulty node from cluster, and all write latency got resolved after it. Later, we replaced the faulty node with new node.

--

--