An Elasticsearch Disaster Recovery Solution

Deepak Paliwal
Globant
Published in
7 min readJul 18, 2023
Source: https://www.acronis.com/en-gb/blog/posts/disaster-recovery/

As the infrastructure of a company grows, the rate of risks such as data losses, system downtimes, or natural disasters has increased drastically in the last few years. In such scenarios, business continuity requires a finely designed Disaster Recovery Plan, which helps infrastructure withstand these challenges.

A Disaster Recovery plan first evaluates risks, identifies assets, lists backup strategies, and then implements, tests & optimises the DR plan to secure the infrastructure from unfortunate events of disaster.

This article is the first in a two-part series, where we will dive into the details of Disaster Recovery (DR), Elasticsearch, and Elasticsearch Cross-Cluster Replication (CCR). We will review topics going from the solution design, to the implementation, and how to enable Continued Replication of data in the event of a disaster.

This article will discuss the following points:

  • Assumptions/Prerequisites
  • Elasticsearch
  • Elasticsearch Multi-region DR cluster architecture
  • Implementation
  • Conclusions
  • References

Assumptions/Prerequisites

There are a few assumptions or prerequisites needed for this article. Please make sure to have all suggested items below in place before starting.

  • An AWS Account with access to IAM, EC2 & VPC networking services.
  • Understanding of Linux systems and basic commands.
  • Understanding of Elastic Search and API commands.

Elasticsearch

Elasticsearch is a well-known search engine widely used for log analytics, full-text search, security intelligence, business analytics, and operational intelligence use cases. Elasticsearch provides us with features to mitigate disasters, one being ES cross-cluster replication of real-time data leveraging indexes & index patterns.

In the article, we will learn how to set up 3-node Elasticsearch clusters in multiple availability zones and enable Cross-cluster Replication between them.

Elasticsearch Multi-region DR cluster architecture

As a part of the Disaster Recovery plan, we created two Elasticsearch clusters in two AWS regions.

Below is the Elasticsearch Cross-Cluster Replication(CCR) architecture diagram, where we have created an Elasticsearch cluster comprising three nodes in the primary region (1 primary node and 2 data nodes). We have also deployed a dedicated application in the primary region to ingest data in Elasticsearch. We will replicate the same setup in the secondary region as part of the Disaster Recovery Plan.

For Cross-Cluster Replication, we leverage the elasticsearch bi-directional cross-cluster replication approach, ensuring data replication between the clusters. To achieve this, we establish leader and follower indexes across the clusters.

The indexes have aliases created in both Elasticsearch clusters for the application to have the read and write ops on the indexes smoothly in the event of a disaster recovery. We can read and write data across the clusters using aliases without using specific index names.

Here, the data ingested in the primary cluster will continuously get replicated into the second cluster and vice-versa to keep both the clusters/indexes in sync.

Failover and Fallback Scenario

In this section, we will see the various stages of a disaster, their impact on our infrastructure, and how we can mitigate it with a parallel DR solution, which gets invoked immediately after the disaster.

1. Primary cluster down

In this scenario, we assume the primary cluster is down due to an unexpected event, CCR stops, and the link between the primary region application and Elasticsearch cluster breaks. The application starts pointing to the secondary cluster and it will continue to support both the read and write operations. The diagram below depicts this scenario:

Primary Cluster Down

2. Primary cluster up after outage

When the primary cluster comes up again, the CCR link and app to cluster link resume automatically. As soon as the remote cluster connection resumes, the primary region application starts using the primary cluster for further ingestion. Once the CCR is resumed, data in both clusters get synchronized. The diagram below depicts this scenario:

Primary cluster up after outage

3. Primary region down

In this scenario, we assume that the whole environment is down, and the CCR link between clusters stops. The application deployed in the secondary region will start acting as the primary application and use the secondary Elasticsearch cluster as the primary for operations. The application deployed in the secondary region was idle until the event occurred. The diagram below depicts this scenario:

Primary region down

4. Primary Region up after outage

When the primary region is up, the CCR link between the clusters resumes, and data gets replicated automatically without manual intervention. Elasticsearch handles the data replication by itself.

Now the application in the primary region acts as the primary and will start working like before. The data replication will also resume from the primary to the secondary cluster. The diagram below depicts this scenario:

Primary region up after outage

Implementation

Below are the implementation steps to be followed, to set up Elasticsearch and to enable cross-cluster replication between clusters

1. Elasticsearch Cluster Setup

Please follow the below GIST to configure ES clusters in both AWS regions.

https://gist.github.com/ideepakpaliwal/ae7f30ad7f33cd3e1e50034e83b579da#file-Elastic Search-cluster-setup-on-ec2-linux-machines-md

As we have created both Elasticsearch Clusters, the next step is to configure the remote clusters setting in the Kibana UI for both clusters respectively and create the connection to leverage the CCR feature.

2. Enabling cross-cluster replication

To use the Elasticsearch CCR feature, we must activate the product’s license. Upon installation, we get 30 days of free trial. Please activate it under the Kibana UI:

Kibana UI

3. Steps to set up Bi-directional Replication with Elasticsearch CCR

Please refer, Leader and Follower indexes Architecture as below:

a. Define remote clusters

Remote cluster setup is required for both clusters (primary and secondary). We want to make sure our primary-cluster knows about the secondary-cluster, and vice versa.

We can set up remote clusters from Kibana UI or remote cluster setting API:

Kibana UI — Adding remote cluster

b. Creating indexes on both clusters for CCR

Create an index called users-dc1 on our primary Elasticsearch cluster. We will replicate this index from the primary cluster to the secondary cluster. On the primary-cluster:

# Create a user's index
PUT /users-dc1

Create an index called users-dc2 on our secondary cluster. We will replicate this index from the secondary cluster to the primary cluster. On the secondary-cluster:

# Create a user's index
PUT /users-dc2

c. Create follower indexes in both Clusters

Create a follower index for each leader index in appropriate clusters.

The primary cluster will have 'users-dc2' as a follower index to the 'users-dc2' leader index in the secondary cluster.
The secondary cluster will have 'users-dc1' as a follower index of the 'users-dc1' leader index in the primary cluster.

We can create the follower indexes from Kibana UI under the CCR section:

Kibana UI — Adding follower index

d. Define read aliases

We defined the same read alias for leader indexes from both clusters so that the application can search across both indexes without mentioning the index names.

On the primary-cluster, add an index and alias on primary cluster:

# Adding index and alias on primary cluster.
POST /_aliases
{
"actions": [
{
"add": {
"index": "users-dc1",
"alias": "users"
}
}
]
}

On the secondary-cluster, add index and cluster on secondary cluster:

# Adding index and cluster on secondary cluster.
POST /_aliases
{
"actions": [
{
"add": {
"index": "users-dc2",
"alias": "users"
}
}
]
}

e. Define write aliases

Here, we define the same write alias on the leader indexes of both clusters, so we can write using the alias instead of the index name. This ensures that there are fewer application changes required for CCR in determining the correct index to write to (depending on the data center).

On the primary-cluster:

POST /_aliases
{
"actions": [
{
"add": {
"index": "users-dc1",
"alias": "users",
"is_write_index": true
}
}
]
}

On the secondary-cluster:

POST /_aliases
{
"actions": [
{
"add": {
"index": "users-dc2",
"alias": "users",
"is_write_index": true
}
}
]
}

f. Enable stack monitoring

This will help us look at the CCR-related metrics under the management tab in Kibana UI:

PUT _cluster/settings
{
"persistent": {
"xpack.monitoring.collection.enabled": true
}
}

Conclusions

In this article, we discussed what is a Disaster Recovery Plan, why we require it and saw an Elasticsearch Disaster Recovery (DR) use case. We reviewed different failover and fallback scenarios of a disaster and their impact on a single region Elasticsearch and DR setup that would handle all the operations during the outage. We learned how to create 3-node Elasticsearch Clusters spanning different subnets for each node in multiple regions.

Furthermore, we explored the Elasticsearch Cross Cluster Replication (CCR) Feature and leveraged it for our DR purpose. Lastly, we set up the bidirectional replication between our Elasticsearch cluster to leverage the cross-cluster replication (CCR) feature to enable the in-time replication of the index data across the clusters spanning multiple regions.

In the next part, we will try to simulate the Disaster Recovery use cases to see the impact and validate the point-in-time replication between the clusters. Stay tuned.

References

--

--