Building resilient Amazon OpenSearch cluster with AWS CDK (part 1)

Mikhail Chumakov
Life at Apollo Division

--

Disaster recovery is a set of methods to respond to and recover from an event that negatively affects business operations. AWS Global Infrastructure provides different ways to design and operate applications and databases that automatically failover without interruption. In this blog series, we will focus on caveats and problems we met during the implementation of DR with AWS CDK for one AWS-managed service — AWS OpenSearch service (the successor to Amazon Elasticsearch Service).

To prevent data loss and minimize Amazon OpenSearch cluster downtime in the event of a service disruption, you can distribute nodes across two or three Availability Zones in the same Region, a configuration known as Multi-AZ. Availability Zones are isolated locations within each AWS Region. You can also add dedicated master nodes to avoid cluster downtime in case of AZ disruption.

We had implemented DR for our cluster this way before, and it worked well for us until December 2021, when the regional failover cluster was not enough. You can read in more detail about the event in AWS Post-Event Summary.

There are many different ways you can implement DR for your OpenSearch cluster. You can read about them in detail in this series. In our case, we’ve decided to follow the Active-Passive strategy and put our cluster in two regions.

Before diving into implementation details, let’s look at the architecture we have.

Conceptual Architecture

As you can see, we have 2 clusters for different business needs. Also, to follow best practices of tuning access control for the OpenSearch service, we control access to a cluster on different levels.

On the network level, we put the cluster in VPC into a private subnet. Into the same private subnet, we put Lambdas which need access to the cluster and EC2 instance (we will explain below why we need it).

On the level of domain access policy, after a request reaches a domain endpoint, the resource-based access policy allows or denies the request access to a given URI. The access policy accepts or rejects requests at the “edge” of the domain before they reach OpenSearch itself.

Now, let’s look at what we want to build up and jump into DR implementation for our infrastructure.

The OpenSearch service offers cross-cluster replication for the strategy we’ve chosen above (active-passive). With cross-cluster replication, we can replicate indexes, mappings, and metadata between OpenSearch Service domains in different regions. So, to avoid cluster downtime in case of a region outage, we will roll out the additional (passive) cluster in the secondary region and replicate data between them.

Here are some prerequisites before you can set up cross-cluster replication:

  • Your domain should have Elasticsearch 7.10 or OpenSearch 1.1 or later. It was not an issue for us because we already had a domain with Elasticsearch 7.10.
  • Node-to-node encryption enabled.
  • Enable fine-grained access control. That is where we met most of the problems.

Fine-grained access control is the third and final security layer in multilayered OpenSearch Service security. After a resource-based access policy allows a request to reach a domain endpoint, fine-grained access control evaluates the user credentials and either authenticates the user or denies the request. If FGAC authenticates the user, it fetches all roles mapped to that user and uses the complete set of permissions to determine how to handle the request. There are three options for underlying authentication methods:

  • Cognito authentication for OpenSearch Dashboards
  • Internal user database
  • SAML authentication with an external identity provider

In our case, we will use the first option — Cognito authentication for OpenSearch Dashboards, because we want to use IAM for user management.

With the above in mind, we can move on to implementing.

We are ACTUM Digital and this piece was written by Mikhail Chumakov, Senior .NET Developer of Apollo Division. Feel free to get in touch.

--

--