Building resilient Amazon OpenSearch cluster with AWS CDK (part 5)

Mikhail Chumakov
Life at Apollo Division

--

In the previous part, we set cross-cluster replication and access for our lambda functions, which sit behind the AWS API Gateway. So, we have an infrastructure in two different regions, and all we need is to switch all incoming requests from one region to another in case the first one experiences an outage. But how do we understand that something is wrong with our infrastructure? Health checks detect and respond to these kinds of problems automatically.

Many things can break an infrastructure, and there are many places in our system where we measure its health. What exactly should we check, and which options do we have for that?

Liveness checks — test the basic connectivity to a service. They are often performed by a load balancer or external monitoring agent, and they are unaware of the details of how an application works.

Local health checks — go further than liveness checks to verify that the application is likely to be able to function. These health checks test local resources. Therefore, they are unlikely to fail on many services simultaneously (prevents false positive failure across the entire system). These health checks test for the following: inability to write to or read from disk, critical processes crashing or breaking, etc.

Dependency health checks — are a thorough inspection of an application’s ability to interact with its dependencies (like database servers, AWS OpenSearch clusters, etc.). These checks ideally catch problems local to the service, but they can also have false positives when there are problems with the dependency itself. Because of those false positives, we must be careful about how we react to dependency health check failures.

There are other health check types, but we don’t need them for our infrastructure.

Initially, we thought of going with the liveness check since it was the most straightforward approach for implementation and seemed to solve our problem. But in September 2022, there was an interesting incident with the Systems Manager:

Our Lambda functions use System Manager to store some configuration parameters, and the liveness health check wouldn’t help us in this situation since the system appears to be almost healthy (responded to user requests) but with increased latency and error rate. So, we’ve decided to go with a deep health check. Deep health checks can be a combination of liveness checks, local health checks as well as the health of interactions with external dependencies that you define, such as databases or external APIs.

In our case, a deep health check will include a liveness check, the System manager connection check, and the OpenSearch connection check. For that purpose, we will create a small lambda that will read some settings from the System Manager Parameter Store and will try to connect to the OpenSearch cluster.

Before we move on, let’s look at the overall architecture we are going implement:

As you can see from the diagram above, a Route 53 health check must have an endpoint to call and a failover routing policy.

Let’s start with the lambda handler implementation.

Note: examples in this article will be for dotnet and C# since we use them for all our backend services.

To check communication with the Parameter Store, we will use the .NET configuration provider for AWS Systems Manager (you can find more details on the official AWS repo). We are going to create small wrappers around the standard library to be able to switch off SSM for local testing and to configure a root path.

Now we can add Systems Manager to our default configuration builder.

The next step is configuring our lambda function handler’s host builder and registering all necessary dependencies.

I think it’s worth mentioning a few things. As was mentioned in the previous part, from this moment, all requests to our OpenSearch cluster should be signed with AWS Signature Version 4 (to do this, we use this package); that’s why we register ElasticClient here with the particular connection class AwsHttpConnection. The HealthCheckService class is responsible for calling the OpenSearch health check endpoint. The code of this class is pretty simple:

Now, we can go to the final part and define the handler for our Lambda function.

You might notice that we introduced a strange method, IsSimulationEnabled. We need this method to simulate false health check results for the purpose of testing (e.g., we can set up the desired health check result through the environment variable and check the region switch).

We are done with the lambda handler; let’s move to the CDK implementation of the infrastructure in the picture above. For the sake of simplification, some trivial CDK code snippets will be skipped (like lambda creation with CDK, etc.), and we will focus only on essential parts.

Let’s assume we’ve created a lambda function. We should remember that we’ve enabled fine-grained access control for our OpenSearch cluster, meaning we should grant our lambda function corresponding permissions. To do that, we reuse the solution we introduced in part 2 of this series — a custom resource and lambda that sends related requests to the OpenSearch cluster.

Now, we should create an API that will be called by our Route53 health check.

So far, we are moving close to the finish. Let’s assemble everything together and create a Route53 health check.

We just specify failover for the routing policy to create an active-passive failover configuration with one primary record and one secondary record. Route 53 responds to DNS queries using the primary record when the primary resource is healthy. Route 53 responds to DNS queries using the secondary record when the primary resource is unhealthy. The block of code below demonstrates how to do this.

Finally, it is time to run the deployment and start using the OpenSearch cluster. But you are mistaken if you think it is the journey’s end. The next part will uncover other obstacles we met during DR implementation.

We are ACTUM Digital and this piece was written by Mikhail Chumakov, Senior .NET Developer of Apollo Division. Feel free to get in touch.

--

--