Building resilient Amazon OpenSearch cluster with AWS CDK (part 2)

Mikhail Chumakov

Published in

Life at Apollo Division

7 min readOct 24, 2022

In the previous part, we chose a strategy and some key points, let’s move on to implementation.

Implementing DR with AWS CDK

Creating a fully configured secure AWS Elasticsearch cluster with Kibana using Cognito authentication is doable using the AWS CDK. A Git repo with an example from the AWS team gives you a good starting point. In this article, we try to focus on the main problems and caveats we met.

For the purpose of simplification, the code examples in this article will mainly be provided for one region for one cluster until the other is mentioned explicitly.

The first step is to create and configure Cognito user and identity pools. For the purpose of simplification, we will use the same roles and security settings for both OS domains (in real projects, set up security properly and according to best practices). We’ve wrapped up the code in a Stack and will extend it during this article. Your Amazon Cognito user pool must have a domain name. OpenSearch Service uses this domain name to redirect users to a login page for accessing Dashboards. Besides a domain name, the user pool doesn’t require any non-default configuration.

Notice that the code creates two user pool clients. Those clients are not needed in terms of disaster recovery and are created just to demonstrate a more realistic situation (in our case, Cognito is used for securing API Gateway). We will come back to these clients later in this article.

In the next step, we will prepare basic roles for different access levels to the OpenSearch service and for access to Cognito.

OpenSearch service needs access to Cognito to create and configure an app client in the user pool and identity in the identity pool. To make it possible, we should create a role with corresponding permissions for OpenSearch.

Roles are the core way of using fine-grained access control. In this case, roles are distinct from IAM roles. Roles contain any combination of permissions: cluster-wide, index-specific, document level, and field level. After configuring a role, you map it to one or more users (or IAM roles).

Users are people or applications that make requests to the OpenSearch cluster. Users have credentials — either IAM access keys or a username and password — that they specify when they make requests. With fine-grained access control on Amazon OpenSearch Service, you choose one or the other for your master user when you configure your domain. The master user has full permissions to the cluster and manages roles and role mappings. In our case, we’ve chosen IAM for user management; moreover, we would like to do an initial security configuration from CDK using the Lambda function. For that, we should create a corresponding role for the Lambda we mentioned above and set this role as the master-user role for the OpenSearch domain.

There is one important thing — our OpenSearch domain belongs to VPC (remember the architecture diagram in part 1 of this series), so the Lambda function should belong to the same VPC and have corresponding execution permissions. That’s why we added “AWSLambdaVPCAccessExecutionRole” policy to the Lambda role.

For simplification, we will assume that we have just two user groups: administrators and users with limited access. For each group, we should create a corresponding role, then the group itself in Cognito, and assign the created role to the group. Logged-in users (authenticated by Cognito) can then assume the role depending on the groups they belong to.

Notice that in the last part of the provided code, we’ve also added a resource-based policy to our roles to make possible calls to the OpenSearch cluster.

Before we jump into CDK implementation for our domain, we will do one more thing. For our OS domain, we want to enable a custom domain endpoint (you can read this document for details). You will see later in this article how we are going to utilize this feature. For that, we should choose the hostname and provide an SSL certificate. Since in our project, we already have a domain name to manage HTTP API (let’s say demodomain.link), we will use the following names for our endpoints (remember, we have two OS domains, according to the diagram at the beginning of this article): dr-domain1.demodomain.link and dr-domain2.demodomain.link. And we will use a wildcard certificate for both endpoints.

Keep in mind that the OpenSearch domain name and custom endpoint name are different things, and they may have different values.

So, now we are ready to jump into the OpenSearch domain itself.

A detailed explanation of all properties of the Domain construct is out of the scope of this article, so I would recommend reading this document before.

Regarding the code example above, it is worth mentioning a few things.

At the beginning, we create an access policy for our cluster. For demonstration purposes, we allowed anonymous access to our cluster, don’t use this without FGAC, VPC support, or IP-based restrictions. You can find more information about the recommended configuration in this article. Then, in order to start replication, we must include the “es:ESCrossClusterGet” permission on the remote (leader) domain. For that, we should know whether we are deploying the domain to the primary region or not.

Then we created a KMS key to enable encryption data at rest for our domain. Together with node-to-node encryption, it helps us prevent unauthorized access to data.

Also worth mentioning is that in the code sample above, we assume that VPC, hosted zone, and some other information will be prepared in advance and passed through stack props.

Other important things are zone awareness and dedicated master nodes configuration. These options allow you to increase cluster stability. And in the provided example, we assume the following behavior in case one AZ experiences a disruption: no downtime, OpenSearch Service automatically distributes the dedicated master nodes across three AZ, so the remaining two dedicated master nodes can elect a master. But for non-production environments, this configuration can be relaxed to minimize the cost of an environment. I would recommend you read these articles for further details: multi-AZ domain and dedicated master node.

Last small thing left — remember we’ve decided to enable a custom endpoint for our domain? So we need to complete this with a small but necessary step: we must create a CNAME mapping in Amazon Route 53 to route traffic to the custom endpoint.

So, we created a domain, and OpenSearch service at the same time created something internally for us. So, what was created? OS created for us app client in Cognito user pool and identity in identity pool. And here is a problem.

Every identity in your identity pool is either authenticated or unauthenticated. Authenticated identities belong to users who are authenticated by a login provider (remember, we’ve chosen Amazon Cognito as an authentication provider). For each identity type, there is an assigned role. This role and its policy dictate which AWS services that role can access. When Amazon Cognito receives a request, the service determines the identity type, determines the role assigned to that identity type, and uses the policy attached to that role to respond. Since we allow users to authenticate using Cognito, we should specify how to choose the role for a user based on claims in the user’s token (remember, we are following a best practice and the principle of granting least privilege). To do this, we should know the provider name which has been created for us when we created the OS domain. Provider name looks like this: `cognito-idp.${ region}.amazonaws.com/${userPool.ref}:${clientId}`. Unfortunately, there is no way in AWS CDK to get back the client ID that has been created for us. In the AWS example we referred to above, developers try solving this problem through the CDK custom resource. Briefly, their solution is to get a user pool app client and rely on its index (because they have only one app client). But at the beginning of this article, we showed that in real projects, you could have more than one user pool app client (if you remember, we’ve created two other user pool app clients for some different purposes). Moreover, we have two clusters which mean each of them will create a user pool app client. There are two issues related to this problem in AWS-CDK repo on GitHub (this and this). Based on the comments on these issues, the only way you can retrieve generated app client identifier is with a custom resource and Lambda function call. We found this way complicated, and we had some timelines for the implementation, because of that, we decided to go with the simple way (in the next part of this series, we will show how you can properly do this). In a simplified solution, we will map the limited role to all authenticated users and then manually configure the identity pool and Cognito authentication provider role mapping based on claims in token.

The next step is to create roles and define permissions inside the OpenSearch cluster and map them to our IAM roles (remember, we discussed above that those roles are different things). At that point, we decided not to reinvent the wheel and use the implementation from the AWS example we referred to above (you can look at the code of this function on GitHub)

Our Lambda must send a set of requests to the OpenSearch domain to configure internal security.

It is worth mentioning here that we added the Lambda function to the same VPC as the cluster, and for the same reason, we added to the Lambda role “AWSLambdaVPCAccessExecutionRole” policy.

Also, you can see from the requests we sent that we are using some predefined roles to configure cluster security. Later in this article, I will show you how to create your own role (you can read more about predefined roles here)

Now we can move to the next part.

We are ACTUM Digital and this piece was written by Mikhail Chumakov, Senior .NET Developer of Apollo Division. Feel free to get in touch.

Building resilient Amazon OpenSearch cluster with AWS CDK (part 2)

Implementing DR with AWS CDK

Written by Mikhail Chumakov