Implementing An Amazon OpenSearch Service Productivity Cluster

Published in

Globant

15 min readMar 13, 2024

In this article, we will review all the necessary steps to be able to deploy an Amazon OpenSearch cluster from AWS in productive mode, and we will delve into its origin and some concepts of this service, good practices, and some recommendations.

Origin of the Service

In September 2021, AWS launched the first version of the Amazon OpenSearch Service as the successor to the Amazon Elasticsearch service, making it easier for developers to launch and operate large-scale Elasticsearch clusters. Amazon OpenSearch is an open-source Java-based search and analytics engine for real-time application monitoring, logging, access, and stream analysis.

You may wonder why Amazon renamed the Amazon Elasticsearch service to Amazon OpenSearch. That’s easy to answer, but let’s review something first. OpenSearch is a scalable, flexible, and extensible open-source software package for search, analysis, and observability applications licensed under Apache 2.0 (based on Elasticsearch). It is developed by Apache Lucene and powered by the OpenSearch Project community. OpenSearch provides a highly scalable system for rapid access and response to large volumes of data with an integrated visualization tool called the OpenSearch Dashboard. This interface makes it easy for users to explore the data.

This is where AWS steps in and adopts an early OpenSearch release of its then version 1.0 and renames Amazon Elasticsearch to Amazon OpenSearch. This managed service makes deploying, running, and scaling OpenSearch clusters easy without worrying about managing, monitoring, or maintaining your infrastructure.

The following image shows a general reference of the main functionalities supported by the service from interactive log analysis, real-time application monitoring, and website search, as well as the visualization capabilities provided through the OpenSearch Dashboards and Kibana technology in versions 1.5 to 7.10:

It is worth mentioning that the version of OpenSearch adopted by AWS is only compatible with 19 versions of OS Elasticsearch, from version 1.5 to 7.10. Keep this in mind if you have an Elasticsearch cluster running with a version higher than those mentioned and you plan to migrate your workloads to Amazon OpenSearch.

As of today, AWS will not support newer versions of Elasticsearch after the 7.10.2 release. Later versions are no longer open-source and are no longer released under the ALv2 license. However, AWS will continue to support new OS versions of OpenSearch to provide new features to customers for their various use cases, such as log analysis, search, and observability. Here is a list of the currently supported versions of OpenSearch and legacy OS Elasticsearch.

Node Types and Storage Levels

Now, with this introductory scope to Amazon OpenSearch, let’s review some of its features, computational and storage infrastructure components that the service offers (I will not go deeper into certain terms such as index, shards, records, etc, as these concepts are well known if you have already worked with OS Elasticsearch).

Let’s start by reviewing the different types of nodes and the storage levels offered by Amazon OpenSearch for data processing and operation of the cluster itself. Here is a representative image of each of them:

Types of Amazon OpenSearch nodes according to storage level (Prepared by the author)

Master

These are nodes used to perform administration tasks of the cluster itself and not to execute storage tasks or respond to data load requests; this ensures the permanent stability of the cluster, decoupling the operational tasks from its primary function. Some functions performed by this type of node are the following:

Track the status of all nodes.
Track the number of existing indexes.
Track the number of shards per index.
Maintain routing information for all nodes in the cluster.
Update the status of the cluster against changes that may occur on the indexes or nodes, as well as maintenance tasks of an update.
Monitor the status of the nodes by sending heartbeat signals periodically to ensure the availability of the data nodes in the cluster.

The choice of the ideal instance type for this type of node will be closely related to the number of instances, indexes, and shards we must manage. Here is a list of instance types recommended by AWS and some considerations to consider for the calculation of fragments that we will need to support in the indexes and workloads for a uniform distribution of the data.

A first recommendation for this type of node is to have at least one instance of the master node for each availability zone for a “High Availability” context on productive environments under a Multi-AZ scheme.

Storage levels

Every so often, it is a big challenge to persist large amounts of data (from hundreds of GB, TB, or even PT) with almost immediate access to it. Either for a search or processing of that data, without incurring huge billing figures for the storage usage we are generating. Whether it’s in most of these cases because of strict business requirements or compliance processes that we are subject to comply with either for an audit process or company regulations.

Fortunately, we have AWS Amazon S3, a large-capacity storage service that is secure, durable, highly scalable, and low-cost. Best of all, we can use this technology for hosting data on the nodes, with some familiarity with the kinds of storage offered by Amazon OpenSearch service, which we can see below:

Amazon OpenSearch Storage Tiering (Prepared by the author)

Here are some final recommendations for each storage tier:

Hot: Use for data with frequent access and low latency for reading or querying.
Ultrawarm: Used for data with infrequent read-only access and large amounts of data to balance cost and performance through its S3 caching layer.
Cold: Use for historical data that needs to be archived for forensic analysis investigation scenarios or regulated for compliance issues.

Another crucial point is that Amazon OpenSearch also allows us to configure custom management policies to automate routine lifecycle tasks that will be applied to existing indexes. A policy usually contains a default state and a list of underlying states that the indexes will transit from one phase or stage to another. These states indicate the level of storage that the indexes will migrate between the different types of nodes.

States of an Index Management Policy (Prepared by the author)

OpenSearch Main Features

Now let’s take a look at a summary of the most relevant features offered by Amazon OpenSearch Service under some pillars of the Well-architected framework:

Scalability: It offers different instances with CPU, memory, and storage capacity (including Graviton instances). You can use up to 3 PB of storage. Cold and UltraWarm storage for read-only data. Customization of Storage Class type for data nodes.
Security: Integrates with the IAM service for granular role-based access control. It can be deployed over Amazon VPC and security groups. Enables encryption at rest and transit between nodes using TLS 1.2. It offers multiple authentication mechanisms with Amazon Cognito via HTTP or federation with SAML through the AWS IAM Identity Center service for integrated login to the Amazon OpenSearch dashboard. It also provides granular access security support at the index, document, and field levels. It provides management policies for index state handling. Furthermore, it can also integrate with Amazon CloudWatch for audit, application, search, and index logs. Lastly, it provides granular access to Dashboard (multi-user) and Tenants (isolated and shared workspaces per domain).
Stability: Data location in regions and availability zones. Multi-AZ scheme support for high availability. It offers dedicated master nodes to orchestrate and scale according to the defined configuration. Local upgrades are based on a “blue-green” strategy with no downtime.
Flexibility: SQL support for custom queries and integrations with BI applications.
Performance efficiency: It uses the “Auto-Tune” functionality, which provides performance and cluster usage metrics to issue memory-related configuration suggestions, improve cluster speed and stability, and automate the application of enhancement changes required by the cluster to enhance its efficiency and performance.

Here are some everyday use cases where we can take advantage of the capabilities offered by Amazon OpenSearch:

Full-text search.
Observability (collect, detect, investigate, and remediate).
Logs Analytics.
Monitoring.
Real-time analytics.
Dashboards.
Highly Scalable.
SIEM and security.

Reference Architecture

The following architecture diagram shows the main components that we will implement in our Amazon OpenSearch cluster for a production environment, such as three hot data nodes, three warm nodes, and cold storage nodes for historical or audit data. We will also configure federated access via SAML through the AWS Identity Center service for user access to the OpenSearch data management interface:

Prerequisites

Before proceeding further, we should assume that we already have a VPC network with three Availability Zones and at least one subnetwork in each zone. I won’t go into the steps of deploying a standard VPC network as that is not the purpose of the article, but I will leave you with a link so you can deploy one.

Now, let’s start with the implementation:

1. Go to Amazon OpenSearch from the console and select the Create Domain option. In this first section, enter a name for the cluster domain and then choose the Standard create option:

2. For this example, select the configuration template for the use case for a Production environment:

Types of configuration deployment templates

3. Now, choose the Domain with a standby deployment option to ensure 99.99% availability of our cluster domain under a failover-based recovery scheme (with two instances in active mode and one in passive or standby mode):

Deployment options according to the desired scope of availability

Note: With this option selected, we activate the Auto-Tune functionality automatically. We will verify this in the following steps.

4. Choose the Engine version of the cluster (it is always recommended to use the newest version to access the latest OpenSearch features). It is also recommended that the compatibility mode option be enabled to extend support for open-source service integrations such as Logstash or Heartbeat, which require cluster version verification for connection. This option allows OpenSearch to report its version as if it were an Elasticsearch OSS with version 7.10 and facilitates the connection with open-source clients that support this version:

OpenSearch/Elasticsearch engine version configuration options

5. Next, choose the types of compute and storage resources to use for the data node instances related to the Hot storage level for immediate data access (indexes).

For this demonstration, we have selected the following minimum values required for a productive deployment:

Properties of computational resources for Hot type instances (Prepared by the author)

Note: The value of the volume size described is just for reference. Depending on the use case, we will need to consider the actual size that the log records of our applications will have and the number of indexes as well as their estimated size, in addition to the value that is defined for their retention to make a better sizing of the computational resources that we need.

Tip #1: Using gp3 volumes will get lower latency and high throughput for indexing and fast query processing support, with cost savings of up to 9.6% compared to the gp2 type.

In the following table, we make a brief description of some IOPS amounts and reference the performance of the volume size that AWS offers us for this type of volume, as well as the maximum values that could be provisioned:

Performance properties according to volume size (Prepared by the author)

You can find the complete list of values recommended by AWS at the following link.

6. The next step is to select the instance type we will use for the Ultrawarm nodes and enable the cold storage class.

By default, this option will be enabled and initialized with three instances because we initially chose to work with three data nodes. This implies that we will work with several availability zones, and at least one node will be implemented in each.

Currently, the only Ultrawarm instance types supported are:

Instance type options available for Ultrawarm nodes (Prepared by the author)

7. Now, configure the dedicated master nodes (by default, as well as the Ultrawarm nodes, three instances will also be selected, an equal distribution of nodes in each availability zone). In this case, choose an instance based on Graviton2 R6g, which offers a price-performance ratio of up to 40% better than R5 and is ideal for memory-intensive workloads, in-memory caching, and real-time big data analytics:

8. Optionally, we can also configure a custom domain name for the access URIs to our cluster domain and OpenSearch Dashboard. Ensure you have previously generated a certificate with the Amazon Certificate Manager service:

Configuration of Custom Domain and Certificate

9. In this section, we will choose the option VPC access to restrict public access to our domain. In addition to selecting the VPC along with the private subnets that we have deployed with the initial security group, we can add additional security groups. This can be useful if we want to enable inbound traffic to the cluster with other computing resources, such as from an EC2 or EKS Node instances for sending log ingestion from an agent:

Configuring Subnets and VPC for private access

10. The next step is to enable the Detailed Access Control mechanism to have a more granular level of security and to define the credentials of the master user, who will have all the cluster administration permissions.

Some important access mechanisms provided by the Detailed Access Control are:

It enables integrations for role-based access.
Provides security control for access at the level of indexes, documents, or even document fields themselves.
Multi-user access to the OpenSearch Dashboard interface.

Support to authenticate via SAML to OpenSearch Dashboard using an IDP (Identity Provider) for a login based on Single Sign On.

Activation of detailed control access and administrator user settings

Tip #2: If you use Terraform in this implementation, rely on some secret management tool that allows securing sensitive data, such as AWS Secret Manager, HashiCorp Vault, or any tool to manage and securely access this credential.

11. Now configure a secure and scalable authentication mechanism that allows access to the OpenSearch Dashboard console. We recommend using SAML-based authentication for centralized sign-on (SSO) through the AWS IAM Identity Center.

Tip #3: To make this configuration possible, we must have deployed at least one cluster domain. To get the URIs as an Identity Provider (IDP), parameters such as IDP-initiated SSO URL (audience) and Service Provider Entity ID (Service Assertion URL); this information will serve as inputs to create our SAML application in AWS IAM Identity Center.

Here is a quick reference on how to configure a SAML application in the AWS IAM Identity Center:

First, from our previously implemented Cluster domain, we must go to the option Actions > Edit Security Configuration, and in the SAML authentication for OpenSearch Dashboard/Kibana section, enable the option Enable SAML authentication:

Once this option is activated, IDP configuration URIs of the cluster domain will be generated:

Opensearch domain IDP configuration URLs

Assuming that we have already configured our SAML application in the AWS Identity Center, here I share a link where you can see the steps needed to achieve this. We will need to download an XML file from our application that contains the configuration metadata of the IDP that we will then need to import in the Import IdP metadata section in our Amazon OpenSearch cluster domain:

Application XML file metadata configuration section

In addition, we need to specify the name of the attribute that we set in the SAML application and define the SSO session’s TTL:

SAML assertion validation key parameter configuration section and session expiration time

Finally, save and wait a few minutes for the modifications to be applied. Verify that the status of our domain is Active:

Another similar option to maintain a federated authentication would be to use Amazon Cognito. You can review the documentation of their integration process if you are encouraged.

12. Continuing with the configuration of our cluster domain, select the access policy that we will use to have custom control of the requests to our Amazon OpenSearch domain based on identities offered by the IAM service. For this, choose the option Configure domain level access policy.

13. Depending on the compliance requirements, we can use a KMS managed by AWS or one of our own. This will help us to reinforce the security and encryption of the data for support:

Only HTTPS traffic is allowed to the cluster domain.
Maintain encrypted communication between nodes.
Maintain data encryption at rest.

14. Enable the cluster performance analyzer called Auto-Tune to get suggestions for improvements and optimizations based on current workloads, configuring a maintenance window schedule for such automatic deployments.

Customization Configuration Section for Auto-Tune Autotune Maintenance Execution

Optimization maintenance involving hardware-level property changes to improve performance will be executed under a Blue/Green deployment scheme.

15. The next step will be to enable the automatic update functionality so that we always stay on a stable version of the OpenSearch software that is released and can take advantage of the new features:

16. Finally, we need to define some mandatory tags. As a good practice, consider adding fields that are key to the business units to facilitate the search and identify the costs incurred on the resources deployed.

With this, we will have completed the implementation steps for an Amazon OpenSearch Cluster for a productive environment. Here are some cluster and node health metrics that we can immediately visualize:

Cluster metrics section in the AWS console

Here is a more detailed picture of the metrics for each node:

Health section of cluster master nodes in the AWS console

If authentication via SAML for an SSO login is enabled, we should be able to access the centralized URL of our application and be able to view it so that we can immediately be redirected to the OpenSearch Dashboard.

To achieve this, go to the main section of the implemented cluster domain, copy the URL OpenSearch Dashboards URL, and open it in your browser:

General informative section of the Domain Cluster

This will redirect us to the single session URL where we enter our user credentials. Once we access the centralized page, we can click on the icon of our application, which will finally redirect us to the OpenSearch dashboard administration console:

Tip #4: Before deploying your workloads (log ingestion of your services), previously define some status management policies for the indexes, also called ISM Policies. Configure these policies before enabling any index we need to register to ensure its inclusion in these policies.

To configure a state management policy for the indexes, access the OpenSearch dashboard interface, go to the section of State Management Policies > Policy Managed Indices, and select Create Policy:

In this link, you can find some examples of policy templates according to the use cases you need.

Once the ISM policy is configured, ensure the correct functioning of the storage in the cluster nodes to avoid suffering any risk of insufficient space in the volumes (Disk). This will help us avoid having to worry about the maintenance and deletion of the indexes:

ISM policy listing section in OpenSearch Dashboard console

Conclusions

In this article, we have learned how to implement a productive Amazon OpenSearch cluster and reviewed several important configuration aspects to ensure high availability, manageability, and maintenance based on AWS best practices.

We learned how to use the OpenSearch dashboard identity provider for SAML-based authentication, which provides single sign-on (SSO) for Amazon OpenSearch domains.

We learned how to monitor our domain cluster through metrics offered by the service itself, both at the cluster and node level, to check the status of the cluster.

We also learned about the different types of nodes and their storage levels offered by this service, which can be useful when choosing the right location for the type of data we want to store.

Using tags can help us simplify the search and consolidation in the billing of the resources we have implemented. Certain tags are mandatory, and they can be very useful in any environment, especially in production.