How to Set Up and Manage Your Own Elasticsearch Cluster on AWS

Published in

Traveloka Engineering Blog

11 min readDec 8, 2021

Editor’s Note:

If you’ve been on the fence about adopting a self-managed Elasticsearch cluster as part of your infrastructure, Ayush will share three compelling reasons (supported by cost analysis), a concise how-to guide, and invaluable risk-mitigating tips to persuade you to take that first step over.

Ayush Jaiswal is a software engineer with the Search Engineering team, responsible for improving search results’ quality and infra across all Traveloka’s products.

This is the first of a two-part series, where I will give a general overview of the self-managed Elasticsearch cluster, followed by a second part, on the next occasion, that will go deep into technical specifics of each component and associated configuration.

Introduction

At Traveloka, Elasticsearch is one of the core components serving search traffic. Therefore, Elasticsearch constitutes a significant part of our tech stack. The improvement in our search functionality usually comes from researching and understanding more about Elasticsearch. While looking into the feasibility of a self-managed Elasticsearch cluster, we found that there are different flavors of Elasticsearch. We decided to go with the Basic — Free and Open (at the time of this writing) version, which is suitable to our needs. To create a functioning Elasticsearch cluster on our AWS-based infrastructure, we need the following four must-have starting components:

Instances to host Elasticsearch service (EC2 instances)
Load balancer to distribute incoming load evenly (AWS ELB)
DNS mapping to your load balancer (Route 53)
Autoscaling to scale your cluster depending on the load (AWS ASG)

If you use different cloud providers, you should be able to get those similar components.

Why Did We Build It?

When established cloud providers already provide services to manage an Elasticsearch cluster effortlessly, there is very little incentive to set up & manage your own cluster. However, here were our three compelling factors under considerations:

In our comparison, we found that an elasticsearch instance costs ~50% higher than an EC2 instance of the same type. For example, a c5.xlarge.elasticsearch costs $0.289/hr (+47%), whereas a c5.xlarge EC2 instance costs $0.196/hr in the ap-southeast-1 region. Since our team was spending close to $20,000 monthly on AWS-managed Elasticsearch, there was a potential saving of $6,000 monthly by looking at this ratio.
We had faced a few production issues (in the beginning of 2020) and not only had AWS blocked the APIs needed to resolve our issues (like retrying failed allocations) on AWS-managed Elasticsearch clusters, but their AWS-managed Elasticsearch support was also not forthcoming with a resolution that left our database in yellow state for quite some time. Hopefully, it has improved since then.
We were also looking to build expertise in self-managing Elasticsearch clusters.

Therefore, we initiated building an experiment cluster to test our hypothesis. While analyzing its cost, we found that the savings were significant, which proved our hypothesis. Consequently, we became committed to setup and manage our own Elasticsearch clusters.

Why Should You Build One?

There are some incentives to set up and manage your own Elasticsearch cluster.

Cost

The cost savings factor was a big motivation for us. If you look at the calculations in the Cost analysis section, you will see that the savings in the hypothetical cluster is about 24%. However, the savings depend a lot on the size of clusters being compared. With bigger clusters, the proportion of fixed costs (or the costs, which vary very little due to the size of clusters) declines when compared to variable costs (e.g., EC2 instances) and thereby, gives a higher percentage of savings.

Expertise

While researching on how to create a functioning self-managed Elasticsearch cluster, which qualifies all the criteria of a production environment, we decided on every small configuration that not only acquainted us with its innards, but also helped us in understanding potential problems that could arise later in our production systems and how to fix them. If you understand Elasticsearch cluster from the infrastructure point of view, then it will also help you in picking the right machines’ size for different node types and in calculating buffer capacity allocation to avoid potential issues or bottlenecks. And both would result in cost savings.

Performance

When comparing performance between self-managed and AWS-operated ES clusters, the former gave us the performance gains across the majority of metrics, but for which we couldn’t find a concrete reason. Therefore, performance gain may vary in your self-managed cluster.

How You Can Build One Too

Elasticsearch’s customizability as well as flexibility means you have to find (and set) the right parameters for each setting. The steps discussed below are applicable to Elasticsearch version 6.8 and above. For other versions, you might have to tweak things accordingly.

Setting up Elasticsearch

Download Elasticsearch for Linux. Once installed and verified locally, you should also add configurations specific to your environment/infrastructure to form a cluster in the Elasticsearch.yml file inside the /etc directory. (see Instance Type and Configuration section below)

Heap Settings

Setting the correct heap size is an important aspect in fine-tuning each node for its intended use. Elasticsearch is a search engine based on the Lucene library that stores data in lucene segments. Since these segments are immutable, they can be cached in the filesystem cache. Therefore, it is recommended to set the heap usage to 50% of total RAM so that it can be allocated to Java heap segments, while the rest of the memory can be utilized by the filesystem cache. Elasticsearch recommends that you don’t allocate more than 32GB to the heap. You can read more about heap allocation in this thread.

Node Types

An instance of Elasticsearch is known as a node. Usually, each node in a cluster is created on separate machines (EC2 instances). Though it is possible to create multiple nodes on a single machine, it defeats the purpose of having multiple nodes. In a cluster, a node can assume multiple types, or it can become a dedicated node type depending on the requirement. The three most common types of node are:

Master-eligible node: Master eligible nodes can be elected as a master node when voting happens. The elected master node maintains the cluster state. If you have a large cluster, you should ideally have dedicated master nodes. It is recommended to have at least 3 master eligible nodes in order to avoid the split brain problem, potential data loss and inconsistency. You might need to change the number of master eligible nodes as you add more nodes to the cluster.
Data node: As the name suggests, it stores the data of indices.
Coordinating node: An undefined node type that accepts and forwards all Elasticsearch-related requests, such as search or index, to relevant data nodes, which then applies the necessary logic to collate the response of the query or other requests in REST API form.

There are other types of nodes as well in Elasticsearch that you will need to configure each node per its desired role. Also refer to Important Elasticsearch configurations to set up clusters.

Instance Type and Configuration

It is important to choose the right EC2 instance, which suits best according to a node type by considering the following parameters:

For master nodes, memory size is usually secondary to CPU requirements that could be higher depending on the cluster configuration. If an elected master node becomes busy in managing cluster operation, then CPU usage usually spikes due to the following two frequent contributing factors:
A lot of shards in your cluster that overwhelm the master nodes.
The mapping of your indices are being frequently updated.
For data nodes, CPU and RAM are both important. However, with AWS, you should also consider the amount of available IOPS to each of your node depending on the I/O rate you expect to happen. You should also try to balance between allocated EBS size and provisioned IOPS for better cost optimization.
When allocating Elastic Block Store (EBS) volume, its size should be double of the projected disk usage of each node. Since lucene segments are immutable, the new segment’s size is the sum of both participating segments when segment merging happens. In our clusters, we have set an alarm that triggers when disk utilization exceeds 60%.

Cluster Formation

Elasticsearch offers different plugins for popular service providers, which can help nodes discover other nodes of the same cluster. For AWS, you can use the EC2 discovery plugin to specify a tag key/value that it should look for to find relevant nodes (EC2 instances). All the settings for EC2 discovery are listed here. This plugin relies on AWS’ DescribeInstances API, which lists all EC2 instances fulfilling a given criteria. Please note that since AWS applies request throttling on such API, be mindful of the number of calling nodes in your account, which might prevent you from accessing your AWS console if you have too many. If you must have a lot of nodes calling that API, you could play around with EC2 discovery’s discovery.ec2.node_cache_time property to prevent request throttling.

High Availability

High availability is a must in production environments, especially for databases. AWS generally has more than one Availability Zone (AZ) in a region that you can leverage to ensure enough replication, which is important to suit your specific needs, by assigning primary and replicas in different AZ. If you set cloud.node.auto_attributes: to true in elasticsearch.yml, it will allow EC2 discovery to update automatic node attributes for you and replicas won’t be assigned to the nodes that are hosted in the same AZ as the primary nodes. In a multi-AZ configuration, search requests coming to a coordinating node in an AZ, would be routed to a node holding either a primary or replica shard, preferably in the same AZ.

Backup

Backup is an essential process in a database system that enables us to restore our database to the latest known stable snapshot in case of adverse events such as corruption. Elasticsearch has a snapshot API, which you can call at regular intervals to take incremental backups of your cluster. To save on storage costs, you should also create a policy to delete old backups. If you preface your backups’ names with date and time, then you can use date-* to filter out backups for that specific date and delete the snapshots for a specified date one by one. With version 7.6 onwards, however, Elasticsearch comes with a snapshot lifecycle management API, which can simplify listing and deleting all relevant snapshots.

Mitigating risks

Any production system is bound to run into problems occasionally despite ample applied resources. Therefore, it is prudent to be prepared with the tools that could help you when a problem occurs. Let’s discuss four of them that we use:

Logging

You should consider setting up a robust logging mechanism such as CloudWatch Logs to categorically filter out different kinds of logs, which can help in troubleshooting issues such as an unresponsive/unreachable node. The information in the logs can assist you in avoiding similar issues or telling you the root cause in the future.

Contingency Planning

A robust contingency plan is a prerequisite to implementing a self-managed Elasticsearch in production. Some of the events that could happen in production systems:

An unresponsive/unreachable node that gets terminated eventually, and Elasticsearch will return either a yellow or red status of your cluster depending on the configuration. Once the node is replaced and if shards are still in unassigned state, you should try to re-allocate them with the cluster reroute API. You can read a similar contingency planning drill here.
If the cluster becomes red, you might need to restore your indices from a known good backup, which makes regular backup practice a must.

Therefore, it is important to document your contingency plans to cater to various known scenarios for every second is precious when an incident happens.

Buffer Capacity

Whenever you create a cluster, you should keep buffer capacities across CPU, memory, and storage according to your traffic projections. Also, to pre-scale your resources, ensure to set up buffer capacity with proper alerts such as CPU usage, JVM heap usage (if garbage collection runs frequently), disk usage, and DescribeInstances API usage.

Monitoring & Alerts

A carefully crafted dashboard, which monitors crucial metrics is very helpful with the following benefits:

Calculate future traffic projections.
Pinpoint issues’ root cause timely during an emergency..

Apart from monitoring, if you have set up the right alerts, you can also get early warnings through other alerting services such as PagerDuty. Being proactive towards production issues can save you from a lot of troubles.

In Traveloka, we use Datadog and PagerDuty extensively for monitoring and alerting purposes respectively, piped to a dashboard that monitors our Elasticsearch clusters. Have a look at some important sections from our dashboard below:

Figure 1. Dashboard to monitor our Elasticsearch clusters

Additional Costs

Although the savings are quite substantial, self-managed Elasticsearch does carry additional costs, which could get easily overlooked when glancing at the details such as:

ELB units used, S3 backups, and data transfer costs across different availability zones (No such charge with AWS-managed Elasticsearch).
Developer cost associated with coming up with solutions, creating knowledge base, maintaining clusters, keeping systems’ versions up-to-date.

Cost Analysis

Let’s break down the components costs needed for a self-managed Elasticsearch cluster using the following components that we use as an example:

EC2 instances
Elastic load balancer
Autoscaling groups
S3 storage
AWS lambda
Cloudwatch events
Cloudwatch logs
Route 53
Elastic network interface

We have calculated a comparable configuration between an AWS-managed and a self-managed Elasticsearch for each of those components. We also have assumed a hypothetical Elasticsearch configuration with the following configurations:

4 nodes r5.xlarge as data nodes
3 nodes c5.large as dedicated masters.
EBS storage per data node = 100GB Total primary shards data = 100GB

All the EC2 instances are on-demand instances.

The cost components discussed below are monthly costs based on pricing data available on AWS pricing page for Asia pacific (Singapore) region.

Figure 2. Cost comparison of self managed ES (left) and AWS managed ES (right)

I believe that the cost analysis makes a compelling case for you to adopt a self-managed Elasticsearch cluster system in your organization as well.

In a nutshell, Elasticsearch has a lot of features for setting up clusters. Elasticsearch also has a plugin for almost all major cloud providers to perform provider (cloud) specific operations such as cluster backup or node discovery. I have listed all the components needed to build a self-managed Elasticsearch cluster and the relevant links for all the components, which should help you set it up in your infrastructure as well. For other relevant topics or deeper understanding, refer to the excellent Elasticsearch documentation.

We at Traveloka constantly work to optimize our systems and explore new technologies. Check out Traveloka’s career page and join us on our adventure!