Journey from Self Hosted Elasticsearch to AWS Managed Elasticsearch Service

Tejas Nayak
7 min readJul 26, 2021

--

Every day, in this Tech world, we come across one or the other problem which outgrows itself and managing it becomes more painful than actually working on the engineering side of it. One such problem is managing ES clusters and Kibana. We had our Elasticsearch Cluster running in production, which used to go OOM, long-running queries, the disk space needed to be added more frequently, spend ample amount of time on Kibana when there are high cluster loads, etc. The decision was taken to move to AWS-managed ElasticSearch to take the load off the team, move to a true ES cluster and lessen the overhead maintenance. So when we thought of moving to AWS Managed ES Service, we planned to migrate our ES Cluster from 6.3 to 6.8.

Source : https://makeameme.org/meme/one-does-not-ao1pfo

Creating and configuring the AWS Managed ES Cluster

Why dedicated Master nodes?

Dedicated Master Nodes perform just 1 operation i.e. Orchestration. AWS best practices recommend using dedicated master instances for all production workloads. The added stability and protection against split-brain is a major benefit. We have gone with three dedicated master nodes, to provides two backup nodes in the event of a master node failure and the necessary quorum to elect a new master. The added advantage of Master Nodes can be found on the AWS ES Developer Guide.
We performed failover testing by considering two possible scenarios of having 3 dedicated master nodes vs having none (data nodes only). Master nodes do not perform querying operations but are instead focused on sharding. Having dedicated master nodes (min. 3 for quorum enforced) ensures cluster stability in case of a high load. Our test results concluded that having no dedicated master nodes under gazillion shards, caused it to become totally stuck/unrecoverable. Having dedicated master nodes on the other hand made it possible for the data node to recover and even respond to APIs. Data nodes take about 15 mins to recover after being over-stressed.

Creating the AWS Managed ES Domain

Once, you have analyzed the requirement of what amount of storage do you need on the AWS side and the number of instances and their types, then it makes very much sense to configure your own AWS ES cluster. You can just do 5 to 6 button clicks on the “Create Elasticsearch domain” wizard or you can use a cloudformation template and validate your ES Cluster.

Migrate the data

Due to limitations of AWS Managed Elasticsearch, migration cannot be done by connecting two ES clusters and transporting the data from one to another, while reindexing on the go. So the work-around for this is using AWS S3 buckets and using snapshots and restore operations. We had around 2TB of data that had to be moved from our hosted ES to AWS Managed ES, so data migration is the most crucial of all. I will walk you through the step-by-step procedure of this entire data migration.

  • Create an s3 bucket, say es-bucket-prod
  • Create a policy in IAM named s3role-es-prod
{
"Version": "2012-10-17",
"Statement": [{
"Action": [
"s3:ListBucket"
],
"Effect": "Allow",
"Resource": [
"arn:aws:s3:::es-bucket-prod"
]
},
{
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"iam:PassRole"
],
"Effect": "Allow",
"Resource": [
"arn:aws:s3:::es-bucket-prod/*"
]
}
]
}
  • Create IAM user with Programmatic access: say prod_es_user (It should have all the permissions on the S3 bucket where you would be storing your snapshot) and attach the s3role-es-policy policy to it
  • We would also need to have a role for our AWS ES service so that it can write to S3.
aws iam create-role --role-name es-prod-s3-repository --assume-role-policy-document '{"Version": "2012-10-17", "Statement": [{"Sid": "", "Effect": "Allow", "Principal": {"Service": "es.amazonaws.com"}, "Action": "sts:AssumeRole"}]}'
  • Now go to IAM in console, select role es-prod-s3-repository, and attach policy s3role-es-prod. After all the permissions are in place on the AWS Side, we can go ahead with a snapshot and restore.
  • On the hosted ES Side, Install S3 repo plugin
    Install required s3 repo plugin in ES running in the physical server. You would need to perform a restart after that (on each node).
[tejas@data02 ~]$ sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install repository-s3
[sudo] password for tejas:
-> Downloading repository-s3 from elastic
[=================================================] 100%
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: plugin requires additional permissions @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
* java.lang.RuntimePermission accessDeclaredMembers
* java.lang.RuntimePermission getClassLoader
* java.lang.reflect.ReflectPermission suppressAccessChecks
* java.net.SocketPermission * connect,resolve
See http://docs.oracle.com/javase/8/docs/technotes/guides/security/permissions.html
for descriptions of what these permissions allow and the associated risks.
Continue with installation? [y/N]y
-> Installed repository-s3
  • Setup access keys for user prod_es_user
sudo /usr/share/elasticsearch/bin/elasticsearch-keystore add s3.client.default.access_keysudo /usr/share/elasticsearch/bin/elasticsearch-keystore add s3.client.default.secret_key
  • Register repository to S3
    → on the hosted ES side
[tejas_nayak@data02 ~]$ curl -u admin -H "Content-Type: application/json" -XPUT 'https://ip:port/_snapshot/s3_repository_data02' -d'
{
"type": "s3",
"settings": {
"bucket": "es-bucket-prod",
"region": "us-east-1"
}
}'

→ On the AWS ES Side

curl -u admin -XPOST "https://domain-endpoint:443/_snapshot/s3_repository_data02" -H "Content-Type: application/json" -d'
{
"type": "s3",
"settings": {
"bucket": "es-bucket-prod",
"region": "us-east-1",
"role_arn": "arn:aws:iam::***********:role/es-prod-s3-repository"
}
}'
  • Create a snapshot named snapshot1 in repository s3_repository_data02
[tejas_nayak@data02 ~]$curl -k -u admin  -H "Content-Type: application/json" -XPUT "https://ip:port/_snapshot/s3_repository_data02/snapshot1?wait_for_completion=true" -d'
{
"indices": "logstash-2020.01.05,logstash-2020.01.10",
"include_global_state": false
}'

Here snapshot1 is the name that has to be used during restore operation as well. Basically, you create snaps from Hosted ES, store it on the S3 repository that's created by you, and using the _restore ES API, put back the data to AWS Managed ES Service.

  • Restore the Snapshot
curl -XPOST -u admin https://domain-endpoint:443/_snapshot/s3_repository_data02/snapshot1/_restore

All the indices that were snapshot under the name snapshot( in our case logstash-2020.01.05,logstash-2020.01.10) is now migrated to AWS managed Elasticsearch Service. If there are any index-specific settings or any index level tweaks that were done(say alias) everything gets migrated with this restore API.

In this way, we migrated all the ES indices from our hosted ES to AWS-managed ES by creating chunks of relevant snapshots and restoring them on AWS side.

Switch Logstash Eleasticsearch output

Once you have successfully migrated all the data and till the day you would make the switch for the applications/consumers to use this AWS Managed Elasticsearch, you don't want to sit daily and sync up the data. We can make a small tweak on the Logstash side to send data to the current ES Cluster as well as the newly added AWS-managed ES Cluster.

output {
elasticsearch {
hosts => "vpc-endpoint.us-east-1.es.amazonaws.com:443"
ssl => true
ssl_certificate_verification => false
retry_max_interval => 15
user => "logstash_es_user"
password => "********"
}
elasticsearch {
hosts => ["ip1:port","ip2:port","ip3:port"]
ssl => true
ssl_certificate_verification => false
retry_max_interval => 15
user => "logstash_es_user"
password => "********"
}
}

In this way, the entire data migration becomes a one-time Job and the data sync up is done until the actual switch is made.

What about Kibana?

  • If you have set up your AWS Elasticsearch Domain outside the VPC ( public), once you have the cluster up and running, you will get a Kibana URL. You can access it directly by just clicking on the URL.
    Instead, you can proxy pass this URL on the web server that you are using.
  • But, if you have set up your Elasticsearch inside a VPC and want to access the Kibana from outside VPC, AWS provides us with 3 solutions. I will be writing a separate blog on how we used Cognito and automated the User Access management Process and also the configurations used to access Kibana outside the VPC.

Using Ultrawarm Nodes

  • Once you have enabled Ultrawarm nodes, you can use the ultrawarm nodes for storing read-only data for a longer period of time in a cheaper manner. UltraWarm is best-suited to immutable data, such as logs.
  • If you finished writing to an index and no longer writes are performed on that index, the fastest possible search performance would be to migrate it from hot to warm.
# Migrate index from Hot State to Warm
curl -XPOST -u admin https://vpc-endpoint.us-east-1.es.amazonaws.com:443/_ultrawarm/migration/my-index/_warm
# Listing all warm indices
curl -XPOST -u admin https://vpc-endpoint.us-east-1.es.amazonaws.com:443/_cat/indices/_warm
# Migrate index from Warm State to Hot
curl -XPOST -u admin https://vpc-endpoint.us-east-1.es.amazonaws.com:443/_ultrawarm/migration/my-index/_hot

Sample Architecture Diagram of the AWS ES Cluster

AWS Managed ES Cluster

Monitoring?

The beauty of migrating anything as an AWS Managed Service is its built-in capability of creating Cloudwatch Alarms. We have enabled all the recommended Clodwatch Alarms for Elasticsearch and some specific alarms for our monitoring.
You can set up these alarms easily across all environments by using this cloudformation template.

So after 6 months of moving into AWS Managed Elasticsearch Service, it has reduced a lot of operational overhead and allowed us to focus more on building things rather than managing things. And if any issue arises, the AWS Support Team is always a button click away. It's a shared responsibility model, AWS manages the ES uptime, deployments, Kibana and releases security patches, and our responsibility is to monitor our ES Cluster, apply patches based on our need.

I will come up with another blog on Kibana Management and Accessing Kibana outside the VPC.

Thank you for taking the time to read this!!!

--

--