Lessons Learned from Migrating our Production Elasticsearch Cluster

Joseph Park
Imagine Learning Engineering
2 min readOct 22, 2018

All was good. Our support team could easily access our Elasticsearch cluster for their investigative logging needs. Until one day, it stopped keeping up with our Kinesis stream (and lambda function) and eventually became unresponsive. Elastic.co would not respond to our cry for help for at least 3 days (or really at all). So, we told our support team we would have it back up by the end of the week. A week and a half later, we had a stable self-hosted Elasticsearch cluster on AWS. Here’s what we learned:

  1. Elasticsearch configuration is extremely complicated.
  2. X-Pack configuration is even more complicated.
  3. If you don’t configure the cluster to use the X-Pack trial license from the start, it will store a basic license on persistent storage. This is only bad if you are using Kubernetes and forget to delete the persistent volume claim.
  4. If you want to install any plugins, they have to be installed prior to running the Elasticsearch binary. This also really only affects Kubernetes users since you are forced to create your own Dockerfile and image.
  5. Same thing with adding AWS credentials for the repository-s3 plugin. It’s best to set environment variables to docker build so that the access and secret keys aren’t exposed except at build time. Locally.
  6. Ebs optimized is not turned on by default on Kubernetes 1.10 for AWS EC2 instances through kops. You have to use the stretch AMI for 1.10 (not debian) and set rootVolumeOptimization: true.
  7. Make sure you set the iopsPerGB setting in the storage class if you are using provisioned iops…or do it through the AWS console. Otherwise it defaults to 100. Yeah, exactly. It defaults to 100.
  8. Dedicated master nodes. We did not, but only because it costs $6,600 per node for X-Pack and we don’t want to pay it. However, if you have the cash or don’t use X-Pack then you should definitely do it. Otherwise, the cluster becomes unstable and it can take a while for the shards to be redistributed (or even acknowledged).
  9. Large queries consume A LOT of cpu. Our warm node has an SSD (because the HDD was just too slow) and it spiked to 100% on 30 day queries (we are generating about 100 GB of logs a day) on 4 cpus. We increased the provisioned instance to 16 cpus and it handled it better. Not perfect, just better.
  10. Coordinating only nodes, if it makes sense. We did because our Kinesis stream and lambda are sending a lot of requests at our cluster. Constantly. We provisioned 2 nodes with 4 cpus each and they get bombarded. However, this allows our data nodes to focus on indexing and searches.

I think the above list is sufficiently long enough. We learned a lot more, including using Curator, the advantage of using a hot-warm architecture, recovering a snapshot and associating it with a StatefulSet, shard sizes and allocating them evenly across data nodes, etc. Maybe I’ll rant about it further in a part 2…

--

--