How we reduced our Elasticsearch shards by 90% to improve performance

Tim Little
Kudos Engineering
Published in
6 min readOct 1, 2019
Photo by Edu Grande on Unsplash

We have been using Elasticsearch for a while now as our time series database and we plan to utilise it a lot more in our new product, Kudos Pro, so we recently reviewed our usage of the platform to make sure that we are using it to the best of its ability.

Investigation

When we started to look at the cluster one thing that was immediately apparent was that we were a major version behind. We were on 6.7 and 7.2 had just been released, so we wanted to upgrade to 7.2 before we started building anything else on top of Elastic.

When diving deeper into performance and usage we noticed that we had very high JVM usage for our cluster, an average of about 80%. We had a nice big banner in the Elastic console saying it can affect our performance.

Warning on Elastic console

We also checked back through our ticketing system and found that our upgrades could take a very long time and that we would frequently have to contact Elastic to help us to complete the upgrade.

Another problem was that we had far too many shards in the cluster. We only have two data nodes and a total of 3000 shards over 650 indices.

It turns out these problems are related, the root cause of the update issue being the sheer number of shards we had in the cluster. So we decided to address this issue before moving forward with the upgrade.

We started to investigate why we had so many shards for such a small cluster and we discovered that the version of Elasticsearch we were using had a default of 5 shards per index (which was later changed in Elasticsearch 7).

We also discovered that we were rotating indices too frequently, creating a new index every week for most of our use cases and sometimes even daily. These indices were tiny, on average about 2MB per index.

Remediation

Photo by danny howe on Unsplash

We decided to tackle the frequency of our index rotation first.

Our daily indices were things like Heartbeat, Watches and some SLO monitoring data. Most of this data is only relevant for the first month after collection, so we decided to just remove indices that were older than 60 days.

We have a Kubernetes Job that runs every day and calls the Elasticsearch Curator for this kind of thing, so we added a task into curator to remove indices from those index patterns after 60 days.

This helped reduce our number of shards and indices by about 350, but we were still well over the soft limit of 1000 shards per node.

Soft limit of shards per node

Next we moved onto the weekly indices. These were things like reporting data and build monitoring metrics that were ingested using Logstash running in Kubernetes. Most of this data was low volume sometimes a few MB per index, so we wanted switch it to using yearly indices to save space and shards.

The first thing we did was to switch them in logstash to use a yearly index. This was a simple change to our config and the new data was being written to a yearly index. We kept the names similar so they still matched the index pattern set in Kibana for reporting.

Switch to yearly index

Once we had created the new yearly index we needed to move all the data from the many smaller ones into the yearly one. To do this we wrote a python script that would list the weekly indices for each year, reindex them into the yearly index and delete the old weekly index (after some checking that the data was moved correctly).

Once that was done we had managed to reduce the cluster from 650 indices and 3000 shards to 270 indices and 300 shards.

This had a huge impact on the cluster, reducing its JVM from 80% to 40% on each of the node.

We wanted to future proof this success, so we decided to ensure that all new indices were created with the right amount of shards. To do this we modified the logstash index template and set it to only use one shard per index (the new default).

We took an infrastructure as code approach to this and stored the index template in a repo with a python script that that would parse the JSON and inject it into the Elasticsearch API. That way the engineering team could add to the index template and apply it the same way each time.

The upgrade

Once the new index pattern was in place, and all the data was being ingested into yearly indices, we checked the upgrade tool in Kibana to see if there was anything else needed before we upgraded.

There were a few settings that needed updating, mostly notification settings for our watches which needed to be moved to the key store in the Elasticsearch control panel.

Then we started the upgrade. We went from 6.7 to 6.8 to see if the changes we had made had any impact on the performance of the upgrade process.

Success! 🎉

We were able to upgrade the minor revision without issue.

The next step was a big one and would be a major version upgrade to Elasticsearch 7.2.

With Elastic cloud we can upgrade with the push of a button, so we made sure we had a snapshot of all the indices from the cluster stored in a GCS bucket and then we started the upgrade.

Unfortunately, it did not go smoothly.

One of the master nodes failed to come online correctly so the cluster was stuck in a limbo between 6.8 and 7.2 for a while. We managed to kickstart the upgrade to 7.2 and it completed but the 6.8 nodes were still up and configured. We ended up opening a support case with Elastic to sort that out and they soon resolved the problem.

Conclusion

It was a long road but the final result was that we had our Elasticsearch cluster running 7.2 and it was tuned better for our future use. The JVM memory consumption is much lower and the shards for all of our indices start out with sensible defaults.

Going through this process we learned a huge amount about Elasticsearch and how to administer the cluster and it gave us some more insight into how we can further utilise some of the features moving forward.

If all this sounds interesting to you, why not consider joining Kudos?

--

--