How to turn a nice day into a nightmare: an Elasticsearch story

Stratio
Stratio
Published in
6 min readMar 30, 2023

When we first started using Elasticsearch for our predictive maintenance platform, about three years ago, we were a small company with a limited number of features — a small 3 node cluster + 1 node for Kibana — and a small infrastructure team.

At the time, we chose to host our systems on AWS. We were then faced with the decision of whether we were going to use a managed service or if we would be better off doing it all ourselves on top of EC2 instances (for those not familiar with AWS, those are Virtual Machines). Given that the price was about double for the managed service, it was an easy choice to go for the EC2 instances.

Today, we are in a much different position. We have many more customers, a wider range of features, and way more vehicles to monitor. The cluster went from 3 nodes up to 15, and is now managed by a bigger infrastructure team.

A few months ago, we removed one node from the cluster. Everything was normal, the cluster was green, no issues whatsoever. The following morning, one of our senior developers noticed that Kibana was failing to perform some queries — it looked like it was attempting to query nodes that didn’t exist anymore. Looking into the configuration of Kibana, we noticed that, on rotating nodes over a period of three years, we had forgotten to update the configuration to include all the new nodes that had been added since. The night before, we had removed the last note Kibana had knowledge of.

So, we updated Kibana’s configuration to match the current nodes in the cluster and restarted it, an operation that appeared pretty simple…or so it should have been. In less than a minute, our monitoring started beeping like crazy: Elasticsearch was in trouble.

We immediately started looking at logs and performing the usual queries to try to understand what had happened. We saw that some nodes were down, so we entered those nodes and saw that the services were running. Strange, right?

Meanwhile, Elasticsearch was pretty much unresponsive. Several nodes were failing to connect to the master node, responses were inconsistent: one minute we had 12 nodes, the next minute we had 10 or 14. It was chaos.

Seeing this, we decided to restart the master node to force another node to take its place. That helped, we had various nodes down, but the cluster seemed stable at least. So we started recovering the missing nodes, stopped the service on each one, and then started bringing them up one at a time.

The first nodes were ok, the number of unassigned shards was decreasing, so we proceeded to bring back one more, Node 6. We brought it back, but as soon as we did that, the cluster started going berserk again! ‘What the hell?’ At least we now had something more concrete to go on. We stopped the node, took a better look at its logs, and we got a message that said “Failed to authenticate user (…)”.

This was strange and, to make things worse, when we turned on Node 6, we started to receive the same message in other nodes. We decided to check the SSL certificate of that node, and found that yes, it had expired, but more than a month prior. It was a problem, but it wasn’t enough to explain why it was suddenly causing this cascade of issues with other nodes.

We renewed the SSL certificate for Node 6 and ascertained that it was the only one which had expired (thankfully it was), and re-started the same process of stopping nodes with problems and bringing them back one by one. When we got to the final one, the cluster was running and shards were recovering… to some level.

While we were busy bringing back the nodes, Elasticsearch tried to recover itself, and in the process we got approximately 600 shards that would not recover at all.

No panic. The cluster was up and working properly, so we thought we’d best try to run a few queries that could point us to something. Minutes later, we noticed that these shards were stuck unassigned with the reason “node left the cluster”, and that an error message “shard either stalled or corrupt” had appeared.

I am not going to lie, at this point, the majority of the team was sweating. We have backups, and even better, everything we have in Elastic can be rebuilt. But this would take time and effort, and this was a business day, with services impacted and some systems down. Not a great picture.

Searching online, we found an interesting article where someone seemed to be outlining a similar problem. We decided to try the approach of Remco Verhoef, the author of the piece, who advised instructing Elasticsearch to reroute the failing shards to another node.

Thankfully, that worked. Shards recovered, and the team got ready for lunch (it was a team-building day, talk about luck…).

But of course, that wasn’t the end of it: the instruction above was to recover one shard at a time, but notice the part where we select “Node”.

This is the Node where the shard is stuck… and guess what? There is no easy way to find this (at least, not one that we were able to find). This meant that for each shard we would have to try each Node, and we had 600 of them! What a mess (if you got here, don’t give up and wait till the end, it will be worth it, I promise). So the solution was to go and match each shard on every Node.

We were able to bring everything back up, with no shards corrupt. From failure to full recovery, it took approximately 3 hours, and we were still able to join the rest of the team for the (late) lunch (at least that was awesome).

Now, let’s get to the lessons learnt, which hopefully will help someone else make it to lunch on time and avoid the same stress:

  • Be aware of your choices, all the time. A decision we made 3 years ago had an impact here (ahh the devil known as “tech-debt”). Let me be honest, we were aware of this tech-debt, and targeting it, but sometimes you can’t just move fast enough, so our advice is to try to monitor ANYTHING that could potentially go wrong, at all times.
  • Following the conclusion above, if you use Elasticsearch with TLS, for the love of god monitor your Certificates!!
  • Finally, make sure you have all the right “auto-recovery” options set in place. In the end, this harmed us, because nodes were coming back to life and failing again!: To provide a little more detail, we had configured our instances to restart the Elasticsearch service in case of failure, but we also had our cluster instructed to promote replicas to primary shards in event of the primary failing. So, doing what we did, we caused the cluster to enter a loop state that in the end caused the shards to be “stuck”.

Things we weren’t able to discover yet:

  • How in the world a certificate that was expired for so long gave us trouble now, and not the minute it expired;
  • It seemed a huge coincidence that everything failed upon a Kibana restart (experience tells us there isn’t such a thing), but we still can’t find a connection;

Now, if you got down to this point, you deserve a cookie.

Probably, you got to this article because you searched for the error of the shards not recovering. As stated above, recovering shard by shard, and going node by node was a huge time burden, and so Diogo Duarte (one of the heroes in this mess) made a script that will do that for you.

We are open sourcing this to the community: https://github.com/stratio-automotive/elastic-fix-stale-shards

This is our first open source community contribution, but we plan to do a lot more going forward!

Stay tuned!

--

--

Stratio
Stratio
Editor for

The world’s #1 predictive fleet maintenance platform.