Postmortem: Outage due to Elasticsearch’s flexibility and our carelessness

Vaidik Kapoor
Dec 14, 2015 · 9 min read

Some Background

Our product search and navigation is served from Elasticsearch. We create daily index of products, related merchant data and locality data in one index but under different mappings. This index is built on a daily basis and then the latest index is switched with an existing one under a fixed alias. This works well.

The Incident

A couple of weeks back, our consumer mobile apps stopped working because our Consumer API was sending 500 response for every request that depends on Elasticsearch. This happened at around 4:30 AM in the morning when obviously nobody from our engineering team was awake.

Postmortem

While our service was working normally again, we could not continue to serve stale data as we had switched to an old index.

sort[<custom:\"price\": org.elasticsearch.index.fielddata.fieldcomparator.LongValuesComparatorSource@60d18f72>]: Query Failed [Failed to execute main query]]; nested: ElasticsearchException[java.lang.NumberFormatException: Invalid shift value (64) in prefixCoded bytes (is encoded value really an INT?)]

What changed?

For the upcoming sale, our business functions behind the campaign wanted to run promotions on our app. To make the promotions related content controllable via the CMS, our engineering team worked towards adding certain features which allowed our Content Team to manage promotions that are shown in the app. This feature also implemented making promotions available for querying using Elasticsearch since promotions are also location specific.

Understanding Elasticsearch Better

With Elasticsearch things may seem very nice and simple from the outside. In reality, they may not be as nice and simple. One of the nice things that Elasticsearch has, you may say, is being schema-less and using JSON for storing documents. JSON by its very nature is very flexible to work with. Add or remove things as you please. And Elasticsearch would take care of the rest. But that’s what most of us beginning with Elasticsearch believe to be true. The reality is quite different.

How Elasticsearch really handles mappings

An Elasticsearch index is made of one or more shards. An index is nothing but a collection of shards. Practically speaking, one index with 5 shards or 5 indices with one shard each are essentially the same.

How we fixed it

As I mentioned, we had immediately switched back to previous day’s index but that was just a quick interim fix. We had new data that we had to show to our consumers for the sale that was about to start in about 6 hours from the time of failure.

How could we have avoided this?

Better testing

  1. The CMS code that was pushed live was tested indeed but it was tested only locally and pushed to production as the requirements were urgent. But, everything is urgent in a startup. Its easy to dig your own grave if you don’t follow the practices that are important. We cannot get careless.
  2. How was it not caught locally? The team testing the feature only tested the APIs that were used for getting promotions. They did not even imagine that the APIs that make use products mapping can get affected. On top of this, the issue surfaced up only when our content team started using the feature. We were actually sitting on a time bomb. Thank god it happened at night. Automated regression testing would have caught this issue. This is something we don't have and are prioritizing with every release.
  3. Putting a review process around everything that goes into Elasticsearch, just like how you do the same thing around your RDBMS (you do that right?).
  4. Separating concerns is important. For some circumstances at the beginning of how systems were built, we relied on the CMS to index documents in Elasticsearch for the consumer app to consume those documents. Perhaps, if we had depended on consumer app to index their documents themselves, this issue could have gotten caught in the testing process by the consumer app team. May be it is time for us to move out the responsibility of indexing documents for search from CMS.

Read The F***ing Docs

Very aptly put in our world — I can’t give more emphasis on this phrase. Nothing is magic. Nothing just works. As engineers, its our responsibility to get to the root of things and understand how this works. Had we made an effort to understand mappings better, which is such an essential part of running Elasticsearch, we would have not committed this mistake.

Conclusion

  1. Elasticsearch is NOT bad at all. We love Elasticsearch. It would be difficult to imagine a lot of things we do today without Elasticsearch. But the low bar to entry can get you careless and have you shoot yourself in the foot.
  2. This incident proved how careless we were with testing changes in pre-production environments. One of the reasons why this happened to us was not having sanitized pre-production environments for testing. We are working on bringing more dev-prod parity in our software release cycles.
  3. This incident also taught us the importance of understanding the systems we use.

References

A few links for further reading:

Discussion on Hacker News

Follow the discussion on Hacker News.


Vaidik Kapoor

Thoughts on software engineering and technology

Vaidik Kapoor

Written by

Software Engineer, Building Tech at Grofers

Vaidik Kapoor

Thoughts on software engineering and technology