The HOW & WOW of Our ElasticSearch Upgrade

Deepak Varshney
Tokopedia Engineering
7 min readFeb 18, 2021

Welcome Back, Guys!!!

This is the second and final station of our Elastic Search upgrade journey. In this part, I will explain our migration plan and we will learn how we planned and executed this activity on a grand scale.

In the end, I will share the results that we achieved, and also highlight the issues that we faced during this journey.

Just to recap, in part 1, we learned about the salient features of ES 7 and why we decided to upgrade our Elastic Search. If you haven’t gone through that you can check it out here.

Let’s see the 4 ground rules before embarking on this journey

  • Do Code changes to make codebase compatible with ES 7 for breaking changes.
  • Do POC on 1 server.
  • Request ES 7 cluster and route all traffic to ES 7.
  • Duplicate Code Cleanup — We replicated the ES 5 functions to make them compatible with ES 7. After completion, we should do the holy activity to remove ES 5 code.

Let’s begin the masterplan

Time to roll the sleeves and make our guest(ES 7) as TopAds family member :)

FYI, Right one is me!! :D
FYI, the left one is MG(teammate) and the right one is me!!! :D

Day (1–15): Code Change and POC(Proof of Concept) Phase

This was the most time consuming and needed open eyes all the time :(

Don’t worry baby, everything will be fine!!!
  • We learned about the breaking changes of ES 7 and made code changes according to that. For example:

* Standard token filter is deprecated

* type is deprecated

* _all is deprecated

Infra Setup for POC

  • We requested for ES 7.7.1 cluster with 1D-Node for POC.
  • Create all index mappings by removing type.
  • Configured new ES 7 Cluster similar to ES 5 cluster settings.
  • Added ES libraries such as dynamic-synonym and hunspell compatible with ES 7. Don’t forget to add these!!!

Indexing In ES

  • We removed type from all queries in the codebase as this is deprecated in ES 7.
  • We have NSQ Consumer to index ES docs.
  • We created another channel from the indexing consumers and replicated all ES 5 code to make it compatible with ES 7. This is done to keep ES 7 indexing code decoupled from ES 5 indexing code. So if something breaks, it won’t affect the existing system.

This is how we achieved dual indexing on ES 5 as well as ES 7 cluster.

Debugger

  • We added an on-demand debugger to compare the response of ES 5 and ES 7, and print errors from ES 7 query execution.
  • The comparison logic was that if we do a query, the number of documents for each ad score in ES 5 should match the number of documents in ES 7 for the same ad score.

Migrator

  • We have an “Awesome Migrator” to migrate all ads from PostgresDB to Elasticsearch in batches to prevent indexing load on ElasticSearch.
  • Guess what!!! After migration, document count on ES 5 and ES 7 were similar.
  • There was a minor discrepancy of 1% which could be due to the dynamic(indexing, deletion of docs) nature of ElasticSearch.
  • Meaning that we didn’t miss any indexing flow in the code. Aww yeah!!! :D

I recommend to have a migrator with you all the time to quickly migrate ads from DB to ElasticSearch

It’s time for the querying part, baby!!!

  • Create a clone of functions that are querying ES 5 to make them compatible with ES 7 and add a switcher. Eg:
Original: func BuildMainQueryMapping(dqs *DisplayQueryStruct, pdp         *ProductDisplayparams)Clone: func BuildMainQueryMappingv7(dqs *DisplayQueryStruct7, pdp *ProductDisplayparams7)if(enableES7 || es7ABExperiment) {
return BuildMainQueryMappingv7()
} else {
return BuildMainQueryMapping()
}

Yes, we kept a backup with us to go back to ES 5 anytime using just a toggle switch.

  • Announced and reminded SE’s on the slack channel, again and again, to make changes in both functions so that they remain in sync during the migration process.

Coding part done till here.. woaaahh!!!

T for Testing, Testing, and Testing….

  • Tested all API flows on ES 7 cluster and results were similar to ES 5.

Yes, yes, yes, we were gaining confidence as the results were meeting our expectations.

So another step close to heaven :)

Day (15–25): Going to Production Phase

Ladies and Gentlemen, please fasten your seatbelts. We are going to Production.

We rolled out in different phases.

With POC setup ready, we requested a similar cluster with 15 D-Nodes, 2 C-Nodes with ES 7.7.1. We created all index mappings by removing type and similar settings of the number of shards and analyzers enabled dual indexing consumers to start indexing in real-time on ES 7 and ran the migrator.

  1. Migrator took 1 day to sync the documents between the ES 5 and ES 7 clusters.
  2. After migration completed, we enabled ElasticSearch AB Experiment for 2% traffic on 1st day, then 4% on 2nd day, 30% on 3rd day.
  3. We enabled on-demand debugger to compare the response of ES 5 and ES 7.
  4. We continuously monitored the important business metrics on Datadog and special Datadog alerts were placed to inform us of any mishappenings.

Now my friend, its time to wait and watch the beast performing live on production for 1 week.

Time for some initial results from AB Experiment:

  1. Surprisingly, 80% of the ES 7 response was the same as ES 5 response. We got the confidence that we are going in the right direction.
  2. 20% was different due to the dynamic nature of ES during continuous indexing, updations, and deletions.
  3. There was no drop in revenue and ES query time also improved.

Please wait a little more for the final results ;)

Day (25–30): Enable for All traffic

  1. We already had a switcher to route 100% traffic to ES 7 for particular pages.
  2. After watching some mouth-watering results from ES 7 AB experiment, we routed all the traffic to ES 7 with the switcher.
  3. Of course, continuous monitoring of Business Metrics and ES 7 metrics was there for us.

Can you guess what happened next????

ElasticSearch 7.7.1 passed with FLYING COLOURS with huge traffic of TopAds and we became the first team in the entire company to upgrade Elastic Search to the latest version, i.e, 7.7.1.

Zero downtime!!! Yes, you heard me right!!!

Post-Upgrade Cleanup Activity

  1. Stop Indexing on ES 5 to free the old ES 5 servers.
  2. Clean up old ES 5 Golang app code.

Here are the amazing results that we have been waiting for!!!

We achieved 2.5x improvement in ES Node CPU Usage for Data Nodes (55% to 20%)

There was a 2x decrease in ES Query Latency (15ms to 7ms)

After seeing such amazing initial results, we couldn’t control our emotions so we went for the last straw and completed the migration for all ad types.

Difficulties we encountered:

With the migration of this level, we were expecting some issues due to so much code replication and breaking changes. Yes, they came also :(

Client Library issues:

We were using go olivere as a client library for ES. So v7 compatible version for olivere uses go modules for dependency management and we were still using dep, which is deprecated.

So in turn we needed to update our Golang version also from 1.10.4 because go modules came into the picture in Golang version 1.11.

We upgraded Golang version from 1.10.4 to 1.14.4. But for that story lookout for our next blog!!!

Haha.. Double benefits!!! :P

Yeahh.. I know!!!

Code related issues:

  1. Maintaining dual-code changes, one in the replicated new functions and the other one in the old functions.
  2. SE’s forgetting to make changes in both original and replicated functions. So we socialized and reminded again and again to prevent this.
  3. Clean up the replicated code.

This whole process took us around 1 month due to the code replication we need to do for ES 7 due to Olivere library incompatibilty.

Conclusion

In the end, the result exceeded our expectations. We were able to reduce costs (nearly 40% due to the smaller cluster) while improving performance and stability so it was definitely a big win for everyone.

Please shower lots of 👏 👏 if you liked our Journey!!!

References

  1. https://www.elastic.co/guide/en/elasticsearch/reference/7.0/breaking-changes-7.0.html
  2. https://github.com/golang/go/issues?q=milestone%3AGo1.14.4+label%3ACherryPickApproved

--

--