Tales from Elasticsearch upgrade

Idan Koch
9 min readMay 3, 2022

--

I’m a team lead server infra core at Wix and our team maintains several services in production. Some services are legacy some new. One of our legacy services performs searches over the Elasticsearch cluster. Recently we tackled upgrading Elastic from v6 to v7 (or newer) due to the EOL of v6. If you worked for a few years in the industry you probably know what it is like to have a legacy service that you maintain, and add features to it but never really deep dive into it. To start the process of an upgrade there was a lot of “catchup” regarding our Elasticsearch knowledge, my team and I needed to close the gap while delivering to production. In the process, we took an online course, read a lot of documentation, and consulted with experts across Wix and outside of it. Even though our work is not completed yet, here are some of the lessons learned from this process.

Tesha retrospective look

System parameters:

  • We have a synchronous blocking server that calls Elasticsearch v6
  • Max read requests per minute (RPM) 40K
  • Max Writes RPM 60K
  • The document is an aggregation of Multiple Data Sources
  • We have more than 500M documents spread on 7 nodes with 20 shards and one replica
  • Schema is dynamic
  • Index size including replica — ~780GB per node (5.6TB including replica)
  • Our query spans cross tenants. Meaning, we hit all the shards in every query

What have we learned?

  1. Raw data vs Preprocessed data — our old index was comprised of data collection from several data sources and aggregation into a big document using dynamic mapping. Using dynamic mapping made it easy to add new fields fast. Additionally, The raw data was saved as is and it made it easier to expose new fields based on this data to clients without the need to reindex for every new requirement. The downside was that all raw data was saved and document sizes got to be pretty big. This resulted in timeouts when deserializing large page sizes with large documents. In order to mitigate the timeout issue, we created a smaller index with a strict schema. We only indexed what we exposed to our clients. This resulted in smaller documents and a faster deserialization time. Additionally, We saw a huge decrease in our index size (from 2.8Tb without replica to 1TB) which cut costs a bit. The downside of a strict schema is you need to migrate all documents for every newly added field but you keep a clear view of what you need.
  2. Page size limits — In general you should be mindful when creating a listing or any bulk API and create a low page size limit as much as possible. Otherwise, you will find yourself with several issues that impact performance, large memory footprint, or failures. Lowering page sizes entails communication to other teams about the change, Having other teams prioritize this change, and keeping tabs to make sure they are all done. So it’s best to start with a low page size limit and increase over time.
  3. Verifying changes — one of the drawbacks of making changes to your schema is that you need to make sure that the results will remain the same. You want to test actual queries from production without taking up too many resources from your system and definitely without SLA degradation. The way we verified those changes was by running a separate thread pool that would take samples of queries in the background. That thread will run against the newer version of Elastic with the same query and compare results. Going over the logs of such a comparison allowed us to find a bug that would have taken us some time to discover.
  4. Number of shards — We got a recommendation from DBAs that shards are too big and that shard size too big should not exceed 50Gb per shard (which correlates with Elastic recommendation). We decided to increase the number of shards from 20 to 170 (calculation of each shard being under 50Gb and with room to grow), under the assumption there wouldn’t be any performance impact. However, Our queries (even the basic ones) hit all shards (we query by several fields which collect documents from several shards). The use of routing would not help for the same reason, we query across several tenants. As a result, we saw an increase in response time. In hindsight, this makes sense since you need to aggregate from more shards which result in more IO. With the decrease in index size (see the previous item) we landed on 30 shards spread on 7 nodes. However, this experience leads to the next lesson learned and the most important one.
  5. Number of changes — When you start a project, especially one with many unknown variables, it’s best to prioritize changes. You should list all wishlist changes and set what is more important to you….. Sounds trivial, right? In our upgrade process, every change in cluster configuration requires the migration of 500M documents. The data is not static and needs to be retrieved from several data sources. Each iteration of migration can take a lot of time (will touch on that in the next item). It is best to allow yourself room to fail with minimal parameter changes. In our example: changing the document to be preprocessed data instead of raw data was a must (it led to failures in production). But changing cluster config (number of shards) should have been postponed for a later iteration (after going live with the upgraded cluster). This set us back a migration cycle or two.
  6. Indexing — My team holds many services with a large amount of data and high RPM (some services' RPM are in the millions). Migrating large amounts of data is our bread and butter. However, even in this migration, there were a few details to learn from. As I stated before we hold 500M documents collected from several data sources. We have an internal tool that allows you to trigger migration based on a key (in our case document Id). The migration tool calls an RPC/gRPC endpoint and returns the result (failed/success/skip). Our endpoint in turn calls several services at Wix to collect up-to-date data for the index. The strength of the migration tool is the visualization of the process. You can see how long it takes and cap the RPM if needed. When we started the migration process we saw that the rate of migration would have taken us weeks if not months. The bottleneck of our migration was the amount of RPM that we can call external services and the response time of each endpoint. We halted the process and did some homework. Using table views of other teams’ data we saw that calling all endpoints for every document is redundant. We saw 2 services that resulted in extremely high response times. Fortunately for us, the data those services provided were relevant to less than 10 percent of the documents. We broke down the data collection into several indexing efforts which allowed us to “hit” most of the data at a relatively higher speed. The data from the slow services were later added to a smaller portion of the documents.
  7. Using nested fields — Our document contains nested objects which require querying its fields with AND and OR conditions. However, the way Elastic indexes nested objects is by flattening the fields. As a result search queries with AND conditions can end up with the wrong result. The perfect example is elastic documentation. On that same documentation page is the solution (or so we thought). Marking certain fields as nested should provide us with the needed functionality. However, when reaching production we got an insane amount of IOPS and degradation in performance. Why? each nested field item creates a separate doc. For write-intensive usage of Elastic (like us with 40k RPM writes), this results in loads of merges. If you ask yourselves what is merge? I think it is best explained in the following in Elastic documentation or in Sravanthi Naraharisetti great summarized post. The general concept is once you update a document it will be marked for deletion and when segmentation merges occur there will be more deletions of documents. In our index, we have a field that is a sequence of complex objects. That sequence can contain over a hundred items. For ease of calculation purposes: if you have 1 document with the nested field of the length of 100, you will have 101 documents. This results in crazy IOPS (Elastic is busy deleting documents) and degradation in search read performance. We delayed our solution for now (see item 5 on this list — number of changes). But we have discussed “fuse” all the relevant fields in one searchable field in the future to support And and Or queries.
  8. Cache — In our flow user’s expected behavior is to query their data infrequently (no more than a few calls in the span of a few minutes). Furthermore, The amount of documents that the user has is pretty capped at a reasonable rate. From time to time we saw an increasing amount of IOPS with long queries. Looking at our slow query log we saw that we had some users with an abnormal amount of traffic spikes. After going through the logs we saw we were “attacked” by our automation users. Those users had an abnormal amount of documents. Adding cache config relieved the performance issue considerably. However, as I said in my opening comments, the service calling the Elasticsearch cluster is a legacy service that is implemented in a blocking fashion. This means that slow response from external calls will exhaust all threads in service pretty quickly. Even with autoscaling defined, automation traffic was hitting spikes very fast (faster than autoscaler detection of RPM spike). Those RPM spikes caused our health tests to not run (there were no threads left to run health tests), essentially killing our service in production. We were able to get to the root of the problem with the relevant automation developers. Still, in order to protect ourselves in the future, we need to rewrite the service to be asynchronous. Additionally, our group is developing a general rate limiter that will prevent abuse of the system in general.
  9. Cold starts — Every time we opened the feature toggle to use the new version of elastic we saw “insane” response times for a few minutes then they died down. Additionally every data center traffic switch this issue returned again. This pointed to the fact that we need a warmup process for Elastic. We implemented that every pod that goes up will first query Elastic 5 times with a general query. The inactive DC has minimal pods to save cost. However, When traffic switches auto-scale kicks in which warms up elastic pretty quietly before actual traffic reaches the DC
  10. Work locally — Installing Elasticsearch on your computer and using Kibana is a great way to cut down one development lifecycle. You can test schema changes/queries/commands easily and without the stress of changing production on the fly.
  11. Course — Udemy has a great course … highly recommended as a starting point
  12. Ask a lot of questions — I feel I’m lucky to be working at a place like Wix. our structure of companies and guilds allows our group to move fast as a startup and to enjoy the resources of a big company. One of the major perks is that our backend guild has an incredible amount of seasoned developers with tons of experience. So you would find more often than not an answer to any technical question in one of our slack channels.
  13. Working as a team — When more than one person works on such a project you need to divide tasks properly and find the tasks that “intersect” or have dependencies between them. In our team, I started developing a framework. By the time other team members joined the effort the tasks were divided into writing tasks and reading tasks. When read and write tasks “collided” team members just sat together in pair programming. But the crucial thing is communication. I’m not a big fan of a rigid structure of management with strict processes. But daily sync (30 minutes each morning) helped to raise a lot of issues fast and allowed us to resolve them quickly. Not to mention during COVID-19 it’s just nice to see your teammates for a morning coffee, even if it’s through zoom.

Final thoughts:

As I mentioned in the beginning I’m not an expert on Elasticsearch. There are some more changes we would like to do, mainly changing the schema to make hits more localized to a single shard (or at least not all of the shards). Going through the upgrade process, and talking to other developers with more experience with Elastic and even paid professionals it seems that the road to production with Elastic takes time and tunning cycles. There appears to be no rule of thumb, every index and its set of queries need to be addressed differently.

Lastly, we are still faced with a conundrum of crazy IOPS spikes, even though traffic is the same as in our v6 Elastic. We suspect some cluster configuration is missing. Once we figure it out I will update this post (or write a new one). If you ever encountered such behavior when upgrading to v7, feel free to hit me up…. I’m Hiring :)

--

--