How We Serverlessly Migrated 1.58 Billion Elasticsearch Documents

Nicholas Ionata
Stream Monkey
Published in
5 min readOct 1, 2020
Stream Monkey platform dashboard

The Stream Monkey platform features a real-time stats dashboard for rich audience analysis. Behind the scenes, the data is stored in AWS Elasticsearch Service, a managed, distributed, open source search and analytics engine. Our original production cluster ran v2.3. As the latest version of the service is up to v7.4, it was time for an upgrade. With ~2,100 GB of data, representing 1.58 billion documents, across 195 indices, this was easier said than done.

Upgrade options

A rolling upgrade is a great option if it is to a minor version or a recent major version. Unfortunately, the leap from v2.3 to v7.4 includes fundamental changes in the mapping types and the underlying Lucene index. If a simple upgrade is off the table, a cluster migration must be performed. There are several different approaches:

  1. Elasticsearch reindex API (v5.x+)
  2. Restore from a snapshot
  3. Manually reindex

The first two options would not work for our use case due to the API version restriction and snapshot version incompatibility respectively. Manually reindexing our v2.3 cluster to a fresh v7.4 cluster would give us the most flexibility, but it meant we needed to create a tool.

Migration tool

Concept

Concurrent index migration

A cluster is an instance of Elaticsearch. A cluster can be run on multiple nodes and contains many indices. Each index is a logical grouping of data. As such, indices can be migrated concurrently with independent workers. Since indices are generally large, tens of millions of documents, each worker will be broken down into jobs. The first job for an index will create a search context with all the documents from a source cluster, retrieve a batch of docs, transform them, and bulk index them into the destination cluster. Subsequent jobs perform the same function by retrieving the next batch of docs from the search context. A concurrency setting will determine the number of indices being migrated at once.

Implementation

Migration tool implementation

Our implementation builds on the aforementioned concurrent index migration concept. Two Lambda functions handle requests to create and start a migration. They interface with two DynamoDB tables (migrations and indices) to store migration state. New entries to the indices table trigger a third function, migrate index batch. Each batch’s state is saved in a third table (jobs). After a batch completes, a recursive call will be made to process the next batch until all the documents in the index’s search context have been retrieved. The recursive calls avoid the Lambda timeout restriction, make it easy to retry failed batches, and help manage concurrency. In the event of a batch failure, all state data will be dumped to a JSON file and saved to S3 before it is retried. Deployment of the entire tool to AWS is managed by the Serverless Framework, providing a frictionless experience.

Carrying out the migration

Cloudwatch dashboard

With a fresh v7.4 cluster running, we kicked off our migration of 191 indices with a concurrency value of 2.

“We need to turn up the speed!!!”

Chris

With a single digit CPU utilization on both the source and destination clusters, we increased the concurrency to eight and eventually fourteen. While the CPU utilization peaked out at 35.4%, the JVM memory pressure was consistently hitting 75%. This caused the cluster to perform garbage collection and restricted us from pushing the concurrency any further. Within five hours, over a billion documents had been migrated.

Garbage collection kicks in at 75%

Progress stalled for a while due to failed indices. Their jobs were timing out during garbage collection. To address this, we increased the retry time and took some additional measures to get troubled indices through. After kicking off a few more migrations, we had successfully migrated all but our most recent index (each index represents a month of logs). Once we pointed our ingest to the new v7.4 cluster, we were able to run one last migration to officially hit the 1.58 billion documents mark.

Internal migration dashboard

Lessons learned

Managing concurrency is hard. Kicking off and maintaining n concurrent index migrations worked for the most part. However, race conditions occasionally became an issue when multiple indices finished around the same time. As a result, the new concurrency would be >n, while n was desired. To create a more fault tolerant solution, logic for maintaining concurrency should be moved from each worker to a central broker.

Elasticsearch is complex. There are a lot of moving pieces and a ton of important configuration settings. Some of which have short term effects, while others can have everlasting consequences:

  • Refresh interval: We disabled refresh while migrating an index by setting it to -1. Once the index was migrated, the setting was set to 1s to support normal ingest and search. We saw a decrease in CPU utilization and an increase in indexing speed with this strategy.
  • Number of replicas: We set this value to 0 while migrating an index and 1 afterwards. Similar to the last strategy, we saw a huge decrease in CPU utilization and increased indexing speed.
  • Batch size: We tested different batches to find an optimized size that did not exceed the maximum content length. Since we have two types of log indices, short term and long term, each with their own density, we had to test and use drastically different sizes.
  • Number of shards: When we migrated to the new cluster, we overlooked the default 5 shards per index. For the size of our indices, we were oversharing, leading to needless overhead during search. By setting the shards per index to 1, we achieved increased performance.

Final thoughts

Building a migration tool with a serverless architecture removes the unnecessary steps of allocating and managing infrastructure. This setup allowed us to move quickly and even deploy changes during an active migration without a hiccup. Additionally, the tool can be left dormant until it is needed again, making it extremely cost effective.

--

--