Sitemap

The Master Guide To Moving Data Between Elasticsearch and OpenSearch

5 min readOct 16, 2022

Elasticsearch is one of the most commonly used full-indexed distributed database. Some companies use it to save the logs/metrics from their infrastructure whereas some use it to provide search/analytics functionality.

Since the feud between Amazon AWS and Elasticsearch, Amazon has launched its forked version of Elasticsearch and named it OpenSearch. In response, Elasticsearch has updated its client libraries to show an error message when it detects OpenSearch distribution. Elasticsearch recently also announced a complete revamp to its architecture to make it stateless. I think this is a good thing, but it will also start creating more branching between OpenSearch and Elasticsearch. In this fight between AWS and Elastic, the only victims are the users.

Release notes for Elasticsearch Python client v7.14.0.

I used to self-manage my Elasticsearch cluster when I first started my startup since it is cost-effective using Terraform. Over time, as the data/requirements started growing, I decided to move to AWS-managed Elasticsearch service called OpenSearch. I had hoped moving the data would be easy but I have never been so wrong. In my quest to move the data by hook or by crook, I ended up discovering all possible ways to move data from Elasticsearch to OpenSearch which I have listed below.

Moving Data Between Elasticsearch/OpenSearch servers.
Moving Data Between Elasticsearch/OpenSearch servers.

Solution 1

The most obvious way to move your cluster is using the Snapshot backup(Here is a full guide) and restore feature provided by Elasticsearch, but this might not be an option for you if you are moving it to a cluster with a different Elasticsearch version. The latest version of OpenSearch supports Elasticsearch version 7.10.0.

This option won’t work for you if your current cluster version is 7.12 or later.

If you cannot move the data using the snapshot feature, the next option for you to try is the reindex API provided by Elasticsearch.

Solution 2

reindex API to move data from source index to destination index.

The reindex API lets a user define source and destination indices(and hosts) and takes care of moving the data itself. While reindex API can transfer data between incompatible versions of Elasticsearch, there are still reasons why this might fail. For example, if you are storing the UNIX timestamp in milliseconds in the date field, Elasticsearch returns this value as a scientific notation: 1.45464534E12. Reindexing will fail because Elasticsearch no longer accepts scientific notations for the date field since Elasticsearch version 6.8, so you will get a `date_time_parse_exception`. Potentially you could use an ingestion pipeline to parse the date field while it’s being reindexed. Still, unfortunately, there is no DateTime parser that can convert scientific notation to EPOC milliseconds.

Sometimes reindex API also fails if your source server is low on resources.

Solution 3

If reindex API has failed you, the next thing for you to try is to use the scroll API provided by Elasticsearch to move the data manually. This solution might not seem feasible if you are moving a large amount of data, but this is really the last resort. The scroll feature provided by Elasticsearch creates pagination and gives you one page at a time. You can iterate over all the pages and push the data to the destination cluster. This option should work for most of the users, but again there are times when the scroll fails too. When the source cluster has limited/low resources, it can lead to failed shards. When that happens, it cannot retain the search context it needs to remember the scroll/pagination requests, and hence you won’t be able to fetch all the documents from the source Elasticsearch.

Elasticsearch creates pages of results and scroll_id is used to specify the page number to fetch
Elasticsearch creates pages of results and scroll_id is used to specify the page number to fetch.

Sometimes reindex API also fails because of this very reason. The reindex API uses scroll functionality behind the scene and would fail when the source server shards are not stable and can’t retain the search context.

Last Resort

If you are at this step, you have been betrayed by all three options(snapshot, reindex, and scroll) and it may seem impossible to extract data from your Elasticsearch cluster. At this point, all we want is a way to get the data in chunks and push it to the destination cluster. So instead of scroll, we can try to use a basic query that sorts the data and uses the “search_after” argument on the sorted field. So if you have a DateTime field called “created_utc”, you can request the first 10,000 documents by passing 0 in the `search_after` field. Once you get the first batch of 10,000 records, you can request the next 10,000 by running the same query but this time passing the value of the created_utc field from the last record from the first batch into the `search_after` field. This way, we can tell Elasticsearch to return the next 10,000 records. Now, you can get chunks of 10,000 documents at a time and push them to the destination cluster. This option works where the scroll fails since the scroll needs a dynamic context to remember what data has been returned to the user. In this way, we simply sort the data on a field and ask it to return data after the last DateTime from the previous batch using the sort field.

search_after attribute in Elasticsearch query to get data after a particular record.

Conclusion…

These are all the possible ways to move data between different setups/versions of Elasticsearch. If none of these methods work for you then you should give up, set your old Elasticsearch server on fire, and start afresh on your new Elasticsearch/OpenSearch cluster.

--

--

No responses yet