Data store migrations with no downtime

Neetish Jethra
Jun 14 · 4 min read

Data store migrations can be complex, especially if the application cannot afford to have maintenance downtime. We will explore some strategies that can help you navigate the complexity.

In our case, it was a migration from a NoSQL data store, DynamoDB into a relational data store, MySQL. However, this post is generic enough to make the approaches applicable regardless of the source and target data stores.

We set out to migrate our data from one data store to another while keeping the application running seamlessly for our users. I will start off by introducing some of the common migration approaches, and then cover how we went about it.

Migration Approaches

Offline migration

The biggest drawback of this approach is the downtime involved which could be anywhere from minutes to days depending on the size of your data. For an application with active users, the Service Level Agreements (SLAs) will generally determine if this is even a feasible option.

Online migration

This adds more complexity with the two data stores “live” at the same time e.g. differing behaviour of the data stores with regards to query performance, transactions etc. It will also mean incurring a performance penalty on the first access as there may be multiple requests to the data stores until data is retrieved. The flowchart below shows the different paths that an online migration may take.

Backfill

This is where running the offline migration in conjunction with the online one can accelerate things. It is important though that the offline migration is designed as idempotent (think upserts instead of inserts) if it runs alongside online migration. We refer to this as a backfill which can be executed as a batch job.

Monitoring and execution

  • hit: emitted when a record is found in the target data store.
  • miss: emitted when a record is online migrated.
  • backfill: emitted when a record is migrated by offline batch job.

We found these very useful in having the confidence that things were going to plan. Here’s what our progress tracking looked like — it started off with a lot of misses quickly tapering down as records were migrated across with a long tail until the backfill job was executed.

Visualising the statsd metrics in Datadog
Visualising the statsd metrics in Datadog

Once the backfill job was run, all data was moved across to the target data store. At this stage, the application was solely reliant on the target data store, and stopped accessing the source data store. We still left the source data store around for a month (without being accessed by the application) to verify data store integrity. Any issues with missing data in the target data store were expected to be captured via regular alerting mechanisms i.e. “Record not found” scenario in the flow chart.

Some other considerations

Gradual migration

Data Structure changes

Shared data stores

Conclusion

The key thing to ensure is to have visibility into the progress of the migration, and some careful planning before jumping into execution.

“Measure twice, cut once”!

Thanks to my teammates Toby Corkindale, Amy Guo, Derrick Cheng, Marcelo Garcia, Patric Cui, and Simon Fenton for making this happen, and to Marius Gherman and Steven McPhillips for reviewing and helping refine this article.

Zendesk Engineering

Engineering @ Zendesk