Data Migration Guidelines

Yaniv Bronhaim
Yotpo Engineering

--

How to migrate your data to new micro-service without surprises

It’s been almost three years since my team at Yotpo started the “Breaking the monolith” process to switch to micro-services architecture. Breaking monolith, in general, means to re-write or move logic with specific responsibility to separate entity that we call a “service”.

The reasons to split to micro-services can be related to scale cost, code maintenance, refactoring, upgrading infrastructures (e.g. data store), and more.

Services can either expose APIs, react to external stream data (e.g. Kafka) or run scheduled jobs.

Doing the move to a “micro-services architecture” we require to perform sensitive operations to keep the integrity of the data we work on. In this post I will summarise our retrospect after migrating 158,435,340 data records in a week.

I also would like to suggest guidelines with the hope of giving you some pointers to things that were most helpful to us.

Simple Service Example

Let’s look into the domain of an e-commerce store website that shows and sells millions of products, one might think that a micro-service that manages products’ metadata will be helpful and self-contained. For this example it might expose APIs to retrieve and update the products’ metadata. Whoever uses Product micro-service doesn’t need to know the internal implementation of how we save and query products’ metadata.

We can think of many parts in the code like ProductService, such as — UserService that takes care for user information, AuthService for authentication data, BillingService and more. Each might use its own different technology as its data-store.

To start working with our new ProductService after breaking it from the monolith we should:

(1) Move all current products’ metadata to the new service data-store

(2) Remove the related code from the monolith (cleanup is always fun)

(3) Verify that our new pipes work well with both new and old incoming requests

With that, bear in mind that the data is sensitive and important for us to continue serving and keep backward compatibility while preparing and running the migration to the new micro-service’s API.

The migration

The data migrating process can be done before or in parallel to start serving the new service in production. The process of migrating the data allows us to refactor the data structure, verify the data correctness and upgrade the data-store technology.

In any of those cases, the data can be copied directly (with migration script) or the data can be added to the service via its API to follow the code logic (validations, new ERD structure e.g).

The following provides guidelines on how the latter should be done in order to keep backward compatibility and integrity.

Preparations

Dry run

Verify that your data-store can handle the format of the data. Dry-run can be done in almost every database implementation and it allows you to verify your actions before committing them.

Transactions log

Be sure you have logs for each transaction, with enough retention to document fails and success flows.

Running code that adds millions of new records to any data-store takes time. We usually do it by running code that is basically a “one time task” that fetches record, creates a request to the service and sends the request. The service gets it, does validations and commits the record. After each task we need to collect all the failures to re-run them. I would recommend automating reports for commit or validation failures as much as possible instead of looking for them in logs. Statsd metrics can be also useful here.

Missing or corrupt records will probably have very expensive side-effects. It’s not different from importing data — those processes can be done using a queue mechanism with retry abilities and DLQ (Kafka, Sidekiq, Redis — with the correct configurations, failures can be re-triggered easily and monitored).

Common failures: (1) Load balancer issues that cause 502 errors (2) Server errors that require hot-fixes and to re-run the same data (In our case for example we had encoding issues — elaborated more later) (3) Task’s code errors or terminations surprisingly (operations issues). Requires to re-run the migration from scratch.

You are literally blind if you don’t summarise your work in phases — We added an excel sheet to organise by records’ IDs the all process.

Table columns settings

Columns character set, collation, types, uniqueness, nullable and more data-store validation methods are available while creating or modifying tables’ columns.

Fixing an index or table structure is very expensive and can lead to very long down time.

Your logic to save and query the data is very important when you define your indexes, types and collate value. I suggest to follow the same settings you have in your old data store. In our case we defined a varchar column to utf8 while the old configuration was utf8_bin.

During the data transfer we suffered many times from duplication errors.

We thought the cause was a race between the transfer and the production flows. After investigation we understood that this specific column includes the same strings with different cases. The SQL comparison was not case-sensitive due to the collate definition.

Fix-ups flows

Preparing in advance a task that rerun failures will save you time later. This task should allow you to run specific data parts that you document for later fix. We did it by creating a task that runs in production and basically reads csv files from external storage that includes all corrupted record id׳s and rerun the process for them.

Feedback and handling failures

Data integrity validations

We prepared queries to run after the transformation is done and based on this info we added Grafana graphs that show sync gaps between the monolith and new service data stores.

Check for race conditions

If your code runs in production while the transformation is running, and your new tables already get production calls you need to be sure that your transferred data doesn’t override new data — use “updated_at” timestamp field for each record and verify that it was not changed during the data transfer task run.

Monitor the process

Set yourself good observation graphs and alerts based on metrics

Setting smart alerts before running any migration is good practice. We need sensitivity both on the source and destination of the migration. We don’t want to overload the old database or the new service with too many requests. You should always monitor the CPU usage, response time, and health status. If you’re using RDS infrastructures monitor IOPS values as well — it can affect your run pretty badly.

High load in IOPS caused us once that each transfer took 10 times slower than usual.

In cases you want to accelerate the process by running parallel tasks you can end up with race conditions, overloading, high memory usage and other crash issues. Grafana is a very good utility for querying metrics and set alerts.

Stay cool

The data migration process can be indeed rough and long depending on the scale. However, it’s also pretty simple to be safe and smart here. I can promise you that if you keep working organised you will be able to sleep peacefully also when you move millions of important data records.

--

--