How to get a service migration to the finish line successfully

Dalia Simons
Wix Engineering
Published in
4 min readAug 3, 2023

For the past 2 years, I’ve worked as a Backend Tech lead on Wix’s Ecom Platform. One of the biggest tasks we’ve been working on was re-writing our old micro-services

A rewrite has a lot of different stages:
- Analyzing the old service
- Writing the new Code
- Moving traffic to the new service
- Migrating data from the old system to the new.

Most articles about rewrites focus on migrating the data, But I feel one of the hardest parts is feeling confident enough to start moving traffic to the new service.
This is especially true for complex services that have been running for a long time and have a large number of customers, which are not all known to us. In our use case, we had an even bigger scare: rewriting of systems that make purchase-related calculations, this is touching the most sensitive areas, MONEY!

Image by Julita from Pixabay

How we did it so far?

The classic way will be having a few cycles of QA testing the new system on all known flows, and creating bugs for problems found.
Then open the new system (hopefully gradually using an FT system) and close it whenever we get complaints from users or see exceptions.

This is a very long and exhausting process and takes a long time to finally move all the traffic.

But can we have a faster cycle?

We can have a shadow system that runs under the hood where the users are doing the QA without even knowing about it.
This will give us high traffic, faster cycles, and much better visibility.

Using a Compare tool — Is this the correct solution?

It seems like the best way to create such a shadow system will be to send the same request to both servers and compare the responses to make sure they are identical.
At Wix, we have an internal tool called by the funny name DiffMatoky.

Comparison Example

This tool uses a Kafka topic to compare 2 JSON objects.
Let’s consider the example above: We’ve re-written the ProductService, and now we created another proxy service to decide if we should call the old/new service during the rollout (this decision can of course be made in the old or new service, it’s completely optional to add a proxy).
How do we compare? We’ll create an experiment switch in the proxy with 3 states: Old, Compare, and New.
Old and New simply call the old or new Api. When Compare is on, we will call the old API and return it (so we won’t hurt the performance of the api) and in the background, we will run an async task that will call the new api and then produce a message with the request, the response from the new server and response from the old server.

DiffMatoky compare tool

DiffMatoky service uses a simple json compare library, and gives us:
1. Percentage of requests where the responses are identical.
2. Details of requests that failed with errors.
3. A list of the fields that failed the compare with the number of failures (so we can know what is the most common problem).
4. stack trace for diffs and errors.

Now we can go and fix the problems, and once we get to 100% we can switch traffic. Sounds simple, but does it work?

Image by Arek Socha from Pixabay

There is no magic

After trying this solution on a few rewrites we found a few problems with it and came up with a few rules to make it more efficient

  1. Only start using the compare tool when you’ve finished the QA process for all the main flows. This might sound trivial, but at first, we thought we had a magic tool at hand. We spend a long time trying to fix small fields we found in the comparison when in fact there were more fundamental problems we didn’t address first.
  2. You’ll Never get to 100%. As much as you’ll try you will always have differences. A few examples:
    - a difference in milliseconds of creationTime
    - null value in the old service that was given a default value in the new service
    - enums with different types (a type was added or removed on the new service).
    We’ve added a mechanism to be able to ignore such fields in the dashboard, but we still had edge cases we didn’t want to ignore.

You’ve got to be brave

Image by Mohamed Hassan from Pixabay

At the end of the day, you will almost never get to 100% confidence. But you can get to a stage where you’re confident enough with your main flows to start and move traffic. You will still need to do it gradually and patiently, monitor it carefully and be ready to close the experiment if a problem arouses.

Be confident in yourself, you’ve tested all the main flows, and you know no major issue should happen.
Good luck!

--

--

Dalia Simons
Wix Engineering

I’m an experienced software engineer, writing backend code has been my passion and my career for the last 12 years. Currently I enjoy working for Wix.com