IDEA 2.0 — A look at migration from older IDEA

Abhishek Jain
Myntra Engineering
Published in
12 min readAug 29, 2021

In this article we are going to have a look at how we rolled out IDEA 2.0 without any downtime. If you are unaware what IDEA 2.0 is, please read this blog first: https://medium.com/myntra-engineering/idea-2-0-a-look-at-scalable-micro-service-architecture-9eee5669767f

Problem Statement

  • Migrate data from Idea 1.0 to 2.0 without any data loss and impact on existing production traffic or downtime.
  • During the transition continue serving data from both places at same time, for an incremental rollout.
  • Any change in one DB should be reflected in other side with minimal delay and should be eventually consistent.

Options Available

  • We could have gone via DB route, replicating any commit happening at one DB in another

Problems:

DB schemas are completely different between 2 sides, not just a plain MySQL server upgradation.

MySQL version difference one is at MySQL5 other is at MySQL8, keeping always in sync a challenge

  • Using Snapshots

Problems:

DB schema difference between 2 sides, snapshot needs to be modified

Keeping 2 databases always in sync not possible, this would be good for one time migration

  • Using Airbus (Our in-house kafka managed service)

Problems:

What if during production event fails to produce?

What if during consumption we fail to consume the event?

When to produce the event for transactions?

What if transaction fails?

The best option was Using Airbus, and the only challenge now is to solve certain problems that comes associated with sync-up using airbus.

Terminologies

  • Alpha — The sync from Idea 1.0 to Idea 2.0 is called alpha
  • Beta — The sync from Idea 2.0 to Idea 1.0 is called beta
  • Alien — The % of traffic that is being served from Idea 2.0 is called alien
  • URT — Unique Request Transaction, which is used to maintain multi-request transactions at database levels
  • OTP — One Time Password
  • Gateway — Myntra’s Public AP Platform
  • Rabbit-Graylog — Our internal log management system built over Graylog
  • Airbus — Our in-house kafka managed service

Answering the Whys and Hows of the solution-ing

Why Dual Sync-Up?

  • To provide 0 downtime and smooth transition of users from 1.0 to 2.0
  • To provide capability of incremental rollout of users with some users serving from 1.0 and others from 2.0
  • To provide rollback capability in case something goes wrong at Idea 2.0, we can immediately rollback to 1.0 without any data loss or user requests left hanging
  • This enables us to have zero risk at production for any flow, and we have a lot of control on how we want to serve the users and from where, without any losses.

How are we providing rollback capability?

  • Data is always in sync at both 1.0 and 2.0
  • If any issue happens and we want to rollback, just with a feature gate flag, reduce the alien % and traffic starts serving from 1.0, after the ongoing request is completed.
  • To serve transactional flows we have a concept of URT (Unique Request Token) where the entire transactional flow is associated with this token, how this flow is solved in explained in the section of URT and OTP Migration.

How are we providing 0 downtime?

  • With both alpha and beta always enabled the data always remains in sync
  • During the migration stage we will be migrating the data from 1.0 to 2.0 using alpha by producing the events in Airbus at a throttled rate
  • Alpha is also enabled during migration stage which keeps the data in sync for any incremental changes that happen for users

How do we ensure that data is consistent and any diff is also synced?

  • Instead of producing the entire data in airbus we just produce the unique identifier of each table (Ex: UIDX)
  • Post that an API call is made to other side to get the data, this ensures we have the latest changes available at that point in time and then save in DB
  • Even if any diff happens post our fetch a new event with same identifier is produced in the airbus topic, which when consumed will fetch the latest data and save in DB
  • This ensures data is always in sync and production traffic is not affected at all

What is the role of Alien?

  • Alien provides a mechanism to control which is user is being served from where (from Idea 1.0 or Idea 2.0)
  • The logic can be controlled in 3 ways:

% based — A random number is generated between 0–100 (upto 2 decimal places) and based on defined % where that number falls the user is served accordingly.

List Based — A defined list of Users (Phone, Email, UIDX), if the user falls in the list the flows are served from Idea 2.0

Pattern Based — If the User is matched with the defined pattern (Ex: @myntra.com), the flows are served from Idea 2.0

What is the role of Adapter?

  • Until all the clients migrate directly to Idea 2.0 we serve the requests from Idea 1.0 only, but this comes with an issue of contract binding.
  • An adapter has 2 roles:

Control alien logic and re-route the request to be served either from Idea 1.0 or Idea 2.0

Act as a transformation layer — Act as a request — response transformer to transform the contracts of Idea 2.0 to Idea 1.0, thus serving the request from Idea 2.0 but with older contracts of Idea 1.0

How are we providing Incremental Rollout?

  • The alien constructs are kept in feature gate at each API level
  • As and when we want to serve the users from 1.0 or 2.0 the alien constructs (percentage, list, pattern) are changed at feature gate and within refresh interval of feature gate the flow starts serving from Idea 2.0

How are we ensuring data is synced correctly and is consistent? What about post rollout production metrics?

  • For data being consistent between alpha and beta we created a diff API, which tells within a range what’s the difference between the number of users created in IDEA 1.0 and IDEA 2.0
  • For checking sync is happening correctly without any issues, we created JMX metric and are using Airbus graphana metrics
  • The metrics consists of adapter timings, RPM, Airbus Alpha and Beta RPMs, and their Error queues.
  • We also plotted JMX metrics for production APIs to monitor those for latency and RPM with relevant alerts over those to monitor, do proactive callouts and help in solving issues.
  • Rabbit-Graylog is also enabled to monitor the logs in Real-Time for any issues seen for the APIs both at IDEA for adapter and Account Service (IDEA 2.0).

Now let’s look a the solution and how some typical problems along with solution-ing are solved.

Onboarding of IDEA 2.0 APIs in Gateway

Gateway needs to serve two API versions of IDEA, one set existing for IDEA 1.0 and another for IDEA 2.0, which was migrated to just serving IDEA 2.0 APIs post entire migration

Dual-way synch-up from source to sink

We’ll use Airbus to produce events from 1.0 and 2.0 will consume and sync the changes to respective databases, and vice versa.

  1. Event will be produced after the DB write in source
  2. If DB call fails no events will be produced, kind of atomic operation
  3. Event will be published in Airbus

Every event will contain one identifier to fetch the data from other

Data will not be passed in the event

What happens when the event producer goes down? — Refer FAQ

4. Sink will consume the event and make API call to source to get the data

5. Sink will persist the changes in DB

What happens when the event consumer is down? — Refer FAQ

6. Versioning — Merging & conflict resolving for a data

The same thing will happen to sync in reverse also, from sink to source.

Incremental Go live plan

Phase 1 — Making it live, Observe dual writes from 1.0 to 2.0

  1. Code Is Live for Data Migration in Idea 1.0 and 2.0
  2. Start the data migration script for migrating data from 1.0 to 2.0
  3. Keep two feature gates named Alpha to enable writes 1.0 to 2.0 and Beta to enable writes from 2.0 to 1.0 — These gates would be configurable to clients as well.
  4. Data migration script is complete
  5. We make Alpha = true
  6. Dual-way synch-up will start and Incremental writes would be pushed from 1.0 to 2.0
  7. Observe the results and check consistency
  8. If all good we declare internally phase-1 is Done
  9. In case of inconsistency — We fix it and repeat with Step 6
  10. Done

Phase 2 — Enable for a single client and observe dual writes from 2.0 to 1.0

  1. Keep a feature gate named Aliens in IDEA 1.0 to serve a set of migrants traffic from 2.0
  2. Definition of Aliens can be many — May be a client, May be 1% of user base
  3. We enable a set of predefined Migrants using feature gated Aliens, we also enable Beta = true
  4. Gateway calls for Aliens users will be coming to Idea 1.0 with older APIs
  5. Idea 1.0 will use Adapter and translate the request and redirect the request to Idea 2.0
  6. Idea 2.0 will process and return the response to Idea 1.0
  7. Idea 1.0 will again use Adapter to transform the response and returns the response as per older API contract — Nothing will break
  8. In the step 5, Dual-way synch-up will push the Incremental writes from 2.0 to 1.0
  9. Rollback option

We Disable the migrants from Aliens and request will start serving from 1.0

We observe the issue in 2.0 and fix it

Test it for some test client then repeat Step 2

Phase 3 — Enable for all clients

  1. We repeat step 2 of Phase 1 until all the clients and all the users are migrated
  2. Now Idea 1.0 will become only like a Gateway with Adapter

Phase 4 — Migration of clients to Idea 2.0 APIs

  1. Major clients — Gateway, Internal clients such as Logistics, Warehouse, Seller Portal etc.
  2. We’ll publish the new API contract of IDEA and ask the clients to prioritize in their plan
  3. Clients Integrate and go Live with new contract
  4. We remove the client from the Aliens and so the Beta will be off for those clients

We can keep an observation period of 7 days to get sign off from client and then we make the Beta off, till that time roll back may happen and so Beta should be enabled (dual write)

5. Once all the clients move to Idea 2.0, Aliens size will become zero

6. We observe zero calls on Idea 1.0

7. We deprecate 1.0

What IDEA versions will do

IDEA 1.0

  • Maintain a map of UIDX to Idea version (we can also run an a/b test)
  • If UIDX does not belong to Aliens
  • Serve the response from Idea 1.0

If Alpha is enabled for a the client, then it does Dual-synch up (push the data to 2.0)

  • Else

Request Adapter

Redirect the request to Idea 2.0

Use Adapter to transform the response

  • Return the response

IDEA 2.0

  • Serve API with new contract
  • If Beta is enabled for a client in Alien then it does Dual-synch up (push the data to 1.0)

FAQs

Which client will be migrated first?

We can migrate clients from MYNTRA tenant first, and then INSIDE clients. This is done as INSIDE clients have much more complex dependencies and requirements with IDEA than MYNTRA clients.

Scopes of Alien? How can we migrate 1% traffic first?

  1. We’ll use Alien feature gate for this
  2. If Alien = {1% traffic of myntra}, then we’ll use below approach

For every uidx, hash it, and use that hash to generate a number using modulo

With feature gate config of 1.0 to 2.0 calls, we can use that to redirect traffic to specific IDEA versions.

With this way all the calls of a uidx go to a specific IDEA version.

  1. If Alien = {Client A}, then redirect all the calls of Client A to 2.0

When do clients migrate to IDEA 2.0

Post phase 2, clients migrated to IDEA 2.0

What happens when the event producer goes down?

  1. There can be a case when a producer failed to produce an event, this problem would be solved by applying multiple retries over the producer.
  2. We will produce the event in sync.
  3. In case the event still fails to be produced, we will fail the transaction and not store the change in database.

What happens when the event consumer is down?

  1. Airbus guarantees at least once guarantee for event consumption, so in case event consumer goes down, it won’t commit the events that are consumed.
  2. Next time the consumer is up, it consumes the event from the last committed offset (some events could be consumed twice), then start consuming them again.
  3. In case the event processing fails, we will send that to the error queue with infinite retries in sync. Next time we can consume those events from the error queue, processing them in the same fashion.

When to produce events for transactions? What is transaction fails?

  1. Wait for a transaction to complete and produce the event afterCommit.
  2. If the transaction fails then a rollback happens and no event is produced in the case.
  3. Even if an exception happens the transaction will rollback and no event is produced for syncing to alpha or beta.

URT and OTP migration

This is a special case as these have no historical value in real-time calls and are only valid for a maximum of 15 minutes.

Approach

UIDX/Phone/Email hereon called as entity

Alpha Enabled

  • URT is present

If present, redirect the request to 1.0

Else

Redirect the request to 2.0

Check in redis for URT in 1.0 (Feature Gate Configurable)

  • No URT but entity is present

If entity belongs to Aliens

Redirect the request to 2.0

Else

Redirect the request to 1.0

UIDX > Phone > Email

Check for entity if it is of 1.0 or 2.0 (Aliens)

Priority for Entity

  • No URT, No Entity

Redirect the request to 1.0 or 2.0 as per APIs that are migrated to Alien

Rollback Option

  • URT is present

If present, redirect the request to 1.0

Else

Redirect the request to 2.0

Check in redis for URT in 1.0 (Feature Gate Configurable)

Disable the redis check for URT post 30 minutes and redirect all the requests to 1.0

  • No URT but entity is present

Redirect the request to 1.0

  • No URT, No Entity

Redirect the request to 1.0

Callouts

  1. For Phone Login V1 case (No URT)

Send OTP call is served from 1.0

Alien migration has started.

After migration entity belongs to Alien

Verify OTP call is served from 2.0 which would result in a failure

This is true for both beta enabled and rollback

Maximum callout for this API is few hundred RPH during migration at night around 3.

2. Some amount of customer escalations can happen for any flow of URT, if rollback is done, and rollback is due to a bug in URT flow in 2.0.

Profile Calls Migration

This is another special case for Profile calls as the profile APIs are not coded in Idea 2.0. To serve the Profile calls without any data loss or user experience issues we had to solve that in a different way

Approach

On a call with Alien for Profile Update calls

  • IsRedirectToAccountService
  • If True — Case 2
  • Else — Case 1

Idea 1.0 — I

Idea 2.0 — A

Case 1: User is to be served from IDEA 1.0 (I)

  • Profile Update happens in I
  • Final DB call happens in sync manner in I
  • Generate an event to alpha topic
  • Consume in A, make a call to I, and save the changes

Case 2: User is to be served from IDEA 2.0 (A)

  • Profile Update happens in I
  • Final DB call happens in sync manner in A
  • Generate an event to beta topic
  • Consume in I, make a call to A, and save the changes

In both the cases we are ensuring that if the call fails at one place the data is not synced to other place also, as according to transaction handling we won’t be producing the event to alpha or beta if the transaction fails in their respective services.

This also ensures that users are always in sync and the experience is served right after the data is changed as if the user is being served from Idea 1.0 his profile update happens in sync way in 1.0 and async in 2.0 and vice-versa.

Where are we at?

Currently we are in phase 4 of this plan, which would lead to deprecation of IDEA 1.0 after 0 observability of calls.

All the phases are successful till now with seeing zero to minimal issues in the rollout, and without any downtime to business functions across Myntra.

We shall keep you updated of the progress post phase 4 rollout. Stay tuned! Thanks for the read. Comments welcome!

--

--