Myntra’s BCP/DR Journey

Published in

Myntra Engineering

13 min readSep 1, 2021

Business Continuity Plan

Myntra has datacenter in region A . For diaster recovery (DR), Myntra got another datacenter in the region B. So region B became our DR region for region A.

If things go offtrack, it can lead to the unavailability of one whole region. So having at least two data centres in two different regions for business continuity makes sense.

We were using region A datacenter for serving production traffic. Region B datacenter was used for realtime state replication of databases and also to keep backup of databases.

So if region A is unavailable due to disaster, we can use backup state from region A to restore state in region B and provision stateless services in region B and then enable the production traffic in region B.

For diaster recovery, we need to do Business Impact Analysis of all the tech systems and need to arrive at the acceptable RTO and RPO thresholds.

RTO- Recovery Time Objective- Acceptable time duration threshold to restore the business function after the diaster event

RPO- Recovery Point Objective- Acceptable threshold for how back you will go back in time before diaster to restore data in usable format.

BCP/DR first phase

We wanted to do first iteration to move couple of services from region A to region B with production traffic.

We wanted to solve two things with this:

Do the ground work on production infra and network infra connectivity and platform services setup in both the regions and build necessary automation to move services from one region to another region
and to make optimal use of our reserved capacity in another region.

As per our agreement with cloud provider, we had bought reserved cores in both the regions.

Reserved cores are bought in batches i.e couple of thousand cores in one batch.

Myntra had bought one batch of cores in region B. We were only using couple of hundred cores for state replication. Rest of the cores were idle while we were paying for whole batch in region B. Meanwhile our consumption in region A was growing quarter on quarter and was breaching towards full capacity. So decided to use region B’s idle capacity for future growth.

Identification of candidate subset for movement to region A to region B

Now we needed to decide which all set of services we can move to region B.

Myntra’s service ecosystems is generally decoupled into 3 major sets:

Storefront — services in order capture path

SCM — services in order fulfilment path

Myntra Data Platform (MDP) — ingests, processes and stores whole of Myntra’s transactional and clickstream data. This data is used for reporting, analytics and training of DS pipelines.

We had two options:

Move Data Platform to region B. With this approach, we would only be utilising 1/3 of idle capacity of region B.
Move SCM services to region B . With this approach, we would be utilising 2/3rd of idle capacity in region B. And we could also use additional idle capacity for hardware augmentation during End of Reason Sale (EORS). EORS is Myntra’s major sale event, where we get 15–20x more traffic than normal days. So we provision additional hardware to support increase in traffic during EORS

So we finalised on option 2

Placement of services

SCM services are mostly decoupled from storefront services. Once order is captured by store, rest of the order fulfilment pipeline is asynchronous.

We still had few services in SCM which had interaction with both Storefront and SCM services. We needed to decide the placement of these services in the two regions.

We considered below three parameters to come at a decision.

Latency — As regions are 1000–2000km away from each other, network latency across 2 regions is 30 to 60ms. So if we host a service A in region A, services in region B will incur high latency while calling service A or vice versa. Storefront services are user facing and latency sensitive but SCM services have mostly ERP interactions where latency doesn’t have that much impact.
Consistency — If a service is called by both the regions, what are consistency guarantees for the use cases- strong consistency or eventual consistency?
Cross region Network Traffic — Cloud providers don’t charge you for network transfer with in a datacenter in a region. But if the data is flowing across regions, you are charged for amount of data you transfer across regions. So if a service is called in both the regions, you need to make sure cross region network transfer cost is lower i.e you try to place the service with other high chatter services in the same region.

Patterns

For some services we were ok with eventual consistency, so we had read service setup in both the regions. DB state was replicated across regions via native db async replication in another region. Writes were enabled only on one region. If write latency was not an issue, we kept writes in the region which would reduce cross region network traffic. If additional cross region latency was not acceptable, we enable write on that side, even though we took a hit for cross dc network transfer cost for other use cases.

For some services we wanted strong consistency. So we placed them in the single region with latency sensitive services irrespective of network cost.

Platform services multi region topologies

There was requirement on Myntra’s core platform systems to support use cases in both the regions for the services deployed there.

We used below topology solutions for the same:-

Control plane in one region, Data planes in both the regions

Airbus — Myntra’s messaging platform

Airbus was architected to enable Active Active setup in both the regions.

Messages of one region for a topic can be replicated to another region or vice versa via replicator service.

Admin UI and admin APIs i.e. control plane was available only in one region i.e region A

Control plane uses meta data store for read/write operations. So meta data store read write mode was enabled in region A and read only mode was enabled in region B.

With this setup, producers could produce messages locally in either region and consumers could consume messages from local region or from all regions by enabling cross region replication for their use case

Scale IT

ScaleIT is Myntra’s load testing platform. To support load test in any region, we made ScaleIT data plane i.e scaleIt load generator cluster, multi region aware. We kept control plane in only one region.

Total isolated setup in each region with exact copy.

Logging

For logging we setup a totally isolated dedicated stack in another region.

We didn’t require the single log view because most of services were only in one region except some. And we didn’t want to incur cross region network data transfer cost without much benefit.

If they are in different regions, we would go to each region’s log index to troubleshoot.

For UI access, we had two different DNSs i.e region-a-logging.myntra.com and region-b-logging.myntra.com

Sentry

We use sentry for error tracking and troubleshooting.

Sentry had a similar setup like logging above. Two disjoint setups each in one region.

Setup in only one region

Chronicle

Chronicle is Myntra’s auditing platform. For an entity change, any client can publish the entity with old state, new state, actor and timestamp. With this client can see an audit for an entity’s state changes with timeline view.

We kept chronicle in a single region with SCM services, which have mostly all of our audit use cases. For remaining use cases from another region we replicated the audit events via Airbus to keep things simple with a tradeoff of additional cost for cross region replication.

Cohosted database schema separation

For MongoDB and MySQL databases, we use single VM with single MySQL/MongoDB process which hosts multiple schemas.

Tier 1 service schemas are hosted on separate VMs i.e no sharing. Myntra’s teams/PODs are organised around platform use case capabilities. For example, all logistics services are owned by one team, all warehouse services are owned by another single team and so on. So schemas of tier 2 services from a single POD are hosted on shared database VM/VMs. This is done for efficient resource utilisation and lower operation cost for database engineers.

We had cross region async replication enabled for databases from region A to region B as part of state propagation in DR region. Async replication is done at MySQL process level. So replication direction for all schemas in a VM can be either from region A to region B or vice versa.

Suppose there are three services — X, Y, Z as part of SCM systems. Suppose service Z is used by both Storefront and SCM services but service Z needs to be placed in region A with storefront services to avoid cross region latency.

After service X and Y are moved to region B, we would want async replication for X and Y from region B to region A and async replication for Z from region A to region B.

So we did one exercise to identify all these schemas and move them to different VMs.

Whitelisting of third party services

Myntra services use many third party SaaS provider services.

We use IP whitelisting for incoming traffic from these services and these services also use IP whitelisting for their incoming traffic from Myntra.

We asked all our SaaS partners to whitelist the Myntra’s region B gateway IPs also. So if the Myntra’s traffic to them comes from any region, there are no connectivity issues.

X.X.X.X is forward proxy gateway’s IP range for region A

Y.Y.Y.Y is forward proxy gateway’s IP range for region B

Z.Z.Z.Z is forward proxy gateway’s IP range for third party SaaS provider.

DNS Setup

We kept DNS names same for services in both regions. For service X, x.myntra.com would resolve to service X in either of the regions.

Below are the different use cases for DNS setups

Service X in placed only in region A. But it is called by services in both regions. Service X from region A will resolve to Service X’s LB. Service X from region B will resolve to forward proxy of region B. Then reverse proxy of region A will forward it to Service X’s LB.

Service X is setup in both regions in active active mode. Service X from region A will resolve to local Service X’s LB. Service X from region B will resolve to local Service X’s LB.

Service X is setup in both regions. Region A is having both read and write enabled and region B is having only read enabled. To handle this new use case, we created new DNS with read write prefixes in the DNS names : r-x.myntra.com resolves to read only service instance and rw-x.myntra.com resolves to read write service instance. r-x.myntra.com from region A will resolve to local Service X’s LB. rw-x.myntra.com from region A will resolve to local Service X’s LB. r-x.myntra.com from region B will resolve to local Service X’s LB. rw-x.myntra.com from region B will resolve to forward proxy of region B. Then reverse proxy of region A will forward it to Service X’s LB.

Object storage

Myntra uses cloud object storage to store images, videos and files.

Our object storage accounts were in region A. We were ok with write latency for write path which are generally asynchronous due to high payload sizes and there were very few write interactions on Storefront order taking path and write frequency is very very minimal- couple of requests in day.

Storefront’s use cases serve images and videos via CDN from object storage. So read path via CDN is not impacted in whatever region you have object store accounts.

We would have migrated the object storage accounts for the services that were moving to region B but didn’t do it because there was not bigger downside apart from cross region network traffic cost for write path. We were ok with to incur that cost for the time being.

Initial production state

Once order is captured, it flows via messaging platform i.e Airbus for fulfilment through SCM services pipeline.

ERP portals are accessed by employees via private gateway and accessed by partners via public gateway.

Transactional data gets ingested in data warehouse via dedicated replicas of source databases.

Stages for production traffic rollout in region B

Validation stage

Provisioning of SCM services in region B, functional validation and load testing

Pre rollout stage

Preparation steps for rollout

Production traffic rollout stage

Enable live order processing in region B instead of region A

Validation setup

Cross region network connectivity infra was setup. MPLS connectivity from office to warehouses/hubs to region B was also setup.
We setup SCM services i.e candidate services identified for movement, in region B.
For MySQL and MongoDB datastore setup, we created the fresh replicas with production state and then cut off replication and enabled the write on these for validation phase.
For Cassandra, we had active active setup running in both DCs. So applications started using the local cluster in region B for validation.
For Elastic Search, we recreated the cluster from snapshots from region A.
To enable test path for order fulfilment in SCM services, SCM services added a feature gate that would enable them to process either live production traffic or test traffic or both. In region A, we set feature gate to process only live production traffic and in region B, we set feature gate to process only test traffic.
For UI/ERP portals of SCM services in region B, we created separate DNS names during validation phase.
After functional testing, we did couple of vertical and horizontal load tests to certify the setup.
We also whitelisted 3rd party SaaS gateway IPs during validation stage.

Pre production traffic rollout activities in region B

We planned to do the production rollout i.e enable live production orders fulfilment processing in region B instead of region A, during night.

We started pre production traffic rollout activities after all our testing and certification was complete. It was done one day before after the actual production traffic rollout.

We reduced the DNS TTL to 1 minute from 30 minutes. It was required to immediately reflect the DNS changes across the clients.
For MySQL and MongoDB in region B, we enabled the sync again from production data of region A and kept them in read only mode.
Redis and ElasticSearch are used for caching and searching. They are not source of truth databases. We recreated the state in these stores form the source databases in local region B.

D day — Production traffic rollout

For rollout during night, we took the downtime for couple of hours for our ERP systems with some impact on warehouse processing during this window.

During rollout, there was no impact on B2C users that interact with Storefront.

Production rollout run-book:

Shutdown cron jobs in region A. Some services use cron jobs for reconciliation and failure handling. Before shutting down services in region B, cron jobs had to be shut down.
Put the maintenance banner on ERP apps.
Disable the SCM order processing pipeline i.e order taking was working but order processing was disabled.
Drain all the messages from Airbus i.e wait till all inflight messages in the processing pipeline are processed.
After this SCM systems in region A were not receiving any traffic, we shut down the SCM services in region A.
For MySQL and MongoDB databases in region B, we cut off the replication from region A to region B and reversed the replication from region B to region A.
For Redis and ElasticSearch, application teams restored the incremental state changes from previous restored point i.e one day before.
Then we did one round of sanity testing for our applications.
Once everything was working fine, we changed DNS for ERP apps to region B gateway, removed maintenance banner and enabled the fulfilment pipeline by enabling the order processing pipeline.
Once production traffic starting flowing through SCM region B, we enabled the cron jobs in region B.

Outcome

With this, we built core multi region infrastructure capabilities — cross region connectivity and connectivity from company premises to both the regions, connectivity with third party interactions. And core platform services were also enabled in both the regions serving production traffic.

Whole of order fulfilment pipeline, containing more than 200 services, was moved in another region successfully.

Learnings and way forward

Currently our tech stack is on VMs. For some services, master images were not in parity in production because some packages were installed with new releases. So we identified these during validation and validation phase took more time. Our plan is to move to containers that will give us true portability.

We need to build our whole service repository with extensive meta data about each service -its components, profile of each component, DNS configuration, HAP routes, public vs private whitelisting rules and so on. And we will use single canonical id of service repository across all platforms services. This would help us to fully automate service bootstrapping in any region or lift or shift of any service from one region to another region on demand.

We are working on next phases of our BCP/DR journey where we will incorporate these learnings. So stay tuned for future updates.

Credits

Thanks to Myntra engineering leadership for support, program team for planning and whole of SRE, DBA and SCM teams for execution with cross team collaboration.