IDEA 2.0 — A look at scalable Micro-Service Architecture

Published in

Myntra Engineering

16 min readAug 29, 2021

In this article, we would talk about how we re-arched our Identity and Authentication service to micro-service architecture to achieve much better reliability and scale for next 10 years. IDEA system was one stop solution for all the Authentication, profile, token needs, which was bottlenecking the service by many folds. There was a scope to build clarity in what functions the service would cater to, as well as define domain boundaries for it. We will talk in detail about our approach how we broke the service into different domains and how we were able to scale the service by many folds.

What is IDEA and what does it do at Myntra?

IDEA is a one stop solution for all Identity and Authentication needs of Myntra. Any service or client which needs to an authenticated user to perform its operations would connect with IDEA, create or use an existing user account, authenticate if the user is valid or not, and then perform its operations.

Capabilities of IDEA

Authentication (Not Authorisation)
Session management
User profile data access
Account security components
Whitelisting & 2FA

Terminologies

OMS — Order Management Service
Gateway — Myntra Public API Platform
Knuth — Request Processor to Gateway
DS — Data Science Models
COD — Cash on Delivery
PII — Personally Identifiable Information

Current Architecture

All the capabilities of IDEA are defined under one single service. A better picture of this would be seen under this high level architecture diagram:

As we can see a single service caters to token manager, acts as a login handler, user profiler, information retriever and many more. There is no clear defined boundary of what the service is supposed to do. The database is clogged with all types of data which bottlenecks the retrieval of main account data at high throughput.

Current Client & Tenant Structure

Myntra (app, web) is a tenant
Internal Myntra applications are all clients
Tenant status management
Client status management
Client wise session management

Why do we want to re-architecture?

Availability challenge — Idea has multiple use cases and data stores, creates impact on Authentication flow when bottleneck is User profile fetching flow
Scalability challenges — We’ve observed lack of scale during HRD days when we were not able to serve user profile info to different teams
Legacy system — Not a clear segregation of Tenants/Clients and their functions
Confusing and unintuitive implementation of SSO
Lack of scalable Data model design — Get user profile makes 12 MySQL queries + 1 Cassandra query (Profile Images)
Existing flows is heavy on database operations and Response payload size, so even caching all the user’s won’t solve the problem
Device authentication bypassing Knuth, changing our Read-Write pattern, less scalable
Using Cassandra in Quorum mode — Forcing AP oriented DB to act as CP like DB

Re-Architecture Objectives

Separation of concern with Individual micro services
Defined responsibility and scope of the service
Independent scaling of each micro-service
Very high availability (99.99%) and Scalability (Million RPM)
Segregation of Myntra consumer app users with Inside application users
Device authentication
API security and control of the usages (Authorisation, Rate Limiter and Circuit Breaker)
PII data protection

Metrics

High Availability — 99.99% (avg. Myntra services availability is 99.95%)
Certify for 1M RPM
Reduce customer escalations by 80%
Reduce customer login issues by 50%

Our Solution

We decided to divide the IDEA service into mainly 4 micro-services and cater to each part with a different view of scale and reliability. These 4 systems would cater to following:

User Account Service — User account service owns user attributes which are required for account creation and management at Myntra. Account service will store user credentials, primary/secondary emails/phone , gender, age etc attributes(Full list can be found in follow up sections). All these attributes are at account level and do not hold any other domain/service’s information. This service also manages different states of account like active, deleted, blocked etc.
OTP Service — A generic service which caters to different use-cases of sending and verifying OTP’s. OTP’s can be of different lengths and over different channels
Token Service — Service to create, delete and refresh authentication tokens. This is used in Sign-In, Sign-Up, Sign-out and Secure Refresh Flows
User Profile Service — Service for keeping user profile, multiple profiles for a user under single account.

All these 4 systems have different kinds of scale requirements and reliability aspects and we’ll cater to each of them one-by-one, scaling and revamping each part differently.

High Level Architecture

Typical Flow of IDEA

This is how we imagined a typical Authentication flow at IDEA would be catered to using these micro-services:

As of now Profile Service is not active and is in design phase but other 3 services are live and active. We will talk about profile services in later posts.

Now we will have a look at these 3 micro-services, their architecture and design choices we took in detail.

Account Service

User account service owns user attributes which are required for account creation and management at Myntra. Account service will store user credentials, primary/secondary emails/phone , gender, age etc attributes(Full list can be found in follow up sections). All these attributes are at account level and do not hold any other domain/service’s information. This service also manages different states of account like active, deleted, blocked etc. With a possibility of multiple users sharing a single account and different sources which can contribute to user profile level information, a different Profile Service will understand different profiles under an account.

Actions

Create a new user Account, with user information and credentials.
Authenticate users using credentials.
Return user information by userId.
Verify user email and phone number.
Update user information.
Send Recovery Email and OTP.
Update user status to blocked, active and deleted.

High Level Design

Multi-Tenancy

In Account Service, we decided to implement multi-tenancy at database level itself.

MasterDB

MasterDB will consist of all the tenants and clients configs
This allows us the control different tenants at one place.
Tomorrow if different tenants are scaled at different servers using deployment 2 options, master db becomes the central connector between all tenants and clients.
Any password policy or common configs can be defined at global level
If any changes are needed at all tenants they can be done at one central place and loaded into application from MasterDB.

TenantsDB

Myntra will be one single tenant for App/Web consumers, thus one DB named myntra_tenant.
All the internal apps under a separate tenant called INSIDE, thus one DB named inside_tenant.
Other tenants will be deprecated.
In future, if any additional tenant is introduced, it will be added as a new DB to existing DB cluster.

Tenant Functions

Tenant wise user will be inclusive to that tenant only
User status will be following ACTIVE, SUSPENDED, BLACKLISTED, TEMP_BLOCK, DEACTIVATED
User status will be managed on tenant level

Security & Scaling capabilities

Every tenant will have their own DB so data is segregated and more secure
Every tenant can be deployed on different MySQL servers
Every tenant requests can be scaled independently

Clients

There are two types of clients that will exist in the system.

Authentication Feature Seekers

Clients

Gateway is a single client for tenant Myntra
INSIDE tenant has internal myntra clients — for example Warehouse, Logistics, Delivery, Seller Portals
3rd party tools will be clients with some custom AT/RT

Functions

One tenant can have multiple clients but all the clients will be bound to use basic Authentication rules and PII data protection rules.
Client will have freedom to choose following

Customise AT/RT fields like first name as full name
Policy for session management — transient/static
AT/RT expiration
Enable/Disable and allow max number of concurrent sessions

3. Callouts

User-Client mapping and status will not be managed in Idea- This is anyways there in Security service (Authorisation Service).
Internal/Seller/3rd Party — User life cycle management will be done via Security service, linked with IDEA via API.

User Account Detail Seekers

Clients

Client 1— Needs mostly gender
Client 2 — Needs first name
Client 3— Needs email/phone to send communication

Functions

Idea 2.0 will use clientID & API key to enable account detail access.
Every client will be registered with a user object template and they will receive only those fields which is registered against that client.

Database

One of the most critical aspects of designing the system is Database choice and modelling it.

Our objective

Strong consistency
Highly reliable and having expertise to manage backup, recovery, and migration
Multi-key and group querying capability
Secondary indexing is required

DB choice thought process

We did not consider Cassandra/Aerospike because we wanted strong consistency.
We did consider HBase, but the amount of data is not that huge to pick HBase, our’s is few GBs, and HBase doesn’t support secondary indexing.
We did consider MongoDB, but our access pattern is 90–10, Read-Writes, so not able to utilize MongoDB better whereas with MySQL it comes well proven Reliable and consistent in the industry.
We also consider Vitess over MySQL to implement Sharding, which we found based data sizing is not much fruitful, and amount of data growing every year. In the next 10 years even if it grows to a few TBs, single MySQL master can take the load.
We eventually found MySQL is best suitable for our requirement

Why MySQL

Industry proven, more than 20 years of Highly Reliable and Consistent Database, huge amount of experience in Data management, HA, DR, BK, Migration.
Current Write Load is about a few thousand RPM per server, with MySQL, we can easily scale 10 times more of that even with a single master.
Read/Write pattern which 90–10 ratio, more suitable for MySQL, where 70% of Reads will be covered with Redis and only 30% will go to MySQL.
MySQL8 also provides InnoDB cluster, which is Group Replications to support Quorum kind of Reads and Writes which support Strong Consistency at multiple masters, switching back to plain vanilla MySQL is just a config change.

MySQL — InnoDB cluster

We were planning to use MySQL 8, but at the same time we were evaluating the multi-master cluster model, and we figured out that there is a good possibility to use it.

Group Replication

Group Replication makes the topology eventually synchronous replication (among the nodes belonging to the same group) a reality, whereas the existing MySQL Replication feature is asynchronous (or at most semi-synchronous). Therefore, better high availability guaranties can be provided, because transactions are delivered to all members in the same order (despite being applied at its own pace in each member after being accepted).

Group Replication does this via a distributed state machine with strong coordination among the servers assigned to a group. This communication allows the servers to coordinate replication automatically within the group. More specific, groups maintain membership so that the data replication among the servers is always consistent at any point in time. Even if servers are removed from the group, when they are added, the consistency is initiated automatically. Further, there is also a failure detection mechanism for servers that go offline or become unreachable. Figure on the side shows how you would use Group Replication with our applications to achieve high availability.

InnoDB Cluster

It is designed to make high availability easier to setup, use, and maintain. InnoDB Cluster works with the X AdminAPI via the MySQL Shell and the Admin API, Group Replication, and the MySQL Router to take high availability and read scalability to a new level. That is, it combines new features in InnoDB for cloning data with Group Replication and the MySQL Shell and MySQL Router to provide a new way to setup and manage high availability.

Advantages:

The cluster is setup with a single primary (think master in standard replication parlance), which is the target for all write (updates).
Multiple secondary servers (slaves) maintain replicas of the data, which can be read from and thus enable reading data without burdening the primary thus enabling read out scalability (but all servers participate in consensus and coordination).
The incorporation of Group Replication means the cluster is fault tolerant and group membership is managed automatically.
The MySQL router caches the metadata of the InnoDB Cluster and performs high availability routing to the MySQL server instances making it easier to write applications to interact with the cluster.

As this cluster provided us with better scalability and fault tolerance we decided to go with InnoDB cluster instead of single master multiple slaves setup.

DB modeling highlights

Every tenant will have their own schema, most likely similar or may be different based on need
For now, there will be 3 DB schemas
Master — Contains tenant & clients
Myntra — All Myntra consumers
INSIDE — Logistics, Warehouse, Sellers and all internal users
Single user table to fetch user details
User credentials — One sided hashed
Social links as separate table — might deprecate in future & not required in user profile requirement
Image records — moving from Cassandra to here in MySQL and serving on demand
Email, Phone & Social archive tables
URT — to maintain multi-request transactions at database levels
Recovery request — For recovering accounts

Answering some of the whys for DB Schema

Why single user table?

To get user account data, it used to take a couple of joins getting the data, phone, email, profile data, image data, etc, this used to drop the performance a lot.
With this model we were never able to scale Access APIs and always reached into DB bottleneck. But with single user table, the main profile data (FirstName, LastName, Gender, Email, Phone, RegistrationOn) is available in single query, thus drastically improving the performance.

Why social link as a separate table?

Social links can be of two types Facebook and Google, if we decide to have them in same table, there would be multiple entries over same uidx which we didn’t want.
We might plan to remove social links in future, so dropping the table itself would be much easier than dropping a column.
Social Links are not part of main profile data, so this data is not queried that frequently and thus minimal queries over the join of user + social links.

What’s the purpose of archive tables?

Archive table is meant to keep the history of email or phone when the user decides to change them.
This is for logging purposes and historical behaviour of the user.
As of now we use it to send communications to user in case there is a number change for a user with sending comms over both new number (user table) and old number (archive).
This can be tomorrow to do operations over numbers that were once verified for a user but are now unlinked.

URT table is meant for?

URT is Unique Request Token, this is meant to store this unique token against each request keeping it valid until the token closes or expires for a transaction.
This helps to create multiple requests (Ex: SendOTP → VerifyOTP → Signup) for a user a part of same transaction so as to create a sort of session between them to support multiple operations in a consistent manner across servers.
This also helps to create isolations between different requests in case they originate for the same user.

What is recovery request?

Recovery Request is meant to generate Recovery Reset Links for the user when they try to reset their password.
This creates a Reset Key which has a set expiration and once reset password request comes, the reset key is used to validate the request and reset user password to recover the account.

Cache Management

User id to UserInfo Cache

Api to get UserInfo from user id needs to serve user path at 20k RPS with very tight latency requirements. A fast in memory store would be more efficient in this case. We would not rely upon MySql data cache as a complex sql query can very easily pollute cache .

We will be maintaining a separate redis cluster to answer this kind of queries. As write operations are at very miniscule as compared to reads , we maintain consistency between cache and db in write path.

Cache Management for Access Flows

Aim is to serve 95% of calls in the GETUserByUIDX flow from cache, and rest 5% from MySQL DB.
Storage for cache: Key - Prefix + ID

How and what data is going to be cached?

Data from MySQL nodes, with join over user and image_entries data.
The entire data after a join between tables is cached at MySQL, and any custom views are served on that cached data.

OTP Service

OTP Service needs to be segregated from IDEA, for the following reasons:

OTP service is a generic service, which anyone can use to send OTPs
Having it in IDEA makes it difficult for other services to call it, and every service would need to create their own OTP service sub-module.
Segregation of concerns

Actions

Transient Data store to store OTP temporarily
Send OTP API — To send the OTP
Verify OTP API — Verify Existing OTP for a user
Send OTP via SMS or Email
Generate OTP for different lengths
Generate OTP for different clients (Generic Service)
Same User can have OTPs for different use-cases and clients
Different TTL for different use-cases and clients
Per client/use-case requirements to send different OTP or same OTP on resend within TTL
To keep templates for sending OTP per use-case

A Special Case — How to handle Concurrency?

There is a special case here, what if 2 calls comes for same phone number or email and wants to get the OTP, one user multiple device case.
In this case there will be 2 OTPs generated but we don’t know which one is valid for which device.
To handle this case URT concept is introduced.
URT is Unique Request Token which is generated on each send OTP.
During Verify OTP a call is made with URT to know which OTP needs to be served for that request.
If 2 requests comes with same PhoneNo or Email, both will have different URTs and thus with URT we can identify which OTP to serve for each request.

Database

We experimented with different databases here. The main requirement here was requirement if high consistency but at the same time should be partition tolerant to allow for data sharding and scalability in future.

For this CP based usecase Mongo seemed to fit the bill perfectly. We experimented with HBase too, but no secondary index support and native complex query support in Mongo seemed to favour Mongo much more for this usecase.

DB modelling

The configs are to be kept in JVM cache.
Reload the configs in every one hour.
TTL for OTP to be kept for few months and should be configurable.
We would be keeping a counter at client level, which would be reseted every hour on the basis of created time, handled at application level.
Created time is set when counter is set for the first time or reseted after an hour.
Phone will be kept with isd + number.
isd will also be kept separately.

Following are the DB Configs for Client, on how it would be registered with Mongo:

Following is the OTP collection, where actual OTPs are stored

The fields star marked are all secondary indexes, which are used while querying the data from OTP collection.

Token Service

For token service we decided to leverage existing IDEA system currently. All the other APIs except token management parts were re-used from older IDEA with minor changes to the APIs.

This is going to be revamped in coming months into a new service, and we will have more updates on the same then.

Reaping the Benefits — Benchmark

Unless a system is correctly benchmarked we would never know if it reaped any considerable benefits or not.

We benchmarked various aspects of real-world simulations on the new system using Myntra’s in-house benchmark platform and we saw many promising results.

For Login Calls we saw same amount of calls at 50% latencies than older system, increasing the user responsiveness by twice the amount during user on-boarding.

For User Access Details we saw 10x the amount of calls at same latencies, thus allowing us to handle high peaks during HRDs in a much more confident way. Not only that it also paved the way for many new use-cases which were not possible before such as gender based targeting in real-time etc.

For Token Refresh calls we saw 3x the amount of calls at same latencies. This allows us to serve smaller AT sessions during HRDs without compromising on session security of users.

Lastly, it enabled us for the very first time during HRDs to serve Bulk User Access Calls in real-time solving some use-cases at bulk such as personalised coupons for Shout & Earn campaigns during sale events (which became almost impossible due to PII compliance on personalised coupons).

What’s next?

Now that we have seen the new architecture and understood in in details there are couple of things for future scope:

Token Service will be revamped to 2.0 standards.
A new profile service will be created to cater to User Profile needs of Myntra.

Data Migration

We have missed one important aspect of this whole re-architecture, i.e. data migration from older IDEA to newer one. We will look at this in a different blog post as that itself was a separate project in itself, and demands a detailed perspective.

Link: https://medium.com/myntra-engineering/idea-2-0-a-look-at-migration-from-older-idea-efc0c67c898f

Credits

The project was created as a team of motivated engineers which consists of many team members. Our members are:

Abhishek Jain (Myself, Project Lead), Amarjeet Singh (Lead Architect), Prashant Kumar (Manager), Abhishyam Chennayipalem, Anshuman Kaushik, Sourav Prem, Vishvesh Oza, Venu Babu Narra (IDEA 1.0 Perspective), Pawan Gaur (DBA)

We shall keep you posted on the progress. Stay tuned! Thanks for the read. Comments welcome!

IDEA 2.0 — A look at scalable Micro-Service Architecture

What is IDEA and what does it do at Myntra?

Capabilities of IDEA

Terminologies

Current Architecture

Current Client & Tenant Structure

Why do we want to re-architecture?

Re-Architecture Objectives

Our Solution

High Level Architecture

Typical Flow of IDEA

Account Service

Actions

High Level Design

Multi-Tenancy

Database

Our objective

DB choice thought process

Why MySQL

MySQL — InnoDB cluster

DB modeling highlights

Answering some of the whys for DB Schema

Cache Management

OTP Service

Actions

A Special Case — How to handle Concurrency?

Database

Token Service

Reaping the Benefits — Benchmark

What’s next?

Data Migration

Credits

Written by Abhishek Jain