IDEA 2.0 — A look at scalable Micro-Service Architecture
In this article, we would talk about how we re-arched our Identity and Authentication service to micro-service architecture to achieve much better reliability and scale for next 10 years. IDEA system was one stop solution for all the Authentication, profile, token needs, which was bottlenecking the service by many folds. There was a scope to build clarity in what functions the service would cater to, as well as define domain boundaries for it. We will talk in detail about our approach how we broke the service into different domains and how we were able to scale the service by many folds.
What is IDEA and what does it do at Myntra?
IDEA is a one stop solution for all Identity and Authentication needs of Myntra. Any service or client which needs to an authenticated user to perform its operations would connect with IDEA, create or use an existing user account, authenticate if the user is valid or not, and then perform its operations.
Capabilities of IDEA
- Authentication (Not Authorisation)
- Session management
- User profile data access
- Account security components
- Whitelisting & 2FA
- OMS — Order Management Service
- Gateway — Myntra Public API Platform
- Knuth — Request Processor to Gateway
- DS — Data Science Models
- COD — Cash on Delivery
- PII — Personally Identifiable Information
All the capabilities of IDEA are defined under one single service. A better picture of this would be seen under this high level architecture diagram:
As we can see a single service caters to token manager, acts as a login handler, user profiler, information retriever and many more. There is no clear defined boundary of what the service is supposed to do. The database is clogged with all types of data which bottlenecks the retrieval of main account data at high throughput.
Current Client & Tenant Structure
- Myntra (app, web) is a tenant
- Internal Myntra applications are all clients
- Tenant status management
- Client status management
- Client wise session management
Why do we want to re-architecture?
- Availability challenge — Idea has multiple use cases and data stores, creates impact on Authentication flow when bottleneck is User profile fetching flow
- Scalability challenges — We’ve observed lack of scale during HRD days when we were not able to serve user profile info to different teams
- Legacy system — Not a clear segregation of Tenants/Clients and their functions
- Confusing and unintuitive implementation of SSO
- Lack of scalable Data model design — Get user profile makes 12 MySQL queries + 1 Cassandra query (Profile Images)
- Existing flows is heavy on database operations and Response payload size, so even caching all the user’s won’t solve the problem
- Device authentication bypassing Knuth, changing our Read-Write pattern, less scalable
- Using Cassandra in Quorum mode — Forcing AP oriented DB to act as CP like DB
- Separation of concern with Individual micro services
- Defined responsibility and scope of the service
- Independent scaling of each micro-service
- Very high availability (99.99%) and Scalability (Million RPM)
- Segregation of Myntra consumer app users with Inside application users
- Device authentication
- API security and control of the usages (Authorisation, Rate Limiter and Circuit Breaker)
- PII data protection
- High Availability — 99.99% (avg. Myntra services availability is 99.95%)
- Certify for 1M RPM
- Reduce customer escalations by 80%
- Reduce customer login issues by 50%
We decided to divide the IDEA service into mainly 4 micro-services and cater to each part with a different view of scale and reliability. These 4 systems would cater to following:
- User Account Service — User account service owns user attributes which are required for account creation and management at Myntra. Account service will store user credentials, primary/secondary emails/phone , gender, age etc attributes(Full list can be found in follow up sections). All these attributes are at account level and do not hold any other domain/service’s information. This service also manages different states of account like active, deleted, blocked etc.
- OTP Service — A generic service which caters to different use-cases of sending and verifying OTP’s. OTP’s can be of different lengths and over different channels
- Token Service — Service to create, delete and refresh authentication tokens. This is used in Sign-In, Sign-Up, Sign-out and Secure Refresh Flows
- User Profile Service — Service for keeping user profile, multiple profiles for a user under single account.
All these 4 systems have different kinds of scale requirements and reliability aspects and we’ll cater to each of them one-by-one, scaling and revamping each part differently.
High Level Architecture
Typical Flow of IDEA
This is how we imagined a typical Authentication flow at IDEA would be catered to using these micro-services:
As of now Profile Service is not active and is in design phase but other 3 services are live and active. We will talk about profile services in later posts.
Now we will have a look at these 3 micro-services, their architecture and design choices we took in detail.
User account service owns user attributes which are required for account creation and management at Myntra. Account service will store user credentials, primary/secondary emails/phone , gender, age etc attributes(Full list can be found in follow up sections). All these attributes are at account level and do not hold any other domain/service’s information. This service also manages different states of account like active, deleted, blocked etc. With a possibility of multiple users sharing a single account and different sources which can contribute to user profile level information, a different Profile Service will understand different profiles under an account.
- Create a new user Account, with user information and credentials.
- Authenticate users using credentials.
- Return user information by userId.
- Verify user email and phone number.
- Update user information.
- Send Recovery Email and OTP.
- Update user status to blocked, active and deleted.
High Level Design
In Account Service, we decided to implement multi-tenancy at database level itself.
- MasterDB will consist of all the tenants and clients configs
- This allows us the control different tenants at one place.
- Tomorrow if different tenants are scaled at different servers using deployment 2 options, master db becomes the central connector between all tenants and clients.
- Any password policy or common configs can be defined at global level
- If any changes are needed at all tenants they can be done at one central place and loaded into application from MasterDB.
- Myntra will be one single tenant for App/Web consumers, thus one DB named myntra_tenant.
- All the internal apps under a separate tenant called INSIDE, thus one DB named inside_tenant.
- Other tenants will be deprecated.
- In future, if any additional tenant is introduced, it will be added as a new DB to existing DB cluster.
- Tenant wise user will be inclusive to that tenant only
- User status will be following ACTIVE, SUSPENDED, BLACKLISTED, TEMP_BLOCK, DEACTIVATED
- User status will be managed on tenant level
Security & Scaling capabilities
- Every tenant will have their own DB so data is segregated and more secure
- Every tenant can be deployed on different MySQL servers
- Every tenant requests can be scaled independently
There are two types of clients that will exist in the system.
Authentication Feature Seekers
- Gateway is a single client for tenant Myntra
- INSIDE tenant has internal myntra clients — for example Warehouse, Logistics, Delivery, Seller Portals
- 3rd party tools will be clients with some custom AT/RT
- One tenant can have multiple clients but all the clients will be bound to use basic Authentication rules and PII data protection rules.
- Client will have freedom to choose following
Customise AT/RT fields like first name as full name
Policy for session management — transient/static
Enable/Disable and allow max number of concurrent sessions
User-Client mapping and status will not be managed in Idea- This is anyways there in Security service (Authorisation Service).
Internal/Seller/3rd Party — User life cycle management will be done via Security service, linked with IDEA via API.
User Account Detail Seekers
- Client 1— Needs mostly gender
- Client 2 — Needs first name
- Client 3— Needs email/phone to send communication
- Idea 2.0 will use clientID & API key to enable account detail access.
- Every client will be registered with a user object template and they will receive only those fields which is registered against that client.
One of the most critical aspects of designing the system is Database choice and modelling it.
- Strong consistency
- Highly reliable and having expertise to manage backup, recovery, and migration
- Multi-key and group querying capability
- Secondary indexing is required
DB choice thought process
- We did not consider Cassandra/Aerospike because we wanted strong consistency.
- We did consider HBase, but the amount of data is not that huge to pick HBase, our’s is few GBs, and HBase doesn’t support secondary indexing.
- We did consider MongoDB, but our access pattern is 90–10, Read-Writes, so not able to utilize MongoDB better whereas with MySQL it comes well proven Reliable and consistent in the industry.
- We also consider Vitess over MySQL to implement Sharding, which we found based data sizing is not much fruitful, and amount of data growing every year. In the next 10 years even if it grows to a few TBs, single MySQL master can take the load.
- We eventually found MySQL is best suitable for our requirement
- Industry proven, more than 20 years of Highly Reliable and Consistent Database, huge amount of experience in Data management, HA, DR, BK, Migration.
- Current Write Load is about a few thousand RPM per server, with MySQL, we can easily scale 10 times more of that even with a single master.
- Read/Write pattern which 90–10 ratio, more suitable for MySQL, where 70% of Reads will be covered with Redis and only 30% will go to MySQL.
- MySQL8 also provides InnoDB cluster, which is Group Replications to support Quorum kind of Reads and Writes which support Strong Consistency at multiple masters, switching back to plain vanilla MySQL is just a config change.
MySQL — InnoDB cluster
We were planning to use MySQL 8, but at the same time we were evaluating the multi-master cluster model, and we figured out that there is a good possibility to use it.
Group Replication makes the topology eventually synchronous replication (among the nodes belonging to the same group) a reality, whereas the existing MySQL Replication feature is asynchronous (or at most semi-synchronous). Therefore, better high availability guaranties can be provided, because transactions are delivered to all members in the same order (despite being applied at its own pace in each member after being accepted).
Group Replication does this via a distributed state machine with strong coordination among the servers assigned to a group. This communication allows the servers to coordinate replication automatically within the group. More specific, groups maintain membership so that the data replication among the servers is always consistent at any point in time. Even if servers are removed from the group, when they are added, the consistency is initiated automatically. Further, there is also a failure detection mechanism for servers that go offline or become unreachable. Figure on the side shows how you would use Group Replication with our applications to achieve high availability.
It is designed to make high availability easier to setup, use, and maintain. InnoDB Cluster works with the X AdminAPI via the MySQL Shell and the Admin API, Group Replication, and the MySQL Router to take high availability and read scalability to a new level. That is, it combines new features in InnoDB for cloning data with Group Replication and the MySQL Shell and MySQL Router to provide a new way to setup and manage high availability.
- The cluster is setup with a single primary (think master in standard replication parlance), which is the target for all write (updates).
- Multiple secondary servers (slaves) maintain replicas of the data, which can be read from and thus enable reading data without burdening the primary thus enabling read out scalability (but all servers participate in consensus and coordination).
- The incorporation of Group Replication means the cluster is fault tolerant and group membership is managed automatically.
- The MySQL router caches the metadata of the InnoDB Cluster and performs high availability routing to the MySQL server instances making it easier to write applications to interact with the cluster.
As this cluster provided us with better scalability and fault tolerance we decided to go with InnoDB cluster instead of single master multiple slaves setup.
DB modeling highlights
- Every tenant will have their own schema, most likely similar or may be different based on need
- For now, there will be 3 DB schemas
- Master — Contains tenant & clients
- Myntra — All Myntra consumers
- INSIDE — Logistics, Warehouse, Sellers and all internal users
- Single user table to fetch user details
- User credentials — One sided hashed
- Social links as separate table — might deprecate in future & not required in user profile requirement
- Image records — moving from Cassandra to here in MySQL and serving on demand
- Email, Phone & Social archive tables
- URT — to maintain multi-request transactions at database levels
- Recovery request — For recovering accounts
Answering some of the whys for DB Schema
- Why single user table?
To get user account data, it used to take a couple of joins getting the data, phone, email, profile data, image data, etc, this used to drop the performance a lot.
With this model we were never able to scale Access APIs and always reached into DB bottleneck. But with single user table, the main profile data (FirstName, LastName, Gender, Email, Phone, RegistrationOn) is available in single query, thus drastically improving the performance.
- Why social link as a separate table?
Social links can be of two types Facebook and Google, if we decide to have them in same table, there would be multiple entries over same uidx which we didn’t want.
We might plan to remove social links in future, so dropping the table itself would be much easier than dropping a column.
Social Links are not part of main profile data, so this data is not queried that frequently and thus minimal queries over the join of user + social links.
- What’s the purpose of archive tables?
Archive table is meant to keep the history of email or phone when the user decides to change them.
This is for logging purposes and historical behaviour of the user.
As of now we use it to send communications to user in case there is a number change for a user with sending comms over both new number (user table) and old number (archive).
This can be tomorrow to do operations over numbers that were once verified for a user but are now unlinked.
- URT table is meant for?
URT is Unique Request Token, this is meant to store this unique token against each request keeping it valid until the token closes or expires for a transaction.
This helps to create multiple requests (Ex: SendOTP → VerifyOTP → Signup) for a user a part of same transaction so as to create a sort of session between them to support multiple operations in a consistent manner across servers.
This also helps to create isolations between different requests in case they originate for the same user.
- What is recovery request?
Recovery Request is meant to generate Recovery Reset Links for the user when they try to reset their password.
This creates a Reset Key which has a set expiration and once reset password request comes, the reset key is used to validate the request and reset user password to recover the account.
User id to UserInfo Cache
Api to get UserInfo from user id needs to serve user path at 20k RPS with very tight latency requirements. A fast in memory store would be more efficient in this case. We would not rely upon MySql data cache as a complex sql query can very easily pollute cache .
We will be maintaining a separate redis cluster to answer this kind of queries. As write operations are at very miniscule as compared to reads , we maintain consistency between cache and db in write path.
Cache Management for Access Flows
- Aim is to serve 95% of calls in the GETUserByUIDX flow from cache, and rest 5% from MySQL DB.
- Storage for cache: Key - Prefix + ID
How and what data is going to be cached?
- Data from MySQL nodes, with join over user and image_entries data.
- The entire data after a join between tables is cached at MySQL, and any custom views are served on that cached data.
OTP Service needs to be segregated from IDEA, for the following reasons:
- OTP service is a generic service, which anyone can use to send OTPs
- Having it in IDEA makes it difficult for other services to call it, and every service would need to create their own OTP service sub-module.
- Segregation of concerns
- Transient Data store to store OTP temporarily
- Send OTP API — To send the OTP
- Verify OTP API — Verify Existing OTP for a user
- Send OTP via SMS or Email
- Generate OTP for different lengths
- Generate OTP for different clients (Generic Service)
- Same User can have OTPs for different use-cases and clients
- Different TTL for different use-cases and clients
- Per client/use-case requirements to send different OTP or same OTP on resend within TTL
- To keep templates for sending OTP per use-case
A Special Case — How to handle Concurrency?
- There is a special case here, what if 2 calls comes for same phone number or email and wants to get the OTP, one user multiple device case.
- In this case there will be 2 OTPs generated but we don’t know which one is valid for which device.
- To handle this case URT concept is introduced.
- URT is Unique Request Token which is generated on each send OTP.
- During Verify OTP a call is made with URT to know which OTP needs to be served for that request.
- If 2 requests comes with same PhoneNo or Email, both will have different URTs and thus with URT we can identify which OTP to serve for each request.
We experimented with different databases here. The main requirement here was requirement if high consistency but at the same time should be partition tolerant to allow for data sharding and scalability in future.
For this CP based usecase Mongo seemed to fit the bill perfectly. We experimented with HBase too, but no secondary index support and native complex query support in Mongo seemed to favour Mongo much more for this usecase.
- The configs are to be kept in JVM cache.
- Reload the configs in every one hour.
- TTL for OTP to be kept for few months and should be configurable.
- We would be keeping a counter at client level, which would be reseted every hour on the basis of created time, handled at application level.
- Created time is set when counter is set for the first time or reseted after an hour.
- Phone will be kept with isd + number.
- isd will also be kept separately.
Following are the DB Configs for Client, on how it would be registered with Mongo:
Following is the OTP collection, where actual OTPs are stored
The fields star marked are all secondary indexes, which are used while querying the data from OTP collection.
For token service we decided to leverage existing IDEA system currently. All the other APIs except token management parts were re-used from older IDEA with minor changes to the APIs.
This is going to be revamped in coming months into a new service, and we will have more updates on the same then.
Reaping the Benefits — Benchmark
Unless a system is correctly benchmarked we would never know if it reaped any considerable benefits or not.
We benchmarked various aspects of real-world simulations on the new system using Myntra’s in-house benchmark platform and we saw many promising results.
For Login Calls we saw same amount of calls at 50% latencies than older system, increasing the user responsiveness by twice the amount during user on-boarding.
For User Access Details we saw 10x the amount of calls at same latencies, thus allowing us to handle high peaks during HRDs in a much more confident way. Not only that it also paved the way for many new use-cases which were not possible before such as gender based targeting in real-time etc.
For Token Refresh calls we saw 3x the amount of calls at same latencies. This allows us to serve smaller AT sessions during HRDs without compromising on session security of users.
Lastly, it enabled us for the very first time during HRDs to serve Bulk User Access Calls in real-time solving some use-cases at bulk such as personalised coupons for Shout & Earn campaigns during sale events (which became almost impossible due to PII compliance on personalised coupons).
Now that we have seen the new architecture and understood in in details there are couple of things for future scope:
- Token Service will be revamped to 2.0 standards.
- A new profile service will be created to cater to User Profile needs of Myntra.
We have missed one important aspect of this whole re-architecture, i.e. data migration from older IDEA to newer one. We will look at this in a different blog post as that itself was a separate project in itself, and demands a detailed perspective.
The project was created as a team of motivated engineers which consists of many team members. Our members are:
Abhishek Jain (Myself, Project Lead), Amarjeet Singh (Lead Architect), Prashant Kumar (Manager), Abhishyam Chennayipalem, Anshuman Kaushik, Sourav Prem, Vishvesh Oza, Venu Babu Narra (IDEA 1.0 Perspective), Pawan Gaur (DBA)
We shall keep you posted on the progress. Stay tuned! Thanks for the read. Comments welcome!