Azure AD Architecture Explained

Published in

LearnWithNK

6 min readMar 9, 2022

In this blog, we will learn about the architecture of Azure AD, and we will see how various design patterns are used to design Azure AD. Check out my Azure AD Explained Blog, to get a basic understanding of Azure Active Directory.

Azure AD Architecture uses a lot of design patterns to ensure:

High Availability
Fault Tolerance and Fault Isolation
Scalablilbility
Security
Collection of logs and metrics
Automated Recovery

At the end of this blog, I have added Key Takeaways Section, one can directly jump to that section as well.

Azure AD has stateless gateways, front-end service, backend service is all available datacenters, Additionally, they also have sync servers in all datacenter

Overview

Azure AD is comprised of independent, scalable units ( aka partitions ). Front-end servers provide read and write capability, through geographically distributed data centres.

They have two kinds of Replicas:

Primary Replicas
Secondary Replicas

Primary Replica

It is meant for all write operations, and all write operation is performed from the nearest data centre.

It is further classified into two:

Active Primary: It is a single clustered write replica. In normal operation, all write requests will be directed to this replica. Once the writing is completed, data will be written to a passive primary as well.
Passive Primary: It is also a single clustered write replica. In normal operation, it receives the data from Active Primary. In case of some failure in Active replica, it will take the role of Active Primary, and once the older Active primary is back, it will change its role to Passive. (The process of changing role is also known as Leader Election).

Data needs to be written in at least one more datacenter, apart from one which is receiving the write operation, to avoid any dataloss in case of failure.

Secondary Replica

It is meant for all read operations, and all read operations performed from the nearest data centre.

It comprises multi clustered read replicas, located in different geographical locations.

All read replicas receive data asynchronously, which ensures eventual consistency, not strong consistency. ( In eventual consistency, data will not be written immediately in all replicas, whereas in strong consistency data will be written immediately ).

In eventual consistency, there is always a chance of getting old data.

Azure AD uses Graph API for writing, Each Graph API service maintains a logical session with some secondary replicas, and it always pulls the response synchronously from that secondary replica only, during write Operation. Once data is returned then, other replicas will be updated asynchronously, as discussed above.

NOTE: I have only named a few design patterns that are responsible for designing AAD, there may be more

High Availability

To ensure, highly available architecture, it uses the following design patterns:

Health Endpoint Monitoring: It continuously monitors the health of all services, at regular intervals.
Deployment Stamps: It has independent copies of services, along with databases.
Geodes: It has services that are distributed in a set of different geographical nodes.
Throttling: It controls the access of resources, within an application.

Geographically distributed data centres (Using Deployment stamp and Geodes) plays a significant role in high availability.

Continuously monitoring of services ( Using Health Endpoint Monitoring ), ensures that there are no unhealthy services. In case of unhealthy service Gateway Service will perform load balancing, and will route the request to healthy services.

It uses a Single Master System (Active Primary); carefully orchestrated and deterministic failover to Passive Primary.

Fault Tolerance And Isolation

To ensure fault tolerance and isolation behaviour, it uses the following design patterns:

Health Endpoint Monitoring: It continuously monitors the health of all services, at regular intervals.
Circuit Breaker: It prevents the cascading of error, in case of failure.
Compensating Transaction: It undoes all steps if a failure occurs amidst a write operation.
Each service of Azure AD works in de-correlated mode, which will prevent the failure of the entire system in case of failure of a single service ( Using Circuit breaker).

Health Endpoint Monitoring ensures that there is no unhealthy service, and in case of unhealthy service, Gateway Service will perform load balancing.

In case of failure amid of write operation, Compensating Transaction undoes all operation.

High Availability, and Fault Tolerance and Isolation contributes to Continous Availability of Azure AD

Scalability

To ensure scalability, it uses the following design patterns:

CQRS (Command Query Responsibility Segregation): It uses different replicas for reading and writing operations. Command in CQRS is all CRUD operations, and the Query part is fetching data from datastores.
Sharding( aka Horizontal Scaling): It uses different datastores from a different set of clients.
Caching: It stores the data in a key-value store, from where data can be pulled faster. e.g. Redis

Partitioning (Check the Overview Section), plays a key role in Write Scalability (Using Active Primary and Passive Primary), for reading ( Using Secondary Partition) operation, Azure AD ensures multiple replicas. Using different read/write, replicas are achieved using CQRS.

Different data stores for a different set of clients ensures that each client can work without affecting the work of others (Using Sharding)

Security

To ensure security, it uses the following :

MFA (Multi-Factor Authentication)
Auditing
Just-in-time Privileged Access Management

In order to access Azure AD, the user needs to register its account in the Authenticator app. and whenever the user wants to log in, Azure AD will send the approval request in your phone, or you can use the passcode provided by an Authenticator app. (Using MFA)

If anyone wants any access temporarily, Azure AD uses Just in time elevation system. (Using Just-in-time Privileged Access Management)

Logging and Metrics Collection Capability

To ensure a highly available, scalable, secure system, logging and metrics collection plays a very significant role.

It ensures how another design pattern behaves in a fashion that will provide the best customer experience.

Continuously analyze and monitor all of the key health metrics of service.

Helps in the tuning of metrics, like if CPU usage is high, then the system will take the necessary action to bring down CPU Usage.

Helps in the restoration of service, if not working properly.

Quickly detect problems in a live site and instruct the system to take necessary actions.

Azure AD focuses on minimizing time of detection(TTD), as of now it is less than five minutes (TTD < 5 mins), once issue is identified, then time to mitigation(TTM) is less than thirty minutes (TTM < 30 mins).

Key Takeaways

Azure AD is mainly used for Authentication and lookups.
It has 2 types of replicas: Primary and Secondary.
Primary Replica is further classified in Active and Passive Primary.
All Write operation is performed by Primary Replica ( Active Primary )
Secondary Replicas perform all Read Operation.
In case of the failure of Active Primary, Passive Primary will take its role.
Secondary Replicas receives data asynchronously using Sync Service, and this results in eventual consistency.
Data needs to be written in at least two data centres before it is acknowledged.
All services work in a de-correlated mode that prevents cascading of error in case of failure.
Time to Detect any issue is less than 5 mins.
The time to mitigate is less than 30 mins.
Azure AD uses soft delete, instead of hard delete to prevent any accidental delete of data.

Please reach out to me, if anyone wants some blogs on a specific topic of Azure AD or Azure, I would be happy to write it.

Please let me know if anyone finds any flaws with this article.

Comments and feedback are most welcomed.

Follow me on Linkedin, Github, and Medium to keep yourself updated.

Thanks for reading. Happy Learning 😊