Microsoft Azure AD was Down! Is Cloud Really Reliable?
What happens when a critical service provided by Microsoft fails, bringing down Microsoft 365, Teams, Sharepoint, Azure Portal, and your application that depends on Azure Active Directory.
--
Monday, September 28th, 2020. It’s 5:30 pm EST, and I try to login to an internal application integrated with Azure AD. Error.
Try again, thinking I might have put the wrong username. Error. Damm! Something wrong with my application!
I reached a Teams channel and discovered that other colleagues are having similar issues. And not only on the application that I support.
Microsoft Azure Status page shows everything in green, but when I start researching on the web and Twitter, other users reporting the same. Suddenly, my token expired, and I got logged out to Azure Portal, which I cannot log in again.
Almost two hours passed now, and service still was not restored. It is a big deal. It is affecting thousands of applications that are integrated into Azure AD and all the applications from Microsoft itself that depend on AAD, like Microsoft 365, Sharepoint, Teams, and even the Azure Portal.
I start checking some tweets related to this subject, and it is interesting to see some users just reporting, others ranting, and a few sending HugOps to the team that is working on this incident.
While waiting for the service come up again, I spent some time reflecting about it:
Cloud is resilient. But it is not fail-proof
Hardware fails. Software fails. And it is not different at Microsoft, AWS, or Google. Gmail suffered an outage today, but it was quickly recovered.
As much as resilient the service and infrastructure are, it can fail and will fail.
The providers state in contract the service they will offer, the promised SLA, and the credits they provide if they fail to meet the SLA.
It is crucial to know the SLA of the services that are consumed by the cloud, no matter if it is IaaS, PaaS, or SaaS.