About the 5-hour Microsoft Outage
What self-sabotaging action could take your company down for 5 hours?
This is related to but not exactly part of my series on Automating Cybersecurity Metrics. The Code. I’m taking a break today to write about the recent Microsoft outage because it is so important for all of us to learn from these incidents and figure out how we can prevent and withstand future outages.
I was waiting for Microsoft to publish a report on the matter publicly but ended up obtaining the information from other sources. I’m considering the impact of this incident and what might have prevented it in this blog post. Also, what can you do about outages like this that affect your business?
Initially, I wondered if the attack was related to the West’s recent agreement to send tanks to Ukraine — something they had been haggling over for months. But no, it was a self-imposed disaster, apparently. That’s all I can tell at the moment, unless Microsoft reports some other root cause below the surface of what appears to be the reason for the incident. I can think of a few which are listed below, but so far it seems to be a case of human error.
What caused the Microsoft Outage?
The technical source of this outage is that someone changed a router configuration. A router is a device that controls where packets get sent on the Internet. But not just one router — the person pushed a command that ended up going to a whole bunch of routers and that changed caused them all to send traffic to the wrong place. That resulted in a 5-hour Microsoft outage that impacted many people who use Microsoft products who were trying to get work done.
Technically, the outage was caused by a change to BGP, a very error-prone protocol that helps automatically push new routes out to network devices. The problem is that when the wrong routes are pushed, they can spread very quickly as well.
I first saw this report by ThousandEyes who deduced the problem by analyzing packet loss, disruptions and BGP routes:
Microsoft Outage Analysis: January 25, 2023
ThousandEyes actively monitors the reachability and performance of thousands of services and networks across the global…
Organizations cannot put many technical guardrails around these changes within BGP itself. However, companies can architect proper governance and controls around BGP.
How do the engineers that make the changes to BGP systems get access to the systems? How many people are involved in the change? Have automated processes and testing been architected around these changes to prevent errors? Are processes simply a matter of trust — or are they automated with proper checks and balances? Are the people making the changes fully qualified? Are employees rotated on a regular basis to prevent collusion?
I am not sure of the legality of this point, so consult a lawyer. But do companies test employees to determine if they are willing to make changes when offered money by third parties when they manage highly critical infrastructure? Also, are employees trained on how they might be approached and a person may “make friends” or become romantically involved with them to obtain information or in some cases, offer to pay them to make a change? See the link to the Tesla article below.
Are Clouds Considered Critical Infrastructure?
They should be.
I don’t know about you, but when I can’t see my calendar or access communication systems, my day is pretty shot. That’s why I have multiple means of accessing those systems. But if the system itself is down, I’m a bit out of luck. I might not be able to log into certain systems if I cannot access my email or text messages, for example.
Think about how much of America runs on the major cloud providers these days? What sort of impact would that have on US operations (and other countries as well)?
What is critical infrastructure in the US?
CISA defines critical infrastructure as follows:
Critical infrastructure describes the physical and cyber systems and assets that are so vital to the United States that their incapacity or destruction would have a debilitating impact on our physical or economic security or public health or safety. The Nation’s critical infrastructure provides the essential services that underpin American society.
Critical infrastructure describes the physical and cyber systems and assets that are so vital to the United States that…
Cloud providers control the access to much of our nation’s data. This data runs many systems that help the US function on a daily basis. That’s why it is so, so important for global technology companies who support these systems not to make a mistake.
We have enough problems with attackers and cybersecurity that we don’t need more self-imposed disasters. And if we take a look through some of the largest outages in cloud provider history — that’s exactly what is causing some of the biggest disruptions.
In addition, should a mistake happen, how quickly can a cloud provider recover? Do they plan for and test rollbacks or corrections to misconfigurations?
Cloud Outage History
AWS has had a couple of S3 bucket outages, among other things, that took down large parts of the Internet because so many systems are reliant on S3. I think the first major incident that made people understand how reliant we are on cloud providers occurred in 2017 with AWS S3:
Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region
We'd like to give you some additional information about the service disruption that occurred in the Northern Virginia…
In the above outage, the error was caused by a typo. A human error. One single person took out a major portion of the Internet. Now mind you, I’m not blaming that person, I’m saying you should take a look at the processes that could allow such an incident to happen in your organization.
Google also has had some outages, with one of the largest in 2020.
Google explains the cause of the recent YouTube, Gmail outage
Google says that the global authentication system outage which affected most consumer-facing series on Monday was…
Google says that the global authentication system outage which affected most consumer-facing series on Monday was caused by a bug in the automated quota management system impacting the Google User ID Service.
Once again, this massive cloud outage was a self-imposed error, not a malicious cyber attack. In this case it was not a human configuration mistake, but somewhere along the way improper testing — by a human — allowed a bug to slip through that caused this problem. I wrote about proper testing a number of times on this blog.
Better testing for better outcomes
Infrastructure, disaster recovery, product, and penetration testing
CloudFlare had an incident in June 2022 caused by —a BGP change:
Massive Cloudflare outage caused by network configuration error
Cloudflare says a massive outage that affected more than a dozen of its data centers and hundreds of major online…
Cloudflare outage on June 21, 2022
Today, June 21, 2022, Cloudflare suffered an outage that affected traffic in 19 of our data centers. Unfortunately…
Microsoft Outages and Incidents
Having monitored cloud outages for a while and being a user of all the major cloud providers, I seem to recall Microsoft having the biggest number of problems with outages, cybersecurity incidents, and downtime. I haven’t sat down to record and count the incidents and data breaches, though I’ve thought about it.
I wrote about my own experience not even being able to create a VM for days here:
Here’s an example of customer data exposed through a misconfiguration of a storage bucket:
Investigation Regarding Misconfigured Microsoft Storage Location
October 28, 2022 update: Added a Customer FAQ section. Security researchers at SOCRadar informed Microsoft on September…
Here’s another one caused by a server misconfiguration:
250 Million Microsoft Customer Support Records Exposed Online
Unprotected Microsoft Database Exposed 250 Million Customer Support Records Online
The Firewall Times has compiled a list of breaches, some of which are actual cyberattacks:
Microsoft Data Breaches: Full Timeline Through 2022
The most recent Microsoft breach occurred in October 2022, when data on over 548,000 users was found on an…
The number of visible incidents could relate to different approaches to governance at different cloud providers. See more on governance below, but first, let’s look at some other recent incidents and their root causes.
Taking a look at the root cause of incidents in general
You can view a few recent outage reports here, and I’m very thankful that Microsoft is transparent about this information. I’m simply looking for solutions to problems in this post by finding the root cause. What is causing the most cloud outages and security incidents? In a lot of cases, it is human error. We can take a look at the incidents that affected Microsoft Azure in January:
Azure status history
What happened? Between 07:08 UTC and 12:43 UTC on 25 January 2023, customers experienced issues with network…
1/31/2023 — Lack of alerting and/or response to resource exhaustion
1/25/2023 — Proper steps not followed during a change.
1/23/2023 — Hardware fault (not a human error)
1/18/2023 — BGP change
Two of the above (50%) are human errors.
I would argue that resource exhaustion should be monitored and addressed before that problem occurs, but I don’t have all the details. I’m sure it’s not simple but it can be done.
As for the hardware failure, these things happen, but it seems like it should have been possible to remove the failing device from rotation even if it’s management interface was not available. Perhaps Microsoft needs to consider how those systems are managed and perform some additional automation and testing.
Running a cloud platform is not easy, and especially one of the scale of the major cloud providers. (Running systems in one rack in a data center was not easy for me and I didn’t enjoy it.) But if these cloud providers are so critical to so many systems we need to evaluate mistakes after they happen and find ways to prevent them in the future.
By the way you can also view the incident history for other cloud providers as well to see what types of incidents they are facing and how they are addressing them.
AWS incident history:
The AWS Health Dashboard provides the latest health information for AWS services across the globe, as well as the…
Google incident history:
Google Cloud Service Health
This page provides status information on the services that are part of Google Cloud. Check back here to view the…
Cloud Governance at Cloud Providers
The AWS outage above resulted from a human typo that took down the S3 service. Since that time, AWS has learned from this mistake and has taken steps to limit human error. Though I’ve still heard of a few incidents related to human error at AWS, it seems like there are far fewer of these at AWS than Azure.
Perhaps the reason is because AWS has implemented better governance, as I wrote about in these posts:
At some point after that incident I head Stephen Schmidt, CISO at AWS, say in a presentation that it takes two people to make a critical change now. Microsoft should do the same.
I’ve been writing a lot about separation of duties in my latest blog series and why it matters. Separation of duties not only helps prevent data breaches, it helps prevent human error that can lead to outages.
Automating Cybersecurity Metrics (ACM)
A series of blog posts on cybersecurity metrics and security automation
Sometimes an outage can have a similar impact to a data breach, and availability is one of the key factors in the security CIA triad — confidentiality, integrity, and availability.
Another reason why the possibility of these human errors are so concerning, is the fact that attackers use phishing and social engineering to trick insiders — or even pay them to make changes.
In this case at Tesla, an insider got offered a substantial amount of money to try to insert malware into the Tesla network:
Tesla Insider Works with FBI to Turn the Tables on Russia's Million Dollar Attempt to Hijack the…
On August 25, the Department of Justice announced the arrest of Egor Igorevich Kriuchkov, a citizen of Russia, for…
Kriuchkov asked the insider to insert into the organization’s computer network malware provided by Kriuchkov and his associates. Following insertion, a distributed denial of service attack would occur which was designed to occupy the information security team. Meanwhile the malware would be inserted into the network and corporate data would be extracted. The company would be asked to pay to get their data back. For his cooperation, the insider and Kriuchkov would both be well compensated.
How much would an insider need to be paid to get them to make a rogue BGP change? To spy on mobile network traffic at a telecom company? To steal a mobile SIM card? To insert a proxy to spy on traffic as it passes through routing devices on major Internet backbone devices?
Governance architecture in your own environment
Separation of duties applies not just to major cloud providers. Have you considered what changes within your own organization could cause a major outage and take down your systems? I’ve been considering some changes that could take over cloud environments in my latest series like this one:
Backdoors and Privilege Escalation Via Cloud Account Users
ACM.134 Preventing a user from leveraging another user with permissions they don’t have
In subsequent posts, I proposed a solution and am now going through testing a potential solution to that problem. I simply paused to write this post as it is timely.
Have you thought through the ways in which a single insider could take out your organization through:
- A mistake?
- Compromised credentials?
- A social engineering or phishing attack?
What kind of governance and security architecture have you put in place to prevent these egregious errors that cause significant damage? Do you have automated governance that prevents mistakes rather than simply trusting people to do the right thing? People make mistakes, can be tricked, or may simply be offered a lot of money to make an unwanted change. Your reliance on people may be misguided, as I wrote about in my book at the bottom of this post.
Architect proper governance — which is really what my blog series has turned into and partially where it will end up.
Architect for failure
In addition to segregation of duties and proper governance — have you architected systems that can withstand failure, where possible?
Even though AWS went down in the above example, it only went down in one region. Companies who are relying on cloud for their infrastructure and could not afford that outage should have built multi-region support into their cloud infrastructure. I was hosting a meetup in Seattle at the time (yes, I’m working on it — stay tuned in March) and Netflix came to speak. I asked them if they went down during the above AWS S3 outage.
They had designed their infrastructure to withstand that outage. Someone else told me that customers could not start new movies but existing movie-watchers were unaffected.
Where possible, architect systems for failure. Think through what could go wrong and how systems can automatically recover or withstand the impact of an outage. Test your architecture and your recovery plan.
When you don’t control the infrastructure — ask the right questions
Where possible, customers need to architect and plan for cloud failures. But in the case of Microsoft Teams and O365, customers do not have control over that infrastructure or architecture since those are SAAS products. You are at Microsoft’s mercy. Have you performed a proper security assessment and asked them about their portion of the cloud security responsibility model as I wrote about here?
What are AWS’s Security Responsibilities, Anyway?
ACM.144 A deeper dive into the shared responsibility model
I suppose I should add BGP to the list in that post.
Now back to trying to architect separation of duties into authentication and authorization in a cloud environment. I’m taking a look at Okta in future posts to see if and how that system may be able to help.
Follow for updates.
Teri Radichel | © 2nd Sight Lab 2023
If you liked this story ~ use the links below to show your support. Thanks!
Clap for this story or refer others to follow me.
Follow on Medium: Teri Radichel
Sign up for Email List: Teri Radichel
Follow on Twitter: @teriradichel
Follow on Mastodon: @email@example.com
Follow on Post: @teriradichel
Like on Facebook: 2nd Sight Lab
Buy a Book: Teri Radichel on Amazon
Buy me a coffee: Teri Radichel
Request services via LinkedIn: Teri Radichel or through IANS Research
Slideshare: Presentations by Teri Radichel
Speakerdeck: Presentations by Teri Radichel
Recognition: SANS Difference Makers Award, AWS Hero, IANS Faculty
Education: BA Business, Master of Sofware Engineering, Master of Infosec
How I got into security: Woman in tech
Company (Penetration Tests, Assessments, Training): 2nd Sight Lab
Cybersecurity for Executives in the Age of Cloud on Amazon
Cloud Security Training (virtual now available):
2nd Sight Lab Cloud Security Training
Is your cloud secure?
Hire 2nd Sight Lab for a penetration test or security assessment.
Have a Cybersecurity or Cloud Security Question?
Ask Teri Radichel by scheduling a call with IANS Research.
More by Teri Radichel:
Cybersecurity and Cloud security classes, articles, white papers, presentations, and podcasts