5-Minute DevOps: How to Avoid the CrowdStrike Mistake
Unless you live in a cave, you're probably aware of the cybersecurity company CrowdStrike. If you weren't aware of them before, their incident on July 19, 2024, probably made you aware. On that Friday morning, they released a defective change to their customers. The impact was a global disruption of services from airlines to emergency call centers. The overall cost of the outage will be staggering. Delta Airlines alone estimates a cost of $500 million over five days. The entire industry must learn the right lessons to improve how we deliver change instead of pointing fingers and assuming we are too smart to break things.
Houston, we have a problem.
CrowdStrike provides their Falcon security software for multiple operating systems to detect unusual system activity and block attackers. On Windows OS, their software must operate with kernel-level access. Here's a great video from a former Windows developer explaining this. When CrowdStrike creates a new version of the Falcon sensor, it goes through Microsoft's driver certification process before they release it to the public. They also deploy "Rapid Response Content" files to define the sensor's rules to detect and block attacks. CrowdStrike says this about the content files.
“Rapid Response Content is a representation of fields and values, with associated filtering. This Rapid Response Content is stored in a proprietary binary file that contains configuration data. It is not code or a kernel driver.”
After the "Rapid Response Content" file is created, it goes through their Content Validator application and is then pushed to CrowdStrike's customers.
That morning, a rule file change was released using their standard process for configuration changes to address a known vulnerability. The Validator failed to detect that the binary file was corrupted, and the file caused an out-of-bounds memory error in the sensor code. This caused the OS kernel to panic. The outcome was approximately 8.5 million Windows machines showing the Blue Screen of Death.
Since the sensor works as a kernel driver and is loaded at boot time, the affected machines are put into an infinite reboot loop. CrowdStrike quickly released information on how to remediate the problem, but the solution required physical access to the systems and could only be executed by their customers. Some companies were able to recover in a few hours, but as I write this, it's ten days later, and some systems are still impacted.
The "Experts" React
The day after the CrowdStrike incident, I saw the normal response online after every major incident: armchair experts with no facts about what caused the issues loudly proclaiming the reasons they are too smart to cause a similar problem.
It's interesting to note that Azure Central US had an outage the Thursday before due to a configuration change, so "I'm too smart" seems like hubris.
Some of the "simple" solutions proposed were reasonable, and we'll cover those, but not in the context of not knowing what failed or "just do this." Systemic failures don't lend themselves to simple fixes. Others were unreasonable. Here are some hot takes that made my face hurt from palming:
- "Don't deploy on Friday!" There are two reasons why that's a bad takeaway. First, if you are afraid of deploying on Friday, you won't be able to safely when you need to. Some people said that anything could wait until Monday morning. Criminals love people who wait until Monday.
- "We need to make developers criminally liable so they won't make mistakes like this!" So you think a developer is responsible? I can (and may) write an entire post about why this is dumb. However, for now, let's keep this simple. No single developer is responsible for this problem. We'll see that as we dive into their remediation plan.
- "They should have had staged rollouts!" Sure, everyone should. Would that have prevented this problem? No. It may have only taken down Delta Airlines and a few 911 centers. That's not a fix.
The Preliminary Postmortem
Running a postmortem is hard, especially in the heat of the moment right after an incident. I can only imagine doing it when a large portion of the planet wants a pound of your flesh. The pressure to scapegoat is real. The final postmortem is pending, but I was shocked that they released preliminary findings only four days later.
While not part of the published causes of the outage, it was interesting to learn from other sources that anti-trust regulators in the European Union contributed to the outage. In 2009, EU regulators required Microsoft to give other companies access to create security software. Doing this made issues like this more likely by forcing MS to make Windows less secure in the name of competition. Beware of unintended consequences.
You can read their preliminary postmortem for more details, but I want to focus on their remediation plan, what it implies, and my takeaways.
Add additional validation checks to the Content Validator for Rapid Response Content.
According to their postmortem, they had a history of successfully deploying configuration changes. However, automated tests can only test for things that we think of. If the tests pass all the time and we don't see a production issue, it's easy to become complacent and think we've covered everything that can fail. We haven't. Tests that never fail worry me. It usually means we lack imagination about what can fail.
How can we test our tests? What are we doing to explore edge cases? No test framework is perfect, so "red teaming" our tests with random inputs, corrupt data, or anything else we can think of should be a continuous process of hardening our validation process. CrowdStrike did this; it's just that it was intentional and happened in production.
Implement a staggered deployment strategy for Rapid Response Content in which updates are gradually deployed to larger portions of the sensor base, starting with a canary deployment.
This surprised me. Deploying a change to every endpoint with no dogfooding? That's what a canary deploy is: deploying to a production-like or very controlled production system to validate our validation process. We should assume we have problems until no problem appears. Even then, if we are deploying to multiple systems, we should assume that there's a configuration out there that may not work with our change. I've been bitten many times by that. If there's one, there's more than one. That's why we deploy in batches so we can limit the blast radius.
Now, imagine they did a canary. It worked, they batched releases, and only a few hundred thousand systems were impacted. There's still a good chance that someone like Delta would be taken down. It took days for Delta to recover most of their systems, and I'm still seeing indications that passengers are being impacted. What was THEIR plan for mitigating the impact of a widespread outage?
Rollback testing
I bet they had a rollback plan—I can't fathom that they didn't. However, that plan probably went out the door when the downstream systems became unreachable. The only answer I have for this is "defense in depth." You cannot always back out of every kind of change. What's our backup plan for our rollback plan?
Improve monitoring for both sensor and system performance, collecting feedback during Rapid Response Content deployment to guide a phased rollout.
According to their timeline in the original incident update, anyone connected to their servers during about a 90-minute window, starting at 04:09 UTC, received an update. This raises a few questions. Who was monitoring the installation? CrowdStrike is located in Austin, TX. That means the deploy happened at 11:09 PM Thursday the 18th from their perspective. Was the team who made the change involved? Was it handed off to another team? Was their alerting process that a customer called in? 90 minutes is a long time to remove the bad file from the download server. The people best able to confirm a change are the people making the change. That needs to happen during core hours.
Rapid Response Content Deployment
When the first reports came out, I speculated that they may have different delivery processes for normal and rapid change types. As I dug into it more, I was almost correct. They had one method for their sensor and a different method for the sensor configuration files. Any change to the sensor required additional testing for certification by Microsoft which would be an untenable process for delivering rule changes to block new attacks. However, based on the preliminary postmortem, the testing processes CrowdStrike used for the configuration files were very different from the sensor. It’s very common to treat data and configuration differently from code. It’s common but wrong. They are all the same thing.
Provide customers with greater control over the delivery of Rapid Response Content updates by allowing granular selection of when and where these updates are deployed.
This shocked me. One of my knee-jerk reactions was to also lay responsibility a the feet of their customers for blindly accepting changes with no internal batched updates to validate. I stand corrected. No matter how high the change success rate is, when the risk of a change is “crash machine,” we definitely need some internal sandboxes. I’m glad they are fixing it. Look around and see how many other things you have like this. You may want to mitigate that.
Lessons We Should Learn
The naive lesson would be how to prevent this specific issue. “They needed a canary release!” Sure, and they identified that as something that they should add. However, what about the next problem that occurs? What if it only affects 10% of the machine configurations they deploy to? Is it better if only 850 thousand machines are impacted? Simple answers are never the solution. This was a perfect storm caused by gaps in many layers of the process. It’s the outcome of many decisions over the years. Plugging one hole doesn’t make it safer. So, what should we learn?
We aren't too smart to fail, either. We have our problems, too. It’s always a good idea to review our quality processes and run fire drills to stress test them. If we need to deploy a change to our system right now, how risky is that? How can we reduce it? No, the answer isn’t to deliver less frequently. That only increases the risk.
Do we trust our recovery plan based on the worst-case failure of our application? We should test those and not hand-wave them away because we’ve never needed to use them for disaster recovery… yet.
We should only have one way to validate and deliver any kind of change to an application. An application isn’t just the code. It consists of everything from the pipeline used to validate it to the infrastructure it runs on.
We should constantly stress-test our delivery process to find ways to make us safer and more secure. Harden our pipelines! Exploratory testing should be a continuous process we are doing, not something that is focused only on a specific feature we are delivering. We make our systems better by trying to use our systems in unexpected ways and adding validations for things that break.
We need defense in depth, not just for security but also against errors we will make and errors we will consume from others. We need to design platforms and test fixtures to make errors more difficult to make, but we also need to apply the "testing mindset" to everything we do: What could go wrong, and how can we be alerted quickly that it did? Preferably before production.
Learn from others or become a lesson to others
I've ridden motorcycles for twenty years. I've had close calls, but I've never had an accident. That's because I have spent 20 years studying how other people have accidents and how those might be avoided.
There will be more large incidents from other companies that impact millions of people. If we point and blame, we set ourselves up to be the next example. Instead, we should be thankful we haven't done it yet and learn everything we can from the incidents of others.