Cybersecurity driven Global Tech Outage — Lessons from Crowdstrike, Microsoft, McAfee and Regulations

Saikat
nFactor Technologies
4 min readJul 25, 2024

Alright, let’s get real. The CrowdStrike outage was a massive headache for everyone involved. I know as this is the second time this happened during my career as a CISO and the events were similar in an uncanny way.

It was April 21st, 2010 and I had just completed 90 days as the CISO of Varian Medical Systems (a publicly listed global medical device company). My phone would not stop ringing as the world realized that McAfee had released a faulty update that was bringing down large numbers of Microsoft Windows computers, especially those on Windows XP sp3. While we were relatively insulated by the fact that we most of our computers were not on Windows XP sp3, we launched deep discussions with our Microsoft Account Management team to understand what went wrong. It was fascinating to sit through those discussions, where Microsoft pointed us to EU regulations that required them to provide third part access to the Windows Kernel as well as to McAfee who they felt should have tested their changes before rolling it out. We left these meetings with this notion that this will never be repeated, given how ugly it had been for those large global customers who were struggling to get back up from this devastating mistake.

Fast forward to July 21st 2024, and the play script repeats with the same set of players. It felt like nothing had changed but the impact was way deeper bringing Delta operations to its knees, shutting down a number of key business all over the world. I had folks share with me privately that these days it doesnt require an attacker to put in the effort to attack systems, just scaring a leading cybersecurity vendor to react with malformed updates is enough to achieve their objectives. the remarks get even more scathing when folks realize that this is like relieving April 21st, 2010 with the same players but no apparent maturity in their processes and playbooks.

Here is what you should know about the incident(s?) —

Cybersecurity firm CrowdStrike’s routine update triggered global chaos, crashing 8.5M machines worldwide. The update resulted in the blue screen of death (BSOD) in Windows PCs. Global banks, airlines, hospitals and government offices were disrupted. CrowdStrike released information to fix affected systems, but experts said getting them back online would take time as it required manually weeding out the flawed code. It affected 8.5 million Windows devices could happen again, especially if it involves widely-used enterprise security software

Root causes:

Regulation: A 2009 EU regulation driven compromise required Microsoft to allow third party access to the kernel in Windows unlike Apple computers where such access is not allowed. The Microsoft-EU agreement states that the former must make the Windows Client and Server operating system APIs that its security software, like Microsoft Defender for Endpoint uses, available to other developers. Neowin also said that the company must document the APIs it deploys on the Microsoft Developer Network unless they create security risks.

Microsoft made this move after a complaint was filed against it in Europe, and it allowed other vendors to create products that affect Windows at the kernel level. This agreement with the European Commission resulted in a freer market for security products and prevented Microsoft from gaining a monopoly on antivirus and other security suites. Revisiting and updating these policies to balance competition with security could help prevent similar incidents in the future

Testing (or the lack of it): Most companies test critical changes (such as those that require kernel based access) in a sandboxed (isolated) environment before releasing it. Crowdstrike seems to have released this change without adequate testing. Such testing could have saved millions of dollars in impact that companies like Delta are now having to field from this incident.

What can we do now to avoid this in the furture:

CrowdStrike: While the response to the incident was mature, there is often very little you can do to control damage when untested faulty software goes out into your deep install base. Prevention is the name of the game for a cybersecurity leader like CrowdStrike

Ensure adequate testing of critical changes. Agentic security testing using Generative AI can provide quick results and identify and isolate these issues so that history doesn't have to repeat it self.

Evaluate and deploy options to provides security without having to operate at the Kernel level and be a thought leader in this space. (we had similar feedback to McAfee during 2010, not sure if things have evolved for them since those days)

Microsoft : The CrowdStrike outage underscores the interconnected nature of modern IT ecosystems and the shared responsibility among different stakeholders. For Microsoft, this incident highlights several areas for improvement:

Kernel-Level Access: The update from CrowdStrike affected the kernel level of the Windows operating system, leading to system crashes. Unlike macOS, which restricts third-party software to user space, Windows allows third-party software to run at the kernel level. This decision, influenced by regulatory requirements, increases the risk of such catastrophic failures. If Microsoft has to continue to allow this, it is important that there are monitoring and other controls that prevent such faulty updates to be rolled out. Better still, like Apple provide equivalent access at the User level to work with third party software such that damage can be better managed and controlled with out manual intervention.

Testing and Quality Assurance: Although the faulty update originated from CrowdStrike, the incident raises questions about the robustness of Microsoft’s ecosystem in handling third-party updates. More stringent testing and quality assurance processes for third-party software, especially those with kernel-level access, could have potentially identified the issue before it caused widespread damage

Regulatory and Policy Considerations: Microsoft’s decision to allow third-party kernel access stems from a 2009 agreement with the European Commission, aimed at ensuring fair competition. However, this regulatory environment also introduces vulnerabilities. Revisiting and updating these policies to balance competition with security could help prevent similar incidents in the future

--

--