Technical details of the Windows BSOD disaster due to CrowdStrike

The technical details of what caused this meltdown

B Shyam Sundar
6 min readJul 20, 2024
CrowdStrike-Windows-BSOD

TL; DR

CrowdStrike a very popular US company which provides computer security services has a product called falcon which is installed in a lot of machines, especially Windows. They issued a patch to update their software via internet, which had a widespread rollout on 19th July 2024. This patch caused the machines running Windows to not boot up correctly and show the infamous “Blue Screen of Death” (BSOD). Since this was an issue at the bootup stage itself, this required the IT professionals of the respective organizations to physically boot every windows machine into safe-mode and remove a channel file (More details about this below) for the system to boot normally again. Though the root cause was identified rather quickly fixing it takes time due to it being physically intensive.

About Falcon and EDR

To better understand this problem, first we will briefly understand the component in concern and its privilege level. The software that caused this massive BSOD issue was CrowdStrike’s Endpoint Detection and Response (EDR) driver which is part of the platform called as CrowdStrike Falcon Sensor. Now you might ask, what does this software do?

What does Falcon’s EDR do?

EDR is a cybersecurity solution designed to monitor and respond to threats on endpoints such as computers, servers, and mobile devices. Here are the key components and functions of EDR:

  1. Data Collection: collection of data from endpoints, including process information, network connections, and file activities. This data is crucial for detecting anomalies and potential threats.
  2. Threat Detection: EDR uses advanced analytics, machine learning, and behavioural analysis to detect suspicious activities. It can identify threats that traditional antivirus solutions might miss.
  3. Incident Response: When a threat is detected, EDR can automatically respond by isolating the affected endpoint, terminating malicious processes, and alerting security teams. This helps contain and mitigate the impact of the threat.
  4. Forensic Analysis: EDR provides tools for detailed forensic analysis, allowing security teams to investigate the root cause of incidents, understand the attack vector, and improve future defences.
  5. Threat Intelligence Integration: EDR solutions often integrate with threat intelligence feeds to provide context about detected threats, including information about known attack techniques and adversaries.

The place of EDR driver

The Falcon Sensor EDR includes a driver component. This component operates at the kernel level. This driver is for monitoring and collecting data from endpoints in real-time. The said driver is loaded much earlier in the Pre-OS initialization phase. This phase is called as ELAM (Early Launch Anti Malware) phase. The ELAM drivers are usually the first to be initialized. This is done so that they can monitor and protect the system from the very start.

The Windows boot manager is responsible for loading the ELAM drivers. It initializes these drivers to ensure that any malware attempting to load early in the boot process can be detected and blocked.

After the ELAM drivers are loaded, the Windows kernel continues the boot process which we can consider the kernel phase. The ELAM drivers remain active, providing continuous protection as the rest of the operating system components are initialized.

How does EDR driver receive updates

Falcon receives update from CrowdStrike’s cloud infrastructure automatically and the updates can happen multiple times a day.

The BSOD incident became so widespread rapidly because of this particular characteristic.

What caused the problem?

Now that we have a basic idea about the software surrounding the incident. Let us get into the core of what caused the issue.

An update was pushed by CrowdStrike via its cloud infrastructure to the endpoints. This update was automatically installed in a huge number of Windows systems worldwide.

This update was for a sensor configuration. The update was designed to target newly observed, malicious named pipes being used by common C2 frameworks in cyberattacks. Named pipes are a method for inter-process communication, and attackers often exploit them to establish communication channels between compromised systems and their control servers.

The particular update included changes to the sensor’s configuration files, which dictate how the sensor monitors and responds to various system activities. These configuration files are used for adapting the sensor’s behaviour to new threats without requiring a full software update.

A buggy channel file (C-00000291*.sys) was part of this update. A channel file in the context of the Falcon Sensor is a configuration file that defines specific monitoring and response rules for the sensor. These files can include:

  • Detection Rules: Criteria for identifying suspicious activities or anomalies.
  • Response Actions: Predefined actions the sensor should take when a threat is detected, such as isolating the endpoint or alerting security teams.
  • Communication Settings: Parameters for how the sensor communicates with the cloud-based management console and other components of the Falcon platform.

The particular channel file (C-00000291*.sys) controls how Falcon evaluates named pipe execution on Windows systems. This file contained a logic error which caused the operating system to crash and hence enter into a boot loop.

Systems running Linux or macOS do not use Channel File 291 and were not impacted.

The update to the channel file triggered a logic error which caused a memory allocation error. Furthermore, there was a flaw with the validation logic for memory allocations. Since the validation logic also did not detect anything wrong with the memory allocation logic, the driver simply proceeded to operate as usual. Owing to improper memory allocation, this caused the driver to crash with PAGE_FAULT_IN_NONPAGED_AREA error.

What can cause a Memory Allocation Error

The driver allocated buffers for named pipe operations but did not correctly manage these buffers in all scenarios. This led to situations where the buffers were either too small or not properly aligned, causing memory access violations.

The validation logic for these memory allocations was flawed. Instead of checking and ensuring that the memory was correctly allocated and accessible, the driver proceeded with operations that led to invalid memory access.

When the driver attempted to access these improperly managed memory locations, it triggered a PAGE_FAULT_IN_NONPAGED_AREA stop code. This type of error occurs when the system tries to access a memory page that is not present in the non-paged area of memory, which is reserved for critical system components and drivers.

Why driver crash caused BSOD and boot loop

The Falcon driver crashing caused Windows to BSOD (Blue Screen of Death) due to the critical role that drivers play in the operating system.

Role of Drivers in Windows

Kernel-Level: Drivers operate at the kernel level, which means they have high privileges and direct access to hardware and system resources. This allows them to perform essential tasks but also means that any errors can have severe consequences.

System Stability: Drivers are responsible for managing communication between the operating system and hardware components. If a driver malfunctions, it can disrupt this communication, leading to system instability.

Because the Falcon driver operates at the kernel level, the memory access violation caused a critical system fault. Windows detected this fault and, to prevent further damage or data corruption, it initiated a BSOD. The BSOD is a protective measure that halts the system to prevent further issues.

Not Null Bytes

Many early reports suggested that the issue was due to NULL bytes present in the channel file. But CrowdStrike has clarified that isn’t the case.

Closing thoughts

For such a critical piece of software, how and why was this not checked from CrowdStrike’s side before a worldwide rollout is something that escapes my comprehension. Added to this, if your OS had BitLocker with keys stored in other systems in your organization the remediation becomes even more cumbersome. The most difficulty part of the remediation is that every system that is affected requires manual intervention to sort this issue which will require enormous effort that would take days to sort out fully.

This has been a disaster of monumental proportions for a lot of businesses worldwide. We are yet to see what measures companies would take to prevent such incidents from happening again. Likely some kind of A/B testing or staggered rollout would have prevented such a massive outage.

We will have to wait and see how things pivot from here.

--

--