Global Windows BSOD blackout caused by Crowdstrike

Zaid Khaishagi
8 min readJul 20, 2024

--

On June 19 2024, there was a tech blackout which affected millions of devices worldwide. During this time, devices running Windows crashed with a Blue Screen of Death (BSOD) and then could not be properly booted back on. Even when restarting the device, it would get stuck in a loop of BSODs and restarts.

The impact of this blackout was huge, it was a global IT meltdown. It shut down hospital operations, it hit government systems, media and telcos across many countries. People could not transfer money due to the service outage. In Australia, the federal government convened a meeting of emergency authorities. Around the world, people in their offices could not work because their company devices running Windows would not work. Airports were having major issues and flights were cancelled. In India, because of this blackout, airlines were giving out hand-written boarding passes. It even affected the London stock market which delayed its display of opening trades. Electronic billboards in public places, ad panels, kiosk screens, self checkout screens, displays that normally showed flight information in airports, were all showing the ominous BSOD screen (link).

Even hospitals were hit. It disrupted hospital operations across many hospitals across many countries. (link) It disrupted 911 dispatches, hospitals, flights (link) (link 2) Sky News even went offline in the middle of a broadcast, and the BBC’s CBBC kids channel also blacked out and the broadcasting was replaced with an error message. (link)

The event caused major problems at airports, grocery stores, and even coffee shops such as Starbuck outlets. It caused many people and businesses to revert back to cash payments because all the systems were down.

In fact there were 2 blackout issues that, though unrelated, happened together. The first was the blackout on Windows device caused by a problematic update release by Crowdstrike; and the second was a blackout of Microsoft’s cloud platform Azure which preceded it by only about half a day.

Links about the impact:

This massive blackout and IT disruption was, however, not a cyber attack. (link)

EDR stands for Endpoint Detection and Response. These software are tasked with monitoring users’ endpoints systems and alerting and preventing them from being hacked. Endpoints systems are any devices which an end user uses, such as desktops, laptops, smartphones, etc. The EDR software is installed on the device and it passively monitors the system and provides analytics, antivirus features, intrusion detection, firewall features, and a whole host of more security features.

CrowdStrike was widely trusted by businesses of all sizes across all sectors including financial, healthcare providers, energy and tech companies.

In order to provide these security features, the Crowdstrike Falcon EDR software needs to be run in a more privileged way. It needs to have access and permissions on the user’s beyond what a regular software would be given. These give the software access to information like what processes are running, which files are opened, what devices are plugged in, what network communications are happening, and much more. This makes sense because it needs to be able to view and monitor the system properly in order to be able to detect any kind of malicious activity. In the case of Crowdstrike Falcon, it is installed and then runs as a system driver with system privileges, which goes beyond what regular software has access to. (link)

Drivers are pieces of software which allow the operating system to be able to manage and make use of hardware. So, for example, the audio driver in your Operating System would allow it to make use of the audio devices like the speakers, microphones, etc. A System Kernel is the core part of the Operating System which contains the functionality from its important and core features. It is what the rest of the Operating System is built around.

So, Crowdstrike Falcon runs as a system driver. These drivers get started automatically when the computer starts up, including the drivers for things like audio, display, interface devices, peripherals, etc.

Crowdstrike published an update to systems with the Crowdstrike Falcon Sensor software installed on July 19, 2024 at 04:09 UTC. This update caused an error in the software which caused the Windows systems running the software to crash with a BSOD error. So, if the Windows system with the Falcon software installed were either online or had downloaded the update between Friday, July 19, 2024 04:09 UTC and Friday, July 19, 2024 05:27 UTC, they would be affected.

https://www.crowdstrike.com/blog/technical-details-on-todays-outage/

The update was in the form of a “Channel file” (which has the .sys extension but is not a kernel driver) which helps provide information about novel attacks and threats so that the EDR software can protect against them. This is the normal way that Crowdstrike provides such updates.

So, how does this get fixed?

At the time of writing this, Crowdstrike has fixed the problematic Channel file which has a filename that starts with “C-00000291-” and ends with a .sys extension. They are now serving a fixed version of it. It was remediated on Friday, July 19, 2024 05:27 UTC.

https://www.crowdstrike.com/blog/technical-details-on-todays-outage/

They have also provided remediation steps that users can do, but they are rather complicated. They all involve gaining access to the affected system through another system, and then deleting the problematic Channel file from the corrupted Windows system. They have released some instructions which is entitled as an “automatic recovery” for Windows instances running on GCP (Google Cloud Platform) but it is still a fair bit involved and has similar steps to access the affected Windows instance from a second unaffected system (such as another Windows instance).

The steps they describe are these.

https://www.crowdstrike.com/blog/statement-on-falcon-content-update-for-windows-hosts/

Workaround steps for individual hosts:

Reboot the host to give it an opportunity to download the reverted channel file. We strongly recommend putting the host on a wired network (as opposed to WiFi) prior to rebooting as the host will acquire internet connectivity considerably faster via ethernet.

If the host crashes again, then:

  • Boot Windows into Safe Mode or the Windows Recovery Environment
    — NOTE: Putting the host on a wired network (as opposed to WiFi) and using Safe Mode with Networking can help remediation.
  • Navigate to the %WINDIR%\System32\drivers\CrowdStrike directory
    — Windows Recovery defaults to X:\windows\system32
    — Navigate to the appropriate partition first (default is C:\), and navigate to the crowdstrike directory:
    — C:
    — cd windows\system32\drivers\crowdstrike
    — Note: On WinRE/WinPE, navigate to the Windows\System32\drivers\CrowdStrike directory of the OS volume
  • Locate the file matching “C-00000291*.sys” and delete it.
    — Do not delete or change any other files or folders
  • Cold Boot the host
    — Shutdown the host.
    — Start host from the off state.

Workaround steps for public cloud or similar environment including virtual:
Option 1:

  • ​​​Detach the operating system disk volume from the impacted virtual server
  • Create a snapshot or backup of the disk volume before proceeding further as a precaution against unintended changes
  • Attach/mount the volume to a new virtual server
  • Navigate to the %WINDIR%\System32\drivers\CrowdStrike directory
  • Locate the file matching “C-00000291*.sys” and delete it.
  • Detach the volume from the new virtual server
  • Reattach the fixed volume to the impacted virtual server

Option 2:

  • ​​Roll back to a snapshot before 0409 UTC on July 19, 2024.

So, as you can see, these are not simple solutions. What’s more is that for individual hosts which are not hosted on the cloud, a user or an IT professional must manually go to the affected system and fix it. This is a huge issue because things like electronic billboards, kiosks, self checkout systems, airport displays, ad displays, hospital systems, bank systems, all the affected company laptops at all affected organisations (there are *many* affected organisations), and for all of the many other places this issue caused blackouts, they must all be manually fixed — all of the millions of devices. Remediation is going to take some time…

Technical details:

The specific update which caused this issues was the file C-00000291–00000000–00000032.sys As found by some security folks, this file seems to have contained only null values (all zeroes).

<pic: null values https://x.com/christian_tail/status/1814299095261147448/photo/1>

In a stack trace dump of when the system crashes, the following can be seen.

<pic: stack trace (https://x.com/snicoara/status/1814184181863526504)>

To explain, the part marked red in the images shows the values stored in the registers by this program. The program is csagent.sys. There are couple of values in the registers that are important to draw attention to. The first the value in r8 and then other is rip. The rip register is the address of the current program instruction which is to be executed (instruction pointer). So, in this trace dump, the instruction where the error occurred is the one pointed to by rip which is mov r9d, dword ptr [r8] (marked with green).

This instructions tells the computer to go to the memory address that is held in the r8 register (to use the value of r8 as the address), and then move that data into the lower 32-bits of r9 ( r9dmeans the lower 32-bits of r9).

The problem arises because of the actual address that is held in r8. This address value is 000000000000009c. This is actually an invalid address. So, the program tries to access an invalid address in memory. So, Windows crashes.

We can see from the actual BSOD screen that the error code is SYSTEM_THREAD_EXCEPTION_NOT_HANDLED. If we look at what this error code means on the Windows website, we can see that it “indicates that a system thread generated an exception that the error handler didn’t catch.” And one of the causes listed for this error is this:

0xC0000005: STATUS_ACCESS_VIOLATION indicates a memory access violation occurred.

This error has to do with when an instruction references memory at an invalid address. And this is most likely what caused the BSOD.

https://learn.microsoft.com/en-us/windows-hardware/drivers/debugger/bug-check-0x7e--system-thread-exception-not-handled

https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-erref/596a1078-e883-4972-9bbc-49e60bebca55

<pic: https://x.com/ruskievityazi/status/1814208244044242967/photo/1>

At first viewing, it seems that the null values (the continuous zero values) in the Channel file have caused this error in the address that the program then tries to access. But Crowdstrike mentioned that it “is not related to null bytes contained within Channel File 291 or any other Channel File.”

https://www.crowdstrike.com/blog/technical-details-on-todays-outage/

Credit:

--

--