Could Microsoft have Avoided the Cascading CrowdStrike Outage?

Dan Johnson
5 min readJul 20, 2024

--

I believe they could.

How did this happen?

According to various reports, on July 19, 2024, after midnight, a worldwide outage of Microsoft Windows computers — PCs and servers — was reportedly caused by an automatic software update that resulted in the updated computers being unable to restart without manual intervention. Although it directly affected only computers that used Windows, those using other operating systems may have been indirectly affected, because they relied upon services those Windows computers provided across various computing clouds over the internet.

CrowdStrike, a security software and service company used by many other companies, has procedures that automatically update their software that is installed and runs in their customers’ computers. As you may know, this is not an unusual practice. It keeps software up to date with the latest security protections, in particular, and by scheduling these updates overnight, they can be made when fewer users may be affected by any slight reductions in service.

The software update that night reportedly introduced an errant configuration file that caused or exposed a bug in either CrowdStrike’s own software or perhaps that internal to Microsoft Windows. While that may have been the direct cause, Microsoft could have designed into their Windows operating system a fairly straightforward mechanism that could have avoided the far-reaching harm the world experienced for at least a good part of a day.

I have no connection with Microsoft nor CrowdStrike, however, I am a retired computer software engineer, performance analyst, and system administrator. Making a few assumptions consistent with news reports, I offer the following as a practical, workable solution that Microsoft might consider to help prevent such an outage — if it happened as I understand it — from happening again — or at least with as widespread and severe consequences.

Checkpointing could have prevented this.

The mechanism I am suggesting that Microsoft include in Windows involves a technique called “checkpointing” that was developed decades ago to address just this kind of problem. It is in widespread use in just about every database management system to maintain the consistency and coherence of data. I believe there is no insurmountable reason why it can’t be applied to avoid disasters like the outage.

I don’t know whether CrowdStrike’s update process required rebooting the affected computer — or whether it did not require a reboot but nevertheless caused a the system to crash during the update process. (Some software and operating systems can be updated without restarting them, but my experience has been that most Microsoft-based software does require that with a brief interruption in service that system redundancy may be able to shoulder.)

Either way, the Windows computer would need to reboot, and as currently reported, when the system tried to restart, a “blue screen” appears. This is technically not the storied “Blue Screen of Death” (a.k.a. “BSOD”) which occurs during a crash but nevertheless requires manual intervention to make the computer operational once again.

Portion of a Microsoft Windows “blue screen” display showing “Recovery” options.
This “blue screen” reads “Recovery — It looks like Windows didn’t load correctly. Ifyou’d like to restart and try again, choose “Restart my PC” below. Otherwise, choose “See advanced repair options” for troubleshooting tools and advanced options. If you don’t knew which option is right for you, contact someone trust to help with this.” There are two buttons: “See advanced repair options” and “Restart my PC”.

Either way, the results of the two have the same effect: the computer stops in its tracks, ceases providing service, and in many cases, operation is impeded or interrupted on other computers that may rely on its services.

The “manual intervention” to recover from a problem like CrowdStrike’s requires a skilled technician to (most likely) make changes to each affected computer system — in this case, perhaps, to remove or replace an offending configuration file — and manually restart the computer, perhaps more than once, and then to confirm that the system resumes operating properly.

Although it may be possible to automate this recovery process to some extent, building and using that kind of automation is akin to building an airplane while you’re flying it. In other words, don’t do that unless you really must.

How could Microsoft achieve this?

I don’t know a reason why this particular problem needed to occur the way it did, and responsibility for that should lie firmly with Microsoft, not CrowdSource nor any other third party. It is possible to build operating systems that would behave much more reliably in this situation.

Microsoft could have built into its Windows operating system a facility with some additional functionality that could have automatically “healed” computers harmed by an errant software update.

Third-party vendors could design their software installation and update procedures to use this capability so that if an unanticipated circumstance would cause a system failure and require manual intervention to put them back into service. Microsoft could either require or strongly recommend that software vendors utilize this service, particularly if their software could adversely effect the operating system or other software within the system.

How would the checkpoint process work?

Applying such a “checkpoint” mechanism might work something like this. Before updating the system’s software or configuration, the operating system would record its current “state”, e.g. on local disk storage. If a failure occurs during or immediately after the installation process completes, the operating system’s checkpoint facility could intervene by “undoing” the installed changes using that previously-recorded state and then automatically restart the computer system.

That would give the system a good chance of continuing operation without the updated software, and technical staff could be notified of the event rather than having the system’s users and other, dependent systems “discover” it themselves, painfully.

As part of this checkpoint facility that Microsoft could have built into Windows, they could provide to third-party software vendors some software functions to designate system disk files (and other system resources) that the imminent update procedure may modify or add. The facility would construct (record) the checkpoint before allowing installation to continue.

If the computer restarts during or after the installation and Windows detects that a “recovery” is necessary, rather than displaying a “blue screen”, it would revert the updated files and parameters to their original states using the checkpoint data. Then it would try restarting the system automatically.

If the system restart is successful, the facility would be free to remove the “checkpoint” data and allow system operation to continue normally.

What remains to consider?

This may not work for every kind of problem a computer may have pursuant to an errant software update, but for certain kinds, like we saw this week, it should have resulted in brief service interruptions and required little or no technical attention devoted to each affected computer system.

There would be additional, technical factors to consider in providing this automatic-recovery checkpoint facility, and it would likely need to be somewhat more sophisticated than described above. But it seems technically feasible and, I believe, worthy of Microsoft’s consideration — and perhaps CrowdStrike’s and other enterprise software vendors as well.

--

--