What we can learn from the CrowdStrike outage

How investing resources in effective QA helps your organization prevent catastrophic disaster.

7 min read10 hours ago

You’ve probably heard about the global software outage caused by an update from CrowdStrike. On July 19 they deployed a software update which uncovered a previously unknown bug in production, causing over 8.5 million Windows computers to become inoperable until an IT expert resets their system. In fact, this deployment catastrophe is estimated to be responsible for over $5.4B in losses for fortune 500 companies. The fallout for CrowdStrike was so severe that their stock lost around a third of its value from where it was last month.

Chart displaying CrowdStrike’s stock value plummeting by 33.61% — CrowdStrike stock value plummeting by 33.61%

What happened under the hood?

Recently, CrowdStrike discovered the root cause of the software crashes and outlined what happened. They found that one of their Rapid Response Content updates, which is essentially a configuration file update, had erroneously passed their automated Content Validation test. Apparently this configuration file has “problematic content” which caused an out-of-bounds memory error, i.e. a BSOD.

What CrowdStrike admitted went wrong

In more detail, they had previously released successfully a new component of their security system, called a Template Type, and had a few successful configuration updates to that type, causing them to erroneously trust the Content Validator when they shouldn’t have. What this translates to is they thought their testing process was robust enough because it had worked previously, but based on their change of plans with regards to testing practices, they’ve found the Content Validator is not sufficient for effective quality testing.

How CrowdStrike plans to fix their QA process

In response to this outage, CrowdStrike has outlined a comprehensive four-part plan to prevent similar incidents in the future.

Enhanced Software Testing Procedures

In order to make their release pipeline much more robust for Content updates, CrowdStrike plans on integrating a comprehensive list of testing strategies. They outlined the following QA upgrades:

Local developer testing — Having their developers test the final product on their local machine before releasing any updates into production creates a human filter for any future problematic updates.
Content update and rollback testing — Ensuring they have a robust system for both testing Rapid Response Content updates before release and testing the system to rollback failed updates for future broken releases. This will help ensure that machines impacted by unknown edge cases are much less likely to be impacted by failed updates for an extended period of time.
Stress testing — Making sure their updates function correctly even if a user’s machine has limited resources or if their system accidentally utilizes too many resources is essential for successful deployment of future releases.
Fuzzing — Another useful testing technique is fuzzing, where they have a computer attempt numerous (sometimes on the order of trillions) of different input values into their systems. For example, this would help them uncover previously unknown bugs in their content validation system.
Fault injection — Purposefully injecting bugs, failures, and other software issues into test updates helps them make sure their test system actually works as expected: preventing bugs from causing outages.
Stability testing — It was found that one of the configuration files would commonly be full of zeros after a machine impacted by the outage would restart. By finding out what happens to their Content Updates if a system fails at different points in the update process, CrowdStrike will be able to find ways to make their software more robust against future bugs or failures.
Content interface testing — By testing the interfaces which apply the Rapid Response Content updates more robustly, CrowdStrike’s developers will gleam further into how their systems, such as the Content Interpreter, will fail when interfacing with buggy or broken configuration changes. This will help them patch current problems in their system and help protect against mass outages in the future.

By integrating these several layers of checks, future updates to CrowdStrike’s software will have much less of a chance of failing catastrophically like it did on July 19.

Start investing in QA with QAComet

Want to invest more resources in QA but not sure where to start? Before diving headfirst into running a full-on QA department, you can start by scheduling a free 30 minute consultation call with QAComet. We offer plans starting at $999/month for a variety of QA services including:

Writing test automation suites with E2E/Integration/Unit tests
API testing for REST/JSON/GraphQL/SOAP APIs
Setting up test environments and integrating tests with CI systems
Stress testing, performance testing
Manual testing and bug reporting
Auditing current testing strategies and suggesting improvements

Not only that, because we are a fractional agency, you benefit from:

Not having to sign contracts, so you can pause or cancel your subscription at any time.
Can scale service up and down as needed, so you can get QA services when it matters the most
Receiving service from a variety of experts, both on-shore and off.

Check out our pricing page or schedule a free 30 minute consultation today.

Third-Party Validation

Beyond integrating a much more robust QA process is their software development lifecycle, CrowdStrike plans on using third-parties to validate their QA process from development all the way to deployment. This let’s CrowdStrike have eyes beyond their own staff look at their processes and find weak-spots their team may have missed. Having fresh eyes inspect your processes helps reduce blind-spots caused by assumptions their team may unknowingly possess.

Additionally, they will have third-parties review the underlying security in their software. This could include code reviews, security audits, and penetration testing, helping CrowdStrike uncover problems within their systems before cybercriminals can find and exploit them.

Enhanced Resilience and Recoverability

This outage underscores the importance of building resilient software systems that can gracefully handle errors and unexpected situations. If you’re not acting defensively by protecting against future mistakes, any organization should expect their software will cause problems at some point in the future.

By strengthening the error handling mechanisms within their software, CrowdStrike can minimize the impact of future problems on end-users.

Some best-practices for enhancing resilience include:

Implementing robust exception handling and logging mechanisms. This helps developers analyze failures in their systems during development, testing, and deployment.
Designing systems with fail-safe mechanisms and graceful degradation capabilities. This helps ensure users can recover from broken releases and continue using their machines even if there’s a software failure.
Conducting regular disaster recovery drills and simulations, such as employing chaos engineering techniques. Implementing this strategy let’s CrowdStrike’s team uncover previously unknown edge cases, letting them improve their software’s robustness.
Implementing circuit breakers and bulkheads in their software architecture so system components are more isolated. Using these software design patterns helps prevent both cascading failures and control the location of failures. The system will then fail in a more predictable manner, helping engineers develop a more robust product.

Refined Deployment Strategy

From their report, it seems like CrowdStrike was not following modern DevOps best practices for deploying new releases. In their incident review they outline the following strategies they will use moving forward

Staggered deployments — When deploying new releases into production, they will starting with a canary deployment, letting them test the update on a tiny subset of real users. The next stage is deploying to a small subset of systems, before finally having a staged rollout for the test of their users. By carefully breaking up deployments into several stages, any future outages should only impact a much small set of users.
Enhanced monitoring and logging — They plan on enhancing their monitoring of sensor and system performance during the staggered deployments. Increasing their monitoring lets them identify and mitigate issues promptly, protecting users from mass outages. Additionally, they will have notifications of content updates and timing, giving them more insight on how deployments are performing.
Adding update controls — CrowdStrike is now working on update settings so customers have greater control over new Rapid Response Content updates. They plan on implementing this by allowing letting users select when and where these updates are deployed and giving them these controls for each update.

Conclusion

The CrowdStrike incident serves as a valuable lesson for the entire tech industry. It highlights the importance of implementing robust QA and DevOps practices, especially for mission critical platforms with hundreds of millions of daily users. By implementing a more comprehensive testing process, seeking third-party validation, enhancing system resilience, and refining their deployment strategy, CrowdStrike can significantly reduce the risk of similar incidents in the future. As the cost of software failures becomes more apparent across the world, we can expect to see a growing emphasis on software QA across the tech industry. Catastrophic failure in software systems are a serious risk for any company and this outage will likely drive more resources towards

Advanced testing tools and deployment infrastructure
Implementing more rigorous QA processes and standards across their organizations
Providing ongoing training and skill development for QA and development teams

This shift will ultimately lead to more reliable and secure software products, creating a better future for end-users.

References

A global tech outage brought many computer systems and businesses to a screeching halt. Here’s what happened. CNN.
Executive summary. CrowdStrike.
Preliminary Post Incident Review. CrowdStrike.
Tech Analysis: Channel File May Contain Null Bytes. CrowdStrike.
CrowdStrike to Cost Fortune 500 $5.4 billion. Parametrix.