Berlin

November 4 & 5, 2024

New York

September 4 & 5, 2024

The CrowdStrike disaster is a lesson about testing

What’s been dubbed the world’s biggest IT outage should be a wakeup call to the industry.
July 29, 2024

It grounded planes, shut down banks, and took radio stations off the air. It will likely cost some of the world’s biggest companies $5 billion or more in damages. And it has hit the share price of CrowdStrike, wiping billions off the valuation of the endpoint security company almost overnight. 

What was dubbed the world’s biggest IT outage two weeks ago now has an initial post-mortem, with CrowdStrike blaming a bug in its testing software for not properly validating the update that went worldwide in its Falcon software and took down some 8 million Windows machines.

“Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data,” the company said. In short: the automated checks meant to catch and stop issues before they were deployed failed to do so. 

It happened to CrowdStrike. But it could happen to many other companies. “Digital infrastructure today involves complex interconnected applications, where updates happen all the time on various apps, meaning a tiny error in one application can cause a huge ripple effect and shut down businesses,” says Robert Salesas Martin, chief technology officer at Leapwork, an AI-augmented test automation firm. 

“All app updates and changes, big or small, need to be appropriately tested, but the real challenge is that many businesses have outdated and siloed approaches to testing with a team or tool for each feature or application instead of taking a more holistic view across all apps within a user journey,” he said.

The case for testing

Testing seems to have dropped down many priority lists as of late, a consequence of not seeing major issues arise for a while, says Mav Turner, chief product and strategy officer at software testing specialists Tricentis. “To keep up with delivery expectations, software development teams all too often think of testing as a chore rather than a critical value-add, deprioritizing this crucial business function,” he says.

This industry-wide underappreciation of the importance of testing makes engineers less likely to thoroughly validate their work – or to discount any advice that the few dedicated testers who exist in the industry provide.

“Companies need to put the right energy into their testing strategies to deliver high-quality products and business services,” says Turner. “Testing must identify and target risks and defects early, shorten and enhance release cycles, and harness automation to ensure continuous and actionable feedback.”

But testing can’t simply be paid lip service, says Laveena Ramchandani, software testing lead at EasyJet. It requires serious root and branch reform, and a company-wide dedication to testing. “Whoever is managing the engineering teams should have a testing approach,” she says. “What do you want testing to be like? What is your vision? If the vision is there, and the approach is followed, I don’t think a lot of incidents like this are going to happen.”

Testing times

Ramchandani strongly advocates for having dedicated software testers, rather than relying on developers to test their own code. “You don’t know if the developer is going to actually think like a tester does,” she says. “I’ve seen a lot of developers not do testing right. It’s not natural to them.”

But engineers overlooking the importance of testing isn’t necessarily their fault, and doesn’t make them bad engineers. “It’s just not their role,” she explains. Testers are trained to spot edge cases. “We think of not just happy paths. Our first thought is: ‘What can go wrong?’ So we start with the negative parts”

Taking such a cautious approach may well have helped head off the issue that CrowdStrike faced when it pushed its Falcon update out to the world. 

Help is at hand

After years of downplaying the importance of testing, a significant skill gap has emerged that won’t be filled overnight. 

“In CrowdStrike’s preliminary incident review, it outlined an existing set of testing/QA protocols including an architecture that was supposed to validate new content,” a group of Forrester analysts wrote in their own post-mortem. “The trouble was that this validation itself had a bug. CrowdStrike presumably used traditional software testing techniques to exercise this feature, but these clearly fell short.” 

Still, some believe automated testing systems could help, especially as modern applications get more and more complex. “Automated testing is a game-changer for software development’s accelerating speed and scale,” says Tricentis’s Turner. “It enhances test coverage and accuracy and provides faster feedback, enabling continuous integration and continuous deployment.”

Automated testing has become particularly effective for regression testing, detailing code changes for a targeted approach that shortens cycles. As with all corners of the industry, there is also an opportunity to improve automated testing using generative AI. 

However, “some companies are just relying on AI rather than testers,” Ramchandani warns. “I think it would be a bit alarming now just to let AI do everything.”

For now, organizations will need to find a mix of tooling and expertise to de-risk their releases. Salesas Martin makes the case for comprehensive regression testing that mimics real-world scenarios in order to catch issues before deployment. 

“You need safe rollout strategies – so use progressive or canary rollouts to deploy updates gradually, allowing for early issue detection with minimal impact,” he says. “These methods allow for early detection of issues, significantly reducing the risk of widespread disruptions.”