London

June 28–29, 2027

New York

September 15–16, 2026

Berlin

November 9–10, 2026

Don’t wait for an outage to improve your reliability

A reliable system isn’t just about your infrastructure. Learn to effectively leverage game days to train engineers and build resilience before real incidents force you to.

Speakers: Leo Papaloizos

June 02, 2026

Sponsored by

incident.io

A reliable system isn’t just about your infrastructure. Learn to effectively leverage game days to train engineers and build resilience before real incidents force you to.

No one likes responding to incidents – but they’re also where you learn the most. Every outage reveals weaknesses across your systems and processes that you might not have found any other way.

The problem is, waiting for your biggest incidents isn’t a realistic strategy for building a more reliable system. You can introduce stress tests or extrapolate worst-case behaviours, but there’s an unpredictability to how systems and people behave in combination under real pressure that is hard to fully account for.

Game days are your opportunity to test that combination without frantic firefighting. If you’re not running realistic tests, your safeguards are theoretical: you don’t want to learn whether they work in a disaster scenario.

At incident.io, we use game days to train engineers, validate fixes from past outages, and build genuine confidence in our systems and response processes. Most recently, we ran game days to successfully validate that we would be resilient in the face of an AWS outage. Every time we run a game day, engineers are asking when the next one will be.

It’s easy to run a simple drill, but half the value of a game day comes from thorough preparation. Figuring out what’s actually feasible to break gives you confidence in the parts of your system that are solid. Finding the holes you *can* poke through gives you avenues to follow up, even before you’ve run anything. And preparing for the different response paths your engineers might take lets you create realistic pressure, making it feel like a real incident rather than a repetitive drill.

In this this talk, I’ll cover what we’ve learned running game days, and how you can get the most value out of them. Running a successful game day can make your systems more resilient, but you’ll also be building a culture of reliability in your organisation.

Key takeaways

  • Why game days are one of the most effective tools for reliability and getting ahead of the next potential outage
  • Learn to run efficient and worthy game days that can give you both confidence in your systems and a clear list of areas to improve, as well as a chance for your engineers to experience realistic pressure without risking production
  • How to build a culture of reliability by making game days and incidents something engineers find valuable and want to repeat
Promoted Partner Content