Berlin

November 4 & 5, 2024

New York

September 4 & 5, 2024

Outages are coming for you

How to respond when your systems have a significant outage
August 19, 2021

Given a long enough timeline, the probability of your software systems having a significant outage approaches 100%.

This is a fact, and your acceptance of it is reflected in your operational culture.

Avoiding outages is the primary operational goal for many organizations, and for good reason – but it shouldn’t be your only goal. A singular focus on uptime will lead you down a dark and murky path, bereft of inspiration. This path is also littered with piles of wasted money, and is haunted by the shadows of engineers who have left your company after your culture bored or exhausted them.

Outages are coming for you, regardless of your beliefs, goals, or plans. Tactically, you have thousands of ways to approach this problem. Strategically, you only have a few. You must choose between fear and courage, investing or reacting, and how best to balance your primary resources: money, bodies, and brains.

Boundaries of failure

Below is Jens Rasmussen’s ‘migration model’ from his paper Risk Management in a Dynamic Society: a Modeling Problem.  You’ve probably seen it before. I first saw it while watching Richard Cook present a keynote at the Velocity Conference in 2013, and it’s been burning in my brain ever since.

Let’s explore this model with a hypothetical example: you’re responsible for operating an application which is critical to your company’s revenue stream. The application has a bug that causes servers to randomly crash. After every minute of uptime, the probability of crashing increases by 10%, so uptime for each server is between 1 and 10 minutes. Initially, you’re just running a single server, since this meets your capacity requirements.

Disclaimer: it’s very possible that some of my math below is incorrect, but I hope that the overarching theory is clear.

Throwing money at the problem

With just one server and the current failure rates, the probability of the server still running by the 5th minute is 50%. So you decide to run more servers:

server count probability of uptime at 5th minute
1 50%
2 75%
4 94%
7 99%
10 99.9%

I’ll use the term ‘probable uptime’ to indicate the probability that at least 1 server is running. Even though 1 server provides sufficient capacity, you need to run 4 servers for 1 nine of probable uptime at the 5th minute, 7 servers for 2 nines, and 10 servers for 3 nines. If your goal is 3 nines of probable uptime, you have to run at 10x of your required capacity to get there, which will likely also multiply your costs by 10x. As your service continues to expand, you’re heading beyond the boundary of economic failure.

Throwing bodies at the problem

You don’t want to increase your costs by 10x, so you consider other options. You could run 1 server and assign an engineer to immediately restart the application after every crash. By my calculations, if we can assume the median server uptime is 5 minutes, and the application can be restarted within 1 minute of failure, this will be done around 230 times per day. You’d have about 230 minutes of downtime per day, which limits you to a maximum uptime of 84%.

If you started a second server 1 minute after you launched the first, you’d decrease the risk of total downtime when either node crashed, at the cost of requiring twice as many restarts. Your engineer now has to do 460 restarts per day. But your engineer is human, and cannot work 24 hours per day, so you’ll need at least 2 engineers working in shifts in order to cover all hours of the day, seven days per week. And as your traffic grows, it’s likely that you’ll need more than one instance running at a time, so the number of required restarts multiplies. This is an absolutely dreadful job, so you’re likely to have low employee retention. You’ll be spending as much time training new people to do the work as they will spend actually doing the work. You’re beyond the boundary of an unacceptable workload.

Investing instead of reacting

Why not just run three servers and schedule a job to restart each of them every two minutes?  Or why not actually fix the bug that causes the application to crash? I agree that either of these ideas would be superior to the scenarios I’ve described above. But this was a hypothetical situation with one application and one bug – in reality you may have hundreds of services with thousands of bugs, along with various combinations of deployments, config changes, DNS updates, queues filling, duplicate events, data consistency problems, CPU and latency spikes, and so on. Each of these events increases the probability of failure across your systems.

If you’re running systems with a notable volume of traffic, and you consider the volume of alerts and random operational issues that your engineers have to deal with on a daily basis, ask yourself this – is your environment really that different than having a pair of engineers restart 2 servers 500 times per day? In my experience, many companies are in denial about the amount of repetitive, manual operational work they ask of their engineers. Your strategy is likely much closer to throwing bodies at the problem than you’d like to admit.

Companies in growth mode often throw money at the problem in order to maintain focus on adding product features, and that’s a reasonable choice for as long as the economics allow.  More established companies tend to focus on controlling costs, and often resort to throwing bodies at the problem – especially if they can utilize a less expensive headcount, such as contractors and offshore teams. But neither of these approaches will enable the innovation you need to improve your stability while also improving your products.

We’re all operating somewhere within Rasmussen’s model – proactively and consistently investing in resiliency allows you to deliberately target your location within the boundaries.

Create a culture of operational innovation

Any day that you don’t get more resilient, you get less resilient. Entropy never sleeps; it’s a constant gravitational force, pulling you towards chaos, and the absence of frequent outages isn’t sufficient evidence that you’ve conquered it. You may have managed to push your operating point away from the failure boundary for now, but your efforts may have also caused you to exceed your budget, or exhaust your workforce, which in the long run are fatal outcomes.

Ask your engineers: what is one thing you wish you had time to build in order to improve our resiliency? You’ll be overwhelmed by the volume of creative and effective solutions they come up with. Then, give them the time to implement at least a few of these solutions with your full support. It’s understood that you cannot have 100% of your engineering efforts focused on resiliency 100% of the time, but what matters most is that the effort is focused on continuous improvement, as opposed to just extinguishing the current fires.

Life is short, and best experienced when you end each day a bit smarter and more fulfilled than when you woke up that morning. An operational culture that prioritizes curiosity over fear and investment over reaction can transform your organization, thus enabling you to convert outages into incidents, and incidents into resiliency. It is possible to make operational work effective, efficient, and fun – the first step is deciding that it’s worth the effort to do so.