Berlin

November 4 & 5, 2024

New York

September 4 & 5, 2024

How to turn an engineering incident into an opportunity

Harnessing the power of incident analysis
April 04, 2022

Turn incidents into valuable learning opportunities by harnessing the power of incident analysis.

A few years ago, there was an engineer acting as the infrastructure lead for an analytics startup. We’ll call them LeadEng McInfraPants. (This definitely wasn’t me! Of course not!)

LeadEng successfully led many projects that greatly improved reliability and scalability. Despite these successes and their deep experience, one day LeadEng set into motion what became a huge, multi-hour incident that took the company’s services completely offline. Most of us would look at this series of events and say ‘LeadEng made a mistake’.

To grow as an organization, however, we need to understand not just the how, but also the why. The change LeadEng made was the most visible contributor; the most salient ‘cause’. This explains the how, but does not even begin to approach the why! LeadEng didn’t wake up that morning, stretch, and say, ‘I can’t wait to ruin production today by running bad commands.’

Every person involved in an incident is making decisions and taking action they hope will ‘fix’ the problem. There is more, however, to an incident than just a failed service, and incident responders are balancing system stability, customer trust, and their own fatigue to name a few. To learn about this balancing act we must see these situations as opportunities for learning. We can harness them for growth and improvement in our organizations, and we can lead the charge using incident analysis.

How should we respond to incidents?

The typical response to an incident is to write up a timeline and a ‘factual’ representation of the events that occurred. LeadEng made a configuration change to infrastructure at whatever o’clock and that change spiraled into an outage.

These facts, however, neglect to tap into the rich vein of information around why LeadEng did what they did. Why was the change needed? How was the change vetted? Why did LeadEng not understand the consequences? Was LeadEng fatigued from weeks of foot-to-the-floor releases to satisfy a new contract? Finding these questions – and maybe some answers – is the realm of incident analysis.

Incident analysis steps beyond the collection of a dry timeline and aims to first collect then tell the story of the system, the people, and the organization. The goal is to delve into the areas missed by dry timelines.

Using our earlier example: LeadEng made the change to support some larger reliability goals, but they were missing some fundamental information! They didn’t know how the system was really configured, and critical information about how the system would react. Neither the organization nor LeadEng McInfraPants had ever seen this failure mode!

The above analysis might sound like blame. We keep saying that LeadEng McInfraPants did this or that. ‘Blameless’ culture as a concept is well-intentioned, but in practice may have gone too far. Attributing errors and actions to individuals is critical to understanding and learning. How are we to learn if we don’t acknowledge that someone took action? How can we learn if we don’t then discuss the actions with the person that took them?

Where we often fail is morality: naming what people did ‘wrong’. As mentioned earlier, the work was a series of decisions and actions taken with the intention of helping the system and meeting the organization’s goals. You often hear this sneaky morality in words like ‘should’. LeadEng should have looked at this dashboard or should have run the command this way. During a post-mortem, LeadEng might even judge themselves with ‘I could ’have done X’. LeadEng did not do these things though, so it’s counterfactual!

To stop at this explanation – that LeadEng is faulty or broken – is where we give up the riches that incident analysis can provide. Why did LeadEng do what they did?

What are the basics of effective incident analysis?

To learn about the actions, we’ve got to throw on our mining hats and start digging. The basic shape of analysis looks like this:

  • Assign someone to do the analysis. They may or may not have been involved in the incident. Most importantly, they need time to do the analysis!
  • Identify the data you can analyze: chat transcripts, system logs, on-call records, etc!
  • Analyze! Go through the data and create a narrative and timeline of important events, collecting ideas about whom you might interview…
  • Interview folks! Talk about what they experienced, the narrative you’re building, and questions or concerns that might come up.
  • Calibrate everyone by sharing the result of analysis and adjusting based on feedback.

Once you’ve done these steps you can meet up with interested parties, discuss the findings, and really learn. The output of this effort – a meeting, a report, and the distribution of the results – are intended to be shared, read, and celebrated. One of the biggest distinctions of this approach is the focus on widely sharing the narrative: a human-readable story about what happened and why the people, organization, and systems reacted the way they did. These outputs are not made to be filed, but to be read and shared.

Incident analysis is particularly exciting because anyone can do it. All that’s needed is a bit of time and effort. Knowing how to collect information about your system’s functions can make you self-sufficient, but even those without familiarity can ask around. Once you’ve gathered the log of the incident (chat, etc) then only time and access to those involved is needed.

What can we learn from an analysis of LeadEng’s incident?

An analysis of our example mistake at the analytics startup would yield some valuable insights. LeadEng wasn’t familiar enough with the systems in question. The organization had not tested for the failure modes that arose. LeadEng was fatigued from lots of work. 

Not all of the things we find are bad or need fixing. Teammates rallied to help LeadEng, giving a break when needed. Leadership understood the importance of failure and didn’t apply extra pressure. The support teams quickly and effectively communicated. The sum of all of this is a really interesting story with some ups and downs. Composed and shared as a story, humans are much more interested in reading or attending a talk versus a dry recitation of timestamps and commands run and passive-aggressive blame language… 

LeadEng, despite being implicated as a contributor to the incident, can be proud of helping the organization grow. Yay LeadEng!

How can you kickstart incident analysis in your organization?

If you’re convinced this is a cool idea, you might wonder how to get started:

  • Carve out some time to look into something in your organization that could benefit from analysis. It’s probably an incident, but could also be something like a successfully completed ship or change.
  • Read up on what happened and write a rough outline of the narrative.
  • Talk through this narrative with some folks involved and weave in their insights and ideas.
  • Share your work with them again as you wrap up and see if they have other feedback.

Finally, send the results far and wide! If your organization has an existing post-mortem process, perhaps attach this narrative to that process’ output. In my career, I’ve seen this work done solo, as informal groups that pick interesting incidents, or even as full-blown teams with headcount and budgets! The time spent on the analysis can easily be adjusted. Less time means fewer or shorter interviews or chasing down fewer questions. Balancing these types of questions is often the domain of lead engineers and should be somewhat familiar. Cheap, fast, good: pick two!

With this sort of learning effort in place, you can start to look past ‘shallow’ incident metrics like duration or count. Instead, you can begin looking at the number of teams involved, systems involved, or number of people involved in the response. Heck, even capturing sentiment about staffing levels or familiarity with systems can be intriguing.

You can help your organization look into the analysis for ways to measure improvement. Incidents will always happen so we must shape our organization to recognize the opportunities for improvement across all the analyses. These metrics will change and develop over time, so make sure you don’t get too fixated!

(For more guidance on incident analysis, check out Jeli’s Incident 101 Series which covers each phase in detail.)

Reflections

In summary, LeadEng and the rest of the organization experienced an incident and used it as an opportunity to learn rather than a situation requiring judgment. This analysis yielded powerful insights into how we prepare and handle changes.

You can lead this charge in your own organization, either individually or by supporting others who might do the analyses. Sharing the results as findings to be read rather than reports to be filed can bring more analysts to the cause. Convincing your management to support these analyses is usually a problem of showing the insights. In turn, leadership’s support can rally more resources to the cause. These analyses, by nature, can be dialed up or down to accommodate the appetite your leadership will support.

LeadEng might have taken down production, but they also learned a lot. The organization learned a lot too by talking to LeadEng and others involved. Asking a few questions can crack open many gaps, shortcomings, and ideas that are otherwise hidden behind the shroud of blame.

Incidents are commonly the most profound moments wherein the expectation of our work differs from its reality. By pushing for investment in this area, you stand to reshape your entire organization by helping everyone in it learn.