Berlin

November 4 & 5, 2024

New York

September 4 & 5, 2024

Building better on-call routines for engineering teams

Mentions of on-call are usually met with grumbles.
July 02, 2024

Here’s how to flip the narrative and build a more resilient and empathetic process.

A couple of years ago, I took over a role overseeing a critical part of our infrastructure. It was a complex undertaking, and if left unmanaged and uncontrolled, would have had a huge blast radius in terms of major outages and business disruptions. These sorts of projects require well-equipped engineers, with knowledge of the systems and the dependencies, as well as a solid group of engineers who are on-call during off-hours for emergencies. 

In my case, I was met with a well-intentioned group that was unprepared for the unpredictability of on-call duties. I set out to shift on-call from being a constant stressor, to a picture of stability.

The stress of being on-call

According to the Honeycomb CTO Charity Majors, on-call duties represent a significant socio-technical challenge that can often be one of the most stressful aspects of a developer’s career. This idea stems from several core challenges that can make being on-call particularly difficult:

Unpredictable work hours: On-call requires engineers to be available to respond to emergencies during off-hours. This disrupts personal time and leads to an unpredictable work-life balance.

High stakes, high pressure: On-call involves dealing with failures that can affect significant business operations. The stakes can lead to considerable pressure. Handling complex issues alone during off-hours contributes to a sense of isolation as well.

Lack of preparation: Without proper training, preparation, and practice, engineers will feel under-equipped to handle the issues they might encounter. This creates anxiety and the fear of making wrong decisions, which further escalates the situation.

Alert fatigue: Frequent, non-critical alerts can desensitize engineers, making it difficult to recognize real issues. This can lead to slower response times, missed critical alerts, compromised system reliability, and increased stress. This not only affects their ability to respond quickly when an actual issue does come up, but contributes to overall job dissatisfaction.

Why do we need on-call processes?

Despite its challenges, being on-call is important for maintaining the health and resilience of production systems. Someone needs to be responsible for your services in the off-hours. 

Being close to production systems as an on-call engineer is important as it ensures:

Business continuity: Quick response times are critical in minimizing downtime and mitigating impact on end users and business operations. They are the difference between a minor hiccup and a major outage.

In-depth understanding of systems: Being immersed in the system lets you deeply understand the production environment’s nuances. It helps to build a culture where team members focus on optimizing performance and preventing issues, rather than merely reacting to them. 

Improvement soft skills: On-call responsibilities push engineers to develop a broad set of skills, including crisis management, quick decision-making, and effective communication. These skills are also valuable in broader professional contexts for career growth.

Adapting to a fearless on-call framework

The on-call reality check

In my team, when we initially adapted to the on-call system, we faced a patchwork of quick fixes and a noticeable lack of confidence during deployments. The engineers, though resilient, operated in silos. Each on-call night was a solo (mis)adventure. It was also clear that without better preparation and support, our system’s fragility would soon cause a major outage. We had a ticking time bomb on our hands.

Our transformation began with simple, small foundational steps. 

Establishing a pre-on-call checklist

A pre-on-call checklist is a simple, but crucial way for making certain that all necessary steps are taken before an engineer begins their shift. It avoids the risk of being underprepared, prevents oversights, and promotes a proactive approach to incident management. The list we created had the following categories, each with specific tasks underneath:

  • Squad-level training
    Training specific to the squad’s on-call responsibilities. This includes role-specific tasks, general on-call procedures, familiarization with necessary tools, architectures, and simulation exercises. 
  • Onboarding documentation
    Have clear documentation to understand the role and responsibilities during the on-call process. This should ideally include detailed descriptions of each role, procedures for handling incidents, escalation paths, key contacts. We should also validate that this documentation is easily accessible and regularly updated.
  • Incident response guidelines
    Guidelines for handling and managing incidents. These should be comprehensive potentially including training courses, organization policies, and escalation procedures.
  • On-call schedule
    Provide detailed information about the on-call rotation, including who is on-call and when. Also, check that all scheduled shifts are documented and added to each engineer’s calendar using their relevant scheduling tools. This helps everyone stay informed and prepared. 
  • On-call tools and access
    Make sure that all the necessary tools and levels of access required for on-call duties are granted for engineers. This may include monitoring dashboards, AWS queues, and OpsGenie.
  • Communication channels
    Joining relevant Slack channels and Google Groups. This could include channels where information about incidents is posted or discussed, maybe an #all-incidents in your org, or department specific incident channels. Additionally, join channels for stability groups in your org, cross-functional updates that could impact your services (such as #marketing-updates), and any other channels providing crucial information that might impact users. Creating or joining a dedicated on-call coordination channel with all the members can also facilitate real-time communication during incidents.
  • Runbooks and troubleshooting guides
    Make sure that all team members have access to the relevant runbooks and review them regularly to familiarize themselves with common issues and their resolutions.
  • Post-mortem process
    Establish or solidify a process for analyzing and learning from incidents. This process should ideally include conducting post-mortem meetings where incidents are reviewed in detail to identify root causes and areas for improvement. Engineers should understand the importance of a blameless approach, focusing on learning and improvement rather than assigning fault.
  • Emergency contacts
    Ensure that all on-call members have access to a list of contacts in the case of emergencies. It’s crucial that the team knows when and how to contact emergency personnel.
  • Periodic review
    Attend on-call process review, stability, and operations meetings for your squad, or organization. These meetings should take place regularly, such as bi-weekly or monthly, for continuous improvement. Regular reviews help refine the on-call process, and promote effective operations.

Once the engineers got comfortable with the pre-on-call checklist, we added them to the rotation. While they understood the challenges in theory, we reminded them that handling a real incident could be more intense and complex.

If your organization maintains a separate payment structure for on-call engineers, make sure that employees are properly included. Pagerly has a nice guide to various on-call payment strategies.

Wheel of misfortune: Role-playing issues for practice

We introduced the “wheel of misfortune,” a role-playing game inspired by the Google Site Reliability Engineering (SRE) approach. The idea is simple: we simulate service disruptions to test and improve the response capabilities of on-call engineers in a controlled environment. We conducted some of these simulations to help teams improve their preparedness for real incidents.

“If you have played any role-playing game, you probably already know how it works: a leader such as the Dungeon Master, or DM, runs a scenario where some non-player characters get into a situation (in our case, a production emergency) and interact with the players, who are the people playing the role of the emergency responders,” wrote Jesus Climent, a systems engineer at Google Cloud, in 2019.

It’s an effective way to ensure that engineers can handle rare but critical incidents confidently and competently.

Improving on-call readiness through data

Stage 1: Verification of data

The next step was to ensure we had the right kind of data for good alerts and appropriate dashboards. This step was important because having accurate and relevant data is the backbone for effective monitoring and incident response. We also scrutinized whether our understanding of “good” alerts and dashboards were correct in the first place. Verifying data quality and relevance helps prevent false positives and assures that alerts are reliable indicators of actual issues.

A successful on-call system relies on the ability to link incidents with key system performance or business metrics. For each service, we ensured we had the right metrics, tracked them continuously, and had a dashboard for an overview.

Stage 2: Refining alerts and dashboards

To mitigate alert fatigue – a common issue in on-call systems – we overhauled our monitoring processes:

  • Alert review: Regular reviews of false positives were set up to reduce unnecessary noise and refine alert sensitivity. The main idea was to prevent desensitization that can come from frequent, unneeded alerts. 
  • Dashboard review: We started triaging our dashboards to figure out if we have the right widgets, proper access, visibility, and overall usefulness. Some key questions we considered included: Is the dashboard visible at the right time? Is it actionable? Is it used? In a recent blog post, Adrian Howard mentioned a list of great questions that can be further asked to triage your dashboards in a clean fashion
  • Leveraging golden signals: We used our golden signals dashboard to track primary business metrics, traffic, and error rates. The four golden signals are latency, traffic, errors, and saturation. Latency measures response times, traffic monitors demand, errors track failed requests, and saturation captures resource usage. This helps on-call engineers assess the health of the system quickly and understand the impact of any changes or issues.

Stage 3: Increasing awareness

Once the team grasped the flow, process, and our dashboards, deployments, and health metrics, we began identifying other areas for optimization.

  • Dependency understanding and monitoring: Understanding your upstream and downstream dependencies is important for incident response. We started with creating architecture diagrams so that everyone understands the information flow. These diagrams clarified how different components interact, such as services A, B, and C, where service A depends on service B, and service B depends on service C. Then, we set up dashboards to monitor these dependencies, allowing us to quickly identify and address any unusual behavior. This helped in maintaining a clear overview of the system’s interdependencies.
  • Dependency communication: We ensured that all engineers knew how to quickly reach out to teams and services we depend on during an incident.
  • Runbooks: We improved our runbooks by making them more accessible. Each runbook held enough detail and example scenarios to enable on-call engineers with knowledge to quickly jump into problem-solving. 

Reducing on-call stress

Once things got into a routine, we started optimizing areas that were causing additional effort or knowledge gaps. 

Ritualizing the handover: We turned the handover process into a ritual. Before an engineer’s on-call shift begins, the outgoing engineer formally reviews the system status, ongoing issues, recent incidents, and potential risks. This fully prepares the incoming engineer before their shift.

Celebrating on-call successes: We began to celebrate on-call successes. In each post-on-call review, we acknowledged the challenges faced and victories won. This could range from managing a critical incident to proactively solving a problem before it escalated.

Continuous improvement and knowledge-sharing

Regular post-Incident reviews: We held regular, blameless post-incident reviews after each incident to learn and improve. Each session included a root-cause analysis with the involved on-call team, focusing on understanding issues without assigning fault.

Monthly ops-review sessions: Monthly operational review meetings are conducted to maintain and enhance our on-call processes. The agenda for these sessions included:

  • Reviewing previous action items
  • Status of on-call rotations: Doing a health check on the current on-call system. This includes introducing new team members to the rotation, reviewing the status of the current group, verifying that all shifts are covered, and assessing the workload
  • Recent incidents: Discuss any incidents that have occurred since the last meeting
  • Alert reviews: Review mean time to recovery (MTTR) reports 
  • Lessons learned: Share insights gained from recent alerts or incidents
  • Pain points: Share the challenges faced by the team and brainstorm solutions to mitigate these issues

Knowledge base and documentation: Our knowledge base contains documentation, runbooks, and playbooks. It’s continuously updated and linked to our monitoring tools, providing on-call engineers with context-specific guidance during incidents.

What’s next?

Once the fire-fighting has stopped, and you have some time to breathe, that’s the moment when you start actively investing in SLOs (Service Level Objectives) and SLIs (Service Level Indicators). They are the only way for your team to be in proactive mode rather than continuously reacting to every single thing that happens. SLOs define the target reliability for your services, like 99.9% uptime. SLIs are the metrics that measure how well you’re meeting those targets, such as the actual uptime percentage.

By establishing and monitoring SLOs and SLIs, your team can focus on meeting predefined performance targets rather than continuously reacting to incidents.

Final thoughts 

“It is engineering’s responsibility to be on call and own their code. It is management’s responsibility to make sure that on-call does not suck. This is a handshake, it goes both ways…” – Charity Majors

Effective on-call strategies are manifested through the way engineering leaders structure and value their systems. Your approach to on-call will depend on your organization’s specific needs and history. Adapt your strategy to what works best for your team, considering both your requirements and what you can do without. 

This proactive stance isn’t just about maintaining stability, it’s about setting new standards for your operational resilience. But it will take effort, it will take maintenance, and it will need constant iteration and ongoing care. The good thing is, it’s a solvable problem.