From on-call firefighting to future-proofing

How to reduce on-call load by 66% and the principles that will help you get there.

By Mallika Rao

July 16, 2025

Your inbox, upgraded.

Receive weekly engineering insights to level up your leadership approach.

Estimated reading time: 10 minutes

6 ways to improve your approach, plus an inside look at how we cut on-call load by 66%.

Every engineering leader dreams of a world where their team isn’t woken up at 3 AM by a failing service. Yet for many, that world feels out of reach. Pager fatigue, repeated incidents, and brittle systems often make on-call feel more like firefighting than engineering. But there’s a better way – building systems that anticipate failure and recover on their own.

I’ve seen how intentional design and a shift toward self-healing infrastructure can transform both developer productivity and system reliability; all achieved with practical strategies that reduce operational burden without sacrificing velocity or innovation.

How we reduced on-call load at X

A few years ago, when I was working at X (previously Twitter) as an engineering manager, we had one large thorn in our operations: the search infrastructure team had an on-call load of 56% – meaning that 56% of the time someone was on-call, they were being paged. Engineers were drained. We were solving the same issues over and over. Requests came in with no triage, ranging from basic questions to severe indexing failures. If our index went down, recovery took over a week.

So we took a different approach. We formed a dedicated squad focused on operational stability.

We made it our mission to never solve the same problem twice. That meant eliminating ambiguity and duplication in how we responded to operational issues. We created detailed tutorials and internal knowledge banks to capture answers to recurring questions, thus freeing engineers from having to explain the same things in Slack or debug similar issues from scratch. Documenting query paths directly in code gave everyone, from new team members to client engineers, a clearer view of how the system behaved, enabling faster root cause identification.

These resources weren’t just for infrastructure teams; they were for the broader organization. Product and machine learning (ML) teams needed to understand how to safely interact with our search systems, and these materials gave them the confidence (and autonomy) to do so. The result? Fewer ad-hoc requests, fewer interruptions, and a measurable drop in repetitive support tickets.

We also invested in index snapshots, which meant that when failures happened, we weren’t starting from scratch. Previously, recovering from a full index failure could take up to a week. With snapshots, we brought recovery time for multi-terabyte indexes down to under 48 hours.

To bring structure, we built a JIRA portal that categorized requests – FAQs, ingestion, search syntax, Apache Lucene issues, Elasticsearch usage. That alone made triage faster and more consistent.

We also gave product teams sandbox environments – so they could experiment with the search infrastructure without depending on us. Before releasing anything, we ran pre-mortems to proactively surface failure modes. Some teams needed more hands-on support early on – this included helping them onboard to our tools, navigate configuration changes, and learn best practices for interacting with the infrastructure. Providing this support required time and attention, but it created champions within those teams who helped scale adoption. Over time, the number of questions dropped, their confidence increased, and we saw higher-quality usage patterns with fewer errors.

To ensure knowledge sharing and redundancy, we rotated engineers every six months and, eventually, our on-call load dropped to less than 20%. Repeat incidents disappeared and more than 30% of our time was unlocked for innovation.

The 6 principles of well-structured on-call systems

High on-call loads and burned-out engineers is something many engineering organizations, especially those building complex distributed systems, face.

Here are six principles I’ve learned along the way that can help streamline on-call systems.

1. Design for discoverability and resilience

Most teams build dashboards and alerts reactively, after something breaks. But when you design them upfront, they become invaluable tools for debugging, onboarding, and learning.

A good dashboard answers:

What’s the current health of my system?
What changed recently?
What’s the most likely root cause?

When paired with traceable logs, tagged metrics, and curated runbooks, dashboards turn into guides, not just monitors. We also embedded structured tagging directly into code to improve discoverability and transparency. This small change made it easier to trace features to logs and identify failure patterns.

2. Build for self-serve and transparency

In many organizations, every feature change needs coordination with the infrastructure team. This is where platform engineering becomes critical. Building reusable tools, capabilities, and practices that product teams can adopt without relying on infrastructure engineers can be a slow process, increasing operational load, but the payoff is immense.

However, enabling full self-serve capabilities isn’t always straightforward. Sometimes, product prefers that infrastructure teams abstract away complexity and handle operational tasks. The key is finding the right layer of abstraction. Rather than building one-size-fits-all tools, we partnered with product teams to understand their specific needs and built targeted self-serve capabilities. These weren’t always UI-driven – sometimes, it was about building better documentation, exposing a configuration layer, enabling a way to experiment and preview changes, or enabling safe defaults.

With greater self-serve comes the potential for more repetitive or monotonous work on the infrastructure side. So, we made investments to reduce our own operational load.

We automated repetitive steps like creating internal dashboards from configuration files, generating metadata from code annotations, and pushing environment changes through safe, pre-tested templates. We built a searchable FAQ for debugging edge cases and created command-line tools that could be used by any team to validate configurations.

Where applicable, we also began to leverage generative AI to assist with documentation and repetitive code workflows – automatically generating usage examples, summarizing logs, or even pre-populating internal tickets based on error patterns. These efforts helped us avoid repetitive Slack threads and made postmortems faster and simpler..

In parallel, we also focused on making our infrastructure more transparent, as opaque infrastructure makes it harder for teams to troubleshoot or improve. We visualized dependencies and data flows through live topologies (real-time, interactive maps of how services and components connect), labeling parts of the codebase by performance sensitivity (how components behave under varying load). This showed us how central or failure-prone they are to the system’s overall functioning, allowing us to document common failure modes.

3. Automate recovery for better future-proofing

Runbooks – or step-by-step guides – are useful, but automation is better. Recovery tasks, like restarting a system after a crash or scaling up, can often be automated by health checks that have occurred across a bug or error thresholds (too many errors in a short period).

Techniques like circuit breakers, backpressure handling, and adaptive scaling help isolate issues and keep systems stable under pressure.

We also automated post-incident workflows: log tagging, root cause snapshots, and retrospective templates. These tools weren’t just about cleaning up after the fact – they made it easier to detect patterns, document fixes, and share learnings broadly. For example, imagine a launch operator preparing a filter for a high-profile live event, only to discover the expected titles aren’t appearing. They have no knowledge about the deep system internals of how a filter is set up. In the past, this would result in a last-minute escalation to engineering. But with automated tagging and clear incident snapshots tied to configuration issues, the operator could pinpoint the problem and resolve it independently, without waiting on a developer.

These systems empowered our partners at the edge, added velocity to the last mile, and allowed engineers to stay focused on long-term improvements rather than repetitive triage.

4. Prioritize testing discipline

Reliability begins long before a system goes live. One of the most effective ways to build confidence in your code is through test-driven development (TDD). For example, before adding a new feature, developers write small tests that define what the feature should do. This helps catch mistakes early, so issues don’t slip into production.

Strong unit tests check individual pieces of code. Integration tests go a step further, verifying that different parts of the system work together correctly – such as ensuring a payment service properly communicates with the billing system.

We also relied on regular regression tests, which automatically ran a suite of tests every night to catch unexpected breakages. For instance, if we had deployed a change that unintentionally slowed down search results, tests would catch that problem before customers noticed.

Testing isn’t just a checkbox to mark off at the end of development. It’s a safety net that protects the system from unexpected issues. When our systems become complex, these testing habits keep problems small, manageable, and less costly to fix.

5. Stick to first principles

Before diving into advanced tools and automation, it’s worth stepping back to focus on some foundational engineering practices. These principles might seem basic or even obvious, but they play a crucial role in building systems that are easier to maintain, understand, and evolve.

Modular functions and classes help break down complex problems into smaller, manageable pieces. The codebase becomes easier to read and debug, and enables teams to reuse components rather than rewriting code.
Single-responsibility components ensure that each part of the system does one thing well. This reduces unexpected side effects and makes it easier to isolate issues when they arise.
Class-level documentation provides clear guidance on what each piece of code does, helping new team members onboard faster and reducing guesswork during maintenance.
Rigorous code reviews act as a quality gate, catching potential problems early and encouraging knowledge sharing across the team.
Guardrails like linters and static analysis tools automatically enforce coding standards and catch common mistakes before they reach production.

Focusing on these fundamentals creates a codebase that is more resilient and adaptable. When the underlying code is clean, modular, and well-documented, engineering teams can respond quickly to changes, whether that’s launching a new feature, pivoting strategy, or fixing a critical bug. These best practices lay the groundwork that lets teams move fast without breaking things.

6. Leadership and culture

Self-healing systems don’t come solely from tooling. They come from intentional leadership and cultural alignment.

Leaders need to prioritize operational health, not just in moments of crisis, but as a first-class concern in every project. That means baking reliability investments into roadmap planning, not tacking them on later as tech debt.

But to truly shift the load, cultural buy-in is non-negotiable.

Engineers need to feel empowered and expected to think about long-term system health. That begins with how we frame “done.” A feature isn’t complete if it’s hard to debug, opaque to other teams, or brittle in production. By reinforcing that message early and often in code reviews, design docs, and planning cycles, leaders can normalize operational rigor as a mark of engineering excellence, not an afterthought.

Teams also need support in developing a mindset of ownership. That means not waiting for incidents to drive learning, but proactively doing pre-mortems, writing usage guides, and conducting architecture reviews with operational risks in mind. At X (formerly Twitter), we saw a shift when we made it clear that everyone owned reliability, not just on-call engineers or site reliability engineers (SREs).

Cultural change isn’t instant. Early on, it may take white-glove support or overcommunication. But once the mindset catches on, it reinforces itself. That’s when resilience becomes sustainable.

It’s worth acknowledging that no system can be completely autonomous. Failures will happen, novel edge cases will arise, and operational realities will evolve. But if the team is culturally aligned and equipped to handle those moments thoughtfully, the result is a more adaptive, less reactive organization.

Final thoughts

Looking back at the systems I’ve worked on, from fragile first versions to mature, reliable platforms, one theme stands out: the best reliability outcomes come from thoughtful design choices and cultural buy-in.

Investing in dashboards, test infrastructure, automation, and knowledge sharing may not seem urgent in the moment – but over time, they create a margin of safety that protects both the system and the people behind it. At X (previously Twitter), those investments gave us back the time to focus on innovation, not just survival.

If I could go back and give myself one reminder, it would be this: the time you spend making your system more resilient is never wasted. It’s the difference between reactive support and sustainable, forward momentum.

About the author

Mallika Rao

Mallika Rao is an Engineering Manager at Netflix.

Newsletters

Panel discussions

Videos

Reports

For you

New York

Berlin

London

Meetups

From on-call firefighting to future-proofing

By Mallika Rao

Your inbox, upgraded.

How we reduced on-call load at X

More like this

The 6 principles of well-structured on-call systems

1. Design for discoverability and resilience

2. Build for self-serve and transparency

3. Automate recovery for better future-proofing

4. Prioritize testing discipline

5. Stick to first principles

6. Leadership and culture

Final thoughts

About the author

Mallika Rao

New York

Berlin

London

Meetups

From on-call firefighting to future-proofing

By Mallika Rao

Your inbox, upgraded.

How we reduced on-call load at X

More like this

The 6 principles of well-structured on-call systems

1. Design for discoverability and resilience

2. Build for self-serve and transparency

3. Automate recovery for better future-proofing

4. Prioritize testing discipline

5. Stick to first principles

6. Leadership and culture

Final thoughts

Share:

About the author

Share:

More like this