Berlin

November 4 & 5, 2024

New York

September 4 & 5, 2024

How to fix a terrible on-call system

Less time on-call isn’t always better
March 30, 2021

If your on-call system makes people want to pull their hair out all the time, then you have a problem that you need to fix.

Back when our team was small, we put together a single on-call rotation. Every dev was in the rotation and would go on-call for one week at a time. When we first started the rotation, our team was made up of five devs. Years passed, and despite our team growth, we still stuck with the single rotation. Eventually, the team grew so big that people were going on-call once every three-to-four months. This may seem like a dream come true, but in reality, it was far from it.

A broken on-call system 

The single on-call rotation was miserable for just about everyone for a variety of reasons.

Infrequent on-call shifts

The large rotation meant that on-call shifts were so infrequent that devs were not able to get the experience and repetitions they needed to know how to handle on-call issues effectively. In addition, our codebase had grown tremendously and there were so many things being developed at once that when a problem arose, there was a good chance the on-call dev knew nothing about it or the code that was causing it.

This led to panicked developers often turning to the Site Reliability Engineering (SRE) team for help with issues. Constantly having to jump in and help with on-call issues quickly began to drain a lot of the SRE team’s time and resources. Essentially, the SRE team began to act as if they were on-call 24/7. The constant bombardment of questions and requests came very close to burning out the entire team and took away the valuable time they needed to work on their own projects. 

No ownership

Besides having a burned-out and inefficiently-used SRE team, another developer gripe about on-call was that developers felt like they had no ownership over the code they were supporting. One person would write code and another person would debug it if it broke. The app was so big that there was no way anyone could have a sense of ownership over all the production code because there was just too much of it.

Three teams, one application

Due to the size of our engineering organization, we had three separate dev teams. Each team had five-to-seven devs on it, plus a manager. Each team also had its own set of projects. However, our main application was still a single monolithic Rails app and all three teams worked equally across the entire codebase. Unlike other apps which have very separate backend components owned by individual teams, there were no clear or obvious lines of ownership. Solving this issue would prove to be the hardest task when it came to fixing our broken on-call system.  

The solution

Three rotations

We knew we had to break up the rotation if we wanted to continue growing, but the question was how? Despite all of the developers working across a single application with no clearly-defined lines of ownership, we devised a plan that broke our single rotation into three: one for each of our three dev teams. This led to shorter rotations, which meant more reps for devs. As backward as it may sound, being on-call more is a benefit because devs become a lot more comfortable and experienced with it.  

Divided application ownership

Three rotations allowed the devs to get more reps being on-call, but that still left the biggest problem of all and that was the problem of ownership; no one wants to support something they don’t feel like they own. To accomplish this, we chose to split up the on-call application ownership among the three dev teams. 

It didn’t happen overnight, but with a few meetings and a lot of team discussions, we were able to break up everything in our application between the three teams. Now, this gets a little bit in the weeds, but I wanted to share how we did it in case it helps you come up with ideas for how you might split up a monolithic app.

We broke up all the background workers:

  • Team 1: Indexing jobs
  • Team 2: Overnight reporting jobs
  • Team 3: Client communication jobs.

We broke up all the individual service alerts:

  • Team 1: Redis alerts, Queue backup alerts
  • Team 2: Elasticsearch alerts, API traffic alerts
  • Team 3: MySQL alerts, User load page alerts.

We broke up the application components:

  • Team 1: Users and Alert models and controllers
  • Team 2: Asset and Vulnerability models and controllers
  • Team 3: Reporting and Emailing models and controllers.

Once the lines had been drawn, we made sure to stress to each of the dev teams that despite doing our best to balance the code equally, we might still have to move things around. This showed the devs that we were fully invested in making sure this new on-call rotation was fair and better for everyone. 

On-call training sessions

After the code was split up, the SRE team took time to sit down with each dev team to thoroughly review the app components, workers, and alerts they now owned. We went over everything from common issues to exactly what every single piece of code did and how it affected the rest of the application. These sessions gave devs a lot more confidence in their ability to handle on-call situations because they now had a clear picture of what they owned and how to handle it. Even though they didn’t build some of the code themselves, they had an understanding of exactly how it worked and what it was doing.

CODEOWNERS file

In addition to giving each team an education over their section of code, we also took advantage of GitLab’s CODEOWNERS file. The CODEOWNERS file allows you to specify who or what teams in your organization own a file. When that file is updated by anyone in a pull request, the owner of the file will automatically be tagged for review. This allowed the team that supported the code to be updated about any changes made to it.

On-call support system

Originally, the SRE team was the support system for the on-call dev; if the on-call dev had questions or needed help, they would talk to the SRE team member that was on-call that week. Our SRE team only had three members at the time so you can see why they got burned out being the constant go-to. With the new system, if any of the three on-call devs get overwhelmed or stuck on an issue, they are encouraged to reach out to one another for help, rather than the SRE team.  

Narrowed on-call responsibility scope 

Prior to these on-call rotation changes, the on-call devs were responsible for determining whether any customer messaging was needed during an incident and for communicating it to the rest of the organization and team. We have since moved the customer messaging responsibility to the support team. The support team is the closest to the customer, and is, therefore, the best equipped to communicate any problems.

Communicating the incident internally to the rest of the team and organization became the duty of the manager of the on-call dev. When an incident occurred, the on-call dev would notify their manager and then they would be the one to disperse and convey the message to the rest of the organization. Removing the communication responsibilities from the on-call dev allowed them to focus on what they do best: diagnosing and solving the problem at hand.

The payoff

Improved alerting 

Originally, the SRE team had set up all the alerting and monitoring tools. However, once we turned the alerts over to each of the dev teams, they took them and ran. Because each team felt a renewed sense of ownership over their alerts, they started to improve and build on them. Not only did they make more alerts, but they also improved the accuracy of the existing ones. 

A renewed sense of ownership

Even though one team might edit the code that another team supports, there is still a keen sense of ownership for the supporting team; this team acts almost as the domain experts over their section of code. Using the CODEOWNERS file ensures that the supporting team is made aware and can sign off on any changes made by other teams, and because each chunk of code a team supports is small, they can actually learn from it.

Faster incident response 

With three devs on-call at once and each one of them focusing on a smaller piece of the application, they can spot problems a lot faster. Each team also wants to ensure that when things do go wrong they are caught quickly. Many teams choose to tune their alerts to warn them of problems even faster than they were originally set up for.

Triaging and figuring out the root cause of issues is also a lot faster. Teams are intimately familiar with their alerts and the pieces of code they own which allows them to figure out problems quicker than before.  

Never alone

Having three devs on-call at once means that none of them ever feel alone. If things start to fall apart in one section of the application, the dev that owns that part knows there are two others available to help if they need it. Just knowing that you have someone else who is easily accessible can do wonders for your confidence when you are on-call.

Improved cross-team communication

As I stated before, each of the three dev teams worked across the entire application and would often change code that another team supported. Having the CODEOWNERS file ensured that the on-call team was alerted to those changes. This not only allowed for a good technical review process, but it kept each of the teams up to date on what the other teams were working on and doing.

On-call shouldn’t suck

On-call is something that many people in this industry dread and it shouldn’t be that way. If people are dreading on-call then something is broken with your system. Sure, everyone at some point will get that late-night or weekend page that is a pain, but that pain shouldn’t be the norm. If on-call makes people want to pull their hair out all the time, then you have a problem that you need to fix. I hope this article has given you some ideas to help you improve your own on-call system so that it can benefit everyone.