Doing On-Call the Right Way

Improving Quality of Life for Engineers and Quality of Product for Customers

Maria Gullickson
CarMax Engineering Blog
7 min readMay 29, 2024

--

I’ve worked in software development for many years now. I’ve worked on a lot of different teams in a handful of different companies. Most of these jobs have involved on-call work in some form or another. I’ve had a couple of jobs where the on-call completely burned me out. For a while after that, I would only take jobs with no on-call required, even if it meant taking a step backwards in my career. But in recent years, I’ve found that on-call doesn’t have to be a nightmare.

My role at CarMax currently involves an on-call rotation. I am on call 24/7 one out of every four weeks. And past Maria might be surprised to hear that I really don’t mind it at all. It turns out that it’s not on-call that was hard, but poorly managed on-call. I want to share some strategies that make this responsibility better for the engineers, while also producing a better product for customers.

I’m Only On Call for Things I Work On

At CarMax, we used to have a large on-call rotation with all the engineers in the Product Organization. At first, this seemed great — there were so many of us that I was only on call about one week out of every eight months. But I quickly learned that this system wasn’t so great.

First of all, when I was on call, most of the alerts were for things I didn’t know anything about. Sometimes they were things I’d never even heard of. Often there was really nothing I was able to do to get things working again, other than call other people (who weren’t supposed to be on call) or wait till business hours. If I was able to resolve it myself, it took much longer than it should have, as I had to learn about the systems in the moment. And I got paged a lot, because I was responsible for so many different things that week.

Another big problem was that none of the underlying issues got fixed. If I get alerted about an intermittent problem in a system someone else owns, I’m not in a position to do anything about it. And there’s no visibility into how much of a problem it is. Only being on call once every eight months, I wouldn’t know if this comes up every week or only once a year. And the team that owned it wasn’t pushed to resolve it, because they too only got paged a couple times a year.

Now that I’m on call for only things that my own team works on, and on call much more often, things work a lot better. If something breaks, it’s something I am very familiar

with. This means I’m able to address it myself (or at least within my immediate team), and fairly quickly. If it’s an alert that is likely to go off again, I am able to get to the underlying cause and actually fix it. If I ever get paged a second time for the same thing, I know it wasn’t a fluke and should be cleaned up. This means I’m not only addressing the outage I was paged for, but also preventing future outages. As a result, my future on-calls are quieter. Another result is that the product is working better for our customers, because I’ve reduced the chance of another outage.

We Are Empowered to Fix the Problems

I’ve worked on teams where the product owners dictate the work that gets done, and what they care about is adding new features. Adding new features is a key part of my job, and it’s something we need to do to keep making our product better for our customers.

But we also need a solid foundation. At CarMax, the teams I have worked on have always empowered the engineers to do what is right. If there is an underlying bug, it can be prioritized to be fixed. If there’s an alert that is just noisier than it needs to be, it can be adjusted. Time can be made for these foundational issues, alongside new feature development and other product enhancements. The quicker these things get fixed, the less developer time is wasted dealing with the issues when they arise, and the less developer burnout happens, leaving the developer to focus more on those new features going forward.

We are Thoughtful About Alerting Levels

When incidents occur, our engineers don’t just think about addressing the issue, but also about how well the alerting worked. There are a few questions I ask myself whenever I’m responding to an alert or handling an outage:

  • If I got an alert, was it appropriate? Was there action for me to take? If there was nothing to do, maybe I don’t need to be alerted. Reducing the noise of the alerts makes developers’ lives better, and ensures that we pay better attention when alerts do come in.
  • If we got an appropriate alert, was it fast enough? Could the problem have been detected earlier, so it could be resolved earlier?
  • Was the level of the alert correct? Our team has high-priority alerts that page the on-call immediately, and low-priority alerts that simply send an email for the on-call to see when they are at their desk. If an urgent issue triggers a low-priority alert, the issue could go unnoticed. (In the worst case, it might go off on a Friday evening and not be seen till Monday morning.) If a non-urgent issue triggers a high-priority alert, it’s creating extra noise. Noisy alerts can lead to developer burnout. Too many noisy alerts can also lead a developer to think alerts aren’t important, things that can be ignored for a while, potentially missing an alert that really is urgent.
  • Sometimes we find out about a problem with our service through other means. Maybe we stumble on it in the logs while looking for something else; maybe we run into the problem ourselves while using our services; in the worst case, someone else finds the problem first and brings it to our attention. In a perfect world, a customer (internal or external) should never know that our systems are having problems before we do. When there’s an issue we didn’t detect automatically, we always think about what alerts (or automated tests, if it’s a coding bug) we can add to catch the issue next time.
  • Did the alerts have the necessary information? Did they make it very clear what was wrong and where? If not, we can improve the alerts with more details.

We Make Alerts Easier to Resolve

When an issue arises, the best-case scenario is that it resolves itself, and no developers need to be woken up. Things like auto-scale settings that react to increasing load on your resources are an example of this. You also might build your own framework to detect an issue and take the appropriate action — try replaying messages that show up on a DLQ, reboot a misbehaving server, etc.

The next best-case scenario is that the resolution is automated, but developers need to trigger it. For example, if we need to restart our services, we have Azure DevOps Pipelines that can manage this for us. Our service runs in three regions, so the pipelines will remove one region from Traffic Manager, wait for existing traffic to drain, restart the service, then add it back to Traffic Manager. This ensures that services can be restarted safely on a rolling basis while other instances are still serving requests. By automating this process , we ensure that a sleepy developer doesn’t miss a step, click on the wrong thing, or move too fast. Even a developer who is brand new to the team can use these tools safely. We don’t have an automated way to detect that these need to be run and trigger them. But once a developer decides it’s the right thing to do, it’s a single button click to make it all happen.

If nothing else, documentation is helpful. It’s great to have a document for each alert that might come through, including information on how to investigate and resolve it (possibly referencing those automated resolution tools I just mentioned). If an alert comes in that doesn’t have a documented resolution, adding that documentation is a good action item to be completed during normal working hours. As the lead engineer on my team, this is also great protection for my time. Just because I know how to handle some of these situations doesn’t mean everyone on the team, down to the most junior engineer, knows how. If another engineer is dealing with an outage scenario they aren’t familiar with, and there is no documentation, they might reach out to me. Having good

documentation means that my phone stays completely quiet during the weeks I’m not on-call.

Another thing that helps is having great observability. When something goes weird, we’ve got lots of logs, metrics, and dashboards to help us dig into what is going on, and what else might be related to it. If we don’t find the info that we need, or it was hard to get to and make sense of, this is something else we will do. We might add more logging, track more metrics, or enhance our dashboards to highlight the important information and show correlations that matter.

A Common Thread: Continuous Improvement

I am a huge believer in continuous improvement. I’m sure I’ll have more posts that talk about how we continuously improve at work. Any time an issue comes up, I like to think about “how could this have gone better,” and then take the necessary steps to make it better for the future. This really works. My on-calls are incredibly quiet, and it’s not unusual for months to go by where I don’t receive any high-priority alerts, and don’t ever have to log on outside my regular working hours. This makes me a lot happier, and a lot more productive at work. It’s also a sign that our product is working well and not causing issues for the customers who are trying to use it. When you set up a good on-call system, and continually iterate to make it better, everybody wins!

--

--