Why, as a Netflix infrastructure manager, am I on call?
The twitter discussions on on-call last week (1, 2) motivated me to write a blog post that’s been in my Trello list for some time (actual card title “Why am I still on-call”). I started to consider writing this when one of my excellent team mates asked me bluntly during 360 feedback, “Are you still (as a manager) on-call as you don’t trust us (the team)?”. In working through the feedback, I realized the answer to the question was nuanced and I hadn’t provided the right context. We discussed the answer in a dev meeting and here are some of the highlights.
I go primary on-call for the team every seven weeks. I have the same responsibilities of “carrying the pager(duty)” that everyone in the team has. If the system has issues at 2AM, I wake up and fix these issues doing the same analysis using the same tools as engineers on the team. I work through critical issues that risk systematic reliability that are found during on-call. I run the on-call handover for that week summarizing issues found, fixes and analysis in flight. These are all of the same responsibilities every engineer on the team holds once every seven weeks. What I don’t do is handle our Slack and email user support during office hours. The team pitches in and does that support for me as I am in meetings most days and cannot reliably handle support issues that span more than 15 minutes. The team sees this as a fair trade-off of not having to carry the pager during this week.
Now, let’s talk about the reasons (both bad and good) why I stay on call.
Bad (Selfish) Reason #1: I miss complex system analysis
I decided to move into management a little over a year and a half ago. I have focused on analysis of complex codebases for most of my career, specifically performance engineering. I love tearing apart codebases I do not know and working to make them more performant and scalable. While on-call analysis isn’t exactly this, with the complexity of our infrastructural services, the approach is similar. I found when I was an engineer on-call that I loved taking a new problem peeling back the layers of the system until root cause was identified. I found extreme technical joy in working through these problems on systems of our scale. While, as a manager, this isn’t a good reason to stay on call, I do it selfishly as my passion for analysis is satisfied once every seven weeks.
Good Reason #1: Empathy
As a manager, I invest a great deal of energy into the product manager for our service. Good product managers should have empathy for their users and customers. When I’m on-call, I get a good idea of how well the system is working from our users perspective. I can make informed trade offs between new feature work and taking down technical debt that is impacting happiness with existing features. One could argue that I could listen intently during on-call handovers to get the same signal. I have found it hard to get to that level of understanding without being deeply involved once every so often. The reason for this is that our team is so well connected during the week that not doing the job means you lose context to participate in the discussions.
Sitting on the outside of development, it is sometimes easy to gloss over how things are going. Our team knows when our service is too close to falling off a cliff. Without being on call, I can hear how close the edge is, but I can rationalize away the fear easier than I can when I’m the one driving and looking at the cliff’s edge myself.
This perspective ensures I am empathetic with the team. Not only will I be far more supportive of pulling back when we need to, but I also feel the level of burn out the team has to deal with. On-call, when done well, shouldn’t burn out the on-call engineer. On-call, when done poorly, always burns out the entire team. When I feel burned out during an on-call, you can be sure I’ll be looking at committing to new features more sparingly. Transitively this helps the team stay healthy as well.
Good Reason #2: Hiring
This approach also helps tremendously during hiring. I would say half of the engineers talking to me about career opportunities ask me about on-call and burn out. I can answer directly with “I was on call last week. Let me tell you about the issues and how much time I spent”. I am very proud to discuss openly how we handle systemic reliability of our service. Many candidates that ask about on-call are doing so as they are burnt out. They are burnt out on on-call as their services haven’t invested in taking down tech debt in similar ways to our team. I specifically look for engineers who are excited to join a team that is directly on-call for their service and can demonstrate either experience or thought based software engineering approaches to take on reliability issues with a global and fast growing service.
Good Reason #3: Operational Tooling
As part of making sure I am still able to be on-call, after every on-call handover that was my on-call week, I ask with a simple thumbs up/down if I am still providing the right value when on-call. After one of these votes, a team member pointed out that I actually give a benefit no-one on the team can provide. Specifically, after discussing gaps in one of our operational tools I found, the team member said I am a good measuring stick for the quality of our operational tools. He pointed out that operational tooling shouldn’t be accessible only to those who wrote the tool or system it is managing during office hours. Therefore, as a non-developer, I am a good gauge of if our tools work for slightly impaired operators like those who have to use them at 2AM.
I’ve been asked if being on-call is the best use of my time. Obviously, I believe it is even with my one selfish reason. Also, people tend to assume that on-call comes at the expense of other important aspects managers spend time on at Netflix (things like user outreach, strategy and recruiting engineering talent). The fact is, since the team covers my daily responsibilities, it doesn’t get in the way much. In some cases, where there is a systemic reliability issue that drags on from night time coverage to day time support or a day time page occurs, it does impact my office hours work. However, given we are making sure these cases are exceptional, it is worth the trade-off occasionally. Also, if the issues do drag on, I understand clearly how much they are impacting developers time as well.
I also realize that for some managers it might be harder to get into on-call. Given I helped on early implementations of our service, I know the technology well. I simply stayed on-call when I transitioned into management. My advice for managers that want to be in this position would be shadow on-call for a while, working with the operational tools along with your team to the point where you feel comfortable that you’d be able to solve a problem at 2AM. I am not advocating for all managers to stay on call. For me it was a personal choice that aligned well with the needs of my team and my passion and almost a requirement when the team size was 3 engineers. When the time comes where I am no longer able to do on-call up to the team standards due to expanded workload, I plan to invest in keeping a close eye on the reliability of the service as well as the personal health of on-call members. For managers who aren’t on-call today, it is super important to keep track of these signals as part of being a great product manager.
I am proud of our service’s investment in reliability and operability. I value staying deeply involved in that aspect of our service by staying on-call. I am romantic about being on-call and will continue to be so.